About metadata collection

Danger

data.world University!

Check out our courses - Introduction to Collectors course and Cloud collectors overview course.

Metadata collection is the process of systematically gathering and cataloging technical metadata from the data sources in your landscape. That technical metadata is then displayed in your data catalog and from there users can further enrich those resources with additional context and information to help tell the story about your data within your organization.

The application provides a wide array of metadata collectors that are designed to pull metadata from your systems such as databases and reporting tools. These collectors are used to pull metadata and in some cases lineage from the source and create a graph of that information that can then be ingested into the catalog.

There are two ways of running these metadata collectors:

data.world cloud collectors: The Cloud collectors are managed and hosted by data.world for creating and managing data connections for cataloging metadata for data sources available over the internet. The user runs a simple wizard to configure and schedule the collector runs. Note that cloud collectors are not available for all metadata sources.
Benefits of using cloud collectors
- Cloud collectors work best for data sources that are available on the internet for data.world to be able to securely connect to the data sources.
- Cloud collectors are data.world hosted and fully managed by data.world.
- You don't need any IT involvement to setup, run, or schedule the collectors. The no-code collector setup means collectors can be quickly implemented to more quickly make use of cataloged resources. This requires minimal time from data engineers and IT teams and allows for faster troubleshooting by data.world support.
- Collectors are updated automatically to latest version (on a monthly release cadence) and run seamlessly without any manual intervention.
- The data.world support team has full access to the log files for the collector runs and can help with faster troubleshooting of any issues with collector runs.
  Important
  See this documentation for a quick overview of setting up cloud metadata collectors.
On-premise collectors: If the data sources are not available on the internet and managed behind a private network or firewalls, the application will not be able to connect to them using cloud collectors. In this case, the Metadata Collectors will need to be set up and run in a customer-managed environment, leveraging either a Docker container or JAR files. Other containerization alternatives to Docker, such as Podman, are not formally supported. If you decide to use these tools and run into issues, data.world will not be able to help troubleshoot them.
Benefits of using on-premise collectors:
- Use on-premise collectors for data sources that are behind a private network or firewalls.
- Private networking solutions like AWS PrivateLink are supported for on-premise collectors.
- Cloud collectors are not available for some data sources. For such sources, you will have to use the on-premise collectors.
Important
See this documentation for a quick overview of setting up on-premise metadata collectors.

Frequently Asked Questions

Question: I currently have on-premise collectors for data sources that are now supported through Cloud Collectors. Do I have to switch to could collectors?
Answer: No! You can continue to use the on-premise collectors. However, if you want to move to Cloud collectors to get a fully managed collector experience from data.world, you can make the switch. This will require re-configuring the data sources using Cloud Collectors. Work with your Customer Success Director for a smooth transition from on-premise to cloud collectors.
Question: What if my organization wants to catalog metadata from cloud and on-premise data sources?
Answer: Cloud Collectors are designed for data sources that are cloud accessible via the open internet. They can run in parallel with the on-premise collector, which features sources not available in Cloud Collectors, including many on-premise sources. View collector availability documentation.
Question: I was using Connection Manager for metadata collection. Can I continue to use it?
Answer: Yes! You can continue to use Connection Manager. However, if you want to move to Cloud collectors for more robust experience, you can make the switch. This will require re-configuring your collectors for Cloud Collectors. Work with your Customer Success Director for a smooth transition from Connection Manager to cloud collectors. If it aligns with your business operations, we advocate the move to Cloud collectors for a superior experience.
Question: Can I have a cloud collector and an on-premise collector setup for the same data source?
Answer: Yes, if you have a business need of doing this, you can. However, we recommend that for both collector runs you use different set of datasets and collections so that the collector runs don't overwrite each others output. Work with you Customer Success Director to plan and setup this properly.
Question: How do I get notifications for Could Collector runs?
Answer: Simply setup webhooks at the Organization level from the Organization profile page > Settings tab. The Web hooks will automatically start capturing the Status events (Pending, Provisioning, Running, Completed, Error, Cancelled) for the collector runs.
Question: How will you secure the credentials I enter in the screens while setting Cloud collectors?
Answer: Besides using standard TLS/SSL for all network transactions, we always encrypt credentials at rest, ensuring they are unintelligible and not human-readable. Furthermore, once credentials enter our system, they are never used for anything other than authenticating with the target data system over an encrypted channel.
Question: Can I run on-premise collectors on Windows machines?
Answer: Yes, you can run on-premise collectors on Linux and Windows machines. However, we do not currently produce a Windows Containers. You can however, Update the configuration of Docker on the Windows machine to use Linux containers. Alternatively, you can install a Java Virtual Machine on the Windows machine, and run the collector in Java. The collector software is a Java .jar file that you can download and run directly, without Docker.
Question: Can cloud collectors connect to a server behind a firewall?
Answer: No, if the data sources are not available on the internet and managed behind a private network or firewalls, the application will not be able to connect to them using cloud collectors. Use on-premise collectors for such data sources.
Question: Why would I use AWS PrivateLink with my on-premise collectors?
Answer: Using AWS PrivateLink provides a secure connection to data systems that are behind firewalls or not accessible via the internet, ensuring your data remains protected.
Question: Can I use AWS PrivateLink with cloud-based collectors?
Answer: No, AWS PrivateLink is only supported for on-premise collectors.
Question: How does data.world handle metadata conflicts when two collectors are harvesting from the same source?
Anwser: When two collectors harvest metadata from the same source and persist it in the catalog, the outcome depends on the catalog or collection name specified:
If the same collection name is used (--name option), the metadata from the collector that ran last will take precedence, effectively overwriting previous entries.
If different names are used for the collections, metadata from both collectors is retained. The resources collected from both runs will appear on the same resource page since the IRIs will match. This allows properties from both runs to be displayed together, and users can also see which collections each resource belongs to.

In this section:

About metadata collection

Danger

Important

Important

Frequently Asked Questions

Search results