About metadata collection
Danger
data.world University!
Check out our courses - Introduction to Collectors course and Cloud collectors overview course.
Metadata collection is the process of systematically gathering and cataloging technical metadata from the data sources in your landscape. That technical metadata is then displayed in your data catalog and from their users can further enrich those resources with additional context and information to help tell the story about your data within your organization.
The application provides a wide array of metadata collectors that are designed to pull metadata from your systems such as databases and reporting tools. These collectors are used to pull metadata and in some cases lineage from the source and create a graph of that information that can then be ingested into the catalog.
There are two ways of running these metadata collectors:
data.world cloud collectors: The Cloud collectors are managed and hosted by data.world for creating and managing data connections for cataloging metadata for data sources available over the internet. The user runs a simple wizard to configure and schedule the collector runs. Note that cloud collectors are not available for all metadata sources.
Benefits of using cloud collectors
Cloud collectors work best for data sources that are available on the internet for data.world to be able to securely connect to the data sources.
Cloud collectors are data.world hosted and fully managed by data.world.
You don't need any IT involvement to setup, run, or schedule the collectors. The no-code collector setup means collectors can be quickly implemented to more quickly make use of cataloged resources. This requires minimal time from data engineers and IT teams and allows for faster troubleshooting by data.world support.
Collectors are updated automatically to latest version (on a monthly release cadence) and run seamlessly without any manual intervention.
The data.world support team has full access to the log files for the collector runs and can help with faster troubleshooting of any issues with collector runs.
Important
See this documentation for a quick overview of setting up cloud metadata collectors.
On-premise collectors: If the data sources are not available on the internet and managed behind a private network or firewalls, the application will not be able to connect to them using cloud collectors. In this case, the Metadata Collectors will need to be set up and run in a customer-managed environment, leveraging either a Docker container or JAR files. Other containerization alternatives to Docker, such as Podman, are not formally supported. If you decide to use these tools and run into issues, data.world will not be able to help troubleshoot them.
Benefits of using on-premise collectors:
Use on-premise collectors for data sources that are behind a private network or firewalls.
Cloud collectors are not available for some data sources. For such sources, you will have to use the on-premise collectors.
Important
See this documentation for a quick overview of setting up on-premise metadata collectors.
Frequently Asked Questions
Question: I currently have on-premise collectors for data sources that are now supported through Cloud Collectors. Do I have to switch to could collectors?
Answer: No! You can continue to use the on-premise collectors. However, if you want to move to Cloud collectors to get a fully managed collector experience from data.world, you can make the switch. This will require re-configuring the data sources using Cloud Collectors. Work with your Customer Success Director for a smooth transition from on-premise to cloud collectors.
Question: What if my organization wants to catalog metadata from cloud and on-premise data sources?
Answer: Cloud Collectors are designed for data sources that are cloud accessible via the open internet. They can run in parallel with the on-premise collector, which features sources not available in Cloud Collectors, including many on-premise sources. View collector availability documentation.
Question: I was using Connection Manager for metadata collection. Can I continue to use it?
Answer: Yes! You can continue to use Connection Manager. However, if you want to move to Cloud collectors for more robust experience, you can make the switch. This will require re-configuring your collectors for Cloud Collectors. Work with your Customer Success Director for a smooth transition from Connection Manager to cloud collectors. If it aligns with your business operations, we advocate the move to Cloud collectors for a superior experience.
Question: Can I have a cloud collector and an on-premise collector setup for the same data source?
Answer: Yes, if you have a business need of doing this, you can. However, we recommend that for both collector runs you use different set of datasets and collections so that the collector runs don't overwrite each others output. Work with you Customer Success Director to plan and setup this properly.
Question: How do I get notifications for Could Collector runs?
Answer: Simply setup webhooks at the Organization level from the Organization profile page > Settings tab. The Web hooks will automatically start capturing the Status events (Pending, Provisioning, Running, Completed, Error, Cancelled) for the collector runs.
Question: How will you secure the credentials I enter in the screens while setting Cloud collectors?
Answer: Besides using standard TLS/SSL for all network transactions, we always encrypt credentials at rest and once they enter our system we never send credentials externally. This means data.world will never send a password (even in an encrypted form) to the UI or to any other source.
Question: Can I run on-premise collectors on Windows machines?
Answer: Yes, you can run on-premise collectors on Linux and Windows machines. However, we do not currently produce a Windows Containers. You can however, Update the configuration of Docker on the Windows machine to use Linux containers. Alternatively, you can install a Java Virtual Machine on the Windows machine, and run the collector in Java. The collector software is a Java .jar file that you can download and run directly, without Docker.