Skip to main content

About metadata collection

Danger

data.world University! 

Check out our courses - Introduction to Collectors course and Cloud collectors overview course.

data.world provides a wide array of metadata collectors that are designed to pull metadata from your systems such as databases and reporting tools. These collectors are used to pull metadata and in some cases lineage from the source and create a graph of that information that can then be ingested into the data.world catalog.

metadata_collectors.png

There are two ways of running these metadata collectors:

  • data.world cloud collectors: The Cloud collectors are managed and hosted by data.world for creating and managing data connections for cataloging metadata. The user runs a simple wizard to configure and schedule the collector runs. Note that cloud collectors are not available for all metadata sources.

    Benefits of using cloud collectors

    • Cloud collectors work best for data sources that are available on the internet for data.world to be able to securely connect to the data sources.

    • Cloud collectors are data.world hosted and fully managed by data.world.

    • You don't need any IT involvement to setup, run, or schedule the collectors. The no-code collector setup means collectors can be quickly implemented to more quickly make use of cataloged resources. This requires minimal time from data engineers and IT teams and allows for faster troubleshooting by data.world support.

    • Collectors are updated automatically to latest version (on a monthly release cadence) and run seamlessly without any manual intervention.

    • The data.world support team has full access to the log files for the collector runs and can help with faster troubleshooting of any issues with collector runs.

      Important

      See this documentation for a quick overview of setting up cloud metadata collectors.

  • On-premise collectors: If the data sources are not available on the internet and managed behind a private network or firewalls, data.world will not be able to connect to them using cloud collectors. In this case, the Metadata Collectors will need to be set up and run in a customer-managed environment, leveraging either a Docker container or JAR files.

    Benefits of using on-premise collectors:

    • Use on-premise collectors for data sources that are behind a private network or firewalls.

    • Cloud collectors are not available some data sources. For such sources, you will have to use the on-premise collectors.

    Important

    See this documentation for a quick overview of setting up on-premise metadata collectors.

Frequently Asked Questions

  1. Question: I currently have on-premise collectors for data sources that are now supported through Cloud Collectors. Do I have to switch to could collectors?

    Answer: No! You can continue to use the on-premise collectors. However, if you want to move to Cloud collectors to get a fully managed collector experience from data.world, you can make the switch. This will require re-configuring the data sources using Cloud Collectors. Work with your Customer Success Director for a smooth transition from on-premise to cloud collectors.

  2. Question: What if my organization wants to catalog metadata from cloud and on-premise data sources?

    Answer: Cloud Collectors are designed for data sources that are cloud accessible via the open internet. They can run in parallel with the data.world Collector, which features sources not available in Cloud Collectors, including many on-premise sources. View collector availability documentation.

  3. Question: I was using Connection Manager for metadata collection. Can I continue to use it?

    Answer: Yes! You can continue to use Connection Manager. However, if you want to move to Cloud collectors for more robust experience, you can make the switch. This will require re-configuring your collectors for Cloud Collectors. Work with your Customer Success Director for a smooth transition from Connection Manager to cloud collectors. If it aligns with your business operations, we advocate the move to Cloud collectors for a superior experience.

  4. Question: Can I have a cloud collector and an on-premise collector setup for the same data source?

    Answer: Yes, if you have a business need of doing this, you can. However, we recommend that for both collector runs you use different set of datasets and collections so that the collector runs don't overwrite each others output. Work with you Customer Success Director to plan and setup this properly.

  5. Question: How do I get notifications for Could Collector runs?

    Answer: Simply setup webhooks at the Organization level from the Organization profile page > Settings tab. The Web hooks will automatically start capturing the Status events (Pending, Provisioning, Running, Completed, Error, Cancelled) for the collector runs.

  6. Question: How will you secure the credentials I enter in the screens while setting Cloud collectors?

    Answer: Besides using standard TLS/SSL for all network transactions, we always encrypt credentials at rest and once they enter our system we never send credentials externally. This means data.world will never send a password (even in an encrypted form) to the UI or to any other source.