Enterprise docs

Metadata collectors

Metadata collectors are used to connect to source systems such as databases and BI tools, and collect and generate useful information that should be in the data.world catalog. The most commonly used one that we provide is called DWCC (the data.world catalog collector).

If the technical lineage server (Manta) is in scope of the customer solution, DWCC is used to collect metadata from Manta and pass it to data.world. This is the method in which data.world receives information from Manta (Manta does not connect to data.world directly).

By default DWCC is run in the cloud, fully managed by data.world as part of the Connection Manager using tasks. If you have the data.world Bridge set up, it will leverage your bridge connection just like the data virtualization / federated query engine can. See our article on tasks for more details.

Connection_manager_bridge_connections.png
Create_a_task_dwcc_hosted.png

However, for security or infrastructure operations reasons, you may opt to use DWCC in your on-premise compute environment behind your own firewall. By doing so keep in mind that you cannot use the Connection Manager UI -- rather instead you will leverage the command line developer user experience for DWCC, which is documented in detail on our help documentation.Configurations for dwcc connections

DWCC ships as a Docker image which can be loaded and run with a series of command line (CLI) options. It outputs a file with the extension *.dwec.ttl that you can upload to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Installing the collector
  1. For the latest version of the DWCC open a command line interface (e.g., Terminal on MAc) and enter: docker pull datadotworld/dwcc

  2. For a specific version of DWCC, enter docker pull datadotworld/dwcc:X.XX where X.XX is the version number you want to pull.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.