Skip to main content

Documentation

The data.world Collector self-hosted by the customer

However, for security or infrastructure operations reasons, you may opt to use the data.world Collector in your on-premise compute environment behind your own firewall. By doing so keep in mind that you cannot use the Connection Manager UI -- rather instead you will leverage the command line developer user experience for the Collector, which is documented in our data.world Collector documentation.

The data.world Collector ships as a Docker image which can be loaded and run with a series of command line (CLI) options. It outputs a file with the extension *.dwec.ttl that you can upload to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/.

  • The user defined to run the data.world Collector must have read access to all resources being cataloged.

  • The computer running the data.world Collector needs a Java Runtime Environment (JRE), version 11 or higher. (OpenJDK available here)

Installing the collector
  1. For the latest version of the the data.world Collector open a command line interface (e.g., Terminal on Mac) and enter: docker pull datadotworld/dwcc

  2. For a specific version of the data.world Collector, enter docker pull datadotworld/dwcc:X.XX where X.XX is the version number you want to pull.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM datadotworld/dwcc:x.y
ADD ./ca.der ca.der 
RUN keytool -importcert -alias startssl -cacerts -storepass changeit -noprompt -file ca.der

Important

Do not forget to replace x.y in datadotworld/dwcc:x.y with the version of the Collector you want to use (e.g., datadotworld/dwcc:2.80).

Then, in the directory with that Dockerfile:

docker build -t dwcc-cert .

Note

The command needs to be all lower case, and there must be a dot (.) supplied to reference the current directory.

Finally, change the docker run command to use dwcc-cert instead of dwcc. Here is an example command for cataloging Tableau with a custom SSL certificate:

docker run -it --rm --mount type=bind,source=/tmp,target=/dwcc-output \
--mount type=bind,source=/tmp,target=/app/log datadotworld/dwcc-cert:x.y \
catalog-tableau --tableau-api-base-url <baseUrl> \
--tableau-password <password> --tableau-username <username> \
-a <account> -n <catalogName> -o "/dwcc-output"
Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.