Enterprise docs

Run DWCC collectors with Docker

The DWCC collectors are distributed as Docker images available via dockerhub. To run one of the collectors, you will use the fully-qualified name of the collector on a CLI (datadotworld/dwcc:x.y, where x.y is the version of the collector that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Database collectors that are shipped with drivers

Our database collectors use JDBC (the Java database connectivity API) to connect to each database so that they can harvest metadata. We bundle a database driver with some of the database collectors lwe catalog. For licensing reasons, we cannot bundle other drivers. Please check with your database vendor for the proper driver to use with your database version The drivers we include are:

  • Hive

  • Postgres

  • Presto

  • Snowflake

  • SQL Server

Where to get a DWCC collector

The DWCC collectors are distributed as images on dockerhub. For those whose requirements dictate that they need to run a DWCC collector from a CLI, pulling the correct version is part of the run command in your CLI. The CLI will attempt to pull the image locally, and if it doesn't find it, it will go to dockerhub and download it automatically:


If you are unsure what version of a DWCC collector to use, the most current releases of the collectors are always listed in the Catalog collector change log. However If you don't know the complete version name, or if you would like to see a list of the DWCC collector versions, you can go to our Dockerhub repositories.There are two repositories, one for released versions and one for release candidate versions:

  • datadotworld/dwcc- Contains all of the officially released versions of the DWCC

  • datadotworld/dwcc-rc - Contains the "release candidate" versions. Release candidates are test versions, they are not officially supported and released. They are primarily used for quick customer fixes until the official release comes out.


Do not use the versions named Latest from either repository--only specify numeric releases (e.g., dwcc:2.36).


Do not use a release candidate (rc) version of the DWCC unless you have been explicitly directed to do so by your customer success or support representative.

The name you specify on the CLI should match exactly the version name on Dockerhub. For example:

  • The name of the DWCC collector version 2.36 is datadotworld/dwcc:2.36

  • The name of the third DWCC RC collector version of 2.37 is datadotworld/dwcc-rc:2.37-rc-0003 (RC versions are padded to four digits).

Versions of metadata sources collectors were tested against

Here is a list of the API-based collectors in dwcc and the versions of the underlying metadata sources and/or their APIs that we have developed/tested against. This is not to say that the collectors won’t function properly with other versions, but these are the versions we specifically developed against:

  • catalog-awsglue: Collector uses AWS SDK for Java 1.11. We believe this is associated with AWS API version 2020-04-08, per https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md

  • catalog-bigquery: Collector uses Google BigQuery SDK version 1.107.

  • catalog-dremio: Collector developed/tested against Dremio version 4.7.2-202008180758160892-1a34c463.

  • catalog-domo: Domo does not seem to version its API or express a particular “version” of the service/application. Therefore all we can say is that we developed against the API as it existed on the release date.

  • catalog-looker: Collector developed/tested against Looker API v 3.1.

  • catalog-manta: Collector developed/tested against MANTA 1.30 and 1.31.

  • catalog-powerbi: Collector developed/tested against PowerBI Cloud API v 1.0.

  • catalog-tableau: Collector developed/tested against Tableau API versions 3.7-3.10 on Tableau Server v 2020-04

  • catalog-openapi: Expects openapi specification inputs (aka “swagger files”) that conform to open api spec version 2.0

Validating a manually installed DWCC collector

If you have manually installed a DWCC collector image instead of pulling it as part of the run command in the CLI, you can validate that it is an authorized version by using the hash on the file. The hash for every released version after 2.36 is provided right below the version number in the Catalog collector change log:


To compare the hash from your version to the authorized version run the following command from your CLI:

docker inspect datadotworld/dwcc:x.y where x.y is the version of the release (e.g., 2.36)

You will get back something that looks like this:


Compare the value in Digest with the value in RepoDigests and if they are the same, you have an authorized version. If they are not the same, contact support.

Editing severity level of reported error messages for DWCC collectors

It is now possible for users to set the level (severity) of log messages written to the console and log file from DWCC collectors. By default, we write “info” level messages; users can choose to write only errors (level=“ERROR”), errors+warnings (level=“WARN”), or all messages including debug trace (level=“DEBUG”). This is useful if we want to have customers run a DWCC collector with debug logging turned on, for troubleshooting problems etc.

If you are using Docker, to set the level to something other than "info", add the statement -e log_level=DEBUG to your run Docker... statement.

How to display DWCC collector license information

To display the licensing information for any version of a DWCC collector after 2.24, run the following command in your terminal window:

docker run -it --rm datadotworld/dwcc:X.XX display-license

where X.XX is the version number for the DWCC collector.