Enterprise docs

About DWCC collectors

The DWCC collectors are distributed as Docker images available via dockerhub. To run one of the collectors, you will use the fully-qualified name of the collector (datadotworld/dwcc:x.y, where x.y is the version of the collector that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

DWCC collectors that are shipped with JDBC drivers

We include JDBC drivers for our various metadata collectors where we have a license to do so. The drivers we include are:

  • Athena

  • Derby

  • Hive

  • Postgres

  • Presto

  • Snowflake

  • SQL Server

Any other collector requires the driver be provided by the user. If you run a DWCC collector without a driver, you will get a database connection error.

Where to get a DWCC collector

The DWCC collectors are distributed as images on dockerhub. It is no longer necessary to manually install the docker image of a collector on your system. For those whose requirements dictate that they need to run a DWCC collector from a CLI, pulling the correct version is part of the run command in your CLI. The CLI will attempt to pull the image locally, and if it doesn't find it, it will go to dockerhub and download it automatically:

dwcc_and_cli.png

If you are unsure what version of a DWCC collector to use, the most current releases of the collectors are always listed in the Catalog collector change log. However If you don't know the complete version name, or if you would like to see a list of the DWCC collector versions, you can go to our Dockerhub repositories.There are two repositories, one for released versions and one for release candidate versions:

  • datadotworld/dwcc- Contains all of the officially released versions of the DWCC

  • datadotworld/dwcc-rc - Contains the "release candidate" versions. Release candidates are test versions, they are not officially supported and released. They are primarily used for quick customer fixes until the official release comes out.

Warning

Do not use a release candidate (rc) version of the DWCC unless you have been explicitly directed to do so by your customer success or support representative.

The name you specify on the CLI should match exactly the version name on Dockerhub. For example:

  • The name of the DWCC collector version 2.36 is datadotworld/dwcc:2.36

  • The name of the third DWCC RC collector version of 2.37 is datadotworld/dwcc-rc:2.37-rc-0003 (RC versions are padded to four digits).

Versions of metadata sources collectors were tested against

Here is a list of the API-based collectors in dwcc and the versions of the underlying metadata sources and/or their APIs that we have developed/tested against. This is not to say that the collectors won’t function properly with other versions, but these are the versions we specifically developed against:

  • catalog-awsglue: Collector uses AWS SDK for Java 1.11. We believe this is associated with AWS API version 2020-04-08, per https://github.com/aws/aws-sdk-java/blob/master/CHANGELOG.md

  • catalog-bigquery: Collector uses Google BigQuery SDK version 1.107.

  • catalog-dremio: Collector developed/tested against Dremio version 4.7.2-202008180758160892-1a34c463.

  • catalog-domo: Domo does not seem to version its API or express a particular “version” of the service/application. Therefore all we can say is that we developed against the API as it existed on the release date.

  • catalog-looker: Collector developed/tested against Looker API v 3.1.

  • catalog-manta: Collector developed/tested against MANTA 1.30 and 1.31.

  • catalog-powerbi: Collector developed/tested against PowerBI Cloud API v 1.0.

  • catalog-tableau: Collector developed/tested against Tableau API versions 3.7-3.10 on Tableau Server v 2020-04

  • catalog-openapi: Expects openapi specification inputs (aka “swagger files”) that conform to open api spec version 2.0

Validating a manually installed DWCC collector

If you have manually installed a DWCC collector image instead of pulling it as part of the run command in the CLI, you can validate that it is an authorized version by using the hash on the file. The hash for every released version after 2.36 is provided right below the version number in the Catalog collector change log:

DWCC_hash.png

To compare the hash from your version to the authorized version run the following command from your CLI:

docker inspect datadotworld/dwcc:x.y where x.y is the version of the release (e.g., 2.36)

You will get back something that looks like this:

check_hash.png

Compare the value in Digest with the value in RepoDigests and if they are the same, you have an authorized version. If they are not the same, contact support.

Editing severity level of reported error messages for DWCC collectors

It is now possible for users to set the level (severity) of log messages written to the console and log file from DWCC collectors. By default, we write “info” level messages; users can choose to write only errors (level=“ERROR”), errors+warnings (level=“WARN”), or all messages including debug trace (level=“DEBUG”). This is useful if we want to have customers run a DWCC collector with debug logging turned on, for troubleshooting problems etc.

If you are using Docker, to set the level to something other than "info", add the statement -e log_level=DEBUG to your run Docker... statement.

How to display DWCC collector license information

To display the licensing information for any version of a DWCC collector after 2.24, run the following command in your terminal window:

docker run -it --rm datadotworld/dwcc:X.XX display-license

where X.XX is the version number for the DWCC collector.