The Collector FAQ
What is the Collector?

Unless you use the Connection Manager to catalog your metadata, you will be using a command-line program called the Collector to collect your data. The Collector pulls only metadata from your source--it doesn't collect any data.

Why use Docker to run the Collector?

Docker is an open source containerization platform or application build and deployment tool. It is based on the idea of that you can package your code with dependencies into a deployable unit called a container. A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings. Container images become containers at runtime--and in the case of Docker containers, images become containers when they run on Docker Engine. Available for both Linux and Windows-based applications, containerized software will always run the same, regardless of the infrastructure. Containers isolate software from its environment and ensure that it works uniformly despite differences for instance between development and staging.

Do I need to have a driver to run the Collector?

Database sources require a JDBC driver to run the Collector. We bundle a JDBC driver with some of the database collectors we catalog. For licensing reasons, we cannot bundle other drivers. The drivers we include are:

  • Hive (for Hive and Hive Metastore)

  • Postgres

  • Presto

  • Snowflake

  • SQL Server

If you are cataloging one of the following database sources, please check with the database vendor for the proper driver to use with your version. You will need to obtain and license the driver yourself, and pass the full path to that directory as the value of that system property (examples shown in the scripts below):

  • DB2

  • Databricks

  • Denodo

  • Dremio

  • Infor ION

  • MySQL

  • Oracle

  • Redshift

  • SQL Anywhere

  • Vertica

How many Collectors do you need?

You can use one Collector to catalog as many data sources as you have. All you need to do is change the name of the catalog source and the parameters in the command-line.

What is needed to run a catalog collector?

Because the collectors are shipped as Docker images, you need to have Docker installed on your local machine. If you can't use Docker, we also have a Java version of the collectors available. For more information about Docker see


Docker has recently changed their commercial licensing. Please review their licensing requirements here

The computer running the catalog collector should have network access to the data source.

The user running the catalog collector must have read access to the data resource.

We have licensing permissions to distribute some JDBC drivers with the Collector. However for other data sources you will need to have JDBC drivers for the data source installed on the local machine. The Collector collectors assume the .jar file driver is in the ../jdbcdrivers directory.

Finally, a minimum of 2G of memory and a 2Ghz processor are required for all sources. Certain data sources (like BigQuery) may have additional requirements.

What operating system does the collector Docker image use?

Debian Buster (the development codename for Debian 10).

How are credentials, like usernames and passwords, stored?

They are not stored anywhere, the command is run and authenticates with the service it is collecting from. The user enters the authentication information when they run the command, or whatever scheduled job is running it passes the credentials in, each time.

How often should the collector be run?

A collector should run as often as the organization wants to see updates to metadata reflected in their catalog. Here are some more specific guidelines that we’ve seen from our customer use-cases:

What kind of collector is it?

How often should you run it?

A collector that gathers metadata about a database.

Unless there are lots of schema changes, these are typically run once a week or once a month.

A data analytics tool.

If you have an active user base that makes a lot of dashboards or reports, weekly to daily might make sense.

Does the docker image need to be on the same machine as the database it is cataloging?

The docker image needs to be located on a machine that has access to the database or analytics tool it is cataloging. If it is located on a different machine than the source it is gathering metadata from, network access between the docker image machine and the machine that the metadata source is on needs to be created.

Here’s an example:

A customer put the docker image onto an EC2 instance. They wanted to use that image to catalog a Salesforce instance. The EC2 instance needed to have access to make requests from the salesforce API for the collector to be able to successfully collect metadata.

How long does it take the Collector to run and catalog a source?

The best answer is “it depends”.

Some factors that can affect how long it takes to run the collector:

  • How many tables are in the database

  • How many fields those tables have

  • How many reports and dashboards there are in a data analytics tool

  • How many resources the server that the thing you are cataloging has

Since the docker image is, at it’s core, a simple java runtime environment, the code itself does not take a long time to boot up and run. Some collectors have filters that can isolate certain projects in a source to decrease the runtime.

Should I use a certificate (SSL) to provide extra security?

If the metadata source being cataloged is configured to use a self-signed certificate to secure TLS connections, then the Collector needs to be told to trust that certificate (since the native Java runtime trust store will not recognize it). The the Collector documentation contains instructions on how to accomplish this. Note that this only applies if (typically on-premise) metadata sources have been set up to use self-signed certificates for TLS.

The typical example would be a postgres or SQL Server database on-premise.

Do I need a Docker license? The Docker website says that certain enterprises do.

The license is for enterprises that need Docker Desktop. Linux environments can run Docker without Docker Desktop. If you need to use Docker on a machine running MacOS or Windows, and you are a qualifying enterprise, (larger than 250 employees OR greater than $10 million in annual revenue) you will need to purchase a license, according to Docker’s terms.

How do I display the Collector license information?

To display the licensing information for any version of a Collector after 2.24, run the following command in your terminal window:

docker run -it --rm datadotworld/dwcc:X.XX display-license

where X.XX is the version number for the Collector.

How do I edit the severity level of reported error messages for the Collector?

It is now possible for users to set the level (severity) of log messages written to the console and log file from the Collector. By default, we write “info” level messages; users can choose to write only errors (level=“ERROR”), errors+warnings (level=“WARN”), or all messages including debug trace (level=“DEBUG”). This is useful if we want to have customers run the Collector with debug logging turned on, for troubleshooting problems etc.

If you are using Docker, to set the level to something other than "info", add the statement -e log_level=DEBUG to your run Docker... statement.

How do I validate a manually installed Docker image for the Collector?

If you manually installed a Docker image for the Collector instead of pulling it as part of the run command in the CLI, you can validate that it is an authorized version by using the hash on the file. The hash for every released version after 2.36 is provided right below the version number in the Catalog collector change log:


To compare the hash from your version to the authorized version run the following command from your CLI:

docker inspect datadotworld/dwcc:x.y where x.y is the version of the release (e.g., 2.36)

You will get back something that looks like this:


Compare the value in Digest with the value in RepoDigests and if they are the same, you have an authorized version. If they are not the same, contact support.