Skip to main content

The data.world Collector FAQ

Important

These FAQ are only for on-premise run of collectors.

What is the data.world Collector?

Unless you use the Connection Manager to catalog your metadata, you will be using a command-line program called the data.world Collector to collect your data. The Collector pulls only metadata from your source--it doesn't collect any data.

Where to get the data.world Collector

The data.world Collector is distributed as an image on Dockerhub. If you run the Collector from Docker, the run command will attempt to find the image locally, and if it doesn't find it, it will go to Dockerhub and download it automatically:

dwcc_and_cli.png

If you are running the Collector from a .jar file, you will get the correct file from customer support.

If you are unsure what version of the Collector to use, the most current releases are always listed in the Catalog collector change log. However If you don't know the complete version name, or if you would like to see a list of the the Collector versions, you can go to our Dockerhub repositories. There are two repositories, one for released versions and one for release candidate versions:

  • datadotworld/dwcc- Contains all of the officially released versions of the data.world Collector

  • datadotworld/dwcc-rc - Contains the "release candidate" versions. Release candidates are test versions, they are not officially supported and released. They are primarily used for quick customer fixes until the official release comes out.

Caution

Do not use the versions named Latest from either repository--only specify numeric releases (e.g., dwcc:2.36). The Latest tag refers to the most recent docker image that has been already downloaded to the host machine. Using Latest in the command instead of a specific version number does not pull the latest version from the Docker repository.

Warning

Do not use a release candidate (rc) version of the Collector unless you have been explicitly directed to do so by your customer success or support representative.

The name you specify on the CLI should match exactly the version name on Dockerhub. For example:

  • The name of the Collector version 2.36 is datadotworld/dwcc:2.36

  • The name of the third RC collector version of 2.37 is datadotworld/dwcc-rc:2.37-rc-0003 (RC versions are padded to four digits).

Why use Docker to run the data.world Collector?

Docker is an open source containerization platform or application build and deployment tool. It is based on the idea of that you can package your code with dependencies into a deployable unit called a container. A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings. Container images become containers at runtime--and in the case of Docker containers, images become containers when they run on Docker Engine. Available for both Linux and Windows-based applications, containerized software will always run the same, regardless of the infrastructure. Containers isolate software from its environment and ensure that it works uniformly despite differences for instance between development and staging.

Do I need to have a driver to run the data.world Collector?

Database sources require a JDBC driver to run the Collector. We bundle a JDBC driver with some of the database collectors we catalog. For licensing reasons, we cannot bundle other drivers. The drivers we include are:

  • Databricks

  • Hive (for Hive and Hive Metastore)

  • PostgreSQL

  • Presto

  • Snowflake

  • SQL Server

If you are cataloging another sources, please check with the database vendor for the proper driver to use with your version. You will need to obtain and license the driver yourself, and pass the full path to that directory as the value of that system property.

How many data.world Collectors do you need?

You can use one Collector to catalog as many data sources as you have. All you need to do is change the name of the catalog source and the parameters in the command-line.

What operating system does the collector Docker image use?

Base image used for collector docker image is: eclipse-temurin version for Java 17

How are credentials, like usernames and passwords, stored?

They are not stored anywhere, the command is run and authenticates with the service it is collecting from. The user enters the authentication information when they run the command, or whatever scheduled job is running it passes the credentials in, each time.

How often should the collector be run?

A collector should run as often as the organization wants to see updates to metadata reflected in their catalog. Here are some more specific guidelines that we’ve seen from our customer use-cases:

What kind of collector is it?

How often should you run it?

A collector that gathers metadata about a database.

Unless there are lots of schema changes, these are typically run once a week or once a month.

A data analytics tool.

If you have an active user base that makes a lot of dashboards or reports, weekly to daily might make sense.

Does the docker image need to be on the same machine as the database it is cataloging?

The docker image needs to be located on a machine that has access to the database or analytics tool it is cataloging. If it is located on a different machine than the source it is gathering metadata from, network access between the docker image machine and the machine that the metadata source is on needs to be created.

Here’s an example:

A customer put the docker image onto an EC2 instance. They wanted to use that image to catalog a Salesforce instance. The EC2 instance needed to have access to make requests from the salesforce API for the collector to be able to successfully collect metadata.

How long does it take the Collector to run and catalog a source?

The best answer is “it depends”.

Some factors that can affect how long it takes to run the collector:

  • How many tables are in the database

  • How many fields those tables have

  • How many reports and dashboards there are in a data analytics tool

  • How many resources the server that the thing you are cataloging has

Since the docker image is, at it’s core, a simple java runtime environment, the code itself does not take a long time to boot up and run. Some collectors have filters that can isolate certain projects in a source to decrease the runtime.

Should I use a certificate (SSL) to provide extra security?

If the metadata source being cataloged is configured to use a self-signed certificate to secure TLS connections, then the data.world Collector needs to be told to trust that certificate (since the native Java runtime trust store will not recognize it). The the data.world Collector documentation contains instructions on how to accomplish this. Note that this only applies if (typically on-premise) metadata sources have been set up to use self-signed certificates for TLS.

The typical example would be a postgres or SQL Server database on-premise.

Do I need a Docker license? The Docker website says that certain enterprises do.

The license is for enterprises that need Docker Desktop. Linux environments can run Docker without Docker Desktop. If you need to use Docker on a machine running MacOS or Windows, and you are a qualifying enterprise, (larger than 250 employees OR greater than $10 million in annual revenue) you will need to purchase a license, according to Docker’s terms.

How do I display the data.world Collector license information?

To display the licensing information for any version of a Collector after 2.24, run the following command in your terminal window:

docker run -it --rm datadotworld/dwcc:X.XX display-license

where X.XX is the version number for the Collector.

How do I edit the severity level of reported error messages for the data.world Collector?

It is now possible for users to set the level (severity) of log messages written to the console and log file from the data.world Collector. By default, we write “info” level messages; users can choose to write only errors (level=“ERROR”), errors+warnings (level=“WARN”), or all messages including debug trace (level=“DEBUG”). This is useful if we want to have customers run the Collector with debug logging turned on, for troubleshooting problems etc.

If you are using Docker, to set the level to something other than "info", add the statement -e log_level=DEBUG to your run Docker... statement.

How do I validate a manually installed Docker image for the data.world Collector?

If you manually installed a Docker image for the data.world Collector instead of pulling it as part of the run command in the CLI, you can validate that it is an authorized version by using the hash on the file. The hash for every released version after 2.36 is provided right below the version number in the Catalog collector change log:

DWCC_hash.png

To compare the hash from your version to the authorized version run the following command from your CLI:

docker inspect datadotworld/dwcc:x.y where x.y is the version of the release (e.g., 2.36)

You will get back something that looks like this:

check_hash.png

Compare the value in Digest with the value in RepoDigests and if they are the same, you have an authorized version. If they are not the same, contact support.

Why does my Collector list schema in the log that are not part of the target database(s)?

We scan the schema declared, or all schema if using the -A parameter, but we only capture records within the target database.