Enterprise docs

Collector FAQ

How many collectors do you need?

You can use one DWCC to catalog as many data sources as you have. All you need to do is change the name of the catalog source and the parameters in the command-line.

What is needed to run a catalog collector?

Because the collectors are shipped as Docker images, you need to have Docker installed on your local machine. If you can't use Docker, we also have a Java version of the collectors available. For more information about Docker see https://docs.docker.com/get-docker/.

Note

Docker has recently changed their commercial licensing. Please review their licensing requirements here

The computer running the catalog collector should have network access to the data source.

The user running the catalog collector must have read access to the data resource.

We have licensing permissions to distribute some JDBC drivers with DWCC. However for other data sources you will need to have JDBC drivers for the data source installed on the local machine. The DWCC collectors assume the .jar file driver is in the ../jdbcdrivers directory.

Finally, a minimum of 2G of memory and a 2Ghz processor are required for all sources. Certain data sources (like BigQuery) may have additional requirements.

What operating system does the collector Docker image use?

Debian Buster (the development codename for Debian 10).

How are credentials, like usernames and passwords, stored?

They are not stored anywhere, the command is run and authenticates with the service it is collecting from. The user enters the authentication information when they run the command, or whatever scheduled job is running it passes the credentials in, each time.

How often should the collector be run?

A collector should run as often as the organization wants to see updates to metadata reflected in their catalog. Here are some more specific guidelines that we’ve seen from our customer use-cases:

What kind of collector is it?

How often should you run it?

A collector that gathers metadata about a database.

Unless there are lots of schema changes, these are typically run once a week or once a month.

A data analytics tool.

If you have an active user base that makes a lot of dashboards or reports, weekly to daily might make sense.

What permissions do you need to run the collector?

What kind of collector?

General permission guidance

Database collector

The account used to authenticate needs to read the information schema of the databases it is collecting metadata about. It does not need read access to the data within the tables.

Data analytics tool

We develop using admin credentials. If we have tested the collector with lesser permnissions, they are noted in theWCC doc for the source.

Does the docker image need to be on the same machine as the database it is cataloging?

The docker image needs to be located on a machine that has access to the database or analytics tool it is cataloging. If it is located on a different machine than the source it is gathering metadata from, network access between the docker image machine and the machine that the metadata source is on needs to be created.

Here’s an example:

A customer put the docker image onto an EC2 instance. They wanted to use that image to catalog a Salesforce instance. The EC2 instance needed to have access to make requests from the salesforce API for the collector to be able to successfully collect metadata.

How long does it take DWCC to run and catalog a source?

The best answer is “it depends”.

Some factors that can affect how long it takes to run the collector:

  • How many tables are in the database

  • How many fields those tables have

  • How many reports and dashboards there are in a data analytics tool

  • How many resources the server that the thing you are cataloging has

Since the docker image is, at it’s core, a simple java runtime environment, the code itself does not take a long time to boot up and run. Some collectors have filters that can isolate certain projects in a source to decrease the runtime.

Should I use a certificate (SSL) to provide extra security?

If the metadata source being cataloged is configured to use a self-signed certificate to secure TLS connections, then dwcc needs to be told to trust that certificate (since the native Java runtime trust store will not recognize it). The dwcc documentation contains instructions on how to accomplish this. Note that this only applies if (typically on-premise) metadata sources have been set up to use self-signed certificates for TLS.

The typical example would be a postgres or SQL Server database on-premise.

Do I need a Docker license? The Docker website says that certain enterprises do.

he license is for enterprises that need Docker Desktop. Linux environments can run Docker without Docker Desktop. If you need to use Docker on a machine running MacOS or Windows, and you are a qualifying enterprise, (larger than 250 employees OR greater than $10 million in annual revenue) you will need to purchase a license, according to Docker’s terms.