FAQ for on-premise metadata collectors

Important

These FAQ are only for on-premise run of collectors.

How many data.world Collectors do you need?

You can use one Collector to catalog as many data sources as you have. All you need to do is change the name of the catalog source and the parameters in the command-line.

What operating system does the collector Docker image use?

Base image used for collector docker image is: eclipse-temurin version for Java 17

How are credentials, like usernames and passwords, stored?

They are not stored anywhere, the command is run and authenticates with the service it is collecting from. The user enters the authentication information when they run the command, or whatever scheduled job is running it passes the credentials in, each time.

How often should the collector be run?

A collector should run as often as the organization wants to see updates to metadata reflected in their catalog. Here are some more specific guidelines that we’ve seen from our customer use-cases:

What kind of collector is it?	How often should you run it?
A collector that gathers metadata about a database.	Unless there are lots of schema changes, these are typically run once a week or once a month.
A data analytics tool.	If you have an active user base that makes a lot of dashboards or reports, weekly to daily might make sense.

Does the docker image need to be on the same machine as the database it is cataloging?

The docker image needs to be located on a machine that has access to the database or analytics tool it is cataloging. If it is located on a different machine than the source it is gathering metadata from, network access between the docker image machine and the machine that the metadata source is on needs to be created.

Here’s an example:

A customer put the docker image onto an EC2 instance. They wanted to use that image to catalog a Salesforce instance. The EC2 instance needed to have access to make requests from the salesforce API for the collector to be able to successfully collect metadata.

How long does it take the Collector to run and catalog a source?

The best answer is “it depends”.

Some factors that can affect how long it takes to run the collector:

How many tables are in the database
How many fields those tables have
How many reports and dashboards there are in a data analytics tool
How many resources the server that the thing you are cataloging has

Since the docker image is, at it’s core, a simple java runtime environment, the code itself does not take a long time to boot up and run. Some collectors have filters that can isolate certain projects in a source to decrease the runtime.

Should I use a certificate (SSL) to provide extra security?

If the metadata source being cataloged is configured to use a self-signed certificate to secure TLS connections, then the data.world Collector needs to be told to trust that certificate (since the native Java runtime trust store will not recognize it). The the data.world Collector documentation contains instructions on how to accomplish this. Note that this only applies if (typically on-premise) metadata sources have been set up to use self-signed certificates for TLS.

The typical example would be a postgres or SQL Server database on-premise.

Do I need a Docker license? The Docker website says that certain enterprises do.

The license is for enterprises that need Docker Desktop. Linux environments can run Docker without Docker Desktop. If you need to use Docker on a machine running MacOS or Windows, and you are a qualifying enterprise, (larger than 250 employees OR greater than $10 million in annual revenue) you will need to purchase a license, according to Docker’s terms.

How do I edit the severity level of reported error messages for the collectors?

By default, the logs are written at info level messages. Users can choose to write only errors (level=“ERROR”), errors+warnings (level=“WARN”), or all messages including debug trace (level=“DEBUG”). It is useful to run the Collector with debug logging turned on, for troubleshooting problems.

If you are using Docker, to set the level to something other than info, add the statement -e log_level=DEBUG to your Docker statement. For detail troubleshooting information, see "Troubleshooting the collectors"

Why does my Collector list schema in the log that are not part of the target databases?

We scan the schema declared, or all schema if using the -A parameter, but we only capture records within the target database.

In this section: