Community docs

dbt metadata collector

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

Installing the collector
  1. Request access to a download link from your data.world representative for the catalog collector. Once you receive the link, download the catalog collector Docker image (or programmatically download it with curl).

  2. Load the docker image into the local computer’s Docker environment:

    docker load -i dwdbt-X.Y.tar.gz

    where X.Y is the version number of the dbt collector image.

  3. The previous command will return an <image id> which needs to be renamed as 'dwbt'. Copy the <image id> and use it in the docker-load command:

    docker tag <image id> dwdbt
Basic parameters

These are the basic parameters needed to run the collector. A list of all parameter options is at the end of this article. Where available, either short (e.g., -a) or long (--account) forms can be used.

Example of a dbt script

The example below is an almost copy-and-paste script for any Unix environment that uses a Bash shell (e.g., MacOS and Linux). It uses the minimal set of parameters required to run the collector--your instance may require more. Information about the referenced parameters follows, and a complete list of parameters is at the end of the guide. Edit the script by adding any other parameters you wish to use, and by replacing the values for all your parameters with your information as appropriate. Parameters required by the collector are in bold. When you are finished, run your script.

Collector runtime and troubleshooting

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled. If the catalog collector runs without issues, you should see no output on the terminal, but a new file that matching *.dwec.ttl should be in the directory you specified for the output. If there was an issue connecting or running the catalog collector, there will be either a stack trace or a *.log file. Both of those can be sent to support to investigate if the errors are not clear. A list of common issues and problems encountered when running the collectors is available here.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.