dbt legacy metadata collector
With the release of the data.world Collector 2.85 we now use our Collector for dbt. This documentation is maintained for those customers who prefer to remain with the legacy collector, however for all new users we suggest using the new data.world Collector.
Prerequisites
The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.
Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.
The files
catalog.json
,manifest.json
, andprofiles.yml
must be in the same directory on the host machine, e.g.,/tmp
.
Installing the collector
Request access to a download link from your data.world representative for the catalog collector. Once you receive the link, download the catalog collector Docker image (or programmatically download it with curl).
Load the docker image into the local computer’s Docker environment:
docker load -i dwdbt-X.Y.tar.gz
where X.Y is the version number of the dbt collector image.
The previous command will return an <image id>which needs to be renamed as 'dwbt'. Copy the <image id> and use it in the docker-load command:
docker tag <image id> dwdbt
Basic parameters
Each collector has parameters that are required, parameters that are recommended, and parameters that are completely optional. Required parameters must be present for the command to run. Recommended parameters are either:
parameters that exist in pairs, and one or the other must be present for the command to run (e.g.,
--agent
and--base
)parameters that we recommend to improve your experience running the command in some way
Together, the required and recommended parameters make up the Basic parameters for each collector. The Basic parameters for this collector are:
-a, --agent, --account=<agent>
- The ID for the data.world account into which you will load this catalog - this is used to generate the namespace for any URIs generated.
-P, --profile-file <profileFile>
- The file containing profile definitions (defaults to dbt default of .dbt/profiles.yml
in the user's home directory)
-g, --target <target>
- The dbt profile target to use to obtain database location information (defaults to the profile's 'target' value)
-p, --profile=<profile>
- the dbt profile to use to obtain database location information (defaults to first profile found in profile definitions file)
Example of a dbt command
The example below is an almost copy-and-paste command for any Unix environment that uses a Bash shell (e.g., MacOS and Linux). It uses the minimal set of parameters required to run the collector--your instance may require more. Information about the referenced parameters follows, and a complete list of parameters is at the end of the guide. Edit the command by adding any other parameters you wish to use, and by replacing the values for all your parameters with your information as appropriate. Parameters required by the collector are in bold. When you are finished, run your command.
docker run -it --rm --mount type bind,source /tmp,target /dbt-input \ --mount type bind,source /tmp,target /dbt-output dwdbt -a <account> \ -d <profileFile> -g <target> -p <profile> /dbt-input /dbt-output
Collector runtime and troubleshooting
The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.
If the catalog collector runs without issues, you should see no output on the terminal, but a new file that matching *.dwec.ttl should be in the directory you specified for the output.
If there was an issue connecting or running the catalog collector, there will be either a stack trace or a *.log file. Both of those can be sent to support to investigate if the errors are not clear.
A list of common issues and problems encountered when running the collectors is available here.
Automating updates to your metadata catalog
Maintaining an up-to-date metadata catalog is crucial and can be achieved by employing Azure Pipelines, CircleCI, or any automation tool of your preference to execute the catalog collector regularly.
There are two primary strategies for setting up the collector run times:
Scheduled: You can configure the collector according to the anticipated frequency of metadata changes in your data source and the business need to access updated metadata. It's necessary to account for the completion time of the collector run (which depends on the size of the source) and the time required to load the collector's output into your catalog. This could be for instance daily or weekly. We recommend scheduling the collector run during off-peak times for optimal performance.
Event-triggered: If you have set up automations that refresh the data in a source technology, you can set up the collector to execute whenever the upstream jobs are completed successfully. For example, if you're using Airflow, Github actions, dbt, etc., you can configure the collector to automatically run and keep your catalog updated following modifications to your data sources.
dbt parameters
--help
- Show the help text and exit.
--upload-location=<uploadLocation>
- The dataset to which the catalog is to be uploaded, specified as a simple dataset name to upload to that dataset within the organization's account, or [account/dataset] to upload to a dataset in some other account (ignored if --upload not specified)
-a, --agent, --account=<agent>
- The ID for the data.world account into which you will load this catalog - this is used to generate the namespace for any URIs generated
-b, --base=<base>
- The base URI to use as the namespace for any URIs generated (Must use this OR --agent
)
-P, --profile-file <profileFile>
- The file containing profile definitions (defaults to dbt default of .dbt/profiles.yml
in the user's home directory)
-g, --target <target>
- The dbt profile target to use to obtain database location information (defaults to the profile's 'target' value)
-p, --profile=<profile>
- the dbt profile to use to obtain database location information (defaults to first profile found in profile definitions file)
-t, --api-token=<apiToken>
- The data.world API token to use for authentication; default is to use an environment variable named DW_AUTH_TOKEN
-U, --upload
- Whether to upload the generated catalog to the organization account's catalogs dataset or to another location specified with --upload-location
(requires --api-token
)