Skip to main content

dbt cloud and the data.world Collector

Note

The latest version of the Collector is 2.137. To view the release notes for this version and all previous versions, please go here.

About the collector

The dbt cloud collector will connect to your dbt cloud project and harvest dbt assets and lineage relationships from dbt transformations.

How does the dbt cloud collector work?

The dbt Cloud collector will also identify how dbt moves data between tables (i.e., lineage). To accomplish this, the dbt collector needs to parse View SQL. Without specifying the target database information, no lineage relationships between columns specified through views can be harvested. The connection information can be supplied with the data.world YAML file or CLI command as an optional override.

Note that the collector however does not harvest everything that the target database collector would harvest. For example, Snowflake can harvest profiling, tags, and policies that the dbt Cloud collector will not harvest. It is recommended to run both the dbt Coud collector and the target database collector to build a comprehensive data catalog.

Supported versions of dbt

The collector supports the following dbt versions:

  • dbt 1.4

Authentication supported

  • The collector supports authenticating to dbt cloud using API key.

  • When the collector authenticates to Snowflake as the target database, the collector supports either:

    • Username and key pair authentication.

    • Username and password authentication

What is cataloged

The information cataloged by the collector includes metadata for the following dbt resources:

Table 1.

Object

Information cataloged

Analysis

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL, Compiled SQL, Enabled, Materialized, Resource type

Model

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL, Compiled SQL, Enabled, Materialized, Resource type

Project

Name, Project version

Snapshot

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL, Compiled SQL, Enabled, Materialized, Resource type

Seed

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL, Compiled SQL, Enabled, Materialized, Resource type

Source

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL, Compiled SQL, Enabled, Source name, Resource type

Test

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL, Compiled SQL, Enabled, Materialized, Resource type



Relationship between objects

By default, the data.world catalog will include catalog pages for the resource types below. Each catalog page will have a relationship to other related resource types. Note that the catalog presentation and relationships are fully configurable, so these will list the default configuration.

Table 2.

Resource page

Relationship

Model

  • Project containing dbt model

  • Tests testing the integrity of model, dbt resources (test, seed, model, snapshot, source) that are upstream of model

  • dbt resources (Test, Seed, Model, Snapshot, Source) that are downstream of model

Project

Dbt resources (Test, Seed, Model, Snapshot, Source) contained within project

Snapshot

  • Project containing dbt project

  • dbt resources (Test, Seed, Model, Source) that are upstream of snapshot

  • dbt resources (Test, Seed, Model, Source) that are downstream of snapshot

Seed

  • Project containing dbt project

  • dbt resources (Test, Seed, Model, Snapshot, Source) that are upstream of seed

  • dbt resources (Test, Seed, Model, Snapshot, Source) that are downstream of seed

Source

  • Project containing dbt project

  • dbt resources (Test, Seed, Model, Snapshot) that are downstream of seed

  • Database table that is the source of data for source

Test

  • Project containing dbt project

  • dbt model that has its integrity tested by this test



Lineage for dbt

  • Relationships between views and referenced database tables and columns for dbt models materialized as views

  • Relationships between dbt resources and the resources that are upstream and downstream (e.g., seeds that are upstream of models, and tests that are downstream of models)

We also collect column-level lineage for the following databases in the dbt collector:

  • Snowflake

Important

For Eureka Explorer, these harvested lineage relationships display from the page of the upstream or downstream resource from dbt. For example, you can see and access Eureka Explorer from a downstream Snowflake table resource page to see what upstream Snowflake table was transformed as a result of a view associated with a dbt model. The dbt resource will also appear in Eureka Explorer.

Preparing dbt cloud for collectors

Obtaining account ID, project ID, and job run ID

This section talks about generating the account ID, project ID, and job ID. You will use the account id for --dbt-cloud-account parameter, project id for --dbt-cloud-project parameter, and job id for --dbt-cloud-run parameter.

The dbt cloud collector assumes that your dbt cloud instance has an environment and job set up with at least one successful run.

  1. Under the Deploy menu at the top navigation, go to Jobs.

  2. From the Environment dropdown, select the environment that you want to run the collector against.

  3. Select the Job associated with your Environment.

  4. From the URL of the Job page, copy the account ID, project ID, and job run ID from the following URL:

    https://cloud.getdbt.com/deploy/<accountID>/projects<projectID>/runs/<jobRunID>.

    The account ID is represented as accountID, the project ID is represented as projectID, and the job run ID is represented as jobRunID.

Obtaining dbt Cloud API token

  1. From the top right menu dropdown, select Profile Settings.

  2. Navigate to the API section. Click copy to the right of the API Key. You will use this api key for --dbt-cloud-api-key when setting up the collector to authenticate to dbt cloud.

Updating job execution settings

  1. Under the Deploy menu at the top navigation, go to Jobs.

  2. From the Environment dropdown, select the environment that you want to run the collector against.

  3. Select the Job associated with your Environment and click Settings.

  4. Under Execution Settings, ensure that Generate docs on run is selected.

Pre-requisites for running the collector

Table 3.

Item

Requirement

Hardware

RAM

8 GB

CPU

2 Ghz processor

Software

Docker

Click here to get Docker.

Java Runtime Environment

OpenJDK 17 is supported and available here.

data.world specific objects

Dataset

You must have a ddw-catalogs (or other) dataset set up to hold your catalog files when you are done running the collector.



Ways to run the data.world Collector

There are a few different ways to run the data.world Collector--any of which can be combined with an automation strategy to keep your catalog up to date:

  • Create a configuration file (config.yml) - This option stores all the information needed to catalog your data sources. It is an especially valuable option if you have multiple data sources to catalog as you don't need to run multiple scripts or CLI commands separately.

  • Run the collector though a CLI - Repeat runs of the collector requires you to re-enter the command for each run.

Note

This section walks you through the process of running the collector using CLI.

Details about running the command

The easiest way to create your Collector command is to:

  1. Copy the following example command.

  2. Edit it for your organization and data source.

  3. Open a terminal window in any Unix environment that uses a Bash shell and paste your command into it.

    The example command includes the minimal parameters required to run the collector --your instance may require more. Edit the command by adding any other parameters you wish to use, and by replacing the values for all your parameters with your information as appropriate.

    docker run -it --rm --mount type=bind,source=/tmp,target=/input-data \
    --mount type=bind,source=/tmp,target=/dwcc-output \
    --mount type=bind,source=/tmp,target=/app/log \
    datadotworld/dwcc:<CollectorVersion> catalog-dbt-cloud -a <account> -n <catalogName> \
    --dbt-cloud-api-key=<dbtCloudApiKey> --dbt-cloud-account=<account> \ 
    --dbt-cloud-project=<project> --dbt-cloud-run=<runIdentifier> \
    -o /dwcc-output

    The following table describes the parameters for the command. Detailed information about the Docker portion of the command can be found here.

    Table 1.

    Parameter name

    Description

    Required?

    dwcc: <CollectorVersion>

    The version of the collector you want to use (For example, datadotworld/dwcc:2.113)

    Yes

    -t= <apiToken>

    --api-token= <apiToken>

    The data.world API token to use for authentication. Default is to use an environment variable named ${DW_AUTH_TOKEN}.

    Yes

    -a=<agent>

    --agent=<agent>

    --account=<agent>

    The ID for the data.world account into which you will load this catalog - this is used to generate the namespace for any URIs generated.

     

    Yes

    --site= <site>

    This parameter should be set only for Private instances. Do not set it for public instances and single-tenant installations. Required for private instance installations.

    Yes

    (required for private instance installations)

    -o= <outputDir>

    --output= <outputDir>

    The output directory into which any catalog files should be written.

    Yes

    -L

    --no-log-upload

    Do not upload the log of the Collector run to the organization account's catalogs dataset or to another location specified with --upload-location(ignored if --upload not specified)

    No

    -n= <catalogName>

    -n= <catalogName>

    The name of the collection where the collector output will be stored.

    Yes

    --upload-location= <uploadLocation>

    The dataset to which the catalog is to be uploaded. You can specify a simple dataset name to upload to that dataset within the organization's account, or [account/dataset] to upload to a dataset in some other account.

    Yes

    -H= <apiHost>

    --api-host= <apiHost>

    The host for the data.world API. NOTE: This parameter is required for Private single-tenant installations. For example, "api.8bank.data.world" where "8bank" is the name of the single-tenant install.

    Yes

    (for single-tenant installations)

    --dbt-cloud-api-key= <dbtCloudApiKey>

    A dbt cloud-issued api key with permissions to access the specified account

    Yes

    --dbt-cloud-account=<account>

    The dbt cloud account that owns the project from which to harvest dbt metadata artifacts.

    Yes

    --dbt-cloud-project= <project>

    The name or numeric identifier of the project from which to harvest dbt metadata artifacts.

    Yes

    --dbt-cloud-run= <runIdentifier>

    The numeric identifier of the run that produced the artifacts to be harvested; if not specified, the most recent successful run that produced artifacts within the project will be harvested.

    No

    --dry-run

    If specified, the collector does not actually harvest any metadata, but just checks the database connection parameters provided by the user and reports success or failure at connecting.

    No

    --database-user= <databaseUser>

    The user credential to use in connecting to the target database.

    No

    --database-password= <databasePassword>

    The password credential to use in connecting to  the target database.

    No

    --snowflake-application= <snowflakeApplication>

    The application connection parameter to use in connecting to the target Snowflake database (use this option to override the DBT profile, ignored for non-Snowflake target dbs). Use datadotworld unless otherwise directed.

    No

    --snowflake-private-key-file= <snowflakePrivateKey>

    The private key file to use for authentication with Snowflake (for example rsa_key.p8). Use this option to override the dbt profile. Ignored for non-Snowflake target dbs.

    No

    --snowflake-private-key-file-password= <snowflakePrivateKeyFilePassword>

    The password for the private key file to use for authentication with Snowflake, if the key is encrypted and a password was set (use this option to override the dbt profile. Ignored for non-Snowflake target dbs).

    No



Common troubleshooting tasks

Collector runtime and troubleshooting

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled. If the catalog collector runs without issues, you should see no output on the terminal, but a new file that matching *.dwec.ttl should be in the directory you specified for the output. If there was an issue connecting or running the catalog collector, there will be either a stack trace or a *.log file. Both of those can be sent to support to investigate if the errors are not clear. A list of common issues and problems encountered when running the collectors is available here.

Issue: Collector fails to complete because it cannot find catalog.json

  • Cause: The catalog.json is a required file and was not generated for a job.

  • Solution: Review and ensure that the Job Execution settings are configured properly.

Automating updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.