Skip to main content

About the dbt cloud collector

The dbt cloud collector connects to your dbt cloud project, harvest dbt assets, and column-level lineage relationships from database views associated with dbt assets.

Important

The dbt cloud collector can be run in the Cloud or on-premise using Docker or Jar files.

Note

The latest version of the Collector is 2.239. To view the release notes for this version and all previous versions, please go here.

How does the dbt cloud collector work?

dbt Cloud is an extract, transform, and load (ETL) platform that allows users to specify jobs that move and transform data between relational database tables and views in a target database. It makes rich metadata available about the processes that perform this movement and transformation of data.

The collector connects to dbt Cloud and harvests metadata from dbt Cloud artifacts such as models and sources.

To fully represent lineage relationships between the tables, for any of the supported target databases, the collector parses the definition of any database views created by dbt Cloud, connecting to the target database to obtain database metadata where necessary.

How does the collector know which dbt Cloud run to harvest from?

To make sure that the correct metadata is being harvested, the dbt Cloud collector needs to know the Project, Job, and Environment combination.

In dbt Cloud, a job is typically executed on a schedule. The dbt Cloud collector uses the following logic to determine the job run from which it will obtain the metadata artifacts to process:

  • The collector examines only dbt Cloud runs associated with the dbt Cloud project specified by the user.

  • Any unsuccessful runs and any runs that did not produce metadata artifacts are discarded.

  • If the user specifies a job (by identifier or name) or an environment (by identifier or name), then only runs from that job and/or environment are examined.

  • If multiple successful runs are found, then the collector harvests the metadata from the most recent run.

Running target database collector in addition to dbt Cloud Collector

The dbt Cloud collector only captures the lineage relationships but does not harvest any information about the database tables/views and columns involved in lineage relationships. In order to harvest information such as the table/view or column name, descriptions, data types, and table/view-column associations, you need to run the appropriate database collector (for example, the Snowflake collector) on the target database.

What is cataloged

The information cataloged by the collector includes metadata for the following dbt resources:

Note

Starting with dbt 1.3, the raw_sql and compiled_sql properties in dbt artifacts are now renamed to raw_code and compiled_code. Depending on the version of dbt that you are using, you may see either fields for Raw SQL/Compiled SQL or Compiled Code/Raw code.

Table 1.

Object

Information cataloged

Analysis

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Model

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Model column

Column name

Project

Name, Project version

Snapshot

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Seed

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Source

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Source name, Resource type

Test

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Test result

Time the test was executed, Status, Count of failures (if any), Message emitted by the test (if any)



Note: The collector also harvests the following information about the dbt Cloud resources. This information is available in the collector output and not presented in the UI. You can use this information for querying and Eureka Automations.

  • The run that produced the artifacts

  • The job that configured the run

  • The environment for the job (that specifies the target database)

  • The dbt Cloud project in which the job is defined

  • The dbt Cloud account

Relationship between objects

By default, the data.world catalog will include catalog pages for the resource types below. Each catalog page will have a relationship to other related resource types. Note that the catalog presentation and relationships are fully configurable, so these will list the default configuration.

Table 2.

Resource page

Relationship

Model

  • Project containing dbt model

  • Tests testing the integrity of model, dbt resources (test, seed, model, snapshot, source) that are upstream of model

  • dbt resources (Test, Seed, Model, Snapshot, Source) that are downstream of model

Model column

  • The database column in the manifested table or view

Project

dbt resources (Test, Seed, Model, Snapshot, Source) contained within project

Snapshot

  • Project containing dbt project

  • dbt resources (Test, Seed, Model, Source) that are upstream of snapshot

  • dbt resources (Test, Seed, Model, Source) that are downstream of snapshot

Seed

  • Project containing dbt project

  • dbt resources (Test, Seed, Model, Snapshot, Source) that are upstream of seed

  • dbt resources (Test, Seed, Model, Snapshot, Source) that are downstream of seed

Source

  • Project containing dbt project

  • dbt resources (Test, Seed, Model, Snapshot) that are downstream of seed

  • Database table that is the source of data for source

Test

  • Project containing dbt project

  • dbt model that has its integrity tested by this test

Test result

  • The dbt test that was executed to produce the result



Lineage for dbt

Table 3.

Object

Lineage available

dbt model materialized as view

  • Referenced database tables and columns in dbt model materialized as view

dbt resource

  • dbt resources that are upstream and downstream (for example, seeds that are upstream of models, and tests that are downstream of models) of dbt resource.



Supported cross-system lineage

Important

For Eureka Explorer, these harvested lineage relationships display from the page of the upstream or downstream resource from dbt. For example, you can see and access Eureka Explorer from a downstream Snowflake table resource page to see what upstream Snowflake table was transformed as a result of a view associated with a dbt model. The dbt resource will also appear in Eureka Explorer.

The currently supported data sources for cross-system lineage are:

  • Snowflake

    Important

    While other data sources are not formally supported, running the collector for those sources may still enable you to view cross-system lineage between dbt Cloud and these sources.

Supported versions of dbt

The collector supports the following dbt versions:

  • dbt 1.0.0

  • dbt 1.0.5

  • dbt 1.1.1

  • dbt 1.3.0

  • dbt 1.4.0

  • dbt 1.5.0

  • dbt 1.6.0

  • dbt 1.7.0

Authentication supported

  • The collector supports authenticating to dbt cloud using API key.

  • When the collector authenticates to Snowflake as the target database, the collector supports either:

    • Username and key pair authentication.

    • Username and password authentication

      Warning

      dbt is deprecating user tokens on September 18. If you are using legacy user tokens, you will need to generate either a service token or an account-scoped access token to use with the collector.

      In your collector configuration, update the dbt cloud API key (--dbt-cloud-api-key) parameter with the new token.