Skip to main content

About the dbt cloud collector

The dbt cloud collector connects to your dbt cloud project, harvest dbt assets, and column-level lineage relationships from database views associated with dbt assets.

Important

The dbt cloud collector can be run in the Cloud or on-premise using Docker or Jar files.

Note

The latest version of the Collector is 2.251. To view the release notes for this version and all previous versions, please go here.

How does the dbt cloud collector work?

The dbt Cloud Collector operates within the dbt Cloud platform, an ETL tool focused on moving and transforming data within relational databases. It provides access to detailed metadata about these processes.

Here is a detailed breakdown of how the dbt Cloud Collector works:

  • Metadata Harvesting:

    • The collector connects to dbt Cloud to extract metadata from artifacts created by dbt jobs, specifically targeting manifest.json and catalog.json files.

    • When a job in dbt Cloud is executed, configured dbt commands run to perform ETL operations. If docs generate is active in these commands, dbt Cloud generates the aforementioned artifacts, storing them with each run's results.

  • Connecting to the dbt Cloud API: The dbt Cloud Collector uses the dbt Cloud API to retrieve these artifacts, facilitating the extraction of metadata.

  • Lineage Metadata Generation:

    • As an ETL platform, dbt produces lineage metadata for database objects, such as tables and views.

    • For the collector to accurately track these objects, it requires precise database information, which it gathers from the dbt Cloud project data accessible via the dbt Cloud API.

  • Target Database Information:

    • Certain crucial details, like database connection passwords, may be absent from the dbt Cloud project API.

    • Because the collector directly connects to the target database to retrieve catalog/schema data, any missing information must be provided through command options, which also allow for modifications to the available database data.

  • Artifact Analysis and Database View Parsing:

    • The collector scans manifest.json to create catalog resources for any identified models, snapshots, seeds, and tests.

    • It processes database view definitions created by dbt Cloud, establishing connections to the target database to ensure a comprehensive representation of table lineage.

  • Lineage Information Over Detailed Metadata: The collector prioritizes lineage information and does not delve into detailed metadata about database objects. Users requiring detailed metadata should utilize the appropriate database collector for their target database.

  • Leveraging run_results.json: The collector utilizes run_results.json files when available from a run’s output. It extracts metadata on processes like model manifestations into database views, including their execution time and status.

 

How does the collector know which dbt Cloud run to harvest from?

choosing the appropriate run is crucial for ensuring that the dbt cloud collector processes the correct artifacts in order to harvest metadata from them. If you run the dbt cloud collector and the dbt resources and/or lineage information in your catalog appear incorrect, there is a good chance the collector retrieved artifacts from the wrong job run.

To make sure that the correct metadata is being harvested, the dbt Cloud collector needs to know the Project, Job, and Environment combination.

In dbt Cloud, a job is typically executed on a schedule. The dbt Cloud collector uses the following logic to determine the job run from which it will obtain the metadata artifacts to process:

  • The collector examines only dbt Cloud runs associated with the dbt Cloud project specified by the user.

  • Any unsuccessful runs and any runs that did not produce metadata artifacts are discarded.

  • If the user specifies a job (by identifier or name) or an environment (by identifier or name), then only runs from that job and/or environment are examined.

  • If multiple successful runs are found, then the collector harvests the metadata from the most recent run.

Running target database collector in addition to dbt Cloud Collector

The dbt Cloud collector only captures the lineage relationships but does not harvest any information about the database tables/views and columns involved in lineage relationships. In order to harvest information such as the table/view or column name, descriptions, data types, and table/view-column associations, you need to run the appropriate database collector (for example, the Snowflake collector) on the target database.

What is cataloged

The information cataloged by the collector includes metadata for the following dbt resources:

Note

Starting with dbt 1.3, the raw_sql and compiled_sql properties in dbt artifacts are now renamed to raw_code and compiled_code. Depending on the version of dbt that you are using, you may see either fields for Raw SQL/Compiled SQL or Compiled Code/Raw code.

Table 1.

Object

Information cataloged

Analysis

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Model

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Model column

Column name

Project

Name, Project version

Snapshot

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Seed

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Source

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Source name, Resource type

Test

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Test result

Time the test was executed, Status, Count of failures (if any), Message emitted by the test (if any)



Note: The collector also harvests the following information about the dbt Cloud resources. This information is available in the collector output and not presented in the UI. You can use this information for querying and Eureka Automations.

  • The run that produced the artifacts

  • The job that configured the run

  • The environment for the job (that specifies the target database)

  • The dbt Cloud project in which the job is defined

  • The dbt Cloud account

Relationship between objects

By default, the data.world catalog will include catalog pages for the resource types below. Each catalog page will have a relationship to other related resource types. Note that the catalog presentation and relationships are fully configurable, so these will list the default configuration.

Table 2.

Resource page

Relationship

Model

  • Project containing dbt model

  • Tests testing the integrity of model, dbt resources (test, seed, model, snapshot, source) that are upstream of model

  • dbt resources (Test, Seed, Model, Snapshot, Source) that are downstream of model

Model column

  • The database column in the manifested table or view

Project

dbt resources (Test, Seed, Model, Snapshot, Source) contained within project

Snapshot

  • Project containing dbt project

  • dbt resources (Test, Seed, Model, Source) that are upstream of snapshot

  • dbt resources (Test, Seed, Model, Source) that are downstream of snapshot

Seed

  • Project containing dbt project

  • dbt resources (Test, Seed, Model, Snapshot, Source) that are upstream of seed

  • dbt resources (Test, Seed, Model, Snapshot, Source) that are downstream of seed

Source

  • Project containing dbt project

  • dbt resources (Test, Seed, Model, Snapshot) that are downstream of seed

  • Database table that is the source of data for source

Test

  • Project containing dbt project

  • dbt model that has its integrity tested by this test

Test result

  • The dbt test that was executed to produce the result



Lineage for dbt

Table 3.

Object

Lineage available

dbt model materialized as view

  • Referenced database tables and columns in dbt model materialized as view

dbt resource

  • dbt resources that are upstream and downstream (for example, seeds that are upstream of models, and tests that are downstream of models) of dbt resource.



Supported cross-system lineage

Important

For Eureka Explorer, these harvested lineage relationships display from the page of the upstream or downstream resource from dbt. For example, you can see and access Eureka Explorer from a downstream Snowflake table resource page to see what upstream Snowflake table was transformed as a result of a view associated with a dbt model. The dbt resource will also appear in Eureka Explorer.

The currently supported data sources for cross-system lineage are:

  • Snowflake

    Important

    While other data sources are not formally supported, running the collector for those sources may still enable you to view cross-system lineage between dbt Cloud and these sources.

Supported versions of dbt

The collector supports the following dbt versions:

  • dbt 1.0.0

  • dbt 1.0.5

  • dbt 1.1.1

  • dbt 1.3.0

  • dbt 1.4.0

  • dbt 1.5.0

  • dbt 1.6.0

  • dbt 1.7.0

Authentication supported

  • The collector supports authenticating to dbt cloud using API key.

  • When the collector authenticates to Snowflake as the target database, the collector supports either:

    • Username and key pair authentication.

    • Username and password authentication

      Warning

      dbt is deprecating user tokens on September 18. If you are using legacy user tokens, you will need to generate either a service token or an account-scoped access token to use with the collector.

      In your collector configuration, update the dbt cloud API key (--dbt-cloud-api-key) parameter with the new token.