About the dbt cloud collector

The dbt cloud collector connects to your dbt cloud project, harvest dbt assets, and column-level lineage relationships from database views associated with dbt assets.

Important

The dbt cloud collector can be run in the Cloud or on-premise using Docker or Jar files.

Note

The latest version of the Collector is 2.292. To view the release notes for this version and all previous versions, please go here.

How does the dbt cloud collector work?

The dbt Cloud Collector operates within the dbt Cloud platform, an ETL tool focused on moving and transforming data within relational databases. It provides access to detailed metadata about these processes.

Here is a detailed breakdown of how the dbt Cloud Collector works:

Metadata Harvesting:
- The collector connects to dbt Cloud to extract metadata from artifacts created by dbt jobs, specifically targeting manifest.json and catalog.json files.
- When a job in dbt Cloud is executed, configured dbt commands run to perform ETL operations. If docs generate is active in these commands, dbt Cloud generates the aforementioned artifacts, storing them with each run's results.
Connecting to the dbt Cloud API: The dbt Cloud Collector uses the dbt Cloud API to retrieve these artifacts, facilitating the extraction of metadata.
Lineage Metadata Generation:
- As an ETL platform, dbt produces lineage metadata for database objects, such as tables and views.
- For the collector to accurately track these objects, it requires precise database information, which it gathers from the dbt Cloud project data accessible via the dbt Cloud API.
Target Database Information:
- Certain crucial details, like database connection passwords, may be absent from the dbt Cloud project API.
- Because the collector directly connects to the target database to retrieve catalog/schema data, any missing information must be provided through command options, which also allow for modifications to the available database data.
Artifact Analysis and Database View Parsing:
- The collector scans manifest.json to create catalog resources for any identified models, snapshots, seeds, and tests.
- It processes database view definitions created by dbt Cloud, establishing connections to the target database to ensure a comprehensive representation of table lineage.
Lineage Information Over Detailed Metadata: The collector prioritizes lineage information and does not delve into detailed metadata about database objects. Users requiring detailed metadata should utilize the appropriate database collector for their target database.
Leveraging run_results.json: The collector utilizes run_results.json files when available from a run’s output. It extracts metadata on processes like model manifestations into database views, including their execution time and status.

How does the collector know which dbt Cloud run to harvest from?

choosing the appropriate run is crucial for ensuring that the dbt cloud collector processes the correct artifacts in order to harvest metadata from them. If you run the dbt cloud collector and the dbt resources and/or lineage information in your catalog appear incorrect, there is a good chance the collector retrieved artifacts from the wrong job run.

To make sure that the correct metadata is being harvested, the dbt Cloud collector needs to know the Project, Job, and Environment combination.

In dbt Cloud, a job is typically executed on a schedule. The dbt Cloud collector uses the following logic to determine the job run from which it will obtain the metadata artifacts to process:

The collector examines only dbt Cloud runs associated with the dbt Cloud project specified by the user.
Any unsuccessful runs and any runs that did not produce metadata artifacts are discarded.
If the user specifies a job (by identifier or name) or an environment (by identifier or name), then only runs from that job and/or environment are examined.
If multiple successful runs are found, then the collector harvests the metadata from the most recent run.

Running target database collector in addition to dbt Cloud Collector

The dbt Cloud collector only captures the lineage relationships but does not harvest any information about the database tables/views and columns involved in lineage relationships. In order to harvest information such as the table/view or column name, descriptions, data types, and table/view-column associations, you need to run the appropriate database collector (for example, the Snowflake collector) on the target database.

What is cataloged

The information cataloged by the collector includes metadata for the following dbt resources:

Note

Starting with dbt 1.3, the raw_sql and compiled_sql properties in dbt artifacts are now renamed to raw_code and compiled_code. Depending on the version of dbt that you are using, you may see either fields for Raw SQL/Compiled SQL or Compiled Code/Raw code.

Table 1.

Object	Information cataloged
Analysis	Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type
Model	Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type
Model column	Column name
Project	Name, Project version
Snapshot	Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type
Seed	Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type
Source	Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Source name, Resource type
Test	Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type
Test result	Time the test was executed, Status, Count of failures (if any), Message emitted by the test (if any)
Semantic Models	Name, Description, Path, Package name, Unique ID, Enabled, Resource Type, Semantic Model Components, Primary Entity
Entities	Title, SQL Expression, Entity Type
Dimensions	Title, Dimension Type
Measures	Title, Description, Has Measure Aggregation
Metrics	Title, Description, Path, Package Name, Unique ID, Metric Type

Note: The collector also harvests the following information about the dbt Cloud resources. This information is available in the collector output and not presented in the UI. You can use this information for querying and Eureka Automations.

The run that produced the artifacts
The job that configured the run
The environment for the job (that specifies the target database)
The dbt Cloud project in which the job is defined
The dbt Cloud account

Relationship between objects

By default, the data.world catalog will include catalog pages for the resource types below. Each catalog page will have a relationship to other related resource types. Note that the catalog presentation and relationships are fully configurable, so these will list the default configuration.

Table 2.

Resource page	Relationship
Model	Project containing dbt model Tests testing the integrity of model, dbt resources (test, seed, model, snapshot, source) that are upstream of model dbt resources (Test, Seed, Model, Snapshot, Source) that are downstream of model
Semantic Model	Project containing the semantic model dbt model related to the semantic model dbt semantic model components (dimensions, entities, measures) Metric that the semantic model provides context for
Model column	The database column in the manifested table or view
Project	dbt resources (Test, Seed, Model, Snapshot, Source) contained within project
Snapshot	Project containing dbt project dbt resources (Test, Seed, Model, Source) that are upstream of snapshot dbt resources (Test, Seed, Model, Source) that are downstream of snapshot
Seed	Project containing dbt project dbt resources (Test, Seed, Model, Snapshot, Source) that are upstream of seed dbt resources (Test, Seed, Model, Snapshot, Source) that are downstream of seed
Source	Project containing dbt project dbt resources (Test, Seed, Model, Snapshot) that are downstream of seed Database schema that the source represents
Test	Project containing dbt project dbt model that has its integrity tested by this test
Test result	The dbt test that was executed to produce the result

Lineage for dbt

Table 3.

Object	Lineage available
dbt model materialized as view	Referenced database tables and columns in dbt model materialized as view
dbt resource	dbt resources that are upstream and downstream (for example, seeds that are upstream of models, and tests that are downstream of models) of dbt resource.

Supported cross-system lineage

Important

For Eureka Explorer, these harvested lineage relationships display from the page of the upstream or downstream resource from dbt. For example, you can see and access Eureka Explorer from a downstream Snowflake table resource page to see what upstream Snowflake table was transformed as a result of a view associated with a dbt model. The dbt resource will also appear in Eureka Explorer.

The currently supported data sources for cross-system lineage are:

Snowflake
Important
While other data sources are not formally supported, running the collector for those sources may still enable you to view cross-system lineage between dbt Cloud and these sources.

Supported versions of dbt

The collector supports the following dbt versions:

dbt 1.0.0
dbt 1.0.5
dbt 1.1.1
dbt 1.3.0
dbt 1.4.0
dbt 1.5.0
dbt 1.6.0
dbt 1.7.0

Authentication supported

The collector supports authenticating to dbt cloud using API key.
When the collector authenticates to Snowflake as the target database, the collector supports either:
- Username and key pair authentication.
- Username and password authentication
  Warning
  dbt is deprecating user tokens on September 18. If you are using legacy user tokens, you will need to generate either a service token or an account-scoped access token to use with the collector.
  In your collector configuration, update the dbt cloud API key (--dbt-cloud-api-key) parameter with the new token.

In this section:

About the dbt cloud collector

Important

Note

How does the dbt cloud collector work?

How does the collector know which dbt Cloud run to harvest from?

Running target database collector in addition to dbt Cloud Collector

What is cataloged

Note

Relationship between objects

Lineage for dbt

Important

Supported cross-system lineage

Important

Important

Supported versions of dbt

Authentication supported

Warning

Search results