About the dbt cloud collector
The dbt cloud collector connects to your dbt cloud project, harvest dbt assets, and column-level lineage relationships from database views associated with dbt assets.
Important
The dbt cloud collector can be run in the Cloud or on-premise using Docker or Jar files.
Note
The latest version of the Collector is 2.251. To view the release notes for this version and all previous versions, please go here.
How does the dbt cloud collector work?
The dbt Cloud Collector operates within the dbt Cloud platform, an ETL tool focused on moving and transforming data within relational databases. It provides access to detailed metadata about these processes.
Here is a detailed breakdown of how the dbt Cloud Collector works:
Metadata Harvesting:
The collector connects to dbt Cloud to extract metadata from artifacts created by dbt jobs, specifically targeting manifest.json and catalog.json files.
When a job in dbt Cloud is executed, configured dbt commands run to perform ETL operations. If docs generate is active in these commands, dbt Cloud generates the aforementioned artifacts, storing them with each run's results.
Connecting to the dbt Cloud API: The dbt Cloud Collector uses the dbt Cloud API to retrieve these artifacts, facilitating the extraction of metadata.
Lineage Metadata Generation:
As an ETL platform, dbt produces lineage metadata for database objects, such as tables and views.
For the collector to accurately track these objects, it requires precise database information, which it gathers from the dbt Cloud project data accessible via the dbt Cloud API.
Target Database Information:
Certain crucial details, like database connection passwords, may be absent from the dbt Cloud project API.
Because the collector directly connects to the target database to retrieve catalog/schema data, any missing information must be provided through command options, which also allow for modifications to the available database data.
Artifact Analysis and Database View Parsing:
The collector scans manifest.json to create catalog resources for any identified models, snapshots, seeds, and tests.
It processes database view definitions created by dbt Cloud, establishing connections to the target database to ensure a comprehensive representation of table lineage.
Lineage Information Over Detailed Metadata: The collector prioritizes lineage information and does not delve into detailed metadata about database objects. Users requiring detailed metadata should utilize the appropriate database collector for their target database.
Leveraging run_results.json: The collector utilizes run_results.json files when available from a run’s output. It extracts metadata on processes like model manifestations into database views, including their execution time and status.
How does the collector know which dbt Cloud run to harvest from?
choosing the appropriate run is crucial for ensuring that the dbt cloud collector processes the correct artifacts in order to harvest metadata from them. If you run the dbt cloud collector and the dbt resources and/or lineage information in your catalog appear incorrect, there is a good chance the collector retrieved artifacts from the wrong job run.
To make sure that the correct metadata is being harvested, the dbt Cloud collector needs to know the Project, Job, and Environment combination.
In dbt Cloud, a job is typically executed on a schedule. The dbt Cloud collector uses the following logic to determine the job run from which it will obtain the metadata artifacts to process:
The collector examines only dbt Cloud runs associated with the dbt Cloud project specified by the user.
Any unsuccessful runs and any runs that did not produce metadata artifacts are discarded.
If the user specifies a job (by identifier or name) or an environment (by identifier or name), then only runs from that job and/or environment are examined.
If multiple successful runs are found, then the collector harvests the metadata from the most recent run.
Running target database collector in addition to dbt Cloud Collector
The dbt Cloud collector only captures the lineage relationships but does not harvest any information about the database tables/views and columns involved in lineage relationships. In order to harvest information such as the table/view or column name, descriptions, data types, and table/view-column associations, you need to run the appropriate database collector (for example, the Snowflake collector) on the target database.
What is cataloged
The information cataloged by the collector includes metadata for the following dbt resources:
Note
Starting with dbt 1.3, the raw_sql and compiled_sql properties in dbt artifacts are now renamed to raw_code and compiled_code. Depending on the version of dbt that you are using, you may see either fields for Raw SQL/Compiled SQL or Compiled Code/Raw code.
Object | Information cataloged |
---|---|
Analysis | Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type |
Model | Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type |
Model column | Column name |
Project | Name, Project version |
Snapshot | Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type |
Seed | Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type |
Source | Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Source name, Resource type |
Test | Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type |
Test result | Time the test was executed, Status, Count of failures (if any), Message emitted by the test (if any) |
Note: The collector also harvests the following information about the dbt Cloud resources. This information is available in the collector output and not presented in the UI. You can use this information for querying and Eureka Automations.
The run that produced the artifacts
The job that configured the run
The environment for the job (that specifies the target database)
The dbt Cloud project in which the job is defined
The dbt Cloud account
Relationship between objects
By default, the data.world catalog will include catalog pages for the resource types below. Each catalog page will have a relationship to other related resource types. Note that the catalog presentation and relationships are fully configurable, so these will list the default configuration.
Resource page | Relationship |
---|---|
Model |
|
Model column |
|
Project | dbt resources (Test, Seed, Model, Snapshot, Source) contained within project |
Snapshot |
|
Seed |
|
Source |
|
Test |
|
Test result |
|
Lineage for dbt
Object | Lineage available |
---|---|
dbt model materialized as view |
|
dbt resource |
|
Supported cross-system lineage
Important
For Eureka Explorer, these harvested lineage relationships display from the page of the upstream or downstream resource from dbt. For example, you can see and access Eureka Explorer from a downstream Snowflake table resource page to see what upstream Snowflake table was transformed as a result of a view associated with a dbt model. The dbt resource will also appear in Eureka Explorer.
The currently supported data sources for cross-system lineage are:
Snowflake
Important
While other data sources are not formally supported, running the collector for those sources may still enable you to view cross-system lineage between dbt Cloud and these sources.
Supported versions of dbt
The collector supports the following dbt versions:
dbt 1.0.0
dbt 1.0.5
dbt 1.1.1
dbt 1.3.0
dbt 1.4.0
dbt 1.5.0
dbt 1.6.0
dbt 1.7.0
Authentication supported
The collector supports authenticating to dbt cloud using API key.
When the collector authenticates to Snowflake as the target database, the collector supports either:
Username and key pair authentication.
Username and password authentication
Warning
dbt is deprecating user tokens on September 18. If you are using legacy user tokens, you will need to generate either a service token or an account-scoped access token to use with the collector.
In your collector configuration, update the dbt cloud API key (--dbt-cloud-api-key) parameter with the new token.