About the dbt Core collector
The dbt collector processes artifacts from your dbt Core project to harvest dbt assets and lineage relationships from dbt transformations.
Important
The dbt Core collector can be run on-premise using Docker or Jar files.
Note
The latest version of the Collector is 2.247. To view the release notes for this version and all previous versions, please go here.
How does the dbt Core collector work?
The dbt Core collector is designed to extract metadata from artifacts generated by the dbt docs generate command, specifically focusing on manifest.json and catalog.json files. Typically, these files are created or updated in the target subdirectory of your dbt project directory. To ensure that the metadata is up-to-date with the current state of your dbt project, it is recommended to run dbt docs generate immediately after executing dbt run and/or dbt snapshot.
Some important things to note:
Databse Lineage metadata: As an Extract-Transform-Load (ETL) tool, dbt primarily generates lineage metadata for database objects such as tables and views. The dbt Core Collector requires accurate and relevant database information to properly identify these associated objects. By default, this information is sourced from the profiles.yml file used to configure dbt. This file is typically located in a .dbt subdirectory within the current user's home directory, unless specified otherwise using the --profile-file option. The first profile listed in the profiles.yml file by default serves as the environment definition for the scanned artifacts. If a different profile is preferred, it can be specified by name using the --profile option.
Missing database information: Certain database details, such as passwords, may not be included in the profiles.yml file. Since the dbt Core collector needs to connect to the target database to gather catalog/schema information, any missing data must be supplied via command options. These options also enable users to override existing data in profiles.yml if needed.
Artifacts scanning and output: The dbt Core collector scans manifest.json to produce catalog resources for any models, snapshots, seeds, and tests it discovers. For dbt models expressed as database views, it writes lineage metadata linking each view's columns to the columns in source tables referenced in the view's SQL DDL (SELECT statement).
Detailed metadata limitations: It is important to note that the dbt Core collector focuses solely on lineage information rather than detailed metadata about database objects. To gather detailed metadata, users should execute the relevant database collectors for the target databases.
Utilizing run_results.json: The dbt Core collector also utilizes one or more run_results.json files, if they are available in the target artifacts directory. These files provide metadata on the processes that result in lineage, such as the manifestation of models as database views, including timestamps and status information.
What is cataloged
The information cataloged by the collector includes metadata for the following dbt Core resources:
Important
Starting with dbt 1.3, the raw_sql and compiled_sql properties in dbt artifacts are now renamed to raw_code and compiled_code. Depending on the version of dbt that you are using, you may see either fields for Raw SQL/Compiled SQL or Compiled Code/Raw code.
Object | Information cataloged |
---|---|
Analysis | Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type |
Model | Name, Description, Path, Root path, Package name,Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type, Model Columns |
Model column | Column name |
Project | Name, Project version |
Snapshot | Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type |
Seed | Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type |
Source | Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Source name, Resource type, Columns |
Test | Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type |
Test result | Time the test was executed, Status, Count of failures (if any), Message emitted by the test (if any) |
Relationships between objects
By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.
Resource page | Relationship |
---|---|
Model |
|
Model column |
|
Project |
|
Snapshot |
|
Seed |
|
Source |
|
Test |
|
Test result |
|
Lineage for dbt Core
Object | Lineage available |
---|---|
dbt model materialized as view | Referenced database tables and columns in dbt model materialized as view. |
dbt resource | dbt resources that are upstream and downstream (for example, seeds that are upstream of models, and tests that are downstream of models) of dbt resource. |
Supported cross-system lineage
The currently supported data sources for cross-system lineage are:
Important
While other data sources are not formally supported, running the collector for those sources may still enable you to view cross-system lineage between dbt Core and these sources.
BigQuery
PostgreSQL
Redshift
Snowflake
Azure Synapse (the only supported dbt/Synapse adapter is dbt-synapse adapter)
Microsoft SQL server (the only supported adapter is dbt-sqlserver).
Important
For Eureka Explorer, these harvested lineage relationships display from the page of the upstream or downstream resource from dbt. For example, you can see and access Eureka Explorer from a downstream Snowflake table resource page to see what upstream Snowflake table was transformed as a result of a view associated with a dbt model. The dbt resource will also appear in Eureka Explorer.
Authentication supported
The collector supports the following authentication methods to the Snowflake, BigQuery, PostgreSQL, Redshift, Azure Synapse, Microsoft SQL server databases:
Username and password authentication
When authenticating to Snowflake, the collector also supports:
Username and key pair authentication.
Supported versions of dbt Core
The collector supports the following dbt Core versions:
dbt 1.0.0
dbt 1.0.5
dbt 1.1.1
dbt 1.3.0
dbt 1.5.0
dbt 1.6.0
dbt 1.7.0