Skip to main content

About the dbt Core collector

The dbt collector processes artifacts from your dbt Core project to harvest dbt assets and lineage relationships from dbt transformations.

Important

The dbt Core collector can be run on-premise using Docker or Jar files.

Note

The latest version of the Collector is 2.255. To view the release notes for this version and all previous versions, please go here.

How does the dbt Core collector work?

The dbt Core collector is designed to extract metadata from artifacts generated by the dbt docs generate command, specifically focusing on manifest.json and catalog.json files. Typically, these files are created or updated in the target subdirectory of your dbt project directory. To ensure that the metadata is up-to-date with the current state of your dbt project, it is recommended to run dbt docs generate immediately after executing dbt run and/or dbt snapshot.

Some important things to note:

  • Databse Lineage metadata: As an Extract-Transform-Load (ETL) tool, dbt primarily generates lineage metadata for database objects such as tables and views. The dbt Core Collector requires accurate and relevant database information to properly identify these associated objects. By default, this information is sourced from the profiles.yml file used to configure dbt. This file is typically located in a .dbt subdirectory within the current user's home directory. If the file is in an alternative location, users should specify that location using the --profile-file option. The first profile listed in the profiles.yml file by default serves as the environment definition for the scanned artifacts. If a different profile is preferred, it can be specified by name using the --profile option.

  • Missing database information: Certain database details, such as passwords, may not be included in the profiles.yml file. Since the dbt Core collector needs to connect to the target database to gather catalog/schema information, any missing data must be supplied via command options. These options also enable users to override existing data in profiles.yml if needed.

  • Artifacts scanning and output: The dbt Core collector scans manifest.json to produce catalog resources for any models, snapshots, seeds, and tests it discovers. For dbt models expressed as database views, it writes lineage metadata linking each view's columns to the columns in source tables referenced in the view's SQL DDL (SELECT statement).

  • Detailed metadata limitations: It is important to note that the dbt Core collector focuses solely on lineage information rather than detailed metadata about database objects. To gather detailed metadata, users should execute the relevant database collectors for the target databases.

  • Utilizing run_results.json: The dbt Core collector also utilizes one or more run_results.json files, if they are available in the target artifacts directory. These files provide metadata on the processes that result in lineage, such as the manifestation of models as database views, including timestamps and status information.

What is cataloged

The information cataloged by the collector includes metadata for the following dbt Core resources:

Important

Starting with dbt 1.3, the raw_sql and compiled_sql properties in dbt artifacts are now renamed to raw_code and compiled_code. Depending on the version of dbt that you are using, you may see either fields for Raw SQL/Compiled SQL or Compiled Code/Raw code.

Table 1.

Object

Information cataloged

Analysis

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Model

Name, Description, Path, Root path, Package name,Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type, Model Columns

Model column

Column name

Project

Name, Project version

Snapshot

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Seed

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Source

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Source name, Resource type, Columns

Test

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Test result

Time the test was executed, Status, Count of failures (if any), Message emitted by the test (if any)



Relationships between objects

By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.

Table 2.

Resource page

Relationship

Model

  • Project containing dbt model

  • Tests testing the integrity of model, dbt resources (test, seed, model, snapshot, source) that are upstream of model

  • dbt resources (test, seed, model, snapshot, source) that are downstream of model

Model column

  • The database column in the manifested table or view

Project

  • Dbt resources (test, seed, model, snapshot, source) contained within project

Snapshot

  • Project containing dbt project

  • dbt resources (test, seed, model, source) that are upstream of snapshot

  • dbt resources (test, seed, model, source) that are downstream of snapshot

Seed

  • Project containing dbt project

  • dbt resources (test, seed, model, snapshot, source) that are upstream of seed

  • dbt resources (test, seed, model, snapshot, source) that are downstream of seed

Source

  • Project containing dbt project

  • dbt resources (test, seed, model, snapshot) that are downstream of seed

  • database table that is the source of data for source

Test

  • Project containing dbt project

  • dbt model that has its integrity tested by this test

Test result

  • The dbt test that was executed to produce the result



Lineage for dbt Core

Table 3.

Object

Lineage available

dbt model materialized as view

Referenced database tables and columns in dbt model materialized as view.

dbt resource

dbt resources that are upstream and downstream (for example, seeds that are upstream of models, and tests that are downstream of models) of dbt resource.



Supported cross-system lineage

The currently supported data sources for cross-system lineage are:

Important

While other data sources are not formally supported, running the collector for those sources may still enable you to view cross-system lineage between dbt Core and these sources.

  • BigQuery

  • PostgreSQL

  • Redshift

  • Snowflake

  • Azure Synapse (the only supported dbt/Synapse adapter is dbt-synapse adapter)

  • Microsoft SQL server (the only supported adapter is dbt-sqlserver).

    Important

    For Eureka Explorer, these harvested lineage relationships display from the page of the upstream or downstream resource from dbt. For example, you can see and access Eureka Explorer from a downstream Snowflake table resource page to see what upstream Snowflake table was transformed as a result of a view associated with a dbt model. The dbt resource will also appear in Eureka Explorer.

Authentication supported

The collector supports the following authentication methods to the Snowflake, BigQuery, PostgreSQL, Redshift, Azure Synapse, Microsoft SQL server databases:

  • Username and password authentication

When authenticating to Snowflake, the collector also supports:

  • Username and key pair authentication.

Supported versions of dbt Core

The collector supports the following dbt Core versions:

  • dbt 1.0.0

  • dbt 1.0.5

  • dbt 1.1.1

  • dbt 1.3.0

  • dbt 1.5.0

  • dbt 1.6.0

  • dbt 1.7.0