Skip to main content

About the dbt Core collector

The dbt collector processes artifacts from your dbt Core project to harvest dbt assets and lineage relationships from dbt transformations.

Important

The dbt Core collector can be run on-premise using Docker or Jar files.

Note

The latest version of the Collector is 2.200. To view the release notes for this version and all previous versions, please go here.

How does the dbt Core collector work?

The collector harvests metadata from dbt generated files.

The dbt Core collector will also identify how dbt moves data between tables (i.e., lineage). To accomplish this, the dbt collector needs to parse View SQL. Without specifying the target database information, no lineage relationships between columns specified through views can be harvested. The connection information is passed in via dbt’s profiles.yml file or can be supplied with the data.world YAML file or CLI command.

Important

If the dbt profiles.yml file is not provided, no lineage relationships between columns and views will be available and no database resources will be harvested.

Note that the collector however does not harvest everything that the target database collector would harvest. For example, Snowflake can harvest profiling, tags, and policies that the dbt Core collector will not harvest. It is recommended to run both the dbt Core collector and the target database collector to build a comprehensive data catalog.

What is cataloged

The information cataloged by the collector includes metadata for the following dbt Core resources:

Important

Starting with dbt 1.3, the raw_sql and compiled_sql properties in dbt artifacts are now renamed to raw_code and compiled_code. Depending on the version of dbt that you are using, you may see either fields for Raw SQL/Compiled SQL or Compiled Code/Raw code.

Table 1.

Object

Information cataloged

Analysis

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Model

Name, Description, Path, Root path, Package name,Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type, Model Columns

Model column

Column name

Project

Name, Project version

Snapshot

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Seed

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Source

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Source name, Resource type, Columns

Test

Name, Description, Path, Root path, Package name, Unique ID, Alias, Meta, Raw SQL/Raw Code, Compiled SQL/Compiled Code, Enabled, Materialized, Resource type

Test result

Time the test was executed, Status, Count of failures (if any), Message emitted by the test (if any)



Relationships between objects

By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.

Table 2.

Resource page

Relationship

Model

  • Project containing dbt model

  • Tests testing the integrity of model, dbt resources (test, seed, model, snapshot, source) that are upstream of model

  • dbt resources (test, seed, model, snapshot, source) that are downstream of model

Model column

  • The database column in the manifested table or view

Project

  • Dbt resources (test, seed, model, snapshot, source) contained within project

Snapshot

  • Project containing dbt project

  • dbt resources (test, seed, model, source) that are upstream of snapshot

  • dbt resources (test, seed, model, source) that are downstream of snapshot

Seed

  • Project containing dbt project

  • dbt resources (test, seed, model, snapshot, source) that are upstream of seed

  • dbt resources (test, seed, model, snapshot, source) that are downstream of seed

Source

  • Project containing dbt project

  • dbt resources (test, seed, model, snapshot) that are downstream of seed

  • database table that is the source of data for source

Test

  • Project containing dbt project

  • dbt model that has its integrity tested by this test

Test result

  • The dbt test that was executed to produce the result



Lineage for dbt Core

Table 3.

Object

Lineage available

dbt model materialized as view

Referenced database tables and columns in dbt model materialized as view.

dbt resource

dbt resources that are upstream and downstream (for example, seeds that are upstream of models, and tests that are downstream of models) of dbt resource.



The collector also harvests column-level lineage for the following databases in the dbt collector:

  • BigQuery

  • PostgreSQL

  • Redshift

  • Snowflake

Important

For Eureka Explorer, these harvested lineage relationships display from the page of the upstream or downstream resource from dbt. For example, you can see and access Eureka Explorer from a downstream Snowflake table resource page to see what upstream Snowflake table was transformed as a result of a view associated with a dbt model. The dbt resource will also appear in Eureka Explorer.

Supported versions of dbt Core

The collector supports the following dbt Core versions:

  • dbt 1.0.0

  • dbt 1.0.5

  • dbt 1.1.1

  • dbt 1.3.0

  • dbt 1.5.0

  • dbt 1.6.0

  • dbt 1.7.0

Authentication supported

The collector supports the following authentication methods to the target databases:

  • Username and password authentication

When authenticating to Snowflake, the collector also supports:

  • Username and key pair authentication.