Skip to main content

Preparing to run the dbt Core collector

Setting up pre-requisites for running the collector

Make sure that the machine from where you are running the collector meets the following hardware and software requirements.

Table 1.

Item

Requirement

Hardware

Note: The following specs are based upon running one collector process at a time. Please adjust the hardware if you are running multiple collectors at the same time.

RAM

8 GB

CPU

2 Ghz processor

Software

Docker

Click here to get Docker.

data.world specific objects

Dataset

You must have a ddw-catalogs dataset set up to hold your catalog files when you are done running the collector.

If you are using Catalog Toolkit , follow these instructions to prepare the datasets for collectors.

Network connection

Allowlist IPs and domains

Follow these instructions to configure your network.



Preparing dbt Core for collectors

Harvesting metadata from dbt Core artifacts themselves requires that the artifact files be in a file system directory for which the user running the collector has at least read access. In order to harvest intra-database (column-level) lineage for dbt models materialized as views, the collector must be provided with a credential to dbt’s target database that has SELECT privileges on those views and tables referenced by those views. This database credential can be supplied via CLI options or obtained from the profiles.yml file.

The collector harvests information from the following files in the dbt project:

  • manifest.json: This is the file from where the collector gets most of the cataloged metadata. Attributes of all dbt objects, like, models, sources, tests, etc., and the database objects they manifest. The collector harvests lineage information between view’s columns and source table columns, using the information found in the manifest.json file.

  • catalog.json: Information about database columns involved in tables/views manifested by models is available in this file.

  • run_results.json: This file contains the activity part of lineage - what caused a model to be executed, at what time, etc.

  • profiles.yml: When provided, the collector captures database connection information from this file.

Generating dbt Core metadata artifacts to pass to the collector

  • profiles.yml - It is located in the ~/.dbt directory by default. For more information see the dbt connection profiles documentation.

  • dbt_project.yml - Is found at the top level of the dbt project.

  • catalog.json, manifest.json and run_results.json - These files can be generated by running the dbt docs generate command. For more information about the results JSON file see this dbt documentation. Ensure that the account used for running the dbt docs generate command has the required permissions to query the metadata in the target database.

Important

The files catalog.json, manifest.json , and profiles.yml must be in the same directory on the host machine. For example, /artifact_directory.