Preparing to run the dbt Core collector

Setting up pre-requisites for running the collector

Make sure that the machine from where you are running the collector meets the following hardware and software requirements.

Table 1.

Item	Requirement
Hardware Note: The following specs are based upon running one collector process at a time. Please adjust the hardware if you are running multiple collectors at the same time.
RAM	8 GB
CPU	2 Ghz processor
Software Docker or Java Runtime Environment
Docker	Click here to get Docker.
Java Runtime Environment	OpenJDK 17 is supported and available here.
data.world specific objects
Dataset	You must have a ddw-catalogs dataset set up to hold your catalog files when you are done running the collector. If you are using Catalog Toolkit , follow these instructions to prepare the datasets for collectors.
Network connection
Allowlist IPs and domains	Follow these instructions to configure your network. Use these tools to check network connections before running the collector.

Preparing dbt Core for collectors

Harvesting metadata from dbt Core artifacts themselves requires that the artifact files be in a file system directory for which the user running the collector has at least read access. In order to harvest intra-database (column-level) lineage for dbt models materialized as views, the collector must be provided with a credential to dbt’s target database that has SELECT privileges on those views and tables referenced by those views. This database credential can be supplied via CLI options or obtained from the profiles.yml file.

The collector harvests information from the following files in the dbt project:

manifest.json: This is the file from where the collector gets most of the cataloged metadata. Attributes of all dbt objects, like, models, sources, tests, etc., and the database objects they manifest. The collector harvests lineage information between view’s columns and source table columns, using the information found in the manifest.json file.
catalog.json: Information about database columns involved in tables/views manifested by models is available in this file.
run_results.json: This file contains the activity part of lineage - what caused a model to be executed, at what time, etc.
profiles.yml: When provided, the collector captures database connection information from this file.

Generating dbt Core metadata artifacts to pass to the collector

profiles.yml - It is located in the .dbt directory by default. For more information see the dbt connection profiles documentation.
dbt_project.yml - Is found at the top level of the dbt project.
catalog.json, manifest.json and run_results.json - These files can be generated by running the dbt docs generate command. For more information about the results JSON file see this dbt documentation. Ensure that the account used for running the dbt docs generate command has the required permissions to query the metadata in the target database.

Important

The files catalog.json, manifest.json , and profiles.yml must be in the same directory on the host machine, such as /artifact_directory. These files should be located at the root level of the artifact directory.

In this section:

Preparing to run the dbt Core collector

Setting up pre-requisites for running the collector

Preparing dbt Core for collectors

Generating dbt Core metadata artifacts to pass to the collector

Important

Search results