Preparing to run the dbt Core collector
Setting up pre-requisites for running the collector
Make sure that the machine from where you are running the collector meets the following hardware and software requirements.
Item | Requirement |
---|---|
Hardware Note: The following specs are based upon running one collector process at a time. Please adjust the hardware if you are running multiple collectors at the same time. | |
RAM | 8 GB |
CPU | 2 Ghz processor |
Software | |
Docker | Click here to get Docker. |
data.world specific objects | |
Dataset | You must have a ddw-catalogs dataset set up to hold your catalog files when you are done running the collector. If you are using Catalog Toolkit , follow these instructions to prepare the datasets for collectors. |
Network connection | |
Allowlist IPs and domains |
Preparing dbt Core for collectors
Harvesting metadata from dbt Core artifacts themselves requires that the artifact files be in a file system directory for which the user running the collector has at least read access. In order to harvest intra-database (column-level) lineage for dbt models materialized as views, the collector must be provided with a credential to dbt’s target database that has SELECT
privileges on those views and tables referenced by those views. This database credential can be supplied via CLI options or obtained from the profiles.yml
file.
The collector harvests information from the following files in the dbt project:
manifest.json: This is the file from where the collector gets most of the cataloged metadata. Attributes of all dbt objects, like, models, sources, tests, etc., and the database objects they manifest. The collector harvests lineage information between view’s columns and source table columns, using the information found in the manifest.json file.
catalog.json: Information about database columns involved in tables/views manifested by models is available in this file.
run_results.json: This file contains the activity part of lineage - what caused a model to be executed, at what time, etc.
profiles.yml: When provided, the collector captures database connection information from this file.
Generating dbt Core metadata artifacts to pass to the collector
profiles.yml - It is located in the
.dbt
directory by default. For more information see the dbt connection profiles documentation.dbt_project.yml - Is found at the top level of the dbt project.
catalog.json, manifest.json and run_results.json - These files can be generated by running the dbt docs generate command. For more information about the results JSON file see this dbt documentation. Ensure that the account used for running the dbt docs generate command has the required permissions to query the metadata in the target database.
Important
The files catalog.json, manifest.json , and profiles.yml must be in the same directory on the host machine, such as /artifact_directory. These files should be located at the root level of the artifact directory.