Skip to main content

BigQuery and the Collector


The latest version of the Collector is 2.128. To view the release notes for this version and all previous versions, please go here.

About the collector

Use this collector to harvest metadata for BigQuery datasets, projects, tables, and columns across the enterprise systems and make it searchable and discoverable in The collector also harvest column-level lineage relationships between tables and views.

Authentication supported

The collector authenticates to BigQuery using a Service Account associated with the project.

What is cataloged

The collector catalogs the following information.

Table 1.


Information cataloged


ID, name, description, labels (note these are key/value pairs), created date, last modified date, default table expiry, default partition expiry, data location




Name, Description, Created date, Last modified date, Default table expiration, Data location, Labels, Type (Standard, External, Snapshot, Model), Partitioned on field, Clustered by columns for standard and snapshot tables, Partition type (range or time) requires partition filter - Range (Start, end, interval) Time (Partition type (hour, day, month, year), expiration)


Name, Description, Data Type, Is Nullable, Column size


Name, description, created date, default table expiration, last modified date, data location, default collation, labels, view SQL, clustered by columns for materialized

Relationship between objects

By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the Solutions team, you may see other resource pages and relationships.

Table 2.

Resource page



Tables, Views




Column, Labels


Table, View



Label Value

Table, View, Project, Dataset

Lineage for BigQuery

The following lineage information is collected by the BigQuery collector.

Table 3.


Lineage available

View Column

The collector identifies the associated column in an upstream view or table:

  • Where the data is sourced from

  • That sort the rows via ORDER BY

  • That filter the rows via WHERE/HAVING

  • That aggregate the rows via GROUP BY

Setting up authentication for BigQuery connects to BigQuery using a Service Account associated with your project.

To set up authentication for BigQuery:

  1. Create a service account with the following roles BigQuery Data Viewer and BigQuery User. For additional information on predefined roles and permissions, see Google Cloud Platform documentation.

    Details about the BigQuery User role


    Details about the BigQuery Data Viewer role

  2. After you create a service account, create a key for the account and download the associated JSON key file.

  3. Place this key file on the machine from where you plan to run the collector. You will need this file while running the collector.

Pre-requisites for running the collector

Make sure that the machine from where you are running the collector meets the following hardware and software requirements.

Table 4.





8 GB


2 Ghz processor



Click here to get Docker.

Java Runtime Environment

OpenJDK 17 is supported and available here. specific objects


You must have a ddw-catalogs (or other) dataset set up to hold your catalog files when you are done running the collector.

Ways to run the Collector

There are a few different ways to run the Collector--any of which can be combined with an automation strategy to keep your catalog up to date:

  • Create a configuration file (config.yml) - This option stores all the information needed to catalog your data sources. It is an especially valuable option if you have multiple data sources to catalog as you don't need to run multiple scripts or CLI commands separately.

  • Run the collector though a CLI - Repeat runs of the collector requires you to re-enter the command for each run.


This section walks you through the process of running the collector using CLI.

Preparing and running the command

The easiest way to create your Collector command is to:

  1. Copy the following example command in a text editor.

  2. Set the required parameters in the command. The example command includes the minimal parameters required to run the collector

  3. Open a terminal window in any Unix environment that uses a Bash shell and paste the command in it and run in.

docker run -it --rm --mount type=bind,source=/tmp,target=/dwcc-output \
--mount type=bind,source=/tmp,target=/app/log \
--mount type=bind,source=/local_creds_dir,target=/creds datadotworld/dwcc:<collectorversion> catalog-bigquery \
-a <account> -n <catalogName> --credentialFile=<credentialFile>  -p <project> -d <dataset> \
-o "/dwcc-output"

The following table describes the parameters for the command. Detailed information about the Docker portion of the command can be found here.

Table 1.




Location of credentials profiles file

Provide the location of the credentials file you generated for authentication. For example:

--mount type=bind,source=/local_creds_dir,target=/creds



Replace <CollectorVersion> in with the version of the collector you want to use (For example, datadotworld/dwcc:2.113)





The ID for the account into which you will load this catalog - this is used to generate the namespace for any URIs generated.

(Must include either --agent or --base.)


(if base parameter is not provided)



The base URI to use as the namespace for any URIs generated.

(Must include either --agent or --base.)


(if agent parameter is not provided)


GCP service account credential file. Must match the target specified in the --mount command for credentials. For example,





 The BigQuery datasets to catalog in the given project. By default all datasets in a project are cataloged.




The BigQuery project.




The name of the catalog. This will be used to generate the ID for the catalog as well as the filename into which the catalog file will be written.




The output directory into which any catalog files should be written.

In our example we use the /dwcc -output as it is running in a Docker container and that is what we specified in the script for a Docker mount point.

You can change this value to anything you would like as long as it matches what you use in the mount point:

-mount type=bind,source=/tmp,target=/dwcc-output ...-o /dwcc-output

In this example, the output will be written to the /tmp directory on the local machine, as indicated by the mount point directive. The log file, in addition to any catalog files, will be written to the directory specified in the mount point directive.




Do not upload the log of the Collector run to the organization account's catalogs dataset or to another location specified with --upload-location(ignored if --upload not specified)



The name for the site into which you will load this catalog. E.g., for, use --site="siteName"(Used to generate the namespace for any URIs generated.) This parameter should not be used for the multi-tenant or VPC instances.




The host for the API. NOTE: This parameter is not required for most users (only VPCs/private installs) e.g. "" where "site" is the name of the VPC or private install.





The API token to use for authentication; default is to use an environment variable named DW_AUTH_TOKEN




Whether to upload the generated catalog to the  organization account's catalogs dataset or to another location specified with --upload-location(requires --api-token)



The dataset to which the catalog is to be uploaded, specified as a simple dataset name to upload to that dataset within the organization's account, or [account/dataset] to upload to a dataset in some other account (ignored if --upload not specified)




A file containing a SPARQL query to execute to transform the catalog graph emitted by the collector


Common troubleshooting tasks

Collector runtime and troubleshooting

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled. If the catalog collector runs without issues, you should see no output on the terminal, but a new file that matching *.dwec.ttl should be in the directory you specified for the output. If there was an issue connecting or running the catalog collector, there will be either a stack trace or a *.log file. Both of those can be sent to support to investigate if the errors are not clear. A list of common issues and problems encountered when running the collectors is available here.

Issue: Credential file not found

  • Cause: The credential file was not found on the Docker container.

  • Solution: Check that the path for the credential file is properly mounted. You will need to mount a local directory containing the credential file with a directory on the Docker container. The --credentialFile should be the path to the directory on the container with the credential file.

Issue: Correct credentials not set for the service account

  • Cause: An exception occurs while reading the credentials file.

  • Solution: Check that the credentials file contains the correct credential information required for running the collector.

Upload the .ttl file generated from running the Collector

When the Collector runs successfully, it creates a .ttl file in the directory you specified as the dwcc-output directory. The automatically-generated file name is databaseName.catalogName.dwec.ttl. You can rename the file or leave the default, and then upload it to your ddw-catalogs dataset (or wherever you store your catalogs).


If there is already a .ttl catalog file with the same name in your ddw-catalogs dataset, when you add the new one it will overwrite the existing one.

Automating updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your representative for more tailored recommendations on how best to optimize your catalog collector processes.