BigQuery and DWCC
Version of metadata source collector was tested against
Prerequisites
The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.
If you are using Docker, it must be installed. For more information see https://docs.docker.com/get-docker/.
The user defined to run DWCC must have read access to all resources being cataloged.
The computer running dwcc needs a Java Runtime Environment (JRE), version 11 or higher. (OpenJDK available here)
Using a YAML file to run a collector
Create a YAML configuration file
The config.yml file provdes a way to manage complex metadata collector runs from a CLI. Having all of your options saved in a file eliminates the tedium of retyping the commands every time you run a collector, and also allows you to run multiple collectors at once from the same file.
Note
Running a collector from a YAML file is very similar to running it through the CLI. The file supports and uses all the same command-line options, and has all the same prerequisites. However, when running DWCC with Docker in a YAML file/script for DWCC, all parameter options must be specified in the YAML file.
The YAML configuration file is broken up into two sections:
global options
command options
Global options are used to express collector parameters that should remain the same for all collector runs in the file. For example, typically all of the collector runs for an organization should use the same value for account. Other parameters which are commonly shared across all collectors and convenient to specify globally are upload and API token.
Command options are used to specify the configuration for a single run of a specific collector. Each command-options section is an element in a YAML array labeled commands. The name of the element is the name of the collector command to run. A list of the current collector commands is here.
For a generic (JDBC) database collector, the YAML file would be set up something like this:
global_options: account: <string> output: /dwcc-output upload: <true | false> api_token: <string | empty> command options: - catalog-<datasource_name> server: <string> port: <integer> user: <string> password: <string | ENV variable>
Caution
Formatting and indenting are significant in YAML. See YAML Aint't Markup Language (YAML™) Version 1.2 for a good overview of YAML file structure and formatting.
Note
Some options do not require values when used through via the CLI, e.g., --upload
, --no-log-upload
. In the YAML file, these options must use "true" or false"--as shown with upload
in the above example.
To begin, create an empty config.yml
file, open it in a text editor, and add a configuration like this:
global_options: account: democorp output: /dwcc-output upload: true api_token: ${DW_AUTH_TOKEN} commands: - catalog-snowflake: server: ${SNOWFLAKE_SERVER} database: demo all-schemas: true user: demo-user password: ${SNOWFLAKE_PW} role: ACCOUNTADMIN
Note
The commands used for cataloger are on this list.
Note
The above example assumes that your operating system or shell environment has the environment variables DW_AUTH_TOKEN
, SNOWFLAKE_SERVER
, and SNOWFLAKE_PW
.
Tip
Using environment variables for sensitive option values allows the user running dwcc to specify these values at runtime, rather than storing them in the config file.
Store the file in your current working directory.
You can see a list of the data resources we currently support for metadata collection here with links to the configuration parameters for each of them.
Running the YAML file with docker
There are a few things to keep in mind when you use docker to run your metadata collector:
No DWCC options are allowed on the command line after config.yml.
We have licensing permission to distribute some JDBC drivers with our collectors. Drivers that are not supplied need to be mounted. (See an example of the command used to run a collector on the DWCC configuration page for that collector. Here is a list of the sources we currently support with links to the commands for them. )
To run your config.yml, you begin with commands similar to the ones you would use on a command line. However when you run docker with a config file you need to add one more mount
statement to indicate that the file is mounted in the container. If you store your dwcc configurations in the directory /dwcc-configs, then you would add the mount directive --mount type=bind,source=/dwcc-configs,target=/dwcc-configs
and specify the file as --config-file /dwcc-configs/config.yaml
.
Here is an example of a command used to run the config file:
docker run -it --rm --mount type=bind,source=/tmp,\ target=/dwcc-output --mount type=bind,source=/tmp,\ target=/app/log --mount type=bind,source=/dwcc-configs,\ target=/dwcc-configs-e SNOWFLAKE_SERVER -e SNOWFLAKE_PW \ datadotworld/dwcc:x.y --config-file /dwcc-configs/config.yml
Important
Do not forget to replace x.y
in datadotworld/dwcc:x.y
with the version of DWCC you want to use. E.g., datadotworld/dwcc:2.345
.
Basic parameters
These are the basic parameters needed to run the collector. A list of all parameter options is at the end of this article. Where available, either short (e.g., -a) or long (--account) forms can be used.
Base parameters
Writing the DWCC command
The easiest way to create your DWCC command is to:
Copy the example below
Edit it for your organization and data source
Open a terminal window in any Unix environment that uses a Bash shell (e.g., MacOS and Linux) and paste your command into it.
The example command includes the minimal parameters required to run the collector (described below)--your instance may require more. A description of all the available parameters is at the end of this article. Edit the command by adding any other parameters you wish to use, and by replacing the values for all your parameters with your information as appropriate. Parameters required by the collector are in bold.
Detailed information about the Docker portion of the command can be found here. When you run the command, run
will attempt to find the image locally, and if it doesn't find it, it will go to Dockerhub and download it automatically:

Collector runtime and troubleshooting
The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled. If the catalog collector runs without issues, you should see no output on the terminal, but a new file that matching *.dwec.ttl should be in the directory you specified for the output. If there was an issue connecting or running the catalog collector, there will be either a stack trace or a *.log file. Both of those can be sent to support to investigate if the errors are not clear. A list of common issues and problems encountered when running the collectors is available here.
Automatic updates to your metadata catalog
Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:
Frequency of changes to the schema
Business criticality of up-to-date data
For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.