Using a YAML file to run a collector

Why use a configuration file to run the Collector

The config.yaml file provdes a way to manage complex metadata collector runs from a CLI. Having all of your options saved in a file provides the following benefits:

A saved file eliminates the tedium of retyping the commands every time you run the data.world Collector.
Troubleshooting the configuration is easier due to the file structure.
You can run multiple Collectors at once from the same file.

What is a YAML config file

A YAML config file is a place to store all the configuration and credential information needed to run the data.world Collector for your source(s). Running the Collector from a YAML file is very similar to running it through the CLI: The file supports and uses all the same command-line options, and has all the same prerequisites.

It is composed of two sections which contain:

Global options - These are used to express collector parameters that should remain the same for all collector runs in the file. For example, typically all of the collector runs for an organization should use the same value for account. Other parameters which are commonly shared across all collectors and convenient to specify globally are upload and API token.

Command options - These are used to specify the configuration for a single run of a specific collector. Each command-options section is an element in a YAML array labeled commands. The name of the element is the name of the collector command to run. The following example config file contains a basic, generic set of options to run a data.world Collector and illustrates the syntax used:

A config.yaml file to run a data.world Collector would look something like this:

global_options:
    agent: <string>
    output: /dwcc-output
    upload: <true | false>
    api_token: <string | empty>

commands:
- catalog-<datasource_name>
    name: <catalogName>
    server: <string>
    database: <string>
    all-schemas: <true | false>
    role: <string>
    user: <string>
    password: <string | ENV variable>

Note

Some options do not require values when used through via the CLI, e.g., --upload, --no-log-upload. In the YAML file, these options must use "true" or false", as shown with upload in the above example.

Values of properties in a configuration file can reference system environment variables, which are then substituted at collector runtime. The syntax for referencing an environment variable is : ${ENV_VARIABLE}. For example, ${MY_VAR} for an environment variable named MY_VAR. The purpose of this feature is to allow you to configure collectors using configuration file while avoiding to store sensitive values, like passwords, in the file.

For example, to substitute an environment variable DB_PASSWORD as the value for a database password passed to a JDBC collector (like postgres), the YAML command will look like:

commands:
- catalog-postgresql:
  server: mydb.myorg.com 
  user: dbuser 
  password: ${DB_PASSWORD} 
  database: mydb

When the collector uses this configuration file, it looks for a system environment variable named DB_PASSWORD and, if found, substitute its value into the file. If no such environment variable is found, a warning message will be logged and blank value is used.

Note that formatting and indenting are significant in YAML. See YAML Aint't Markup Language (YAML™) Version 1.2 for a good overview of YAML file structure and formatting.

Running the YAML file with Docker

There are a few things to keep in mind when you use Docker to run your metadata collector:

No Collector options are allowed on the command line after config.yml.
We have licensing permission to distribute some JDBC drivers with our collectors. Drivers that are not supplied need to be mounted. (See the example of the command used to run a collector for your data source on the the data.world Collector configuration page for that collector. Here is a list of the sources we currently support with links to the configurations for them.

The config.yml is run with Docker and so uses the same Docker commands you would use for your data source. Examples of the commands used for each data source are provided on their respective configuration pages. However when you run Docker with a config file, you need to add one more mount statement to indicate that the file is mounted in the container. If you store your the data.world Collector configurations in the directory /dwcc-configs, then you would add the mount directive --mount type=bind,source=/dwcc-configs,target=/dwcc-configs and specify the file as --config-file /dwcc-configs/config.yaml.

Here is an example of a command used to run the config file:

docker run -it --rm --mount type=bind,source=/tmp,\
target=/dwcc-output --mount type=bind,source=/tmp,\
target=/app/log --mount type=bind,source=/dwcc-configs,\
target=/dwcc-configs datadotworld/dwcc:x.y \
--config-file /dwcc-configs/config.yml

Important

Do not forget to replace x.y in datadotworld/dwcc:x.y with the version of the Collector you want to use (e.g., datadotworld/dwcc:2.113).

If your config.yml file includes environment variable references, make sure to define those environment variables when you run the docker container. To continue the example above, if your database password is defined as an environment variable on your host, then pass it into the container (for substitution with the example yaml file) using -e DB_PASSWORD=$DB_PASSWORD.

The command will look like this:

docker run -it --rm -e DB_PASSWORD=$DB_PASSWORD --mount type=bind,source=/tmp,\
target=/dwcc-output --mount type=bind,source=/tmp,\
target=/app/log --mount type=bind,source=/dwcc-configs,\
target=/dwcc-configs datadotworld/dwcc:x.y \
--config-file /dwcc-configs/config.yml

In this section: