Enterprise docs

Ways to run a collector

There are currently four ways to catalog your metadata in data.world. They are (in order of ease of use):

  • Connection Manager - A simple interface you use to create connections that can be used to both catalog metadata and access remote data from a virtual connection. Here is an example of the Configuration Manager interface:

    connectin_manager_-_snowflake.png

    You can find out about the Connection Manager here.

  • Create a configuration file (config.yml) - This is a new option which stores all the information needed to catalog your data sources. This is an especially valuable option if you have multiple data sources to catalog as you don't need to run multiple scripts or CLI commands separately. More information on setting up a config file here.

  • Create a configuration script - This option is very similar to the one for the config.yml file. We have a quick start with instructions to create and run a script here. The instructions are for Snowflake, but you can get information on the parameters you will need to include for your particular data source from the the parameters section of the configuration document for your data source. The sources we currently collect and links to their configurations can be found below.

  • Run the collector though a CLI - Makes regular, repeating runs of the collector very laborious and time-consuming as the commands are re-entered for each run. Detailed instructions for getting the collectors, configuring them, and running them are available through the links below.

Running a catalog collector from a configuration file

Creating a config.yml file to run the collector is a good option for you if you want to do any or all of the following:

  • Automate running the collector

  • Catalog metadata from multiple sources at the same time

  • Streamline your catalog collection process

In the following articles we will run through:

  • What is in the config.yml file

  • How to create the file

  • How to run the file

Create a YAML configuration file

The config.yml file provdes a way to manage complex metadata collector runs from a CLI. Having all of your options saved in a file eliminates the tedium of retyping the commands every time you run a collector, and also allows you to run multiple collectors at once from the same file.

Note

Running a collector from a YAML file is very similar to running it through the CLI. The file supports and uses all the same command-line options, and has all the same prerequisites. Complete information for each collector is found on this list.

The YAML configuration file is broken up into two sections:

  • global options

  • command options

Global options are used to express collector parameters that should remain the same for all collector runs in the file. For example, typically all of the collector runs for an organization should use the same value for account. Other parameters which are commonly shared across all collectors and convenient to specify globally are upload and API token.

Command options are used to specify the configuration for a single run of a specific collector. Each command-options section is an element in a YAML array labeled commands. The name of the element is the name of the collector command to run. A list of the current collector commands is here.

For a generic (JDBC) database collector, the YAML file would be set up something like this:

global_options:
    account: <string>
    output: /dwcc-output
    upload: <true | false>
    api_token: <string | empty>

command options:
- catalog-<datasource_name>
    server: <string>
    port: <integer>
    user: <string>
    password: <string | ENV variable>

Caution

Formatting and indenting are significant in YAML. See YAML Aint't Markup Language (YAML™) Version 1.2 for a good overview of YAML file structure and formatting.

Note

Some options do not require values when used through via the CLI, e.g., --upload, --no-log-upload. In the YAML file, these options must use "true" or false"--as shown with upload in the above example.

To begin, create an empty config.yml file, open it in a text editor, and add a configuration like this:

global_options:
    account: democorp
    output: /dwcc-output
    upload: true
    api_token: ${DW_AUTH_TOKEN}

commands:
- catalog-snowflake:
    server: ${SNOWFLAKE_SERVER}
    database: demo
    all-schemas: true
    user: demo-user
    password: ${SNOWFLAKE_PW}
    role: ACCOUNTADMIN

Note

The commands used for cataloger are on this list.

Note

The above example assumes that your operating system or shell environment has the environment variables DW_AUTH_TOKENSNOWFLAKE_SERVER , and SNOWFLAKE_PW.

Tip

Using environment variables for sensitive option values allows the user running dwcc to specify these values at runtime, rather than storing them in the config file.

Store the file in your current working directory.

You can see a list of the data resources we currently support for metadata collection here with links to the configuration parameters for each of them.

Running the YAML file with docker

There are a few things to keep in mind when you use docker to run your metadata collector:

  • No DWCC options are allowed on the command line after config.yml.

  • We have licensing permission to distribute some JDBC drivers with our collectors. Drivers that are not supplied need to be mounted. (See an example of the command used to run a collector on the DWCC configuration page for that collector. Here is a list of the sources we currently support with links to the commands for them. )

To run your config.yml, you begin with commands similar to the ones you would use on a command line. However when you run docker with a config file you need to add one more mount statement to indicate that the file is mounted in the container. If you store your dwcc configurations in the directory /dwcc-configs, then you would add the mount directive --mount type=bind,source=/dwcc-configs,target=/dwcc-configs and specify the file as --config-file /dwcc-configs/config.yaml.

Here is an example of a command used to run the config file:

docker run -it --rm --mount type=bind,source=/tmp,\
target=/dwcc-output --mount type=bind,source=/tmp,\
target=/app/log --mount type=bind,source=/dwcc-configs,\
target=/dwcc-configs-e SNOWFLAKE_SERVER -e SNOWFLAKE_PW \
datadotworld/dwcc:x.y --config-file /dwcc-configs/config.yml

Important

Do not forget to replace x.y in datadotworld/dwcc:x.y with the version of DWCC you want to use. E.g., datadotworld/dwcc:2.345.