Documentation

How to catalog metadata

Metadata is cataloged using either:

  • The Connection Manager - data.worlds GUI for connections. Currently available for these data sources. A simple interface you use to create connections that can be used to both catalog metadata and access remote data from a virtual connection. Here is an example of the Configuration Manager interface:

    connectin_manager_-_snowflake.png
  • the data.world Collector - a data.world program created to catalog metadata. The Collector is either run from a Docker container or from a jar file. We recommend that you runit from inside a Docker container. If you cannot use Docker we can provide you with a .jar file containing the correct Collector version for your implementation. See our data.world Collector FAQ for more information about the Collector.

Where to get the data.world Collector collector

The data.world Collector is distributed as an image on Dockerhub. If you run the Collector from Docker, the run command will attempt to find the image locally, and if it doesn't find it, it will go to Dockerhub and download it automatically:

dwcc_and_cli.png

If you are running the Collector from a .jar file, you will get the correct file from customer support.

If you are unsure what version of the Collector to use, the most current releases are always listed in the Catalog collector change log. However If you don't know the complete version name, or if you would like to see a list of the the Collector versions, you can go to our Dockerhub repositories. There are two repositories, one for released versions and one for release candidate versions:

  • datadotworld/dwcc- Contains all of the officially released versions of the data.world Collector

  • datadotworld/dwcc-rc - Contains the "release candidate" versions. Release candidates are test versions, they are not officially supported and released. They are primarily used for quick customer fixes until the official release comes out.

Caution

Do not use the versions named Latest from either repository--only specify numeric releases (e.g., dwcc:2.36).

Warning

Do not use a release candidate (rc) version of the Collector unless you have been explicitly directed to do so by your customer success or support representative.

The name you specify on the CLI should match exactly the version name on Dockerhub. For example:

  • The name of the Collector version 2.36 is datadotworld/dwcc:2.36

  • The name of the third RC collector version of 2.37 is datadotworld/dwcc-rc:2.37-rc-0003 (RC versions are padded to four digits).

How to run the data.world Collector command

If you use Docker with the Collector there are several ways to run the collector:

  • Create a configuration file (config.yml) - This option stores all the information needed to catalog your data sources. It is an especially valuable option if you have multiple data sources to catalog as you don't need to run multiple scripts or CLI commands separately.

  • Create a configuration script - This option is very similar to the one for the config.yml file. We have a quick start with instructions to create and run a script here. The instructions are for Snowflake, but you can get information on the parameters you will need to include for your particular data source from the the Parameters section below.

  • Run the collector though a CLI - Makes regular, repeating runs of the collector very laborious and time-consuming as the commands are re-entered for each run. Links to the instructions for each of the data sources we support can be found here.

Using a YAML file to run a collector
Create a YAML configuration file

The config.yml file provdes a way to manage complex metadata collector runs from a CLI. Having all of your options saved in a file eliminates the tedium of retyping the commands every time you run the data.world Collector, and also allows you to run multiple Collectors at once from the same file.

Note

Running the Collector from a YAML file is very similar to running it through the CLI. The file supports and uses all the same command-line options, and has all the same prerequisites. However, when running the Collector with Docker in a YAML file/script, all parameter options used in the command must be specified in the YAML file.

The YAML configuration file is broken up into two sections:

  • global options

  • commands

Global options are used to express collector parameters that should remain the same for all collector runs in the file. For example, typically all of the collector runs for an organization should use the same value for account. Other parameters which are commonly shared across all collectors and convenient to specify globally are upload and API token.

Command options are used to specify the configuration for a single run of a specific collector. Each command-options section is an element in a YAML array labeled commands. The name of the element is the name of the collector command to run.

Using environment variables for sensitive option values allows the user running the data.world Collector to specify these values at runtime, rather than storing them in the config file. See this article for more information about environment variables.

global_options:
    agent: <string>
    output: /dwcc-output
    upload: <true | false>
    api_token: <string | empty>

commands:
- catalog-<datasource_name>
    name: <catalogName>
    server: <string>
    database: <string>
    all-schemas: <true | false>
    role: <string>
    user: <string>
    password: <string | ENV variable>

Caution

Formatting and indenting are significant in YAML. See YAML Aint't Markup Language (YAML™) Version 1.2 for a good overview of YAML file structure and formatting.

Note

Some options do not require values when used through via the CLI, e.g., --upload, --no-log-upload. In the YAML file, these options must use "true" or false", as shown with upload in the above example.

To begin, create an empty config.yml file, and open it in a text editor. An example for a config file for a Snowflake database that automatically uploads the catalog would look like this:

global_options:
    agent: democorp
    output: /dwcc-output
    upload: true
    api_token: ${DW_AUTH_TOKEN}

commands:
- catalog-snowflake:
    name: Snowflake
    server: ${SNOWFLAKE_SERVER}
    database: demo
    all-schemas: true
    role: PUBLIC
    user: demo-user
    password: ${SNOWFLAKE_PW}

Note

The above example assumes that your operating system or shell environment has the environment variables DW_AUTH_TOKENSNOWFLAKE_SERVER , and SNOWFLAKE_PW.

Store the file in the directory from which you will run Docker.

A list of the data resources we currently support for metadata collection is here along with links to the configuration parameters for each of them.

Running the YAML file with Docker

There are a few things to keep in mind when you use Docker to run your metadata collector:

  • No the data.world Collector options are allowed on the command line after config.yml.

  • We have licensing permission to distribute some JDBC drivers with our collectors. Drivers that are not supplied need to be mounted. (See the example of the command used to run a collector for your data source on the the data.world Collector configuration page for that collector. Here is a list of the sources we currently support with links to the configurations for them.

The config.yml, is run with Docker and so uses the same Docker commands you would use for your data source. Examples of the commands used for each data source are provided on their respective configuration pages. However when you run Docker with a config file, you need to add one more mount statement to indicate that the file is mounted in the container. If you store your the data.world Collector configurations in the directory /dwcc-configs, then you would add the mount directive --mount type=bind,source=/dwcc-configs,target=/dwcc-configs and specify the file as --config-file /dwcc-configs/config.yaml.

Here is an example of a command used to run the config file:

docker run -it --rm --mount type=bind,source=/tmp,\
target=/dwcc-output --mount type=bind,source=/tmp,\
target=/app/log --mount type=bind,source=/dwcc-configs,\
target=/dwcc-configs datadotworld/dwcc:x.y \
--config-file /dwcc-configs/config.yml

Important

Do not forget to replace x.y in datadotworld/dwcc:x.y with the version of the Collector you want to use (e.g., datadotworld/dwcc:2.80).