Skip to main content

How to catalog metadata

Metadata is cataloged using either:

  • The Connection Manager - data.worlds GUI for creating and managing data connections, is a simple interface used to create connections for both catalog ingmetadata and accessing remote data from a virtual connection. The databases that can be configured with the Connection manager are listed in this table.

    Here is an example of the Configuration Manager interface:

    connectin_manager_-_snowflake.png
  • The data.world Collector - a data.world program created to catalog metadata that is either run from a Docker container or from a jar file. We recommend that you run it from inside a Docker container. If you cannot use Dockerwe can provide you with a .jar file containing the correct Collector version for your implementation. See our data.world Collector FAQ for more information about the Collector.

Docker is a highly configurable, and very well documented application. This is a brief introduction to its options and syntax when used with the data.world Collector. Here are two examples of the Docker portion of scripts used with the Collector:

docker run -it --rm --mount type=bind,source=/tmp,target=/dwcc-output \
--mount type=bind,source=/tmp,target=/app/log \
--mount type=bind,source=/jdbcdrivers,target=/usr/src/dwcc-config/lib \
datadotworld/dwcc:x.y catalog-oracle
docker run -it --rm --mount type=bind,source=/tmp,target=/dwcc-output \
--mount type=bind,source=/tmp,target=/app/log \
--mount type=bind,source=/path/to/local/.aws/credentials,target=/root/.aws/credentials,\
readonly -e AWS_PROFILE=<PROFILE> datadotworld/dwcc:x.y catalog-athena...

This is a brief explanation of the various pieces:

  • docker run The docker run command first creates a writeable container layer over the specified image, and then starts it using the specified command.

    .-it is a combination of --i Keep STDIN open even if not attached, and -t Allocate a pseudo-TTY- The -it instructs Docker to allocate a pseudo-TTY connected to the container’s stdin; creating an interactive bash shell in the container.

  • --rm By default a container’s file system persists even after the container exits. This makes debugging a lot easier (since you can inspect the final state) and you retain all your data by default. But if you are running short-term foreground processes, these container file systems can really pile up. If instead you’d like Docker to automatically clean up the container and remove the file system when the container exits, you can add the --rm flag.

  • --mount allows you to mount volumes, host-directories and tmpfs mounts in a container. It consists of multiple key-value pairs, separated by commas and each consisting of a <key>=<value> tuple. See the Docker site for more information on bind mounts and their options, and a comparison between --volume and --mount in the service create command reference.

  • type=bind,source=/tmp,target=/dwcc-output reference here

  • bind bind-mounts a directory or file from the host into the container.

  • src or source src is required, and specifies an absolute path to the file or directory to bind-mount (for example, src=/path/on/host/). An error is produced if the file or directory does not exist.

  • target Mount path inside the container, for example /some/path/in/container/. If the path does not exist in the container's filesystem, the Engine creates a directory at the specified location before mounting the volume or bind mount.

  • -env or -e Set environment variables

An explanation of all the Docker run commands is available here,