Enterprise docs

Docker and DWCC quickstart

As we continue to add data resources to our Connection manager interface, it becomes increasingly easy to catalog your data through Connection manager tasks. However there continue to be many resources for which it is necessary to run the data.word catalog collector (DWCC) in a Docker container through a Terminal window.

For some people in their company's data steward role, the Terminal interface is familiar and poses no issues. This quickstart is designed to help people who are less comfortable with Terminal quickly create a basic metadata catalog without having to fully understand Docker or DWCC. The resulting catalog configuration is for moderately technical people. It does not include information on all the different parameters you can use to catalog your data resources. Once you have used the quickstart, you can use the full documentation for configuring DWCC with your data resource.

This quickstart is a fast walk-through of setting up and running DWCC for the first time. It is broken down into the following steps:

  1. Scan the section on DWCC catalog collectors to familiarize yourself with the concepts.

  2. Verify that your system has all the necessary prerequisites to catalog your data resource.

  3. Install DWCC.

  4. Create a script to catalog your resource with DWCC.

  5. Run the script create your catalog and validate it.

  6. Upload your catalog to data.world.

Additional prerequisites you'll need to run the catalog collector are:

  • You will need to use the Bash shell in the Terminal window to run your script. (If you are running macOS Catalina or later here are instructions for changing the default shell to Bash.)

The information you will need to know from data.world is:

  • The name you want to use for your collection (e.g., Snowflake)

  • The name of your organization (e.g., ddw-doccorp)

  • The name of the database you are cataloging (NIGHTOWL_DB)

  • Your username for the database (MUST have read permissions to the data)

  • The host name for the database (e.g., cu55907.snowflakecomputing.com)

  • The name of the schema in the database (e.g., NIGHTOWL_SCHEMA)

  • Your password to the database

  1. Download the current DWCC zip file from us and put it in the directory where you want to run it.

  2. Follow these steps to load the DWCC into Docker:

    1. open your Terminal window and navigate to the directory where you put the DWCC zip

    2. run: docker load --input dataworld-DWCC-X.X.tar.gz where X.X is the current DWCC release, (e.g., 2.17).

    3. Copy the image id from the results and run: docker tag <image_id> DWCC

The cataloger is now installed and ready to use.

Unless you have a script editor on your computer you won't be able to natively create a *.sh file. To get around that limitation you'll first create a .txt file with a standard text editor and then rename it changing the extensin to .sh

  1. Create a plain text (.txt) file with the name you want for your script ((e.g., DWCC.txt) in the directory where you want to run DWCC.

  2. After saving your file, rename it to DWCC.sh.

  3. Open the file with your text editor and copy and paste the text below into it, then save and close it:

    export COLLECTION_NAME="your_collection"
    export ORG_NAME="your_org"
    export DATABASE="database_name"
    export USER="user_name"
    export HOST="database_host_server"
    export SCHEMA="schema_name"
    export PASSWORD="password"
    docker run --mount type=bind,source=${PWD}/tmp,target=/dwcc-output \
    --mount type=bind,source=${PWD}/tmp,target=/app/log \
    dwcc catalog-snowflake -n $COLLECTION_NAME -a $ORG_NAME \
    -o /dwcc-output -r PUBLIC
  4. Make the script executable by typing:

    chmod a+x DWCC.sh.

  5. In the Terminal window run the command mkdir tmp to create the directory where your output TTL file will be saved.

The next step is to run the script and catalog your data.

  1. Type the following (in the same Terminal window as you have been working):


  2. Open your finder window and verify that there is a directory named "tmp" inside the directory where you've been working, and that there is a .ttl file named your_database_name.your_collection_name.dwec.ttl in it. If there isn't, there should be a log file telling you what went wrong.

  1. Log in to data.world.

  2. Go to your organization home page (https://data.world/{your-organization}ddw-catalogs:

  3. Select the datasets tile.

  4. Open your ddw-catalogs dataset. If there isn't a dataset named ddw-catalogs, create it, then open it.

  5. Select + Add data Upload from computer.

  6. Select the TTL file from the ../tmp directory on your computer.

  7. Go back to your organization page and open the new collection. You should see a list of tables in the database.