Community docs

Quickstart for Docker and a DWCC catalog collector

As we continue to add data resources to our Connection manager interface, it becomes increasingly easy to catalog your data through Connection manager tasks. However there continue to be many resources for which it is necessary to run the data.word catalog collector in a Docker container through a Terminal window.

For some people in their company's data steward role, the Terminal interface is familiar and poses no issues. This quickstart is designed to help people who are less comfortable with Terminal quickly create a basic metadata catalog without having to fully understand Docker or the catalog collector. The steps below are for moderately technical people. They do not include information on all the different parameters you can use to catalog your data resources. Once you have run through the quickstart using your data source, you can use the full documentation to extend the configuration of the catalog collector with your data resource.

How to use this quickstart

This quickstart is a fast walk-through of setting up and running a DWCC catlog collector for the first time. It is broken down into the following steps:

  1. Scan the section on DWCC catalog collectors to familiarize yourself with the concepts.data.world catalog collectors

  2. Verify that your system has all the necessary prerequisites to catalog your data resource.

  3. Create a script to catalog your resource with DWCC.

  4. Run the script create your catalog and validate it.

  5. Upload your catalog to data.world.

Verify prerequisites

Additional prerequisites you'll need to run the catalog collector are:

  • You will need to use the Bash shell in the Terminal window to run your script. (If you are running macOS Catalina or later here are instructions for changing the default shell to Bash.)

The information you will need to know from data.world is:

  • The name you want to use for your collection (e.g., Snowflake)

  • The name of your organization (e.g., ddw-doccorp)

  • The name of the database you are cataloging (NIGHTOWL_DB)

  • Your username for the database (MUST have read permissions to the data)

  • The host name for the database (e.g., cu55907.snowflakecomputing.com)

  • Your password to the database

Create an executable script to run DWCC

Unless you have a script editor on your computer you won't be able to natively create a *.sh file. To get around that limitation you'll first create a .txt file with a standard text editor and then rename it changing the extension to .sh

  1. Create a plain text (.txt) file with the name you want for your script ((e.g., DWCC.txt) in the directory where you want to run DWCC.

  2. After saving your file, rename it to DWCC.sh.

  3. Open the file with your text editor and copy and paste the text below into it. Replace dwcc:x.y on the 11th line with the version of the DWCC you are running (e.g., dwcc:2.36. The -A option is to catalog all schemas to which the indicated user has access. Finally, save and close the file:

    export COLLECTION_NAME="your_collection"
    export ORG_NAME="your_org"
    export DATABASE="database_name"
    export USER="user_name"
    export HOST="database_host_server"
    export PASSWORD="password"
    
    docker run --mount type=bind,source=${PWD}/tmp,target=/dwcc-output \
    --mount type=bind,source=${PWD}/tmp,target=/app/log \
    datadotworld/dwcc:x.y catalog-snowflake -A -n $COLLECTION_NAME \
    -a $ORG_NAME -d $DATABASE -u $USER -s $HOST -S $SCHEMA -P $PASSWORD \
    -o /dwcc-output -r PUBLIC
  4. Make the script executable by typing:

    chmod a+x DWCC.sh.

  5. In the Terminal window run the command mkdir tmp to create the directory where your output TTL file will be saved.

Run and validate the script

The next step is to run the script and catalog your data.

  1. Type the following (in the same Terminal window as you have been working):

    ./your_scriptname.sh

  2. Open your finder window and verify that there is a directory named "tmp" inside the directory where you've been working, and that there is a .ttl file named your_database_name.your_collection_name.dwec.ttl in it. If there isn't, there should be a log file telling you what went wrong.

Upload your catalog to data.world

  1. Log in to data.world.

  2. Go to your organization home page (https://data.world/{your-organization}ddw-catalogs:

    ddw-doccorp_org_page.png
  3. Select the datasets tile.

  4. Open your ddw-catalogs dataset. If there isn't a dataset named ddw-catalogs, create it, then open it.

  5. Select + Add data using the Upload from computer option.

    Add_data.png
  6. Choose the TTL file from the ../tmp directory on your computer to upload.

  7. Go back to your organization page and open the new collection. You should see a list of tables in the database.