Skip to main content

Documentation

Quickstart for Docker and the data.world Collector

As we continue to add data resources to our Connection manager interface, it becomes increasingly easy to catalog your data through Connection manager tasks. However there continue to be many resources for which it is necessary to run the data.word Collector in a Docker container through a Terminal window.

For some people in their company's data steward role, the Terminal interface is familiar and poses no issues. This quickstart is designed to help people who are less comfortable with Terminal quickly create a basic metadata catalog without having to fully understand Docker or the Collector. The steps below are for moderately technical people. They do not include information on all the different parameters you can use to catalog your data resources. Once you have run through the quickstart using your data source, you can use the full documentation to extend the configuration of the Collector with your data resource.

How to use this quickstart

This quickstart is a fast walk-through of setting up and running the data.world Collector for the first time. It is broken down into the following steps:

  1. Scan the FAQ for the data.world Collector to familiarize yourself with the concepts.

  2. Verify that your system has all the necessary prerequisites to catalog your data resource.

  3. Create a command to catalog your resource with the Collector.

  4. Put the command in a script and run the script to create your catalog and validate it.

  5. Upload your catalog to data.world.

Verify prerequisites

Additional prerequisites you'll need to run the catalog collector are:

  • You will need to use the Bash shell in the Terminal window to run your script. (If you are running macOS Catalina or later here are instructions for changing the default shell to Bash.)

The information you will need to know from data.world is:

  • The name you want to use for your catalog (e.g., Snowflake)

  • The name of your organization (e.g., ddw-doccorp)

  • The name of the database you are cataloging (NIGHTOWL_DB)

  • Your username for the database (MUST have read permissions to the data)

  • The host name for the database (e.g., cu55907.snowflakecomputing.com)

  • Your password to the database

Create an executable script to run the data.world Collector

Unless you have a script editor on your computer you won't be able to natively create a *.sh file. To get around that limitation you'll first create a .txt file with a standard text editor and then rename it changing the extension to .sh

  1. Create a plain text (.txt) file with the name you want for your script ((e.g., collector.txt) in the directory where you want to run the data.world Collector.

  2. After saving your file, rename it to collector.sh.

  3. Open the file with your text editor and copy and paste the command below into it. Replace dwcc:x.y on the 11th line with the version of the data.world Collector you are running (e.g., dwcc:2.36. The -A option is to catalog all schemas to which the indicated user has access. Finally, save and close the file:

    export COLLECTION_NAME="your_collection"
    export ORG_NAME="your_org"
    export DATABASE="database_name"
    export USER="user_name"
    export HOST="database_host_server"
    export PASSWORD="password"
    
    docker run --mount type=bind,source=${PWD}/tmp,target=/dwcc-output \
    --mount type=bind,source=${PWD}/tmp,target=/app/log \
    datadotworld/dwcc:x.y catalog-snowflake -A -n $COLLECTION_NAME \
    -a $ORG_NAME -d $DATABASE -u $USER -s $HOST -P $PASSWORD \
    -o /dwcc-output -r PUBLIC
  4. Open a terminal window and make the script executable by typing:

    chmod a+x collector.sh.

  5. In the Terminal window run the command mkdir tmp to create the directory where your output TTL file will be saved.

Run and validate the script

The next step is to run the script and catalog your data.

  1. Type the following (in the same Terminal window as you have been working):

    ./your_scriptname.sh

  2. Open your finder window and verify that there is a directory named "tmp" inside the directory where you've been working, and that there is a .ttl file named your_database_name.your_collection_name.dwec.ttl in it. If there isn't, there should be a log file telling you what went wrong.

Upload your catalog to data.world
  1. Log in to data.world.

  2. Go to the Organization Profile Page

  3. Click the Datasets tile.

    org_profile_resources_datasets.png
  4. Open the ddw-catalogs dataset. If there isn't a dataset named ddw-catalogs, create it, and then open it.

  5. Select + Add data using the Upload from computer option.

    Add_data.png
  6. Choose the TTL file from the ../tmp directory on your computer to upload.

  7. Go back to your organization page and open the new collection. You should see a list of tables in the database.