# Documentation

###### Introduction

The steps to catalog metadata using the data.world Collector are as follows:

1. Read over the data.world Collector FAQ to familiarize yourself with the Collector.

2. Verify that your installation meets the prerequisites below.

3. Validate that you have a ddw-catalogs (or other) dataset set up to hold your catalog files when you are done running the collector.

4. Write the collector command and run it.

5. Upload the resulting file to your ddw-catalogs dataset (or other dataset as configured for your organization).

###### Prerequisites
• The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

• Docker must be installed. For more information see https://docs.docker.com/get-docker/.

• The user defined to run the data.world Collector must have read access to all resources being cataloged.

• The computer running the data.world Collector needs a Java Runtime Environment (JRE), version 11 or higher. (OpenJDK available here)

###### Version of data source

This is the version of the underlying metadata sources and/or its APIs that we have developed/tested against. This is not to say that the collector will not function properly with other versions, but this is the version we specifically developed against:

###### Permissions Looker

In order to scan a Looker account you will need to set up a specific role in Looker with these permissions checked:

access_data
see_lookml_dashboards
see_looks
see_user_dashboards
explore (optional, needed if Looker Explores has to be crawled)
see_lookml
develop
see_datagroups
see_sql

The Looker permissions are set under Admin > Roles > New Permission Set in Looker:

The permissions indicated will enable you to create catalogs for all the models and explores that you have access to. Here is more information from the Looker documentation about how access to content is managed:

Data Access, which controls which data a user is allowed to view, is primarily managed via Model Sets. Model Sets make up one half of a Looker role which is applied to users and groups. Data access can be further restricted within a model using access filters to limit which rows of data they can see, as though there was an automatic filter on their queries. You can also restrict access to specific Explores, joins, views, or fields using access grants.

To restrict access using an access grant you will need to:

1. Define an access grant

2. Apply the required_access_grants to the explore.

Here is an example of how an access grant could be structured based on a user attribute called “department”:

1 access_grant: datadotworld_scannable_explore {
2  user_attribute: department
3  allowed_values: [ "datadotworld" ]
4 }

More info on the access_grant can be found here.

Then you can apply the required_access_grants to the explore:

1 explore: explore_name {
2  required_access_grants: [datadotworld_scannable_explore, access_grant_name, …]
3  }
4 }

###### Ways to run the data.world Collector

There are a few different ways to run the data.world Collector--any of which can be combined with an automation strategy to keep your catalog up to date:

• Create a configuration file (config.yml) - This option stores all the information needed to catalog your data sources. It is an especially valuable option if you have multiple data sources to catalog as you don't need to run multiple scripts or CLI commands separately.

• Create a configuration script - This option is very similar to the one for the config.yml file. We have a quick start with instructions to create and run a script here. The instructions are for Snowflake, but you can get information on the parameters you will need to include for your particular data source from the the Parameters section below.

• Run the collector though a CLI - Makes regular, repeating runs of the collector very laborious and time-consuming as the commands are re-entered for each run.

For this example we will be running the command from a CLI.

###### Writing the data.world Collector command

The easiest way to create your Collector command is to:

1. Copy the example command below

2. Edit it for your organization and data source

3. Open a terminal window in any Unix environment that uses a Bash shell and paste your command into it.

The example command includes the minimal parameters required to run the collector (described below)--your instance may require more. A description of all the available parameters is at the end of this article. Edit the command by adding any other parameters you wish to use, and by replacing the values for all your parameters with your information as appropriate. Parameters required by the Collector are in bold.

### Important

Do not forget to replace x.y in datadotworld/dwcc:x.y with the version of the Collector you want to use (e.g., datadotworld/dwcc:2.80).

###### Basic parameters

These are the basic parameters needed to run the collector. A list of all parameter options is at the end of this article. Where available, either short (e.g., -a) or long (--account) forms can be used.

###### Docker and the data.world Collector

Detailed information about the Docker portion of the command can be found here. When you run the command, run will attempt to find the image locally, and if it doesn't find it, it will go to Dockerhub and download it automatically:

### Collector runtime and troubleshooting

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled. If the catalog collector runs without issues, you should see no output on the terminal, but a new file that matching *.dwec.ttl should be in the directory you specified for the output. If there was an issue connecting or running the catalog collector, there will be either a stack trace or a *.log file. Both of those can be sent to support to investigate if the errors are not clear. A list of common issues and problems encountered when running the collectors is available here.

###### Upload the .ttl file generated from running the Collector

When the data.world Collector runs successfully, it creates a .ttl file in the directory you specified as the dwcc-output directory. The automatically-generated file name is databaseName.catalogName.dwec.ttl. You can rename the file or leave the default, and then upload it to your ddw-catalogs dataset (or wherever you store your catalogs.

### Caution

If there is already a .ttl catalog file with the same name in your ddw-catalogs dataset, when you add the new one it will overwrite the existing one.