Information Schema Catalog Collector (ISCC) and the data.world Collector
There are two parts to cataloging metadata from a database's information schema:
Generate or create the CSV files containing the database's metadata
Run the the Collector against the CSV files
In this article we cover both parts in the order they need to be done.
About Information Schema Catalog Collector (ISCC)
There are occasionally times when a database configuration makes it difficult to connect the the data.world Collector directly to the data source. In those cases, the Information Schema Catalog Collector (ISCC) can be used to access the information schema of a database as a source for cataloging the database's metadata.
The information schema is an ANSI-standard set of read-only views of all the tables, views, columns, and procedures in an RDBMS. The ISCC works by parsing CSV files created from the information schema and using them as an input source for the data.world Collector. You can find more information on the information schema here. Using the data.world Collector directly is the preferred method for cataloging an RDMS, but the following instructions provide a secondary access method when needed.
Note
We have tested this collector against a MS SQL Server database, but it can be used on any database for which you can generate the four CSV files described in this guide.
Options and requirements
There are a couple of different ways to create the CSV files necessary for the Information Schema Catalog Collector. The first, and easiest, is to run SQL queries directly against the information schema of the database. Below are the requirements for using this method:
The database supports Information Schema and SQL querying.
You have permissions to query the database.
The second way to create the CSV files is to build them manually. If you use this option, your CSV files must contain the following columns.
For tables.csv:
TABLE_SCHEMA,
TABLE_NAME
TABLE_TYPE
For columns.csv:
TABLE_SCHEMA
TABLE_NAME
COLUMN_NAME
ORDINAL_POSITION
IS_NULLABLE
DATA_TYPE
COLUMN_DEFAULT
CHARACTER_MAXIMUM_LENGTH
NUMERIC_PRECISION
For table_constraints.csv:
TABLE_SCHEMA
TABLE_NAME
CONSTRAINT_NAME
CONSTRAINT_TYPE
For constraint_column_usage:
TABLE_SCHEMA
TABLE_NAME
COLUMN_NAME
CONSTRAINT_NAME
Note
If you are creating the CSV files manually you can find information about the data formats and how they are nulled in this document.
Generate CSV files with SQL query
To generate the CSV files from SQL queries against a database's information schema, run the following four SQL queries against your database:
select * from information_schema.tables
select * from information_schema.columns
select * from information_schema.table_constraints
select * from information_schema.constraint_column_usage
Export the results of each query as CSV files with the names:
tables.csv
columns.csv
table_constraints.csv
constraint_column_usage.csv
These files will be loaded into the csv-file-directory used with the data.world Collector.
Create CSV files manually
If you use this option to create the CSV files to use with the data.world Collector, they must contain the following columns.
For tables.csv:
TABLE_SCHEMA,
TABLE_NAME
TABLE_TYPE
For columns.csv:
TABLE_SCHEMA
TABLE_NAME
COLUMN_NAME
ORDINAL_POSITION
IS_NULLABLE
DATA_TYPE
COLUMN_DEFAULT
CHARACTER_MAXIMUM_LENGTH
NUMERIC_PRECISION
For table_constraints.csv:
TABLE_SCHEMA
TABLE_NAME
CONSTRAINT_NAME
CONSTRAINT_TYPE
For constraint_column_usage:
TABLE_SCHEMA
TABLE_NAME
COLUMN_NAME
CONSTRAINT_NAME
Note
If you are creating the CSV files manually you can find information about the data formats and how they are nulled in this document.
Once you have the CSV files you are ready to run Docker and the the data.world Collector against them just as for any other data source.
Introduction
Note
The latest version of the Collector is 2.128. To view the release notes for this version and all previous versions, please go here.
The data.world Collector harvests metadata from your source system. Please read over the data.world Collector FAQ to familiarize yourself with the Collector.
Prerequisites
You must have a ddw-catalogs (or other) dataset set up to hold your catalog files when you are done running the collector.
The machine running the catalog collector should have connectivity to the internet or access to the source instance. It is recommended to have a minimum of 8GB memory and a 2Ghz processor.
Docker must be installed. For more information see https://docs.docker.com/get-docker/.
The user defined to run the data.world Collector must have read access to all resources being cataloged.
The computer running the data.world Collector needs a Java Runtime Environment. OpenJDK 17 is supported and available here.
Ways to run the data.world Collector
There are a few different ways to run the data.world Collector--any of which can be combined with an automation strategy to keep your catalog up to date:
Create a configuration file (config.yml) - This option stores all the information needed to catalog your data sources. It is an especially valuable option if you have multiple data sources to catalog as you don't need to run multiple scripts or CLI commands separately.
Run the collector though a CLI - Repeat runs of the collector requires you to re-enter the command for each run.
Note
This section walks you through the process of running the collector using CLI.
Writing the data.world Collector command
The easiest way to create your Collector command is to:
Copy the following example command
Edit it for your organization and data source
Open a terminal window in any Unix environment that uses a Bash shell and paste your command into it.
The example command includes the minimal parameters required to run the collector (described below)--your instance may require more. A description of all the available parameters is available in this article. Edit the command by adding any other parameters you wish to use, and by replacing the values for all your parameters with your information as appropriate. Parameters required by the Collector are in bold.
Basic parameters
Each collector has parameters that are required, parameters that are recommended, and parameters that are completely optional. Required parameters must be present for the command to run. Recommended parameters are either:
parameters that exist in pairs, and one or the other must be present for the command to run (e.g.,
--agent
and--base
)parameters that we recommend to improve your experience running the command in some way
Together, the required and recommended parameters make up the Basic parameters for each collector. The Basic parameters for this collector are:
Important
Do not forget to replace x.y
in datadotworld/dwcc:x.y
with the version of the Collector you want to use (e.g., datadotworld/dwcc:2.113)
.
Docker and the data.world Collector
Detailed information about the Docker portion of the command can be found here. When you run the command, run
will attempt to find the image locally, and if it doesn't find it, it will go to Dockerhub and download it automatically:

Collector runtime and troubleshooting
The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled. If the catalog collector runs without issues, you should see no output on the terminal, but a new file that matching *.dwec.ttl should be in the directory you specified for the output. If there was an issue connecting or running the catalog collector, there will be either a stack trace or a *.log file. Both of those can be sent to support to investigate if the errors are not clear. A list of common issues and problems encountered when running the collectors is available here.
Upload the .ttl file generated from running the Collector
When the data.world Collector runs successfully, it creates a .ttl file in the directory you specified as the dwcc-output
directory. The automatically-generated file name is databaseName.catalogName.dwec.ttl
. You can rename the file or leave the default, and then upload it to your ddw-catalogs dataset (or wherever you store your catalogs).
Caution
If there is already a .ttl
catalog file with the same name in your ddw-catalogs dataset, when you add the new one it will overwrite the existing one.
Automating updates to your metadata catalog
Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:
Frequency of changes to the schema
Business criticality of up-to-date data
For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.