Enterprise docs

Metadata catalog collectors

A big part of efficiently using data is understanding and managing its metadata. data.world metadata collectors are designed to be your tools for aggregating and managing the metadata for all of your organization's data. The majority of our collectors use a specific program called DWCC and are called the DWCC collectors. We also have stand-alone collectors (e.g., the dbt ollector). The following articles contain additional information about the collectors in general. See the list of data sources we catalog for specific instructions to run each one.

Information Schema Catalog Collector

There are occasionally times when a database configuration makes it difficult to connect the DWCC directly to the data source. The Information Schema Catalog Collector (ISCC) makes it possible to use the information schema of a database as a source for cataloging the database's metadata.

The information schema is an ANSI-standard set of read-only views of all the tables, views, columns, and procedures in an RDBMS. The ISCC works by parsing CSV files created from the information schema and using them as an input source for DWCC. You can find more information on the information schema here. Using DWCC directly is the preferred method for cataloging an RDMS, but the following instructions provide a secondary access method when needed.

Note

We have tested this collector against a SQLServer database, but it can be used on any database for which you can generate the four CSV files described in this article.

Options and requirements

There are a couple of different ways to create the CSV files necessary for the Information Schema Catalog Collector. The first, and easiest, is to run SQL queries directly against the information schema of the database. Below are the requirements for using this method:

  • The database supports Information Schema and SQL querying.

  • You have permissions to query the database.

The second way to create the CSV files is to build them manually. If you use this option, your CSV files must contain the following columns.

For tables.csv:

  • TABLE_SCHEMA,

  • TABLE_NAME

  • TABLE_TYPE

For columns.csv:

  • TABLE_SCHEMA

  • TABLE_NAME

  • COLUMN_NAME

  • ORDINAL_POSITION

  • IS_NULLABLE

  • DATA_TYPE

  • COLUMN_DEFAULT

  • CHARACTER_MAXIMUM_LENGTH

  • NUMERIC_PRECISION

For table_constraints.csv:

  • TABLE_SCHEMA

  • TABLE_NAME

  • CONSTRAINT_NAME

  • CONSTRAINT_TYPE

For constraint_column_usage:

  • TABLE_SCHEMA

  • TABLE_NAME

  • COLUMN_NAME

  • CONSTRAINT_NAME

Note

If you are creating the CSV files manually you can find information about the data formats and how they are nulled in this document.

Generate CSV files with SQL query

To generate the CSV files from SQL queries against a database's information schema, run the following four SQL queries against your database:

select * from information_schema.tables

select * from information_schema.columns

select * from information_schema.table_constraints

select * from information_schema.constraint_column_usage

Export the results of each query as CSV files with the names:

  • tables.csv

  • columns.csv

  • table_constraints.csv

  • contraint_column_usage.csv

These files will be loaded into the csv-file-directory used with DWCC.

Create CSV files manually

If you use this option to create the CSV files to use with DWCCV, they must contain the following columns.

For tables.csv:

  • TABLE_SCHEMA,

  • TABLE_NAME

  • TABLE_TYPE

For columns.csv:

  • TABLE_SCHEMA

  • TABLE_NAME

  • COLUMN_NAME

  • ORDINAL_POSITION

  • IS_NULLABLE

  • DATA_TYPE

  • COLUMN_DEFAULT

  • CHARACTER_MAXIMUM_LENGTH

  • NUMERIC_PRECISION

For table_constraints.csv:

  • TABLE_SCHEMA

  • TABLE_NAME

  • CONSTRAINT_NAME

  • CONSTRAINT_TYPE

For constraint_column_usage:

  • TABLE_SCHEMA

  • TABLE_NAME

  • COLUMN_NAME

  • CONSTRAINT_NAME

Note

If you are creating the CSV files manually you can find information about the data formats and how they are nulled in this document.