Enterprise docs

Connect to data

With data.world there are several ways you can interact with your data. You can catalog your metadata with the Connection manager or our data.world catalog collectors (DWCC). You can also create live connections to your data that allow you to run queries against it and analyze it without needing to move it into data.world. Live connections can be created with Connection manager, or 3rd-party tools. Details are in this section.

The Connection manager is a one-stop shop for creating and managing org-owned connections to your data sources. These connections can be used to:

  • Catalog your metadata (similar to the data.world catalog collector or DWCC)

  • Read in your data (either through a live table or data extract)

To create a new connection to your data sources, select the Add connection button:

conn_manage_add_new.png

The dialog opened lists the database sources for which you can create an organization level connection:

org-level_connections.png

Select a source to be presented with the configuration screen for your connection.

Select your database from the following list to get specific configuration for it:

CM_three_dot_edit_delete_menu.png

The three dot menu to the right of a connection name on the Connection manager allows to you both Edit connection and to Delete connection.

Edit connection links to the same configuration page used to create the connection, and you need to have the same security information (login, password, etc.) to change anything about the connection. Delete connection prompts you to be sure you want to proceed, but does not require any further validation.

Note

Any organization admin or other authorized person with the credentials for a connection can modify it, and any data steward or other authorized person with access to the Connection manager can delete connections.

The Manage tasks button to the right of each connection links to a list of all the tasks for that connection. Tasks will eventually expand to include many more options, but for now only a metadata task is able to be created. A Metadata task catalogs the metadata for a data source and places the extracted information into a specified Collection.

Note

Before creating a task you need to have a collection in which to put your task output. The collection must be created before the task.

Caution

Tasks work with version 2.x of DWCC. If you are still using DWCC v1.x, see the article on DWCC v2.0 for more information.DWCC v.2.0

CM_tasks.png

The Create a task button opens a dialog where you can configure your new task. To create a task you will be prompted to choose a database and schema as appropriate. You also need to add it to a collection:

CM_create_a_task.png

 

In addition to creating tasks, you can also sync, edit, or delete them from the three dot menu to the right of the task. Sync runs the task again, pulling in any new metadata. With Edit task details, you can change any of the original configuration (database, schemas, or collection), or you can select Delete task.

CM_three_dot_task_menu.png

We continue to add new data sources to the Connection manager. However there are still some sources that are only available for cataloging metadata with the data.world catalog collector (DWCC). Here is a list of our currently supported data connections:

Table 1. Supported data sources

data source

Connection manager

metadata collection through DWCC

Athena

yes

yes

AWS Glue

not yet

yes

Azure Synapse

yes

beta

BigQuery

not yet

yes

DB2

not yet

yes

Denodo

beta

yes

DBT

not yet

beta

Domo

not yet

beta

Dremio

not yet

yes

Generic JDBC

not yet

yes

Hive

not yet

yes

Infor Ion

beta

yes

Looker

not yet

beta

Manta

not yet

yes

My SQL

yes

yes

Open API

not yet

beta

Oracle

yes

yes

PostgreSQL

yes

yes

PowerBI

not yet

beta

Presto

not yet

yes

Redshift

yes

yes

Snowflake

yes

yes

SQL Anywhere

not yet

yes

SQL Server

yes

yes



A big part of efficiently using data is understanding and managing its metadata. The collection and storage of electronic data has proliferated to such an extent that it's not uncommon for there to be no one in an organization who knows where and what everything is. Enter the data.world catalog collector (DWCC), designed to be the tool for aggregating and managing the metadata for all of your organization's data.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Here is a list of the current sources supported by the data.world metadata cataloger:

  • Athena

  • AWSGlue

  • BigQuery

  • DB2

  • DBT

  • Denodo

  • Domo

  • Dremio

  • Generic JDBC

  • Hive

  • Infor Ion Data Lake

  • MANTA

  • MySQL

  • OpenAPI

  • Oracle

  • PostgreSQL

  • PowerBI

  • Redshift

  • Snowflake

  • SQL Server

  • Tableau

Depending on your data source, different things are cataloged by the DWCC. What is cataloged also changes as we release new versions of the DWCC. The following is a continually evolving of data sources and features.

JDBC data sources

When the DWCC is run against a JDBC data source the following metadata is collected:

  • database name

  • connection information

  • schema name

  • table and view names by schema

  • column names

  • column datatypes

  • column length

  • column precision (as appropriate)

  • table and column descriptions (if they exist)

Primary and foreign key information is also collected by the DWCC, but it is not currently displayed in the platform.

JDBC sources include:

  • Athena

  • DB2

  • Denodo

  • Dremio

  • Hive

  • Infor ION

  • MySQL*

  • Oracle

  • PostgreSQL

  • Presto

  • Redshift

  • Snowflake

  • SQL Anywhere

  • SQL Server

Note

* For MS SQL Server, table and column descriptions are not cataloged, even if they exist.

Tableau

When the DWCC is run against Tableau server the following metadata is collected:

  • Workbook name

  • Dashboard name

  • Dashboard title

  • Project a dashboard is in

  • Non-dashboard views

  • Number of dashboard views

  • Tags for objects that have them

  • Relationships between views/dashboards and workbooks

  • Number of dashboard favorites

With the release of v2.0 of the DWCC there are some fundamental changes in how we catalog your data. Starting with DWCC v2.X we now use v2 URIs as the official locator IDs for metadata resources. This is a breaking change (for structural, intentional reasons) which is not backwards compatible with v1 URIs. In this article we will cover the migration-related implications for this change and how to determine your best course of action.

We devised v2 URIs to improve how customers use our metadata collectors, and how they can store metadata. We also wanted to ensure uniqueness as well as accommodate MANTA, AWS Glue, and other future use cases (such as 3rd party data quality tools) where data may come from many different collectors and will need to be intelligently merged together in data.world.

Technical details and caveats

The new v2 URIs use server+port+databaseName+organization+context as the 5 key factors that are hashed together . These factors allow for a deterministic and unique identifier. The “context” component is hardcoded to “default” for now, but it gives us an option for handling both future requirements and one-off customer situations where it is necessary to disambiguate data sources. For non-database systems, a similar but system-appropriate pattern will be applied for v2 URIs. The Tableau collector, which is still separate at the moment, has not yet been transitioned to the new v2 URIs, and we will need to address this situation when we merge it into DWCC.

All new DWCC implementations should use V2 URIs. When you use DWCC 2.0 or newer, this happens automatically. Existing customers who either did not make any changes at the user edit layer (aka API or UI metadata edits) or who do not care about the changes made (for example, the edits were just for testing purposes) should either delete or replace/overwrite the v1 catalog files in ddw-catalogs and run the DWCC 2.X collector. Running the 2.X collector will generate the latest catalog with v2 URIs.

Note

The edit layer will not automatically link up with the new URIs so the edits will essentially be gone.

We recommend all customers shift to v2 URIs as soon as possible in order to minimize important user edit layer changes as those edits will have to be migrated. If you have made significant changes and would still like to migrate now, please contact us for assistance.

If, however, you have made important changes at the user edit layer already you can continue to use the v1 URIs. Then To continue to use the v1 URIs include the parameter -use-v1-uris flag when you run DWCC 2.x. On the roadmap for development is the creation of an API endpoint to allow the user edit layer to be mass updated. We will also provide a SPARQL template to migrate the user edit layer from the v1 URIs to v2 URIs. More details on this migration process will be provided as we get closer to this API endpoint being available.

When running DWCC via Docker to catalog large bodies of metadata (e.g., a data source with hundreds or thousands of tables and many thousands of columns), you might exhaust the available memory in the docker container for the DWCC process. To address this problem, increase the memory available to Docker. On Windows and MacOS, this is handled via a Docker desktop preference change. If you are running this on a native Linux host, the Docker host and native host are the same (so memory available to Docker is all machine memory). On a Mac, e.g., go to Docker preferences:

docker_prefernces.png

And select Resources > Advanced. In this example the memory allowance is set to 2 GB. Increase it to 4 GB by moving the slider for Memory:

docker_resources_allocation.png

You can also increase the memory available to the DWCC container by terminating other containers running within the Docker host.

To display the licensing information for any version of DWCC after 2.24 run the following command in your terminal window:

docker run -it --rm datadotworld/dwcc:X.XX display-license

where X.XX is the version number for the DWCC.

It is now possible for users to set the level (severity) of log messages written to the console and log file. By default, we write “info” level messages; users can choose to write only errors (level=“ERROR”), errors+warnings (level=“WARN”), or all messages including debug trace (level=“DEBUG”). This is useful if we want to have customers run DWCC with debug logging turned on, for troubleshooting problems etc.

If you are using Docker, to set the level to something other than "info", add the statement -e log_level=DEBUG to your run Docker... statement.

If you are having difficulty running one of our metadata catalog collectors, this article contains a regularly updated list of tips for figuring out what went wrong. If you are still having trouble, please contact support@data.world for more assistance.

User permission issues

If your run of the DWCC does not capture everything in the catalog that you think should be there, the first thing to check is the user account you use to connect to your resource to ensure that you can authenticate to the resource outside of DWCC and find those objects. For instance, with a database, you should be able to log into the database with a client (preferably a JDBC client like DBeaver) and see the objects. If the objects don't show up there either, it's a permissions issue.

Overwriting files on upload to the catalog

When you run the DWCC, the output file name is of the form [database name].[collection name].dwec.ttl. The result is that any time the DWCC is run more than one time against the same database and uploaded to the same collection, the output file will be overwritten. Overwriting the results when catalogging all schemas in a database is fine as the previously produced file is just updated.

However there are instances--e.g., when it is necessary to catalog one schema in a database at a time--where using the same name for the output file results in an overwrite of unique information as opposed to an update. In this case it would be desirable to have unique names for each of the output files before they were uploaded to a collection in the catalog.

Currently the way to achieve uploading of unique files from different schema in the same database is to:

  1. Disable automatic upload of the TTL files when running the DWCC

  2. Rename each output file with a unique name after running the DWCC

  3. Manually upload each of the newly created TTL files.

What is the DWCC?

The DWCC is a self-contained program for extracting metadata from various data sources including databases like MS SQl, Redshift, Amazon Athena, and Snowflake, and non-database sources like Tableau Server. New sources--both database and non database--are continually being added.

The DWCC is deployed as a command-line application shipped as a Docker image. When it is run, it creates a Docker container that is isolated from everything else on your system except the data source it catalogs, and the directory outside the container for the catalog output.

What does the DWCC collect?

The DWCC pulls only metadata from the source. It doesn't collect any data. For databases, the information gathered includes the number of tables and columns , the names of the tables and columns, key information, and the data types used--information that is useful for data analysts to use.

How many collectors do you need?

You can use one DWCC to catalog as many data sources as you have. All you need to do is change the name of the catalog source and the parameters in the command-line.

What is needed to run the DWCC?

Because the DWCC is shipped as a Docker image, you need to have Docker installed on the local machine. If you can't use Docker, we also have a Java version of DWCC available. For more information about Docker see https://docs.docker.com/get-docker/.

The computer running the catalog collector should have network access to the data source.

The user running the catalog collector must have read access to the data resource.

For many data sources you will also need to have jdbc drivers for the data source installed on the local machine. The DWCC assumes the .jar file driver is in the ../jdbcdrivers directory.

Finally, a minimum of 2G of memory and a 2Ghz processor are required for all sources. Certain data sources (like BigQuery) may have additional requirements.

What operating system does the DWCC Docker image use?

Debian Buster (the development codename for Debian 10).

DWCCv2.36

  • Improvements to error messages produced when using a config-file to run DWCC

  • We disallow running catalog-postgres and catalog-redshift in the same config file as the two collectors use incompatible JDBC drivers

  • Improved error handling throughout DWCC

  • Improvements in representation of Tableau data source names in tableau catalogs

  • Improvements to the MANTA collector

DWCC v2.35 Changes in this release:

  • Upgrade of Denodo collector to Denodo 8

  • Handle edge case of very large field values embedded in manta’s exported artifacts

  • Support for sites

  • Handle edge case of stored procedure columns in manta

DWCC v2.34 This release includes:

  • Enhancements to domo collector output

  • Testing improvements

  • A minor tableau collector enhancement

  • Fix for an issue in the tableau collector in which column fields were sometimes not properly identifying the Tableau Table from which they sourced their data

  • Improvment to the presentation of domo catalogs in the platform UI.

  • Changes to the dockerhub repository where we house images containing non-released versions of dwcc. Previously we were calling these “beta” releases; we now call them “release candidates”. The new repository is datadotworld/dwcc-rc and the image tags are x.y-rc-z where x.y is the next expected dwcc release, and z is an increment.

DWCC v2.33 Adds support for harvesting intra-database lineage from manta scans, and accommodates changes in MANTA R32 (aka 1.32). We no longer support MANTA versions earlier than MANTA R32.

DWCC v2.32 This release adds in collector support for Vertica db.

DWCC v2.31 Issued fix to ensure alignment of identifiers for databases referenced by Tableau and Looker collectors.

DWCC v2.30 Installed a config file-driven configuration (as a hidden feature for now). Issued a fix for handling empty powerbi objects returned by the API

DWCC v2.29 The data.world catalog collector now supports Tableau Online! Additionally there was a bugfix for PowerBi.

DWCC v2.28 Bugfix release

DWCC v 2.27 Added the optional CLI option tableau-graphql-page-size to the Tableau collector which allows the user to set a number of objects to be included in each page of paginated queries.

DWCC v2.26 Updated the PowerBi collector so that if a report is unavailable via the API it will be logged, and cataloging will continue on the rest of the repository.

DWCC v2.25 This release includes better and more user-friendly error handling and reporting. We have also added an enhanced collection of Tableau metadata via the Tableau Metadata API (graphql endpoint). New metadata includes datasources, databases, fields, metrics, and many more inter-object relationships.

DWCC v2.24 DWCC is now distributed via Dockerhub Additionally there are changes to the Tableau and PowerBI collectors, and the ability to change the level of error messages written to the console and log file, and a new subcommand to display the DWCC license text.

For Tableau:

  • The Tableau collector now emits RDF in which the object of `dct:creator` is a `dwec:Agent` instead of a string literal. This means we write additional details about the Tableau account that created the dashboard, via properties of the `dwec:Agent` resource. These details include: account name, account “full name”, and account email address (if they are populated in Tableau).

For PowerBI:

  • The PowerBI collector writes resources representing powerbi “data sources” that are now of a PowerBI-specific class, rather than `dwec:DataArtifact`.

Logging changes:

  • It is now possible for users to set the level (severity) of log messages written to the console and log file. By default, we write “info” level messages; users can choose to write only errors (level=“ERROR”), errors+warnings (level=“WARN”), or all messages including debug trace (level=“DEBUG”). This is useful if we want to have customers run DWCC with debug logging turned on, for troubleshooting problems etc.

Display DWCC license information:

  • License information for DWCC is now available as a subcommand of DWCC. To get all licensing information, run the command docker run -it --rm datadotworld/dwcc:X.XX display-license where X.XX is a version of DWCC greater than or equal to 2.24.

DWCC v2.23 Internal release

DWCC v2.22 Internal release

DWCC v2.21 fixed some timeout issues with Looker collector when fetching images from the Looker API. Fixed an issue with cataloging reports and dashboards based on user workspace permissions in PowerBi.

DWCC v2.20 With this release our Tableau collector now supports cataloging of workbooks and non-dashboard views as well as harvesting tags on workbooks and views. FIxed an issue in the Looker collector where preview images returned from looker api were missing.

DWCC v2.19 Includes a clean-up of the embedded help commands for several collectors and:

  • Fixes an issue with the Tableau Server collector when cataloging multi-site server instances.

  • Adds --tableau-site parameter to enable user to restrict cataloging to a single site (not required, by default all sites in the instance are scanned). Value provided to --tableau-site can be a site ID or name.

DWCC v2.18 The tableau collector now has a flag option --tableau-skip-images which skips the harvesting of preview images for views. Usage is like this:

... catalog-tableau --tableau-api-base-url=http://ec2-44-192-86-11.compute-1.amazonaws.com/api/3.10/ --tableau-username=admin --tableau-password=password -a sc-test3 -n tableau-test --tableau-skip-images

DWCC v2.17 Adds a collector for Presto

DWCC v2.16 This release:

  • Adds the parameter --all-databases to the Athena collector so that it can catalog all the databases accessible from the logged-in account.

  • Fixes some issues with datatypes for dwec:externalUrl predicates.

DWCC v2.15 This release contains the following:

  • The Tableau collector formerly had a CLI parameter --tableau-project-id which could be used to catalog only assets in the project with the specified ID. The parameter is now --tableau-project and takes either a project ID or project name

  • Update to the MANTA collector to accommodate a minor change in the MANTA API with v 1.31. Customers who have updated their MANTA instance to v 1.31+ will want to use DWCC 2.15+.

  • The Looker collector now works for non-admin Looker users; however, when DWCC is run by a non-admin, the emitted catalog will not contain any information about databases used by Looker analysis assets (access to database information in Looker requires admin permissions).

  • All JDBC collectors now populate two new properties for dwec:DatabaseColumndwec:columnDefaultValue  and dwec:columnIsNullable, which contain the default value for that column in newly inserted rows, and whether the column can be null, respectively. (Note that only some databases/drivers provide this metadata…we put it in the catalog if it’s there).

DWCC v2.14 Adds a collector for Looker. Minor update to the docker-save.sh script that includes available versions in the error message if you don’t supply a version.

DWCC v2.13 Adds cli params with this version so it now possible to pass arbitrary driver properties through to the connection

DWCC v2.12 Adds collector for SAP (formerly Sybase) SQL Anywhere metadata collector

DWCC v2.11 Improves the Dremio collector’s handling of data sources nested within multiple layers of folders, and fixed a minor issue with the Dremio collector’s harvesting of lineage metadata from the Dremio graph API.

DWCC v2.10 Adds a collector for Domo and JDBC database collectors can now catalog all schemas in the database at once (default remains to catalog only user's default schema).

DWCC v2.9 Adds Tableau Server collector and extended the OpenAPI collector to include a few additional schema property metadata properties.

DWCC v2.8 Adds Infor ION data lake collector. Optimized collection of JDBC metadata (performance improvement).

DWCC v2.7 Adds a collector for PowerBI.

DWCC v2.6 Adds the Manta collector.

DWCC v2.5 Upgrads Java runtime.

DWCC v2.4 Extends handling of OpenAPI collector parameters and responses.

DWCC v2.3 Adds support for OpenAPI (fka Swagger) collector.

DWCC v2.2 A refactoring release.

DWCC v2.1 Fixes an issue with the Denodo cataloger jdbc url port.

DWCC v2.0 We now use v2 URIs as the official locator IDs for metadata resources. This is a breaking change (for structural, intentional reasons) which is not backwards compatible with v1 URIs. For more information see the article on DWCC v2.X.DWCC v.2.0

DWCC v 1.20 Addresses some memory issues and open-cursor leaks.

DWCC v.1.19 Adds writing statements to the catalog graph indicating that the catalog was DWCC by DWCC (with a version). We also added the ability to write database schema objects to the catalog graph.

DWCC v1.18, Allows you to specify alternate organization permissions and upload locations when performing an automatic upload of the metadata.

DWCC v.1.16 and DWCC v.1.17 Address issues with the SQL Server cataloger.

DWCC v.1.15 Adds Dremio support with optional Catalog API lineage fetching.

DWCC v1.14, Enables you to change the amount of memory that gets allocated to a DWCC docker process. See our article on allocating additional memory to Docker for more information.

DWCC v.1.13 Adds support for Microsoft SQL Server, and we enable JVM to use available memory in the container (useful for creating large catalogs). Additionnally we Improve data type recognition in AWS Glue cataloger.

As of DWCC v1.12 we can support not only Glue ETL jobs, but also Glue Data Catalog tables and columns.

With DWCC v.1.11 you can:

  • Upload generated catalogs via the --upload / -U command-line parameters

  • Upload the DWCC log when uploading generated catalogs with --upload

  • Fetch an organization's current catalog with the fetch-catalog command

In DWCC v1.10 we added support for AWS Glue and AWS Athena including cataloging ETL jobs associated with an AWS account. There is no need to mount in a jdbc drivers directory as the Glue cataloger uses the Glue API, not JDBC.

dwc v.1.9 is a bug cleanup release.

It is now possible with DWCC.1.8 to use jdbc drivers on classpath as well as those found in user-specified JDBC Driver Directory (drivers in directory have higher precendence than classpath drivers).

DWCC v.1.7 is a bug-fix release

DWCC v.1.6 adds the support for arbitrary jdbc data sources and the ability to build one-off docker images for testing, demos, etc.,

With DWCC v.1.5 we add support for Oracle.

In DWCC.1.4 we add support for Google BigQuery.

DWCC v.1.3 brings much new functionality including:

  • Support for Denodo and Snowflake

  • Compatibility of JDBC catalogs with tables imported through data.world integrations

  • Ability to differentiate source information for databases cataloged from localhost

  • Cataloging of REMARKS fields into dct:descriptio

With DWCC v.1.2 we support Redshift databases.

DWCC v.1.1 contains documentation clarification and expansion for the documents to streamline tags on customer docker hosts.

The initial release of DWCC v.1.0 provides support for metadata catalog extraction for DB2, Hive, MySQL, Postgres.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

User permission issues

If your run of the DWCC does not capture everything in the catalog that you think should be there, the first thing to check is the user account you use to connect to your resource to ensure that you can authenticate to the resource outside of DWCC and find those objects. For instance, with a database, you should be able to log into the database with a client (preferably a JDBC client like DBeaver) and see the objects. If the objects don't show up there either, it's a permissions issue.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

User permission issues

If your run of the DWCC does not capture everything in the catalog that you think should be there, the first thing to check is the user account you use to connect to your resource to ensure that you can authenticate to the resource outside of DWCC and find those objects. For instance, with a database, you should be able to log into the database with a client (preferably a JDBC client like DBeaver) and see the objects. If the objects don't show up there either, it's a permissions issue.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Installing the collector
  1. Request access to a download link from your data.world representative for the catalog collector. Once you receive the link, download the catalog collector Docker image (or programmatically download it with curl).

  2. Load the docker image into the local computer’s Docker environment:

    docker load -i dwdbt-X.Y.tar.gz

    where X.Y is the version number of the dbt collector image.

  3. The previous command will return an <image id> which needs to be renamed as 'DWCC'. Copy the <image id> and use it in the docker-load command:

    docker tag <image id> dwdbt
Parameters

The following parameters are used to run the DBT collector. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Note

Snowflake is case-sensitive so database and schema names need to specified with the same case as they are in Snowflake.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

The DWCC is distributed as a docker image, available via dockerhub. To run the collector, you'll reference the image by a fully-qualified name (datadotworld/dwcc:x.y, where x.y is the version of dwcc that you wish to run). The Docker client on your machine will pull the image if you don’t have it already--there is no need for you to explicitly install it. The image is run with a series of command line (CLI) options and outputs a file with the extension *.dwec.ttl. You can upload the file to to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

The following parameters are used to run the DWCC. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM DWCC
ADD ./ca.der ca.der RUN keytool -importcert -alias startssl -cacerts \
-storepass changeit -noprompt -file ca.der

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

When you connect a database to data.world using Connection manager or an integration from the Integrations gallery, your data continues to live at its source location and is not stored in data.world. This configuration is frequently referred to as data virtualization.

The Connection manager is the best way to create a virtual connection that will be owned by an organization. If you need a connection that you will own personally, you will need to create it from the integrations gallery. See Connection permissions for more information about choosing an owner for a connection.

One of the benefits of data virtualization is that it allows you to view and query data on data.world that would exceed the dataset size limits on data.world. It also ensures that you have access to your most current data without needing to worry about scheduling synchronizations, or the processing it time it would take to import/refresh the data.

When you query a live table using data.world, our system will translate your query from our native SQL dialect into the SQL dialect of the target system. That system will then execute the query on its own hardware and return the results to data.world for display. Another benefit of virtualization is that it makes managing permissions and access to the data easier.

Please be aware that cloud database providers frequently charge either by the amount of time that queries run on their systems or by the total amount of data scanned during the query. If this describes your database service then executing queries against live tables in data.world will also incur charges on those systems.

Connection permissions

When you create a connection to use with a data source you are asked to set the owner of the connection. By default, if you are in organization then your organization is the owner of the connection. However, you can also set yourself as the owner making it a personal connection.

There are two compelling reasons for having most connections owned by an organization:

  • There is no loss of access to data when an employee leaves and their account is deactivated.

  • Federation across data sources is faster and more efficient if it uses the same connection.

Organizational-level connections are shared between admins of the organization and can be used by by all of them to create new live tables. Non-administrator users can only query and preview existing live tables.

Note

Organization-owned connections can only be used to add data to datasets owned by that organization. If you are in organizations A and B, you cannot add data to a dataset owned by B using a connection owned by A.

With a personal connection, only the connection owner may create new live tables with the connection, and other members of the organization can query and preview live tables.

Create a connection from the Integration Gallery

To create a new virtual data connection through the integrations gallery, go to Database connectors section of the integrations gallery and select the one you want to use:

Integrations_database_connectors.png

From the connector screen select + Enable Integration to get to the configuration screen.

Tip

Instructions for specific databases are located here.

Test configuration

Enter all the parameters into the configuration window and select Test configuration to make sure it works. If it does, select Configure to save it. You can now use this connection any time you add data.

Add_data.png
Add or modify a connection

Connections owned by an organization can be managed in the Connection manager whether they were created in the Integration gallery or in Connection manager. Personally-owned connections must be managed from your integrations.

To edit a connection from the Integrations gallery go to Integrations and select My Integrations:

Integrations_page.png

On the screen for the integration select the Manage tab. From there you can Add a new connection for this data source, or edit or delete one of the existing connections.

Manage_Snowflake_connection.png

Note

You will need your original credentials (password or key file) to make changes to an existing connection.

Virtual connection configurations

Athena configuration
  • S3 Output Bucket Location - The Amazon S3 bucket where query results should be stored. The location should start with s3://. For example, to store results in a folder named "test-folder-1" inside an S3 bucket named "query-results-bucket", you would set the location to s3://query-results-bucket/test-folder-1

  • Workgroup - If your Athena instance is configured with different workspaces you can assign your connection to a workspace here

  • AWS ARN - A dedicated Identity Access Management (IAM) role created specifically for data.world. This role must be created before you can configure a connection to Athena. See Create a dedicated IAM role for Athena connections for more information.

  • AWS external id - provided in the "Add a new Athena connection" dialog

Note

Before configuring a virtual connection to Athena you need to have set up an IAM role in the AWS console.

Athena_configuration_IG.png
Create a dedicated IAM role for Athena connections

To configure a virtual connection to Athena you will need to create a dedicated IAM role in your Amazon Web Services (AWS) console and enter the AWS Amazon Resource Name (ARN) for it in the Add a new connection dialog. To create the role, however, you will need to first get the AWS External ID from the bottom of the connection dialog. Follow the steps below to create the AWS role and the connection to Athena.

  1. Open the configuration screen as described above,

  2. Copy the External ID and do not close the dialog.

    Warning

    You have to leave the Add a new connection dialog open while you connect to the AWS console and create the role needed for the connection because every time you open the dialog to create a new connection, a new external ID is generated.

  3. Go to the AWS console and select Create role.

    AWS_screen_1.png
  4. Use the following parameters for the role:

    • Select type of trusted entity - Another AWS account

    • Account ID - 465428570792

    • Require external ID - checked

    • External ID - The value copied from the Add new connection dialog in data.world

  5. Select Next: Permissions:

    AWS_screen_2.png
  6. Use the search bar to find the following two policies and add them:

    • AmazonAthenaFullAccess

    • AmazonS3FullAccess

    Note

    You may choose to be more fine grained in precisely which buckets you allow data.world to access. We will only need write access on the S3 output bucket location configured earlier. Otherwise, the minimum permissions required to query data from table backing buckets is required.

  7. Select Next: Tags and add any tags you would like.

  8. Select Next: Review

    AWS_Screen_3.png
  9. Name the role, write a description, verify that the two policies shown above are present, and select Create role.

  10. Find the role you have just created:

    AEWS_find_role.png
  11. Copy its ARN, and paste the ARN into the dialog window you left open for adding a new Athena connection.

    AWS_role_permissions.png
Azure Synapse configuration
Azure_Synapse_config_IG.png
BigQuery configuration
  • Project ID - The unique identifier for your BigQuery project

  • Service account username - A Google account that is associated with your Google project, as opposed to a specific user

  • Service account key file - Provides the authentication information used in the connection configuration. This file will be uploaded when you enter the other configuration information into the dialog

BigQuery_config_IG.png
MS SQL Server configuration
SQL_Server_config_IG.png
MYSQL configuration
My_SQL_config_IG.png
Oracle Database configuration
Oracle_db_config_IG.png
PostgreSQL configuration
PostgreSQL_config_IG.png
Redshift configuration
Redshift_config_IG.png
Snowflake configuration
Snowflake roles, warehouses, and privileges

The Snowflake user specified in the connection must have a default Warehouse set in Snowflake

All queries run against Snowflake with this connection will use this Warehouse for their compute power.

If the Snowflake user specified in the connection does not have a default Role set in Snowflake, the connection will use the Public role, which may limit privileges to access and query data.

If a default Role is set for the Snowflake user, that Role will be used by the data.world connection.

In order to create a virtualized connection to a table or view, the user must have USAGE privileges on the database and schema and SELECT privileges on the table or view.

If a default database is specified in the data.world connection modal, it must be specified using all UPPER CASE letters.

Snowflake_config_IG.png