Community docs

AWS Glue and DWCC
Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Parameters

A list of all parameter options for the collector is at the end of this article. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in the command datadotworld/dwcc:x.y catalog with the version of DWCC you want to use.

Important

For JDBC sources, DWCC will harvest the metadata for everything that the user specified for the connection has access to. To restrict what is being cataloged, specify the database and schema as appropriate.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

User permission issues for DWCC collectors

If your run of the DWCC collector does not capture everything in the catalog that you think should be there, the first thing to check is the user account you use to connect to your resource to ensure that you can authenticate to the resource outside of the collector and find those objects. For instance, with a database, you should be able to log into the database with a client (preferably a JDBC client like DBeaver) and see the objects. If the objects don't show up there either, it's a permissions issue.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.