Integrations

Hive and DWCC

More information on collectors

For more information about catalog collectors see the article data.world catalog collectors.data.world catalog collectors

Prerequisites

  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Capturing table metadata properties from the Hive metastore

The Hive DWCC has the ability to capture the table metadata properties while also harvesting other valuable table-level metadata from the Hive metastore. To catalog information from the metastore you need to use the following DWCC parameters:

You must pass all three --hive-metastore-* options for the collector to attempt to harvest anything from the hive metastore. if --hive-metastore-jdbc-url isn’t passed, the collector will write a warning and harvest the standard jdbc collector content--it won’t prevent cataloging the basic jdbc db/schema/table/column objects, it just won’t get the table-level metadata from the metastore.

Important

Make sure to supply a jdbc driver for the specific database as needed. In particular, if your metastore db is oracle or mysql, you will need to put the driver jar in the jdbc drivers directory (just as you would if you were running those databases’ collectors). If your metastore db is postgres, derby, or sql server, we ship the necessary drivers with dwcc.

Parameters

A list of all parameter options for the collector is at the end of this article. Where available, either short (e.g., -a) or long (--acccount) forms can be used.

Run the collector

Important

Do not forget to replace x.y in datadotworld/dwcc:x.y with the version of DWCC you want to use. E.g., datadotworld/dwcc:2.345.

catalog connector runtime

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled.

Collector troubleshooting

Find a list of common issues and problems encountered when running the collectors here.Troubleshooting the collectors

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM datadotworld/dwcc:x.y
ADD ./ca.der ca.der 
RUN keytool -importcert -alias startssl -cacerts -storepass changeit -noprompt -file ca.der

Important

Do not forget to replace x.y in datadotworld/dwcc:x.y with the version of DWCC you want to use. E.g., datadotworld/dwcc:2.345.

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.