Skip to main content

Preparing to run the Hive metastore collector

Setting up pre-requisites for running the collector

Make sure that the machine from where you are running the collector meets the following hardware and software requirements.

Table 1.

Item

Requirement

Hardware

Note: The following specs are based upon running one collector process at a time. Please adjust the hardware if you are running multiple collectors at the same time.

RAM

8 GB

CPU

2 Ghz processor

Software

Docker

Click here to get Docker.

data.world specific objects

Dataset

You must have a ddw-catalogs dataset set up to hold your catalog files when you are done running the collector.

If you are using Catalog Toolkit , follow these instructions to prepare the datasets for collectors.

Network connection

Allowlist IPs and domains



Capturing table metadata properties from the Hive metastore

The Hive Collector has the ability to capture the table metadata properties while also harvesting other valuable table-level metadata from the Hive metastore. To catalog information from the metastore you need to use the following Collector parameters:

  • --hive-metastore-jdbc-url=<hiveMetastoreJdbcUrl>- The JDBC URL for the Hive Metastore database. The value you should pass is the same value you specify for javax.jdo.option.ConnectionURL in their Hive config.

  • --hive-metastore-password=<hiveMetastorePassword>- The password to use in authenticating to the Hive Metastore database.

  • --hive-metastore-user=<hiveMetastoreUser>- The user to use in authenticating to the Hive Metastore.

You must pass all three --hive-metastore options for the collector to attempt to harvest anything from the hive metastore. if --hive-metastore-jdbc-url isn’t passed, the collector will write a warning and harvest the standard jdbc collector content--it won’t prevent cataloging the basic jdbc db/schema/table/column objects, it just won’t get the table-level metadata from the metastore.

Important

Make sure to supply a jdbc driver for the specific database as needed. In particular, if your metastore db is oracle or mysql, you will need to put the driver jar in the jdbc drivers directory (just as you would if you were running those databases’ collectors). If your metastore db is postgres, derby, or sql server, we ship the necessary drivers with the data.world COllector.