Skip to main content

About the Hive collector

Use this collector to harvest metadata for Hive tables and columns across the enterprise systems and make it searchable and discoverable in data.world.

Important

The Hive collector can be run on-premise using Docker or JAR files.

Note

The latest version of the Collector is 2.326. To view the release notes for this version and all previous versions, please go here.

What is cataloged

The collector catalogs the following information.

Table 1.

Object

Information cataloged

Columns

Name, Description, JDBC type, Column Type

Table

Name, Description, Last DDL time, Row count, Total size, Raw data size, File count, Erasure coded file count, Bucketing version, Is external, External table purge, Is translated to external, Column stats accurate

Views

Name, description, SQL definition

Schema

Identifier

Database

Type, Name, Identifier, Server, Port, Environment, JDBC URL



Relationships between objects

By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.

Table 1.

Resource page

Relationship

Table

Columns, Table Indexes

Columns

Table, Table Indexes

Table Indexes

Columns

Schema

Database that contains Schema, Table that is part of Schema

Database

Schema contained in Database



Lineage for Hive

The following lineage information is collected by the Hive collector.

Table 2.

Object

Lineage available

Column in View

The collector identifies The associated column in an upstream view or table:

  • Where the data is sourced from

  • That sort the rows via ORDER BY 

  • That filter the rows via WHERE/HAVING 

  • That aggregate the rows via GROUP BY



Authentication supported

  • The collector supports username/password authentication to Hive.

IRI consistency between Hive Metastore and Hive collectors

If you are using both the Hive Metastore collector and the Hive collector to catalog the same Hive environment, configure them with matching parameters to ensure they produce identical IRIs for the same resources. This prevents duplicate entries in data.world and maintains a unified view of your metadata.

Table 3.

Hive Metastore collector parameter

Hive collector parameter

--hive-server-host

--server

--hive-server-port

--port

--hive-database

--database

--hive-database-id

--database-id



When these values match across both collectors, metadata from either source will resolve to the same resources in your catalog, ensuring consistent resource identification regardless of which collector harvested the metadata.