About the Marquez collector

Warning

This collector is in public preview. It has passed our standard testing, but it is not yet widely adopted. You might encounter unforeseen edge cases in your environment. data.world is committed to promptly addressing any issues with public preview collectors. If you face any problems, please report them through your Customer Success Director, implementation team, or support team for assistance.

Use this collector to harvest metadata for Marquez objects such as datasets, jobs, and job runs. The collector harvests lineage relationships among the data resources represented by datasets and the jobs that move data between them.

Important

The Marquez collector can be run on-premise using Docker or JAR files.

Note

The latest version of the Collector is 2.328. To view the release notes for this version and all previous versions, please go here.

Marquez versions supported

The collector supports Marquez version 0.50.1.

Authentication supported

The collector currently harvests from unauthenticated Marquez API instances only.

What is cataloged

The collector catalogs the following information.

Table 1.

Object	Information cataloged
Dataset	Identifier, Title (name), Description, Creation time, Last update time, Namespace
Job	Identifier, Title (name), Description, Creation time, Last update time, Namespace
Job Run	Identifier, Last run time, Last run state (error, completed)

Relationships between objects

By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.

Table 2.

Resource page	Relationship
Dataset	Data resource that the dataset abstracts (facades) Jobs for which the dataset serves as input or output
Job	Datasets serving as inputs/outputs for the job Latest run of the job
Job Run	The job for which this run was an execution

Lineage for Marquez

Table 3.

Object	Lineage available
Data resource	Data resources from which this data resource was derived, based on lineage events recorded in Marquez.
Database resource	For database resources, the collector identifies: Upstream tables that contribute to the current resource. Upstream columns within those tables that contribute to or are derived from columns in the current resource.

Supported cross-system lineage

The currently supported data sources for cross-system lineage are:

AWS Glue
Amazon S3
Microsoft SQL Server
Postgres
Oracle
MySQL
Teradata
Databricks

In this section:

About the Marquez collector

Warning

Important

Note

Marquez versions supported

Authentication supported

What is cataloged

Relationships between objects

Lineage for Marquez

Important

Supported cross-system lineage

Search results