Skip to main content

About the Marquez collector

Warning

This collector is in public preview. It has passed our standard testing, but it is not yet widely adopted. You might encounter unforeseen edge cases in your environment. data.world is committed to promptly addressing any issues with public preview collectors. If you face any problems, please report them through your Customer Success Director, implementation team, or support team for assistance.

Use this collector to harvest metadata for Marquez objects such as datasets, jobs, and job runs. The collector harvests lineage relationships among the data resources represented by datasets and the jobs that move data between them.

Important

The Marquez collector can be on-premise using Docker or Jar files.

Note

The latest version of the Collector is 2.272. To view the release notes for this version and all previous versions, please go here.

Marquez versions supported

  • The collector supports Marquez version 0.50.1.

Authentication supported

  • The collector currently harvests from unauthenticated Marquez API instances only.

What is cataloged

The collector catalogs the following information.

Table 1.

Object

Information cataloged

Dataset

Identifier, Title (name), Description, Creation time, Last update time, Namespace

Job

Identifier, Title (name), Description, Creation time, Last update time, Namespace

Job Run

Identifier, Last run time, Last run state (error, completed)



Relationships between objects

By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.

Table 2.

Resource page

Relationship

Dataset

  • Data resource that the dataset abstracts (facades)

  • Jobs for which the dataset serves as input or output

Job

  • Datasets serving as inputs/outputs for the job

  • Latest run of the job

Job Run

  • The job for which this run was an execution



Lineage for Marquez

Table 3.

Object

Lineage available

Data resource

Data resources from which this data resource was derived, per a lineage event recorded in Marquez



Supported cross-system lineage

The currently supported data sources for cross-system lineage are:

  • AWS Glue

  • Microsoft SQL Server

  • Postgres

  • Oracle

  • MySQL

  • Teradata