Skip to main content

Querying lineage metadata in your data catalog

About data lineage

Data lineage describes the journey that data takes from its origin to its final destination, detailing all transformations, processing steps, and intermediate states it undergoes.

This information allows users to trace back or forward through the data lifecycle, which is crucial for auditing, debugging, and understanding the impact of changes.

How data lineage is defined

Lineage is defined by the predicate prov:wasDerivedFrom and its subproperties. The prov:wasDerivedFrom predicate indicates that a resource was derived from another resource. Subproperties of prov:wasDerivedFrom represent more specific types of derivation, such as how one resource is calculated from another.

These predicates capture the relationships between different resources (for example, tables, views, columns, etc.) to provide a comprehensive view of the journey of data across your systems.

The system allows you to query and export lineage metadata. Before downloading, you can preview and filter the data.

Querying all lineage predicates and their definitions

To identify lineage predicates defined in your environment, you can query for the prov:wasDerivedFrom predicate and its subproperties using SPARQL.

The following query returns these lineage predicates and a description of each.

Query:

PREFIX : <https://<your_org_name>.app.linked.data.world/d/ddw-catalogs/>

SELECT ?subproperty ?comment
WHERE {
  ?subproperty rdfs:subPropertyOf prov:wasDerivedFrom .
  ?subproperty rdfs:comment ?comment
}

Sample results:

querying_lineage_metadata_01.png

Querying all resources associated with lineage metadata

To get a full picture of lineage metadata, relationships, and attributes for resources in your system, the following query returns triples for all resources with a lineage predicate.

Query:

PREFIX : <https://<your_org_name>.app.linked.data.world/d/ddw-catalogs/>
PREFIX dwec: <https://dwec.data.world/v0/>
PREFIX prov: <http://www.w3.org/ns/prov#>

SELECT ?target_name ?target_type ?lineage_predicate ?source_name ?source_type
WHERE {
   {
       SELECT DISTINCT ?lineage_predicate {
           {
               SELECT DISTINCT ?predicate {
                   ?s ?predicate ?o .
               }
           } .
           ?predicate rdfs:subPropertyOf* prov:wasDerivedFrom .
           BIND(?predicate AS ?lineage_predicate)
       }
   } .
   ?target_iri ?lineage_predicate ?source_iri .
   ?target_iri a ?target_type .
   ?target_iri dct:identifier ?target_name .
   ?source_iri a ?source_type .
   ?source_iri dct:identifier ?source_name .
}

Sample results:

querying_lineage_metadata_02.png