Querying lineage metadata in your data catalog
About data lineage
Data lineage describes the journey that data takes from its origin to its final destination, detailing all transformations, processing steps, and intermediate states it undergoes.
This information allows users to trace back or forward through the data lifecycle, which is crucial for auditing, debugging, and understanding the impact of changes.
How data lineage is defined
Lineage is defined by the predicate prov:wasDerivedFrom and its subproperties. The prov:wasDerivedFrom predicate indicates that a resource was derived from another resource. Subproperties of prov:wasDerivedFrom represent more specific types of derivation, such as how one resource is calculated from another.
These predicates capture the relationships between different resources (for example, tables, views, columns, etc.) to provide a comprehensive view of the journey of data across your systems.
The system allows you to query and export lineage metadata. Before downloading, you can preview and filter the data.
Querying all lineage predicates and their definitions
To identify lineage predicates defined in your environment, you can query for the prov:wasDerivedFrom predicate and its subproperties using SPARQL.
The following query returns these lineage predicates and a description of each.
Query:
PREFIX : <https://<your_org_name>.app.linked.data.world/d/ddw-catalogs/>
SELECT ?subproperty ?comment
WHERE {
?subproperty rdfs:subPropertyOf prov:wasDerivedFrom .
?subproperty rdfs:comment ?comment
}
Sample results:
Querying all resources associated with lineage metadata
To get a full picture of lineage metadata, relationships, and attributes for resources in your system, the following query returns triples for all resources with a lineage predicate.
Query:
PREFIX : <https://<your_org_name>.app.linked.data.world/d/ddw-catalogs/> PREFIX dwec: <https://dwec.data.world/v0/> PREFIX prov: <http://www.w3.org/ns/prov#> SELECT ?target_name ?target_type ?lineage_predicate ?source_name ?source_type WHERE { { SELECT DISTINCT ?lineage_predicate { { SELECT DISTINCT ?predicate { ?s ?predicate ?o . } } . ?predicate rdfs:subPropertyOf* prov:wasDerivedFrom . BIND(?predicate AS ?lineage_predicate) } } . ?target_iri ?lineage_predicate ?source_iri . ?target_iri a ?target_type . ?target_iri dct:identifier ?target_name . ?source_iri a ?source_type . ?source_iri dct:identifier ?source_name . }
Sample results: