Skip to main content

About the Apache Airflow collector

Warning

This collector is in public preview. It has passed our standard testing, but it is not yet widely adopted. You might encounter unforeseen edge cases in your environment. data.world is committed to promptly addressing any issues with public preview collectors. If you face any problems, please report them through your Customer Success Director, implementation team, or support team for assistance.

Use this collector to:

  • Discover objects (such as, DAG, DAG Run, Dataset, Task, Task Instance) in your Apache Airflow instance.

  • Perform impact analysis to understand how changes to upstream data sources impact Airflow objects.

Important

The Amazon S3 collector can be run on-premise using Docker or JAR files.

Note

The latest version of the Collector is 2.258. To view the release notes for this version and all previous versions, please go here.

What is cataloged

The collector catalogs the following information.

Table 1.

Object

Information cataloged

Directed Acyclic Graph (DAG)

ID, Display name, Is paused, Is active, Is subdag, File path, Owners, Description, Tags, Timetable description, Last parsed time, Last pickled, Last expired, Max active runs, Max active tasks, Has task concurrency limits, Has import errors, Next dag run, Next dag run data interval start, Next dag run data interval end, Next dag run create after, Max consecutive failed dag runs, Source code

Airflow Task

ID, Display name, Operator name, Owner, trigger rule, Priority weight, Weight rule, Depends on past, Is mapped, Wait for downstream, Template fields

DAG Run

Title, Dag run ID, Dag ID, Logical time/execution date, Start time, End time, Data interval start, Data interval end, Last scheduling decision, Run type, State, External trigger, Run configuration, Note

Dataset

ID, URI, Extra, Created at, Updated at

Airflow Task Instance

Title, Task id, Task display name, Dag ID, Dag run ID, Execution date, Start date, End date, Duration, Try number, Map index, Hostname, Unixname, Priority weight, Queued when, Trigger ID, Trigger class path, Triggerer job ID, Triggerer job state, Triggerer job type, Triggerer job start date, Triggerer job end date, Triggerer job executor class, Duration, Note



Relationships between objects

By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.

Table 2.

Resource page

Relationship

Directed Acyclic Graph (DAG)

Contains Airflow Tasks, Has sub DAG, Consumes Dataset

Airflow Task

Has downstream Airflow Task

DAG Run

Associated with DAG

Dataset

Was produced by DAG

Airflow Task Instance

Run within DAG Run, Associated with Task



Lineage for Apache Airflow

The Airflow collector gathers the following lineage information. It identifies the source datasets from which destination datasets derive their data through DAG Run activities.

Table 3.

Object

LIneage available

Dataset

  • Dataset that a DAG Run sources its data from

  • Dataset that a DAG Run writes its data to



Apache Airflow version supported

  • The collector supports Airflow Version 2.10.4

Authentication supported