Skip to main content

About the Azure Data Factory Collector

Important

The collector can be run in the cloud or on-premise using docker or JAR files.

Azure Data Factory (ADF) empowers users to collect, transform, and relocate data. Use this collector to harvest metadata from ADF, encompassing details on pipelines, datasets, dataflows, linked services, triggers, integration runtimes, and global parameters. Additionally, it gathers lineage information between ADF datasets and between ADF and external sources such as Snowflake.

Note

The latest version of the Collector is 2.247. To view the release notes for this version and all previous versions, please go here.

What is cataloged

The collector catalogs the following information from Azure Data Factory.

Table 1.

Object

Information collected

Factory 

  • ID, Name, ETag, Location, Create Time, Provisioning State, Version, Public Network Access, Factory Tags, Repository configuration (Account name, Collaboration Branch, Repository Name, Disable Publish, Root Folder, Host Name, Client ID, Project Name, Last Commit ID, Tenant ID, Repo Configuration Type).

Pipeline 

  • ID, Name, Description, Etag, Concurrency, Folder, Parameters, Metric Policy Duration, Variables

Pipeline Activity 

  • Name, Description, Type, Inactivity Status, State, User Properties, Activity Policy (Retry, Timeout, Retry Interval In Secs, Secure Input, Secure Output)

Linked Service 

  • ID, Name, Description, Type, Etag, Connection String, Domain, Parameters

    Note: Harvesting of Connection String for SFTP Linked Services is not supported.

Dataset 

  • ID, Name, Etag, Type, Database, Schema, Table, Folder, Container, File Name, Parameters

Dataflow 

  • ID, Name, Etag, Type, Description, Folder

Trigger 

  • ID, Name, Etag, Type, State, Description, Frequency, Interval, Start time, End time

Integration Runtime 

  • ID, Etag, Name, Type, Description, State Compute Properties (Node Size, Number of Nodes, Max Parallel Execution Per Node, Core Count, Compute Type, Clean up, Number of External Nodes, Number of Pipeline Nodes), SSIS properties ( Catalog Server Endpoint, Catalog Admin Username, Catalog Pricing Tier, License Type, Dual Standby PairName, Edition)

Global Parameter 

  • ID, Name, Value, Type

ADF Table 

  • ID, Name

ADF Column 

  • ID, Name, Type, Precision, Scale



Relationships between objects

By default, the data.world catalog will include catalog pages for the resource types below. Each catalog page will have a relationship to other related resource types. Note that the catalog presentation and relationships are fully configurable, so these will list the default configuration.

Table 2.

Resource page

Relationship

Factory

  • Contains Global Parameter, Contains Pipeline, Contains Dataset, Contains Dataflow, Contains Trigger, Contains Integration Runtime

Pipeline

  • Has Tag (also known as Annotation), Contains Activity

Activity

  • Belongs to Pipeline, Contains Activity, Depends on Activity, uses Linked Service, uses Integration Runtime, uses Dataset

Linked Service

  • Uses Integration Runtime, Has Tag (also known as Annotation)

Dataset

  • Uses Linked Service, Has Tabular Datasource, Has Tag (also known as Annotation)

Dataflow

  • Uses Dataflow, Imports Data From Linked Service, Exports Data From Linked Service, Imports Data From Dataset, Exports Data From Dataset, has Tag (also known as Annotation)

Integration Runtime

  • Uses Integration Runtime, uses Linked Service

Trigger

  • Triggers Pipeline, Has Tag (also known as Annotation)



Lineage for Azure Data Factory

The following lineage information is collected by the Azure Data Factory collector.

Table 3.

Object

Lineage available

Dataset

The collector identifies the source of the dataset:

  • when the source is Snowflake, Databricks, PostgreSQL, MySQL, Oracle, Teradata, DB2, and SQLServer.

  • when there is a Copy Activity Run copying data between two datasets.

ADF table

  • The collector identifies the associated table in an upstream table where the data is sourced from.

ADF column

  • The collector identifies the associated table in an upstream column where the data is sourced from.



Supported cross-system lineage

The currently supported data sources for cross-system lineage are:

  • Snowflake

  • Databricks

    Important

    While other data sources are not formally supported, running the collector for those sources may still enable you to view cross-system lineage between Azure Data Factory and these sources.

Authentication supported

  • Authenticate to Azure Data Factory using Service principal.