Skip to main content

About the Databricks collector

Use this collector to harvest metadata from data assets in Databricks Hive Metadata, Unity Catalog (including Delta Lake), Workflows, and Notebooks, and make it searchable and discoverable in data.world.

Important

The Databricks collector can be run in the Cloud or on-premise using Docker or Jar files.

Note

The latest version of the Collector is 2.189. To view the release notes for this version and all previous versions, please go here.

What is cataloged

The collector catalogs the following information.

Table 1.

Object

Information cataloged

Columns

Name, Description, JDBC type, Column Type, Is Nullable, Default Value, , Column size, Column index

Table

Name, Description, Schema, Primary key, Foreign key, Owner, Type, Creation, Last Modified, Location, Provider, Version, Size, File Count, Partition Columns, Properties

Views

Name, Description, SQL definition

Schema

Name

Database

Type, Name, Server, Port, Environment, JDBC URL

Notebook

Notebook ID, Path, Language Type (SQL, Python, Scala, R)

Function

Name, Description, Function Type

Job

Title, Description, Creator, Created At, Job run as, Format, Max Concurrent Runs, Notification On Start, Timeouts (sec), Notification On Success, Schedule, Git Source, Notification on Failure, Job Tags, List of tasks, List of clusters

Cluster

Name, Description, Node Type ID, Driver Node Type ID, Spark Version, Number of Workers, Autoscale Max Workers, Autoscale Min Workers, AWS Attributes, Cluster Tags

Task

Task Key, Type of Task (Notebook, dbt, Spark jar, Python script, Python wheel, Pipeline task, SQL), Task timeout, Retry interval, Cluster used by the task, Max retries, Depends on, Libraries, Notifications (On start, On success, On failure), Notebook File Path, Notebook Source,  Notebook Parameters, Spark Jar Main Class Name, Spark Jar Parameters, Python Script File path, Python Script Parameters, Spark Submit Parameters, Pipeline ID, Pipeline Full Refresh, Python Wheel Package Name, Python Wheel Entry Point, Python Wheel Parameters, SQL Warehouse, SQL Query ID, SQL Dashboard ID, SQL Alert ID, Dbt Project Directory, Dbt Profiles Directory, Dbt warehouse, Dbt catalog, Dbt schema, Dbt commands



Relationships between objects

By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.

Table 3.

Relationship page

Relationship

Table

  • Columns contained in Table

Schema

  • Database that contains Schema

  • Table that is part of Schema

Database

  • Schema contained in Database

Columns

  • Table containing Column

Job

  • Clusters used by tasks in Job

  • Tasks contained within Job

Cluster

  • Cluster Tag referenced by Cluster

  • Cluster contained in job

  • Task using Cluster

Task

  • Job containing Task

  • Cluster used by Task

  • Tasks depending on Task

Notebook

  • Folder containing Notebook

  • Task sourcing data from Notebook

Folder

  • Folders contained in Folder

  • Notebooks contained in Folder

Job Tag

  • Jobs containing Job Tag

Cluster Tag

  • Clusters containing Cluster Tag



Lineage for Databricks

The following lineage information is collected by the Databricks collector.

Table 4.

Object

Lineage available

Column in view

The collector identifies the associated column in an upstream view or table for both Hive metastore and Unity Catalog:

  • Where the data is sourced from

  • That sort the rows via ORDER BY

  • That filter the rows via WHERE/HAVING

  • That aggregate the rows via GROUP BY

Notebook

Tasks that reference Notebook. (Only if Databricks Unity Catalog is enabled).

Table

The collector identifies the upstream and downstream tables along with the intermediate Job. (Only if Databricks Unity Catalog is enabled).