About the Databricks collector

Use this collector to harvest metadata from data assets in Databricks Hive Metadata, Unity Catalog (including Delta Lake), Workflows, and Notebooks, and make it searchable and discoverable in data.world.

Important

The Databricks collector can be run in the Cloud or on-premise using Docker or Jar files.

Note

The latest version of the Collector is 2.292. To view the release notes for this version and all previous versions, please go here.

What is cataloged

The collector catalogs the following information.

Table 1.

Object	Information cataloged
Columns	Name, Description, JDBC type, Column Type, Is Nullable, Default Value, Column size, Column index Extended metadata: Tags Note: Deprecated columns and any lineage related to these deprecated columns are not cataloged.
Table	Name, Description, Schema, Primary key, Foreign key Extended metadata: Tags, Owner, Type, Creation date, Last Modified, Location, Provider, Version, Size, File Count, Partition Columns, Properties
Views	Name, Description, SQL definition, Tags
Schema	Name Extended metadata: Tags
Database	Type, Name, Server, Port, Environment, JDBC URL Extended metadata: Tags
Notebook	Notebook ID, Path, Language Type (SQL, Python, Scala, R)
Function	Name, Description, Function Type
Job	Title, Description, Creator, Created At, Job run as, Format, Max Concurrent Runs, Notification On Start, Timeouts (sec), Notification On Success, Schedule, Git Source, Notification on Failure, Tags, List of tasks, List of clusters
Cluster	Name, Description, Node Type ID, Driver Node Type ID, Spark Version, Number of Workers, Autoscale Max Workers, Autoscale Min Workers, AWS Attributes, Tags
Task	Task Key, Type of Task (Notebook, dbt, Spark jar, Python script, Python wheel, Pipeline task, SQL), Task timeout, Retry interval, Cluster used by the task, Max retries, Depends on, Libraries, Notifications (On start, On success, On failure), Notebook File Path, Notebook Source, Notebook Parameters, Spark Jar Main Class Name, Spark Jar Parameters, Python Script File path, Python Script Parameters, Spark Submit Parameters, Pipeline ID, Pipeline Full Refresh, Python Wheel Package Name, Python Wheel Entry Point, Python Wheel Parameters, SQL Warehouse, SQL Query ID, SQL Dashboard ID, SQL Alert ID, Dbt Project Directory, Dbt Profiles Directory, Dbt warehouse, Dbt catalog, Dbt schema, Dbt commands
External Location	Name, External URL, Description, Data Source Type, Created Date, Created By, Owner
Storage Credential	Name, Description, Credential, Created Date, Created By, Owner

Profiling and sampling specific information

If you include the profiling and sampling specific parameters while running the collector, the following additional information is harvested for Columns.

Important

The user/role must have read access to data to be able to harvest profiling information (column statistics).

Table 2.

Object	Information cataloged
Column	Average Length (sample) Average Value (sample) Data Distribution Distinct Values Estimated Distinct Values Estimated Non-null Values Maximum Length (sample) Maximum Value (sample) sorted numerically or alphabetically (z-a) Minimum Length (sample) Minimum Value (sample) sorted numerically or alphabetically (a-z) Non-null Values (sample) Sample String Values (first 5 items in a column)
Table	Row Count Sample Count (Target sample size)

Relationships between objects

By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.

Table 3.

Relationship page	Relationship
Table	Columns contained in Table
Schema	Database that contains Schema Table that is part of Schema
Database	Schema contained in Database
Columns	Table containing Column
Job	Clusters used by tasks in Job Tasks contained within Job
Cluster	Cluster contained in job Task using Cluster
Task	Job containing Task Cluster used by Task Tasks depending on Task
Notebook	Folder containing Notebook Task sourcing data from Notebook
Folder	Folders contained in Folder Notebooks contained in Folder
External Location	Uses storage credential Connects to datasource (S3 bucket, S3 Object, Azure container or Azure blob)
Storage Credential	Used by External Location

Lineage for Databricks

The following lineage information is collected by the Databricks collector.

Note

Any lineage for SQL Statements defined via variable statements are not supported.

Table 4.

Object	Lineage available
Column in view	The collector identifies the associated column in an upstream view or table for both Hive metastore and Unity Catalog: Where the data is sourced from That sort the rows via ORDER BY That filter the rows via WHERE/HAVING That aggregate the rows via GROUP BY Note: Deprecated columns and any lineage related to these deprecated columns are not cataloged.
Notebook	Tasks that reference Notebook. (Only if Databricks Unity Catalog is enabled).
Table	The collector identifies the upstream and downstream tables and their external locations (S3 and ADLS Gen2 resources) along with the intermediate Job. The collector harvests the Databricks table lineage to S3 object.

Object

Lineage available

Column in view

The collector identifies the associated column in an upstream view or table for both Hive metastore and Unity Catalog:

Where the data is sourced from
That sort the rows via ORDER BY
That filter the rows via WHERE/HAVING
That aggregate the rows via GROUP BY

Note: Deprecated columns and any lineage related to these deprecated columns are not cataloged.

Notebook

Tasks that reference Notebook. (Only if Databricks Unity Catalog is enabled).

Table

The collector identifies the upstream and downstream tables and their external locations (S3 and ADLS Gen2 resources) along with the intermediate Job.
The collector harvests the Databricks table lineage to S3 object.

Authentication supported

The Databricks collector supports personal access token authentication and Oauth service principal authentication.
Warning
Support for Databricks-managed password authentication is discontinued on July 8, 2024. If you used this method of authentication, you must change the authentication to personal access token.

In this section:

About the Databricks collector

Important

Note

What is cataloged

Important

Relationships between objects

Lineage for Databricks

Important

Note

Authentication supported

Warning

Search results