About the Databricks collector
Use this collector to harvest metadata from data assets in Databricks Hive Metadata, Unity Catalog (including Delta Lake), Workflows, and Notebooks, and make it searchable and discoverable in data.world.
Important
The Databricks collector can be run in the Cloud or on-premise using Docker or Jar files.
Note
The latest version of the Collector is 2.235. To view the release notes for this version and all previous versions, please go here.
What is cataloged
The collector catalogs the following information.
Object | Information cataloged |
---|---|
Columns | Name, Description, JDBC type, Column Type, Is Nullable, Default Value, Column size, Column index Extended metadata: Tags Note: Deprecated columns and any lineage related to these deprecated columns are not cataloged. |
Table | Name, Description, Schema, Primary key, Foreign key Extended metadata: Tags, Owner, Type, Creation date, Last Modified, Location, Provider, Version, Size, File Count, Partition Columns, Properties |
Views | Name, Description, SQL definition, Tags |
Schema | Name Extended metadata: Tags |
Database | Type, Name, Server, Port, Environment, JDBC URL Extended metadata: Tags |
Notebook | Notebook ID, Path, Language Type (SQL, Python, Scala, R) |
Function | Name, Description, Function Type |
Job | Title, Description, Creator, Created At, Job run as, Format, Max Concurrent Runs, Notification On Start, Timeouts (sec), Notification On Success, Schedule, Git Source, Notification on Failure, Tags, List of tasks, List of clusters |
Cluster | Name, Description, Node Type ID, Driver Node Type ID, Spark Version, Number of Workers, Autoscale Max Workers, Autoscale Min Workers, AWS Attributes, Tags |
Task | Task Key, Type of Task (Notebook, dbt, Spark jar, Python script, Python wheel, Pipeline task, SQL), Task timeout, Retry interval, Cluster used by the task, Max retries, Depends on, Libraries, Notifications (On start, On success, On failure), Notebook File Path, Notebook Source, Notebook Parameters, Spark Jar Main Class Name, Spark Jar Parameters, Python Script File path, Python Script Parameters, Spark Submit Parameters, Pipeline ID, Pipeline Full Refresh, Python Wheel Package Name, Python Wheel Entry Point, Python Wheel Parameters, SQL Warehouse, SQL Query ID, SQL Dashboard ID, SQL Alert ID, Dbt Project Directory, Dbt Profiles Directory, Dbt warehouse, Dbt catalog, Dbt schema, Dbt commands |
External Location | Name, External URL, Description, Data Source Type, Created Date, Created By, Owner |
Storage Credential | Name, Description, Credential, Created Date, Created By, Owner |
Relationships between objects
By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.
Relationship page | Relationship |
---|---|
Table |
|
Schema |
|
Database |
|
Columns |
|
Job |
|
Cluster |
|
Task |
|
Notebook |
|
Folder |
|
External Location |
|
Storage Credential |
|
Lineage for Databricks
The following lineage information is collected by the Databricks collector.
Note
Any lineage for SQL Statements defined via variable statements are not supported.
Object | Lineage available |
---|---|
Column in view | The collector identifies the associated column in an upstream view or table for both Hive metastore and Unity Catalog:
Note: Deprecated columns and any lineage related to these deprecated columns are not cataloged. |
Notebook | Tasks that reference Notebook. (Only if Databricks Unity Catalog is enabled). |
Table | The collector identifies the upstream and downstream tables and their external locations (S3 and ADLS Gen2 resources) along with the intermediate Job. |
Authentication supported
The Databricks collector supports personal access token authentication.
Warning
Support for Databricks-managed password authentication is discontinued on July 8, 2024. If you used this method of authentication, you must change the authentication to personal access token.