Preparing to run the Databricks collector

Setting up pre-requisites for running the collector

Make sure that the machine from where you are running the collector meets the following hardware and software requirements.

Table 1.

Item	Requirement
Hardware (for on-premise runs only) Note: The following specs are based upon running one collector process at a time. Please adjust the hardware if you are running multiple collectors at the same time.
RAM	8 GB
CPU	2 Ghz processor
Software (for on-premise runs only) Docker or Java Runtime Environment
Docker	Click here to get Docker.
Java Runtime Environment	OpenJDK 17 is supported and available here.
data.world specific objects (for both cloud and on-premise runs)
Dataset	You must have a ddw-catalogs dataset set up to hold your catalog files when you are done running the collector. If you are using Catalog Toolkit , follow these instructions to prepare the datasets for collectors.
Network connection
Allowlist IPs and domains	Follow these instructions to configure your network. Use these tools to check network connections before running the collector.

Preparing Databricks for collectors

Sizing considerations for the collector

Are serverless Databricks SQL Warehouses supported?
Yes, serverless Databricks SQL Warehouses can be used for metadata collection.

Are there any Databricks compute requirements?
There are no specific compute requirements for running the Databricks collector as it largely depends on the workload and data volume. One can start with a General Purpose cluster. The smallest cluster configuration often suffices, but adjustments should be based on actual performance and workload needs.

Does the collector workload benefit from additional workers on the Spark cluster?
The Databricks collector itself is not highly compute-intensive, so adding additional workers to the Spark cluster specifically for the collector might not yield significant benefits.

What driver and worker node memory are recommended?
While the Databricks collector does involve some memory-intensive tasks, there is no one-size-fits-all recommendation for memory size. It's best to start with a modest configuration and monitor performance. If memory issues arise, incrementally increasing the memory size may be necessary. Start with the minimum available configuration and adjust based on observed performance.

What is the recommended node instance type and number of cores?
The instance type and number of cores required for the Databricks collector depend on the scale of the data being cataloged. For small Databricks instances, starting with the smallest configuration available, typically 8 GB of memory and 2 cores, is sufficient. Adjustments should be made based on specific performance and data volume considerations.

Generating personal access token

Important

Configure this if you use personal access token authentication for Databricks.

To generate a personal access token:

In the Databricks workspace, click on username in the top bar. Select Settings from the drop down.
Click Developer.
Next to Access tokens tab. Click on Manage.
Click Generate New Token.
Enter a comment that helps you to identify the token and change the token’s lifetime as required.
To create a token with maximum lifetime of 730 days, leave the lifetime box empty. Click Generate.
Copy the displayed token and click Done. Save this token for future use.
Alternatively, you can use the token API to generate PAT.

Generating service principal secrets

Important

Configure this if you use Oauth service principal authentication for Databricks.

To generate a service principal secret:

Important

The user should be an Account Admin to create and manage service principals.

In the Databricks workspace, click on username in the top bar.
Select Settings from the drop down. Go to Identity and access tab. Click on Manage button, next to Service principals option.
Click on Add service principal button. A pop up windown opens.
Click on Add new. Provide a name to the new service principal and click on Add.
Click on the newly created service principal and go to the Secrets tab.
Click on Generate secret. Enter the lifetime for the secret. Click Generate.
Copy the client ID and client secret that shows up in the pop up window.

Setting permissions for Unity Catalog

In Unity Catalog, data is secure by default. Initially, users have no access to data in a metastore. Access can be granted by either a metastore admin, the owner of an object, or the owner of the catalog or schema that contains the object. Securable objects in Unity Catalog are hierarchical and privileges are inherited downward.

You will need to grant the user that runs the collector under appropriate permissions to harvest resources from Unity Catalog.

The user should have minimum USE CATALOG, USE SCHEMA, and SELECT permissions on the catalog to access the data. See the Databricks documentation for detailed about these permissions. A user with USE CATALOG and BROWSE permissions on a catalog can also access the data within it. However, these permissions do not grant the ability to retrieve column statistics, system functions, and certain additional table metadata.

To grant the permissions:

Click on the catalog on which you want to grant permission to the user.
Select permissions tab and click the Grant button.
Select the user and the permissions. Click GRANT.

The user should also have Can Use permission on the existing cluster or SQL warehouse. Or, they should be able to create their own compute resources.

To grant Can Use permission on the compute resource:

Click on the three dots at the extreme right end of the resource and select Permissions.
Add the user and select appropriate permission.

Setting permissions for Jobs

For the collector to harvest Jobs, you need to set up proper permissions across the Jobs that you want the collector to harvest. The user that runs the collector should have minimum Can View permissions. For details about setting the Job permissions, see the Databricks documentation.

Setting permissions for harvesting lineage from system tables

Work with your Databricks admin to enable system.access schema for harvesting lineage from system tables. Make a PUT request to the API.
- schema_name is system.access
- metastore_id: Get the ID from the Databricks workspace. Go to Catalog >Settings and click on Metastore info section.
Example API call:
```
curl  -X PUT -H "Authorization: Bearer <PAT_Token>" https://<databricks-host>/api/2.1/unity-catalog/metastores/<metastore_id>/systemschemas/access
```
If the user running the collector is not an account admin, the following permissions should be provided to the user.
- USE schema on system.access.
- SELECT on tables system.access.table_lineage and system.access.column_lineage
Examples SQL commands for granting permissions:
```
GRANT USE SCHEMA ON SCHEMA system.access to `<userid>`;
GRANT SELECT on TABLE system.access.table_lineage to `<userid>`;
GRANT SELECT on TABLE system.access.column_lineage to `<userid>`;
```

In this section:

Preparing to run the Databricks collector

Setting up pre-requisites for running the collector

Preparing Databricks for collectors

Sizing considerations for the collector

Generating personal access token

Important

Generating service principal secrets

Important

Important

Setting permissions for Unity Catalog

Setting permissions for Jobs

Setting permissions for harvesting lineage from system tables

Search results