Skip to main content

Preparing to run the Databricks collector

Setting up pre-requisites for running the collector

Make sure that the machine from where you are running the collector meets the following hardware and software requirements.

Table 1.

Item

Requirement

Hardware (for on-premise runs only)

Note: The following specs are based upon running one collector process at a time. Please adjust the hardware if you are running multiple collectors at the same time.

RAM

8 GB

CPU

2 Ghz processor

Software (for on-premise runs only)

Docker

Click here to get Docker.

data.world specific objects (for both cloud and on-premise runs)

Dataset

You must have a ddw-catalogs dataset set up to hold your catalog files when you are done running the collector.

If you are using Catalog Toolkit , follow these instructions to prepare the datasets for collectors.

Network connection

Allowlist IPs and domains

Follow these instructions to configure your network.



Preparing Databricks for collectors

Sizing considerations for the collector

  • Are serverless Databricks SQL Warehouses supported?

    Yes, serverless Databricks SQL Warehouses can be used for metadata collection.

  • Are there any Databricks compute requirements?

    There are no specific compute requirements for running the Databricks collector as it largely depends on the workload and data volume. One can start with a General Purpose cluster. The smallest cluster configuration often suffices, but adjustments should be based on actual performance and workload needs.

  • Does the collector workload benefit from additional workers on the Spark cluster?

    The Databricks collector itself is not highly compute-intensive, so adding additional workers to the Spark cluster specifically for the collector might not yield significant benefits.

  • What driver and worker node memory are recommended?

    While the Databricks collector does involve some memory-intensive tasks, there is no one-size-fits-all recommendation for memory size. It's best to start with a modest configuration and monitor performance. If memory issues arise, incrementally increasing the memory size may be necessary. Start with the minimum available configuration and adjust based on observed performance.

  • What is the recommended node instance type and number of cores?

    The instance type and number of cores required for the Databricks collector depend on the scale of the data being cataloged. For small Databricks instances, starting with the smallest configuration available, typically 8 GB of memory and 2 cores, is sufficient. Adjustments should be made based on specific performance and data volume considerations.

Generating personal access token

To generate a personal access token:

Important

The user should have Can Use or Can Manage permission in Databricks workspace to generate PAT.

  1. In the Databricks workspace, click on username in the top bar.

  2. Select User Settings from the drop down, Go to Access tokens tab. Click on Generate New Token button.

  3. Enter a comment that helps you to identify the token and change the token’s lifetime as required.

  4. To create a token with no lifetime, leave the lifetime box empty. Click Generate.

  5. Copy the displayed token and click Done. Save this token for future use.

  6. Alternatively, you can use the token API to generate PAT.

Setting permissions for Unity Catalog

In Unity Catalog, data is secure by default. Initially, users have no access to data in a metastore. Access can be granted by either a metastore admin, the owner of an object, or the owner of the catalog or schema that contains the object. Securable objects in Unity Catalog are hierarchical and privileges are inherited downward.

You will need to grant the user that runs the collector under appropriate permissions to harvest resources from Unity Catalog.

The user should have minimum USE CATALOG, USE SCHEMA, and SELECT permissions on the catalog to access the data. See the Databricks documentation for detailed about these permissions.

To grant the permissions:

  1. Click on the catalog on which you want to grant permission to the user.

  2. Select permissions tab and click the Grant button.

  3. Select the user and the permissions. Click GRANT.

The user should also have Can Use permission on the existing cluster or SQL warehouse. Or, they should be able to create their own compute resources.

To grant Can Use permission on the compute resource:

  1. Click on the three dots at the extreme right end of the resource and select Permissions.

  2. Add the user and select appropriate permission.

Setting permissions for Jobs

  • For the collector to harvest Jobs, you need to set up proper permissions across the Jobs that you want the collector to harvest. The user that runs the collector should have minimum Can View permissions. For details about setting the Job permissions, see the Databricks documentation.

Setting permissions for harvesting lineage from system tables

  1. Work with your Databricks admin to enable system.access schema for harvesting lineage from system tables. Make a PUT request to the API.

    • schema_name is system.access

    • metastore_id: Get the ID from the Databricks workspace. Go to Catalog >Settings and click on Metastore info section.

    Example API call:

    curl  -X PUT -H "Authorization: Bearer <PAT_Token>" https://<databricks-host>/api/2.1/unity-catalog/metastores/<metastore_id>/systemschemas/access
  2. If the user running the collector is not an account admin, the following permissions should be provided to the user.

    • USE schema on system.access.

    • SELECT on tables system.access.table_lineage and system.access.column_lineage

    Examples SQL commands for granting permissions:

    GRANT USE SCHEMA ON SCHEMA system.access to `<userid>`;
    GRANT SELECT on TABLE system.access.table_lineage to `<userid>`;
    GRANT SELECT on TABLE system.access.column_lineage to `<userid>`;