Skip to main content

Preparing to run the Databricks collector

Setting up pre-requisites for running the collector

Make sure that the machine from where you are running the collector meets the following hardware and software requirements.

Table 1.

Item

Requirement

Hardware (for on-premise runs only)

Note: The following specs are based upon running one collector process at a time. Please adjust the hardware if you are running multiple collectors at the same time.

RAM

8 GB

CPU

2 Ghz processor

Software (for on-premise runs only)

Docker

Click here to get Docker.

data.world specific objects (for both cloud and on-premise runs)

Dataset

You must have a ddw-catalogs dataset set up to hold your catalog files when you are done running the collector.

If you are using Catalog Toolkit , follow these instructions to prepare the datasets for collectors.



Preparing Databricks for collectors

Generating personal access token

To generate a personal access token:

Important

The user should have Can Use or Can Manage permission in Databricks workspace to generate PAT.

  1. In the Databricks workspace, click on username in the top bar.

  2. Select User Settings from the drop down, Go to Access tokens tab. Click on Generate New Token button.

  3. Enter a comment that helps you to identify the token and change the token’s lifetime as required.

  4. To create a token with no lifetime, leave the lifetime box empty. Click Generate.

  5. Copy the displayed token and click Done. Save this token for future use.

  6. Alternatively, you can use the token API to generate PAT.

Setting permissions for Unity Catalog

In Unity Catalog, data is secure by default. Initially, users have no access to data in a metastore. Access can be granted by either a metastore admin, the owner of an object, or the owner of the catalog or schema that contains the object. Securable objects in Unity Catalog are hierarchical and privileges are inherited downward.

You will need to grant the user that runs the collector under appropriate permissions to harvest resources from Unity Catalog.

The user should have minimum USE CATALOG, USE SCHEMA, and SELECT permissions on the catalog to access the data. See the Dataricks documentation for detailed about these permissions.

To grant the permissions:

  1. Click on the catalog on which you want to grant permission to the user.

  2. Select permissions tab and click the Grant button.

  3. Select the user and the permissions.Click GRANT.

The user should also have Can Use permission on the existing cluster or SQL warehouse. Or, they should be able to create their own compute resources.

To grant Can Use permission on the compute resource:

  1. Click on the three dots at the extreme right end of the resource and select Permissions.

  2. Add the user and select appropriate permission.

Setting permissions for Jobs

  • For the collector to harvest Jobs, you need to set up proper permissions across the Jobs that you want the collector to harvest. The user that runs the collector should have minimum Can View permissions. For details about setting the Job permissions, see the Databricks documentation.