Skip to main content

Preparing to run the Azure Data Lake Storage Gen2 collector

Note

The latest version of the Collector is 2.255. To view the release notes for this version and all previous versions, please go here.

Setting up pre-requisites for running the collector

Make sure that the machine from where you are running the collector meets the following hardware and software requirements.

Table 1.

Item

Requirement

Hardware (for on-premise runs only)

Note: The following specs are based upon running one collector process at a time. Please adjust the hardware if you are running multiple collectors at the same time.

RAM

8 GB

CPU

2 Ghz processor

Software (for on-premise runs only)

Docker

Click here to get Docker.

data.world specific objects (for both cloud and on-premise runs)

Dataset

You must have a ddw-catalogs dataset set up to hold your catalog files when you are done running the collector.

If you are using Catalog Toolkit , follow these instructions to prepare the datasets for collectors.

Network connection

Allowlist IPs and domains



Setting up access for cataloging Power BI resources

Authentication types supported

The Azure Data Lake Storage Gen 2 collector authenticates using Azure Service Principal.

STEP 1: Registering your application

To register a new application:

  1. Go to the Azure Portal.

  2. Select Azure Active Directory.

  3. Click the App Registrations option in the left sidebar.

  4. Click New Registration and enter the following information:

    1. Application Name: DataDotWorldADLSGen2Application.

    2. Supported account types: Accounts in this organizational directory only.

  5. Click Register to complete the registration.

STEP 2: Creating Client secret and getting the Client ID

To create a Client Secret:

  1. On the new application page you created, select Certificates and Secrets.

  2. Under the Client secrets tab, click the New client secret button.

  3. Add a Description.

  4. Set the expiration for the client secret.

  5. Click Add, and copy the secret value.

To get the Client ID from the Azure portal:

  1. Click on the Overview tab in the left sidebar of the application home page.

  2. Copy the Application (Client) ID from the Essentials section.

STEP 3: Obtaining Subscription ID and Tenant ID

  1. From the page of new application you created from step 1, copy and save the Directory (tenant) ID. You will use this for the --tenant-id parameter.

  2. Navigate to a storage account that you would like to harvest from. From the Overview page, copy the Subscription ID. You will use this for the --subscription-id parameter.

Enable access to the detailed data source information (like tables and columns) provided by Power BI through the read-only admin APIs. For details about doing this task, please see this documentation.

STEP 4: Grant Service Principal access to each Storage Account

Important

Perform these tasks for each Storage Account you plan to harvest.

  1. Go to Storage Account. Click on Access Control (IAM).

  2. Click Add > Add role assignment.

  3. In the Role tab, select Job function role as Storage Blob Data Reader.

  4. Click Members tab. Click Select Members.

  5. Find and click the Service Principal you created earlier. Click Select.

  6. Click Review + assign.

  7. Repeat Steps 1-6 and add the Reader role.