Skip to main content

Preparing to run the Power BI Gov collector

Setting up pre-requisites for running the collector

Make sure that the machine from where you are running the collector meets the following hardware and software requirements.

Table 1.

Item

Requirement

Hardware (for on-premise runs only)

Note: The following specs are based upon running one collector process at a time. Please adjust the hardware if you are running multiple collectors at the same time.

RAM

8 GB

CPU

2 Ghz processor

Software (for on-premise runs only)

Docker

Click here to get Docker.

data.world specific objects (for both cloud and on-premise runs)

Dataset

You must have a ddw-catalogs dataset set up to hold your catalog files when you are done running the collector.

If you are using Catalog Toolkit , follow these instructions to prepare the datasets for collectors.

Network connection

Allowlist IPs and domains



Setting up access for cataloging Power BI resources

The collector authenticates to Power BI Gov using Service principal. This section walks you through the process of setting up the authentication.

Important things to note:

The Collector currently uses Azure Active Directory to authenticate to the Power BI Cloud API. You will need to create an Application Registration in Azure AD, enable the Power BI API authentication for it, and create a client secret.

The collector harvests metadata for all Power BI apps and workspaces to which the supplied account has access.

STEP 1: Registering your application

To register a new application:

  1. Go to the Azure Portal.

  2. Select Azure Active Directory.

  3. Click the App Registrations option in the Azure services.

  4. Click New Registration and enter the following information:

    1. Application Name: DataDotWorldPowerBIApplication

    2. Supported account types: Accounts in this organizational directory only

  5. Click Register to complete the registration.

STEP 2: Creating Client secret and getting the Client ID

To create a Client Secret:

  1. Go to the Azure Portal.

  2. On the application page, select Certificates and Secrets.

  3. Click on Secret and add a description.

  4. Select the desired expiration date.

  5. Click on Create, and copy the secret value.

To get the Client ID from the Azure portal:

  1. Go to the Azure Portal.

  2. Click on the Overview tab in the left sidebar of the application home page.

  3. Copy the Client ID from the Essentials section.

STEP 3: Setting up metadata scanning

Enable access to the detailed data source information (like tables and columns) provided by Power BI through the read-only admin APIs. For details about doing this task, please see this documentation.

STEP 4: Setting up REST API for service principals

Important

Perform this task only if you are using the service principal for authentication.

If you are using service principal as your authentication type, ensure that you enable service principals to use the Power BI APIs. For detailed instructions for doing this task, please see this documentation.

Configuring Power BI for Report Image Harvesting

You must perform these tasks to enable the harvesting of preview images from Power BI reports. Note that report image harvesting is not supported for Power BI Apps.

  1. Enable the Export reports as image files setting from the Admin settings.

  2. Ensure that the reports to be exported are located in a workspace with Premium, Embedded, or Fabric capacity. For details, see the Power BI documentation.

Setting up OBDC data sources YAML file

Use this option when Power BI datasets utilize an ODBC connection or any other source that employs a DSN. In such cases, Power BI does not supply the host or database type for the source. Therefore, you will need to establish a YAML file with a mapping from the DSN to a specific database host and type.

This is an optional task for harvesting lineage information.

  • Create a YAML file (datasources.yml) with the list of data sources and the corresponding host and database type.

     datasources:
     - name: "Name-for-datasource"
       host: <my-datasource-host>
       databaseType: <type-of-database>

    The list of possible databaseTypes are: postgres, redshift, bigquery, oracle, mysql, netezza, snowflake, sqlanywhere, sqlserver, databricks. The types are not case sensitive but should be a single word with no spaces.

    For example:

    datasources:
    - name: "SQL Server"
      databaseType: sqlserver
      host: 8bank-sqlserver.cpetgx.us-east-1.rds.amazonaws.com