Skip to main content

Preparing to run the Power BI Service collector

Setting up pre-requisites for running the collector

Make sure that the machine from where you are running the collector meets the following hardware and software requirements.

Table 1.

Item

Requirement

Hardware (for on-premise runs only)

Note: The following specs are based upon running one collector process at a time. Please adjust the hardware if you are running multiple collectors at the same time.

RAM

8 GB

CPU

2 Ghz processor

Software (for on-premise runs only)

Docker

Click here to get Docker.

data.world specific objects (for both cloud and on-premise runs)

Dataset

You must have a ddw-catalogs dataset set up to hold your catalog files when you are done running the collector.

If you are using Catalog Toolkit , follow these instructions to prepare the datasets for collectors.



Setting up access for cataloging Power BI resources

Important things to note

  • A Power BI administrator is needed to enable settings in the Power BI Admin Portal.

  • Dataflows require the user/service principal to be added to the workspace with at least contributor access. When authenticating with username/password, the app registration needs to have Dataflow.Read.All permissions in API permissions.

  • Power BI does not include user workspace when using Service Principal authentication.

STEP 1: Registering your application

To register a new application:

  1. Go to the Azure Portal.

  2. Click the App Registrations option in the left sidebar.

  3. Click New Registration and enter the following information:

    1. Application Name: DataDotWorldPowerBIApplication

    2. Supported account types: Accounts in this organizational directory only

  4. Click Register to complete the registration.

STEP 2: Creating Client secret and getting the Client ID

To create a Client Secret:

  1. Go to the Azure Portal.

  2. On the application page, select Certificates and Secrets.

  3. Click on Secret and add a description.

  4. Set the expiration to Never.

  5. Click on Create, and copy the secret value. You will use this value while setting the parameters for the collector.

To get the Client ID from the Azure portal:

  1. Click on the Overview tab in the left sidebar of the application home page.

  2. Copy the Client ID from the Essentials section. You will use this value while setting the parameters for the collector.

STEP 3: Setting up authentication

There are two separate ways to authenticate to Power BI:

  • Service principal

  • User and password

This section will walk you through the process for both authentication types.

OPTION 1: Setting up REST API for service principals

Important

Perform this task only if you are using the service principal for authentication. You do not need to do this task if you are using user and password for authentication. You can refer to to the Microsoft documentation for more details.

  • When using Service Principal authentication, the collector will automatically harvest all the objects listed here except for personal workspaces, user workspaces, and report pages.

  • If you want to harvest all apps and any workspace in the tenant:

    • Use the --all-workspaces-and-apps parameter. This excludes the harvesting of personal and user workspaces.

    • To harvest Personal Workspaces and My Workspaces, add the parameters:

      --include-user-workspace

    • To harvest report pages, you need to give the Service principal access to each workspace that you want to harvest. This is needed because the admin API used to access all workspaces and apps in the tenant does not have an API endpoint for report pages.

OPTION 2: Setting up permissions for username & password authentication

Important

Perform this task only if you are using user and password for authentication. You do not need to do this task if you are using service principal authentication. You can refer to to the Microsoft documentation for more details.

  • If you are using User authentication, the collector will automatically harvest all the objects listed here except for personal workspaces, user workspaces, and report pages.

  • If you want to harvest all apps and any workspace in the tenant:

    • Use the --all-workspaces-and-apps parameter. This excludes the harvesting of personal and user workspaces.

    • To harvest Personal Workspaces and My Workspaces, add the parameters:

      --include-user-workspace

    • To harvest report pages, you need to give the user access to each workspace that you want to harvest. This is needed because the admin API used to access all workspaces and apps in the tenant does not have an API endpoint for report pages.

To add permissions:

  1. If you are planning to use the all-workspaces-and-apps option while running the collector, the user must have administrator rights (such as Microsoft 365 Global Administrator or Power BI Service Administrator) to use metadata scanning. For details see the Power BI documentation.

  2. If you are not planning to use the all-workspaces-and-apps option, do the the following:

    1. Click on API Permissions, and select Add Permission.

    2. Search for the Microsoft Graph and select the following permissions:

      • Application permission: Application.Read.All

      • Delegated permission: User.Read (assigned by default)

    3. Search for the Power BI service, and click on Delegated permissions. Select the following permissions:

      • App.Read.All

      • Dashboard.Read.All

      • Dataflow.Read.All

      • Dataset.Read.All

      • Report.Read.All

      • Tenant.Read.All

      • Workspace.Read.All

    4. Click on the Grant Admin consent button, which is located next to the Add permission button. This allows the data.world collector to run as a daemon without having to ask the user permission on every crawler run.

    Note

    Only administrators of the tenant can grant admin consent.

STEP 4: Setting up metadata scanning

Set up metadata scanning to enable access to the detailed data source information (like tables and columns) provided by Power BI through the read-only admin APIs. Before metadata scanning can be run over an organization's Power BI workspaces, it must be set up by a Power BI administrator.

Option 1: When using service principal authentication

  1. Follow the Power BI documentation to enable service principal authentication for Power BI read-only APIs.

  2. Next, follow the Power BI documentation to enable the following enhanced tenant settings for metadata scanning.

    • Enhance admin APIs responses with detailed metadata

    • Enhance admin APIs responses with DAX and mashup expressions

Option 2: When using username and password authentication

Important

The user must have administrator rights (such as Microsoft 365 Global Administrator or Power BI Service Administrator) to use metadata scanning. For details see the Power BI documentation.

  • Follow the Power BI documentation to enable the following enhanced tenant settings for metadata scanning:

    • Enhance admin APIs responses with detailed metadata

    • Enhance admin APIs responses with DAX and mashup expressions

STEP 5: Getting the Tenant ID

  1. To find the tenant ID, click the question mark in the Power BI app and then choose About Power BI.

  2. The tenant ID can be found at the end of the Tenant URL. You will use this value while setting the parameters for the collector.

Setting up OBDC data sources YAML file

Use this option when Power BI datasets utilize an ODBC connection or any other source that employs a DSN. In such cases, Power BI does not supply the host or database type for the source. Therefore, you will need to establish a YAML file with a mapping from the DSN to a specific database host and type.

This is an optional task for harvesting lineage information.

  • Create a YAML file (datasources.yml) with the list of data sources and the corresponding host and database type.

     datasources:
     - name: "Name-for-datasource"
       host: <my-datasource-host>
       databaseType: <type-of-database>

    The list of possible databaseTypes are: postgres, redshift, bigquery, oracle, mysql, netezza, snowflake, sqlanywhere, sqlserver, databricks. The types are not case sensitive but should be a single word with no spaces.

    For example:

    datasources:
    - name: "SQL Server"
      databaseType: sqlserver
      host: 8bank-sqlserver.cpetgx.us-east-1.rds.amazonaws.com