Skip to main content

Running the Microsoft Fabric collector on-premise

Note

The latest version of the Collector is 2.270. To view the release notes for this version and all previous versions, please go here.

Ways to run the data.world Collector

There are a few different ways to run the data.world Collector--any of which can be combined with an automation strategy to keep your catalog up to date:

  • Create a configuration file (config.yml) - This option stores all the information needed to catalog your data sources. It is an especially valuable option if you have multiple data sources to catalog as you don't need to run multiple scripts or CLI commands separately.

  • Run the collector through a CLI - Repeat runs of the collector requires you to re-enter the command for each run.

Note

This section walks you through the process of running the collector using CLI.

Preparing and running the command

The easiest way to create your Collector command is to:

  1. Copy the following example command in a text editor.

  2. Set the required parameters in the command. The example command includes the minimal parameters required to run the collector

  3. Open a terminal window in any Unix environment that uses a Bash shell and paste the command in it and run in.

docker run -it --rm --mount type=bind,source=${HOME}/dwcc,target=/dwcc-output \
  --mount type=bind,source=${HOME}/dwcc,target=/app/log datadotworld/dwcc:2.270 \
  catalog-microsoft-fabric --agent=8bank-catalog-sources --site=solutions \
  --no-log-upload=false --upload=true --api-token=${DW_AUTH_TOKEN} \
  --name=8bank-catalog-sources-collection --output=/dwcc-output \
  --upload-location=ddw-catalogs --client-id=${CLIENT_ID}\ 
--client-secret=${CLIENT_SECRET} --tenant-id=${TENANT_ID}

The following table describes the parameters for the command.

Table 1.

Parameter

Details

Required?

dwcc: <CollectorVersion>

The version of the collector you want to use (For example, datadotworld/dwcc:2.270)

Yes

-t= <apiToken>

--api-token= <apiToken>

The data.world API token to use for authentication. Default is to use an environment variable named ${DW_AUTH_TOKEN}.

Yes

-n= <catalogName>

--name= <catalogName>

Specify the collection name where the collector output will be saved. Ensure you use a distinct collection for each collector.

Yes

-o= <outputDir>

--output= <outputDir>

The output directory into which any catalog files should be written.

Yes

--output-name

Specify the collector output file name to override the default file name. The system automatically adds .dwec.ttl to the end of the provided file name.

No

--upload-location= <uploadLocation>

Enter the name of your dataset to display a list of available datasets. From this list, select the dataset where you want to upload the catalog file.

By default, the search is restricted to the organization you are in. To search across all organizations you have access to, uncheck the Limit search results to this organization option.

Yes

-H= <apiHost>

--api-host= <apiHost>

The host for the data.world API. NOTE: This parameter is required for single-tenant installations. For example, "api.site.data.world" where "site" is the name of the single-tenant install.

Yes

(for single-tenant installations)

-a= <agent>

--agent= <agent>

--account= <agent>

The ID for the data.world account into which you will load this catalog - this is used to generate the namespace for any URIs generated.

Yes

--site= <site>

This parameter should be set only for Private instances. Do not set it for public instances and single-tenant installations. Required for private instance installations.

Yes

(required for private instance installations)

-U

--upload

Whether to upload the generated catalog to the  organization account's catalogs dataset.

Yes

--tenant-id

The tenant ID that identifies the organization in Microsoft Fabric.

Yes

--client-id

The client ID of the registered application in Microsoft Fabric.

Yes

--client-secret

The client secret for the registered application in Microsoft Fabric.

Yes

--include-workspace

Specify the workspaces to be collected, using either a workspace name or a regular expression to match.

Use the parameter multiple times for multiple workspaces. For example, --include-workspace="workspaceA" --include-workspace="workspaceB"

Note: If the workspace name includes special characters [. , + , * , ? , ^ , $ , ( , ) , [ , ] , { , } , | , \], use a backslash (\)before the special character to escape them. For instance, Workspace [Dev] should be changed to Workspace \[Dev\].

No

--exclude-workspace

Specify the workspaces and contents to exclude from being cataloged, using either a workspace name or a regular expression to match.

Use the parameter multiple times for multiple workspaces. For example, --exclude-workspace="workspaceA" --exclude-workspace="workspaceB"

If both --include-workspace and --exclude-workspace are used, --include-workspace takes precedence.

Note: If the workspace name includes special characters [. , + , * , ? , ^ , $ , ( , ) , [ , ] , { , } , | , \], use a backslash (\)before the special character to escape them. For instance, Workspace [Dev] should be changed to Workspace \[Dev\].

No

--api-max-retries

Specify the number of times to retry an API call which has failed. The default value is 5.

No

--api-retry-delay

Specify the amount of time in seconds to wait between retries of an API call which has failed. The default is to try with a delay of 2 seconds between each call.

No

--disable-max-requests-wait

Disable waiting up to an hour for the Power BI API endpoints to reset throttling limits (error code 429 - too many requests). When not disabled, the collector retries every 5 minutes for up to an hour. If this option is disabled, the Max retries and Retry delay options will be used instead.

No

--disable-expression-lineage

Skip harvesting lineage metadata from table source expressions in semantic models and dataflows.

No

--image-collection

Specify if the collector should catalog preview images. The default setting is false. Ensure that you have met all pre-requisites for using this feature.

No

--max-parseable-expression-length

Set the maximum number of characters in a Table expression (coming from Semantic Models and Dataflows) that will be parsed for lineage metadata. Expressions longer than this will be skipped. Default is 32000.

No

--datasource-mapping-file

Provide the location of the file, if you have configured source details in the datasources.yml file.

Note: You should have placed your datasources.yml file in the source directory of your host machine. The value in this field is relative to the mount location of the container (target). For example, if you have a mount target set to /dwcc-output, the value for --datasource-mapping-file will be /dwcc-output/datasources.yaml.

No

The following options apply only for cataloging database resources from Warehouses and Lakehouses

-A, --all-schemas

Catalog all schemas.

Yes (if --schema is not set)

-S, --schema

Select this option to specify the names of the database schema to be catalog.

Yes

(if --all-schemas is not set)

--collect-extended-properties

Harvest information about extended properties from SQL Server type databases.

No

--disable-extended-metadata, --disable-extended-metadata-collection

Skip harvesting of extended metadata for resource types such as database, schema, table, columns functions, stored procedures, user defined types, synonyms. Basic metadata for these resource types will still be harvested.

No

--disable-lineage-collection

Skip harvesting of intra-database lineage metadata. This applies only to database resources such as views.

No

--enable-column-statistics

Enable harvesting of column statistics (i.e. data profiling).

Note: Activating the profiling feature may extend the running time of the collector. This is because the collector needs to read the table data to be able to gather metadata for profiling.

No

--exclude-schema

Specify the name or regular expression of the database schema to be excluded. Applicable only if --all-schemas is specified.

No

--exclude-system-functions

Specify to exclude system functions.

No

--include-information-schema

When --all-schemas is specified, include the database's Information Schema in catalog collection (ignored if --all-schemas is not specified).

No

--jdbc-property

JDBC driver properties to pass through to driver connection, as name=value pair. Use the parameter multiple times for multiple properties. For example, --jdbc-property property1=value1--jdbc-property property2=value2

Note: by default the collector uses authentication=ActiveDirectoryServicePrincipal, and encrypt=true

No

--sample-string-values

Enable sampling and storage of sample values for columns with string values.

Note: Only applies if Enable column statistics collection is turned on.

No

--target-sample-size 

To control the number of rows sampled for computation of column statistics and string-value histograms. For example, to sample 1000 rows, set the parameter as: --target-sample-size=1000. Default is 100000.

Note: Only applies if Enable column statistics collection is turned on.

 No



Common troubleshooting tasks

  • A list of common issues and problems encountered when running the collectors is available here.

Automating updates to your metadata catalog

Maintaining an up-to-date metadata catalog is crucial and can be achieved by employing Azure Pipelines, CircleCI, or any automation tool of your preference to execute the catalog collector regularly.

There are two primary strategies for setting up the collector run times:

  • Scheduled: You can configure the collector according to the anticipated frequency of metadata changes in your data source and the business need to access updated metadata. It's necessary to account for the completion time of the collector run (which depends on the size of the source) and the time required to load the collector's output into your catalog. This could be for instance daily or weekly. We recommend scheduling the collector run during off-peak times for optimal performance.

  • Event-triggered: If you have set up automations that refresh the data in a source technology, you can set up the collector to execute whenever the upstream jobs are completed successfully. For example, if you're using Airflow, Github actions, dbt, etc., you can configure the collector to automatically run and keep your catalog updated following modifications to your data sources.