Skip to main content

Running the Tableau collector on-premise

Important

This collector is available in Private Preview. If you would like access to this collector, please contact your Customer Success Director.

Note

The latest version of the Collector is 2.256. To view the release notes for this version and all previous versions, please go here.

Ways to run the data.world Collector

There are a few different ways to run the data.world Collector--any of which can be combined with an automation strategy to keep your catalog up to date:

  • Create a configuration file (config.yml) - This option stores all the information needed to catalog your data sources. It is an especially valuable option if you have multiple data sources to catalog as you don't need to run multiple scripts or CLI commands separately.

  • Run the collector through a CLI - Repeat runs of the collector requires you to re-enter the command for each run.

Note

This section walks you through the process of running the collector using CLI.

Preparing and running the command

The easiest way to create your Collector command is to:

  1. Copy the following example command in a text editor.

  2. Set the required parameters in the command. The example command includes the minimal parameters required to run the collector

  3. Open a terminal window in any Unix environment that uses a Bash shell and paste the command in it and run in.

Sample command with username and password parameters.

docker run -it --rm --mount type=bind,source=${HOME}/dwcc,target=/dwcc-output \
  --mount type=bind,source=${HOME}/dwcc,target=/app/log datadotworld/dwcc:2.124 \
  catalog-tableau-preview --agent=8bank-catalog-sources --site=solutions \
  --no-log-upload=false --upload=true --api-token=${DW_AUTH_TOKEN} \
  --name=8bank-catalog-sources-collection --output=/dwcc-output \
  --upload-location=ddw-catalogs --tableau-username=8bank-user --tableau-password=${DW_TABLEAU_PASSWORD} \
  --tableau-api-base-url="http://8bank/api/3.10/ " --tableau-skip-images=false

Sample command with PAT parameters.

docker run -it --rm --mount type=bind,source=${HOME}/dwcc,target=/dwcc-output \
  --mount type=bind,source=${HOME}/dwcc,target=/app/log datadotworld/dwcc:2.124 \
  catalog-tableau-preview --agent=8bank-catalog-sources --site=solutions \
  --no-log-upload=false --upload=true --api-token=${DW_AUTH_TOKEN} \
  --name=8bank-catalog-sources-collection --output=/dwcc-output \
  --upload-location=ddw-catalogs --tableau-pat-name=8bank_token \
  --tableau-pat-secret=${DW_TABLEAU_SECRET} --tableau-api-base-url="http://8bank/api/3.10/ " \
  --tableau-skip-images=false

The following table describes the parameters for the command. Detailed information about the Docker portion of the command can be found here.Introduction to Docker

Table 1.

Parameter

Details

Required?

-t= <apiToken>

--api-token= <apiToken>

The data.world API token to use for authentication. Default is to use an environment variable named ${DW_AUTH_TOKEN}.

Yes

-n= <catalogName>

--name= <catalogName>

Specify the collection name where the collector output will be saved. Ensure you use a distinct collection for each collector.

Yes

-o= <outputDir>

--output= <outputDir>

The output directory into which any catalog files should be written.

Yes

--output-name

Specify the collector output file name to override the default file name. The system automatically adds .dwec.ttl to the end of the provided file name.

No

--upload-location= <uploadLocation>

Enter the name of your dataset to display a list of available datasets. From this list, select the dataset where you want to upload the catalog file.

By default, the search is restricted to the organization you are in. To search across all organizations you have access to, uncheck the Limit search results to this organization option.

Yes

-H= <apiHost>

--api-host= <apiHost>

The host for the data.world API. NOTE: This parameter is required for single-tenant installations. For example, "api.site.data.world" where "site" is the name of the single-tenant install.

Yes

(for single-tenant installations)

-a= <agent>

--agent= <agent>

--account= <agent>

The ID for the data.world account into which you will load this catalog - this is used to generate the namespace for any URIs generated.

Yes

--site= <site>

This parameter should be set only for Private instances. Do not set it for public instances and single-tenant installations. Required for private instance installations.

Yes (required for private instance installations)

-U

--upload

Whether to upload the generated catalog to the  organization account's catalogs dataset.

Yes

dwcc: <CollectorVersion>

The version of the collector you want to use (For example, datadotworld/dwcc:2.168)

Yes

--tableau-api-base-url= <baseUrl>

Base URL of the Tableau API. For example: http://8bank/api/3.10/

Yes

--tableau-username= <username>

Tableau username for authentication.

Yes

(or set the Authenticate using a private access token (PAT) parameters)

--tableau-password= <password>

Tableau password for authentication.

--tableau-pat-name= <personalAccessTokenName>

Tableau personal access token name

Yes

(or set the Authenticate using Username and Password parameters.)

--tableau-pat-secret= <personalAccessTokenSecret>

Tableau personal access token secret

--tableau-skip-images= <true_or_fase>

Whether to skip the cataloging of preview images.

No

--tableau-site= <siteID>

ID or name of the Tableau site to catalog (if not provided, will catalog all sites accessible to the user).

Required if you are using Tableau Cloud.

--tableau-graphql-page-size= <size>

Page size to use for paginated graphql (metadata api) queries.

No

--tableau-project= <projectId>

ID or name of the Tableau project to catalog. If not provided, will catalog all projects. Use the parameter multiple times for multiple projects.

Note: Sub-projects (projects nested within another project) must be specified individually.

No

--tableau-exclude= <exclusions>

Exclude Tableau object types from being cataloged.

The supported object types are: View, Dashboard, Database, PublishedDataSource, EmbeddedDataSource, CalculatedField, ColumnField, BinField, GroupField, DatasourceField, CustomSQLTable, Metric.

Use the parameter multiple times to exclude multiple object types.

No

--tableau-exclude-project= <excludedProjectIds>

Specify the ID or name of a Tableau project to  exclude. If not provided, will catalog all projects. If provided, overrides any projects  specified with --tableau-project. Use the parameter multiple times for multiple projects. For example, --tableau-exclude-project="projectA" --tableau-exclude-project="projectB".

No

--tableau-filter-descendant-project

Specify whether to apply the Tableau projects (--tableau-project) and Tableau project exclude (--tableau-exclude-project) options to descendant projects.

No

--tableau-catalog-personal-space-workbooks

Specify whether to catalog personal space workbooks. Personal space workbooks are those workbooks that are not associated with a project.                                            

No



Automating updates to your metadata catalog

Maintaining an up-to-date metadata catalog is crucial and can be achieved by employing Azure Pipelines, CircleCI, or any automation tool of your preference to execute the catalog collector regularly.

There are two primary strategies for setting up the collector run times:

  • Scheduled: You can configure the collector according to the anticipated frequency of metadata changes in your data source and the business need to access updated metadata. It's necessary to account for the completion time of the collector run (which depends on the size of the source) and the time required to load the collector's output into your catalog. This could be for instance daily or weekly. We recommend scheduling the collector run during off-peak times for optimal performance.

  • Event-triggered: If you have set up automations that refresh the data in a source technology, you can set up the collector to execute whenever the upstream jobs are completed successfully. For example, if you're using Airflow, Github actions, dbt, etc., you can configure the collector to automatically run and keep your catalog updated following modifications to your data sources.