Running the Tableau collector on-premise
Note
The latest version of the Collector is 2.253. To view the release notes for this version and all previous versions, please go here.
Ways to run the data.world Collector
There are a few different ways to run the data.world Collector--any of which can be combined with an automation strategy to keep your catalog up to date:
Create a configuration file (config.yml) - This option stores all the information needed to catalog your data sources. It is an especially valuable option if you have multiple data sources to catalog as you don't need to run multiple scripts or CLI commands separately.
Run the collector through a CLI - Repeat runs of the collector requires you to re-enter the command for each run.
Note
This section walks you through the process of running the collector using CLI.
Preparing and running the command
The easiest way to create your Collector command is to:
Copy the following example command in a text editor.
Set the required parameters in the command. The example command includes the minimal parameters required to run the collector
Open a terminal window in any Unix environment that uses a Bash shell and paste the command in it and run in.
Sample command with username and password parameters.
docker run -it --rm --mount type=bind,source=${HOME}/dwcc,target=/dwcc-output \ --mount type=bind,source=${HOME}/dwcc,target=/app/log datadotworld/dwcc:2.124 \ catalog-tableau-preview --agent=8bank-catalog-sources --site=solutions \ --no-log-upload=false --upload=true --api-token=${DW_AUTH_TOKEN} \ --name=8bank-catalog-sources-collection --output=/dwcc-output \ --upload-location=ddw-catalogs --tableau-username=8bank-user --tableau-password=${DW_TABLEAU_PASSWORD} \ --tableau-api-base-url="http://8bank/api/3.10/ " --tableau-skip-images=false
Sample command with PAT parameters.
docker run -it --rm --mount type=bind,source=${HOME}/dwcc,target=/dwcc-output \ --mount type=bind,source=${HOME}/dwcc,target=/app/log datadotworld/dwcc:2.124 \ catalog-tableau-preview --agent=8bank-catalog-sources --site=solutions \ --no-log-upload=false --upload=true --api-token=${DW_AUTH_TOKEN} \ --name=8bank-catalog-sources-collection --output=/dwcc-output \ --upload-location=ddw-catalogs --tableau-pat-name=8bank_token \ --tableau-pat-secret=${DW_TABLEAU_SECRET} --tableau-api-base-url="http://8bank/api/3.10/ " \ --tableau-skip-images=false
The following table describes the parameters for the command. Detailed information about the Docker portion of the command can be found here.
Parameter | Details | Required? |
---|---|---|
-t= <apiToken> --api-token= <apiToken> | The data.world API token to use for authentication. Default is to use an environment variable named ${DW_AUTH_TOKEN}. | Yes |
-n= <catalogName> --name= <catalogName> | Specify the collection name where the collector output will be saved. Ensure you use a distinct collection for each collector. | Yes |
-o= <outputDir> --output= <outputDir> | The output directory into which any catalog files should be written. | Yes |
--output-name | Specify the collector output file name to override the default file name. The system automatically adds .dwec.ttl to the end of the provided file name. | No |
--upload-location= <uploadLocation> | Enter the name of your dataset to display a list of available datasets. From this list, select the dataset where you want to upload the catalog file. By default, the search is restricted to the organization you are in. To search across all organizations you have access to, uncheck the Limit search results to this organization option. | Yes |
-H= <apiHost> --api-host= <apiHost> | The host for the data.world API. NOTE: This parameter is required for single-tenant installations. For example, "api.site.data.world" where "site" is the name of the single-tenant install. | Yes (for single-tenant installations) |
-a= <agent> --agent= <agent> --account= <agent> | The ID for the data.world account into which you will load this catalog - this is used to generate the namespace for any URIs generated. | Yes |
--site= <site> | This parameter should be set only for Private instances. Do not set it for public instances and single-tenant installations. Required for private instance installations. | Yes (required for private instance installations) |
-U --upload | Whether to upload the generated catalog to the organization account's catalogs dataset. | Yes |
dwcc: <CollectorVersion> | The version of the collector you want to use (For example, | Yes |
--tableau-api-base-url= <baseUrl> | Base URL of the Tableau API. For example: http://8bank/api/3.10/ | Yes |
--tableau-username= <username> | Tableau username for authentication. | Yes (or set the Authenticate using a private access token (PAT) parameters) |
--tableau-password= <password> | Tableau password for authentication. | |
--tableau-pat-name= <personalAccessTokenName> | Tableau personal access token name | Yes (or set the Authenticate using Username and Password parameters.) |
--tableau-pat-secret= <personalAccessTokenSecret> | Tableau personal access token secret | |
--tableau-skip-images= <true_or_fase> | Whether to skip the cataloging of preview images. | No |
--tableau-site= <siteID> | ID or name of the Tableau site to catalog (if not provided, will catalog all sites accessible to the user). | Required if you are using Tableau Cloud. |
--tableau-graphql-page-size= <size> | Page size to use for paginated graphql (metadata api) queries. | No |
--tableau-project= <projectId> | ID or name of the Tableau project to catalog. If not provided, will catalog all projects. Use the parameter multiple times for multiple projects. Note: Sub-projects (projects nested within another project) must be specified individually. | No |
--tableau-exclude= <exclusions> | Exclude Tableau object types from being cataloged. The supported object types are: View, Dashboard, Database, PublishedDataSource, EmbeddedDataSource, CalculatedField, ColumnField, BinField, GroupField, DatasourceField, CustomSQLTable, Metric. Use the parameter multiple times to exclude multiple object types. | No |
--tableau-exclude-project= <excludedProjectIds> | Specify the ID or name of a Tableau project to exclude. If not provided, will catalog all projects. If provided, overrides any projects specified with --tableau-project. Use the parameter multiple times for multiple projects. For example, --tableau-exclude-project="projectA" --tableau-exclude-project="projectB". | No |
--tableau-filter-descendant-project | Specify whether to apply the Tableau projects (--tableau-project) and Tableau project exclude (--tableau-exclude-project) options to descendant projects. | No |
Automating updates to your metadata catalog
Maintaining an up-to-date metadata catalog is crucial and can be achieved by employing Azure Pipelines, CircleCI, or any automation tool of your preference to execute the catalog collector regularly.
There are two primary strategies for setting up the collector run times:
Scheduled: You can configure the collector according to the anticipated frequency of metadata changes in your data source and the business need to access updated metadata. It's necessary to account for the completion time of the collector run (which depends on the size of the source) and the time required to load the collector's output into your catalog. This could be for instance daily or weekly. We recommend scheduling the collector run during off-peak times for optimal performance.
Event-triggered: If you have set up automations that refresh the data in a source technology, you can set up the collector to execute whenever the upstream jobs are completed successfully. For example, if you're using Airflow, Github actions, dbt, etc., you can configure the collector to automatically run and keep your catalog updated following modifications to your data sources.