Tableau and the data.world Collector
Note
The latest version of the Collector is 2.138. To view the release notes for this version and all previous versions, please go here.
About the collector
Use this collector to:
Discover Tableau workbooks and dashboards across your enterprise
Perform impact analysis to understand how changes to upstream data sources impact Tableau reports
Tableau version supported
The collector supports Tableau API versions 3.7-3.10 on Tableau Server v 2022.1
Authentication supported
The Tableau collector supports the following methods for authentication:
These authentication details are used while generating the CLI or YAML file for the collector.
What is cataloged
The collector catalogs the following information.
Object | Information cataloged |
---|---|
Databases | Name, Description, Database Connection Type |
Database tables | Name |
Database columns | Name |
Projects | Name, Description |
Workbooks | Name, Description, Creator Email, Creator Name, Creator Tableau User, Preview Image, and Workbook URL |
Dashboards | Name, Creator Email, Creator Name, Creator Tableau User, Preview Image, and Dashboard URL |
Views | Name, Creator Email, Creator Name, Creator Tableau User, Number of Views, Number of Favorites, Preview Image, and View URL |
Fields | Name, Identifier, Description |
Calculated fields | Name, Description, Calculation Formula |
Dimensions | Name, Identifier, Description |
Measures | Name, Identifier, Description |
Metrics | Name, Identifier, Creator, Creation Date, Modified Date, Metrics UrlField Data Type, Field Format, Field Type |
Custom SQL tables | Name, Identifier, Description, Query |
Embedded data sources | Name, Identifier, Description |
Published data sources | Name, Identifier, Description |
Relationships between objects
By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.
Resource page | Relationship |
---|---|
Databases |
|
Database tables |
|
Database columns |
|
Projects |
|
Workbooks |
|
Dashboards |
|
Views |
|
Fields |
|
Calculated fields |
|
Dimensions |
|
Measures |
|
Custom SQL tables |
|
Embedded data sources |
|
Published data sources |
|
Lineage for Tableau
Object | Lineage available |
---|---|
Database columns and tables | Fields that use database columns and tables |
Dashboards | Fields and tables that dashboards source their data from |
Views | Fields and tables that views source their data from |
Fields | Columns, tables, and other fields that a field uses its data from |
Important things to note about improving the performance of collector runs
Depending on the size of your Tableau instance, you may want to exclude or include specific resources from your catalog.
Exclude object types: Use the --tableau-exclude parameter to exclude harvesting of certain object types. The supported object types are: View, Dashboard, Database, PublishedDataSource, EmbeddedDataSource, CalculatedField, ColumnField, BinField, GroupField, DatasourceField, CustomSQLTable, Metric
Filter to specific Tableau site: Use the --tableau-site parameter to filter to a specific site.
Filter to specific Tableau projects: Use the --tableau-project parameter to harvest from multiple tableau projects. Use the parameter multiple times for multiple projects.
GraphQL page size: Use the --tableau-graphql-page-size parameter to adjust the GraphQL page size. The maximum page size is 1000.
Increase Docker resources: If you run into out of memory errors, increase the memory on the machine running the collector, or increase the java heap size when running a jar file, or use filtering.
Setting up Tableau
Enabling Metadata API
The collector harvests from Tableau content using the Tableau Metadata API. The Metadata API is always enabled for Tableau Cloud. However, by default, the Metadata API is disabled by default for Tableau Server.
Important
For detailed instructions, see the Tableau documentation.
Prerequisites:
You must be on Tableau Server 2019.3or later
The REST APImust not be disabled.
You must use an account with server admin role to enable the Metadata API on Tableau Server using the Tableau Services Manager (TSM) command line interface (CLI).
To enable metadata API:
Open a command prompt as an admin on the initial node where TSM is installed in the cluster.
Run the following command:
tsm maintenance metadata-services enable
If you do not have the Data Management license, you will need to enable derived permissions to see related external assets. For details see the Tableau Cloud documentation and Tableau Server documentation.
Setting up permissions
Set up a new user in Tableau with the Server Admin role.
Create a Personal Access Token (PAT) for the new user. See Tableau docs for details.
Setting up pre-requisites for running the collector
Make sure that the machine from where you are running the collector meets the following hardware and software requirements.
Item | Requirement |
---|---|
Hardware | |
RAM | 8 GB |
CPU | 2 Ghz processor |
Software | |
Docker | Click here to get Docker. |
Java Runtime Environment | OpenJDK 17 is supported and available here. |
data.world specific objects | |
Dataset | You must have a ddw-catalogs (or other) dataset set up to hold your catalog files when you are done running the collector. |
Generating the command or YAML file
This section walks you through the process of generating the command or YAML file for running the collector from Windows or Linux or MAC OS.
To generate the command or YAML file:
On the Organization profile page, go to the Settings tab > Metadata collectors section.
Click the Help me set up a collector button.
On the On-prem collector setup prerequisites screen, read the pre-requisites and click Next.
On the On which platform will this collector execute? screen, select if you will be running the collector on Windows or Mac OS or Linux. This will determine the format of the YAML and CLI that is generated in the end. Click Next.
On the Choose metadata collector type you would like to setup screen, select Tableau. Click Next.
On the Configure a new on-premises Tableau Collector screen, set the following properties and click Next.
On the next screen, set the following properties and click Next.
Table 3.Field name
Corresponding parameter name
Description
Required?
Authentication
Select one of the following options and set the corresponding authentication details.
Authenticate using Username and Password
Authenticate using a private access token (PAT)
Yes
Authenticate using Username and Password
Yes
(or set the Authenticate using a private access token (PAT) parameters)
Username
--tableau-username=<username>
Tableau username for authentication.
Password
--tableau-password=<password>
Tableau password for authentication.
Authenticate using a private access token (PAT)
Yes
(or set the Authenticate using Username and Password parameters.)
PAT Name
--tableau-pat-name=<personalAccessTokenName>
Tableau personal access token name
PAT Secret
--tableau-pat-secret=<personalAccessTokenSecret>
Tableau personal access token secret
Tableau Base URL
--tableau-api-base-url=<baseUrl>
Base URL of the Tableau API. For example: http://8bank/api/3.10/
Yes
Tableau Site
--tableau-site=<siteID>
ID or name of the Tableau site to catalog (if not provided, will catalog all sites accessible to the user).
No
Skip cataloging of preview images
--tableau-skip-images=<true_or_fase>
Whether to skip the cataloging of preview images.
No
Tableau Projects
--tableau-project=<projectId>
ID or name of the Tableau project to catalog. If not provided, will catalog all projects. Use the parameter multiple times for multiple projects.
Note: Sub-projects (projects nested within another project) must be specified individually.
No
Exclude Object Types
--tableau-exclude=<exclusions>
Exclude Tableau object types from being cataloged.
The supported object types are: View, Dashboard, Database, PublishedDataSource, EmbeddedDataSource, CalculatedField, ColumnField, BinField, GroupField, DatasourceField, CustomSQLTable, Metric.
Use the parameter multiple times to exclude multiple object types.
No
Tableau Pagination Size
--tableau-graphql-page-size=<size>
Page size to use for paginated graphql (metadata api) queries.
No
On the Finalize your Tableau Collector configuration screen, you are notified about the environment variables and directories you need to setup for running the collector. Select if you want to generate a Configuration file( YAML) or Command line arguments (CLI). Click Next
Important
You must ensure that you have set up these environment variables and directories before you run the collector.
The next screen gives you an option to download the YAML configuration file or copy the CLI command. Click Done. If you are generating a YAML file, click Next.
Sample YAML file.
The Tableau command screen gives you the command to use for running the collector using the YAML file.
You will notice that the YAML/CLI has following additional parameters that are automatically set for you.
Important
Except for the collector version, you should not change the values of any of the parameter listed here.
Table 4.Parameter name
Details
Required?
-a= <agent>
--agent= <agent>
--account= <agent>
The ID for the data.world account into which you will load this catalog - this is used to generate the namespace for any URIs generated.
Yes
--site= <site>
This parameter should be set only for Private instances. Do not set it for public instances and single-tenant installations. Required for private instance installations.
Yes (required for private instance installations)
-U
--upload
Whether to upload the generated catalog to the organization account's catalogs dataset.
Yes
-L
--no-log-upload
Do not upload the log of the Collector run to the organization account's catalogs dataset.
Yes
dwcc: <CollectorVersion>
The version of the collector you want to use (For example,
datadotworld/dwcc:2.113)
Yes
Add the following additional parameter to test run the collector.
--dry-run If specified, the collector does not actually harvest any metadata, but just checks the database connection parameters provided by the user and reports success or failure at connecting.
Verifying environment variables and directories
Verify that you have set up all the required environment variables that were 8 before running the collector. Alternatively, you can set these credentials in a credential vault and use a script to retrieve those credentials.
Verify that you have set up all the required directories that were identified by the Collector Wizard.
Running the collector
Important
Before you begin running the collector make sure you have the correct version of collectors downloaded and available.
Running collector using YAML file
Go to the server where you have setup docker to run the collector.
Make sure you have download the correct version of collectors. This version should match the version of the collector specified in the command you are using to run the collector.
Place the YAML file generated from the Collector wizard to the correct directory.
From the command line, run the command generated from the application for executing the YAML file.
Caution
Note that is just a sample command for showing the syntax. You must generate the command specific to your setup from the application UI.
docker run -it --rm --mount type=bind,source=${HOME}/dwcc,target=/dwcc-output \ --mount type=bind,source=${HOME}/dwcc,target=${HOME}/dwcc -e DW_AUTH_TOKEN=${DW_AUTH_TOKEN} \ -e DW_TABLEAU_PASSWORD=${DW_TABLEAU_PASSWORD} datadotworld/dwcc:2.124 \ --config-file=/dwcc-output/config-tableau.yml
The collector automatically uploads the file to the specified dataset and you can also find the output at the location you specified while running the collector.
At a later point, if you download a newer version of collector from Docker, you can edit the collector version in the generated command to run the collector with the newer version.
Running collector without the YAML file
Go to the server where you have setup docker to run the collector.
Make sure you have download the version of collectors from here. This version should match the version of the collector specified in the command you are using to run the collector.
From the command line, run the command generated from the application. Here is a sample command.
Caution
Note that these are just sample commands for showing the syntax. You must generate the command specific to your setup from the application UI.
Sample command with username and password parameters.
docker run -it --rm --mount type=bind,source=${HOME}/dwcc,target=/dwcc-output \ --mount type=bind,source=${HOME}/dwcc,target=${HOME}/dwcc datadotworld/dwcc:2.124 \ catalog-tableau --agent=8bank-catalog-sources --site=solutions \ --no-log-upload=false --upload=true --api-token=${DW_AUTH_TOKEN} \ --name=8bank-catalog-sources-collection --output=/dwcc-output \ --upload-location=ddw-catalogs --tableau-username=8bank-user --tableau-password=${DW_TABLEAU_PASSWORD} \ --tableau-api-base-url="http://8bank/api/3.10/ " --tableau-skip-images=false
Sample command with PAT parameters.
docker run -it --rm --mount type=bind,source=${HOME}/dwcc,target=/dwcc-output \ --mount type=bind,source=${HOME}/dwcc,target=${HOME}/dwcc datadotworld/dwcc:2.124 \ catalog-tableau --agent=8bank-catalog-sources --site=solutions \ --no-log-upload=false --upload=true --api-token=${DW_AUTH_TOKEN} \ --name=8bank-catalog-sources-collection --output=/dwcc-output \ --upload-location=ddw-catalogs --tableau-pat-name=8bank_token \ --tableau-pat-secret=${DW_TABLEAU_SECRET} --tableau-api-base-url="http://8bank/api/3.10/ " \ --tableau-skip-images=false
The collector automatically uploads the file to the specified dataset and you can also find the output at the location you specified while running the collector.
At a later point, if you download a newer version of collector from Docker, you can edit the collector version in the generated command to run the collector with the newer version.
Overview
Some enterprise systems support the use of Secure Sockets Layer (SSL) encrypted communications on all external traffic. If you are harvesting metadata from a source system that requires SSL, you will need to add a CA certificate or self-signed certificate.
Obtaining the Custom SSL Certificate
Obtain the root certificate for your source system issued by your company. Typically your system administrator should be able to provide you with this.
Extending Docker to use custom SSL certificates
If the collector is run via Docker, extend the Docker image and install the custom certificate.
STEP 1: Prepare the Docker File
First, prepare a Dockerfile with the instructions for Docker to install the custom certificate and extend the Docker image.
Ensure you are on the machine where you have downloaded the Docker Image and plan to execute the Collector.
In a directory create the new Dockerfile with the following parameters for your custom SSL Certificate:
FROM datadotworld/dwcc:<collector_version> ADD ./<custom_certificate_file_path> <custom_certificate_file_name> RUN keytool -importcert -alias startssl -cacerts -storepass changeit -noprompt -file <custom_certificate_file_name>
Replace <collector_version> with the version of the Collector you want to use (For example, datadotworld/dwcc:2.120)
Replace <custom_certificate_file_path> with the path to the custom SSL Certificate.
Replace <custom_certificate_file_name> with the name of your custom SSL Certificate file.
For example, the command will look like:
FROM datadotworld/dwcc:2.120 ADD ./ ca.der certificate RUN keytool -importcert -alias startssl -cacerts -storepass changeit -noprompt -file certificate
STEP 2: Install the certificate and extend the docker image
Next, execute the the Dockerfile to install the certificate and extend the data.world Collector Docker Image.
Using your terminal of choice, ensure you are in the directory containing the Dockerfile created in step 1.
Next, create the new extended Docker image, called dwcc-cert in this example, by executing the following command:
docker build -t dwcc-cert .
Important things to note:
The command must be all lowercase.
The command must include the period (.) at the end, which directs Docker to use the local directory for the Dockerfile created above.
For the new image, the command uses the name dwcc-cert You can change the name if you want.
STEP 3: Run collector using the custom certificate
Finally, run the collector using the custom Certificate.
Get the standard docker run command for the Data Source you are collecting from.
Change the docker run command to use dwcc-cert image instead of dwcc image.
Sample command for Tableau.
docker run -it --rm --mount type=bind,source=/tmp,target=/dwcc-output \ --mount type=bind,source=/tmp,target=/app/log dwcc-cert \ catalog-tableau --tableau-api-base-url <baseUrl> \ --tableau-password <password> --tableau-username <username> \ -a <account> -n <catalogName> -o "/dwcc-output"
If you are using YAML file for running the collector, edit the command to use dwcc-cert image instead of dwcc image.
docker run -it --rm --mount type=bind,source=${HOME}/dwcc,target=/dwcc-output \ --mount type=bind,source=${HOME}/dwcc,target=${HOME}/dwcc -e DW_AUTH_TOKEN=${DW_AUTH_TOKEN} \ -e DW_TABLEAU_PASSWORD=${DW_TABLEAU_PASSWORD} dwcc-cert \ --config-file=/dwcc-output/config-tableau.yml
Adding custom SSL certificates when using jar
If the collector is run via jar, add the certificate to the JVM truststore.
From the terminal, navigate to the directory containing the certificate.
Run the following command to add the SSL certificate to the truststore:
keytool -importcert -alias startssl -cacerts -storepass changeit -noprompt -file <custom_certificate_file_path>
Replace <custom_certificate_file_path> with the path to the custom SSL Certificate.
For example, the command will look like:
keytool -importcert -alias startssl -cacerts -storepass changeit -noprompt -file ca.der
Finally, run the collector using the original jar file command. Note that this command does not need any modifications.
Troubleshooting SSL certificate issues
Issue
The following error occurs while running the collector:
Caused by: javax.ws.rs.ProcessingException: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
Description
There was an issue connecting to the source system using the SSL certificate.
Solution
Check to make sure the SSL certificate has not expired.
Ensure you have the correct SSL certificate for the source system.
Common troubleshooting tasks
Collector runtime and troubleshooting
The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled. If the catalog collector runs without issues, you should see no output on the terminal, but a new file that matching *.dwec.ttl should be in the directory you specified for the output. If there was an issue connecting or running the catalog collector, there will be either a stack trace or a *.log file. Both of those can be sent to support to investigate if the errors are not clear. A list of common issues and problems encountered when running the collectors is available here.
Issue 1: The collector is taking a long time to harvest from Tableau
Cause: The large size of a Tableau environment results in a long time to harvest metadata.
Solution: Increase the graphql page size to reduce the number of API calls.
Issue 2: Out of memory errors while running the collector
Cause: There is not enough memory allocated to the collector on the machine that the collector is running. Note that the collector may run for a long time even if there is sufficient memory due to the size of the source system.
Solution: Increase memory of the machine running the collector. Run docker system info to see the total memory available for Docker. The collector will use 80% of what is allocated to the container. You can run docker stats to see how much memory is used by the container when the collector runs.
Issue 3: Info message "Authentication to Tableau API failed, reauthenticating" observed in log file
Cause: The connection to Tableau API expired due to a long run.
Solution: No action is required as the collector re-authenticates automatically to Tableau.
Issue 4: Partial results error observed
The following error message is observed: Showing partial results. The request exceeded the ‘n’ node limit. Use pagination, additional filtering, or both in the query to adjust results.
Cause: When you increase the graphql page size, you may run into warnings messages in the logs due to nested queries.
Solution: Try a smaller page size or increase the max node limit. Increase the max node limit by setting metadata.query.limits.count which defaults to 20,000.
Automating updates to your metadata catalog
Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:
Frequency of changes to the schema
Business criticality of up-to-date data
For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.