Skip to main content

Tableau and the data.world Collector

Note

The latest version of the Collector is 2.138. To view the release notes for this version and all previous versions, please go here.

About the collector

Use this collector to:

  • Discover Tableau workbooks and dashboards across your enterprise

  • Perform impact analysis to understand how changes to upstream data sources impact Tableau reports

Tableau version supported

  • The collector supports Tableau API versions 3.7-3.10 on Tableau Server v 2022.1

Authentication supported

The Tableau collector supports the following methods for authentication:

These authentication details are used while generating the CLI or YAML file for the collector.

What is cataloged

The collector catalogs the following information.

Table 1.

Object

Information cataloged

Databases

Name, Description, Database Connection Type

Database tables

Name

Database columns

Name

Projects

Name, Description

Workbooks

Name, Description, Creator Email, Creator Name, Creator Tableau User, Preview Image, and Workbook URL

Dashboards

Name, Creator Email, Creator Name, Creator Tableau User, Preview Image, and Dashboard URL

Views

Name, Creator Email, Creator Name, Creator Tableau User, Number of Views, Number of Favorites, Preview Image, and View URL

Fields

Name, Identifier, Description

Calculated fields

Name, Description, Calculation Formula

Dimensions

Name, Identifier, Description

Measures

Name, Identifier, Description

Metrics

Name, Identifier, Creator, Creation Date, Modified Date, Metrics UrlField Data Type, Field Format, Field Type

Custom SQL tables

Name, Identifier, Description, Query

Embedded data sources

Name, Identifier, Description

Published data sources

Name, Identifier, Description



Relationships between objects

By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.

Table 2.

Resource page

Relationship

Databases

  • Schemas contained within database

  • Tables contained within database

Database tables

  • Views that use database table

  • Schema containing database table

  • Database containing the database table

Database columns

  • Table that a database column is part of

Projects

  • Views contained within the project

  • Workbooks contained within the project

  • Dashboards contained within project

  • Subprojects contained within project

Workbooks

  • Projects that contain workbook

  • Data sources embedded within workbook

  • Views contained within workbook

Dashboards

  • Fields used by dashboard

  • Projects containing dashboard

  • Tables used by dashboard

  • Workbooks containing dashboard

  • Views embedded in dashboard

Views

  • Fields used by view

  • Projects containing view

  • Tables used by view

  • Workbooks containing view

  • Dashboards which embed the view

Fields

  • Data Sources containing field

  • Views using field

Calculated fields

  • Views that use the calculated field

  • Data sources that contain the calculated field

Dimensions

  • Data sources containing dimension

  • Table related to dimension

Measures

  • Data Source containing measure

  • Views using measure

Custom SQL tables

  • Views using Custom SQL table

Embedded data sources

  • Fields contained within embedded data source

  • Workbook embedding embedded data source

Published data sources

  • Fields contained within published data source



Lineage for Tableau

Table 3.

Object

Lineage available

Database columns and tables

Fields that use database columns and tables

Dashboards

Fields and tables that dashboards source their data from

Views

Fields and tables that views source their data from

Fields

Columns, tables, and other fields that a field uses its data from



Important things to note about improving the performance of collector runs

Depending on the size of your Tableau instance, you may want to exclude or include specific resources from your catalog.

  1. Exclude object types: Use the --tableau-exclude parameter to exclude harvesting of certain object types. The supported object types are: View, Dashboard, Database, PublishedDataSource, EmbeddedDataSource, CalculatedField, ColumnField, BinField, GroupField, DatasourceField, CustomSQLTable, Metric

  2. Filter to specific Tableau site: Use the --tableau-site parameter to filter to a specific site.

  3. Filter to specific Tableau projects: Use the --tableau-project parameter to harvest from multiple tableau projects. Use the parameter multiple times for multiple projects.

  4. GraphQL page size: Use the --tableau-graphql-page-size parameter to adjust the GraphQL page size. The maximum page size is 1000.

  5. Increase Docker resources: If you run into out of memory errors, increase the memory on the machine running the collector, or increase the java heap size when running a jar file, or use filtering.

Setting up Tableau

Enabling Metadata API

The collector harvests from Tableau content using the Tableau Metadata API. The Metadata API is always enabled for Tableau Cloud. However, by default, the Metadata API is disabled by default for Tableau Server.

Important

For detailed instructions, see the Tableau documentation.

Prerequisites:

  1. You must be on Tableau Server 2019.3or later

  2. The REST APImust not be disabled.

  3. You must use an account with server admin role to enable the Metadata API on Tableau Server using the Tableau Services Manager (TSM) command line interface (CLI).

To enable metadata API:

  1. Open a command prompt as an admin on the initial node where TSM is installed in the cluster.

  2. Run the following command:

    tsm maintenance metadata-services enable
  3. If you do not have the Data Management license, you will need to enable derived permissions to see related external assets. For details see the Tableau Cloud documentation and Tableau Server documentation.

Setting up permissions

  1. Set up a new user in Tableau with the Server Admin role.

  2. Create a Personal Access Token (PAT) for the new user. See Tableau docs for details.

Setting up pre-requisites for running the collector

Make sure that the machine from where you are running the collector meets the following hardware and software requirements.

Table 1.

Item

Requirement

Hardware

RAM

8 GB

CPU

2 Ghz processor

Software

Docker

Click here to get Docker.

Java Runtime Environment

OpenJDK 17 is supported and available here.

data.world specific objects

Dataset

You must have a ddw-catalogs (or other) dataset set up to hold your catalog files when you are done running the collector.



Generating the command or YAML file

This section walks you through the process of generating the command or YAML file for running the collector from Windows or Linux or MAC OS.

To generate the command or YAML file:

  1. On the Organization profile page, go to the Settings tab > Metadata collectors section.

  2. Click the Help me set up a collector button.

  3. On the On-prem collector setup prerequisites screen, read the pre-requisites and click Next.

  4. On the On which platform will this collector execute? screen, select if you will be running the collector on Windows or Mac OS or Linux. This will determine the format of the YAML and CLI that is generated in the end. Click Next.

    general_01.png
  5. On the Choose metadata collector type you would like to setup screen, select Tableau. Click Next.

  6. On the Configure a new on-premises Tableau Collector screen, set the following properties and click Next.

    tabeau_01.png
  7. On the next screen, set the following properties and click Next.

    tabeau_02.png
    Table 3.

    Field name

    Corresponding parameter name

    Description

    Required?

    Authentication

    Select one of the following options and set the corresponding authentication details.

    • Authenticate using Username and Password

    • Authenticate using a private access token (PAT)

    Yes

    Authenticate using Username and Password

    Yes

    (or set the Authenticate using a private access token (PAT) parameters)

    Username

    --tableau-username=<username>

    Tableau username for authentication.

    Password

    --tableau-password=<password>

    Tableau password for authentication.

    Authenticate using a private access token (PAT)

    Yes

    (or set the Authenticate using Username and Password parameters.)

    PAT Name

    --tableau-pat-name=<personalAccessTokenName>

    Tableau personal access token name

    PAT Secret

    --tableau-pat-secret=<personalAccessTokenSecret>

    Tableau personal access token secret

    Tableau Base URL

    --tableau-api-base-url=<baseUrl>

    Base URL of the Tableau API. For example: http://8bank/api/3.10/

    Yes

    Tableau Site

    --tableau-site=<siteID>

    ID or name of the Tableau site to catalog (if not provided, will catalog all sites accessible to the user).

    No

    Skip cataloging of preview images

    --tableau-skip-images=<true_or_fase>

    Whether to skip the cataloging of preview images.

    No

    Tableau Projects

    --tableau-project=<projectId>

    ID or name of the Tableau project to catalog. If not provided, will catalog all projects. Use the parameter multiple times for multiple projects.

    Note: Sub-projects (projects nested within another project) must be specified individually.

    No

    Exclude Object Types

    --tableau-exclude=<exclusions>

    Exclude Tableau object types from being cataloged.

    The supported object types are: View, Dashboard, Database, PublishedDataSource, EmbeddedDataSource, CalculatedField, ColumnField, BinField, GroupField, DatasourceField, CustomSQLTable, Metric.

    Use the parameter multiple times to exclude multiple object types.

    No

    Tableau Pagination Size

    --tableau-graphql-page-size=<size>

    Page size to use for paginated graphql (metadata api) queries.

    No



  8. On the Finalize your Tableau Collector configuration screen, you are notified about the environment variables and directories you need to setup for running the collector. Select if you want to generate a Configuration file( YAML) or Command line arguments (CLI). Click Next

    Important

    You must ensure that you have set up these environment variables and directories before you run the collector.

    tabeau_03.png
  9. The next screen gives you an option to download the YAML configuration file or copy the CLI command. Click Done. If you are generating a YAML file, click Next.

    tabeau_06.png

    Sample YAML file.

    tabeau_04.png
  10. The Tableau command screen gives you the command to use for running the collector using the YAML file.

    tabeau_05.png
  11. You will notice that the YAML/CLI has following additional parameters that are automatically set for you.

    Important

    Except for the collector version, you should not change the values of any of the parameter listed here.

    Table 4.

    Parameter name

    Details

    Required?

    -a= <agent>

    --agent= <agent>

    --account= <agent>

    The ID for the data.world account into which you will load this catalog - this is used to generate the namespace for any URIs generated.

    Yes

    --site= <site>

    This parameter should be set only for Private instances. Do not set it for public instances and single-tenant installations. Required for private instance installations.

    Yes (required for private instance installations)

    -U

    --upload

    Whether to upload the generated catalog to the  organization account's catalogs dataset.

    Yes

    -L

    --no-log-upload

    Do not upload the log of the Collector run to the organization account's catalogs dataset.

    Yes

    dwcc: <CollectorVersion>

    The version of the collector you want to use (For example, datadotworld/dwcc:2.113)

    Yes



  12. Add the following additional parameter to test run the collector.

    • --dry-run If specified, the collector does not actually harvest any metadata, but just checks the database connection parameters provided by the user and reports success or failure at connecting.

Verifying environment variables and directories

  1. Verify that you have set up all the required environment variables that were 8 before running the collector. Alternatively, you can set these credentials in a credential vault and use a script to retrieve those credentials.

  2. Verify that you have set up all the required directories that were identified by the Collector Wizard.

Running the collector

Important

Before you begin running the collector make sure you have the correct version of collectors downloaded and available.

Running collector using YAML file

  1. Go to the server where you have setup docker to run the collector.

  2. Make sure you have download the correct version of collectors. This version should match the version of the collector specified in the command you are using to run the collector.

  3. Place the YAML file generated from the Collector wizard to the correct directory.

  4. From the command line, run the command generated from the application for executing the YAML file.

    Caution

    Note that is just a sample command for showing the syntax. You must generate the command specific to your setup from the application UI.

    docker run -it --rm --mount type=bind,source=${HOME}/dwcc,target=/dwcc-output \
      --mount type=bind,source=${HOME}/dwcc,target=${HOME}/dwcc -e DW_AUTH_TOKEN=${DW_AUTH_TOKEN} \
      -e DW_TABLEAU_PASSWORD=${DW_TABLEAU_PASSWORD} datadotworld/dwcc:2.124 \
      --config-file=/dwcc-output/config-tableau.yml
  5. The collector automatically uploads the file to the specified dataset and you can also find the output at the location you specified while running the collector.

  6. At a later point, if you download a newer version of collector from Docker, you can edit the collector version in the generated command to run the collector with the newer version.

Running collector without the YAML file

  1. Go to the server where you have setup docker to run the collector.

  2. Make sure you have download the version of collectors from here. This version should match the version of the collector specified in the command you are using to run the collector.

  3. From the command line, run the command generated from the application. Here is a sample command.

    Caution

    Note that these are just sample commands for showing the syntax. You must generate the command specific to your setup from the application UI.

    Sample command with username and password parameters.

    docker run -it --rm --mount type=bind,source=${HOME}/dwcc,target=/dwcc-output \
      --mount type=bind,source=${HOME}/dwcc,target=${HOME}/dwcc datadotworld/dwcc:2.124 \
      catalog-tableau --agent=8bank-catalog-sources --site=solutions \
      --no-log-upload=false --upload=true --api-token=${DW_AUTH_TOKEN} \
      --name=8bank-catalog-sources-collection --output=/dwcc-output \
      --upload-location=ddw-catalogs --tableau-username=8bank-user --tableau-password=${DW_TABLEAU_PASSWORD} \
      --tableau-api-base-url="http://8bank/api/3.10/ " --tableau-skip-images=false

    Sample command with PAT parameters.

    docker run -it --rm --mount type=bind,source=${HOME}/dwcc,target=/dwcc-output \
      --mount type=bind,source=${HOME}/dwcc,target=${HOME}/dwcc datadotworld/dwcc:2.124 \
      catalog-tableau --agent=8bank-catalog-sources --site=solutions \
      --no-log-upload=false --upload=true --api-token=${DW_AUTH_TOKEN} \
      --name=8bank-catalog-sources-collection --output=/dwcc-output \
      --upload-location=ddw-catalogs --tableau-pat-name=8bank_token \
      --tableau-pat-secret=${DW_TABLEAU_SECRET} --tableau-api-base-url="http://8bank/api/3.10/ " \
      --tableau-skip-images=false
  4. The collector automatically uploads the file to the specified dataset and you can also find the output at the location you specified while running the collector.

  5. At a later point, if you download a newer version of collector from Docker, you can edit the collector version in the generated command to run the collector with the newer version.

Overview

Some enterprise systems support the use of Secure Sockets Layer (SSL) encrypted communications on all external traffic. If you are harvesting metadata from a source system that requires SSL, you will need to add a CA certificate or self-signed certificate.

Obtaining the Custom SSL Certificate

  • Obtain the root certificate for your source system issued by your company. Typically your system administrator should be able to provide you with this.

Extending Docker to use custom SSL certificates

If the collector is run via Docker, extend the Docker image and install the custom certificate.

STEP 1: Prepare the Docker File

First, prepare a Dockerfile with the instructions for Docker to install the custom certificate and extend the Docker image.

  1. Ensure you are on the machine where you have downloaded the Docker Image and plan to execute the Collector.

  2. In a directory create the new Dockerfile with the following parameters for your custom SSL Certificate:

    FROM datadotworld/dwcc:<collector_version>
    ADD ./<custom_certificate_file_path> <custom_certificate_file_name>
    RUN keytool -importcert -alias startssl -cacerts -storepass changeit 
    -noprompt -file <custom_certificate_file_name>
    • Replace <collector_version> with the version of the Collector you want to use (For example, datadotworld/dwcc:2.120)

    • Replace <custom_certificate_file_path> with the path to the custom SSL Certificate.

    • Replace <custom_certificate_file_name> with the name of your custom SSL Certificate file.

    For example, the command will look like:

    FROM datadotworld/dwcc:2.120
    ADD ./ ca.der certificate
    RUN keytool -importcert -alias startssl -cacerts -storepass changeit 
    -noprompt -file certificate

STEP 2: Install the certificate and extend the docker image

Next, execute the the Dockerfile to install the certificate and extend the data.world Collector Docker Image.

  1. Using your terminal of choice, ensure you are in the directory containing the Dockerfile created in step 1.

  2. Next, create the new extended Docker image, called dwcc-cert in this example, by executing the following command:

    docker build -t dwcc-cert .

    Important things to note:

    • The command must be all lowercase.

    • The command must include the period (.) at the end, which directs Docker to use the local directory for the Dockerfile created above.

    • For the new image, the command uses the name dwcc-cert You can change the name if you want.

STEP 3: Run collector using the custom certificate

Finally, run the collector using the custom Certificate.

  1. Get the standard docker run command for the Data Source you are collecting from.

  2. Change the docker run command to use dwcc-cert image instead of dwcc image.

    Sample command for Tableau.

    docker run -it --rm --mount type=bind,source=/tmp,target=/dwcc-output \
    --mount type=bind,source=/tmp,target=/app/log dwcc-cert \
    catalog-tableau --tableau-api-base-url <baseUrl> \
    --tableau-password <password> --tableau-username <username> \
    -a <account> -n <catalogName> -o "/dwcc-output"

    If you are using YAML file for running the collector, edit the command to use dwcc-cert image instead of dwcc image.

    docker run -it --rm --mount type=bind,source=${HOME}/dwcc,target=/dwcc-output \ 
     --mount type=bind,source=${HOME}/dwcc,target=${HOME}/dwcc -e DW_AUTH_TOKEN=${DW_AUTH_TOKEN} \
      -e DW_TABLEAU_PASSWORD=${DW_TABLEAU_PASSWORD} dwcc-cert \ 
     --config-file=/dwcc-output/config-tableau.yml

Adding custom SSL certificates when using jar

If the collector is run via jar, add the certificate to the JVM truststore.

  1. From the terminal, navigate to the directory containing the certificate.

  2. Run the following command to add the SSL certificate to the truststore:

    keytool -importcert -alias startssl -cacerts -storepass changeit -noprompt -file <custom_certificate_file_path>

    Replace <custom_certificate_file_path> with the path to the custom SSL Certificate.

    For example, the command will look like:

    keytool -importcert -alias startssl -cacerts -storepass changeit -noprompt -file ca.der
  3. Finally, run the collector using the original jar file command. Note that this command does not need any modifications.

Troubleshooting SSL certificate issues

Issue

  • The following error occurs while running the collector:

Caused by: javax.ws.rs.ProcessingException: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

Description

  • There was an issue connecting to the source system using the SSL certificate.

Solution

  1. Check to make sure the SSL certificate has not expired.

  2. Ensure you have the correct SSL certificate for the source system.

Common troubleshooting tasks

Collector runtime and troubleshooting

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled. If the catalog collector runs without issues, you should see no output on the terminal, but a new file that matching *.dwec.ttl should be in the directory you specified for the output. If there was an issue connecting or running the catalog collector, there will be either a stack trace or a *.log file. Both of those can be sent to support to investigate if the errors are not clear. A list of common issues and problems encountered when running the collectors is available here.

Issue 1: The collector is taking a long time to harvest from Tableau

  • Cause: The large size of a Tableau environment results in a long time to harvest metadata.

  • Solution: Increase the graphql page size to reduce the number of API calls.

Issue 2: Out of memory errors while running the collector

  • Cause: There is not enough memory allocated to the collector on the machine that the collector is running. Note that the collector may run for a long time even if there is sufficient memory due to the size of the source system.

  • Solution: Increase memory of the machine running the collector. Run docker system info to see the total memory available for Docker. The collector will use 80% of what is allocated to the container. You can run docker stats to see how much memory is used by the container when the collector runs.

Issue 3: Info message "Authentication to Tableau API failed, reauthenticating" observed in log file

  • Cause: The connection to Tableau API expired due to a long run.

  • Solution: No action is required as the collector re-authenticates automatically to Tableau.

Issue 4: Partial results error observed

The following error message is observed: Showing partial results. The request exceeded the ‘n’ node limit. Use pagination, additional filtering, or both in the query to adjust results.

Automating updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.