Skip to main content

Snowflake and the data.world Collector

Introduction

Note

The latest version of the Collector is 2.119. To view the release notes for this version and all previous versions, please go here.

The data.world Collector harvests metadata from your source system. Please read over the data.world Collector FAQ to familiarize yourself with the Collector.

Prerequisites

  • You must have a ddw-catalogs (or other) dataset set up to hold your catalog files when you are done running the collector.

  • The machine running the catalog collector should have connectivity to the internet or access to the source instance. For Linux- or Unix-based machines, it is recommended to have a minimum of 2G memory, and a 2Ghz processor. For Windows-based machines, it is recommended to have a minimum of 4G, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/.

  • The user defined to run the data.world Collector must have read access to all resources being cataloged.

  • The computer running the data.world Collector needs a Java Runtime Environment. OpenJDK 17 is supported and available here.

Permissions required to catalog Snowflake

Important

Note that the Snowflake collector supports the username and password authentication method.

These permissions have been tested as of 2/16/2022.

  1. Copy and paste this text into a worksheet

  2. Replace <warehouse_name>, <database_name>; and <;password>

  3. Run all queries

  4. Test the collector using the DDW_ACCOUNT user and the password you filled in.

Note

The statements below grant permissions for all Snowflake schemas, tables, external tables, and views to the DDW_ACCOUNT_ROLE. The scope of the grant statements can be modified to grant permissions to specified objects which would limit the metadata cataloged from Snowflake.

create or replace role DDW_ACCOUNT_ROLE;

grant operate, usage on warehouse <warehouse_name> to role DDW_ACCOUNT_ROLE;


grant usage on database <database_name> to role DDW_ACCOUNT_ROLE;
grant usage on all schemas IN DATABASE <database_name> to role DDW_ACCOUNT_ROLE; 
grant select on all tables IN DATABASE <database_name> to role DDW_ACCOUNT_ROLE; 
grant select on all external tables IN DATABASE <database_name> to role DDW_ACCOUNT_ROLE;
grant select on all views IN DATABASE <database_name> to role DDW_ACCOUNT_ROLE;

GRANT USAGE ON FUTURE SCHEMAS IN DATABASE "<database name>" TO ROLE DDW_ACCOUNT_ROLE;
GRANT SELECT ON FUTURE TABLES IN DATABASE "<database name>" TO ROLE DDW_ACCOUNT_ROLE;


create user DDW_ACCOUNT password='<password>' default_role = DDW_ACCOUNT_ROLE default_warehouse = '<warehouse_name>'  display_name = 'data.world';
grant role DDW_ACCOUNT_ROLE to user DDW_ACCOUNT;

// optional: Permission to enable Snowflake Query History based lineage
// must use `ACCOUNTADMIN` role
grant imported privileges on database snowflake to role DDW_ACCOUNT_ROLE;

// You can check the grants on the user with this
SHOW GRANTS TO ROLE DDW_ACCOUNT_ROLE;

What is cataloged

The information cataloged by the collector includes:

Important details about Snowflake tags and policies

The Snowflake collector harvests Snowflake object tags, Snowflake tag-based masking policies, Snowflake masking policies, and Snowflake row access policies. This information enhances the data governance experience to discover all tags and policies applied across their Snowflake tables and columns in a data catalog

In the following example you can see how a Snowflake Tag-Based Masking Policy applied to sensitive data columns: routing numbers, bank name, and bank account number. In this view, you can also see the associated tag (Classification:confidential), and technical details about the policy, like the Policy Body which explains how the Policy works.

policy_info.png
Table 1.

Feature

Details

Snowflake object tags

Tags enable data stewards to track sensitive data for compliance, discovery, protection, and resource usage use cases through either a centralized or decentralized data governance management approach. For details, see Snowflake documentation.

Snowflake masking policies

Masking policies define what data is masked and not shown to users allowing authorized users to access sensitive data at query runtime. For details, see the Snowflake documentation.

Snowflake tag based masking policies

When a masking policy is applied to a Snowflake tag, the columns with the same tag are automatically protected based on the conditions set in the masking policy. For details, see the Snowflake documentation.

Snowflake row access policies

Access policies are applied to rows of data, that is, who can see the rows of data, and masking policies define what data is masked and not shown to users. For details about this see the Snowflake documentation.

Table query count

For each table, the Snowflake collector computes percentile popular and a query count based on the user access history table in Snowflake. This information can then be used to assign popularity ranking to Snowflake tables in data.world. For details about this see the Snowflake documentation.



When this data is harvested and is displayed in data.world, you see the following information in your catalog: Snowflake masking policies, Snowflake row access policies, Snowflake tag values, and Snowflake tags.

snowflake_harvested_info.png

When you browse to a table harvested from Snowflake, you will notice that the appropriate policies are applied to the tables. In the following example, the policy and tag values are applied to the Order table.

snowflake_harvested_info02.png

Users can click through to the tag values and policies to see their details.

Note

Note that users can also use the Search feature in the application to locate all these resources easily.

snowflake_harvested_info03.png

Lineage Snowflake

The Snowflake collector will also now harvest:

  • Tags

  • Masking Policies

  • Row Access Policies

and associate these resources with the database objects to which they apply. There are new parameters used to include them:

  • --tag-collection

  • --policy-collection.

Ways to run the data.world Collector

There are a few different ways to run the data.world Collector--any of which can be combined with an automation strategy to keep your catalog up to date:

  • Create a configuration file (config.yml) - This option stores all the information needed to catalog your data sources. It is an especially valuable option if you have multiple data sources to catalog as you don't need to run multiple scripts or CLI commands separately.

  • Run the collector though a CLI - Clear-cut and efficient, but makes regular, repeating runs of the collector very laborious and time-consuming as the commands are re-entered for each run.

Writing the data.world Collector command

The easiest way to create your Collector command is to:

  1. Copy the following example command

  2. Edit it for your organization and data source

  3. Open a terminal window in any Unix environment that uses a Bash shell and paste your command into it.

The example command includes the minimal parameters required to run the collector (described below)--your instance may require more. A description of all the available parameters is available in this article. Edit the command by adding any other parameters you wish to use, and by replacing the values for all your parameters with your information as appropriate. Parameters required by the Collector are in bold.

Important

Do not forget to replace x.y in datadotworld/dwcc:x.y with the version of the Collector you want to use (e.g., datadotworld/dwcc:2.113).

Basic parameters

Each collector has parameters that are required, parameters that are recommended, and parameters that are completely optional. Required parameters must be present for the command to run. Recommended parameters are either:

  • parameters that exist in pairs, and one or the other must be present for the command to run (e.g., --agent and --base)

  • parameters that we recommend to improve your experience running the command in some way

Together, the required and recommended parameters make up the Basic parameters for each collector. The Basic parameters for this collector are:

Docker and the data.world Collector

Detailed information about the Docker portion of the command can be found here. When you run the command, run will attempt to find the image locally, and if it doesn't find it, it will go to Dockerhub and download it automatically:

dwcc_and_cli.png

Collector runtime and troubleshooting

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled. If the catalog collector runs without issues, you should see no output on the terminal, but a new file that matching *.dwec.ttl should be in the directory you specified for the output. If there was an issue connecting or running the catalog collector, there will be either a stack trace or a *.log file. Both of those can be sent to support to investigate if the errors are not clear. A list of common issues and problems encountered when running the collectors is available here.

Upload the .ttl file generated from running the Collector

When the data.world Collector runs successfully, it creates a .ttl file in the directory you specified as the dwcc-output directory. The automatically-generated file name is databaseName.catalogName.dwec.ttl. You can rename the file or leave the default, and then upload it to your ddw-catalogs dataset (or wherever you store your catalogs).

Caution

If there is already a .ttl catalog file with the same name in your ddw-catalogs dataset, when you add the new one it will overwrite the existing one.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.