Running the Snowflake collector in Cloud

Configuring the cloud collector for Snowflake

To configure the cloud collector for Snowflake:

On the Organization profile page, go to the Settings tab > Metadata collectors section.
Click the Add a collector button.
On the Choose metadata collector screen, select the correct metadata source. Click Next.
On the Choose where the collector will run screen, in the Cloud section, select data.world. Click Next.

On the Configure a cloud Snowflake Collector screen, set the following:

Table 1.

Field name	Description	Required?
Collection Name	Specify the collection name where the collector output will be saved. Ensure you use a distinct collection for each collector.	Yes
Output file name	Specify the collector output file name to override the default file name. The system automatically adds .dwec.ttl to the end of the provided file name.	No
Automatic upload location	Enter the name of your dataset to display a list of available datasets. From this list, select the dataset where you want to upload the catalog file. By default, the search is restricted to the organization you are in. To search across all organizations you have access to, uncheck the Limit search results to this organization option.	Yes

Field name

Description

Required?

Collection Name

Specify the collection name where the collector output will be saved. Ensure you use a distinct collection for each collector.

Yes

Output file name

Specify the collector output file name to override the default file name. The system automatically adds .dwec.ttl to the end of the provided file name.

Automatic upload location

Enter the name of your dataset to display a list of available datasets. From this list, select the dataset where you want to upload the catalog file.

By default, the search is restricted to the organization you are in. To search across all organizations you have access to, uncheck the Limit search results to this organization option.

Yes

On the next screen, set the following properties and click Next.

Important

Snowflake is case-sensitive so database and schema names need to be specified with the same case as they are in Snowflake.

Table 2.

Field name	Description	Required
Server	The hostname of the database server to connect to.	Yes
Server port	The port of the database server (if not the default).	No
Database	The name of the database to connect to. You can add multiple databases by clicking the Add item button. Note: If you don't specify this property, the collector will harvest metadata from all databases. You can then use the optional exclude database parameter to exclude specific databases.	No
Snowflake role	The role used to execute the query.	Yes
Authentication	Select the following options: Authenticate using a private key file	Yes
Authenticate with a username & password (Do not select this option) Note: Snowflake is transitioning away from username & password authentication. For details, see this field notice.
Username	-	-
Password	-	-
Authenticate using a private key file
Username	Specify the username to use to make the JDBC connection.	Yes
Snowflake private key file	Upload the private key file to use for authentication.	Yes
Private key file password	The password for the private key file, if the key is encrypted and a password was set.	No
Schemas to collect	Select from one of the following options: Collect all schema, Specify which schema to collect	Yes
Collect all schema	Catalog all schemas to which the user has access.
Specify which schema to collect	Select this option and then specify the names of the database schema to be catalog.
Exclude Schema	Specify the name or regular expression of the database schema to be excluded. Applicable only if Collect all schema option is selected.	No
Information schema	Set this if you want to also collect this Information schema for the database. You can set this option only when you have selected the Collect all schema option.	No
Excluded database	When the database parameter is not provided, the collector harvests metadata from all databases. If you want the collector run to not harvest specific databases, use the Excluded database parameter and specify one or more regular expressions to indicate databases not to be cataloged. Note: This parameter is ignored if the database parameter is specified. If you wish to use the exclude database parameter, you must not set the database parameter.	No

On the next screen, set the following optional properties and click Next.

Important

If you are using Catalog Toolkit, make sure you select the relevant module for specific Snowflake features.

Table 3.

Field name	Description	Required?
Disable lineage collection	Skip harvesting of intra-database lineage metadata.	No
Disable Extended Metadata collection	Skip harvesting of extended metadata for resource types such as database, schema, table, columns functions, stored procedures, user defined types, synonyms. Basic metadata for these resource types will still be harvested.	No
Collect Snowflake tag information	Harvest information about Snowflake tags in or associated with the database specified via --database.	No
Collect all Snowflake tag information	Harvest information about Snowflake tags regardless of the database in which they reside.	No
Collect Snowflake policy information	Harvest information about Snowflake masking and row-access policies in or associated with the database specified via --database.	No
Collect all Snowflake policy information	Harvest information about Snowflake masking and row-access policies regardless of the database in which they reside.	No
Collect Streamlit app information	Harvest information about Snowflake Streamlit applications.	No
Collect data metric function information	Specify to harvest metadata about data metric functions and their associations to tables.	No
Number of most recent data metric observations to harvest	Specify to indicate the number of most recent data metric observations to harvest. default is 1 Ignored unless --data-metric-function-collection is set.	No
Collect Snowflake table usage information	Harvests metadata about Snowflake table usage in queries (popularity). Calculates, for each table in the database being harvested, the percentage of tables in the database that have been queried no fewer times than the subject table.	No
Table usage lookback days	Number of days in the past at which to begin harvesting table usage (default=7 days).	No
Enable Sample String Values collection	To enable harvesting of sample values and histograms for columns containing string data. Note: Only applies if Enable column statistics collection is turned on.	No
Enable column statistics collection	To enable harvesting of column statistics (i.e., data profiling). Note: Activating the profiling feature may extend the running time of the collector. This is because the collector needs to read the table data to be able to gather metadata for profiling.	No
Target sample size for column statistics	To control the number of rows sampled for computation of column statistics and string-value histograms. For example, to sample 1000 rows, set the parameter as: --target-sample-size=1000. Default is 100000. Note: Only applies if Enable column statistics collection is turned on.	No
Snowflake warehouse	The Snowflake warehouse to use when connecting (user's assigned default if not specified)	No

On the next screen, set the following optional properties and click Next.

Table 4.

Field name	Description	Required?
Server environment	If your provided server name is localhost, use this to give a friendly name to the environment in which your database server runs. It helps differentiate it from other environments.	No
Database ID	A unique identifier for this database - will be used to generate the ID for the database. This is optional, you only need to provide this if the database name used for the connection is not sufficiently unique to completely identify the database.	No
JDBC Properties	JDBC driver properties to pass through to driver connection.	No
Snowflake Collector incremental collection mode	Specify whether the collector should harvest only schemata that have changed. For more details, see Setting incremental runs for Snowflake Cloud collector. Note: Use this feature only under advisement from data.world team	No

On the next screen, set the following properties if you want to set up Sensitive data classification and click Next.

Table 5.

Field name	Description	Required?
Enable sensitive data discovery	Enable the sensitive data classification option.	Yes
Sample Size for Sensitive Data Classification	Specify the number of rows to sample from each column for classification. The default is 1000 if not specified. Note, the actual sample size may vary due to database sampling methods.	No

On the next screen, provide the Collector configuration name and an optional Configuration description, and set the run schedule. You can also set the schedule at a later point.
Click Save and View to go the collector details page.

Testing collector configuration

Test your cloud collector configurations before saving or running them to make sure your connection details are valid and the collector can access the source system.

Use this feature after you have finished entering all configuration fields — such as the server address, authentication credentials, databases, and schemas — to quickly confirm:

The system can connect to the database.
Credentials are valid.
Required schemas are accessible.

To test a collector configuration:

Before saving, use Test configuration to validate your connection and confirm that the collector can access the database and schemas you have specified.

On the Metadata collectors page, click Add a collector.
While configuring the collector, proceed through the setup steps. On the final screen, where you enter the collector configuration name, click Test [Collector Name] configuration.
To test an existing collector, open the configuration by clicking Edit configuration.
Click Next through the setup screens until you reach the final screen, then click Test [Collector Name] configuration.
When you click the Test button, the system attempts to:
- Connect to the database using the credentials you provided.
- Authenticate using the specified role.
- Check schema access, if specific schemas were listed.
Based on the outcome, the collector shows whether the configuration is valid, partially valid — for example, the connection works but a schema is inaccessible — or invalid — for example, due to bad credentials.
If the test fails, an error message appears to help you troubleshoot the issue. For example, Incorrect username or password was specified.

Scheduling collector runs

Important things to note:

Different collectors can be scheduled to run at the same time, but one collector can only run once a day.
It is recommended that you schedule the runs in off-peak hours.
The collector runs in the timezone in which the scheduler is located. For example, if the scheduler sets the collector runs from PST timezone, the collectors will follow the PST timezone.
Runs may start up to one hour after the scheduled time.
Cloud collectors are designed to automatically run against the latest version of the collector supported by the UI.

To schedule collector runs:

On the Configured collectors page, locate the collector you want to run on a schedule.
Click the Edit configurations button.
Go to the screen where you can set the schedule for the collector.
Enable the Scheduled runs option.
From the Frequency dropdown, select from Daily, Weekly, or Monthly.
For Weekly and Monthly options, select the day when the collector should run.
Select the time for running the collector.
Click Save and view. The schedule and next run date and time are displayed on the collector details page.
To get notifications about the collector runs, simply setup web hooks at the Organization level from the Organization profile page > Settings tab. The Webhooks will automatically start capturing the Status events (Pending, Provisioning, Running, Completed, Error, Cancelled) for the collector runs.
Sample data captured by the webhook.

Running collectors manually

After setting up the collector configuration, it's recommended to manually execute it once to confirm everything is set up correctly. Even for collectors scheduled to run automatically, you can initiate them manually at any time. Cloud collectors are designed to automatically run against the latest version of the collector supported by the UI.

To run the collectors manually:

On the Configured collectors page, locate the collector you want to run.
On the collector configuration details page, click the Run now button. Alternatively, on the Configured collectors page, click the Three dot menu and click Run/Sync now button.
On both pages, the Status field shows the status as Running with information about time elapsed since the run was started.
The collector starts running in the background and you can navigate away from the page at any time. If a collector run errors out, the Status section and the Status field update to an Error state.
After the collector has completed the required pre-configuration steps and starts harvesting the metadata, you get an option to Cancel the harvesting process, if you want. The Status section and the Status field update to Canceled.
After the collector run has completed, the Status section of the collector configuration details page updates to show the successful status. The Last run summary page also updates to show the total number of resource collected and total number of types of resources collected. The Resources collected by type gives granular level information about the number of resources collected for each type of resource.
Browse to the Collection and Dataset specified while running the collector to view the collector output.
To get notifications about the collector runs, simply setup web hooks at the Organization level from the Organization profile page > Settings tab. The Webhooks will automatically start capturing the Status events (Pending, Provisioning, Running, Completed, Error, Cancelled) for the collector runs.
Sample data captured by the webhook.

Canceling a collector run

After starting a collector run, you can cancel it if needed.

Important things to note:

Logs generate only after the collector starts up (about 5 minutes). No logs are available if the collector is canceled before this period.
After canceling a run, the collector produces the logs within 5 minutes

To stop a running collector:

Locate the running collector you want to stop. After the collector has completed the required pre-configuration steps and starts harvesting the metadata, you get an option to Cancel it.
On the collector configuration details page, click the Cancel run button. Alternatively, on the Configured collectors page, click the Three dot menu and click Cancel run button.
The collector stops running. On both pages, the Status field shows the status as Canceled with information about time elapsed since the run was canceled.
The collector produces a log file in 5 minutes after stop running. To download a logs, click the View debugging info link. A pop-up window opens.
In the pop-up window, click Export logs to download the log file. The window also includes the Run ID of the collector run that failed. While reporting this issue to data.world support team, include this Run ID to help expedite the troubleshooting process.

Copying collector configurations

After you have configured a collector for a source system, you can easily create a copy of the configuration to configure another collector for the same source system but for different parameters.

To copy collector configurations:

On the Configured collectors page, locate the collector configuration you want to copy.
From the Three dot menu, click Duplicate configuration.
In the Edit Collector window, provide a new name for the collector configuration. Optionally, set a schedule. Click Save and view.
You are taken to the copied collector configuration page. Click the Edit Configuration button to adjust the details of the configuration.

Deleting configurations

Important things to note:

Deleting the configuration will not affect the resources that were collected from previous runs.
Any scheduled future runs for the collector are suspended.

To delete a configuration:

On the Configured collectors page, locate the collector configuration you want to delete.
From the Three dot menu, click the Delete configuration button.
Confirm the deletion. The configuration is deleted and removed from the Configured collectors page.

In this section:

Running the Snowflake collector in Cloud

Important

Configuring the cloud collector for Snowflake

Important

Important

Testing collector configuration

Scheduling collector runs

Running collectors manually

Canceling a collector run

Copying collector configurations

Deleting configurations

Search results