Skip to main content

Setting incremental runs for Snowflake Cloud collector

Important

This feature is available exclusively for Snowflake Cloud collectors and should be used only under the guidance of the data.world team.

How does incremental metadata collection work?

The Snowflake Collector’s incremental collection feature harvests changed objects rather than performing a full scan from Snowflake. It identifies changes to objects in Snowflake and collects metadata at the schema level only for schemas containing changed Snowflake objects since the previous collector run.

Changes in Snowflake are identified through Snowflake’s account_usage schema, which captures changes for most object types in Snowflake. Changes include new objects, modifications to existing objects, and deletions.

Changes to objects in Snowflake are logged in the account_usage table with an approximate latency of 3 hours. During incremental collection, schemas containing modified objects are identified after they have been recorded in Snowflake. For more information, refer the Snowflake documentation. Therefore, the collector scans for changes starting from the last run time minus a 3-hour margin to accommodate this latency. We do not recommend running the collector at intervals shorter than this latency period.

The incrementally collected catalog is combined with the previous full catalog via an interleaving process. All objects within schemas with detected changes are removed from the original catalog, and the catalog is then merged with the incrementally collected catalog.

Important

Incremental collection only works when the catalog file is automatically uploaded to data.world. It will not work for local catalog files generated when running the collector via command line.

What is not supported?

  • Tag collection: If tag collection is enabled, not all changed tag values are harvested during incremental collection unless the associated tag was part of a schema with changes identified by Snowflake. This specifically pertains to tag values (for example, adding a tag value to an existing tag).

  • Streamlit app collection: If Streamlit app collection is enabled, not all changed Streamlit apps are harvested during incremental collection unless the Streamlit app was part of a schema with changes identified by Snowflake.

  • Extended metadata collection: If extended metadata collection is enabled, not all changes to extended metadata are identifiable via Snowflake. Specifically, changes to the owner of an object (Snowflake role that owns the object) are not supported.

  • Column statistics: If column stats are enabled, updates only apply to all columns in schemas with objects that had changes identified by Snowflake.

FAQ

  • How are log messages recorded for incremental runs?

    Historical logs are not available for direct download. Log messages written to the catalog output file only include log messages from the last incremental run and does not include log messages from prior runs.

  • Can I change the cloud collector configurations after setting up incremental updates?

    You should only change the following configurations: Snowflake role, Snowflake warehouse, user, password, and scheduling. Changing any other configuration will cause incremental collector runs to fail.

  • Can I run incremental or full collection on-demand?

    Yes, you can run an incremental or full collection on-demand. Do not run both full and incremental collections simultaneously.

  • Can I manually change the uploaded catalog file in catalog-sources?

    No. Incremental collection identifies the last successful run time from the catalog output file in the upload dataset location. Changing the catalog file name will break incremental collection. Do not manually change the catalog file contents. The collector uses the last uploaded catalog file to determine the state of the last collection and to combine it with the incrementally collected catalog.

  • Why do I sometimes see a full collection when incremental collection is enabled?

    The collector performs a full collection, even if incremental mode is enabled, if there are new or changed parameters available. This ensures the consistency of your catalog.

Setting up incremental runs for Snowflake collector

  1. Configure a cloud collector.

  2. Complete one successful run of the collector. Ensure you are satisfied with the collector configurations and output. Once incremental collection is set up, only very limited configurations can be changed for the collector.

  3. Next, edit the cloud collector configuration to enable the Snowflake collector incremental collection mode setting.

    snowflake_incremental_collection.png
  4. Run the collector manually or on a schedule. The collector will now only harvest incremental metadata. Be aware of the 3-hour logging latency in Snowflake. Adjust your scheduling accordingly to ensure all changes are captured.

Setting up periodic full collection runs for Snowflake collector

Since incremental runs do not harvest some specific data, we recommend that you periodically perform a full collection.

To set periodic full collection:

  1. Create a copy of the existing collector configuration used for incremental updates.

  2. Disable the Snowflake collector incremental collection mode setting to convert the collector configuration to full collection mode. This is the only property you should change.

  3. If the incremental collector runs on a schedule, change the schedule of this configuration to ensure full and incremental collections do not run simultaneously.

  4. Run the collector either manually or on a schedule.