Skip to main content

Running the dbt cloud collector in Cloud

Configuring the cloud collector for dbt cloud

To configure the cloud collector for dbt cloud:

  1. On the Organization profile page, go to the Settings tab > Metadata collectors section.

  2. Click the Add a collector button.

    add_a_collector.png
  3. On the Choose metadata collector screen, select the correct metadata source. Click Next.

  4. On the Choose where the collector will run screen, in the Cloud section, select data.world. Click Next.

    select_cloud.png
  5. On the Configure a cloud dbt cloud Collector screen, set the following:

  6. On the next screen, set the following properties and click Next.

    Table 2.

    Field name

    Description

    Required?

    dbt Cloud host

    Specify the host for your organization's account on dbt cloud. If left unspecified, the default host will be assumed as cloud.getdbt.com.

    No

    dbt cloud account ID

    The dbt cloud account that owns the project from which to harvest dbt metadata artifacts.

    Yes

    dbt cloud API key

    A dbt cloud-issued API key with permissions to access the specified account

    Yes

    dbt cloud project

    The name or numeric identifier of the project from which to harvest dbt metadata artifacts.

    Yes

    dbt cloud run

    The numeric identifier of the run that produced the artifacts to be harvested; if not specified, the most recent successful run that produced artifacts within the project will be harvested.

    No

    dbt Cloud environment

    Specify the dbt Cloud environment (ID or name) used to filter the job runs from which to  harvest dbt metadata artifacts.

    No

    dbt Cloud job

    Specify the dbt Cloud job (ID or name) used to filter the job runs from which to harvest dbt metadata artifacts.

    No



  7. On the next screen, set the following advanced options and click Next.

    Important: You must set the information on this screen if you want to harvest Snowflake lineage relationships between columns specified through views.

    • You can authenticate to Snowflake either using the username and password or private key file and password.

    • By default, the collector obtains the connection information to Snowflake from the identified dbt Cloud run. This connection information includes the Snowflake account, role, and warehouse used to authenticate to Snowflake. You have the option to override the connection information for a given run using the Snowflake account, Snowflake role, and Snowflake warehouse override fields.

    Table 3.

    Field name

    Description

    Required?

    Authentication: Select from one of the following authentication options.

    Yes

     

     

     

    Option 1: Snowflake username and password overrides

    Username

    The user credential to use in connecting to the target database.

    Password

    The password credential to use in connecting to  the target database.

    Option 2: Snowflake private key file overrides

    Database username

    Specify the username to use in connecting to the target database.

    Snowflake key file path

    The private key file to use for authentication with Snowflake (for example rsa_key.p8). Use this option to override the dbt profile.

    Snowflake key file password

    The password for the private key file to use for authentication with Snowflake, if the key is encrypted and a password was set Use this option to override the dbt profile or cloud configuration.

    Other optional resources

    Snowflake application

    The application connection parameter to use in connecting to the target Snowflake database. Use this option to override the dbt profile or cloud configuration.

    Use datadotworld unless otherwise directed.

    No

    Snowflake account

    The Snowflake account/tenant.

    No

    Snowflake role

    The role to use in connecting to the target Snowflake database. Use this option to override the dbt profile or cloud configuration. This is case-insensitive.

    No

    Snowflake warehouse

    The warehouse to use in connecting to the target Snowflake database. Use this option to override the dbt profile or cloud configuration. This is case-insensitive.

    No



  8. On the next screen, set the following advanced options and click Next.

    Table 4.

    Field name

    Description

    Required?

    Max retries

    Specify the number of times to retry an API call which has failed. The default value is 5.

    No

    Retry delay

    Specify the amount of time in seconds to wait between retries of an API call which has failed. The default is to try with a delay of 2 seconds between each call.

    No

    API HTTP header

    Specify name-value pairs that the collector will include as HTTP headers in any calls to the HTTP API used by the collector to harvest metadata. Use the option multiple times for multiple headers.

    Note: Use this option only after consulting the data.world Support team.

    No



  9. On the next screen, provide the Collector configuration name and set the run schedule. You can also set the schedule at a later point.

  10. Click Save and View to go the collector details page.

Scheduling collector runs

Important things to note:

  • Different collectors can be scheduled to run at the same time, but one collector can only run once a day.

  • It is recommended that you schedule the runs in off-peak hours.

  • The collector runs in the timezone in which the scheduler is located. For example, if the scheduler sets the collector runs from PST timezone, the collectors will follow the PST timezone.

  • Runs may start up to one hour after the scheduled time.

To schedule collector runs:

  1. On the Configured collectors page, locate the collector you want to run on a schedule.

  2. Click the Edit configurations button.

  3. Go to the screen where you can set the schedule for the collector.

  4. Enable the Scheduled runs option.

  5. From the Frequency dropdown, select from Daily, Weekly, or Monthly.

  6. For Weekly and Monthly options, select the day when the collector should run.

  7. Select the time for running the collector.

  8. Click Save and view. The schedule and next run date and time are displayed on the collector details page.

  9. To get notifications about the collector runs, simply setup web hooks at the Organization level from the Organization profile page > Settings tab. The Webhooks will automatically start capturing the Status events (Pending, Provisioning, Running, Completed, Error, Cancelled) for the collector runs.

    org_webhooks.png

    Sample data captured by the webhook.

    hooks_notfication_collectors.png

Running collectors manually

After setting up the collector configuration, it's advisable to manually execute it once to ensure correct configuration. Even collectors that are scheduled to run automatically can be manually initiated at anytime.

To run the collectors manually:

  1. On the Configured collectors page, locate the collector you want to run.

  2. On the collector configuration details page, click the Run now button. Alternatively, on the Configured collectors page, click the Three dot menu and click Run/Sync now button.

  3. On both pages, the Status field shows the status as Running with information about time elapsed since the run was started.

    The collector starts running in the background and you can navigate away from the page at any time. For a long running collector, if the collector run does not complete in a weeks time, the collector run automatically terminates after one week. The Status section and the Status field update to an Error state.

  4. After the collector has completed the required pre-configuration steps and starts harvesting the metadata, you get an option to Cancel the harvesting process, if you want. The Status section and the Status field update to Cancelled.

  5. After the collector run has completed, the Status section of the collector configuration details page updates to show the successful status. The Last run summary page also updates to show the total number of resource collected and total number of types of resources collected. The Resources collected by type gives granular level information about the number of resources collected for each type of resource.

  6. Browse to the Collection and Dataset specified while running the collector to view the collector output.

  7. To get notifications about the collector runs, simply setup web hooks at the Organization level from the Organization profile page > Settings tab. The Webhooks will automatically start capturing the Status events (Pending, Provisioning, Running, Completed, Error, Cancelled) for the collector runs.

    org_webhooks.png

    Sample data captured by the webhook.

    hooks_notfication_collectors.png

Copying collector configurations

After you have configured a collector for a source system, you can easily create a copy of the configuration to configure another collector for the same source system but for different parameters.

To copy collector configurations:

  1. On the Configured collectors page, locate the collector configuration you want to copy.

  2. From the Three dot menu, click Duplicate configuration.

  3. In the Edit Collector window, provide a new name for the collector configuration. Optionally, set a schedule. Click Save and view.

  4. You are taken to the copied collector configuration page. Click the Edit Configuration button to adjust the details of the configuration.

Deleting configurations

Important things to note:

  • Deleting the configuration will not affect the resources that were collected from previous runs.

  • Any scheduled future runs for the collector are suspended.

To delete a configuration:

  1. On the Configured collectors page, locate the collector configuration you want to delete.

  2. From the Three dot menu, click the Delete configuration button.

  3. Confirm the deletion. The configuration is deleted and removed from the Configured collectors page.