Skip to main content

Technical Overview: Data pipeline concepts

How do BB Bots and Hoots work together to automate tracking the quality of your data?

To prepare the data used in a given data product, the data pipeline goes through a series of data processing steps that extract the data from one place, do some transformations on the data, and/or load the data in some other place to be used by the next data processing step. This is what is referred to as ETL (Extract, Transform, Load). For example, a data pipeline may look like this.

data_pipeline.png

For each data processing step, you can configure a BB bot to watch for the successful execution of that step and to collect specific information related to the execution (bb bot metadata). Each bb bot reports its information to the hoot app. The hoot app maintains an association between bb bots and the hoots for the reports that use the output data produced by its associated data processing step, either directly or indirectly.

When a bb bot reports some issue with the data or with the execution of the data processing step, the hoot app forwards that information to the hoots that are associated with that bb bot. The hoot app automatically updates the hoot Badge to reflect the new information.

Can I connect multiple hoots to the same BB bot?

Yes!

In cases where the data produced by some data processing step is used by multiple reports, the hoot for each report can be connected to the bb bot related to that data processing step. When the BB bot indicates a concern with the data processing, all hoots will be informed of the situation.

For example, in the pipeline above, a BB bot associated with the Tableau extract will report any issues to the hoot associated with the Tableau report. If the analyst develops another dashboard that uses the same Tableau extract, then the hoot for the new dashboard can connect directly to the same Tableau extract BB bot.

How do I connect BB bots to reflect my data pipeline?

A data processing step extracts data either from direct inputs or from data loaded by another data processing step. When data processing step B extracts data loaded earlier by data processing step A, then you can reflect this relationship in the BB bots for the two steps. hoot allows you to specify this relationship by defining the BB bot for step B as being directly downstream from the BB bot for step A, or equivalently, defining the BB bot for step A as being directly upstream from the BB bot for step B.

For example, a BB bot associated with the Python script in the above pipeline would be directly upstream from a BB bot associated with the dbt processing step. A BB bot associated with the dbt processing step would be directly downstream from a BB bot associated with the Python script.

By defining the direct upstream-downstream relationships between BB bots, you can build a BB bot model that reflects the actual flow of data in your data pipeline.

How do BB bot upstream-downstream relationships help the hoot report more accurately?

The BB bot relationships help the hoot app understand the effects of reported issues more accurately, based on a better understanding of the pipeline timing and execution.

Consider our case above with data processing Step A upstream of data processing step B. When step B reads the output data from step A, this happens at a particular point in time T. Later, Step A may run again and refresh its output. Then, even later at the end of the pipeline, the final product incorporates the version of the data from time T into the final data product. When hoot displays the hoot Badge on the data product, what the hoot Badge shows should reflect the state of the step A BB bot at time T, rather than the state at the time the hoot Badge is being rendered. The upstream-downstream relationships between the BB bots, together with their historical states, enable hoot to take this timing issue into account.