Technical Overview: Data pipeline concepts
How do Sentries and Hoots work together to automate tracking the quality of your data?
To prepare the data used in a given data product, the data pipeline goes through a series of data processing steps that extract the data from one place, do some transformations on the data, and/or load the data in some other place to be used by the next data processing step. This is what is referred to as ETL (Extract, Transform, Load). For example, a data pipeline may look like this.
For each data processing step, you can configure a Sentry to watch for the successful execution of that step and to collect specific information related to the execution (Sentry metadata). Each Sentry reports its information to the hoot app. The hoot app maintains an association between Sentries and the hoots for the reports that use the output data produced by its associated data processing step, either directly or indirectly.
When a Sentry reports some issue with the data or with the execution of the data processing step, the hoot app forwards that information to the hoots that are associated with that Sentry. The hoot app automatically updates the hoot Badge to reflect the new information.
Can I connect multiple hoots to the same Sentry?
Yes!
In cases where the data produced by some data processing step is used by multiple reports, the hoot for each report can be connected to the Sentry related to that data processing step. When the Sentry indicates a concern with the data processing, all hoots will be informed of the situation.
For example, in the pipeline above, a Sentry associated with the Tableau extract will report any issues to the hoot associated with the Tableau report. If the analyst develops another dashboard that uses the same Tableau extract, then the hoot for the new dashboard can connect directly to the same Tableau extract Sentry.
How do I connect Sentries to reflect my data pipeline?
A data processing step extracts data either from direct inputs or from data loaded by another data processing step. When data processing step B extracts data loaded earlier by data processing step A, then you can reflect this relationship in the Sentries for the two steps. hoot allows you to specify this relationship by defining the Sentry for step B as being directly downstream from the Sentry for step A, or equivalently, defining the Sentry for step A as being directly upstream from the Sentry for step B.
For example, a Sentry associated with the Python script in the above pipeline would be directly upstream from a Sentry associated with the dbt processing step. A Sentry associated with the dbt processing step would be directly downstream from a Sentry associated with the Python script.
By defining the direct upstream-downstream relationships between Sentries, you can build a Sentry model that reflects the actual flow of data in your data pipeline.
How do Sentry upstream-downstream relationships help the hoot report more accurately?
The Sentry relationships help the hoot app understand the effects of reported issues more accurately, based on a better understanding of the pipeline timing and execution.
Consider our case above with data processing Step A upstream of data processing step B. When step B reads the output data from step A, this happens at a particular point in time T. Later, Step A may run again and refresh its output. Then, even later at the end of the pipeline, the final product incorporates the version of the data from time T into the final data product. When hoot displays the hoot Badge on the data product, what the hoot Badge shows should reflect the state of the step A Sentry at time T, rather than the state at the time the hoot Badge is being rendered. The upstream-downstream relationships between the Sentries, together with their historical states, enable hoot to take this timing issue into account.