Windowed Aggregations
An introduction to windowed aggregations.
Windowed aggregations partition the results from a SQL query into groups in order to perform calculations across adjacent rows of the query result. Currently windowed aggregations cannot be combined in the same SELECT
statement with GROUP BY
, HAVING
, or any other aggregations. They are placed before an ORDER BY
clause if one is used. To invoke a WINDOW
function you use a special syntax with the OVER
clause to specify the WINDOW
. There are three parts in a windowed aggregation:
PARTITION BY
- The partition specification works similarly to theGROUP BY
clause and indicates how the query results are divided into groups. It is always optional.ORDER BY
- The ordering specification determines in what order the aggregations will be applied. It is required for the new functions and optional when using standard aggregations as windowed functions.ROWS BETWEEN
orRANGE BETWEEN
- The window frame designates a subset of consecutive rows adjacent to the current row in the window that will be evaluated as a group. The difference betweenROWS BETWEEN
andRANGE BETWEEN
is thatROWS BETWEEN
specifies the distance in number of rows andRANGE BETWEEN
specifies the distance in the values. Options for both are:N PRECEDING AND M FOLLOWING
- where N and M are positive integers.N PRECEDING AND M PRECEDING
- legal but uncommonN FOLLOWING AND M FOLLOWING
- legal but uncommonCURRENT
- can replace either aPRECEDING
value or aFOLLOWING
value as inROWS BETWEEN CURRENT AND 7 FOLLOWING
orROWS BETWEEN 7 PRECEDING AND CURRENT
UNBOUNDED
- can replace either aPRECEDING
value or aFOLLOWING
value as inROWS BETWEEN 7 PRECEDING AND UNBOUNDED
orROWS BETWEEN UNBOUNDED and CURRENT
If not specified, the default range is the beginning of the partition through the current row. It is always optional.
Per the SQL standard, the newly supported windowed functions do NOT work with WINDOW
frames. WINDOW
frames are only for other aggregate functions.
Windowed functions also take WINDOW
clauses which are similar to WITH
clauses elsewhere in SQL. They allow you to define a WINDOW
in the clause and then finish it’s definition in the aggregation, but you can’t actually override anything.That is to say, you can specify one or two of the three parts of the aggregation (PARTITION BY
, ORDER BY
, or ROWS BETWEEN
) but they can’t also be specified in the aggregation. WINDOW
clauses can also be nested. In other words you can define one, call it from another, and then use the second one in the aggregation. Though there are many ways you can use WINDOW
clauses, the most common one is to fully define a window within the clause and then use it multiple times in different aggregations.
Following are some examples of common uses of windowed functions:
AVG OVER
If we wanted to know the average order value for a customer ordered from oldest to newest and grouped into threes (an average of the current order, the preceding order, and the following order) the query would look like this:
SELECT account, create_date, AVG(order_value) OVER (PARTITION BY account ORDER BY create_date ROWS BETWEEN 1 preceding and 1 following)
FROM jun_2017_orders
And a subset of the results would look like this:
account | create_date | avg |
---|---|---|
Betasoloin | 6/3/17 | 3993.5 |
Betasoloin | 6/4/17 | 4159.666667 |
Betasoloin | 6/6/17 | 4187 |
Betasoloin | 6/9/17 | 3570 |
Betasoloin | 6/24/17 | 3109 |
Gekko & Co | 6/2/17 | 3189.5 |
Gekko & Co | 6/11/17 | 2480.333333 |
Gekko & Co | 6/12/17 | 878 |
Gekko & Co | 6/14/17 | 542 |
ROW_NUMBER
This is a useful function for knowing what row you are on within your query results:
SELECT name, regional_office, ROW_NUMBER() OVER ( PARTITION BY regional_office ORDER BY name)
FROM employees
Results show the row numbers based on the order by constraint:
name | regional_office | row_number |
---|---|---|
Carl Lin | West | 1 |
Carol Thompson | West | 2 |
Elease Gluck | West | 3 |
Hayden Neloms | West | 4 |
James Ascencio | West | 5 |
Kami Bicknell | West | 6 |
Kary Hendrixson | West | 7 |
Markita Hansen | West | 8 |
Maureen Marcano | West | 9 |
RANK
The are a few things to note about rank on a range of values in a window partition:
- If ranked values are the same they get the same rank.
- A row that doesn’t have the same value as the preceding row has a rank equal to its row number.
- Because of the way it handles ties,
RANK
can have gaps in the sequence.
Here is an example of a query that returns the home run ranking of baseball players in 2000 from the Lahman Sabermetrics dataset:
SELECT playerid, HR, RANK() OVER ( ORDER BY HR DESC)
FROM batting
WHERE yearId = 2000
ORDER BY HR DESC, playerid
And here are the results:
playerid | HR | rank |
---|---|---|
sosasa01 | 50 | 1 |
bondsba01 | 49 | 2 |
bagweje01 | 47 | 3 |
glaustr01 | 47 | 3 |
guerrvl01 | 44 | 5 |
hidalri01 | 44 | 5 |
giambja01 | 43 | 7 |
sheffga01 | 43 | 7 |
thomafr04 | 43 | 7 |
edmonji01 | 42 | 10 |
heltoto01 | 42 | 10 |
batisto01 | 41 | 12 |
delgaca01 | 41 | 12 |
rodrial01 | 41 | 12 |
griffke02 | 40 | 15 |
FIRST_VALUE
The FIRST_VALUE
function returns the first value for each partition, as in this query which returns the first order value for each account in the jun_2017_ orders table:
SELECT account, create_date, order_value, FIRST_VALUE(order_value) OVER (PARTITION BY account ORDER BY create_date)
FROM jun_2017_orders
account | create_date | order_value | first_value |
---|---|---|---|
Betasoloin | 6/3/17 | 5100 | 5100 |
Betasoloin | 6/4/17 | 2887 | 5100 |
Betasoloin | 6/9/17 | 5182 | 5100 |
Betasoloin | 6/24/17 | 1036 | 5100 |
Gekko & Co | 6/2/17 | 5304 | 5304 |
Gekko & Co | 6/11/17 | 1075 | 5304 |
Gekko & Co | 6/12/17 | 1062 | 5304 |
Gekko & Co | 6/14/17 | 497 | 5304 |
Gekko & Co | 6/19/17 | 67 | 5304 |
Gekko & Co | 6/25/17 | 5031 | 5304 |
Gekko & Co | 6/26/17 | 5622 | 5304 |
Plexzap | 6/11/17 | 3850 | 3850 |
LEAD
If you want to know something like both the current order value and the next order value for a customer you could use the LEAD
function. LEAD
takes three arguments:
- The name of the column to evaluate
- The number of rows to offset
- The value to put if there are no more rows that meet the partition requirement
Here is an example of a query that aggregates the results by account and either returns the next row’s value for the company or returns -1 if it’s the last order for the account:
SELECT account, create_date, order_value, LEAD(order_value, 1, -1) OVER (PARTITION BY account ORDER BY create_date)
FROM jun_2017_orders
account | create_date | order_value | lead |
---|---|---|---|
Betasoloin | 6/3/17 | 5100 | 2887 |
Betasoloin | 6/4/17 | 2887 | 4492 |
Betasoloin | 6/6/17 | 4492 | 5182 |
Betasoloin | 6/9/17 | 5182 | 1036 |
Betasoloin | 6/24/17 | 1036 | -1 |
Gekko & Co | 6/2/17 | 5304 | 1075 |
Gekko & Co | 6/11/17 | 1075 | 1062 |
Gekko & Co | 6/12/17 | 1062 | 497 |
LAG
The LAG
function works like the LEAD
function except that it returns a value from any preceding row:
SELECT account, create_date, order_value, LAG(order_value, 1, -1) OVER (PARTITION BY account ORDER BY create_date)
FROM jun_2017_orders
count | create_date | order_value | lag |
---|---|---|---|
Betatech | 2017-06-02 | 3666 | -1 |
Betatech | 2017-06-05 | 460 | 3666 |
Betatech | 2017-06-06 | 41 | 460 |
Betatech | 2017-06-11 | 1067 | 41 |
Betatech | 2017-06-12 | 5370 | 1067 |
Betatech | 2017-06-17 | 3822 | 5370 |
Betatech | 2017-06-24 | 4410 | 3822 |
Betatech | 2017-06-25 | 619 | 4410 |
Ganjaflex | 2017-06-02 | 3780 | -1 |
Here is a list of all the new windowed functions with links to their reference pages:
An introduction to ROLLUP
and CUBE
.