GROUP BY and FILTER
An introduction to the GROUP BY
clause and FILTER
modifier.
GROUP BY
enables you to use aggregate functions on
groups of data returned from a query.
FILTER
is a modifier used on an aggregate function to
limit the values used in an aggregation. All the columns in the select
statement that aren’t aggregated should be specified in a GROUP BY
clause in
the query.
GROUP BY
Returning to a previous section, when we were working with aggregations, we
used the aggregate function AVG
to find out the average deal size. If we
wanted to know the average value of the deals won by each sales person from
highest average to lowest, the query would look like:
SELECT sales_agent,
AVG(close_value)
FROM sales_pipeline
WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_agent
ORDER BY AVG(close_value) DESC
sales_agent | avg |
---|---|
Elease Gluck | 3614.9375 |
Darcel Schlecht | 3304.3381088825213 |
Rosalina Dieter | 3269.4861111111113 |
Daniell Hammack | 3194.9912280701756 |
James Ascencio | 3063.2074074074076 |
Rosie Papadopoulos | 2950.8846153846152 |
Wilburn Farren | 2866.181818181818 |
Reed Clapper | 2827.974193548387 |
Donn Cantrell | 2821.8987341772154 |
We could even ascertain the average value of deals aggregated by manager by running a query with a join like this:
SELECT sales_teams.manager,
AVG(sales_pipeline.close_value)
FROM sales_teams
JOIN sales_pipeline ON (sales_teams.sales_agent = sales_pipeline.sales_agent)
WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_teams.manager
manager | avg |
---|---|
Dustin Brinkmann | 1465.0107095046853 |
Summer Sewald | 2372.886473429952 |
Melvin Marxen | 2553.2086167800453 |
Cara Losch | 2354.26875 |
Celia Rouche | 2629.339344262295 |
Rocco Neubert | 2837.257597684515 |
Though it’s not required by SQL, it is advisable to include
all non-aggregated columns from your SELECT
clause in your GROUP BY
clause.
If you don’t, there are cases where the query will return the desired results,
there are also instances where a random value from the non-aggregated row will
be used as the representative for all the values returned by the query.
For example, let’s say you wanted to know the average deal by sales agent for each of their customers. If you used the query:
SELECT sales_agent,
account,
SUM(close_value)
FROM sales_pipeline
WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_agent
ORDER BY sales_agent
you would get back the following table which shows each sales agent one time and chooses a value at random from the accounts won by that sales person:
|sales_agent|account|sum| |-|-| |Anna Snelling|Bioholding|275,056| |Boris Faz|Mathtouch|261,631| |Cassey Cress|Sumace|450,489| |Cecily Lampkin|Funholding|229,800| |Corliss Cosme|Zencorporation|421,036| |Daniell Hammack|Zathunicon|364,229| |Darcel Schlecht|Isdom|1,153,214| |Donn Cantrell|Bluth Company|445,860| |Elease Gluck|Dalttechnology|289,195| |Garret Kinder|Konex|197,773|
To get the average deal by sales agent for each account the query would look like this:
SELECT sales_agent,
account,
SUM(close_value)
FROM sales_pipeline
WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_agent, account
ORDER BY sales_agent, account
The first several rows of the table returned would look like this:
sales_agent | account | sum |
---|---|---|
Anna Snelling | Betatech | 11,340 |
Anna Snelling | Bioholding | 12,382 |
Anna Snelling | Cancity | 1,496 |
Anna Snelling | Codehow | 22,479 |
Anna Snelling | Condax | 1,572 |
Anna Snelling | Conecom | 4,186 |
Anna Snelling | Domzoom | 6,639 |
Anna Snelling | Dontechi | 5,578 |
Anna Snelling | Finhigh | 1,762 |
Anna Snelling | Funholding | 5,654 |
FILTER
If you wanted to refine your query even more by running your aggregations
against a limited set of the values in a column you could use the FILTER
keyword. For example, if you wanted to know both the number of deals won by
a sales agent and the number of those deals that had a value greater than 1000,
you could use the query:
SELECT sales_agent,
COUNT(sales_pipeline.close_value) AS total,
COUNT(sales_pipeline.close_value)
FILTER(WHERE sales_pipeline.close_value > 1000) AS `over 1000`
FROM sales_pipeline
WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_pipeline.sales_agent
The first several rows of the resulting table would look like this:
sales_agent | total | over 1000 |
---|---|---|
Boris Faz | 101 | 70 |
Maureen Marcano | 149 | 96 |
Vicki Laflamme | 221 | 111 |
Donn Cantrell | 158 | 106 |
Jonathan Berthelot | 171 | 74 |
Wilburn Farren | 55 | 38 |
Elease Gluck | 80 | 32 |
Cassey Cress | 163 | 112 |
James Ascencio | 135 | 88 |
Kami Bicknell | 174 | 78 |
Anna Snelling | 208 | 68 |
Violet Mclelland | 122 | 33 |
As we saw in the aggregate functions section, WHERE
also
limits the values in a query against which an aggregate function is run.
FILTER
is more flexible than WHERE
because you can use more than one
FILTER
modifier in an aggregate query while you can only use only one WHERE
clause.
For example:
SELECT sales_agent,
COUNT(sales_pipeline.close_value) AS `number won`,
COUNT(sales_pipeline.close_value)
FILTER(WHERE sales_pipeline.close_value > 1000) AS `number won > 1000`,
AVG(sales_pipeline.close_value) AS `average of all`,
AVG(sales_pipeline.close_value)
FILTER(WHERE sales_pipeline.close_value > 1000) AS `avg > 1000`
FROM sales_pipeline
WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_pipeline.sales_agent
The first several rows returned by the above query would look like:
sales_agent | number won | number > 1000 | average of all | avg > 1000 |
---|---|---|---|---|
Rosalina Dieter | 72 | 30 | 3,269.49 | 7,537.83 |
Daniell Hammack | 114 | 91 | 3,194.999 | 3,869.47 |
Gladys Colclough | 135 | 88 | 2,560.55 | 3,791.19 |
Rosie Papadopoulos | 78 | 49 | 2,950.885 | 4,559.16 |
Kary Hendrixson | 209 | 98 | 2,173.675 | 4,071.13 |
Cecily Lampkin | 107 | 63 | 2,147.665 | 3,439.74 |
Lajuana Vencill | 127 | 49 | 1,532.54 | 3,262.94 |
Markita Hansen | 130 | 68 | 2,529.17 | 4,512.37 |
Moses Frase | 129 | 51 | 1,606.067 | 3,441.45 |
Darcel Schlecht | 349 | 272 | 3,304.34 | 4,115.04 |
Hayden Neloms | 107 | 82 | 2,543.09 | 3,215.06 |
Reed Clapper | 155 | 110 | 2,827.97 | 3,808.98 |
Zane Levy | 161 | 99 | 2,671.23 | 4,038.43 |
Garret Kinder | 75 | 46 | 2,636.97 | 4,041.37 |
Niesha Huffines | 105 | 56 | 1,685.34 | 2,747.96 |
Exercises (Continued from previous section)
There are two ways to do these exercises. The first is to use the “Try query” links to test your queries without saving them. The second is to create a data.world project and save your queries to it. If you are reading this documentation and completing the exercises as a tutorial, you will need to create your own project to save your work. Details and instructions are in the SQL tutorial which has instructions for setting up your project and links to all the current exercises.
Exercise 27
Write a query that returns the patient column and a count of all the allergies the patient has from allergies table. Group your results by patient, and order them by the number of allergies from greatest to least.
Hint
SELECT
the column patient,
and a COUNT
of the (description)
column FROM
the allergies
table and GROUP
the results BY
patient
. ORDER
the results BY
the COUNT
of the (description)
column in DESC
ending order.
Exercise 28
Write a query that returns the patient column, the average of the value column relabeled as Avg BMI
, the count of the value column relabeled as Number of readings
,
and the maximum value of the value column filtered for values over 30 and label it as Max Obese BMI
. The query should be written against the observations_cleaned table and the results should all be for records where the description is “Body Mass Index”. Group your results by the patient column.
Hint
SELECT
the column patient,
the AVG
of the (value)
column relabeled AS
`AVG BMI`
,
a COUNT
of the (value)
column relabeled AS
`Number of Readings`
,
and the MAX
of the column (value)
. FILTER
the results to return only records(WHERE
the value
column is > 30)
and label the resulting column AS
`Max Obese BMI`
. All of these columns should be selected FROM
the observations_cleaned
table. Restrict your results to only include records WHERE
the description
column =
“Body Mass Index”
. GROUP
the results BY
the patient
column.
An introduction to the HAVING
clause.