GROUP BY and FILTER

An introduction to the GROUP BY clause and FILTER modifier.

GROUP BY enables you to use aggregate functions on groups of data returned from a query.

FILTER is a modifier used on an aggregate function to limit the values used in an aggregation. All the columns in the select statement that aren’t aggregated should be specified in a GROUP BY clause in the query.

`GROUP BY`

Returning to a previous section, when we were working with aggregations, we used the aggregate function AVG to find out the average deal size. If we wanted to know the average value of the deals won by each sales person from highest average to lowest, the query would look like:

SELECT sales_agent,
       AVG(close_value)
  FROM sales_pipeline
 WHERE sales_pipeline.deal_stage = "Won"
 GROUP BY sales_agent
 ORDER BY AVG(close_value) DESC

siyeh/sql-crm-example-data Run query Copy code

sales_agent	avg
Elease Gluck	3614.9375
Darcel Schlecht	3304.3381088825213
Rosalina Dieter	3269.4861111111113
Daniell Hammack	3194.9912280701756
James Ascencio	3063.2074074074076
Rosie Papadopoulos	2950.8846153846152
Wilburn Farren	2866.181818181818
Reed Clapper	2827.974193548387
Donn Cantrell	2821.8987341772154

We could even ascertain the average value of deals aggregated by manager by running a query with a join like this:

SELECT sales_teams.manager,
       AVG(sales_pipeline.close_value)
  FROM sales_teams
       JOIN sales_pipeline ON (sales_teams.sales_agent = sales_pipeline.sales_agent)
 WHERE sales_pipeline.deal_stage = "Won"
 GROUP BY sales_teams.manager

siyeh/sql-crm-example-data Run query Copy code

manager	avg
Dustin Brinkmann	1465.0107095046853
Summer Sewald	2372.886473429952
Melvin Marxen	2553.2086167800453
Cara Losch	2354.26875
Celia Rouche	2629.339344262295
Rocco Neubert	2837.257597684515

Though it’s not required by SQL, it is advisable to include all non-aggregated columns from your SELECT clause in your GROUP BY clause. If you don’t, there are cases where the query will return the desired results, there are also instances where a random value from the non-aggregated row will be used as the representative for all the values returned by the query.

For example, let’s say you wanted to know the average deal by sales agent for each of their customers. If you used the query:

SELECT sales_agent,
       account,
       SUM(close_value)
  FROM sales_pipeline
 WHERE sales_pipeline.deal_stage = "Won"
 GROUP BY sales_agent
 ORDER BY sales_agent

siyeh/sql-crm-example-data Run query Copy code

you would get back the following table which shows each sales agent one time and chooses a value at random from the accounts won by that sales person:

To get the average deal by sales agent for each account the query would look like this:

SELECT sales_agent,
       account,
       SUM(close_value)
  FROM sales_pipeline
 WHERE sales_pipeline.deal_stage = "Won"
 GROUP BY sales_agent, account
 ORDER BY sales_agent, account

siyeh/sql-crm-example-data Run query Copy code

The first several rows of the table returned would look like this:

sales_agent	account	sum
Anna Snelling	Betatech	11,340
Anna Snelling	Bioholding	12,382
Anna Snelling	Cancity	1,496
Anna Snelling	Codehow	22,479
Anna Snelling	Condax	1,572
Anna Snelling	Conecom	4,186
Anna Snelling	Domzoom	6,639
Anna Snelling	Dontechi	5,578
Anna Snelling	Finhigh	1,762
Anna Snelling	Funholding	5,654

`FILTER`

If you wanted to refine your query even more by running your aggregations against a limited set of the values in a column you could use the FILTER keyword. For example, if you wanted to know both the number of deals won by a sales agent and the number of those deals that had a value greater than 1000, you could use the query:

SELECT sales_agent,
       COUNT(sales_pipeline.close_value) AS total,
       COUNT(sales_pipeline.close_value)
FILTER(WHERE sales_pipeline.close_value > 1000) AS `over 1000`
  FROM sales_pipeline
 WHERE sales_pipeline.deal_stage = "Won"
 GROUP BY sales_pipeline.sales_agent

siyeh/sql-crm-example-data Run query Copy code

The first several rows of the resulting table would look like this:

sales_agent	total	over 1000
Boris Faz	101	70
Maureen Marcano	149	96
Vicki Laflamme	221	111
Donn Cantrell	158	106
Jonathan Berthelot	171	74
Wilburn Farren	55	38
Elease Gluck	80	32
Cassey Cress	163	112
James Ascencio	135	88
Kami Bicknell	174	78
Anna Snelling	208	68
Violet Mclelland	122	33

As we saw in the aggregate functions section, WHERE also limits the values in a query against which an aggregate function is run. FILTER is more flexible than WHERE because you can use more than one FILTER modifier in an aggregate query while you can only use only one WHERE clause.

For example:

SELECT sales_agent,
       COUNT(sales_pipeline.close_value) AS `number won`,
       COUNT(sales_pipeline.close_value)
FILTER(WHERE sales_pipeline.close_value > 1000) AS `number won > 1000`,
       AVG(sales_pipeline.close_value) AS `average of all`,
       AVG(sales_pipeline.close_value)
FILTER(WHERE sales_pipeline.close_value > 1000) AS `avg > 1000`
  FROM sales_pipeline
 WHERE sales_pipeline.deal_stage = "Won"
 GROUP BY sales_pipeline.sales_agent

siyeh/sql-crm-example-data Run query Copy code

The first several rows returned by the above query would look like:

sales_agent	number won	number > 1000	average of all	avg > 1000
Rosalina Dieter	72	30	3,269.49	7,537.83
Daniell Hammack	114	91	3,194.999	3,869.47
Gladys Colclough	135	88	2,560.55	3,791.19
Rosie Papadopoulos	78	49	2,950.885	4,559.16
Kary Hendrixson	209	98	2,173.675	4,071.13
Cecily Lampkin	107	63	2,147.665	3,439.74
Lajuana Vencill	127	49	1,532.54	3,262.94
Markita Hansen	130	68	2,529.17	4,512.37
Moses Frase	129	51	1,606.067	3,441.45
Darcel Schlecht	349	272	3,304.34	4,115.04
Hayden Neloms	107	82	2,543.09	3,215.06
Reed Clapper	155	110	2,827.97	3,808.98
Zane Levy	161	99	2,671.23	4,038.43
Garret Kinder	75	46	2,636.97	4,041.37
Niesha Huffines	105	56	1,685.34	2,747.96

Exercises (Continued from previous section)

There are two ways to do these exercises. The first is to use the “Try query” links to test your queries without saving them. The second is to create a data.world project and save your queries to it. If you are reading this documentation and completing the exercises as a tutorial, you will need to create your own project to save your work. Details and instructions are in the SQL tutorial which has instructions for setting up your project and links to all the current exercises.

Exercise 27

Write a query that returns the patient column and a count of all the allergies the patient has from allergies table. Group your results by patient, and order them by the number of allergies from greatest to least.

Try query

Hint

SELECT the column patient, and a COUNT of the (description) column FROM the allergies table and GROUP the results BY patient. ORDER the results BY the COUNT of the (description) column in DESCending order.

See solution

Exercise 28

Write a query that returns the patient column, the average of the value column relabeled as Avg BMI, the count of the value column relabeled as Number of readings , and the maximum value of the value column filtered for values over 30 and label it as Max Obese BMI. The query should be written against the observations_cleaned table and the results should all be for records where the description is “Body Mass Index”. Group your results by the patient column.

Try query

Hint

SELECT the column patient, the AVG of the (value) column relabeled AS `AVG BMI` , a COUNT of the (value) column relabeled AS `Number of Readings` , and the MAX of the column (value). FILTER the results to return only records(WHERE the value column is > 30) and label the resulting column AS `Max Obese BMI`. All of these columns should be selected FROM the observations_cleaned table. Restrict your results to only include records WHERE the description column = “Body Mass Index”. GROUP the results BY the patient column.

See solution

Next up: HAVING

An introduction to the HAVING clause.