Aggregations
An introduction to aggregations, aggregation functions, and aggregation modifiers.
An aggregation is the result returned from running an aggregate function in a query and is displayed in a new column in the results table. An aggregate function is a function that results of a set of the data queried being condensed into an aggregation.
The most common aggregate functions are SUM
, COUNT
, AVG
, MIN
, MAX
,
and GROUP_CONCAT
. SUM
returns the sum of all numeric value in a specified
column. COUNT
preceding a column name returns the number of rows for which
the value is not NULL
and can be used with both numeric and non-numeric
field-types. COUNT(*)
returns the number of rows
in the table specified in the FROM
clause. AVG
returns the average of all
the entries in a column which are not NULL. AVG
can only be used with numeric
data. MIN
returns the lowest value in a specified column and can be used with
numeric and non-numeric data. MAX
returns the greatest
value in a specified column and can be used with both numeric and non-numeric
data. GROUP_CONCAT
is the only aggregate function designed to work with
strings. It takes string values and aggregates them into one value for each row
with an optional separator.
SUM
If we wanted to know the total value of all the deals closed we would write:
SELECT SUM(close_value)
FROM sales_pipeline
and the resulting table would be:
sum |
---|
10,005,534 |
COUNT
Similarly, if we wanted to see the number of deals we had won in our sales pipeline we could write a query like this:
SELECT COUNT(*)
FROM sales_pipeline
WHERE deal_stage = "Won"
and the result returned would look like:
count |
---|
4,238 |
AVG
Likewise we could show the average deal won:
SELECT AVG(close_value)
FROM sales_pipeline
WHERE sales_pipeline.deal_stage = "Won"
avg |
---|
2,360.91 |
Our pipeline data has a stored value of ‘0’ instead of
a blank field for close_value
of lost deals. If we run an average on the
close_value
column without specifying we want to only include the deals won
we’ll get an average of all the completed deals regardless of whether they were
won and lost. While an average of all finished deals is potentially interesting
data, it isn’t the same as an average value of the actual sales.
MAX
and MIN
For the largest deal we would query:
SELECT MAX(close_value)
FROM sales_pipeline
max |
---|
30,288 |
And for the smallest deal:
SELECT MIN(close_value)
FROM sales_pipeline
WHERE sales_pipeline.deal_stage = "Won"
min |
---|
38 |
As for AVG
, we had to include the qualification to only use
data from deals that were won because otherwise the smallest deal size would
have been ‘0’–the value for lost sales.
GROUP CONCAT
Until we look at the GROUP BY
clause in the next section, the only use we
have for GROUP_CONCAT
at this point is to get a list of the values from one
column in a table and to return it as a string. For example, if we wanted
a list of all of the managers we could run the query:
SELECT GROUP_CONCAT(manager) Managers
FROM sales_teams
The result would be:
Managers |
---|
Rocco Neubert Celia Rouche Summer Sewald Summer Sewald Celia Rouche Melvin Marxen Melvin Marxen Summer Sewald Dustin Brinkmann Melvin Marxen Melvin Marxen Rocco Neubert Summer Sewald Melvin Marxen Rocco Neubert Cara Losch Celia Rouche Cara Losch Celia Rouche Rocco Neubert Dustin Brinkmann Cara Losch Rocco Neubert Cara Losch Dustin Brinkmann Cara Losch Summer Sewald Celia Rouche Dustin Brinkmann Dustin Brinkmann |
There are two problems with that result. First, it is not a list of the
managers, it’s a list of all the instances of manager (one for each sale agent
in the table). It we want to return each manager’s name only once we need to
use the DISTINCT
modifier. Second, the names all run together, and it’s
likely that if we wanted a list we’d want the entries separated. Fortunately,
GROUP_CONCAT
has the modifier SEPARATOR
which inserts the specified string
between each value returned by GROUP_CONCAT
.
The amended search would look like:
SELECT GROUP_CONCAT(DISTINCT manager SEPARATOR ", ") AS Managers
FROM sales_teams
And the result would be:
Managers |
---|
Rocco Neubert, Celia Rouche, Summer Sewald, Melvin Marxen, Dustin Brinkmann, Cara Losch |
The keyword AS
was used in the preceding query because
without it the column name used in the table would have been undefined.
Exercises (Continued from previous section)
There are two ways to do these exercises. The first is to use the “Try query” links to test your queries without saving them. The second is to create a data.world project and save your queries to it. If you are reading this documentation and completing the exercises as a tutorial, you will need to create your own project to save your work. Details and instructions are in the SQL tutorial which has instructions for setting up your project and links to all the current exercises.
Exercise 24
Write a query that returns the description and the number of patients who have an allergy to fish from the allergies table.
You should get a warning that the “Field of aggregated query neither grouped no aggregated”. Grouping is not necessary for this query and will be discussed in the next section.
Hint
SELECT
the column description,
and a COUNT
of all the records (*)
FROM
the allergies
table in the case WHERE
the description
column =
“Allergy to fish”
Exercise 25
As we saw in exercise 23, real-world data is not always nice and tidy and sometimes we have to do a little clean-up to get it in the format we need to analyze it. While the CAST
expression is very good at turning strings into integers or decimals, it’s a pain to have to include it in every aggregation clause of your queries to clean up your data whenever there is a datatype mismatch. Fortunately we’re doing these exercises on data.world, and we have a little shortcut to take the sting out of data clean-up.
For the next exercise, take the query result from exercise 23 and modify it to include the descriptions “Body Height” and “Body Mass Index”. Save your query as “Observations Cleaned”. Then go to the Download button underneath the query text and select “Save to dataset or project”. If you need help with the “Save to dataset or project option, see the help article Why “Save to dataset or project” is important and how to use it You’ll be prompted to save the table with the name of the query to your current project. Select Save. Once you have saved this table you can use it to do aggregations and other numerical calculations on the data.
Now write a query against this new table that returns the description column, the average body weight of all the patients, and the units from the observations table.
You should get a warning that the “Field of aggregated query neither grouped no aggregated”. Grouping is not necessary for this query and will be discussed in the next section.
Hint
SELECT
the column description,
the AVG
of the column (value),
and the column units
FROM
the observations_cleaned
table for the records WHERE
the description
column =
“Body Weight”
Exercise 26
Write a query that returns the description column, the maximum body height of all the patients, and the units from the observations table.
You should get a warning that the “Field of aggregated query neither grouped no aggregated”. Grouping is not necessary for this query and will be discussed in the next section.
Hint
SELECT
the column description,
the MAX
of the column (value),
column, and the units
column FROM
the observations_cleaned
table for the records WHERE
the description
column =
“Body Height”
An introduction to the GROUP BY
clause and FILTER
modifier.