Aggregations

An introduction to aggregations, aggregation functions, and aggregation modifiers.

An aggregation is the result returned from running an aggregate function in a query and is displayed in a new column in the results table. An aggregate function is a function that results of a set of the data queried being condensed into an aggregation.

The most common aggregate functions are SUM, COUNT, AVG, MIN, MAX, and GROUP_CONCAT. SUM returns the sum of all numeric value in a specified column. COUNT preceding a column name returns the number of rows for which the value is not NULL and can be used with both numeric and non-numeric field-types. COUNT(*) returns the number of rows in the table specified in the FROM clause. AVG returns the average of all the entries in a column which are not NULL. AVG can only be used with numeric data. MIN returns the lowest value in a specified column and can be used with numeric and non-numeric data. MAX returns the greatest value in a specified column and can be used with both numeric and non-numeric data. GROUP_CONCAT is the only aggregate function designed to work with strings. It takes string values and aggregates them into one value for each row with an optional separator.

SUM

If we wanted to know the total value of all the deals closed we would write:

SELECT SUM(close_value)
  FROM sales_pipeline

and the resulting table would be:

sum
10,005,534

COUNT

Similarly, if we wanted to see the number of deals we had won in our sales pipeline we could write a query like this:

SELECT COUNT(*)
  FROM sales_pipeline
 WHERE deal_stage = "Won"

and the result returned would look like:

count
4,238

AVG

Likewise we could show the average deal won:

SELECT AVG(close_value)
  FROM sales_pipeline
 WHERE sales_pipeline.deal_stage = "Won"
avg
2,360.91

Our pipeline data has a stored value of ‘0’ instead of a blank field for close_value of lost deals. If we run an average on the close_value column without specifying we want to only include the deals won we’ll get an average of all the completed deals regardless of whether they were won and lost. While an average of all finished deals is potentially interesting data, it isn’t the same as an average value of the actual sales.

MAXand MIN

For the largest deal we would query:

SELECT MAX(close_value)
  FROM sales_pipeline
max
30,288

And for the smallest deal:

SELECT MIN(close_value)
  FROM sales_pipeline
 WHERE sales_pipeline.deal_stage = "Won"
min
38

As for AVG, we had to include the qualification to only use data from deals that were won because otherwise the smallest deal size would have been ‘0’–the value for lost sales.

GROUP CONCAT

Until we look at the GROUP BY clause in the next section, the only use we have for GROUP_CONCAT at this point is to get a list of the values from one column in a table and to return it as a string. For example, if we wanted a list of all of the managers we could run the query:

SELECT GROUP_CONCAT(manager) Managers
  FROM sales_teams

The result would be:

Managers
Rocco Neubert Celia Rouche Summer Sewald Summer Sewald Celia Rouche Melvin Marxen Melvin Marxen Summer Sewald Dustin Brinkmann Melvin Marxen Melvin Marxen Rocco Neubert Summer Sewald Melvin Marxen Rocco Neubert Cara Losch Celia Rouche Cara Losch Celia Rouche Rocco Neubert Dustin Brinkmann Cara Losch Rocco Neubert Cara Losch Dustin Brinkmann Cara Losch Summer Sewald Celia Rouche Dustin Brinkmann Dustin Brinkmann

There are two problems with that result. First, it is not a list of the managers, it’s a list of all the instances of manager (one for each sale agent in the table). It we want to return each manager’s name only once we need to use the DISTINCT modifier. Second, the names all run together, and it’s likely that if we wanted a list we’d want the entries separated. Fortunately, GROUP_CONCAT has the modifier SEPARATOR which inserts the specified string between each value returned by GROUP_CONCAT.

The amended search would look like:

SELECT GROUP_CONCAT(DISTINCT manager SEPARATOR ", ") AS Managers
  FROM sales_teams

And the result would be:

Managers
Rocco Neubert, Celia Rouche, Summer Sewald, Melvin Marxen, Dustin Brinkmann, Cara Losch

The keyword ASwas used in the preceding query because without it the column name used in the table would have been undefined.

Exercises (Continued from previous section)

There are two ways to do these exercises. The first is to use the “Try query” links to test your queries without saving them. The second is to create a data.world project and save your queries to it. If you are reading this documentation and completing the exercises as a tutorial, you will need to create your own project to save your work. Details and instructions are in the SQL tutorial which has instructions for setting up your project and links to all the current exercises.

Exercise 24

Write a query that returns the description and the number of patients who have an allergy to fish from the allergies table.

You should get a warning that the “Field of aggregated query neither grouped no aggregated”. Grouping is not necessary for this query and will be discussed in the next section.

Try query

Hint

SELECT the column description, and a COUNT of all the records (*) FROM the allergies table in the case WHERE the description column = “Allergy to fish”

See solution

Exercise 25

As we saw in exercise 23, real-world data is not always nice and tidy and sometimes we have to do a little clean-up to get it in the format we need to analyze it. While the CAST expression is very good at turning strings into integers or decimals, it’s a pain to have to include it in every aggregation clause of your queries to clean up your data whenever there is a datatype mismatch. Fortunately we’re doing these exercises on data.world, and we have a little shortcut to take the sting out of data clean-up.

For the next exercise, take the query result from exercise 23 and modify it to include the descriptions “Body Height” and “Body Mass Index”. Save your query as “Observations Cleaned”. Then go to the Download button underneath the query text and select “Save to dataset or project”. If you need help with the “Save to dataset or project option, see the help article Why “Save to dataset or project” is important and how to use it You’ll be prompted to save the table with the name of the query to your current project. Select Save. Once you have saved this table you can use it to do aggregations and other numerical calculations on the data.

Now write a query against this new table that returns the description column, the average body weight of all the patients, and the units from the observations table.

You should get a warning that the “Field of aggregated query neither grouped no aggregated”. Grouping is not necessary for this query and will be discussed in the next section.

Try query

Hint

SELECT the column description, the AVG of the column (value), and the column units FROM the observations_cleaned table for the records WHERE the description column = “Body Weight”

See solution

Exercise 26

Write a query that returns the description column, the maximum body height of all the patients, and the units from the observations table.

You should get a warning that the “Field of aggregated query neither grouped no aggregated”. Grouping is not necessary for this query and will be discussed in the next section.

Try query

Hint

SELECT the column description, the MAX of the column (value), column, and the units column FROM the observations_cleaned table for the records WHERE the description column = “Body Height”

See solution