SQL on data.world
For querying datasets stored on data.world, a dialect of SQL named dwSQL has been developed. dwSQL is a full-featured dialect of SQL for reading and analyzing your data.world datasets, and for joining in data from other data.world datasets. This tutorial will cover what is supported with dwSQL to assist you in building SQL queries on data.world.
Important note: SQL is currently supported only for reading data from data.world datasets. None of the standard SQL data manipulation statements for inserting, modifying, or deleting data are available on data.world.
The most common way to query a dataset using dwSQL is with a
SELECT queries include the following clauses:
SELECTclause, indicating just what fields should be returned by the query
- An optional
FROMclause, indicating which tables should be used for the query, and possibly how those tables should be joined.
- An optional
WHEREclause, allowing the results limited to those that satisfy one or more criteria.
- An optional
BYclause, describing how the results should be sorted.
- An optional
BYclause, describing how the results should be grouped together for aggregation purposes.
- An optional
HAVINGclause, similar to the
WHEREclause, but for limiting aggregated results by one or more criteria.
- An optional
LIMITclause, limiting to the results returned to some fixed number.
- An optional
OFFSETclause, indicating that some fixed number of results should be skipped.
Here's an example dwSQL query which joins two tables: MyTable and MyOtherTable. It shows all of the clauses that dwSQL supports
SELECT MyTable.name, SUM(MyOtherTable.score) as total FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId) WHERE MyOtherTable.score > 3 ORDER by name GROUP by name HAVING total > 42 LIMIT 10 OFFSET 20
SELECT clause indicates just which columns or calculated values should
be returned from a query. These may be explicitly listed by name
SELECT MyTable.name, MyOtherTable.score FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)
or implicitly specified using a wildcard select, which causes all of the fields in all tables involved in the query to be returned
SELECT * FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)
Wildcards are very easy to use, but don't allow much control. A restricted wildcard syntax is also available, which only returns the columns from one table
SELECT MyTable.* FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)
If marked with
DISTINCT in the select clause, results from the query will
be deduplicated before being returned
SELECT DISTINCT name FROM MyTable
dwSQL supports all of the standard formats of SQL
FROM clauses, including
implicit join syntax:
SELECT MyTable.name, MyOtherTable.address FROM MyTable, MyOtherTable where MyTable.id = MyOtherTable.myTableId
explicit join syntax with the
SELECT MyTable.name, MyOtherTable.address FROM MyTable JOIN MyOtherTable ON (MyTable.id = MyOtherTable.myTableId)
explicit join syntax with the
SELECT MyTable.name, MyOtherTable.address FROM MyTable JOIN MyOtherTable USING(ID)
NATURAL join syntax:
SELECT MyTable.name, MyOtherTable.address FROM MyTable NATURAL JOIN MyOtherTable
Left, right and full outer joins are also supported.
SELECT MyTable.name, MyOtherTable.address FROM MyTable LEFT OUTER JOIN MyOtherTable ON (MyTable.id = MyOtherTable.myTableId)
To limit the results from a query based on some criteria, dwSQL provides
support for SQL
WHERE clauses. A
WHERE clause indicates that a result
should only be returned by the query if a specified boolean expression
evaluates to true. The following standard expressions are supported
by dwSQL for construction of where clauses and calculated fields
- Comparisons (>, <, <=, >=, !=, =)
- Numeric operations (+, - %, /)
CASEexpressions, allowing different results to be calculated based on
INexpressions, for testing containment in some set
NULL, for testing whether a value is available
LIKE, for matching strings
BETWEENfor testing whether numeric values are in a given range
CASTfor converting values between available data types
- A variety of standard built-in functions, listed in a later section
Order By Clauses
dwSQL fully supports standard SQL
ORDER BY syntax for sorting the results
of a query. The following query returns the names for the rows in MyTable,
sorted by name.
SELECT name FROM MyTable ORDER BY name
It is also possible to sort results by more than one field, lexicographically.
The standard SQL sort modifiers of
DESC are supported by dwSQL,
for indicated whether fields should be sorted in ascending or descending
SELECT age, name FROM MyTable ORDER BY age DESC, name ASC
Group By Clauses and Aggregation
A common use for a dwSQL query is to group a set of results together based on some subset of their values, and then perform aggregate calculations based on that grouping
SELECT name, SUM(score) as totalScore FROM MyTable GROUP BY name
The following standard SQL aggregate functions are supported by dwSQL:
GROUP_CONCAT. Also supported are statistical functions
VAR_SAMP, for calculating
the variances and standard deviations of sets of query results, either on a population
or sample basis. A two-argument
CORRELATION aggregation is also available, calculating
the Pearson's correlation coefficient for a pair of variables. For boolean expressions,
BOOL_OR aggregations allow for calculating the aggregate conjunction or
disjunction for a set of boolean values. For every aggregate functions, dwSQL allows calculations
to be marked as
DISTINCT, indicating that the result sets should be deduplicated
before they are aggregated
SELECT name, COUNT(DISTINCT address) as numAddresses FROM MyTable GROUP BY name
Additionally dwSQL supports filtered aggregation, allowing aggregate calculations to be performed over specified partitions of the rows available
SELECT name, SUM(score) as totalScore, SUM(score) FILTER(WHERE year=2016) as totalScore2016 FROM MyTable GROUP BY name
Similar to a
WHERE clause, a
HAVING clause allows the results of a query
to be filtered based on some boolean criteria. While a
WHERE clause operates
on raw data values, a
HAVING clause operates on aggregated values.
SELECT name, SUM(score) as totalScore FROM MyTable GROUP BY name HAVING totalScore > 100
Limit and Offset Clauses
In order to allow only a portion of the results of a query to be returned,
OFFSET clauses. A
LIMIT clause provides the
maximum number of rows that should be returned from a query. An
clause lists the number of rows that should be skipped. Using
OFFSET together allow pagination through large query results.
SELECT name FROM MyTable ORDER BY name LIMIT 10 OFFSET 10
Union, Intersect, and Minus
SELECT queries are the most common way of getting data
from dwSQL, sometimes it is necessary to combine results from multiple
queries to get the data that you want. For this purpose, dwSQL provides
support for combining individual select queries into
UNION queries combine the results of two queries, returning all results
from either query.
SELECT name FROM MyTable UNION SELECT name FROM MyOtherTable
MINUS queries remove the results of one query from the results of another.
SELECT name FROM MyTable MINUS SELECT name FROM MyOtherTable
INTERSECT queries combine the results of two queries, only those results
that are common to both queries.
SELECT name FROM MyTable INTERSECT SELECT name FROM MyOtherTable
In order to support the ability to form complex queries out of simple queries, dwSQL supports
WITH form of defining subqueries.
WITH subqueries allow you to define and name one
or more queries which can then be used to define one main query. Once a named subquery is defined
in a later subquery or the main query, just as though it were another "virtual" table.
WITH Derived AS (SELECT name FROM MyTable WHERE name LAKE 'Dave%') SELECT name, COUNT(*) FROM Derived GROUP BY name ORDER BY name
We at data.world are committed to making your data not just easy to access, but also easy to use with other people's data. In dwSQL, it is possible to write queries not just against one dataset, but against multiple different data.world datasets, using what we call a 'federated query'. Federated queries look just like local queries, but require a bit more work to reference the remote tables. You need to specify not just the name of the remote table, but also the name of the remote data.world dataset the table is in.
SELECT age, name FROM MyTable join dave.lahman.Batting using(playerId)
That query joins data from your dataset with data in the the Lahman Sabremetrics baseball dataset (named "lahman"), maintained by user "dave" (that's me).
To keep from having to write out the full name of the dataset and table, you can use a table alias. Once you've defined a table alias, you can use it elsewhere in the query to reference that table.
SELECT age, name FROM MyTable join dave.lahman.Batting as Batting using(playerId) ORDER BY Batting.yearId
A common use case with data.world data sets is to have tabular data that is split among multiple files. For instance, you might upload a new file each month of ledger data for your business or non-profit organization. You might have files named "ledger_jan.csv", "ledger_feb.csv", and "ledger_mar.csv". Upon ingest by data.world, each file would result in a separate table, in this case "ledger_jan","ledger_feb", and "ledger_mar". If you wanted to query about data from all of these in most SQL dialects, you would need to use a "UNION" construct, which would rapidly get cumbersome. Because this is expected to be a common use-case for data.world, we've created the concept of a multi-table select to dwSQL. The select form the three ledger tables, you could list the tables explicitly between square brackets like this:
SELECT * FROM [ledger_jan, ledger_feb, ledger_mar] as ledger
or use a regular expression to select from all tables in the dataset starting with the name "ledger"
SELECT * FROM [/ledger.*/] as ledger
Note that tables to be selected this way must have all the same column names, unless you explicitly specify the names of the columns you wish to query.
SELECT * FROM [ledger_jan, ledger_feb, ledger_mar](col1, col2, col3) as ledger
Specifying the columns makes it possible to using multi-table queries where the underlying tables have evolved over time, for instance by adding new columns. Even if you do explicitly specify the columns for a multi-table query, all of the specified columns must be present in all of the tables, and must all have the same type in each table.
It is often useful to know which from which table a give row in a multi-table was selected. You can query this using the
table_name column added to each multi-table
SELECT ledger.*, ledger.table_name FROM [ledger_jan, ledger_feb, ledger_mar] as ledger
For more general combinations of tables, you'll need to use the "UNION" construct instead.
To do any querying with dwSQL, you need to know the names of the tables and columns in your dataset. These names are generated by data.world when you upload your data, based on what we could infer about the structure of your data. To find out exactly what tables and columns you have available, you can just query that information from dwSQL. Each dataset contains two artificial tables named "Tables" and "TableColumns" which you can query just as though they were tables that you imported into your dataset. "Tables" lists the tables in your dataset, while "TableColumns" additionally lists the columns available for each table.
SELECT * FROM Tables
For each table, dwSQL generates two special columns,
just refers to the order of the row in the table, starting with 0.
row_address is a unique URL
which is generated for the row. Under the covers, dwSQL is implemented using semantic web technology,
and one of the key ideas of the semantic web is that every entity should have it's own unique URL.
Note that unlike normal columns,
row_address will not be returned by default using the
SELECT * form of a dwSQL query. Instead they must be explicitly listed. To query all of the data for a
table and it's associated special columns, you would use the following form.
SELECT MyTable.*, row_index, row_address FROM MyTable
dwSQL identifiers are used to reference the tables and columns that your query operates over. Simple dwSQL identifiers consist of alphanumeric characters or underscores, and must start with an alphabetic character. Identifiers with other characters may be used, but must be surrounded with back-quotes. This is usually only done to refer to tables or columns which were imported with non-alphanumeric characters in their names.
SELECT name FROM `MyTable/with/slashes/in/the/name`
dwSQL allows inline comments, to allow query logic to be annotated with user intent. Comments start with either # or -- and continue to the end of the line.
--A very simple query SELECT name FROM MyTable # the end
dwSQL keywords are case-insensitive. dwSQL identifiers are also case-insensitive.
All string operations and comparisons supported by dwSQL are case-sensitive,
with the exception of
LIKE which are case-insensitive
for compatibility with SQL standards.
Keywords and Functions
These are the standard SQL keywords which are supported by dwSQL.
The following built-in functions are supported by dwSQL.
- General functions
if- Executes it's boolean first argument, then returns the evaluation of the second argument if true or the third if false. Takes three arguments.
coalesce- Evaluates the arguments in order and returns the first expression that does not evaluate to NULL. Takes any number of arguments.
- Numeric functions
abs- The absolute value of a numeric value. Takes one argument.
floor- The greatest integer less than or equal to a numeric value. Takes one argument.
ceiling- The least integer greater to or equal to a numeric value. Takes one argument.
round- When called with one numeric argument,
roundreturns the nearest integer to that argument. When called with two numeric arguments, the second argument indicates how many decimal places to round the first argument to.
pi- The mathematical constant pi, 3.1415... Takes no arguments.
exp- The mathematical exponential function, e to the power x. Takes one argument.
exp10- The decimal exponential function, 10 to the power x. Takes one argument.
ln) - The natural log function, inverse of
exp. Takes one argument.
power) - The general exponentiation function. Takes two arguments.
sqrt- The square root function. Takes one argument.
sin- The sine function, from trigonometry. Takes one argument.
cos- The cosine function, from trigonometry. Takes one argument.
tan- The tangent function, from trigonometry. Takes one argument.
asin- The inverse sine function, from trigonometry. Takes one argument.
acos- The inverse cosine function, from trigonometry. Takes one argument.
atan- The inverse tangent function, from trigonometry. Takes one argument.
atan2- The two-argument form of the inverse tangent function, from trigonometry.
atan(x, y)is equivalent to
atan(x/y)if y is not 0. Takes two arguments
sinh- The hyperbolic sine function, from trigonometry. Takes one argument.
cosh- The hyperbolic cosine function, from trigonometry. Takes one argument.
tanh- The hyperbolic tangent function, from trigonometry. Takes one argument.
radians- Convert an angle from degrees to radians. Takes one argument.
degrees- Convert an angle from degrees to radians. Takes one argument.
sign- The signum of a numeric argument: -1, 0, or 1. Takes one argument
- String functions
length- The length of a string. Takes one argument.
upper- Convert a string to upper case. Takes one argument.
lower- Convert a string to lower case. Takes one argument.
concat- Concatenates two strings, returning a string consisting of the first argument followed by the second. Takes two arguments.
trim- Trims whitespace from the start and end of a string. Takes one argument.
ltrim- Trims whitespace from the start of a string. Takes one argument.
rtrim- Trims whitespace from the end of a string. Takes one argument.
lpad- pad the start of a string with extra characters, to bring the up to a specified length. Takes two or three arguments: the string to pad, the length to pad it to, and (optionally) the character pad with (space is the default).
rpad- pad the end of a string with extra characters, to bring the up to a specified length. Takes two or three arguments: the string to pad, the length to pad it to, and (optionally) the character pad with (space is the default).
substring- Extract a substring from a string. Takes two or three arguments: the original string, the position of the character to start at (the first character position "1"), and (optionally) the number of characters to extract
replace- For a given string, replace each non-overlapping instance of a regular expression pattern by a fixed string. Takes three arguments: the original string, the pattern to replace, and the the string to replace it with.
regex- Check whether a string matches a given regular expression. Similar to
LIKE, but more powerful. Takes two arguments, the string and the regex to match to it.
left- Extract a substring from the leftmost characters in a string. Takes two arguments: the original string and the number of characters to extract
right- Extract a substring from the rightmost characters in a string. Takes two arguments: the original string and the number of characters to extract
mid- Extract a substring from the middle of a string. Takes three arguments: the original string, the position of the character to start at (the first character position "1"), and the number of characters to extract
- Date/Time functions
year- Extract the year from a date or date time value, as an integer. Takes one argument
month- Extract the month value from a date or date time value, as an integer (January = 1). Takes one argument
day- Extract the day of the month from a date or date time value, as an integer. Takes one argument
hours- Extract the hours from a time or date time value, as an integer. Takes one argument
minutes- Extract the minutes from a time or date time value, as an integer. Takes one argument
seconds- Extract the seconds from a time or date time value, as an integer. Takes one argument
now- Returns time for the current query execution. This will be the same for every call to
nowin a query, but exactly which time is not specified. Takes no arguments
date_add- Adds a fixed duration to a date. Takes three arguments, the original date, a count, and a date part string (one of "year", "month", "day", "hour", "minute", or "second")
date_sub- Subtracts a fixed duration from a date. Takes three arguments, the original date, a count, and a date part string (one of "year", "month", "day", "hour", "minute", or "second")
date_part- Extracts a date part from a date or time, as an integer. Takes two arguments, the original date, and a date part string (one of "year", "month", "day", "hour", "minute", or "second")
date_trunc- Truncates a date to a given precision. Takes two arguments, the original date, and a date part string (one of "year", "month", "day", "hour", "minute", or "second")
date_format- Formats a date, time, or datetime as a string. Takes two arguments, a date to format and a format string as specified here: https://www.w3.org/TR/NOTE-datetime
date_diff- Finds the difference between two dates, expressed as in integral number of some date part (e.g, 3 years). Takes three arguments, a start date, an end date, and a date part string (one of "year", "month", "day", "hour", "minute", or "second")