SQL on data.world
For querying datasets stored on data.world, a dialect of SQL named dwSQL has been developed. dwSQL is a full-featured dialect of SQL for reading and analyzing your data.world datasets, and for joining in data from other data.world datasets. This tutorial will cover what is supported with dwSQL to assist you in building SQL queries on data.world.
Important note: SQL is currently supported only for reading data from data.world datasets. None of the standard SQL data manipulation statements for inserting, modifying, or deleting data are available on data.world.
Select queries
The most common way to query a dataset using dwSQL is with a SELECT query.
SELECT queries include the following clauses:
- A
SELECTclause, indicating just what fields should be returned by the query - An optional
FROMclause, indicating which tables should be used for the query, and possibly how those tables should be joined. - An optional
WHEREclause, allowing the results limited to those that satisfy one or more criteria. - An optional
ORDERBYclause, describing how the results should be sorted. - An optional
GROUPBYclause, describing how the results should be grouped together for aggregation purposes. - An optional
HAVINGclause, similar to theWHEREclause, but for limiting aggregated results by one or more criteria. - An optional
LIMITclause, limiting to the results returned to some fixed number. - An optional
OFFSETclause, indicating that some fixed number of results should be skipped.
Here's an example dwSQL query which joins two tables: MyTable and MyOtherTable. It shows all of the clauses that dwSQL supports
SELECT MyTable.name, SUM(MyOtherTable.score) as total
FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)
WHERE MyOtherTable.score > 3
ORDER by name
GROUP by name
HAVING total > 42
LIMIT 10
OFFSET 20
Select Clauses
A SELECT clause indicates just which columns or calculated values should
be returned from a query. These may be explicitly listed by name
SELECT MyTable.name, MyOtherTable.score
FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)
or implicitly specified using a wildcard select, which causes all of the fields in all tables involved in the query to be returned
SELECT *
FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)
Wildcards are very easy to use, but don't allow much control. A restricted wildcard syntax is also available, which only returns the columns from one table
SELECT MyTable.*
FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)
If marked with DISTINCT in the select clause, results from the query will
be deduplicated before being returned
SELECT DISTINCT name
FROM MyTable
From Clauses
dwSQL supports all of the standard formats of SQL FROM clauses, including
implicit join syntax:
SELECT MyTable.name, MyOtherTable.address
FROM MyTable, MyOtherTable
where MyTable.id = MyOtherTable.myTableId
explicit join syntax with the ON keyword:
SELECT MyTable.name, MyOtherTable.address
FROM MyTable JOIN MyOtherTable ON (MyTable.id = MyOtherTable.myTableId)
explicit join syntax with the USING keyword:
SELECT MyTable.name, MyOtherTable.address
FROM MyTable JOIN MyOtherTable USING(ID)
and NATURAL join syntax:
SELECT MyTable.name, MyOtherTable.address
FROM MyTable NATURAL JOIN MyOtherTable
Left, right and full outer joins are also supported.
SELECT MyTable.name, MyOtherTable.address
FROM MyTable LEFT OUTER JOIN MyOtherTable ON (MyTable.id = MyOtherTable.myTableId)
Where Clauses
To limit the results from a query based on some criteria, dwSQL provides
support for SQL WHERE clauses. A WHERE clause indicates that a result
should only be returned by the query if a specified boolean expression
evaluates to true. The following standard expressions are supported
by dwSQL for construction of where clauses and calculated fields
- Comparisons (>, <, <=, >=, !=, =)
- Numeric operations (+, - %, /)
CASEexpressions, allowing different results to be calculated based onINandNOTINexpressions, for testing containment in some setISNULLandISNOTNULL, for testing whether a value is availableLIKEandNOTLIKE, for matching stringsBETWEENfor testing whether numeric values are in a given rangeCASTfor converting values between available data types- A variety of standard built-in functions, listed in a later section
Order By Clauses
dwSQL fully supports standard SQL ORDER BY syntax for sorting the results
of a query. The following query returns the names for the rows in MyTable,
sorted by name.
SELECT name
FROM MyTable
ORDER BY name
It is also possible to sort results by more than one field, lexicographically.
The standard SQL sort modifiers of ASC and DESC are supported by dwSQL,
for indicated whether fields should be sorted in ascending or descending
order.
SELECT age, name
FROM MyTable
ORDER BY age DESC, name ASC
Group By Clauses and Aggregation
A common use for a dwSQL query is to group a set of results together based on some subset of their values, and then perform aggregate calculations based on that grouping
SELECT name, SUM(score) as totalScore
FROM MyTable
GROUP BY name
The following standard SQL aggregate functions are supported by dwSQL: SUM, MIN,
MAX, COUNT, AVG, and GROUP_CONCAT. Also supported are statistical functions
STDEV, STD_SAMP, STD_POP, VARIANCE, VAR_POP, and VAR_SAMP, for calculating
the variances and standard deviations of sets of query results, either on a population
or sample basis. A two-argument CORRELATION aggregation is also available, calculating
the Pearson's correlation coefficient for a pair of variables. For boolean expressions,
the BOOL_AND and BOOL_OR aggregations allow for calculating the aggregate conjunction or
disjunction for a set of boolean values. For every aggregate functions, dwSQL allows calculations
to be marked as DISTINCT, indicating that the result sets should be deduplicated
before they are aggregated
SELECT name, COUNT(DISTINCT address) as numAddresses
FROM MyTable
GROUP BY name
Additionally dwSQL supports filtered aggregation, allowing aggregate calculations to be performed over specified partitions of the rows available
SELECT name, SUM(score) as totalScore, SUM(score) FILTER(WHERE year=2016) as totalScore2016
FROM MyTable
GROUP BY name
Having Clauses
Similar to a WHERE clause, a HAVING clause allows the results of a query
to be filtered based on some boolean criteria. While a WHERE clause operates
on raw data values, a HAVING clause operates on aggregated values.
SELECT name, SUM(score) as totalScore
FROM MyTable
GROUP BY name
HAVING totalScore > 100
Limit and Offset Clauses
In order to allow only a portion of the results of a query to be returned,
dwSQL provides LIMIT and OFFSET clauses. A LIMIT clause provides the
maximum number of rows that should be returned from a query. An OFFSET
clause lists the number of rows that should be skipped. Using LIMIT and
OFFSET together allow pagination through large query results.
SELECT name
FROM MyTable
ORDER BY name
LIMIT 10
OFFSET 10
Union, Intersect, and Minus
While individual SELECT queries are the most common way of getting data
from dwSQL, sometimes it is necessary to combine results from multiple
queries to get the data that you want. For this purpose, dwSQL provides
support for combining individual select queries into UNION, MINUS,
and INTERSECT queries.
UNION queries combine the results of two queries, returning all results
from either query.
SELECT name
FROM MyTable
UNION
SELECT name
FROM MyOtherTable
MINUS queries remove the results of one query from the results of another.
SELECT name
FROM MyTable
MINUS
SELECT name
FROM MyOtherTable
INTERSECT queries combine the results of two queries, only those results
that are common to both queries.
SELECT name
FROM MyTable
INTERSECT
SELECT name
FROM MyOtherTable
With Queries
In order to support the ability to form complex queries out of simple queries, dwSQL supports
the standard WITH form of defining subqueries. WITH subqueries allow you to define and name one
or more queries which can then be used to define one main query. Once a named subquery is defined
in a later subquery or the main query, just as though it were another "virtual" table.
WITH Derived AS (SELECT name
FROM MyTable
WHERE name LAKE 'Dave%')
SELECT name, COUNT(*)
FROM Derived
GROUP BY name
ORDER BY name
Federated Queries
We at data.world are committed to making your data not just easy to access, but also easy to use with other people's data. In dwSQL, it is possible to write queries not just against one dataset, but against multiple different data.world datasets, using what we call a 'federated query'. Federated queries look just like local queries, but require a bit more work to reference the remote tables. You need to specify not just the name of the remote table, but also the name of the remote data.world dataset the table is in.
SELECT age, name
FROM MyTable join dave.lahman.Batting using(playerId)
That query joins data from your dataset with data in the the Lahman Sabremetrics baseball dataset (named "lahman"), maintained by user "dave" (that's me).
To keep from having to write out the full name of the dataset and table, you can use a table alias. Once you've defined a table alias, you can use it elsewhere in the query to reference that table.
SELECT age, name
FROM MyTable join dave.lahman.Batting as Batting using(playerId)
ORDER BY Batting.yearId
Multi-table Selects
A common use case with data.world data sets is to have tabular data that is split among multiple files. For instance, you might upload a new file each month of ledger data for your business or non-profit organization. You might have files named "ledger_jan.csv", "ledger_feb.csv", and "ledger_mar.csv". Upon ingest by data.world, each file would result in a separate table, in this case "ledger_jan","ledger_feb", and "ledger_mar". If you wanted to query about data from all of these in most SQL dialects, you would need to use a "UNION" construct, which would rapidly get cumbersome. Because this is expected to be a common use-case for data.world, we've created the concept of a multi-table select to dwSQL. The select form the three ledger tables, you could list the tables explicitly between square brackets like this:
SELECT *
FROM [ledger_jan, ledger_feb, ledger_mar] as ledger
or use a regular expression to select from all tables in the dataset starting with the name "ledger"
SELECT *
FROM [/ledger.*/] as ledger
Note that tables to be selected this way must have all the same column names and types. For more general combinations of tables, you'll need to use the "UNION" construct instead.
Metadata tables
To do any querying with dwSQL, you need to know the names of the tables and columns in your dataset. These names are generated by data.world when you upload your data, based on what we could infer about the structure of your data. To find out exactly what tables and columns you have available, you can just query that information from dwSQL. Each dataset contains two artificial tables named "Tables" and "TableColumns" which you can query just as though they were tables that you imported into your dataset. "Tables" lists the tables in your dataset, while "TableColumns" additionally lists the columns available for each table.
SELECT *
FROM Tables
Identifiers
dwSQL identifiers are used to reference the tables and columns that your query operates over. Simple dwSQL identifiers consist of alphanumeric characters or underscores, and must start with an alphabetic character. Identifiers with other characters may be used, but must be surrounded with back-quotes. This is usually only done to refer to tables or columns which were imported with non-alphanumeric characters in their names.
SELECT name
FROM `MyTable/with/slashes/in/the/name`
Comments
dwSQL allows inline comments, to allow query logic to be annotated with user intent. Comments start with either # or -- and continue to the end of the line.
--A very simple query
SELECT name
FROM MyTable
# the end
Case Sensitivity
dwSQL keywords are case-insensitive. dwSQL identifiers are case-sensitive.
All string operations and comparisons supported by dwSQL are also case-sensitive,
with the exception of LIKE and NOT LIKE which are case-insensitive
for compatibility with SQL standards.
Keywords and Functions
Keywords
These are the standard SQL keywords which are supported by dwSQL.
SELECTDISTINCTFROMJOINLEFTRIGHTFULLOUTERINNERONUSINGNATURALWHEREGROUPBYORDERASCDESCHAVINGLIMITOFFSETUNIONINTERSECTMINUSASANDORNOTINNULLLIKECASTCASEWHENTHENELSEENDANYALLSOME
Built-in Functions
The following built-in functions are supported by dwSQL.
- Numeric functions
abs- The absolute value of a numeric value. Takes one argument.floor- The greatest integer less than or equal to a numeric value. Takes one argument.ceiling- The least integer greater to or equal to a numeric value. Takes one argument.round- When called with one numeric argument,roundreturns the nearest integer to that argument. When called with two numeric arguments, the second argument indicates how many decimal places to round the first argument to.pi- The mathematical constant pi, 3.1415... Takes no arguments.exp- The mathematical exponential function, e to the power x. Takes one argument.exp10- The decimal exponential function, 10 to the power x. Takes one argument.log(with synonymln) - The natural log function, inverse ofexp. Takes one argument.pow(with synonympower) - The general exponentiation function. Takes two arguments.sqrt- The square root function. Takes one argument.sin- The sine function, from trigonometry. Takes one argument.cos- The cosine function, from trigonometry. Takes one argument.tan- The tangent function, from trigonometry. Takes one argument.asin- The inverse sine function, from trigonometry. Takes one argument.acos- The inverse cosine function, from trigonometry. Takes one argument.atan- The inverse tangent function, from trigonometry. Takes one argument.atan2- The two-argument form of the inverse tangent function, from trigonometry.atan(x, y)is equivalent toatan(x/y)if y is not 0. Takes two argumentssinh- The hyperbolic sine function, from trigonometry. Takes one argument.cosh- The hyperbolic cosine function, from trigonometry. Takes one argument.tanh- The hyperbolic tangent function, from trigonometry. Takes one argument.radians- Convert an angle from degrees to radians. Takes one argument.degrees- Convert an angle from degrees to radians. Takes one argument.sign- The signum of a numeric argument: -1, 0, or 1. Takes one argument
- String functions
length- The length of a string. Takes one argument.upper- Convert a string to upper case. Takes one argument.lower- Convert a string to lower case. Takes one argument.concat- Concatenates two strings, returning a string consisting of the first argument followed by the second. Takes two arguments.trim- Trims whitespace from the start and end of a string. Takes one argument.ltrim- Trims whitespace from the start of a string. Takes one argument.rtrim- Trims whitespace from the end of a string. Takes one argument.lpad- pad the start of a string with extra characters, to bring the up to a specified length. Takes two or three arguments: the string to pad, the length to pad it to, and (optionally) the character pad with (space is the default).rpad- pad the end of a string with extra characters, to bring the up to a specified length. Takes two or three arguments: the string to pad, the length to pad it to, and (optionally) the character pad with (space is the default).substring- Extract a substring from a string. Takes two or three arguments: the original string, the position of the character to start at (the first character position "1"), and (optionally) the number of characters to extractreplace- For a given string, replace each non-overlapping instance of a regular expression pattern by a fixed string. Takes three arguments: the original string, the pattern to replace, and the the string to replace it with.regex- Check whether a string matches a given regular expression. Similar toLIKE, but more powerful. Takes two arguments, the string and the regex to match to it.left- Extract a substring from the leftmost characters in a string. Takes two arguments: the original string and the number of characters to extractright- Extract a substring from the rightmost characters in a string. Takes two arguments: the original string and the number of characters to extractmid- Extract a substring from the middle of a string. Takes three arguments: the original string, the position of the character to start at (the first character position "1"), and the number of characters to extract
- Date/Time functions
year- Extract the year from a date or date time value, as an integer. Takes one argumentmonth- Extract the month value from a date or date time value, as an integer (January = 1). Takes one argumentday- Extract the day of the month from a date or date time value, as an integer. Takes one argumenthours- Extract the hours from a time or date time value, as an integer. Takes one argumentminutes- Extract the minutes from a time or date time value, as an integer. Takes one argumentseconds- Extract the seconds from a time or date time value, as an integer. Takes one argumentnow- Returns time for the current query execution. This will be the same for every call tonowin a query, but exactly which time is not specified. Takes no argumentsdate_add- Adds a fixed duration to a date. Takes three arguments, the original date, a count, and a date part string (one of "year", "month", "day", "hour", "minute", or "second")date_sub- Subtracts a fixed duration from a date. Takes three arguments, the original date, a count, and a date part string (one of "year", "month", "day", "hour", "minute", or "second")date_part- Extracts a date part from a date or time, as an integer. Takes two arguments, the original date, and a date part string (one of "year", "month", "day", "hour", "minute", or "second")date_trunc- Truncates a date to a given precision. Takes two arguments, the original date, and a date part string (one of "year", "month", "day", "hour", "minute", or "second")date_format- Formats a date, time, or datetime as a string. Takes two arguments, a date to format and a format string as specified here: https://www.w3.org/TR/NOTE-datetimedate_diff- Finds the difference between two dates, expressed as in integral number of some date part (e.g, 3 years). Takes three arguments, a start date, an end date, and a date part string (one of "year", "month", "day", "hour", "minute", or "second")