SQL on data.world
For querying datasets stored on data.world, a dialect of SQL named dwSQL has been developed. dwSQL is a full-featured dialect of SQL for reading and analyzing your data.world datasets, and for joining in data from other data.world datasets. This tutorial will cover what is supported with dwSQL to assist you in building SQL queries on data.world.
Important note: SQL is currently supported only for reading data from data.world datasets. None of the standard SQL data manipulation statements for inserting, modifying, or deleting data are available on data.world.
The most common way to query a dataset using dwSQL is with a
SELECT queries include the following clauses:
SELECTclause, indicating just what fields should be returned by the query
FROMclause, indicating which tables should be used for the query, and possibly how those tables should be joined.
- An optional
WHEREclause, allowing the results limited to those that satisfy one or more criteria.
- An optional
BYclause, describing how the results should be sorted.
- An optional
BYclause, describing how the results should be grouped together for aggregation purposes.
- An optional
HAVINGclause, similar to the
WHEREclause, but for limiting aggregated results by one or more criteria.
- An optional
LIMITclause, limiting to the results returned to some fixed number.
- An optional
OFFSETclause, indicating that some fixed number of results should be skipped.
Here's an example dwSQL query which joins two tables: MyTable and MyOtherTable. It shows all of the clauses that dwSQL supports
SELECT MyTable.name, SUM(MyOtherTable.score) as total FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId) WHERE MyOtherTable.score > 3 ORDER by name GROUP by name HAVING total > 42 LIMIT 10 OFFSET 20
SELECT clause indicates just which columns or calculated values should
be returned from a query. These may be explicitly listed by name
SELECT MyTable.name, MyOtherTable.score FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)
or implicitly specified using a wildcard select, which causes all of the fields in all tables involved in the query to be returned
SELECT * FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)
Wildcards are very easy to use, but don't allow much control. A restricted wildcard syntax is also available, which only returns the columns from one table
SELECT MyTable.* FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)
If marked with
DISTINCT in the select clause, results from the query will
be deduplicated before being returned
SELECT DISTINCT name FROM MyTable
dwSQL supports all of the standard formats of SQL
FROM clauses, including
implicit join syntax:
SELECT MyTable.name, MyOtherTable.address FROM MyTable, MyOtherTable where MyTable.id = MyOtherTable.myTableId
explicit join syntax with the
SELECT MyTable.name, MyOtherTable.address FROM MyTable JOIN MyOtherTable ON (MyTable.id = MyOtherTable.myTableId)
explicit join syntax with the
SELECT MyTable.name, MyOtherTable.address FROM MyTable JOIN MyOtherTable USING(ID)
NATURAL join syntax:
SELECT MyTable.name, MyOtherTable.address FROM MyTable NATURAL JOIN MyOtherTable
Left, right and full outer joins are also supported.
SELECT MyTable.name, MyOtherTable.address FROM MyTable LEFT OUTER JOIN MyOtherTable ON (MyTable.id = MyOtherTable.myTableId)
To limit the results from a query based on some criteria, dwSQL provides
support for SQL
WHERE clauses. A
WHERE clause indicates that a result
should only be returned by the query if a specified boolean expression
evaluates to true. The following standard expressions are supported
by dwSQL for construction of where clauses and calculated fields
- Comparisons (>, <, <=, >=, !=, =)
- Numeric operations (+, - %, /)
CASEexpressions, allowing different results to be calculated based on
INexpressions, for testing containment in some set
NULL, for testing whether a value is available
LIKE, for matching strings
BETWEENfor testing whether numeric values are in a given range
CASTfor converting values between available data types
- A variety of standard built-in functions, listed in a later section
Order By Clauses
dwSQL fully supports standard SQL
ORDER BY syntax for sorting the results
of a query. The following query returns the names for the rows in MyTable,
sorted by name.
SELECT name FROM MyTable ORDER BY name
It is also possible to sort results by more than one field, lexicographically.
The standard SQL sort modifiers of
DESC are supported by dwSQL,
for indicated whether fields should be sorted in ascending or descending
SELECT age, name FROM MyTable ORDER BY age DESC, name ASC
Group By Clauses and Aggregation
A common use for a dwSQL query is to group a set of results together based on some subset of their values, and then perform aggregate calculations based on that grouping
SELECT name, SUM(score) as totalScore FROM MyTable GROUP BY name
The following standard SQL aggregate functions are supported by dwSQL:
GROUP_CONCAT. Also supported are statistical functions
VAR_SAMP, for calculating
the variances and standard deviations of sets of query results, either on a population
or sample basis. For every aggregate functions, dwSQL allows calculations
to be marked as
DISTINCT, indicating that the result sets should be deduplicated
before they are aggregated
SELECT name, COUNT(DISTINCT address) as numAddresses FROM MyTable GROUP BY name
Additionally dwSQL supports filtered aggregation, allowing aggregate calculations to be performed over specified partitions of the rows available
SELECT name, SUM(score) as totalScore, SUM(score) FILTER(WHERE year=2016) as totalScore2016 FROM MyTable GROUP BY name
Similar to a
WHERE clause, a
HAVING clause allows the results of a query
to be filtered based on some boolean criteria. While a
WHERE clause operates
on raw data values, a
HAVING clause operates on aggregated values.
SELECT name, SUM(score) as totalScore FROM MyTable GROUP BY name HAVING totalScore > 100
Limit and Offset Clauses
In order to allow only a portion of the results of a query to be returned,
OFFSET clauses. A
LIMIT clause provides the
maximum number of rows that should be returned from a query. An
clause lists the number of rows that should be skipped. Using
OFFSET together allow pagination through large query results.
SELECT name FROM MyTable ORDER BY name LIMIT 10 OFFSET 10
Union, Intersect, and Minus
SELECT queries are the most common way of getting data
from dwSQL, sometimes it is necessary to combine results from multiple
queries to get the data that you want. For this purpose, dwSQL provides
support for combining individual select queries into
UNION queries combine the results of two queries, returning all results
from either query.
SELECT name FROM MyTable UNION SELECT name FROM MyOtherTable
MINUS queries remove the results of one query from the results of another.
SELECT name FROM MyTable MINUS SELECT name FROM MyOtherTable
INTERSECT queries combine the results of two queries, only those results
that are common to both queries.
SELECT name FROM MyTable INTERSECT SELECT name FROM MyOtherTable
We at data.world are committed to making your data not just easy to access, but also easy to use with other people's data. In dwSQL, it is possible to write queries not just against one dataset, but against multiple different data.world datasets, using what we call a 'federated query'. Federated queries look just like local queries, but require a bit more work to reference the remote tables. You need to specify not just the name of the remote table, but also the name of the remote data.world dataset the table is in.
SELECT age, name FROM MyTable join dave.lahman.Batting using(playerId)
That query joins data from your dataset with data in the the Lahman Sabremetrics baseball dataset (named "lahman"), maintained by user "dave" (that's me).
To keep from having to write out the full name of the dataset and table, you can use a table alias. Once you've defined a table alias, you can use it elsewhere in the query to reference that table.
SELECT age, name FROM MyTable join dave.lahman.Batting as Batting using(playerId) ORDER BY Batting.yearId
A common use case with data.world data sets is to have tabular data that is split among multiple files. For instance, you might upload a new file each month of ledger data for your business or non-profit organization. You might have files named "ledger_jan.csv", "ledger_feb.csv", and "ledger_mar.csv". Upon ingest by data.world, each file would result in a separate table, in this case "ledger_jan","ledger_feb", and "ledger_mar". If you wanted to query about data from all of these in most SQL dialects, you would need to use a "UNION" construct, which would rapidly get cumbersome. Because this is expected to be a common use-case for data.world, we've created the concept of a multi-table select to dwSQL. The select form the three ledger tables, you could list the tables explicitly between square brackets like this:
SELECT * FROM [ledger_jan, ledger_feb, ledger_mar] as ledger
or use a regular expression to select from all tables in the dataset starting with the name "ledger"
SELECT * FROM [/ledger.*/] as ledger
Note that tables to be selected this way must have all the same column names and types. For more general combinations of tables, you'll need to use the "UNION" construct instead.
To do any querying with dwSQL, you need to know the names of the tables and columns in your dataset. These names are generated by data.world when you upload your data, based on what we could infer about the structure of your data. To find out exactly what tables and columns you have available, you can just query that information from dwSQL. Each dataset contains two artificial tables named "Tables" and "TableColumns" which you can query just as though they were tables that you imported into your dataset. "Tables" lists the tables in your dataset, while "TableColumns" additionally lists the columns available for each table.
SELECT * FROM Tables
What's missing from dwSQL
Subqueries - Subqueries are not currently fully supported in dwSQL. In particular subqueries in the
FROMsection of a query are prohibiited. Subqueries are allowed as part of
NOT LIKE- Unlike other SQL implementations,
NOT LIKEonly allow comparison to fixed pattern strings, not arbitrary expressions. Thus you can say
select name from MyTable where name like 'Bob%'
select name from MyTable where name like UPPER('Bob%')
dwSQL identifiers are used to reference the tables and columns that your query operates over. Simple dwSQL identifiers consist of alphanumeric characters or underscores, and must start with an alphabetic character. Identifiers with other characters may be used, but must be surrounded with back-quotes. This is usually only done to refer to tables or columns which were imported with non-alphanumeric characters in their names.
SELECT name FROM `MyTable/with/slashes/in/the/name`
dwSQL allows inline comments, to allow query logic to be annotated with user intent. Comments start with either # or -- and continue to the end of the line.
--A very simple query SELECT name FROM MyTable # the end
dwSQL keywords are case-insensitive. dwSQL identifiers are case-sensitive.
All string operations and comparisons supported by dwSQL are also case-sensitive,
with the exception of
LIKE which are case-insensitive
for compatibility with SQL standards.
Keywords and Functions
These are the standard SQL keywords which are supported by dwSQL.
The following built-in functions are supported by dwSQL.
- Numeric functions
- String functions
- Date/Time functions