SQL on data.world

For querying datasets stored on data.world, a dialect of SQL named dwSQL has been developed. dwSQL is a full-featured dialect of SQL for reading and analyzing your data.world datasets, and for joining in data from other data.world datasets. This tutorial will cover what is supported with dwSQL to assist you in building SQL queries on data.world.

Important note: SQL is currently supported only for reading data from data.world datasets. None of the standard SQL data manipulation statements for inserting, modifying, or deleting data are available on data.world.

Select queries

The most common way to query a dataset using dwSQL is with a SELECT query. SELECT queries include the following clauses:

  • A SELECT clause, indicating just what fields should be returned by the query
  • A FROM clause, indicating which tables should be used for the query, and possibly how those tables should be joined.
  • An optional WHERE clause, allowing the results limited to those that satisfy one or more criteria.
  • An optional ORDER BY clause, describing how the results should be sorted.
  • An optional GROUP BY clause, describing how the results should be grouped together for aggregation purposes.
  • An optional HAVING clause, similar to the WHERE clause, but for limiting aggregated results by one or more criteria.
  • An optional LIMIT clause, limiting to the results returned to some fixed number.
  • An optional OFFSET clause, indicating that some fixed number of results should be skipped.

Here's an example dwSQL query which joins two tables: MyTable and MyOtherTable. It shows all of the clauses that dwSQL supports

SELECT MyTable.name, SUM(MyOtherTable.score) as total
FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)
WHERE MyOtherTable.score > 3
ORDER by name
GROUP by name
HAVING total > 42
LIMIT 10
OFFSET 20

Select Clauses

A SELECT clause indicates just which columns or calculated values should be returned from a query. These may be explicitly listed by name

SELECT MyTable.name, MyOtherTable.score
FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)

or implicitly specified using a wildcard select, which causes all of the fields in all tables involved in the query to be returned

SELECT *  
FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)

Wildcards are very easy to use, but don't allow much control. A restricted wildcard syntax is also available, which only returns the columns from one table

SELECT MyTable.*
FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)

If marked with DISTINCT in the select clause, results from the query will be deduplicated before being returned

SELECT DISTINCT name
FROM MyTable

From Clauses

dwSQL supports all of the standard formats of SQL FROM clauses, including implicit join syntax:

SELECT MyTable.name, MyOtherTable.address
FROM MyTable, MyOtherTable
where MyTable.id = MyOtherTable.myTableId

explicit join syntax with the ON keyword:

SELECT MyTable.name, MyOtherTable.address
FROM MyTable JOIN MyOtherTable ON (MyTable.id = MyOtherTable.myTableId)

explicit join syntax with the USING keyword:

SELECT MyTable.name, MyOtherTable.address
FROM MyTable JOIN MyOtherTable USING(ID)

and NATURAL join syntax:

SELECT MyTable.name, MyOtherTable.address
FROM MyTable NATURAL JOIN MyOtherTable

Left, right and full outer joins are also supported.

SELECT MyTable.name, MyOtherTable.address
FROM MyTable LEFT OUTER JOIN MyOtherTable ON (MyTable.id = MyOtherTable.myTableId)

Where Clauses

To limit the results from a query based on some criteria, dwSQL provides support for SQL WHERE clauses. A WHERE clause indicates that a result should only be returned by the query if a specified boolean expression evaluates to true. The following standard expressions are supported by dwSQL for construction of where clauses and calculated fields

  • Comparisons (>, <, <=, >=, !=, =)
  • Numeric operations (+, - %, /)
  • CASE expressions, allowing different results to be calculated based on
  • IN and NOT IN expressions, for testing containment in some set
  • IS NULL and IS NOT NULL, for testing whether a value is available
  • LIKE and NOT LIKE, for matching strings
  • BETWEEN for testing whether numeric values are in a given range
  • CAST for converting values between available data types
  • A variety of standard built-in functions, listed in a later section

Order By Clauses

dwSQL fully supports standard SQL ORDER BY syntax for sorting the results of a query. The following query returns the names for the rows in MyTable, sorted by name.

SELECT name
FROM MyTable
ORDER BY name

It is also possible to sort results by more than one field, lexicographically. The standard SQL sort modifiers of ASC and DESC are supported by dwSQL, for indicated whether fields should be sorted in ascending or descending order.

SELECT age, name
FROM MyTable
ORDER BY age DESC, name ASC

Group By Clauses and Aggregation

A common use for a dwSQL query is to group a set of results together based on some subset of their values, and then perform aggregate calculations based on that grouping

SELECT name, SUM(score) as totalScore
FROM MyTable
GROUP BY name

The following standard SQL aggregate functions are supported by dwSQL: SUM, MIN, MAX, COUNT, AVG, and GROUP_CONCAT. Also supported are statistical functions STDEV, STD_SAMP, STD_POP, VARIANCE, VAR_POP, and VAR_SAMP, for calculating the variances and standard deviations of sets of query results, either on a population or sample basis. For every aggregate functions, dwSQL allows calculations to be marked as DISTINCT, indicating that the result sets should be deduplicated before they are aggregated

SELECT name, COUNT(DISTINCT address) as numAddresses
FROM MyTable
GROUP BY name

Additionally dwSQL supports filtered aggregation, allowing aggregate calculations to be performed over specified partitions of the rows available

SELECT name, SUM(score) as totalScore, SUM(score) FILTER(WHERE year=2016) as totalScore2016
FROM MyTable
GROUP BY name

Having Clauses

Similar to a WHERE clause, a HAVING clause allows the results of a query to be filtered based on some boolean criteria. While a WHERE clause operates on raw data values, a HAVING clause operates on aggregated values.

SELECT name, SUM(score) as totalScore
FROM MyTable
GROUP BY name
HAVING totalScore > 100

Limit and Offset Clauses

In order to allow only a portion of the results of a query to be returned, dwSQL provides LIMIT and OFFSET clauses. A LIMIT clause provides the maximum number of rows that should be returned from a query. An OFFSET clause lists the number of rows that should be skipped. Using LIMIT and OFFSET together allow pagination through large query results.

SELECT name
FROM MyTable
ORDER BY name
LIMIT 10
OFFSET 10

Union, Intersect, and Minus

While individual SELECT queries are the most common way of getting data from dwSQL, sometimes it is necessary to combine results from multiple queries to get the data that you want. For this purpose, dwSQL provides support for combining individual select queries into UNION, MINUS, and INTERSECT queries. UNION queries combine the results of two queries, returning all results from either query.

SELECT name
FROM MyTable
UNION
SELECT name
FROM MyOtherTable

MINUS queries remove the results of one query from the results of another.

SELECT name
FROM MyTable
MINUS
SELECT name
FROM MyOtherTable

INTERSECT queries combine the results of two queries, only those results that are common to both queries.

SELECT name
FROM MyTable
INTERSECT
SELECT name
FROM MyOtherTable

Federated Queries

We at data.world are committed to making your data not just easy to access, but also easy to use with other people's data. In dwSQL, it is possible to write queries not just against one dataset, but against multiple different data.world datasets, using what we call a 'federated query'. Federated queries look just like local queries, but require a bit more work to reference the remote tables. You need to specify not just the name of the remote table, but also the name of the remote data.world dataset the table is in.

SELECT age, name
FROM MyTable join dave.lahman.Batting using(playerId)

That query joins data from your dataset with data in the the Lahman Sabremetrics baseball dataset (named "lahman"), maintained by user "dave" (that's me).

To keep from having to write out the full name of the dataset and table, you can use a table alias. Once you've defined a table alias, you can use it elsewhere in the query to reference that table.

SELECT age, name
FROM MyTable join dave.lahman.Batting as Batting using(playerId)
ORDER BY Batting.yearId

Multi-table Selects

A common use case with data.world data sets is to have tabular data that is split among multiple files. For instance, you might upload a new file each month of ledger data for your business or non-profit organization. You might have files named "ledger_jan.csv", "ledger_feb.csv", and "ledger_mar.csv". Upon ingest by data.world, each file would result in a separate table, in this case "ledger_jan","ledger_feb", and "ledger_mar". If you wanted to query about data from all of these in most SQL dialects, you would need to use a "UNION" construct, which would rapidly get cumbersome. Because this is expected to be a common use-case for data.world, we've created the concept of a multi-table select to dwSQL. The select form the three ledger tables, you could list the tables explicitly between square brackets like this:

SELECT *
FROM [ledger_jan, ledger_feb, ledger_mar] as ledger

or use a regular expression to select from all tables in the dataset starting with the name "ledger"

SELECT *
FROM [/ledger.*/] as ledger

Note that tables to be selected this way must have all the same column names and types. For more general combinations of tables, you'll need to use the "UNION" construct instead.

Metadata tables

To do any querying with dwSQL, you need to know the names of the tables and columns in your dataset. These names are generated by data.world when you upload your data, based on what we could infer about the structure of your data. To find out exactly what tables and columns you have available, you can just query that information from dwSQL. Each dataset contains two artificial tables named "Tables" and "TableColumns" which you can query just as though they were tables that you imported into your dataset. "Tables" lists the tables in your dataset, while "TableColumns" additionally lists the columns available for each table.

SELECT *
FROM Tables

What's missing from dwSQL

  • Subqueries - Subqueries are not currently fully supported in dwSQL. In particular subqueries in the FROM section of a query are prohibiited. Subqueries are allowed as part of EXISTS, NOT EXISTS, IN and NOT IN constructs.

  • Fully general LIKE and NOT LIKE - Unlike other SQL implementations, LIKE and NOT LIKE only allow comparison to fixed pattern strings, not arbitrary expressions. Thus you can say

    select name
    from MyTable
    where name like 'Bob%'
    

    but not

    select name
    from MyTable
    where name like UPPER('Bob%')
    

Identifiers

dwSQL identifiers are used to reference the tables and columns that your query operates over. Simple dwSQL identifiers consist of alphanumeric characters or underscores, and must start with an alphabetic character. Identifiers with other characters may be used, but must be surrounded with back-quotes. This is usually only done to refer to tables or columns which were imported with non-alphanumeric characters in their names.

SELECT name
FROM `MyTable/with/slashes/in/the/name`

Comments

dwSQL allows inline comments, to allow query logic to be annotated with user intent. Comments start with either # or -- and continue to the end of the line.

--A very simple query
SELECT name
FROM MyTable
# the end

Case Sensitivity

dwSQL keywords are case-insensitive. dwSQL identifiers are case-sensitive. All string operations and comparisons supported by dwSQL are also case-sensitive, with the exception of LIKE and NOT LIKE which are case-insensitive for compatibility with SQL standards.

Keywords and Functions

Keywords

These are the standard SQL keywords which are supported by dwSQL.

  • SELECT
  • DISTINCT
  • FROM
  • JOIN
  • LEFT
  • RIGHT
  • FULL
  • OUTER
  • INNER
  • ON
  • USING
  • NATURAL
  • WHERE
  • GROUP
  • BY
  • ORDER
  • ASC
  • DESC
  • HAVING
  • LIMIT
  • OFFSET
  • UNION
  • INTERSECT
  • MINUS
  • AS
  • AND
  • OR
  • NOT
  • IN
  • NULL
  • LIKE
  • CAST
  • CASE
  • WHEN
  • THEN
  • ELSE
  • END

Built-in Functions

The following built-in functions are supported by dwSQL.

  • Numeric functions
    • abs
    • floor
    • ceiling
    • round
    • pi
    • exp
    • exp10
    • log (with synonym ln)
    • pow (with synonym power)
    • sqrt
    • sin
    • cos
    • tan
    • asin
    • acos
    • atan
    • atan2
  • String functions
    • length
    • upper
    • lower
    • concat
    • length
    • substring
    • replace
    • regex
  • Date/Time functions
    • year
    • month
    • day
    • hours
    • minutes
    • seconds
    • now

results matching ""

    No results matching ""