SQL on data.world

For querying datasets stored on data.world, a dialect of SQL named dwSQL has been developed. dwSQL is a full-featured dialect of SQL for reading and analyzing your data.world datasets, and for joining in data from other data.world datasets. This tutorial will cover what is supported with dwSQL to assist you in building SQL queries on data.world.

Important note: SQL is currently supported only for reading data from data.world datasets. None of the standard SQL data manipulation statements for inserting, modifying, or deleting data are available on data.world.

Select queries

The most common way to query a dataset using dwSQL is with a SELECT query. SELECT queries include the following clauses:

  • A SELECT clause, indicating just what fields should be returned by the query
  • An optional FROM clause, indicating which tables should be used for the query, and possibly how those tables should be joined.
  • An optional WHERE clause, allowing the results limited to those that satisfy one or more criteria.
  • An optional ORDER BY clause, describing how the results should be sorted.
  • An optional GROUP BY clause, describing how the results should be grouped together for aggregation purposes.
  • An optional HAVING clause, similar to the WHERE clause, but for limiting aggregated results by one or more criteria.
  • An optional LIMIT clause, limiting to the results returned to some fixed number.
  • An optional OFFSET clause, indicating that some fixed number of results should be skipped.

Here's an example dwSQL query which joins two tables: MyTable and MyOtherTable. It shows all of the clauses that dwSQL supports

SELECT MyTable.name, SUM(MyOtherTable.score) as total
FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)
WHERE MyOtherTable.score > 3
ORDER by name
GROUP by name
HAVING total > 42
LIMIT 10
OFFSET 20

Select Clauses

A SELECT clause indicates just which columns or calculated values should be returned from a query. These may be explicitly listed by name

SELECT MyTable.name, MyOtherTable.score
FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)

or implicitly specified using a wildcard select, which causes all of the fields in all tables involved in the query to be returned

SELECT *  
FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)

Wildcards are very easy to use, but don't allow much control. A restricted wildcard syntax is also available, which only returns the columns from one table

SELECT MyTable.*
FROM MyTable join MyOtherTable on(MyTable.id = MyOtherTable.myTableId)

If marked with DISTINCT in the select clause, results from the query will be deduplicated before being returned

SELECT DISTINCT name
FROM MyTable

From Clauses

dwSQL supports all of the standard formats of SQL FROM clauses, including implicit join syntax:

SELECT MyTable.name, MyOtherTable.address
FROM MyTable, MyOtherTable
where MyTable.id = MyOtherTable.myTableId

explicit join syntax with the ON keyword:

SELECT MyTable.name, MyOtherTable.address
FROM MyTable JOIN MyOtherTable ON (MyTable.id = MyOtherTable.myTableId)

explicit join syntax with the USING keyword:

SELECT MyTable.name, MyOtherTable.address
FROM MyTable JOIN MyOtherTable USING(ID)

and NATURAL join syntax:

SELECT MyTable.name, MyOtherTable.address
FROM MyTable NATURAL JOIN MyOtherTable

Left, right and full outer joins are also supported.

SELECT MyTable.name, MyOtherTable.address
FROM MyTable LEFT OUTER JOIN MyOtherTable ON (MyTable.id = MyOtherTable.myTableId)

Where Clauses

To limit the results from a query based on some criteria, dwSQL provides support for SQL WHERE clauses. A WHERE clause indicates that a result should only be returned by the query if a specified boolean expression evaluates to true. The following standard expressions are supported by dwSQL for construction of where clauses and calculated fields

  • Comparisons (>, <, <=, >=, !=, =)
  • Numeric operations (+, - %, /)
  • CASE expressions, allowing different results to be calculated based on
  • IN and NOT IN expressions, for testing containment in some set
  • IS NULL and IS NOT NULL, for testing whether a value is available
  • LIKE and NOT LIKE, for matching strings
  • POSITION, for finding the first index of one string inside another
  • BETWEEN for testing whether numeric values are in a given range
  • CAST for converting values between available data types
  • A variety of standard built-in functions, listed in a later section

Order By Clauses

dwSQL fully supports standard SQL ORDER BY syntax for sorting the results of a query. The following query returns the names for the rows in MyTable, sorted by name.

SELECT name
FROM MyTable
ORDER BY name

It is also possible to sort results by more than one field, lexicographically. The standard SQL sort modifiers of ASC and DESC are supported by dwSQL, for indicated whether fields should be sorted in ascending or descending order.

SELECT age, name
FROM MyTable
ORDER BY age DESC, name ASC

Group By Clauses and Aggregation

A common use for a dwSQL query is to group a set of results together based on some subset of their values, and then perform aggregate calculations based on that grouping

SELECT name, SUM(score) as totalScore
FROM MyTable
GROUP BY name

The following standard SQL aggregate functions are supported by dwSQL: SUM, MIN, MAX, COUNT, AVG, and GROUP_CONCAT. Also supported are statistical functions STDEV, STD_SAMP, STD_POP, VARIANCE, VAR_POP, and VAR_SAMP, for calculating the variances and standard deviations of sets of query results, either on a population or sample basis. A two-argument CORRELATION aggregation is also available, calculating the Pearson's correlation coefficient for a pair of variables, as are COVAR_POP and COVAR_SAMP for calculating population and sampling covariances for pairs of variables. For more advanced statistical queries, standard linear regression aggregations are available, including REGR_SLOPE (slope of the linear regression), REGR_INTERCEPT (intercept of the linear regression) and REGR_COUNT (number of points used for the linear regression), as well as REGR_R2, REGR_SXX,REGR_SYY,REGR_SXY,REGR_AVGX, andREGR_AVGY, detailing various component calculations of the regression. For boolean expressions, the BOOL_AND and BOOL_OR aggregations allow for calculating the aggregate conjunction or disjunction for a set of boolean values. For every aggregate functions, dwSQL allows calculations to be marked as DISTINCT, indicating that the result sets should be deduplicated before they are aggregated

SELECT name, COUNT(DISTINCT address) as numAddresses
FROM MyTable
GROUP BY name

Additionally dwSQL supports filtered aggregation, allowing aggregate calculations to be performed over specified partitions of the rows available

SELECT name, SUM(score) as totalScore, SUM(score) FILTER(WHERE year=2016) as totalScore2016
FROM MyTable
GROUP BY name

Finally, for more detailed reporting needs, dwSQL supports ROLLUP and CUBE aggregations, allowing values to be aggregated on several different sets of axes

#aggregates by name, and by name and department
SELECT name, department, SUM(score) as totalScore, SUM(score) FILTER(WHERE year=2016) as totalScore2016
FROM MyTable
GROUP BY ROLLUP(name, department)
#aggregates by name, by departement, and by name and department
SELECT name, department, SUM(score) as totalScore, SUM(score) FILTER(WHERE year=2016) as totalScore2016
FROM MyTable
GROUP BY CUBE(name, department)

Having Clauses

Similar to a WHERE clause, a HAVING clause allows the results of a query to be filtered based on some boolean criteria. While a WHERE clause operates on raw data values, a HAVING clause operates on aggregated values.

SELECT name, SUM(score) as totalScore
FROM MyTable
GROUP BY name
HAVING totalScore > 100

Limit and Offset Clauses

In order to allow only a portion of the results of a query to be returned, dwSQL provides LIMIT and OFFSET clauses. A LIMIT clause provides the maximum number of rows that should be returned from a query. An OFFSET clause lists the number of rows that should be skipped. Using LIMIT and OFFSET together allow pagination through large query results.

SELECT name
FROM MyTable
ORDER BY name
LIMIT 10
OFFSET 10

Union, Intersect, and Minus

While individual SELECT queries are the most common way of getting data from dwSQL, sometimes it is necessary to combine results from multiple queries to get the data that you want. For this purpose, dwSQL provides support for combining individual select queries into UNION, MINUS, and INTERSECT queries. UNION queries combine the results of two queries, returning all results from either query.

SELECT name
FROM MyTable
UNION
SELECT name
FROM MyOtherTable

MINUS queries remove the results of one query from the results of another.

SELECT name
FROM MyTable
MINUS
SELECT name
FROM MyOtherTable

INTERSECT queries combine the results of two queries, only those results that are common to both queries.

SELECT name
FROM MyTable
INTERSECT
SELECT name
FROM MyOtherTable

With Queries

In order to support the ability to form complex queries out of simple queries, dwSQL supports the standard WITH form of defining subqueries. WITH subqueries allow you to define and name one or more queries which can then be used to define one main query. Once a named subquery is defined in a later subquery or the main query, just as though it were another "virtual" table.

WITH Derived AS (SELECT name
              FROM MyTable
              WHERE name LIKE 'Dave%')
SELECT name, COUNT(*)
FROM Derived   
GROUP BY name
ORDER BY name

Inline Subqueries

dwSQL also support inline subqueries, where one query can be used directly as a building block of another. Inline subqueries are supported in the following contexts:

  • In EXISTS or NOT EXISTS expressions
  • On the right side of IN or NOT IN expressions
  • On the right side of comparison expressions, with modifiers ANY, ALL or SOME
  • As a component of a FROM clause, in place of a table name. In this case, the subquery must must be "uncorrelated", not referencing any fields from the containing query.
  • Anywhere an expression is allowed. In that case the subquery is subject to several restrictions. It must return exactly one row and one column. It may not contain GROUP BY or HAVING clauses, nor must it uses UNION, INTERSECTION or MINUS. Finally, the subquery must must be "uncorrelated", not referencing any fields from the containing query.

Federated Queries

We at data.world are committed to making your data not just easy to access, but also easy to use with other people's data. In dwSQL, it is possible to write queries not just against one dataset, but against multiple different data.world datasets, using what we call a 'federated query'. Federated queries look just like local queries, but require a bit more work to reference the remote tables. You need to specify not just the name of the remote table, but also the name of the remote data.world dataset the table is in.

SELECT age, name
FROM MyTable join dave.lahman.Batting using(playerId)

That query joins data from your dataset with data in the the Lahman Sabremetrics baseball dataset (named "lahman"), maintained by user "dave" (that's me).

To keep from having to write out the full name of the dataset and table, you can use a table alias. Once you've defined a table alias, you can use it elsewhere in the query to reference that table.

SELECT age, name
FROM MyTable join dave.lahman.Batting as Batting using(playerId)
ORDER BY Batting.yearId

Multi-table Selects

A common use case with data.world data sets is to have tabular data that is split among multiple files. For instance, you might upload a new file each month of ledger data for your business or non-profit organization. You might have files named "ledger_jan.csv", "ledger_feb.csv", and "ledger_mar.csv". Upon ingest by data.world, each file would result in a separate table, in this case "ledger_jan","ledger_feb", and "ledger_mar". If you wanted to query about data from all of these in most SQL dialects, you would need to use a "UNION" construct, which would rapidly get cumbersome. Because this is expected to be a common use-case for data.world, we've created the concept of a multi-table select to dwSQL. The select form the three ledger tables, you could list the tables explicitly between square brackets like this:

SELECT *
FROM [ledger_jan, ledger_feb, ledger_mar] as ledger

or use a regular expression to select from all tables in the dataset starting with the name "ledger"

SELECT *
FROM [/ledger.*/] as ledger

Note that tables to be selected this way must have all the same column names, unless you explicitly specify the names of the columns you wish to query.

SELECT *
FROM [ledger_jan, ledger_feb, ledger_mar](col1, col2, col3) as ledger

Specifying the columns makes it possible to using multi-table queries where the underlying tables have evolved over time, for instance by adding new columns. Even if you do explicitly specify the columns for a multi-table query, all of the specified columns must be present in all of the tables, and must all have the same type in each table.

It is often useful to know which from which table a give row in a multi-table was selected. You can query this using the special table_name column added to each multi-table

SELECT ledger.*, ledger.table_name
FROM [ledger_jan, ledger_feb, ledger_mar] as ledger

For more general combinations of tables, you'll need to use the "UNION" construct instead.

Metadata Tables

To do any querying with dwSQL, you need to know the names of the tables and columns in your dataset. These names are generated by data.world when you upload your data, based on what we could infer about the structure of your data. To find out exactly what tables and columns you have available, you can just query that information from dwSQL. Each dataset contains two artificial tables named "Tables" and "TableColumns" which you can query just as though they were tables that you imported into your dataset. "Tables" lists the tables in your dataset, while "TableColumns" additionally lists the columns available for each table.

SELECT *
FROM Tables

Special Columns

For each table, dwSQL generates two special columns, row_index and row_address. row_index just refers to the order of the row in the table, starting with 0. row_address is a unique URL which is generated for the row. Under the covers, dwSQL is implemented using semantic web technology, and one of the key ideas of the semantic web is that every entity should have it's own unique URL.

Note that unlike normal columns, row_index and row_address will not be returned by default using the SELECT * form of a dwSQL query. Instead they must be explicitly listed. To query all of the data for a table and it's associated special columns, you would use the following form.

SELECT MyTable.*, row_index, row_address
FROM MyTable

Identifiers

dwSQL identifiers are used to reference the tables and columns that your query operates over. Simple dwSQL identifiers consist of alphanumeric characters or underscores, and must start with an alphabetic character. Identifiers with other characters may be used, but must be surrounded with back-quotes. This is usually only done to refer to tables or columns which were imported with non-alphanumeric characters in their names.

SELECT name
FROM `MyTable/with/slashes/in/the/name`

Comments

dwSQL allows inline comments, to allow query logic to be annotated with user intent. Comments start with either # or -- and continue to the end of the line.

--A very simple query
SELECT name
FROM MyTable
# the end

Case Sensitivity

dwSQL keywords are case-insensitive. dwSQL identifiers are also case-insensitive. All string operations and comparisons supported by dwSQL are case-sensitive, with the exception of LIKE and NOT LIKE which are case-insensitive for compatibility with SQL standards.

Keywords and Functions

Keywords

These are the standard SQL keywords which are supported by dwSQL.

  • SELECT
  • DISTINCT
  • FROM
  • JOIN
  • LEFT
  • RIGHT
  • FULL
  • OUTER
  • INNER
  • ON
  • USE
  • USING
  • NATURAL
  • WHERE
  • GROUP
  • BY
  • ORDER
  • ASC
  • DESC
  • HAVING
  • LIMIT
  • OFFSET
  • UNION
  • INTERSECT
  • MINUS
  • AS
  • AND
  • OR
  • NOT
  • IN
  • NULL
  • LIKE
  • CAST
  • CASE
  • WHEN
  • THEN
  • ELSE
  • END
  • ANY
  • ALL
  • SOME
  • FILTER

Built-in Functions

The following built-in functions are supported by dwSQL.

  • General functions
    • if - Executes it's boolean first argument, then returns the evaluation of the second argument if true or the third if false. Takes three arguments.
    • coalesce - Evaluates the arguments in order and returns the first expression that does not evaluate to NULL. Takes any number of arguments.
  • Numeric functions
    • abs - The absolute value of a numeric value. Takes one argument.
    • floor - The greatest integer less than or equal to a numeric value. Takes one argument.
    • ceiling - The least integer greater to or equal to a numeric value. Takes one argument.
    • round - When called with one numeric argument, round returns the nearest integer to that argument. When called with two numeric arguments, the second argument indicates how many decimal places to round the first argument to.
    • pi - The mathematical constant pi, 3.1415... Takes no arguments.
    • exp - The mathematical exponential function, e to the power x. Takes one argument.
    • exp10 - The decimal exponential function, 10 to the power x. Takes one argument.
    • log (with synonym ln) - The natural log function, inverse of exp. Takes one argument.
    • pow (with synonym power) - The general exponentiation function. Takes two arguments.
    • sqrt - The square root function. Takes one argument.
    • sin - The sine function, from trigonometry. Takes one argument.
    • cos- The cosine function, from trigonometry. Takes one argument.
    • tan - The tangent function, from trigonometry. Takes one argument.
    • asin - The inverse sine function, from trigonometry. Takes one argument.
    • acos - The inverse cosine function, from trigonometry. Takes one argument.
    • atan - The inverse tangent function, from trigonometry. Takes one argument.
    • atan2 - The two-argument form of the inverse tangent function, from trigonometry. atan(x, y) is equivalent to atan(x/y) if y is not 0. Takes two arguments
    • sinh - The hyperbolic sine function, from trigonometry. Takes one argument.
    • cosh - The hyperbolic cosine function, from trigonometry. Takes one argument.
    • tanh - The hyperbolic tangent function, from trigonometry. Takes one argument.
    • radians - Convert an angle from degrees to radians. Takes one argument.
    • degrees - Convert an angle from degrees to radians. Takes one argument.
    • sign - The signum of a numeric argument: -1, 0, or 1. Takes one argument
  • String functions
    • length - The length of a string. Takes one argument.
    • upper - Convert a string to upper case. Takes one argument.
    • lower - Convert a string to lower case. Takes one argument.
    • concat - Concatenates two strings, returning a string consisting of the first argument followed by the second. Takes two arguments.
    • trim - Trims whitespace from the start and end of a string. Takes one argument.
    • ltrim - Trims whitespace from the start of a string. Takes one argument.
    • rtrim - Trims whitespace from the end of a string. Takes one argument.
    • lpad - pad the start of a string with extra characters, to bring the up to a specified length. Takes two or three arguments: the string to pad, the length to pad it to, and (optionally) the character pad with (space is the default).
    • rpad - pad the end of a string with extra characters, to bring the up to a specified length. Takes two or three arguments: the string to pad, the length to pad it to, and (optionally) the character pad with (space is the default).
    • substring - Extract a substring from a string. Takes two or three arguments: the original string, the position of the character to start at (the first character position "1"), and (optionally) the number of characters to extract
    • replace - For a given string, replace each non-overlapping instance of a regular expression pattern by a fixed string. Takes three arguments: the original string, the pattern to replace, and the the string to replace it with.
    • regex - Check whether a string matches a given regular expression. Similar to LIKE, but more powerful. Takes two arguments, the string and the regex to match to it.
    • left - Extract a substring from the leftmost characters in a string. Takes two arguments: the original string and the number of characters to extract
    • right - Extract a substring from the rightmost characters in a string. Takes two arguments: the original string and the number of characters to extract
    • mid - Extract a substring from the middle of a string. Takes three arguments: the original string, the position of the character to start at (the first character position "1"), and the number of characters to extract
  • Date/Time functions
    • year - Extract the year from a date or date time value, as an integer. Takes one argument
    • month - Extract the month value from a date or date time value, as an integer (January = 1). Takes one argument
    • day - Extract the day of the month from a date or date time value, as an integer. Takes one argument
    • hours - Extract the hours from a time or date time value, as an integer. Takes one argument
    • minutes - Extract the minutes from a time or date time value, as an integer. Takes one argument
    • seconds - Extract the seconds from a time or date time value, as an integer. Takes one argument
    • now - Returns time for the current query execution. This will be the same for every call to now in a query, but exactly which time is not specified. Takes no arguments
    • date_add - Adds a fixed duration to a date. Takes three arguments, the original date, a count, and a date part string (one of "year", "month", "day", "hour", "minute", "second", "quarter", "decade", or "century")
    • date_sub - Subtracts a fixed duration from a date. Takes three arguments, the original date, a count, and a date part string (one of "year", "month", "day", "hour", "minute", "second", "quarter", "decade", or "century")
    • date_part - Extracts a date part from a date or time, as an integer. Takes two arguments, the original date, and a date part string (one of "year", "month", "day", "hour", "minute", "second", "quarter", "decade", "century", "timezone", "timezone_hour", or "timezone_minute")
    • date_trunc - Truncates a date to a given precision. Takes two arguments, the original date, and a date part string (one of "year", "month", "day", "hour", "minute", "second", "quarter", "decade", or "century")
    • date_format - Formats a date, time, or datetime as a string. Takes two arguments, a date to format and a format string as specified here: https://www.w3.org/TR/NOTE-datetime
    • date_diff - Finds the difference between two dates, expressed as in integral number of some date part (e.g, 3 years). Takes three arguments, a start date, an end date, and a date part string (one of "year", "month", "day", "hour", "minute", "second", "quarter", "decade", or "century")
    • at_time_zone - Either sets or changes the timezone of a datetime or datetimestamp value. Takes two arguments, a datetime and a string representing a timezone. This may be either a standard three-letter abbreviation, or an IANA time zone name.

results matching ""

    No results matching ""