Enterprise docs

Query data

A query, in the context of data, is a request for information written in a query language. A query language uses a specified set of terms in a specific order with proscribed punctuation. The query languages used on data.world are SQL and SPARQL. While data.world is built on semantic web technologies that are best queried using SPARQL, we recognize that most people are more familiar with SQL. For that reason, we've developed our own version of SQL to easily query data.world datasets. If you are new to SQL, we have documentation that will walk you through everything from 'what is a database' to 'how do I join all these different tables together so I can query them all at the same time'. There is also a tutorial based on the exercises in the documentation. For those who would rather work with SPARQL, data.world also has a tutorial for SPARQL that will get you up to speed regardless of your skill level.

Create a query

To write a SQL query against data in a project go to the upper left corner of the workspace and select SQL Query from the + Add menu:

Image_2019-09-18_at_10.50.33_AM.png

When the query editor comes up in the center panel of the screen, the Project Schema also comes up in the right side panel. If the project schema isn't showing, select the arrow to the right of the Run query button to expand the sidebar:

Image_2019-09-18_at_11.19.50_AM.png
Auto-complete of terms and objects

As you type your query, the editor suggests SQL terms--operators, aggregations, and functions--for you. Hitting 'enter' will autocomplete the highlighted suggestion in the list. Arrow down or up to choose another option:

Image_2019-09-18_at_11.48.00_AM.png
Click-to-copy for columns and tables

You have the option of typing in the table and column names for the data you are querying, but you can also use the click-to-copy feature on the project schema to copy the names for you so you don't have to worry about typos.

Image_2019-09-18_at_2.31.00_PM.png
Pop-up help text

Help text is provided for all SQL functions and aggregations when you hover over the function name:

Image_2019-09-18_at_11.52.24_AM.png
How to auto-format queries

While you can type on one line type your SQL query all on one line--it is perfectly legal and will run--it's good practice to split it up based on clauses and punctuation. The line breaks make it easier to see where you may have forgotten a comma or some other element of the query, and it also allows us to give you better error messages if there are problems with your query.

Here's an example of a query before and after running it through our auto-formatter:

Image_2019-09-18_at_2.04.12_PM.png
Image_2019-09-18_at_2.04.55_PM.png

To auto-format your query, use the keyboard shortcut Cmd + Option + L . If you forget the command there is a dropdown menu to the right of the Run query button which has a link to a list of all the keyboard shortcuts.

Image_2019-09-18_at_1.14.15_PM.png

From the dropdown menu you can also manage your query (rename, copy, etc.) as well as get to our SQL tutorial for additional help.

Image_2019-09-18_at_1.13.59_PM.png
Troubleshooting

If you try to auto-format your query and it doesn't work, that's a sign that you have an error in the query. If you run it you'll get an error message telling you what the problem is and where to find it:

Image_2019-09-18_at_2.18.18_PM.png

In the above example the query is written to search for all the records that have the value "won" in the deal_stage column in the sales_pipeline table. However the way the query is written it looks like it is searching for records where the value in the deal_stage column matches the value in another column named Won. For "won" to be recognized as a value, it has to be surrounded by double-quotes. The error message returned from running the query tells you that the problem is in the 4th line and the 36th character (column). The 36th character should have been the opening " on the 4th line:

Image_2019-09-18_at_2.46.44_PM.png
Saving queries

When your query is complete, save it for later use by selecting Save to the left of the Run query button. You'll be prompted to name the query, add a description, determine where to save it, and indicate who can see it.

It's good practice to give the query a short descriptive name for both your own later use and for use by others working on your project. Adding a description also makes your query more useable as the description shows up on the resource card returned from using search on data.world.

Saving a query to a dataset

Queries are saved to projects by default. The logic behind this structure is that datasets are for storing files and tables, and projects are for querying and analyzing those files and tables. A dataset is meant to be reused in multiple projects, and if queries are saved to it instead of to the projects using it then the dataset can rapidly fill with irrelevant queries making it difficult to use. However if the queries specific to a project are all stored in that project, the linked dataset remains clean and ready for reuse.

The reasoning above covers 80% of the use cases, but what about the times you really do want to save a query to a dataset? Maybe you want to clean up the data, join tables and preserve the lineage of the original tables for reference, or just use the query in multiple projects without having to rewrite it (you might even want to parameterize it). In those cases it is useful to be able to save your query to the dataset, and you can do that. After running your query, to save it to the dataset select the Save link and click the drop-down link to the right of the + New Project option. In addition to New project you'll also see the name of the dataset. Select it and the query will be saved to the dataset and you'll still be in an untitled, unsaved project:

Image_2019-10-16_at_5.10.12_PM.png

One thing about saving queries to the dataset instead of to the project is that queries saved to a dataset won't show up in the queries list of any project the dataset is used in. Instead they'll be displayed under the connected datasets info:

Image_2019-10-16_at_5.17.38_PM.png

Ways to use a query

Queries against your data are the building blocks for an analysis of the data. We covered the basics of writing your own queries in the article on creating a query. In this article we'll look at all the things you can do with an existing query.

Use other people's queries

As you work in projects with other people, you might find a query that you want to use and edit, but it's not your query so you don't have permissions to save changes to it. The easiest thing to do is to make a copy of the query by selecting the three dots to the right of the query name in the project workspace and selecting Save a copy:

You'll be prompted to name the query and write a description. For tracking lineage it's a good idea to include the name of the original query in the description. Other save options include where--the current project or the original dataset--and who can see the new query (hidden by the location selection in the image below:

Image_2019-10-08_at_2.07.00_PM.png
Copy a query to another dataset or project

Normally the choice would be to save the new query in the current project because then it would show up in the list of all the queries and would be limited to contributors on the project. However if the query would be useful to others who create projects from that dataset, it might make more sense to save it directly to the dataset. Saving it to the dataset enables anyone with access to the dataset to use it (unless you also set it to share with no one). It's also the easiest way to save the query to another project that is linked to the same dataset. One note: queries saved to datasets show up in the Connected datasets section of the Project directory in the left sidebar--not under Queries: Alternatively you could copy and paste the query into a new query window in the other project and save it directly there.

Make a template from a query

Another way you might want to modify a query is to make a template from it so you can change the input strings to various parts of it. For example we could turn the query we just saved as Animal breed query into a query template by selecting the New template link at the top right of the query panel:

Image_2019-10-08_at_3.58.53_PM.png

A dialog window will pop up giving you basic instructions to how query templates work and linking you to the SQL documentation for query templates:

Image_2019-10-18_at_10.38.12_AM.png

Select Add a sample statement a sample of the DECLARE statement used in query templates, a sample comment, and a link to the documentation will be inserted in the top of your query window:

Image_2019-10-08_at_4.05.53_PM.png

With just a couple of modifications to the SQL you can have a query where other users can input values for animal types without ever having to know SQL themselves. Here we moved the comment to the top line so it would show up as a prompt in the input area, and we indicated where in the original query the input value would go (in the WHERE clause):

Image_2019-10-08_at_4.15.41_PM.png

You can preview the query to see what other users will see and to test the query yourself by selecting the Preview query template link in the upper right corner of the query panel. Then fill in a value and hit Run query:

Image_2019-10-08_at_4.21.45_PM.png

For more detailed information on query templates see the article on using query templates.

Download query results

Another way to use a query is to download the results from it in either a csv or an xlsx file. After running your query, click the Download link below the query window and select the option you want:

Image_2019-10-17_at_11.01.42_AM.png
Save to project or dataset

Another option for working with your query results is to save them as a tabular file in your current project, a different project, in an existing dataset, or in a new dataset:

Image_2019-10-17_at_11.11.21_AM.png

For more information on why this feature is important, see the article Save the results of a query as a new table.

Copy URL or embed code

Two other ways you can use the results of a query are to make them available for download from a URL, or to embed them on a Markdown page. Both options are found in the Download menu under Copy URL or embed code. The URL is a download link for a csv file of the results and the embed code option renders the results table in a Markdown page.

Work with third-party apps

There are many applications you can use to analyze and display the results of your queries, and we make it easy to work with your favorites. You can find out about the applications we integrate with on our integrations page. If you don't have any integrations configured yet you can still make visualizations of your results with our built-in Chart Builder app:

Image_2019-10-18_at_9.03.33_AM.png

For instructions on how to use Chart Builder see one of the following articles on it:

If you already have an integration configured that you want to use or want to integrate a new application, the name of the first application in your list of integrations will show up as a menu to the right of the Download button:

Image_2019-10-18_at_9.06.22_AM.png

You can find out what integrations we currently offer and how to use them on our Integrations page.

Save the results of a query as a new table

When you run a query in data.world, one of the things you can do with the results is to save them as a new table in a dataset or project. The main reasons you might want to save your query as a new data file are security and ease of analysis. Saving the query allows you to:

  • Present a subset of the data from the original file

  • Reorder the data

  • Eliminate the need for joins or subqueries in future queries

  • Work with columns as rows and vice versa without having to include either a PIVOT or an UNPIVOT clause in your queries

Present a subset of the data

When data files are streamed into data.world they include all the columns in the original file. Sometimes there are columns which are irrelevant to your analysis or which contain substandard or incomplete data. Some columns might contain sensitive information that can't be shared outside of the original audience of the dataset but also have data that would be useful to a wider group. While you could manually clean up the data by downloading, editing, and re-uploading it, you would lose the ability for it to automatically update from the original source if that is how the initial data was set up in data.world. However you can both clean the data and maintain automatic updates by running a query against the original data file and saving the results to another dataset or project.

For example, the data file USDA Honey Census data in the dataset USDA bee colony inventories and loss is automatically brought in and synced daily through a URL. Because the original dataset contains information on much more than bees and honey, there are many columns in the table that contain either identical data for all the rows or no data at all.

A quick look at the data pulled in from the URL looks like this:

Screen_Shot_2019-03-14_at_12.29.05_PM.png

All columns displayed above except the second one contain duplicate data. Scrolling through the rest of the columns reveals eight more which have either no values or all the same value. To make this data more useful we can run the following query:

Screen_Shot_2019-03-14_at_1.23.37_PM.png

and save it to another dataset (we could also have saved it to the current dataset). To save the query select Save to dataset or project from the Download link:

Screen_Shot_2019-05-22_at_4.31.19_PM.png

There you will be prompted to either select a dataset or project from your recently viewed items , type the name of a different dataset or project and the system will autofill items for you, or save the data as a new dataset:

Image_2019-10-04_at_11.59.32_AM.png

When saving the query results as a table in a brand new dataset, you create what is called a derviative dataset. The new dataset updates from the original dataset which updates from the source, but it only pulls in the data you want to see:

Screen_Shot_2019-03-14_at_1.44.34_PM.png

To keep the file up-to-date you can either manually sync it (using the Sync now button on the right of the screen shown above), or by going to the dataset/project overview page and setting the Autosync on:

Screen_Shot_2019-03-14_at_2.01.27_PM.png

Note that you have the option upon saving to either allow the query underlying the table to update based on changes to the original table or not. If you allow the query to update then changes to the original table will be reflected in your results (e.g., column name changes, columns added, columns deleted, etc.).

Reorder the data

When you use the SQL SELECT * clause to return data from a dataset, the columns in the resulting table are presented in the order in which they are found in the original data source. If you would always like to see them in a different order, you can accomplish this by running a query to order them as you would like to see them and then saving the query as a table in the current dataset. This query would be similar to the one shown above for presenting a subset of the data and the steps to save it would be the same.

Eliminate the need for joins or subqueries

Sometimes all the data you need to analyze is not stored in one table. While you could still access it using a join or a subquery, it's easier and more convenient to be able to query one table. Another reason for saving the data into one table is that queries against large amounts of data run faster if there is no subquery. Finally, it makes it easier for the members of your team who are not well-versed in SQL to do their own analyses of the data.

For this example we'll use the dataset SQL CRM Example Data. With the way the dataset is structured we can see a lot of information about what’s in the pipeline for sales agents (in the table sales_pipeline), but we can’t easily see the connection between their managers (information stored in the table sales-teams) and the pipeline. To see which managers had the most closed deals and for which accounts, we could run a query like this:

Screen_Shot_2019-03-25_at_5.53.19_PM.png

And that's nice, but if someone else wanted different information the query would have to be modified and the person running the query would have to be comfortable enough with SQL to know how to do it. Far easier would be to to write another query with the WITH clause and to create a new table from it. Here is an example of such a query:

Screen_Shot_2019-03-25_at_5.35.38_PM.png

The results would look like this:

Screen_Shot_2019-03-25_at_5.50.37_PM.png

By saving the query as a new table it could be queried directly without having to use the join. The first query in this example would now look like this:

Screen_Shot_2019-05-15_at_11.08.38_AM.png
Work with columns as rows and vice versa

SQL has two clauses that allow you to rearrange the data in a table: PIVOT and UNPIVOT. PIVOT lets you see rows as columns, and UNPIVOT shows columns as rows. These are both complicated clauses and do not allow certain other clauses to follow them (e.g., WHERE, GROUP BY, and HAVING). if you find yourself using PIVOT or UNPIVOT frequently, or if you'd like to be able to use one of the incompatible clauses, it would be far better to write the query using PIVOT or UNPIVOT once and then save the resulting table for further analysis. See our SQL documentation on PIVOT and UNPIVOT for more information.

Using query templates

A query template creates a graphical interface that simplifies running a query with different parameters. This allows an end-user to easily change the values that a query uses without needing to access or understand the underlying SQL or SPARQL code.

As an example, let's take a look at the Data Professional Salary Survey and examine the most popular databases from that survey in 2018. To do that, I could use the following SQL query:

Screen_Shot_2019-07-18_at_3.38.20_PM.png

For those with a bit of SQL knowledge, modifying this query to get the results for the year 2017 is trivial. One would just need to change the year value on line 3 from 2018 to 2017.

But when a query involves JOINs, UNIONs, nested queries, and aliases, the number of lines in the query could go from 5 lines to 50 or more.

Finding the desired value buried in the query becomes far more challenging. What if we'd like someone with no knowledge of SQL to change the year and run the query without our assistance?

Wouldn't it be nice of they saw something like this instead?

query-template-02.png

We can do that by creating a query template.

Creating query templates

Creating query templates is possible for both SQL and SPARQL queries. In both cases, you need to create a variable to turn a standard query into a parameterized query (though creating a such a variable requires a different process in each). Then it is just one more click to create that graphical query template form.

SQL
  1. From the project workspace, create a new SQL query.

  2. Click the New template link on the top right of the query editor

  3. This brings up a new window that shows example code for adding a variable

  4. Click on the Add a sample statement button on the bottom left to insert this into your query

  5. Change the example values as needed

Referring to the previous example above, you would modify the sample statement to read:

query-template-03.png

In this case, year is the variable name, INTEGER is the data type, and 2018 is the optional default value.

If no default value is specified and the query is run without entering a value manually, the query will return an error.

Allowed data types are: STRING, INTEGER, DECIMAL, BOOLEAN, DATE, TIME, and DATETIME

Additional DECLARE statements can be added to create more parameters in the query.

Using the variable in the query

Now that the variable exists, choose where the variable will be used in the query - this will often be in the WHERE clause of the query - though it could also appear in an ORDER BY, LIMIT, or other clause . In our example, we'll use it in the WHERE clause on line 5:

query-template-04.png

Note: This example is a bit complicated because the survey_year has the data type YEAR in our database, but we can't declare a variable with that type. Instead, we declare our variable as a INTEGER and then used the CAST function to convert the value of survey_year from YEAR to INTEGER to make the comparison on Line 5.

Adding Comments

Comments added before the DECLARE statement will display in the final query template form. The creator of the query can provide instructions and present valid input values.

Denote comments by adding the # or -- characters before the desired text:

query-template-05.png
Creating the query template

After declaring and referencing the variable(s), click the "Preview query template" link that now shows above the query editor:

query-template-07.png

This will create the graphical form where an end-user can enter the desired value for the variable. We'll now see that friendly form originally shown in the Overview section:

query-template-06.png

Creating this query template will lock the query from further edits. To unlock it to make a change, just click the Edit query link above. If another user owns the query, then you will need to save a copy of it after making any edits.

That completes SQL query template - but keep in mind that you can create additional variables to make a more powerful, flexible query.

SPARQL

The SPARQL syntax can be a bit more difficult to understand for those who are unfamiliar with it. For a great primer, check out our SPARQL tutorial!

To get started on our SPARQL query template, create a new SPARQL query within your project. We can mimic the SQL query above with the following SPARQL query:

query-template-08.png
Declaring a variable

Due to the syntax of SPARQL, a query will contain at least one variable by default.

To turn that into a parameter that can be used in a query template, change the character in front of the variable name from ? to $. Make sure to perform this change for every occurrence of the variable within the query.

A new variable can also be declared by starting its name with with $ character.

In our query, we'll do the latter by creating a new variable $year_of_interest and replacing the value 2018 in the FILTER function on line 9.

query-template-09.png
Setting default values

In the form we just created, you may have noticed we did not see a default value populate.

Setting an optional default value for a variable is more complicated in SPARQL than SQL. To do that, we need to add a conditional BIND statement on the first line of the WHERE clause. The general format of that statement is:

BIND(IF(BOUND($variable), $variable, <DEFAULT_VALUE>) AS $variable)

In our case, let's go back into the editor and add a new line on line 7:

BIND(IF(BOUND($year_of_interest), $year_of_interest, 2018) AS $year_of_interest)
query-template-10.png

Although this will set the default value, that value will not be shown in the textbox in the query template form as it does in SQL. Instead it will be hidden from the end-user. Even though that value is not visible, the query will now use 2018 as the default year until another value is entered.

Adding comments

Comments can be added to a SPARQL query template as well. Only the # character (and not the -- characters that are available in SQL) can be used to specify a comment. The query template will show all comments before the first SPARQL statement (typically a PREFIX statement).

Since the query template won't show any default values, making a note of such values in the comments is a great idea.

query-template-11.png
Creating the Query Template

Clicking the Preview query template link above once more will generate our final template:

query-template-12.png

Just like the SQL version, this query will be locked from edits until the "Edit Query" link above is clicked.

API

Parameterized queries can also be made through the data.world public API. Note that non-parameterized queries can be executed with a GET request to the same endpoint, but in order to specify the parameters, a POST request must be made with the parameters specified in the body of the request.

Example curl command:

curl -X POST \

https://api.data.world/v0/queries/${DW_QUERY_ID}/results \

-H 'Accept: application/json' \

-H 'Authorization: Bearer ${DW_API_TOKEN}' \

-H 'Content-Type: application/json' \

-d '{

"parameters" : {

"parameterName1" : "parameter value 1",

"parameterName2" : "parameter value 2" }

}'

See the API docs for more information.

Sharing

Queries that are executed with a new parameter value may be shared with others by simply copying the URL and sending it to another user. If parameters have been filled in, then these parameters will be encoded into the URL for sharing.

SQL queries - native vs emulated

Whenever possible, data.world will translate your query into the proper syntax to run on the target system. There are cases, however, where a query cannot be fully translated. When that happens, part of the query will run natively on the target system, and part will run on data.world as an emulation. As a result, the total query processing time will frequently be greater than if the entire query ran on the target system.

The most up-to-date information about the support for aggregations and functions is in our dataset Function Support Matrix.

function_support_matrix.png

Select Explore the dataset in the upper right of the screen and you'll be taken to a project workspace for the dataset. There, if you select Aggregations pivoted you see columns with the aggregations, the urls to our support docs for them, and the support status for each system:

agg_support_pivot_query.png

If you are comfortable with SQL you can write your own query to search for support for aggregations in a specific system:

query_for_Athena_agg_support.png

Tables for the aggregation and function support for each system Is in the Reference section.