Enterprise docs

Query data

A query, in the context of data, is a request for information written in a query language. A query language uses a specified set of terms in a specific order with proscribed punctuation. The query languages used on data.world are SQL and SPARQL. While data.world is built on semantic web technologies that are best queried using SPARQL, we recognize that most people are more familiar with SQL. For that reason, we've developed our own version of SQL to easily query data.world datasets. If you are new to SQL, we have documentation that will walk you through everything from 'what is a database' to 'how do I join all these different tables together so I can query them all at the same time'. There is also a tutorial based on the exercises in the documentation. For those who would rather work with SPARQL, data.world also has a tutorial for SPARQL that will get you up to speed regardless of your skill level.

Queries against your data are the building blocks for an analysis of the data. We covered the basics of writing your own queries in the article on creating a query. In this article we'll look at all the things you can do with an existing query.

As you work in projects with other people, you might find a query that you want to use and edit, but it's not your query so you don't have permissions to save changes to it. The easiest thing to do is to make a copy of the query by selecting the three dots to the right of the query name in the project workspace and selecting Save a copy:

You'll be prompted to name the query and write a description. For tracking lineage it's a good idea to include the name of the original query in the description. Other save options include where--the current project or the original dataset--and who can see the new query (hidden by the location selection in the image below:

Image_2019-10-08_at_2.07.00_PM.png

Normally the choice would be to save the new query in the current project because then it would show up in the list of all the queries and would be limited to contributors on the project. However if the query would be useful to others who create projects from that dataset, it might make more sense to save it directly to the dataset. Saving it to the dataset enables anyone with access to the dataset to use it (unless you also set it to share with no one). It's also the easiest way to save the query to another project that is linked to the same dataset. One note: queries saved to datasets show up in the Connected datasets section of the Project directory in the left sidebar--not under Queries: Alternatively you could copy and paste the query into a new query window in the other project and save it directly there.

Another way you might want to modify a query is to make a template from it so you can change the input strings to various parts of it. For example we could turn the query we just saved as Animal breed query into a query template by selecting the New template link at the top right of the query panel:

Image_2019-10-08_at_3.58.53_PM.png

A dialog window will pop up giving you basic instructions to how query templates work and linking you to the SQL documentation for query templates:Using query templates

Image_2019-10-18_at_10.38.12_AM.png

Select Add a sample statement a sample of the DECLARE statement used in query templates, a sample comment, and a link to the documentation will be inserted in the top of your query window:

Image_2019-10-08_at_4.05.53_PM.png

With just a couple of modifications to the SQL you can have a query where other users can input values for animal types without ever having to know SQL themselves. Here we moved the comment to the top line so it would show up as a prompt in the input area, and we indicated where in the original query the input value would go (in the WHERE clause):

Image_2019-10-08_at_4.15.41_PM.png

You can preview the query to see what other users will see and to test the query yourself by selecting the Preview query template link in the upper right corner of the query panel. Then fill in a value and hit Run query:

Image_2019-10-08_at_4.21.45_PM.png

For more detailed information on query templates see the article on using query templates.Using query templates

Another way to use a query is to download the results from it in either a csv or an xlsx file. After running your query, click the Download link below the query window and select the option you want:

Image_2019-10-17_at_11.01.42_AM.png

Another option for working with your query results is to save them as a tabular file in your current project, a different project, in an existing dataset, or in a new dataset:

Image_2019-10-17_at_11.11.21_AM.png

For more information on why this feature is important, see the article Save the results of a query as a new table.Save the results of a query as a new table

Two other ways you can use the results of a query are to make them available for download from a URL, or to embed them on a Markdown page. Both options are found in the Download menu under Copy URL or embed code. The URL is a download link for a csv file of the results and the embed code option renders the results table in a Markdown page.

There are many applications you can use to analyze and display the results of your queries, and we make it easy to work with your favorites. You can find out about the applications we integrate with on our integrations page. If you don't have any integrations configured yet you can still make visualizations of your results with our built-in Chart Builder app:

Image_2019-10-18_at_9.03.33_AM.png

For instructions on how to use Chart Builder see one of the following articles on it:

If you already have an integration configured that you want to use or want to integrate a new application, the name of the first application in your list of integrations will show up as a menu to the right of the Download button:

Image_2019-10-18_at_9.06.22_AM.png

You can find out what integrations we currently offer and how to use them on our Integrations page.

With data.world's federation capabilities it is possible to join tables from disparate data sources without having to worry about the source data formats. While this is a very powerful functionality, it also allows construction of poorly formed queries that can time out when they are run. We have a query optimizer in process that will fix many issues without you even having to think about them, but in the meantime here are some things you can do with your joins to make your queries more efficient.

  • Be careful when creating joins to ensure that all applicable join conditions are specified.

  • Use inner joins rather than outer joins, where possible.

  • Be wary of joins on calculated conditions, including especially joins which require casting.

  • When possible, group together joined tables that are virtualized from the same sources.

When you run a query in data.world, one of the things you can do with the results is to save them as a new table in a dataset or project. The main reasons you might want to save your query as a new data file are security and ease of analysis. Saving the query allows you to:

  • Present a subset of the data from the original file

  • Reorder the data

  • Eliminate the need for joins or subqueries in future queries

  • Work with columns as rows and vice versa without having to include either a PIVOT or an UNPIVOT clause in your queries

When data files are streamed into data.world they include all the columns in the original file. Sometimes there are columns which are irrelevant to your analysis or which contain substandard or incomplete data. Some columns might contain sensitive information that can't be shared outside of the original audience of the dataset but also have data that would be useful to a wider group. While you could manually clean up the data by downloading, editing, and re-uploading it, you would lose the ability for it to automatically update from the original source if that is how the initial data was set up in data.world. However you can both clean the data and maintain automatic updates by running a query against the original data file and saving the results to another dataset or project.

For example, the data file USDA Honey Census data in the dataset USDA bee colony inventories and loss is automatically brought in and synced daily through a URL. Because the original dataset contains information on much more than bees and honey, there are many columns in the table that contain either identical data for all the rows or no data at all.

A quick look at the data pulled in from the URL looks like this:

Screen_Shot_2019-03-14_at_12.29.05_PM.png

All columns displayed above except the second one contain duplicate data. Scrolling through the rest of the columns reveals eight more which have either no values or all the same value. To make this data more useful we can run the following query:

Screen_Shot_2019-03-14_at_1.23.37_PM.png

and save it to another dataset (we could also have saved it to the current dataset). To save the query select Save to dataset or project from the Download link:

Screen_Shot_2019-05-22_at_4.31.19_PM.png

There you will be prompted to either select a dataset or project from your recently viewed items , type the name of a different dataset or project and the system will autofill items for you, or save the data as a new dataset:

Image_2019-10-04_at_11.59.32_AM.png

When saving the query results as a table in a brand new dataset, you create what is called a derviative dataset. The new dataset updates from the original dataset which updates from the source, but it only pulls in the data you want to see:

Screen_Shot_2019-03-14_at_1.44.34_PM.png

To keep the file up-to-date you can either manually sync it (using the Sync now button on the right of the screen shown above), or by going to the dataset/project overview page and setting the Autosync on:

Screen_Shot_2019-03-14_at_2.01.27_PM.png

Note that you have the option upon saving to either allow the query underlying the table to update based on changes to the original table or not. If you allow the query to update then changes to the original table will be reflected in your results (e.g., column name changes, columns added, columns deleted, etc.).

When you use the SQL SELECT * clause to return data from a dataset, the columns in the resulting table are presented in the order in which they are found in the original data source. If you would always like to see them in a different order, you can accomplish this by running a query to order them as you would like to see them and then saving the query as a table in the current dataset. This query would be similar to the one shown above for presenting a subset of the data and the steps to save it would be the same.

Sometimes all the data you need to analyze is not stored in one table. While you could still access it using a join or a subquery, it's easier and more convenient to be able to query one table. Another reason for saving the data into one table is that queries against large amounts of data run faster if there is no subquery. Finally, it makes it easier for the members of your team who are not well-versed in SQL to do their own analyses of the data.

For this example we'll use the dataset SQL CRM Example Data. With the way the dataset is structured we can see a lot of information about what’s in the pipeline for sales agents (in the table sales_pipeline), but we can’t easily see the connection between their managers (information stored in the table sales-teams) and the pipeline. To see which managers had the most closed deals and for which accounts, we could run a query like this:

Screen_Shot_2019-03-25_at_5.53.19_PM.png

And that's nice, but if someone else wanted different information the query would have to be modified and the person running the query would have to be comfortable enough with SQL to know how to do it. Far easier would be to to write another query with the WITH clause and to create a new table from it. Here is an example of such a query:

Screen_Shot_2019-03-25_at_5.35.38_PM.png

The results would look like this:

Screen_Shot_2019-03-25_at_5.50.37_PM.png

By saving the query as a new table it could be queried directly without having to use the join. The first query in this example would now look like this:

Screen_Shot_2019-05-15_at_11.08.38_AM.png

SQL has two clauses that allow you to rearrange the data in a table: PIVOT and UNPIVOT. PIVOT lets you see rows as columns, and UNPIVOT shows columns as rows. These are both complicated clauses and do not allow certain other clauses to follow them (e.g., WHERE, GROUP BY, and HAVING). if you find yourself using PIVOT or UNPIVOT frequently, or if you'd like to be able to use one of the incompatible clauses, it would be far better to write the query using PIVOT or UNPIVOT once and then save the resulting table for further analysis. See our SQL documentation on PIVOT and UNPIVOT for more information.

A query template creates a graphical interface that simplifies running a query with different parameters. This allows an end-user to easily change the values that a query uses without needing to access or understand the underlying SQL or SPARQL code.

As an example, let's take a look at the Data Professional Salary Survey and examine the most popular databases from that survey in 2018. To do that, I could use the following SQL query:

Screen_Shot_2019-07-18_at_3.38.20_PM.png

For those with a bit of SQL knowledge, modifying this query to get the results for the year 2017 is trivial. One would just need to change the year value on line 3 from 2018 to 2017.

But when a query involves JOINs, UNIONs, nested queries, and aliases, the number of lines in the query could go from 5 lines to 50 or more.

Finding the desired value buried in the query becomes far more challenging. What if we'd like someone with no knowledge of SQL to change the year and run the query without our assistance?

Wouldn't it be nice of they saw something like this instead?

query-template-02.png

We can do that by creating a query template.

Creating query templates is possible in both SQL and SPARQL queries. In both cases, we'll need to create a variable to turn our standard query into a parameterized query (though creating a such a variable requires a different process in each). Then it's just one more click to create that graphical query template form.

  1. From within a project workspace, create a new SQL query

  2. Click the New template link on the top right of the query editor

  3. This brings up a new window that shows example code for adding a variable

  4. Click on the Add a sample statement button on the bottom left to insert this into your query

  5. Change the example values as needed

Referring to the previous example above, you would modify the sample statement to read:

query-template-03.png

In this case, year is the variable name, INTEGER is the data type, and 2018 is the optional default value.

If no default value is specified and the query is run without entering a value manually, the query will return an error.

Allowed data types are: STRING, INTEGER, DECIMAL, BOOLEAN, DATE, TIME, and DATETIME

Additional DECLARE statements can be added to create more parameters in the query.

Now that the variable exists, choose where the variable will be used in the query - this will often be in the WHERE clause of the query - though it could also appear in an ORDER BY, LIMIT, or other clause . In our example, we'll use it in the WHERE clause on line 5:

query-template-04.png

Note: This example is a bit complicated because the survey_year has the data type YEAR in our database, but we can't declare a variable with that type. Instead, we declare our variable as a INTEGER and then used the CAST function to convert the value of survey_year from YEAR to INTEGER to make the comparison on Line 5.

Comments added before the DECLARE statement will display in the final query template form. The creator of the query can provide instructions and present valid input values.

Denote comments by adding the # or -- characters before the desired text:

query-template-05.png

After declaring and referencing the variable(s), click the "Preview query template" link that now shows above the query editor:

query-template-07.png

This will create the graphical form where an end-user can enter the desired value for the variable. We'll now see that friendly form originally shown in the Overview section:

query-template-06.png

Creating this query template will lock the query from further edits. To unlock it to make a change, just click the Edit query link above. If another user owns the query, then you will need to save a copy of it after making any edits.

That completes SQL query template - but keep in mind that you can create additional variables to make a more powerful, flexible query.

The SPARQL syntax can be a bit more difficult to understand for those who are unfamiliar with it. For a great primer, check out our SPARQL tutorial!

To get started on our SPARQL query template, create a new SPARQL query within your project. We can mimic the SQL query above with the following SPARQL query:

query-template-08.png

Due to the syntax of SPARQL, a query will contain at least one variable by default.

To turn that into a parameter that can be used in a query template, change the character in front of the variable name from ? to $. Make sure to perform this change for every occurrence of the variable within the query.

A new variable can also be declared by starting its name with with $ character.

In our query, we'll do the latter by creating a new variable $year_of_interest and replacing the value 2018 in the FILTER function on line 9.

query-template-09.png

In the form we just created, you may have noticed we did not see a default value populate.

Setting an optional default value for a variable is more complicated in SPARQL than SQL. To do that, we need to add a conditional BIND statement on the first line of the WHERE clause. The general format of that statement is:

BIND(IF(BOUND($variable), $variable, <DEFAULT_VALUE>) AS $variable)

In our case, let's go back into the editor and add a new line on line 7:

BIND(IF(BOUND($year_of_interest), $year_of_interest, 2018) AS $year_of_interest)
query-template-10.png

Although this will set the default value, that value will not be shown in the textbox in the query template form as it does in SQL. Instead it will be hidden from the end-user. Even though that value is not visible, the query will now use 2018 as the default year until another value is entered.

Comments can be added to a SPARQL query template as well. Only the # character (and not the -- characters that are available in SQL) can be used to specify a comment. The query template will show all comments before the first SPARQL statement (typically a PREFIX statement).

Since the query template won't show any default values, making a note of such values in the comments is a great idea.

query-template-11.png

Clicking the Preview query template link above once more will generate our final template:

query-template-12.png

Just like the SQL version, this query will be locked from edits until the "Edit Query" link above is clicked.

Parameterized queries can also be made through the data.world public API. Note that non-parameterized queries can be executed with a GET request to the same endpoint, but in order to specify the parameters, a POST request must be made with the parameters specified in the body of the request.

Example curl command:

curl -X POST \

https://api.data.world/v0/queries/${DW_QUERY_ID}/results \

-H 'Accept: application/json' \

-H 'Authorization: Bearer ${DW_API_TOKEN}' \

-H 'Content-Type: application/json' \

-d '{

"parameters" : {

"parameterName1" : "parameter value 1",

"parameterName2" : "parameter value 2" }

}'

See our API docs for more information.

Queries that are executed with a new parameter value may be shared with others by simply copying the URL and sending it to another user. If parameters have been filled in, then these parameters will be encoded into the URL for sharing.

Whenever possible, data.world will translate your query into the proper syntax to run on the target system. There are cases, however, where a query cannot be fully translated. When that happens, part of the query will run natively on the target system, and part will run on data.world as an emulation. As a result, the total query processing time will frequently be greater than if the entire query ran on the target system.

The most up-to-date information about the support for aggregations and functions is in our dataset Function Support Matrix.

function_support_matrix.png

Select Explore the dataset in the upper right of the screen and you'll be taken to a project workspace for the dataset. There, if you select Aggregations pivoted you see columns with the aggregations, the urls to our support docs for them, and the support status for each system:

agg_support_pivot_query.png

If you are comfortable with SQL you can write your own query to search for support for aggregations in a specific system:

query_for_Athena_agg_support.png

Tables for the aggregation and function support for each system Is in the Reference section.

SPARQL (pronounced "Sparkle") is a powerful query language to retrieve, modify and make the best use out of linked data. It is recognized as one of the key technologies of the semantic web due to its flexibility as well as ease of joining complex data structures and detecting intricate patterns in data. Its also the query language upon which the data.world platform is based.

If you're new to SPARQL and would like to learn more, see our Getting Started with SPARQL tutorial to jump into the basics.