Federated Queries

An introduction to federated queries.

Another piece of data.world-specific functionality is the ability to join tables which are in different datasets and which might not even have the same owner. Normally querying in SQL is database or dataset-specific. However as more and more datasets come on-line, it’s increasingly important to be able to pull information from combinations of them to present a more complete picture of the data.

Queries that join tables from different datasets are called federated queries. Their syntax is very similar to that of other join queries with additional location information specified.

In the CRM dataset we have been using, we have a table called accounts which has a small bit of information about our client accounts. We have recently learned that one of the regional offices has a dataset with the id “crm-account-stats” which has the table “external_account” which contains other information about the accounts, but doesn’t have some of the data we have. If we combined these tables, we could run a federated query like the following against them:

SELECT accounts.account,
       accounts2.sector,
       accounts2.year_established,
       accounts2.subsidiary_of,
       accounts.revenue
  FROM siyeh.`crm-account-stats`.external_account AS accounts2, accounts
 WHERE accounts2.account = accounts.account

siyeh/crm-project Run query Copy code

account	sector	year_established	subsidiary_of	revenue
Acme Corporation	technolgy	1996		1,100.04
Betasoloin	medical	1999		251.41
Codehow	software	1998	Acme Corporation	2,714.90
Condax	medical	2017		4.54
Conecom	technolgy	2005		1,520.66
Dalttechnology	software	2013	Bubba Gump	98.79
dambase	marketing	1995	Inity	2,173.98
Domzoom	entertainment	1998		217.87

To keep from having to write out the full name of the dataset and table, you can use a table alias. If you define a table alias, you can use it elsewhere in the query to reference that table.

The most difficult part of using federated queries is to know the exact dataset id, table and column names for the table that’s not in your dataset. Currently you can find the dataset id by opening a page for that dataset from data.world and looking in the url. After data.world comes the owner of the dataset’s name, and after that is the dataset’s name. On the main page for the dataset you can also see a list of all the tables in it as well as all of the column names.

If the dataset referenced in the federated query has a “-“ in its name the dataset name will need to surrounded by `` in the query.

You can achieve implicit federation of tables in different datasets by putting them in your own project. For example if we wanted to see the sales price of the products in all the west office orders, we would need to write a federated query because the west office order table is in one dataset an the product table is in another. However, because we have both datasets in the same project the query can be written like this:

SELECT west.create_date,
       west.sales_agent,
       west.product,
       products.sales_price
  FROM west_office_orders AS west, products
 WHERE west.product = products.product

siyeh/crm-project Run query Copy code

The only reason we used a table alias was to shorten a long table name–we didn’t have to specify the owner of the dataset. The first several rows of the results table would look like:

create_date	sales_agent	product	sales_price
2017-12-27	Markita Hansen	GTX Basic	550
2017-09-16	Vicki Laflamme	GTX Basic	550
2017-12-04	Zane Levy	GTX Basic	550
2017-09-14	James Ascencio	GTX Basic	550
2017-12-04	Vicki Laflamme	GTX Basic	550
2017-03-27	Maureen Marcano	GTX Basic	550
2017-12-17	Kary Hendrixson	GTX Basic	550
2017-11-03	Vicki Laflamme	GTX Basic	550
2017-11-26	Zane Levy	GTX Basic	550
2017-11-04	James Ascencio	GTX Basic	550

Next up: Parameterized Queries

An introduction to parameterized queries and query templates.