Federated Queries

An introduction to federated queries.

Another piece of data.world-specific functionality is the ability to join tables which are in different datasets and which might not even have the same owner. Normally querying in SQL is database or dataset-specific. However as more and more datasets come on-line, it’s increasingly important to be able to pull information from combinations of them to present a more complete picture of the data.

Queries that join tables from different datasets are called federated queries. Their syntax is very similar to that of other join queries with additional location information specified.

In the CRM dataset we have been using, we have a table called accounts which has a small bit of information about our client accounts. We have recently learned that one of the regional offices has a dataset with the id “crm-account-stats” which has the table “external_account” which contains other information about the accounts, but doesn’t have some of the data we have. If we combined these tables, we could run a federated query like the following against them:

SELECT accounts.account,
       accounts2.sector,
       accounts2.year_established,
       accounts2.subsidiary_of,
       accounts.revenue
  FROM siyeh.`crm-account-stats`.external_account AS accounts2, accounts
 WHERE accounts2.account = accounts.account
account sector year_established subsidiary_of revenue
Acme Corporation technolgy 1996 1,100.04
Betasoloin medical 1999 251.41
Codehow software 1998 Acme Corporation 2,714.90
Condax medical 2017 4.54
Conecom technolgy 2005 1,520.66
Dalttechnology software 2013 Bubba Gump 98.79
dambase marketing 1995 Inity 2,173.98
Domzoom entertainment 1998 217.87

To keep from having to write out the full name of the dataset and table, you can use a table alias. If you define a table alias, you can use it elsewhere in the query to reference that table.

The most difficult part of using federated queries is to know the exact dataset id, table and column names for the table that’s not in your dataset. Currently you can find the dataset id by opening a page for that dataset from data.world and looking in the url. After data.world comes the owner of the dataset’s name, and after that is the dataset’s name. On the main page for the dataset you can also see a list of all the tables in it as well as all of the column names.

If the dataset referenced in the federated query has a “-“ in its name the dataset name will need to surrounded by `` in the query.

You can achieve implicit federation of tables in different datasets by putting them in your own project. For example if we wanted to see the sales price of the products in all the west office orders, we would need to write a federated query because the west office order table is in one dataset an the product table is in another. However, because we have both datasets in the same project the query can be written like this:

SELECT west.create_date,
       west.sales_agent,
       west.product,
       products.sales_price
  FROM west_office_orders AS west, products
 WHERE west.product = products.product

The only reason we used a table alias was to shorten a long table name–we didn’t have to specify the owner of the dataset. The first several rows of the results table would look like:

create_date sales_agent product sales_price
2017-12-27 Markita Hansen GTX Basic 550
2017-09-16 Vicki Laflamme GTX Basic 550
2017-12-04 Zane Levy GTX Basic 550
2017-09-14 James Ascencio GTX Basic 550
2017-12-04 Vicki Laflamme GTX Basic 550
2017-03-27 Maureen Marcano GTX Basic 550
2017-12-17 Kary Hendrixson GTX Basic 550
2017-11-03 Vicki Laflamme GTX Basic 550
2017-11-26 Zane Levy GTX Basic 550
2017-11-04 James Ascencio GTX Basic 550