Find us on GitHub

Combining Data

Objectives

  • Explain the operation of a query that joins two tables.
  • Explain how to restrict the output of a query containing a join to only include meaningful combinations of values.
  • Write queries that join tables on equal keys.
  • Explain what primary and foreign keys are, and why they are useful.
  • Explain what atomic values are, and why database fields should only contain atomic values.

So far we haven't been able to display the authors' names in our query results, because those names are not in the Works but in the Authors table. What's worse, because there can be many authors to a title and many title associated with each author, there is a third table, Works_Authors that's taking care of this many-to-many relationship. How can these tables be joined?

The SQL command to do this is JOIN. To see how it works, let's start with the somewhat easier case of the Items table and try joining it to the Works table:

%load_ext sqlitemagic
%%sqlite swclib.db
SELECT * FROM Items JOIN Works;
1 1 081722942611 2009 Loaned 1 SQL in a nutshell 9780596518844 2009 Sebastopol O'Reilly 3rd ed. 578
1 1 081722942611 2009 Loaned 2 SQL for dummies 9781118607961 2013 Hoboken Wiley 8th ed.
1 1 081722942611 2009 Loaned 3 PHP & MySQL 9781449325572 2013 Sebastopol O'Reilly 2nd ed. 532
1 1 081722942611 2009 Loaned 4 Using SQLite 9780596521189 2010 Sebastopol O'Reilly 1st ed. 503
1 1 081722942611 2009 Loaned 5 Geek sublime 9780571310302 2014 London Faber & Faber 258

The result above was truncated for display because... the query returned a list of 800 records! In fact, JOIN creates the cross product of two tables, i.e., it joins each record of one with each record of the other to give all possible combinations. Since there are 20 records in Works and 40 in Items, the join's output has 20*40=800 records. And since Works has 8 fields and Items has 5, the output has 8+5=13 fields.

What the join hasn't done is figure out if the records being joined have anything to do with each other. It has no way of knowing whether they do or not until we tell it how. To do that, we add a clause specifying that we're only interested in combinations where Work_ID matches in both tables:

%%sqlite swclib.db
SELECT * FROM Items JOIN Works ON Items.Work_ID=Works.Work_ID;
1 1 081722942611 2009 Loaned 1 SQL in a nutshell 9780596518844 2009 Sebastopol O'Reilly 3rd ed. 578
2 1 492437609065 2011 On shelf 1 SQL in a nutshell 9780596518844 2009 Sebastopol O'Reilly 3rd ed. 578
3 2 172480710952 2013 On shelf 2 SQL for dummies 9781118607961 2013 Hoboken Wiley 8th ed.
4 3 708014968732 2013 Missing 3 PHP & MySQL 9781449325572 2013 Sebastopol O'Reilly 2nd ed. 532
5 3 819783404942 2014 Loaned 3 PHP & MySQL 9781449325572 2013 Sebastopol O'Reilly 2nd ed. 532
6 4 257370237291 2010 Missing 4 Using SQLite 9780596521189 2010 Sebastopol O'Reilly 1st ed. 503
7 5 002905925356 2014 Loaned 5 Geek sublime 9780571310302 2014 London Faber & Faber 258
8 5 964583604781 2014 Loaned 5 Geek sublime 9780571310302 2014 London Faber & Faber 258
9 6 701630524534 2014 Loaned 6 Capital in the 21st century 9780674430006 2014 Cambridge Belknap Press 685
10 6 722040919616 2014 On shelf 6 Capital in the 21st century 9780674430006 2014 Cambridge Belknap Press 685
11 6 026655281484 2014 On shelf 6 Capital in the 21st century 9780674430006 2014 Cambridge Belknap Press 685
12 7 422970103061 2010 On shelf 7 SQL 9780071548649 2009 New York McGraw-Hill 3rd ed. 534
13 8 655280484976 2011 Loaned 8 Discovering SQL 9781118002674 2011 Indianapolis Wiley 400
14 9 610721228318 2005 Missing 9 SQL 0321334175 2005 Berkeley Peachpit 2nd ed. 460
15 10 148164881245 2008 Missing 10 A guide to SQL 9780324597684 2008 Mason South-Western 8th ed. 309
16 10 445317012796 2010 On shelf 10 A guide to SQL 9780324597684 2008 Mason South-Western 8th ed. 309
17 11 291006691199 2008 On shelf 11 SQL bible 9780470229064 2008 Indianapolis Wiley 2nd ed. 857
18 12 665741505651 2009 On shelf 12 Learning SQL 9780596520830 2009 Sebastopol O'Reilly 2nd ed. 320
19 12 623061160016 2009 Loaned 12 Learning SQL 9780596520830 2009 Sebastopol O'Reilly 2nd ed. 320
20 13 827361553957 2010 On shelf 13 SQL for dummies 9780470557419 2010 Hoboken Wiley 7th ed. 440
21 13 228598347653 2010 Missing 13 SQL for dummies 9780470557419 2010 Hoboken Wiley 7th ed. 440
22 13 585952782539 2010 On shelf 13 SQL for dummies 9780470557419 2010 Hoboken Wiley 7th ed. 440
23 13 701532568017 2011 On shelf 13 SQL for dummies 9780470557419 2010 Hoboken Wiley 7th ed. 440
24 14 989297622703 2008 On shelf 14 Beginning SQL queries 9781590599433 2008 Berkeley Apress 218
25 15 640793136396 2005 On shelf 15 Beginning SQL 0764577328 2005 Indianapolis Wiley 501
26 15 521089986565 2005 On shelf 15 Beginning SQL 0764577328 2005 Indianapolis Wiley 501
27 16 139685507140 2013 Loaned 16 Microsoft SQL server 2012 9780132977661 2013 Indianapolis Sams
28 16 853183712696 2013 Loaned 16 Microsoft SQL server 2012 9780132977661 2013 Indianapolis Sams
29 17 257153081154 2011 On shelf 17 SQL all-in-one 9780470929964 2011 Hoboken Wiley 2nd ed. 708
30 18 208546921091 2013 Loaned 18 Access 2013 all-in-one 9781118510551 2013 Hoboken Wiley 760
31 19 921664426379 2004 On shelf 19 SQL in a nutshell 0596004818 2004 Cambridge O'Reilly 2nd ed. 691
32 19 298308111210 2004 Loaned 19 SQL in a nutshell 0596004818 2004 Cambridge O'Reilly 2nd ed. 691
33 19 210139559101 2004 Missing 19 SQL in a nutshell 0596004818 2004 Cambridge O'Reilly 2nd ed. 691
34 20 344919897556 2005 On shelf 20 MySQL in a nutshell 0596007892 2005 Sebastopol O'Reilly 1st ed. 321
35 20 035230397910 2005 On shelf 20 MySQL in a nutshell 0596007892 2005 Sebastopol O'Reilly 1st ed. 321
36 20 527524003500 2005 On shelf 20 MySQL in a nutshell 0596007892 2005 Sebastopol O'Reilly 1st ed. 321
37 20 467701665668 2005 Missing 20 MySQL in a nutshell 0596007892 2005 Sebastopol O'Reilly 1st ed. 321
38 20 082665141572 2005 On shelf 20 MySQL in a nutshell 0596007892 2005 Sebastopol O'Reilly 1st ed. 321
39 20 273837764866 2006 Loaned 20 MySQL in a nutshell 0596007892 2005 Sebastopol O'Reilly 1st ed. 321
40 20 582937020090 2006 On shelf 20 MySQL in a nutshell 0596007892 2005 Sebastopol O'Reilly 1st ed. 321

ON does the same job as WHERE: it only keeps records that pass some test. (The difference between the two is that ON filters records as they're being created, while WHERE waits until the join is done and then does the filtering.) Once we add this to our query, the database manager throws away records that combined items with unrelated works, leaving us with just the ones we want.

Notice that we used table.field to specify field names in the output of the join. We do this because tables can have fields with the same name, and we need to be specific which ones we're talking about.

We can now use the same dotted notation to select only the columns we are interested in displaying:

%%sqlite swclib.db
SELECT Items.Barcode, Works.Title, Works.ISBN FROM Items JOIN Works ON Items.Work_ID=Works.Work_ID LIMIT 10;
081722942611 SQL in a nutshell 9780596518844
492437609065 SQL in a nutshell 9780596518844
172480710952 SQL for dummies 9781118607961
708014968732 PHP & MySQL 9781449325572
819783404942 PHP & MySQL 9781449325572
257370237291 Using SQLite 9780596521189
002905925356 Geek sublime 9780571310302
964583604781 Geek sublime 9780571310302
701630524534 Capital in the 21st century 9780674430006
722040919616 Capital in the 21st century 9780674430006

We can now try to tackle the case of the Authors table. To list the contributors associated with the first item on the Works table (Work_ID=1, SQL in a nutshell 3rd ed.), we write:

%%sqlite swclib.db
SELECT Works_Authors.Role, Authors.Personal, Authors.Family 
FROM   Works_Authors 
JOIN   Authors 
ON     Authors.Author_ID=Works_Authors.Author_ID 
WHERE  Works_Authors.Work_ID=1;
Author Kevin E. Kline
Contributor Daniel Kline
Contributor Brand Hunt

Or inversely, if we want to list all the works that Allen G. Taylor (Author_ID=4) has authored or contributed to, we can write:

%%sqlite swclib.db
SELECT Works.Title, Works.Date, Works.Edition, Works_Authors.Role 
FROM   Works 
JOIN   Works_Authors 
ON     Works.Work_ID=Works_Authors.Work_ID 
WHERE  Works_Authors.Author_ID=4;
SQL for dummies 2013 8th ed. Author
SQL for dummies 2010 7th ed. Author
SQL all-in-one 2011 2nd ed. Author
Access 2013 all-in-one 2013 Contributor

If joining two tables is good, then joining more tables must be better. In fact, we can join any number of tables simply by adding more JOIN clauses to our query, and more ON tests to filter out combinations of records that don't make sense.

We can tell which records from Works, Authors, Items and Works_Authors correspond with each other because those tables contain primary keys and foreign keys. A primary key is a value, or combination of values, that uniquely identifies each record in a table. A foreign key is a value (or combination of values) from one table that identifies a unique record in another table. Another way of saying this is that a foreign key is the primary key of one table that appears in some other table. In our database, Works.Work_ID is the primary key in the Works table, while Items.Work_ID is a foreign key relating the Items table's entries to entries in Works. The Authors_Works table contains only foreign keys relating to entries in the Works and Authors tables.

Most database designers believe that every table should have a well-defined primary key. They also believe that this key should be separate from the data itself, so that if we ever need to change the data, we only need to make one change in one place. One easy way to do this is to create an arbitrary, unique ID for each record as we add it to the database. This is actually very common: those IDs have names like "student numbers" and "library card numbers", and they almost always turn out to have originally been a unique record identifier in some database system or other. As the query below demonstrates, SQLite actually numbers records automatically as they're added to tables, and this number could have been used instead of the ID numbers that were specified in the example tables:

%%sqlite swclib.db
SELECT rowid, * from Items LIMIT 5;
1 1 1 081722942611 2009 Loaned
2 2 1 492437609065 2011 On shelf
3 3 2 172480710952 2013 On shelf
4 4 3 708014968732 2013 Missing
5 5 3 819783404942 2014 Loaned

Data Hygiene

Now that we have seen how joins work, we can see why the relational model is so useful and how best to use it. The first rule is that every value should be atomic, i.e., not contain parts that we might want to work with separately. We store personal and family names in separate columns instead of putting the entire name in one column so that we don't have to use substring operations to get the name's components. More importantly, we store the two parts of the name separately because splitting on spaces is unreliable: just think of a name like "Eloise St. Cyr" or "Jan Mikkel Steubart".

The second rule is that every record should have a unique primary key. This can be a serial number that has no intrinsic meaning, one of the values in the record (like the Work_ID field in the Works table), or even a combination of values.

The third rule is that there should be no redundant information. For example, we could get rid of the Works table and rewrite the Items table something like this:

Item_ID Title Publisher Date Barcode Acquired Status
1 SQL in a nutshell Sebastopol: O'Reilly 2009 081722942611 2009 Loaned
2 SQL in a nutshell Sebastopol: O'Reilly 2009 492437609065 2011 On shelf
3 SQL for dummies Hoboken: Wiley 2013 172480710952 2013 On shelf
4 PHP & MySQL Sebastopol: O'Reilly 2013 708014968732 2013 Missing
5 PHP & MySQL Sebastopol: O'Reilly 2013 819783404942 2014 Loaned

In fact, we could use a single table that recorded all the information about each item in each row, just as a spreadsheet would. The problem is that it's very hard to keep data organized this way consistent: if we realize that the bibliographic information associated with a series of items is wrong, we have to change multiple records in the database. Similarly, storing authors' and contributors' names in separate columns would mean adding an extra column every time we encounter a title that has more contributors than all the previous items. If we have only one item that has twenty contributors, a spreadsheet design would require twenty separate columns to store this information, and those columns would sit empty for the majority of the items. This is one of the reasons why authority files are in separate tables.

The fourth rule is that the units for every value should be stored explicitly. In this particular database, it's not very important (although we could specify that the Pages field contains the count of pages in any given book). This can be a problem in databases that contain scientific data, for example.

Stepping back, data and the tools used to store it have a symbiotic relationship: we use tables and joins because it's efficient, provided our data is organized a certain way, but organize our data that way because we have tools to manipulate it efficiently if it's in a certain form. As anthropologists say, the tool shapes the hand that shapes the tool.

Challenges

  1. Write a query that lists all works written by people whose Family name start with the letter "K".

  2. Write a query that lists all authors that have written at least one book that is currently on loan from the library.

  3. To which item does the barcode 722040919616 refer to, what is the title of this book and who are its authors?

Key Points

  • Every fact should be represented in a database exactly once.
  • A join produces all combinations of records from one table with records from another.
  • A primary key is a field (or set of fields) whose values uniquely identify the records in a table.
  • A foreign key is a field (or set of fields) in one table whose values are a primary key in another table.
  • We can eliminate meaningless combinations of records by matching primary keys and foreign keys between tables.
  • Keys should be atomic values to make joins simpler and more efficient.