So far we haven't been able to display the authors' names in our query results, because those names are not in the Works
but in the Authors
table. What's worse, because there can be many authors to a title and many title associated with each author, there is a third table, Works_Authors
that's taking care of this many-to-many relationship. How can these tables be joined?
The SQL command to do this is JOIN
. To see how it works, let's start with the somewhat easier case of the Items
table and try joining it to the Works
table:
%load_ext sqlitemagic
%%sqlite swclib.db SELECT * FROM Items JOIN Works;
1 | 1 | 081722942611 | 2009 | Loaned | 1 | SQL in a nutshell | 9780596518844 | 2009 | Sebastopol | O'Reilly | 3rd ed. | 578 |
1 | 1 | 081722942611 | 2009 | Loaned | 2 | SQL for dummies | 9781118607961 | 2013 | Hoboken | Wiley | 8th ed. | |
1 | 1 | 081722942611 | 2009 | Loaned | 3 | PHP & MySQL | 9781449325572 | 2013 | Sebastopol | O'Reilly | 2nd ed. | 532 |
1 | 1 | 081722942611 | 2009 | Loaned | 4 | Using SQLite | 9780596521189 | 2010 | Sebastopol | O'Reilly | 1st ed. | 503 |
1 | 1 | 081722942611 | 2009 | Loaned | 5 | Geek sublime | 9780571310302 | 2014 | London | Faber & Faber | 258 |
The result above was truncated for display because... the query returned a list of 800 records! In fact, JOIN
creates the cross product of two tables, i.e., it joins each record of one with each record of the other to give all possible combinations. Since there are 20 records in Works
and 40 in Items
, the join's output has 20*40=800 records. And since Works
has 8 fields and Items
has 5, the output has 8+5=13 fields.
What the join hasn't done is figure out if the records being joined have anything to do with each other. It has no way of knowing whether they do or not until we tell it how. To do that, we add a clause specifying that we're only interested in combinations where Work_ID
matches in both tables:
%%sqlite swclib.db SELECT * FROM Items JOIN Works ON Items.Work_ID=Works.Work_ID;
1 | 1 | 081722942611 | 2009 | Loaned | 1 | SQL in a nutshell | 9780596518844 | 2009 | Sebastopol | O'Reilly | 3rd ed. | 578 |
2 | 1 | 492437609065 | 2011 | On shelf | 1 | SQL in a nutshell | 9780596518844 | 2009 | Sebastopol | O'Reilly | 3rd ed. | 578 |
3 | 2 | 172480710952 | 2013 | On shelf | 2 | SQL for dummies | 9781118607961 | 2013 | Hoboken | Wiley | 8th ed. | |
4 | 3 | 708014968732 | 2013 | Missing | 3 | PHP & MySQL | 9781449325572 | 2013 | Sebastopol | O'Reilly | 2nd ed. | 532 |
5 | 3 | 819783404942 | 2014 | Loaned | 3 | PHP & MySQL | 9781449325572 | 2013 | Sebastopol | O'Reilly | 2nd ed. | 532 |
6 | 4 | 257370237291 | 2010 | Missing | 4 | Using SQLite | 9780596521189 | 2010 | Sebastopol | O'Reilly | 1st ed. | 503 |
7 | 5 | 002905925356 | 2014 | Loaned | 5 | Geek sublime | 9780571310302 | 2014 | London | Faber & Faber | 258 | |
8 | 5 | 964583604781 | 2014 | Loaned | 5 | Geek sublime | 9780571310302 | 2014 | London | Faber & Faber | 258 | |
9 | 6 | 701630524534 | 2014 | Loaned | 6 | Capital in the 21st century | 9780674430006 | 2014 | Cambridge | Belknap Press | 685 | |
10 | 6 | 722040919616 | 2014 | On shelf | 6 | Capital in the 21st century | 9780674430006 | 2014 | Cambridge | Belknap Press | 685 | |
11 | 6 | 026655281484 | 2014 | On shelf | 6 | Capital in the 21st century | 9780674430006 | 2014 | Cambridge | Belknap Press | 685 | |
12 | 7 | 422970103061 | 2010 | On shelf | 7 | SQL | 9780071548649 | 2009 | New York | McGraw-Hill | 3rd ed. | 534 |
13 | 8 | 655280484976 | 2011 | Loaned | 8 | Discovering SQL | 9781118002674 | 2011 | Indianapolis | Wiley | 400 | |
14 | 9 | 610721228318 | 2005 | Missing | 9 | SQL | 0321334175 | 2005 | Berkeley | Peachpit | 2nd ed. | 460 |
15 | 10 | 148164881245 | 2008 | Missing | 10 | A guide to SQL | 9780324597684 | 2008 | Mason | South-Western | 8th ed. | 309 |
16 | 10 | 445317012796 | 2010 | On shelf | 10 | A guide to SQL | 9780324597684 | 2008 | Mason | South-Western | 8th ed. | 309 |
17 | 11 | 291006691199 | 2008 | On shelf | 11 | SQL bible | 9780470229064 | 2008 | Indianapolis | Wiley | 2nd ed. | 857 |
18 | 12 | 665741505651 | 2009 | On shelf | 12 | Learning SQL | 9780596520830 | 2009 | Sebastopol | O'Reilly | 2nd ed. | 320 |
19 | 12 | 623061160016 | 2009 | Loaned | 12 | Learning SQL | 9780596520830 | 2009 | Sebastopol | O'Reilly | 2nd ed. | 320 |
20 | 13 | 827361553957 | 2010 | On shelf | 13 | SQL for dummies | 9780470557419 | 2010 | Hoboken | Wiley | 7th ed. | 440 |
21 | 13 | 228598347653 | 2010 | Missing | 13 | SQL for dummies | 9780470557419 | 2010 | Hoboken | Wiley | 7th ed. | 440 |
22 | 13 | 585952782539 | 2010 | On shelf | 13 | SQL for dummies | 9780470557419 | 2010 | Hoboken | Wiley | 7th ed. | 440 |
23 | 13 | 701532568017 | 2011 | On shelf | 13 | SQL for dummies | 9780470557419 | 2010 | Hoboken | Wiley | 7th ed. | 440 |
24 | 14 | 989297622703 | 2008 | On shelf | 14 | Beginning SQL queries | 9781590599433 | 2008 | Berkeley | Apress | 218 | |
25 | 15 | 640793136396 | 2005 | On shelf | 15 | Beginning SQL | 0764577328 | 2005 | Indianapolis | Wiley | 501 | |
26 | 15 | 521089986565 | 2005 | On shelf | 15 | Beginning SQL | 0764577328 | 2005 | Indianapolis | Wiley | 501 | |
27 | 16 | 139685507140 | 2013 | Loaned | 16 | Microsoft SQL server 2012 | 9780132977661 | 2013 | Indianapolis | Sams | ||
28 | 16 | 853183712696 | 2013 | Loaned | 16 | Microsoft SQL server 2012 | 9780132977661 | 2013 | Indianapolis | Sams | ||
29 | 17 | 257153081154 | 2011 | On shelf | 17 | SQL all-in-one | 9780470929964 | 2011 | Hoboken | Wiley | 2nd ed. | 708 |
30 | 18 | 208546921091 | 2013 | Loaned | 18 | Access 2013 all-in-one | 9781118510551 | 2013 | Hoboken | Wiley | 760 | |
31 | 19 | 921664426379 | 2004 | On shelf | 19 | SQL in a nutshell | 0596004818 | 2004 | Cambridge | O'Reilly | 2nd ed. | 691 |
32 | 19 | 298308111210 | 2004 | Loaned | 19 | SQL in a nutshell | 0596004818 | 2004 | Cambridge | O'Reilly | 2nd ed. | 691 |
33 | 19 | 210139559101 | 2004 | Missing | 19 | SQL in a nutshell | 0596004818 | 2004 | Cambridge | O'Reilly | 2nd ed. | 691 |
34 | 20 | 344919897556 | 2005 | On shelf | 20 | MySQL in a nutshell | 0596007892 | 2005 | Sebastopol | O'Reilly | 1st ed. | 321 |
35 | 20 | 035230397910 | 2005 | On shelf | 20 | MySQL in a nutshell | 0596007892 | 2005 | Sebastopol | O'Reilly | 1st ed. | 321 |
36 | 20 | 527524003500 | 2005 | On shelf | 20 | MySQL in a nutshell | 0596007892 | 2005 | Sebastopol | O'Reilly | 1st ed. | 321 |
37 | 20 | 467701665668 | 2005 | Missing | 20 | MySQL in a nutshell | 0596007892 | 2005 | Sebastopol | O'Reilly | 1st ed. | 321 |
38 | 20 | 082665141572 | 2005 | On shelf | 20 | MySQL in a nutshell | 0596007892 | 2005 | Sebastopol | O'Reilly | 1st ed. | 321 |
39 | 20 | 273837764866 | 2006 | Loaned | 20 | MySQL in a nutshell | 0596007892 | 2005 | Sebastopol | O'Reilly | 1st ed. | 321 |
40 | 20 | 582937020090 | 2006 | On shelf | 20 | MySQL in a nutshell | 0596007892 | 2005 | Sebastopol | O'Reilly | 1st ed. | 321 |
ON
does the same job as WHERE
: it only keeps records that pass some test. (The difference between the two is that ON
filters records as they're being created, while WHERE
waits until the join is done and then does the filtering.) Once we add this to our query, the database manager throws away records that combined items with unrelated works, leaving us with just the ones we want.
Notice that we used table.field
to specify field names in the output of the join. We do this because tables can have fields with the same name, and we need to be specific which ones we're talking about.
We can now use the same dotted notation to select only the columns we are interested in displaying:
%%sqlite swclib.db SELECT Items.Barcode, Works.Title, Works.ISBN FROM Items JOIN Works ON Items.Work_ID=Works.Work_ID LIMIT 10;
081722942611 | SQL in a nutshell | 9780596518844 |
492437609065 | SQL in a nutshell | 9780596518844 |
172480710952 | SQL for dummies | 9781118607961 |
708014968732 | PHP & MySQL | 9781449325572 |
819783404942 | PHP & MySQL | 9781449325572 |
257370237291 | Using SQLite | 9780596521189 |
002905925356 | Geek sublime | 9780571310302 |
964583604781 | Geek sublime | 9780571310302 |
701630524534 | Capital in the 21st century | 9780674430006 |
722040919616 | Capital in the 21st century | 9780674430006 |
We can now try to tackle the case of the Authors
table. To list the contributors associated with the first item on the Works
table (Work_ID=1
, SQL in a nutshell 3rd ed.), we write:
%%sqlite swclib.db SELECT Works_Authors.Role, Authors.Personal, Authors.Family FROM Works_Authors JOIN Authors ON Authors.Author_ID=Works_Authors.Author_ID WHERE Works_Authors.Work_ID=1;
Author | Kevin E. | Kline |
Contributor | Daniel | Kline |
Contributor | Brand | Hunt |
Or inversely, if we want to list all the works that Allen G. Taylor (Author_ID=4
) has authored or contributed to, we can write:
%%sqlite swclib.db SELECT Works.Title, Works.Date, Works.Edition, Works_Authors.Role FROM Works JOIN Works_Authors ON Works.Work_ID=Works_Authors.Work_ID WHERE Works_Authors.Author_ID=4;
SQL for dummies | 2013 | 8th ed. | Author |
SQL for dummies | 2010 | 7th ed. | Author |
SQL all-in-one | 2011 | 2nd ed. | Author |
Access 2013 all-in-one | 2013 | Contributor |
If joining two tables is good, then joining more tables must be better. In fact, we can join any number of tables simply by adding more JOIN
clauses to our query, and more ON
tests to filter out combinations of records that don't make sense.
We can tell which records from Works
, Authors
, Items
and Works_Authors
correspond with each other because those tables contain primary keys and foreign keys. A primary key is a value, or combination of values, that uniquely identifies each record in a table. A foreign key is a value (or combination of values) from one table that identifies a unique record in another table. Another way of saying this is that a foreign key is the primary key of one table that appears in some other table. In our database, Works.Work_ID
is the primary key in the Works
table, while Items.Work_ID
is a foreign key relating the Items
table's entries to entries in Works
. The Authors_Works
table contains only foreign keys relating to entries in the Works
and Authors
tables.
Most database designers believe that every table should have a well-defined primary key. They also believe that this key should be separate from the data itself, so that if we ever need to change the data, we only need to make one change in one place. One easy way to do this is to create an arbitrary, unique ID for each record as we add it to the database. This is actually very common: those IDs have names like "student numbers" and "library card numbers", and they almost always turn out to have originally been a unique record identifier in some database system or other. As the query below demonstrates, SQLite actually numbers records automatically as they're added to tables, and this number could have been used instead of the ID numbers that were specified in the example tables:
%%sqlite swclib.db SELECT rowid, * from Items LIMIT 5;
1 | 1 | 1 | 081722942611 | 2009 | Loaned |
2 | 2 | 1 | 492437609065 | 2011 | On shelf |
3 | 3 | 2 | 172480710952 | 2013 | On shelf |
4 | 4 | 3 | 708014968732 | 2013 | Missing |
5 | 5 | 3 | 819783404942 | 2014 | Loaned |
Now that we have seen how joins work, we can see why the relational model is so useful and how best to use it. The first rule is that every value should be atomic, i.e., not contain parts that we might want to work with separately. We store personal and family names in separate columns instead of putting the entire name in one column so that we don't have to use substring operations to get the name's components. More importantly, we store the two parts of the name separately because splitting on spaces is unreliable: just think of a name like "Eloise St. Cyr" or "Jan Mikkel Steubart".
The second rule is that every record should have a unique primary key. This can be a serial number that has no intrinsic meaning, one of the values in the record (like the Work_ID
field in the Works
table), or even a combination of values.
The third rule is that there should be no redundant information. For example, we could get rid of the Works
table and rewrite the Items
table something like this:
Item_ID | Title | Publisher | Date | Barcode | Acquired | Status |
---|---|---|---|---|---|---|
1 | SQL in a nutshell | Sebastopol: O'Reilly | 2009 | 081722942611 | 2009 | Loaned |
2 | SQL in a nutshell | Sebastopol: O'Reilly | 2009 | 492437609065 | 2011 | On shelf |
3 | SQL for dummies | Hoboken: Wiley | 2013 | 172480710952 | 2013 | On shelf |
4 | PHP & MySQL | Sebastopol: O'Reilly | 2013 | 708014968732 | 2013 | Missing |
5 | PHP & MySQL | Sebastopol: O'Reilly | 2013 | 819783404942 | 2014 | Loaned |
In fact, we could use a single table that recorded all the information about each item in each row, just as a spreadsheet would. The problem is that it's very hard to keep data organized this way consistent: if we realize that the bibliographic information associated with a series of items is wrong, we have to change multiple records in the database. Similarly, storing authors' and contributors' names in separate columns would mean adding an extra column every time we encounter a title that has more contributors than all the previous items. If we have only one item that has twenty contributors, a spreadsheet design would require twenty separate columns to store this information, and those columns would sit empty for the majority of the items. This is one of the reasons why authority files are in separate tables.
The fourth rule is that the units for every value should be stored explicitly. In this particular database, it's not very important (although we could specify that the Pages
field contains the count of pages in any given book). This can be a problem in databases that contain scientific data, for example.
Stepping back, data and the tools used to store it have a symbiotic relationship: we use tables and joins because it's efficient, provided our data is organized a certain way, but organize our data that way because we have tools to manipulate it efficiently if it's in a certain form. As anthropologists say, the tool shapes the hand that shapes the tool.
Write a query that lists all works written by people whose Family name start with the letter "K".
Write a query that lists all authors that have written at least one book that is currently on loan from the library.
To which item does the barcode 722040919616
refer to, what is the title of this book and who are its authors?