Software Carpentry:

Combining Data

Objectives

Explain the operation of a query that joins two tables.
Explain how to restrict the output of a query containing a join to only include meaningful combinations of values.
Write queries that join tables on equal keys.
Explain what primary and foreign keys are, and why they are useful.
Explain what atomic values are, and why database fields should only contain atomic values.

So far we haven't been able to display the authors' names in our query results, because those names are not in the Works but in the Authors table. What's worse, because there can be many authors to a title and many title associated with each author, there is a third table, Works_Authors that's taking care of this many-to-many relationship. How can these tables be joined?

The SQL command to do this is JOIN. To see how it works, let's start with the somewhat easier case of the Items table and try joining it to the Works table:

%load_ext sqlitemagic

%%sqlite swclib.db
SELECT * FROM Items JOIN Works;


	1
	1
	081722942611
	2009
	Loaned
	1
	SQL in a nutshell
	9780596518844
	2009
	Sebastopol
	O'Reilly
	3rd ed.
	578
	
	1
	1
	081722942611
	2009
	Loaned
	2
	SQL for dummies
	9781118607961
	2013
	Hoboken
	Wiley
	8th ed.
	
	
	1
	1
	081722942611
	2009
	Loaned
	3
	PHP & MySQL
	9781449325572
	2013
	Sebastopol
	O'Reilly
	2nd ed.
	532
	
	1
	1
	081722942611
	2009
	Loaned
	4
	Using SQLite
	9780596521189
	2010
	Sebastopol
	O'Reilly
	1st ed.
	503
	
	1
	1
	081722942611
	2009
	Loaned
	5
	Geek sublime
	9780571310302
	2014
	London
	Faber & Faber
	
	258

The result above was truncated for display because... the query returned a list of 800 records! In fact, JOIN creates the cross product of two tables, i.e., it joins each record of one with each record of the other to give all possible combinations. Since there are 20 records in Works and 40 in Items, the join's output has 20*40=800 records. And since Works has 8 fields and Items has 5, the output has 8+5=13 fields.

What the join hasn't done is figure out if the records being joined have anything to do with each other. It has no way of knowing whether they do or not until we tell it how. To do that, we add a clause specifying that we're only interested in combinations where Work_ID matches in both tables:

%%sqlite swclib.db
SELECT * FROM Items JOIN Works ON Items.Work_ID=Works.Work_ID;


	1
	1
	081722942611
	2009
	Loaned
	1
	SQL in a nutshell
	9780596518844
	2009
	Sebastopol
	O'Reilly
	3rd ed.
	578
	
	2
	1
	492437609065
	2011
	On shelf
	1
	SQL in a nutshell
	9780596518844
	2009
	Sebastopol
	O'Reilly
	3rd ed.
	578
	
	3
	2
	172480710952
	2013
	On shelf
	2
	SQL for dummies
	9781118607961
	2013
	Hoboken
	Wiley
	8th ed.
	
	
	4
	3
	708014968732
	2013
	Missing
	3
	PHP & MySQL
	9781449325572
	2013
	Sebastopol
	O'Reilly
	2nd ed.
	532
	
	5
	3
	819783404942
	2014
	Loaned
	3
	PHP & MySQL
	9781449325572
	2013
	Sebastopol
	O'Reilly
	2nd ed.
	532
	
	6
	4
	257370237291
	2010
	Missing
	4
	Using SQLite
	9780596521189
	2010
	Sebastopol
	O'Reilly
	1st ed.
	503
	
	7
	5
	002905925356
	2014
	Loaned
	5
	Geek sublime
	9780571310302
	2014
	London
	Faber & Faber
	
	258
	
	8
	5
	964583604781
	2014
	Loaned
	5
	Geek sublime
	9780571310302
	2014
	London
	Faber & Faber
	
	258
	
	9
	6
	701630524534
	2014
	Loaned
	6
	Capital in the 21st century
	9780674430006
	2014
	Cambridge
	Belknap Press
	
	685
	
	10
	6
	722040919616
	2014
	On shelf
	6
	Capital in the 21st century
	9780674430006
	2014
	Cambridge
	Belknap Press
	
	685
	
	11
	6
	026655281484
	2014
	On shelf
	6
	Capital in the 21st century
	9780674430006
	2014
	Cambridge
	Belknap Press
	
	685
	
	12
	7
	422970103061
	2010
	On shelf
	7
	SQL
	9780071548649
	2009
	New York
	McGraw-Hill
	3rd ed.
	534
	
	13
	8
	655280484976
	2011
	Loaned
	8
	Discovering SQL
	9781118002674
	2011
	Indianapolis
	Wiley
	
	400
	
	14
	9
	610721228318
	2005
	Missing
	9
	SQL
	0321334175
	2005
	Berkeley
	Peachpit
	2nd ed.
	460
	
	15
	10
	148164881245
	2008
	Missing
	10
	A guide to SQL
	9780324597684
	2008
	Mason
	South-Western
	8th ed.
	309
	
	16
	10
	445317012796
	2010
	On shelf
	10
	A guide to SQL
	9780324597684
	2008
	Mason
	South-Western
	8th ed.
	309
	
	17
	11
	291006691199
	2008
	On shelf
	11
	SQL bible
	9780470229064
	2008
	Indianapolis
	Wiley
	2nd ed.
	857
	
	18
	12
	665741505651
	2009
	On shelf
	12
	Learning SQL
	9780596520830
	2009
	Sebastopol
	O'Reilly
	2nd ed.
	320
	
	19
	12
	623061160016
	2009
	Loaned
	12
	Learning SQL
	9780596520830
	2009
	Sebastopol
	O'Reilly
	2nd ed.
	320
	
	20
	13
	827361553957
	2010
	On shelf
	13
	SQL for dummies
	9780470557419
	2010
	Hoboken
	Wiley
	7th ed.
	440
	
	21
	13
	228598347653
	2010
	Missing
	13
	SQL for dummies
	9780470557419
	2010
	Hoboken
	Wiley
	7th ed.
	440
	
	22
	13
	585952782539
	2010
	On shelf
	13
	SQL for dummies
	9780470557419
	2010
	Hoboken
	Wiley
	7th ed.
	440
	
	23
	13
	701532568017
	2011
	On shelf
	13
	SQL for dummies
	9780470557419
	2010
	Hoboken
	Wiley
	7th ed.
	440
	
	24
	14
	989297622703
	2008
	On shelf
	14
	Beginning SQL queries
	9781590599433
	2008
	Berkeley
	Apress
	
	218
	
	25
	15
	640793136396
	2005
	On shelf
	15
	Beginning SQL
	0764577328
	2005
	Indianapolis
	Wiley
	
	501
	
	26
	15
	521089986565
	2005
	On shelf
	15
	Beginning SQL
	0764577328
	2005
	Indianapolis
	Wiley
	
	501
	
	27
	16
	139685507140
	2013
	Loaned
	16
	Microsoft SQL server 2012
	9780132977661
	2013
	Indianapolis
	Sams
	
	
	
	28
	16
	853183712696
	2013
	Loaned
	16
	Microsoft SQL server 2012
	9780132977661
	2013
	Indianapolis
	Sams
	
	
	
	29
	17
	257153081154
	2011
	On shelf
	17
	SQL all-in-one
	9780470929964
	2011
	Hoboken
	Wiley
	2nd ed.
	708
	
	30
	18
	208546921091
	2013
	Loaned
	18
	Access 2013 all-in-one
	9781118510551
	2013
	Hoboken
	Wiley
	
	760
	
	31
	19
	921664426379
	2004
	On shelf
	19
	SQL in a nutshell
	0596004818
	2004
	Cambridge
	O'Reilly
	2nd ed.
	691
	
	32
	19
	298308111210
	2004
	Loaned
	19
	SQL in a nutshell
	0596004818
	2004
	Cambridge
	O'Reilly
	2nd ed.
	691
	
	33
	19
	210139559101
	2004
	Missing
	19
	SQL in a nutshell
	0596004818
	2004
	Cambridge
	O'Reilly
	2nd ed.
	691
	
	34
	20
	344919897556
	2005
	On shelf
	20
	MySQL in a nutshell
	0596007892
	2005
	Sebastopol
	O'Reilly
	1st ed.
	321
	
	35
	20
	035230397910
	2005
	On shelf
	20
	MySQL in a nutshell
	0596007892
	2005
	Sebastopol
	O'Reilly
	1st ed.
	321
	
	36
	20
	527524003500
	2005
	On shelf
	20
	MySQL in a nutshell
	0596007892
	2005
	Sebastopol
	O'Reilly
	1st ed.
	321
	
	37
	20
	467701665668
	2005
	Missing
	20
	MySQL in a nutshell
	0596007892
	2005
	Sebastopol
	O'Reilly
	1st ed.
	321
	
	38
	20
	082665141572
	2005
	On shelf
	20
	MySQL in a nutshell
	0596007892
	2005
	Sebastopol
	O'Reilly
	1st ed.
	321
	
	39
	20
	273837764866
	2006
	Loaned
	20
	MySQL in a nutshell
	0596007892
	2005
	Sebastopol
	O'Reilly
	1st ed.
	321
	
	40
	20
	582937020090
	2006
	On shelf
	20
	MySQL in a nutshell
	0596007892
	2005
	Sebastopol
	O'Reilly
	1st ed.
	321

ON does the same job as WHERE: it only keeps records that pass some test. (The difference between the two is that ON filters records as they're being created, while WHERE waits until the join is done and then does the filtering.) Once we add this to our query, the database manager throws away records that combined items with unrelated works, leaving us with just the ones we want.

Notice that we used table.field to specify field names in the output of the join. We do this because tables can have fields with the same name, and we need to be specific which ones we're talking about.

We can now use the same dotted notation to select only the columns we are interested in displaying:

%%sqlite swclib.db
SELECT Items.Barcode, Works.Title, Works.ISBN FROM Items JOIN Works ON Items.Work_ID=Works.Work_ID LIMIT 10;


	081722942611
	SQL in a nutshell
	9780596518844
	
	492437609065
	SQL in a nutshell
	9780596518844
	
	172480710952
	SQL for dummies
	9781118607961
	
	708014968732
	PHP & MySQL
	9781449325572
	
	819783404942
	PHP & MySQL
	9781449325572
	
	257370237291
	Using SQLite
	9780596521189
	
	002905925356
	Geek sublime
	9780571310302
	
	964583604781
	Geek sublime
	9780571310302
	
	701630524534
	Capital in the 21st century
	9780674430006
	
	722040919616
	Capital in the 21st century
	9780674430006

We can now try to tackle the case of the Authors table. To list the contributors associated with the first item on the Works table (Work_ID=1, SQL in a nutshell 3rd ed.), we write:

%%sqlite swclib.db
SELECT Works_Authors.Role, Authors.Personal, Authors.Family 
FROM   Works_Authors 
JOIN   Authors 
ON     Authors.Author_ID=Works_Authors.Author_ID 
WHERE  Works_Authors.Work_ID=1;


	Author
	Kevin E.
	Kline
	
	Contributor
	Daniel
	Kline
	
	Contributor
	Brand
	Hunt

Or inversely, if we want to list all the works that Allen G. Taylor (Author_ID=4) has authored or contributed to, we can write:

%%sqlite swclib.db
SELECT Works.Title, Works.Date, Works.Edition, Works_Authors.Role 
FROM   Works 
JOIN   Works_Authors 
ON     Works.Work_ID=Works_Authors.Work_ID 
WHERE  Works_Authors.Author_ID=4;


	SQL for dummies
	2013
	8th ed.
	Author
	
	SQL for dummies
	2010
	7th ed.
	Author
	
	SQL all-in-one
	2011
	2nd ed.
	Author
	
	Access 2013 all-in-one
	2013
	
	Contributor

If joining two tables is good, then joining more tables must be better. In fact, we can join any number of tables simply by adding more JOIN clauses to our query, and more ON tests to filter out combinations of records that don't make sense.

We can tell which records from Works, Authors, Items and Works_Authors correspond with each other because those tables contain primary keys and foreign keys. A primary key is a value, or combination of values, that uniquely identifies each record in a table. A foreign key is a value (or combination of values) from one table that identifies a unique record in another table. Another way of saying this is that a foreign key is the primary key of one table that appears in some other table. In our database, Works.Work_ID is the primary key in the Works table, while Items.Work_ID is a foreign key relating the Items table's entries to entries in Works. The Authors_Works table contains only foreign keys relating to entries in the Works and Authors tables.

Most database designers believe that every table should have a well-defined primary key. They also believe that this key should be separate from the data itself, so that if we ever need to change the data, we only need to make one change in one place. One easy way to do this is to create an arbitrary, unique ID for each record as we add it to the database. This is actually very common: those IDs have names like "student numbers" and "library card numbers", and they almost always turn out to have originally been a unique record identifier in some database system or other. As the query below demonstrates, SQLite actually numbers records automatically as they're added to tables, and this number could have been used instead of the ID numbers that were specified in the example tables:

%%sqlite swclib.db
SELECT rowid, * from Items LIMIT 5;


	1
	1
	1
	081722942611
	2009
	Loaned
	
	2
	2
	1
	492437609065
	2011
	On shelf
	
	3
	3
	2
	172480710952
	2013
	On shelf
	
	4
	4
	3
	708014968732
	2013
	Missing
	
	5
	5
	3
	819783404942
	2014
	Loaned

Data Hygiene

Now that we have seen how joins work, we can see why the relational model is so useful and how best to use it. The first rule is that every value should be atomic, i.e., not contain parts that we might want to work with separately. We store personal and family names in separate columns instead of putting the entire name in one column so that we don't have to use substring operations to get the name's components. More importantly, we store the two parts of the name separately because splitting on spaces is unreliable: just think of a name like "Eloise St. Cyr" or "Jan Mikkel Steubart".

The second rule is that every record should have a unique primary key. This can be a serial number that has no intrinsic meaning, one of the values in the record (like the Work_ID field in the Works table), or even a combination of values.

The third rule is that there should be no redundant information. For example, we could get rid of the Works table and rewrite the Items table something like this:

Item_ID	Title	Publisher	Date	Barcode	Acquired	Status
1	SQL in a nutshell	Sebastopol: O'Reilly	2009	081722942611	2009	Loaned
2	SQL in a nutshell	Sebastopol: O'Reilly	2009	492437609065	2011	On shelf
3	SQL for dummies	Hoboken: Wiley	2013	172480710952	2013	On shelf
4	PHP & MySQL	Sebastopol: O'Reilly	2013	708014968732	2013	Missing
5	PHP & MySQL	Sebastopol: O'Reilly	2013	819783404942	2014	Loaned

In fact, we could use a single table that recorded all the information about each item in each row, just as a spreadsheet would. The problem is that it's very hard to keep data organized this way consistent: if we realize that the bibliographic information associated with a series of items is wrong, we have to change multiple records in the database. Similarly, storing authors' and contributors' names in separate columns would mean adding an extra column every time we encounter a title that has more contributors than all the previous items. If we have only one item that has twenty contributors, a spreadsheet design would require twenty separate columns to store this information, and those columns would sit empty for the majority of the items. This is one of the reasons why authority files are in separate tables.

The fourth rule is that the units for every value should be stored explicitly. In this particular database, it's not very important (although we could specify that the Pages field contains the count of pages in any given book). This can be a problem in databases that contain scientific data, for example.

Stepping back, data and the tools used to store it have a symbiotic relationship: we use tables and joins because it's efficient, provided our data is organized a certain way, but organize our data that way because we have tools to manipulate it efficiently if it's in a certain form. As anthropologists say, the tool shapes the hand that shapes the tool.

Challenges

Write a query that lists all works written by people whose Family name start with the letter "K".
Write a query that lists all authors that have written at least one book that is currently on loan from the library.
To which item does the barcode 722040919616 refer to, what is the title of this book and who are its authors?

Key Points

Every fact should be represented in a database exactly once.
A join produces all combinations of records from one table with records from another.
A primary key is a field (or set of fields) whose values uniquely identify the records in a table.
A foreign key is a field (or set of fields) in one table whose values are a primary key in another table.
We can eliminate meaningless combinations of records by matching primary keys and foreign keys between tables.
Keys should be atomic values to make joins simpler and more efficient.

081722942611	SQL in a nutshell	9780596518844
492437609065	SQL in a nutshell	9780596518844
172480710952	SQL for dummies	9781118607961
708014968732	PHP & MySQL	9781449325572
819783404942	PHP & MySQL	9781449325572
257370237291	Using SQLite	9780596521189
002905925356	Geek sublime	9780571310302
964583604781	Geek sublime	9780571310302
701630524534	Capital in the 21st century	9780674430006
722040919616	Capital in the 21st century	9780674430006

Author	Kevin E.	Kline
Contributor	Daniel	Kline
Contributor	Brand	Hunt

SQL for dummies	2013	8th ed.	Author
SQL for dummies	2010	7th ed.	Author
SQL all-in-one	2011	2nd ed.	Author
Access 2013 all-in-one	2013		Contributor