Software Carpentry:

Selecting Data

For this lesson, we're going to use the library of the SQL Learners Academy as an example. Since the Academy is much more concerned about teaching SQL than proper cataloguing, its library catalogue is quite rudimentary. The catalogue is based on a few tables that are in relation to each other, enough to get the concept of relational databases across, and run some simple quries!

More importantly, the raw database is always available for download as an SQLite database file or as an SQL dump so that students can learn to tinker SQL.

Objectives

Explain the difference between a table, a record, and a field.
Explain the difference between a database and a database manager.
Write a query to select all values for specific fields from a single table.

A Few Definitions

A relational database is a way to store and manipulate information that is arranged as tables. Each table has columns (also known as fields) which describe the data, and rows (also known as records) which contain the data.

When we are using a spreadsheet, we put formulas into cells to calculate new values based on old ones. When we are using a database, we send commands (usually called queries) to a database manager: a program that manipulates the database for us. The database manager does whatever lookups and calculations the query specifies, returning the results in a tabular form that we can then use as a starting point for further queries.

Every database manager—Oracle, IBM DB2, PostgreSQL, MySQL, Microsoft Access, and SQLite—stores data in a different way, so a database created with one cannot be used directly by another. However, every database manager can import and export data in a variety of formats, so it is possible to move information from one to another.

Queries are written in a language called SQL, which stands for "Structured Query Language". SQL provides hundreds of different ways to analyze and recombine data; we will only look at a handful, but that handful accounts for most of what scientists do.

The tables below show the database we will use in our library example. Students using SQLite can download the database to try the queries along. Only the first 5 rows of each table are displayed here.

Works: simplistic bibliographical information for the works available in the library

Work_ID	Title	ISBN	Date	Place	Publisher	Edition	Pages
1	SQL in a nutshell	9780596518844	2009	Sebastopol	O'Reilly	3rd ed.	578
2	SQL for dummies	9781118607961	2013	Hoboken	Wiley	8th ed.
3	PHP & MySQL	9781449325572	2013	Sebastopol	O'Reilly	2nd ed.	532
4	Using SQLite	9780596521189	2010	Sebastopol	O'Reilly	1st ed.	503
5	Geek sublime	9780571310302	2014	London	Faber & Faber		258

Authors: the "authority file", containing information about the Authors of the Works

Author_ID	Family	Personal	Birth
1	Kline	Kevin E.	1966
2	Kline	Daniel
3	Hunt	Brand
4	Taylor	Allen G.	1945
5	McLaughlin	Brett

Works_Authors: The relationship between the Works and the Authors (more on that later)

Work_ID	Author_ID	Role
1	1	Author
1	2	Contributor
1	3	Contributor
2	4	Author
3	5	Author

Items: the actual copies of the Works owned by the library

Item_ID	Work_ID	Barcode	Acquired	Status
1	1	081722942611	2009	Loaned
2	1	492437609065	2011	On shelf
3	2	172480710952	2013	On shelf
4	3	708014968732	2013	Missing
5	3	819783404942	2014	Loaned

Notice that some entries are shown in red because they don't contain any actual data: we'll return to these missing values later. For now, let's write an SQL query that displays the names of all authors present in the catalogue. We do this using the SQL command SELECT, giving it the names of the columns we want and the table we want them from. Our query and its output look like this:

%load_ext sqlitemagic

%%sqlite swclib.db
SELECT Family, Personal FROM Authors;


	Kline
	Kevin E.
	
	Kline
	Daniel
	
	Hunt
	Brand
	
	Taylor
	Allen G.
	
	McLaughlin
	Brett
	
	Kreibich
	Jay A.
	
	Chandra
	Vikram
	
	Piketty
	Thomas
	
	Goldhammer
	Arthur
	
	Oppel
	Andrew J.
	
	Sheldon
	Robert
	
	Kriegel
	Alex
	
	Fehily
	Chris
	
	Pratt
	Philipp J. 
	
	Last
	Mary Z.
	
	Beaulieu
	Alan
	
	Churcher
	Clare
	
	Wilton
	Paul
	
	Colby
	John W.
	
	Mistry
	Ross
	
	Seenarine
	Shirmattie
	
	Barrows
	Alison
	
	Stockman
	Joseph C.
	
	Dyer
	Russel J. T.

The semi-colon at the end of the query tells the database manager that the query is complete and ready to run. We have written our commands in UPPER CASE, and the column and the table name in Title Case, but we don't have to: as the example below shows, SQL is case insensitive.

%%sqlite swclib.db
SeLeCt FAMILY, PERSONAL from auTHORS limit 5;


	Kline
	Kevin E.
	
	Kline
	Daniel
	
	Hunt
	Brand
	
	Taylor
	Allen G.
	
	McLaughlin
	Brett

Whatever casing convention you choose, please be consistent: complex queries are hard enough to read without the extra cognitive load of random capitalization. An usual practice is to type SQL commands in UPPER CASE, but it's really up to you.

Note also the use of the LIMIT command in the above example. As the name implies, this limits output by only displaying the first 5 rows of data. This command can be notably useful when trying out queries to a large database, helping you make sure you're getting out what you're looking for without wasting time displaying the entire table on every attempt.

Going back to our query, it's important to understand that the rows and columns in a database table aren't actually stored in any particular order. They will always be displayed in some order, but we can control that in various ways. For example, we could swap the columns in the output by writing our query as:

%%sqlite swclib.db
SELECT Personal, Family FROM Authors LIMIT 5;


	Kevin E.
	Kline
	
	Daniel
	Kline
	
	Brand
	Hunt
	
	Allen G.
	Taylor
	
	Brett
	McLaughlin

or even repeat columns:

%%sqlite swclib.db
SELECT Personal, Family, Personal FROM Authors LIMIT 5


	Kevin E.
	Kline
	Kevin E.
	
	Daniel
	Kline
	Daniel
	
	Brand
	Hunt
	Brand
	
	Allen G.
	Taylor
	Allen G.
	
	Brett
	McLaughlin
	Brett

As a shortcut, we can select all of the columns in a table using *:

%%sqlite swclib.db
SELECT * FROM Authors LIMIT 5;


	1
	Kline
	Kevin E.
	
	1966
	
	
	2
	Kline
	Daniel
	
	
	
	
	3
	Hunt
	Brand
	
	
	
	
	4
	Taylor
	Allen G.
	
	1945
	
	
	5
	McLaughlin
	Brett

It may seem strange to use personal and family as field names instead of first and last, but it's a necessary first step toward handling cultural differences. For example, consider the following rules:

Full Name	Alphabetized Under	Reason
Liu Xiaobo	Liu	Chinese family names come first
Leonardo da Vinci	Leonardo	"da Vinci" just means "from Vinci"
Catherine de Medici	Medici	family name
Jean de La Fontaine	La Fontaine	family name is "La Fontaine"
Juan Ponce de Leon	Ponce de Leon	full family name is "Ponce de Leon"
Gabriel Garcia Marquez	Garcia Marquez	double-barrelled Spanish surnames
Wernher von Braun	von or Braun	depending on whether he was in Germany or the US
Elizabeth Alexandra May Windsor	Elizabeth	monarchs alphabetize by the name under which they reigned
Thomas a Beckett	Thomas	and saints according to the names by which they were canonized

Clearly, even a two-part division into "personal" and "family" isn't enough...

Challenges

Write a query that selects only titles from the Works table.
Write a query that selects the first 10 barcodes and status from the Items table.
Many people format queries as:
```
SELECT personal, family FROM authors;
```
or as:
```
select Personal, Family from AUTHORS;
```
What style do you find easiest to read, and why?

Key Points

A relational database stores information in tables, each of which has a fixed set of columns and a variable number of records.
A database manager is a program that manipulates information stored in a database.
We write queries in a specialized language called SQL to extract information from databases.
SQL is case-insensitive.