Find us on GitHub

Missing Data

Objectives

  • Explain how databases represent missing information.
  • Explain the three-valued logic databases use when manipulating missing information.
  • Write queries that handle missing information correctly.

Real-world data is never complete—there are always holes. Databases represent these holes using special value called NULL. NULL is not zero, False, or the empty string; it is a one-of-a-kind value that means "nothing here". Dealing with NULL requires a few special tricks and some careful thinking.

To start, let's have a look at the Works table. There are a total of 20 records. The page number is missing for records #2 and #26—for them, the Pages field is null:

%load_ext sqlitemagic
%%sqlite swclib.db
SELECT Work_ID, Title, Pages FROM Works;
1 SQL in a nutshell 578
2 SQL for dummies
3 PHP & MySQL 532
4 Using SQLite 503
5 Geek sublime 258
6 Capital in the 21st century 685
7 SQL 534
8 Discovering SQL 400
9 SQL 460
10 A guide to SQL 309
11 SQL bible 857
12 Learning SQL 320
13 SQL for dummies 440
14 Beginning SQL queries 218
15 Beginning SQL 501
16 Microsoft SQL server 2012
17 SQL all-in-one 708
18 Access 2013 all-in-one 760
19 SQL in a nutshell 691
20 MySQL in a nutshell 321

Null doesn't behave like other values. If we select the works that have less than 300 pages:

%%sqlite swclib.db
SELECT Work_ID, Title, Pages FROM Works WHERE Pages<300
5 Geek sublime 258
14 Beginning SQL queries 218

we get two results, and if we select the ones that have 300 or more pages:

%%sqlite swclib.db
SELECT Work_ID, Title, Pages FROM Works WHERE Pages>=300
1 SQL in a nutshell 578
3 PHP & MySQL 532
4 Using SQLite 503
6 Capital in the 21st century 685
7 SQL 534
8 Discovering SQL 400
9 SQL 460
10 A guide to SQL 309
11 SQL bible 857
12 Learning SQL 320
13 SQL for dummies 440
15 Beginning SQL 501
17 SQL all-in-one 708
18 Access 2013 all-in-one 760
19 SQL in a nutshell 691
20 MySQL in a nutshell 321

we get sixteen results. Records #2 and #16 aren't in either set of results. The reason is that NULL<300 is neither true nor false: null means, "We don't know," and if we don't know the value on the left side of a comparison, we don't know whether the comparison is true or false. Since databases represent "don't know" as null, the value of NULL<300 is actually NULL. NULL>=300 is also null because we can't answer to that question either. And since the only records kept by a WHERE are those for which the test is true, records #2 and #16 aren't included in either set of results.

Comparisons aren't the only operations that behave this way with nulls. 1+NULL is NULL, 5*NULL is NULL, log(NULL) is NULL, and so on. In particular, comparing things to NULL with = and != produces NULL:

%%sqlite swclib.db
SELECT Work_ID, Title, Pages FROM Works WHERE Pages=NULL;
%%sqlite swclib.db
SELECT Work_ID, Title, Pages FROM Works WHERE Pages!=NULL;

To check whether a value is NULL or not, we must use a special test IS NULL:

%%sqlite swclib.db
SELECT Work_ID, Title, Pages FROM Works WHERE Pages IS NULL
2 SQL for dummies
16 Microsoft SQL server 2012

or its inverse IS NOT NULL:

%%sqlite swclib.db
SELECT Work_ID, Title, Pages FROM Works WHERE Pages IS NOT NULL
1 SQL in a nutshell 578
3 PHP & MySQL 532
4 Using SQLite 503
5 Geek sublime 258
6 Capital in the 21st century 685
7 SQL 534
8 Discovering SQL 400
9 SQL 460
10 A guide to SQL 309
11 SQL bible 857
12 Learning SQL 320
13 SQL for dummies 440
14 Beginning SQL queries 218
15 Beginning SQL 501
17 SQL all-in-one 708
18 Access 2013 all-in-one 760
19 SQL in a nutshell 691
20 MySQL in a nutshell 321

Null values cause headaches wherever they appear. For example, suppose we want to find all books about SQL that have less than 500 pages. It's natural to write the query like this:

%%sqlite swclib.db
SELECT * FROM Works WHERE Title LIKE "%SQL%" AND Pages<500;
8 Discovering SQL 9781118002674 2011 Indianapolis Wiley 400
9 SQL 0321334175 2005 Berkeley Peachpit 2nd ed. 460
10 A guide to SQL 9780324597684 2008 Mason South-Western 8th ed. 309
12 Learning SQL 9780596520830 2009 Sebastopol O'Reilly 2nd ed. 320
13 SQL for dummies 9780470557419 2010 Hoboken Wiley 7th ed. 440
14 Beginning SQL queries 9781590599433 2008 Berkeley Apress 218
20 MySQL in a nutshell 0596007892 2005 Sebastopol O'Reilly 1st ed. 321

but this query filters omits the records where we don't know the number of pages. Once again, the reason is that when Pages is NULL, the < comparison produces NULL, so the record isn't kept in our results. If we want to keep these records we need to add an explicit check:

%%sqlite swclib.db
SELECT * FROM Works WHERE Title LIKE "%SQL%" AND (Pages<500 OR Pages IS NULL);
2 SQL for dummies 9781118607961 2013 Hoboken Wiley 8th ed.
8 Discovering SQL 9781118002674 2011 Indianapolis Wiley 400
9 SQL 0321334175 2005 Berkeley Peachpit 2nd ed. 460
10 A guide to SQL 9780324597684 2008 Mason South-Western 8th ed. 309
12 Learning SQL 9780596520830 2009 Sebastopol O'Reilly 2nd ed. 320
13 SQL for dummies 9780470557419 2010 Hoboken Wiley 7th ed. 440
14 Beginning SQL queries 9781590599433 2008 Berkeley Apress 218
16 Microsoft SQL server 2012 9780132977661 2013 Indianapolis Sams
20 MySQL in a nutshell 0596007892 2005 Sebastopol O'Reilly 1st ed. 321

We still have to decide whether this is the right thing to do or not. If we want to be absolutely sure that we aren't including any books with less than 500 pages, we need to exclude all the records for which we don't know the number of pages.

Challenges

  1. Write a query that sorts the records in Works by the number of pages, omitting entries for which this information is not known (i.e., is null).

  2. What do you expect the query:

    SELECT * FROM Works WHERE Edition in ('1st ed.', null);

    to produce? What does it actually produce?

  3. Some database designers prefer to use a sentinel value to mark missing data rather than NULL. For example, they will use the date "0000-00-00" to mark a missing date, or -1 to mark a missing page number. What does this simplify? What burdens or risks does it introduce?

Key Points

  • Databases use null to represent missing information.
  • Any arithmetic or Boolean operation involving null produces null as a result.
  • The only operators that can safely be used with null are is null and is not null.