Sets and Dictionaries

Opening

Fan Fullerene has just joined Molecules'R'Us, a nanotechnology startup that fabricates molecules using only the highest quality atoms. His first job is to build a simple inventory management system that compares incoming orders for molecules to the stock of atoms in the company's supercooled warehouse to see how many of those molecules we can build. For example, if the warehouse holds 20 hydrogen atoms, 5 oxygen atoms, and 11 nitrogen atoms, Fan could make 10 water molecules (H2O) or 6 ammonia molecules (NH3), but could not make any methane (CH4) because there isn't any carbon.

Fan could solve this problem using the tools we've seen so far. As we'll see, though, it's a lot more efficient to do it using a different data structure. And "efficient" means both "takes less programmer time to create" and "takes less computer time to execute": the data structures introduced in this chapter are both simpler to use and faster than the lists most programmers are introduced to first.

Instructors

The ostensible goal of this set of lessons is to introduce learners to non-linear data structures. Most have ony ever seen arrays or lists, i.e., things that are accessed using sequential numeric indices. Sets and dictionaries are usually their first exposure to accessing content by value rather than by location, and to the bigger idea that there are lots of other data structures they might want to learn about. (Unfortunately, there still isn't a good data structure handbook for Python programmers that we can point them at.)

These lessons also introduce JSON as a general-purpose data format that requires less effort to work with than flat text or CSV. We discuss its shortcomings as well as its benefits to help learners see what forces are at play when designing and/or choosing data representations.

Finally, these lessons are also our first chance to introduce the idea of computational complexity via back-of-the-envelope calculations of how the number of steps required to look things up in an unordered list grows with the number of things being looked up. We return to this idea in the extended example of invasion percolation, and to the notion that algorithmic improvements help more than tuning code, but this is a chance to touch on the idea in classes that don't get to that example. The discussion of hash tables is also good preparation for the discussion of relational databases, but isn't required.

Everything in this lesson except the final example on phylogenetic trees can be covered in two hours, assuming that only three short exercises are given (one for sets, one for basic dictionary operations, and one related to aggregation).

  • Start with sets: they're a familiar concept, there's no confusion between keys and values, and they are enough to motivate discussion of hash tables.
  • Python's requirement that entries in hash-based data structures be immutable only makes sense once the mechanics of hash tables are explained. Terms like "hash codes" and "hash function" also come up in error messages and Python's documentation, so learners are likely to become confused without some kind of hand-waving overview. Tuples are also easy to explain as "how to create immutable multi-part keys", and it's easy to explain why entries can't be looked up by parts (e.g., why a tuple containing a first and a last name can't be looked up by last name only) in terms of hash functions.
  • Finally, explaining why hash tables are fast is a good first encounter with the idea of "big oh" complexity.
  • Once sets have been mastered, dictionaries can be explained as "sets with extra information attached to each entry". The canonical example—counting things—shows why that "extra information" is useful. The original motivating problem then uses both a dictionary and a dictionary of dictionaries; when introducing the latter, compare it to a list of lists.
  • Use the nanotechnology inventory example to re-emphasize how code is build top-down by writing code as if desired functions existed, then filling them in.
  • Only tackle the phylogenetic tree example with very advanced learners. The algorithm is usually presented as a table, which makes an array a natural representation. Showing how and why to use dictionaries instead is as important as showing vector operations when introducing NumPy, but the example is hard to follow (and debug) without a graphical representation of the evolving tree.

Prerequisites

Basic data types (strings, numbers, lists); loops; file I/O; conditionals; string operations; references and aliasing; creating functions; top-down development.

Sets

Notebook

Objectives

  • Explain why some programs that use lists become proportionally slower as data sizes increase.
  • Explain the three adjectives in "unordered collection of distinct values".
  • Use a set to eliminate duplicate values from data.

Lesson

Let's start with something simpler than our actual inventory problem. Suppose we have a list of all the atoms in the warehouse, and we want to know which different kinds we have—not how many, but just their types. We could solve this problem using a list to store the unique atomic symbols we have seen. Here's a function to add a new atom to the list:

def another_atom(seen, atom):
    for i in range(len(seen)):
        if seen[i] == atom:
            return # atom is already present, so do not re-add
    seen.append(atom)

another_atom's arguments are a list of the unique atoms we've already seen, and the symbol of the atom we're adding. Inside the function, we loop over the atoms that are already in the list. If we find the one we're trying to add, we exit the function immediately: we aren't supposed to have duplicates in our list, so there's nothing to add. If we reach the end of the list without finding this symbol, though, we append it. This is a common design pattern: either we find pre-existing data in a loop and return right away, or take some default action if we finish the loop without finding a match.

Let's watch this function in action. We start with an empty list. If the first atomic symbol is 'Na', we find no match (since the list is empty), so we add it. The next symbol is 'Fe'; it doesn't match 'Na', so we add it as well. Our third symbol is 'Na' again. It matches the first entry in the list, so we exit the function immediately.

Before Adding After
[] 'Na' ['Na']
['Na'] 'Fe' ['Na', 'Fe']
['Na', 'Fe'] 'Na' ['Na', 'Fe']

This code works, but it is inefficient. Suppose there are V distinct atomic symbols in our data, and N symbols in total. Each time we add an observation to our list, we have to look through an average of V/2 entries. The total running time for our program is therefore approximately NV/2. If V is small, this is only a few times larger than N, but what happens if we're keeping track of something like patient records rather than atoms? In that case, most values are distinct, so V is approximately the same as N, which means that our running time is proportional to N2/2. That's bad news: if we double the size of our data set, our program runs four times slower, and if we double it again, our program will have slowed down by a factor of 16.

There's a better way to solve this problem that is simpler to use and runs much faster. The trick is to use a set to store the symbols. A set is an unordered collection of distinct items. The word "collection" means that a set can hold zero or more values. The word "distinct" means that any particular value is either in the set or not: a set can't store two or more copies of the same thing. And finally, "unordered" means that values are simply "in" the set. They're not in any particular order, and there's no first value or last value. (They actually are stored in some order, but as we'll discuss in the next section, that order is as random as the computer can make it.)

To create a set, we simply write down its elements inside curly braces:

>>> primes = {3, 5, 7}
A Simple Set
Figure 1: A Simple Set

However, we have to use set() to create an empty set, because the symbol {} was already being used for something else when sets were added to Python:

>>> even_primes = set() # not '{}' as in math

We'll meet that "something else" later in this chapter.

To see what we can do with sets, let's create three holding the integers 0 through 9, the first half of that same range of numbers (0 through 4), and the odd values 1, 3, 5, 7, and 9:

>>> ten  = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> lows = {0, 1, 2, 3, 4}
>>> odds = {1, 3, 5, 7, 9}

If we ask Python to display one of our sets, it shows us this:

>>> print lows
set([0, 1, 2, 3, 4])

rather than using the curly-bracket notation. I personally regard this as a design flaw, but it does remind us that we can create always create a set from a list.

Sets have methods just like strings and lists, and, like the methods of strings and lists, most of those methods create new sets instead of modifying the set they are called for. These three come straight from mathematics:

>>> print lows.union(odds)
set([0, 1, 2, 3, 4, 5, 7, 9])
>>> print lows.intersection(odds)
set([1, 3])
>>> print lows.difference(odds)
set([0, 2, 4])

Another method that creates a new set is symmetric_difference, which is sometimes called "exclusive or":

>>> print lows.symmetric_difference(odds)
set([0, 2, 4, 5, 7, 9])

It returns the values that are in one set or another, but not in both.

Not all set methods return new sets. For example, issubset returns True or False depending on whether all the elements in one set are present in another:

>>> print lows.issubset(ten)
True

A complementary method called issuperset also exists, and does the obvious thing:

>>> print lows.issuperset(odds)
False

We can count how many values are in a set using len (just as we would to find the length of a list or string), and check whether a particular value is in the set or not using in:

>>> print len(odds)
7
>>> print 6 in odds
False

Some methods modify the sets they are called for. The most commonly used is add, which adds an element to the set:

>>> lows.add(9)
>>> print lows
set([0, 1, 2, 3, 4, 9])

If the thing being added is already in the set, add has no effect, because any specific thing can appear in a set at most once:

>>> lows.add(9)
>>> print lows
set([0, 1, 2, 3, 4, 9])

This behavior is different from that of list.append, which always adds a new element to a list.

Finally, we can remove individual elements from the set:

>>> lows.remove(0)
>>> print lows
set([1, 2, 3, 4])

or clear it entirely:

>>> lows.clear()
>>> print lows
set()

Removing elements is similar to deleting things from a list, but there's an important difference. When we delete something from a list, we specify its location. When we delete something from a set, though, we must specify the value that we want to take out, because sets are not ordered. If that value isn't in the set, remove does nothing.

To help make programs easier to type and read, most of the methods we've just seen can be written using arithmetic operators as well. For example, instead of lows.issubset(ten), we can write lows <= ten, just as if we were using pen and paper. There are even a couple of operators, like the strict subset test <, that don't have long-winded equivalents.

Operation As Method Using Operator
difference lows.difference(odds) lows - odds
intersection lows.intersection(odds) lows & odds
subset lows.issubset(ten) lows <= ten
strict subset lows < ten
superset lows.issuperset(ten) lows >= odds
strict superset lows >= odds
exclusive or lows.symmetric_difference(odds) lows ^ odds
union lows.union(odds) lows | odds

The fact that the values in a set are distinct makes them a convenient way to get rid of duplicate values, like the "unique atoms" problem at the start of this section. Suppose we have a file containing the names of all the atoms in our warehouse, and our task is to produce a list of the their types. Here's how simple that code is:

def unique_atoms(filename):
    atoms = set()
    with open(filename, 'r') as source:
        for line in source:
            name = line.strip()
            atoms.add(name)
    return atoms

We start by creating an empty set which we will fill with atomic symbols and opening the file containing our data. As we read the lines in the file, we strip off any whitespace (such as the newline character at the end of the line) and put the resulting strings in the set. When we're done, we print the set. If our input is the file:

Na
Fe
Na
Si
Pd
Na

then our output is:

set(['Na', 'Fe', 'Si', 'Pd'])

The right atoms are there, but what are those extra square brackets for? The answer is that if we want to construct a set with values using set(), we have to pass those values in a single object, such as a list. This syntax:

set('Na', 'Fe', 'Si', 'Pd')

does not work, even though it seems more natural. On the other hand, this means that we can construct a set from almost anything that a for loop can iterate over:

>>> set('lithium')
set(['i', 'h', 'm', 'l', 'u', 't'])

But hang on: if we're adding characters to the set in the order 'l', 'i', 't', 'h', 'i', 'u', 'm', why does Python show them in the order 'i', 'h', 'm', 'l', 'u', 't'? To answer that question, we need to look at how sets are actually stored, and why they're stored that way.

Key Points

  • Use sets to store distinct unique values.
  • Create sets using set() or {v1, v2, ...}.
  • Sets are mutable, i.e., they can be updated in place like lists.
  • A loop over a set produces each element once, in arbitrary order.
  • Use sets to find unique things.

Challenges

  1. Mathematicians are quite comfortable negating sets: for example, the negation of the set {1, 2} is all numbers that aren't 1 or 2. Why don't Python's sets have a not operator?

  2. Fan has created a set containing the names of five noble gases:

    >>> print gases
    set(['helium', 'argon', 'neon', 'xenon', 'radon'])
    

    He would like to print them in alphabetical order. What is one simple way to do this? (Hint: the list function converts its arguments to a list.)

  3. Fan has the following code:

    left = {'He', 'Ar', 'Ne'}
    right = set()
    while len(left) > len(right):
        temp = left.pop()
        right.add(temp)
    

    What values could left and right have after this code is finished running? Explain why your answer makes this code hard to test.

  4. Fan has written the following code:

    left = {'He', 'Ar', 'Ne'}
    right = {'Ar', 'Xe'}
    for element in left:                # X
        if element not in right:        # X
            right.add(element)          # X
    assert left.issubset(right)
    

    What single line could be used in place of the three marked with 'X' to achieve the same effect?

  5. Fan has written a program to print the names of the distinct atoms in a data file:

    # Print the name of each atom in the data file once.
    reader = open('atoms.txt', 'r')
    seen = set()
    for line in reader:
        name = line.strip()
        if name in seen:
            print name
        else:
            seen.add(name)
    reader.close()
    

    When he runs the program on this data file:

    Na
    Fe
    Na
    

    it only prints:

    Na
    

    What is the simplest change you can make to the program so that it produces the correct answer?

Storage

Notebook

Objectives

  • Draw a diagram showing how hash tables are implemented, and correctly label the main parts.
  • Explain the purpose of a hash function.
  • Explain why using mutable values as keys in a hash table can cause problems.
  • Correctly identify the error messages Python produces when programs try to put mutable values in hashed data structures.
  • Explain the similarities and differences between tuples and lists.
  • Explain why using tuples is better than concatenating values when storing multi-part data in hashed data structures.

Lesson

Let's create a set, add the string 'lithium' to it (as a single item, not character by character), and print the result:

>>> things = set()
>>> things.add('lithium')
>>> print things
set(['lithium'])

As expected, the string is in the set. Now let's try adding a list to the same set:

>>> things.add([1, 2, 3])
TypeError: unhashable type: 'list'

Why doesn't that work? And what does that word "unhashable" mean?

When we create a set, the computer allocates a block of memory to store references to the set's elements. When we add something to the set, or try to look something up, the computer uses a hash function to figure out where to look. A hash function is any function that produces a seemingly-random number when given some data as input. For example, one way to hash a string is to add up the numerical values of its characters. If the string is "zebra", those values are 97 for lower-case 'a', 98 for lower-case 'b', and so on up to 122 for lower-case 'z'. When we add them up, we will always get the same result: in this case, 532. If our hash table has 8 slots, we can take the remainder 532%8=4 to figure out where to store a reference to our string in the hash table (Figure 2).

Hashing a String
Figure 2: Hashing a String

Now let's take a look at how a list would be stored. If the list contains the same five characters, so that its hash code is still 4, it would be stored as shown in Figure 3:

Hashing a List
Figure 3: Hashing a List

But what happens if we change the characters in the list after we've added it to the set? For example, suppose that we change the first letter in the list from 'z' to 'X'. The hash function's value is now 498 instead of 532, which means that the modified list belongs in slot 2 rather than slot 4. However, the reference to the list is still in the old location: the set doesn't know that the list's contents have changed, so it hasn't moved its reference to the right location (Figure 4):

After Mutation
Figure 4: After Mutation

This is bad news. If we now ask, "Is the list containing 's', 'e', 'b', 'r', and 'a' in the set?" the answer will be "no", because the reference to the list isn't stored in the location that our hash function tells us to look. It's as if someone changed their name from "Tom Riddle" to "Lord Voldemort", but we left all the personnel records filed under 'R'.

This problem arises with any mutable structure—i.e., any structure whose contents or value can be changed after its creation. Integers and strings are safe to hash because their values are fixed, but the whole point of lists is that we can grow them, shrink them, and overwrite their contents.

Different languages and libraries handle this problem in different ways. One option is to have each list keep track of the sets that it is in, and move itself whenever its values change. However, this is expensive: every time a program touched a list, it would have to see if it was in any sets, and if it was, recalculate its hash code and update all the references to it.

A second option is to shrug and say, "It's the programmer's fault." This is what most languages do, but it's also expensive: programmers can spend hours tracking down the bugs that arise from data being in the wrong place.

Python uses a third option: it only allows programmers to put immutable values in sets. After all, if something's value can't change, neither can its hash code or its location in a hash table.

But if sets can only hold immutable values, what do we do with mutable ones? In particular, how should we store things like (x,y) coordinates, which are naturally represented as lists, or people's names, which are naturally represented as lists of first, middle, and last names? Again, there are several options.

The first is to concatenate those values somehow. For example, if we want to store "Charles" and "Darwin", we'd create the string "Charles Darwin" and store that. This is simple to do, but our code will wind up being littered with string joins and string splits, which will make it slower to run and harder to read. More importantly, it's only safe to do if we can find a concatenator that can never come up in our data. (If we join "Paul Antoine" and "St. Cyr" using a space, there would be three possible ways to split it apart again.)

The second option—the right one—is to use tuples instead of lists. A tuple is an immutable list, i.e., a sequence of values that cannot be changed after its creation. Tuples are created exactly like lists, except we use parentheses instead of square brackets:

>>> full_name = ('Charles', 'Darwin')

They are indexed the same way, too, and functions like len do exactly what we'd expect:

>>> print full_name[0]
Charles
>>> print len(full_name)
2

What we cannot do is assign a new value to a tuple element, i.e., change the tuple after it has been created:

>>> full_name[0] = 'Erasmus'
TypeError: 'tuple' object does not support item assignment

This means that a tuple's hash code never changes, and that means that tuples can be put in sets:

>>> names = set()
>>> names.add(('Charles', 'Darwin'))
>>> print names
set([('Charles', 'Darwin')])

Key Points

  • Sets are stored in hash tables, which guarantee fast access for arbitrary values.
  • The values in sets must be immutable to prevent hash tables misplacing them.
  • Use tuples to store multi-part values in sets.

Challenges

  1. A friend of yours argues, "Finding a value in an unordered list of length N takes N/2 steps on average. Finding it in a hash table takes only one step, but it's a more expensive step, since we have to calculate a hash code for that value. We should therefore use lists for small data sets, and only use things like sets for large ones." Explain the flaws in your friend's reasoning.

  2. Nelle has inherited the following function:

    def is_sample_repeated(left_channel, right_channel, history):
        '''Report repeated samples.  Both channels' values are integers in [0..10] inclusive.'''
        combined = 1000 * left_channel + right_channel
        if combined in history:
            return True
        else:
            history.add(combined)
            return False
    

    How would you improve this function, and why?

  3. Nelle has a function that extracts the latitudes and longitudes of data collection sites from a file:

    >>> sites = extract_sites('north-pacific.dat')
    >>> print sites[:3]
    [[52.097, -173.505], [52.071, -173.510], [51.985, -173.507]]
    

    Write another function called filter_duplicate_sites that takes a list of this kind as its only input, and returns a set (not a list) containing only the unique latitude/longitude pairs.

  4. A list containing just the number 5 is written as [5], and a set containing just that same number is written as {5}. However, a tuple containing just that number must be written with a comma as (5,). Why?

Dictionaries

Notebook

Objectives

  • Explain the similarities and differences between sets and dictionaries.
  • Perform common operations on dictionaries.

Lesson

Now that we know how to find out what kinds of atoms are in our inventory, we want to find out how many of each we have. Our input is a list of several thousand atomic symbols, and the output we want is a list of names and counts.

Once again, we could use a list to store names and counts, but the right solution is to use another new data strucure called a dictionary. A dictionary is a unordered collection of key-value pairs (Fixture 5). The keys are immutable, unique, and unordered, just like the elements of a set. There are no restrictions on the values stored with those keys: they don't have to be immutable or unique. However, we can only look up entries by their keys, not by their values.

A Simple Dictionary
Figure 5: A Simple Dictionary

We create a new dictionary by putting key-value pairs inside curly braces with a colon between the two parts of each pair:

>>> birthdays = {'Newton' : 1642, 'Darwin' : 1809}

The dictionary's keys are the strings 'Newton' and 'Darwin'. The value associated with 'Newton' is 1642, while the value associated with 'Darwin' is 1809. We can think of this as a two-column table:

Key Value
'Newton' 1642
'Darwin' 1809

but it's important to remember that the entries aren't necessarily stored in this order (or any other specific order).

We can get the value associated with a key by putting the key in square brackets:

>>> print birthdays['Newton']
1642

This looks just like subscripting a string or list, except dictionary keys don't have to be integers—they can be strings, tuples, or any other immutable object. It's just like using a phonebook or a real dictionary: instead of looking things up by location using an integer index, we look things up by name.

If we want to add another entry to a dictionary, we just assign a value to the key, just as we create a new variable in a program by assigning it a value:

>>> birthdays['Turing'] = 1612
>>> print birthdays
{'Turing' : 1612, 'Newton' : 1642, 'Darwin' : 1809}

If the key is already in the dictionary, assignment replaces the value associated with it rather than adding another entry (since each key can appear at most once). Let's fix Turing's birthday by replacing 1612 with 1912:

>>> birthdays['Turing'] = 1912
>>> print birthdays
{'Turing' : 1912, 'Newton' : 1642, 'Darwin' : 1809}

Trying to get the value associated with a key that isn't in the dictionary is an error, just like trying to access a nonexistent variable or get an out-of-bounds element from a list. For example, let's try to look up Florence Nightingale's birthday:

>>> print birthdays['Nightingale']
KeyError: 'Nightingale'

If we're not sure whether a key is in a dictionary or not, we can test for it using in:

>>> print 'Nightingale' in birthdays
False
>>> print 'Darwin' in birthdays
True

And we can see how many entries are in the dictionary using len:

>>> print len(birthdays)
3

and loop over the keys in a dictionary using for:

>>> for name in birthdays:
...     print name, birthdays[name]
...
Turing 1912
Newton 1642
Darwin 1809

This is a little bit different from looping over a list. When we loop over a list we get the values in the list. When we loop over a dictionary, on the other hand, the loop gives us the keys, which we can use to look up the values.

We're now ready to count atoms. The main body of our program looks like this:

def main(filename):
    counts = count_atoms(filename)
    for atom in counts:
        print atom, counts[atom]

count_atoms reads atomic symbols from a file, one per line, and creates a dictionary of atomic symbols and counts. Once we have that dictionary, we use a loop like the one we just saw to print out its contents.

Here's the function that does the counting:

def count_atoms(filename):
    '''Count unique atoms, returning a dictionary.'''

    result = {}
    with open(filename, 'r') as reader:
        for line in reader:
            atom = line.strip()
            if atom not in result:
                result[atom] = 1
            else:
                result[atom] = result[atom] + 1
    return result

We start with a docstring to explain the function's purpose to whoever has to read it next. We then create an empty dictionary to fill with data, and use a loop to process the lines from the input file one by one. Notice that the empty dictionary is written {}: this is the "previous use" we referred to when explaining why an empty set had to be written set().

After stripping whitespace off the atom's symbol, we check to see if we've seen it before. If we haven't, we set its count to 1, because we've now seen that atom one time. If we have seen it before, we add one to the previous count and store that new value back in the dictionary. When the loop is done, we return the dictionary we have created.

Let's watch this function in action. Before we read any data, our dictionary is empty. After we see 'Na' for the first time, our dictionary has one entry: its key is 'Na', and its value is 1. When we see 'Fe', we add another entry to the dictionary with that string as a key and 1 as a value. Finally, when we see 'Na' for the second time, we add one to its count.

Input Dictionary
start {}
Na {'Na' : 1}
Fe {'Na' : 1, 'Fe' : 1}
Na {'Na' : 2, 'Fe' : 1}

Just as we use tuples for multi-part entries in sets, we can use them for multi-part keys in dictionaries. For example, if we want to store the years in which scientists were born using their full names, we could do this:

birthdays = {
    ('Isaac', 'Newton') : 1642,
    ('Charles', 'Robert', 'Darwin') : 1809,
    ('Alan', 'Mathison', 'Turing') : 1912
}

If we do this, though, we always have to look things up by the full key: there is no way to ask for all the entries whose keys contain the word 'Darwin', because Python cannot match part of a tuple.

If we think of a dictionary as a two-column table, it is occasionally useful to get one or the other column, i.e., just the keys or just the values:

all_keys = birthdays.keys()
print all_keys
[('Isaac', 'Newton'), ('Alan', 'Mathison', 'Turing'), ('Charles', 'Robert', 'Darwin')]
all_values = birthdays.values()
print all_values
[1642, 1912, 1809]

These methods should be used sparingly: the dictionary doesn't store the keys or values in a list, these methods both actually create a new list as their result. In particular, we shouldn't loop over a dictionary's entries like this:

for key in some_dict.keys():
    ...do something with key and some_dict[key]

since "for key in some_dict" is shorter and much more efficient.

Key Points

  • Use dictionaries to store key-value pairs with distinct keys.
  • Create dictionaries using {k1:v1, k2:v2, ...}
  • Dictionaries are mutable, i.e., they can be updated in place.
  • Dictionary keys must be immutable, but values can be anything.
  • Use tuples to store multi-part keys in dictionaries.
  • dict[key] refers to the dictionary entry with a particular key.
  • key in dict tests whether a key is in a dictionary.
  • len(dict) returns the number of entries in a dictionary.
  • A loop over a dictionary produces each key once, in arbitrary order.
  • dict.keys() creates a list of the keys in a dictionary.
  • dict.values() creates a list of the keys in a dictionary.

Challenges

  1. What is one possible output of the following program? And why does this question say "one possible output" instead of "the output"?

    periods = {'Mercury' : 87.97, 'Venus' : 224.70}
    print periods
    periods.update({'Earth' : 3.6526, 'Mars' : 686.98})
    print periods
    periods['Earthy'] = 365.26
    print periods
    
  2. Fan has a table with the pH levels of samples as the keys, and the percentage of carbon-12 as the values:

    pH C12
    7.43 0.48
    7.51 0.47
    7.56 0.45

    He needs to interpolate between these values, i.e., to predict the percentage of carbon-12 in the sample for a pH 7.50. Will storing his data in a dictionary:

    {7.43 : 0.48, 7.51 : 0.47, 7.56 : 0.45}
    

    be any more efficient than storing it in a list of pairs:

    [ [7.43, 0.48], [7.51, 0.47], [7.56, 0.45] ]
    

    Why or why not?

  3. Before sets were added to Python, people frequently imitated them using dictionaries; the dictionary's keys were the set's elements, and the dictionary's values were all None. This function calculates the intersection of two such "sets":

    def setdict_intersect(left, right):
        '''Return new dictionary with intersection of keys from left and right.'''
        result = {}
        for key in left:
            if key in right:
                result[key] = None
        return result
    

    Write a function setdict_union that calculates the union of two sets represented in this way.

  4. What does the following function do? Explain when and why you would use it, and write a small example that calls it with sample data.

    def show(writer, format, data):
        keys = data.keys()
        keys.sort()
        for k in keys:
            print >> writer, format % (key, data[key])
    
  5. Dictionaries are more general than lists, since you can trivially simulate a list like ['first', 'second', 'third'] using {0 : 'first', 1 : 'second', 2 : 'third'}. Given that, when and why should you use a list rather than a dictionary?

  6. Why should you not use the name dict as a variable?

Notebook

Aggregation

Objectives

  • Recognize problems that can be solved by aggregating values.
  • Use dictionaries to aggregate values.
  • Explain why actual data values should be used as initializers rather than "impossible" values.

Lesson

To see how useful dictionaries can be, let's switch tracks and do some birdwatching. We'll start by asking how early in the day we saw each kind of bird? Our data consists of the date and time of the observation, the bird's name, and an optional comment:

2010-07-03    05:38    loon
2010-07-03    06:02    goose
2010-07-03    06:07    loon
2010-07-04    05:09    ostrich   # hallucinating?
2010-07-04    05:29    loon
     …           …        …

Rephrasing our problem, we want the minimum of all the times associated with each bird name. If our data was stored in memory like this:

loon = ['05:38', '06:07', '05:20', ...]

the solution would simply be min(loon), and similarly for the other birds. However, we have to work with the data we have, so let's start by reading our data file and creating a list of tuples, each of which contains a date, time, and bird name as strings:

def read_observations(filename):
    '''Read data, returning [(date, time, bird)...].'''

    reader = open(filename, 'r')
    result = []

    for line in reader:
        fields = line.split('#')[0].strip().split()
        assert len(fields) == 3, 'Bad line "%s"' % line
        result.append(fields)

    return result

This function follows the pattern we've seen many times before. We set up by opening the input file and creating an empty list that we'll append records to. We then process each line of the file in turn. Splitting the line on the '#' character and taking the first part of the result gets rid of any comment that might be present; stripping off whitespace and then splitting breaks the remainder into fields.

To prevent trouble later on, we check that there actually are three fields before going on. (An industrial-strength version of this function would also check that the date and time were properly formatted, but we'll skip that for now.) Once we've done our check, we append the triple containing the date, time, and bird name to the list we're going to return.

Here's the function that turns that list of tuples into a dictionary:

def earliest_observation(data):
    '''How early did we see each bird?'''

    result = {}
    for (date, time, bird) in data:
        if bird not in result:
            result[bird] = time
        else:
            result[bird] = min(result[bird], time)

    return result

Once again, the pattern should by now be familiar. We start by creating an empty dictionary, then use a loop to inspect each tuple in turn. The loop explodes the tuple into separate variables for the date, time and bird. If the bird's name is not already a key in our dictionary, this must be the first time we've seen it, so we store the time we saw it in the dictionary. If the bird's name is already there, on the other hand, we keep the minimum of the stored time and the new time. This is almost exactly the same as our earlier counting example, but instead of either storing 1 or adding 1 to the count so far, we're either storing the time or taking the minimum of it and the least time so far.

Now, what if we want to find out which birds were seen on particular days? Once again, we are aggregating values, i.e., combining many separate values to create one new one. However, since we probably saw more than one kind of bird each day, that "new value" needs to be a collection of some kind. We're only interested in which birds we saw, so the right kind of collection is a set. Here's our function:

def birds_by_date(data):
    '''Which birds were seen on each day?'''

    result = {}
    for (date, time, bird) in data:
        if date not in result:
            result[date] = {bird}
        else:
            result[date].add(bird)

    return result

Again, we start by creating an empty dictionary, and then process each tuple in turn. Since we're recording birds by date, the keys in our dictionary are dates rather than bird names. If the current date isn't already a key in the dictionary, we create a set containing only this bird, and store it in the dictionary with the date as the key. Otherwise, we add this bird to the set associated with the date. (As always, we don't need to check whether the bird is already in that set, since the set will automatically eliminate any duplication.)

Let's watch this function in action for the first few records from our data:

Input Dictionary
start {}
2010-07-03  05:38  loon {'2010-07-03' : {'loon'}}
2010-07-03  06:02  goose {'2010-07-03' : {'goose', 'loon'}}
2010-07-03  06:07  loon {'2010-07-03' : {'goose', 'loon'}}
2010-07-04  05:09  ostrich {'2010-07-03' : {'goose', 'loon'}, '2010-07-04' : {'ostrich'}}
2010-07-04  05:29  loon {'2010-07-03' : {'goose', 'loon'}, '2010-07-04' : {'ostrich', 'loon'}}

For our last example, we'll figure out which bird we saw least frequently—or rather, which birds, since two or more may be tied for the low score. Forgetting that values may not be unique is a common mistake in data crunching, and often a hard one to track down.

Our first strategy is simple: figure out how many times we've seen each bird, then find the minimum of those counts and get the set of birds we've seen that many times. The function below implements this fairly directly:

def least_common_birds(data):
    '''Which bird or birds have been seen least frequently?'''
    
    counts = count_by_bird(data)
    least = min(counts.values())
    result = set()
    for bird in counts:
        if counts[bird] == least:
            result.add(bird)
    return result

least_common_birds depends on a function count_by_bird, but this is yet another example of using a dictionary to aggregate values (in this case, to sum the number of birds we have seen). Just for variety's sake, we'll use a slightly different strategy that we've used before: whenever we see a new kind of bird, we'll set its count to zero, and then always add one to the stored count:

def count_by_bird(data):
    '''How many times was each bird seen?'''
    result = {}
    for (date, time, bird) in data:
        if bird not in result:
            result[bird] = 0
        result[bird] += 1
    return result

Finally, we'll test our function:

print least_common_birds(entries)
set(['goose', 'ostrich'])

This does the job, but is somewhat inefficient: we do one pass through all the data while counting birds, then another pass through all the birds to find those that we've seen the least number of times. We can actually do the whole job with a single pass through the data, but as we'll see in the challenges, the resulting code is significantly more complex than what we have written so far. Unless we're sure that the second pass is really a performance bottleneck, we should stick with this simple implementation.

Key Points

  • Use dictionaries to count things.
  • Initialize values from actual data instead of trying to guess what values could "never" occur.

Challenges

  1. Draw a blob-and-arrow diagram of the two dictionaries in least_common_birds and all the data they refer to after the following seven lines of data have been processed:

    2013-06-23    05:31    sparrow
    2013-06-25    06:19    robin
    2013-07-03    06:21    robin
    2013-07-17    05:28    cardinal
    2013-07-19    05:28    robin
    2013-07-19    05:29    penguin
    2013-07-19    05:30    penguin
    
  2. We have frequently used the idiom:

    if key in data:
        data[key] = data[key] + 1
    else:
        data[key] = 1
    

    to either update the value associated with a key, or insert a value if the key isn't present. We can rewrite this as:

    if key not in data:
        data[key] = 0
    data[key] += 1
    

    but it's even better to use:

    data[key] = data.get(key, 0) + 1
    

    Rewrite the examples in this lesson to use this idiom, and explain why we can't simplify it even further by writing:

    data.get(key, 0) += 1
    
  3. Modify least_common_birds so that it returns a list of all the birds that have been seen, sorted from least common to most common. (Birds that appeared with equal frequency should be sorted alphabetically by name.)

  4. Write a function dict_subtract that "subtracts" one dictionary mapping names to numbers from another. For example:

    assert dict_subtract({'X' : 3},          {'X' : 2})          == {'X' : 1}
    assert dict_subtract({'X' : 3, 'Y' : 2}, {'X' : 5, 'Z' : 1}) == {'X' : -2, 'Y' : 2, 'Z' : -1}
    
  5. It's possible to figure out which birds have been seen the least number of times using only a single pass through the data. The strategy is:

    • Use a dictionary counts_by_bird to keep track of how many times each bird has been seen, and another birds_by_count to keep track of which birds have been seen how often. The first uses bird names as keys, and counts as values; the second uses counts as keys, and sets of bird names as values.
    • When a bird is seen for the first time, it is added to the set stored with birds_by_count[0], and counts_by_bird[bird] is set to 1.
    • When a bird is seen for the second or subsequent time, counts_by_bird[bird] is incremented, and the bird is taken out of the set stored in birds_by_count[old_count] and added to the set stored in birds_by_count[new_count].
    • Once all the data has been read, the set associated with the smallest key in birds_by_count is returned.

    The diagram below shows the two data structures used by this algorithm and how they change when "loon" is read for the third time:

    Single Pass Aggregation
    Single Pass Aggregation

    Do you think this approach will actually run faster than the one used in the lesson? If so, why? If not, why not? And in either case, how much more complex do you think the code will be than the code given in the lesson? What measure of "complex" did you use, and why?

Nanotech Inventory

Notebook

Objectives

  • Create and manipulate nested dictionaries.
  • Explain the similarities and differences between nested dictionaries and nested lists.

Lesson

We can now solve Fan's original nanotech inventory problem. As explained in the introduction, our goal is to find out how many molecules of various kinds we can make using the atoms in our warehouse. The number of molecules of any particular type we can make is limited by the scarcest atom that molecule requires. For example, if we have five nitrogen atoms and ten hydrogen atoms, we can only make three ammonia molecules, because we need three hydrogen atoms for each.

The formulas for the molecules we know how to make are stored in a file like this:

# Molecular formula file

helium : He 1
water : H 2 O 1
hydrogen : H 2

and our inventory is stored in a file like this:

# Atom inventory file

He 1
H 4
O 3

Let's start by reading in our inventory. It consists of pairs of strings and numbers, which by now should suggest using a dictionary for storage. The keys will be atomic symbols, and the values will be the number of atoms of that kind we currently have (Figure 6). If an atom isn't listed in our inventory, we'll assume that we don't have any.

Nanotech Inventory
Figure 6: Nanotech Inventory

What about the formulas for the molecules we know how to make? Once again, we want to use strings—the names of molecules—as indices, which suggests a dictionary. Each of its values will be something storing atomic symbols and the number of atoms of that type in the molecule—the same structure, in fact, that we're using for our inventory. Figure 7 shows what this looks like in memory if the only molecules we know how to make are water and ammonia.

Storing Formulas
Figure 7: Storing Formulas

Finally, we'll store the results of our calculation in yet another dictionary, this one mapping the names of molecules to how many molecules of that kind we can make (Figure 8).

Nanotech Results
Figure 8: Nanotech Results

The main body of the program is straightforward: it reads in the input files, does our calculation, and prints the result:

def main(inventory_file, formula_file):
    inventory = read_inventory(inventory_file)
    formulas = read_formulas(formula_file)
    counts = calculate_counts(inventory, formulas)
    show_counts(counts)

Reading the inventory file is simple. We take each interesting line in the file, split it to get an atomic symbol and a count, and store them together in a dictionary:

def read_inventory(inventory_file):
    result = {}
    with open(inventory_file, 'r') as reader:
        for line in reader:
            name, count = line.strip().split()
            result[name] = int(count)
    return result

Let's test it:

print read_inventory('inventory-03.txt')
ValueError                                Traceback (most recent call last)
 in ()
----> 1 print read_inventory('inventory-03.txt')

 in read_inventory(inventory_file)
      3     with open(inventory_file, 'r') as reader:
      4         for line in reader:
----> 5             name, count = line.strip().split()
      6             result[name] = int(count)
      7     return result

ValueError: too many values to unpack

Our mistake was to forget that files can contain blank lines and comments. It's easy enough to modify the function to handle them, though it complicates the logic:

def read_inventory(inventory_file):
    result = {}
    with open(inventory_file, 'r') as reader:
        for line in reader:
            line = line.strip()
            if (not line) or line.startswith('#'):
                continue
            name, count = line.split()
            result[name] = int(count)
    return result

print read_inventory('inventory-03.txt')
{'H': 4, 'O': 3, 'He': 1}

The next step is to read the files containing formulas. Since the file format is more complicated, the function is as well. In fact, it's complicated enough that we'll come back later and simplify it.

def read_formulas(formula_file):
    result = {}
    with open(formula_file, 'r') as reader:
        for line in reader:
            line = line.strip()
            if (not line) or line.startswith('#'):
                continue
            name, atoms = line.split(':')
            name = name.strip()
            atoms = atoms.strip().split()
    
            formula = {}
            for i in range(0, len(atoms), 2):
                symbol = atoms[i].strip()
                count = int(atoms[i+1])
                formula[symbol] = count
            result[name] = formula

    return result

We start by creating a dictionary to hold our results. We then split each interesting line in the data file on the colon ':' to separate the molecule's name (which may contain spaces) from its formula. We then split the formulas into a list of strings. These alternate between atomic symbols and numbers, so in the inner loop, we move forward through those values two elements at a time, storing the atomic symbol and count in a dictionary. Once we're done, we store that dictionary as the value for the molecule name in the main dictionary. When we've processed all the lines, we return the final result. Here's a simple test:

print read_formulas('formulas-03.txt')
{'water': {'H': 2, 'O': 1}, 'hydrogen': {'H': 2}, 'helium': {'He': 1}}

Now that we have all our data, it's time to calculate how many molecules of each kind we can make. inventory maps atomic symbols to counts, and so does formulas[name], so let's loop over all the molecules we know how to make and "divide" the inventory by each one:

def calculate_counts(inventory, formulas):
    '''Calculate how many of each molecule can be made with inventory.'''

    counts = {}
    for name in formulas:
        counts[name] = dict_divide(inventory, formulas[name])

    return counts

We say we're "dividing" the inventory by each molecule because we're trying to find out how many of that molecule we can make without requiring more of any particular atom than we actually have. (By analogy, when we divide 11 by 3, we're trying to find out how many 3's we can make from 11.) The function that does the division is:

def dict_divide(inventory, molecule):
    number = None
    for atom in molecule:
        required = molecule[atom]
        available = inventory.get(atom, 0)
        limit = available / required
        if (number is None) or (limit < number):
            number = limit

    return number

This function loops over all the atoms in the molecule we're trying to build, see what limit the available inventory puts on us, and return the minimum of all those results. This function uses a few patterns that come up frequently in many kinds of programs:

  1. The first pattern is to initialize the value we're going to return to None, then test for that value inside the loop to make sure we re-set it to a legal value the first time we have real data. In this case, we could just as easily use -1 or some other impossible value as an "uninitialized" flag for number.
  2. Since we're looping over the keys of molecule, we know that we can get the value stored in molecule[atom]. However, that atom might not be a key in inventory, so we use inventory.get(atom, 0) to get either the stored value or a sensible default. In this case zero, the sensible default is 0, because if the atom's symbol isn't in the dictionary, we don't have any of it. This is our second pattern.
  3. The third is using calculate, test, and store to find a single value—in this case, the minimum—from a set of calculated values. We could calculate the list of available over required values, then find the minimum of the list, but doing the minimum test as we go along saves us having to store the list of intermediate values. It's probably not a noticeable time saving in this case, but it would be with larger data sets.

The last step in building our program is to show how many molecules of each kind we can make. We could just loop over our result dictionary, printing each molecule's name and the number of times we could make it, but let's put the results in alphabetical order to make it easier to read:

def show_counts(counts):
    names = counts.keys()
    names.sort()
    for name in names:
        print name, counts[name]

It's time to test our code. Let's start by using an empty inventory and a single formula:

Inventory Formulas Output
# inventory-00.txt
# formulas-00.txt

      

There's no output, which is what we expect. Let's add a formula but no atoms:

Inventory Formulas Output
# inventory-00.txt
# formulas-01.txt
helium : He 1
helium 0

and now an atom:

Inventory Formulas Output
# inventory-01.txt
He 1
# formulas-01.txt
helium : He 1
helium 1

That seems right as well. Let's add some hydrogen and another formula:

Inventory Formulas Output
# inventory-02.txt
He 1
H 4
# formulas-01.txt
helium : He 1
water : H 2 O 1
helium 1
water 0

The output doesn't change, which is correct. Our final test adds some oxygen:

Inventory Formulas Output
# inventory-03.txt
He 1
H 4
O 3
# formulas-03.txt
helium : He 1
water: H 2 O 1
hydrogen : H 2
helium 1
water 2

That's right too: we can make two water molecules (because we don't have enough hydrogen to pair with our three oxygen atoms).

Refactoring

There are quite a few other interesting tests still to run, but before we do that, we should clean up our code. Both of our input functions handle comments and blank lines the same way; let's put that code in a helper function:

def readlines(filename):
    result = []
    with open(filename, 'r') as reader:
        for line in reader:
            line = line.strip()
            if line and (not line.startswith('#')):
                result.append(line)
    return result

If we convert read_inventory to use it, the result is six lines long instead of ten. More importantly, the logic of what we're doing is much clearer:

def read_inventory(inventory_file):
    result = {}
    for line in readlines(inventory_file):
        name, count = line.split()
        result[name] = int(count)
    return result

The converted version of read_formulas is 15 lines instead of 19:

def read_formulas(formula_file):
    result = {}
    for line in readlines(formula_file):
        name, atoms = line.split(':')
        name = name.strip()
        atoms = atoms.strip().split()

        formula = {}
        for i in range(0, len(atoms), 2):
            symbol = atoms[i].strip()
            count = int(atoms[i+1])
            formula[symbol] = count
        result[name] = formula

    return result

but we can do better still by putting the code that handles atom/count pairs in a helper function of its own:

def read_formulas(formula_file):
    result = {}
    for line in readlines(formula_file):
        name, atoms = line.split(':')
        name = name.strip()
        result[name] = make_formula(atoms)
    return result

def make_formula(atoms):
    formula = {}
    atoms = atoms.strip().split()
    for i in range(0, len(atoms), 2):
        symbol = atoms[i].strip()
        count = int(atoms[i+1])
        formula[symbol] = count
    return formula

This change has actually made the code slightly longer, but each function now does one small job, and as a bonus, the code in make_formula (which is moderately complex) can now be tested on its own.

Key Points

  • Whenever names are used to label things, consider using dictionaries to store them.
  • Use nested dictionaries to store hierarchical values (like molecule names and atomic counts).
  • Get it right, then refactor to make each part simple.
  • Test after each refactoring step.

Challenges

  1. Trace the behavior of read_formulas by showing the value of each variable each time line #6 finishes executing when given the data file:

    helium  : He 1
    ammonia : N 1 H 3
    cyanide : H 1 C 1 N 1
    
    result line name atoms formula
    1) after "helium":
    2) after "ammonia":
    3) after "cyanide":
  2. Can one dictionary be used as a key in another? I.e., is it possible to create the structure:

    { {'site' : 3, 'affinity' : 6} : 'sampled'}
    

    If so, give an example showing when this would be useful. If not, explain why not.

  3. A geographic information system stores the distance between survey points in a dictionary of dictionaries like this:

    dist = {
        'Left Bend'  : {'Sump Creek' : 25.6,
                        'Brents Bay' : 31.1,
                        'Ogalla'     :  4.0},
        'Sump Creek' : {'Brents Bay' : 17.5,
                        'Ogalla'     : 19.2},
        'Brents Bay' : {'Ogalla'     : 20.1}
    }
    

    Given this structure, what is the simplest Python function that will return the distance between any two survey points?

  4. Fan has inherited an activity log for an experimental project formatted as shown below:

    2012-11-30: Re-setting equipment.
    2012-12-12: First run with acidic reagants.
    2012-12-12: Re-ran acidic reagants.
    2012-12-14: Tried neutral reagants again.
    2013-02-05: Back to this stuff after writing up the CSRTI paper.
    2013-02-06: Trying basic reagants this time.
    

    He has written a function to translate this into a dictionary of dictionaries of sets, where the outer dictionary's keys are years (as strings), the inner dictionary's keys are months (also as strings), and the innermost sets are the days (strings again) for which there are comments. For example, the output of this function for the data sample above is supposed to be:

    {
        '2012' : {
            '11' : {'30'},
            '12' : {'12', '14'}
        },
        '2013' : {
            '02' : {'05', '06'}
        }
    }
    

    His function is:

    def extract_dates(filename):
        reader = open(filename, 'r')
        result = {}
        for line in reader:
            year, month, day = line.strip().split(' ', 1)[0].split('-')
            if year not in result:
                pass # fill in 1
            if month not in result[year]:
                pass # fill in 2
            pass # fill in 3
        reader.close()
        return result
    

    Fill in the three missing lines with a single statement each so that this function returns the right answer.

JSON

Notebook

Objectives

  • Correctly define "JSON" and give simple examples of valid JSON structures.
  • Describe JSON's strengths and weaknesses as a storage format.
  • Write code to read and write JSON-formatted data files using standard libraries.

Lesson

The example above used two data file formats: one for storing molecular formulas, the other for storing inventory. Both formats were specific to this application, which means we needed to write, debug, document, and maintain functions to handle them. Those functions weren't particularly difficult to create, but they still took time to create, and if anyone ever wants to read our files in Java, MATLAB, or Perl, they'll have to write equivalent functions themselves.

A growing number of programs avoid these problems by using a flexible data format called JSON, which stands for "JavaScript Object Notation". Despite the name, it is a language-independent way to store nested data structures made up of strings, numbers, Booleans, lists, dictionaries, and the special value null (equivalent to Python's None)—in short, the basic data types that almost every language supports. For example, let's convert a dictionary of scientists' birthdays to a string:

>>> import json
>>> birthdays = {'Curie' : 1867, 'Hopper' : 1906, 'Franklin' : 1920}
>>> as_string = json.dumps(birthdays)
>>> print as_string
{"Curie": 1867, "Hopper": 1906, "Franklin": 1920}
>>> print type(as_string)
<type 'str'>

json.dumps doesn't seem to do much, but that's kind of the point: the textual representation of the data structure looks pretty much like what a programmer would type in to re-create it. The advantage is that this representation can be saved in a file:

>>> import json
>>>
>>> writer = open('/tmp/example.json', 'w')
>>> json.dump(birthdays, writer)
>>> writer.close()
>>>
>>> reader = open('/tmp/example.json', 'r')
>>> duplicate = json.load(reader)
>>> reader.close()
>>>
>>> print 'original:', birthdays
original: {'Curie': 1867, 'Hopper': 1906, 'Franklin': 1920}
>>> print 'duplicate:', duplicate
duplicate: {u'Curie': 1867, u'Hopper': 1906, u'Franklin': 1920}
>>> print 'original == duplicate:', birthdays == duplicate
original == duplicate: True
>>> print 'original is duplicate:', birthdays is duplicate
original is duplicate: False

As the example above shows, saving and loading data is as simple as opening a file and then calling one function. The data file holds what we'd type in to create the data in a program:

$ cat /tmp/example.json
{"Curie": 1867, "Hopper": 1906, "Franklin": 1920}

which makes it easy to edit by hand.

How is this different in practice from what we had? First, our inventory file now looks like this:

{"He" : 1,
 "H" : 4,
 "O" : 3}

while our formulas files look like:

{"helium"   : {"He" : 1},
 "water"    : {"H" : 2, "O" : 1},
 "hydrogen" : {"H" : 2}}

Those aren't as intuitive for non-programmers as the original flat text files, but they're not too bad. The worst thing is the lack of comments: unfortunately—very unfortunately—the JSON format doesn't support them. (And note that JSON requires us to use a double-quote for strings: unlike Python, we cannot substitute single quotes.)

The good news is that given files like these, we can rewrite our program as:

'''Calculate how many molecules of each type can be made with the atoms on hand.'''
import json

def main(inventory_file, formulas_file):
    '''Main driver for program.'''
    with open(inventory_file, 'r') as reader:
        inventory = json.load(reader)
    with open(formulas_file, 'r') as reader:
        formulas = json.load(reader)
    counts = calculate_counts(inventory, formulas)
    show_counts(counts)

def calculate_counts(inventory, formulas):
    ...as before...

def dict_divide(inventory, molecule):
    ...as before...

def show_counts(counts):
    ...as before...

The two functions that read formula and inventory files have been replaced with a single function that reads JSON. Nothing else has to change, because the data structures loaded from the data files are exactly what we had before. The end result is 51 lines long compared to the 80 we started with, a reduction of more than a third.

Nothing's Perfekt

JSON's greatest weakness isn't its lack of support for comments, but the fact that it doesn't recognize and manage aliases. Instead, each occurrence of an aliased structure is treated as something brand new when data is being saved. For example:

>>> inner = ['name']
>>> outer = [inner, inner] # Creating an alias.
>>> print outer
[['name'], ['name']]
>>> print outer[0] is outer[1]
True
>>> as_string = json.dumps(outer)
>>> duplicate = json.loads(as_string)
>>> print duplicate
[[u'name'], [u'name']]
>>> print duplicate[0] is duplicate[1]
False

Figure 9 shows the difference between the original data structure (referred to by outer) and what winds up in duplicate. If aliases might be present in our data, and it's important to preserve their structure, we must either record the aliasing ourself (which is tricky), or use some other format. Luckily, a lot of data either doesn't contain aliases, or the aliasing in it isn't important.

Aliasing in JSON
Figure 9: Aliasing in JSON

Key Points

  • The JSON data format can represent arbitrarily-nested lists and dictionaries containing strings, numbers, Booleans, and None.
  • Using JSON reduces the code we have to write ourselves and improves interoperability with other programming languages.
  • JSON doesn't allow for comments, and doesn't handle aliasing.

Challenges

  1. A friend of yours says, "I understand why flat text files are not ideal, but wouldn't it be better to use comma-separated values (CSV) than JSON? It's easier to read, and more programs support it." What example could you show your friend to explain JSON's advantages?

  2. json.dump has an extra parameter called sort_keys; its default value is False, but if it is True, then all dictionaries are printed with keys in sorted order. Explain why this option isn't True by default, and how setting it to True can be useful in testing.

  3. If we really do need to add comments to JSON files, how can we do it without altering the format?

  4. The bird watching data from an earlier section was stored like this:

    2010-07-03    05:38    loon
    2010-07-03    06:02    goose
    2010-07-03    06:07    loon
    2010-07-04    05:09    ostrich
    2010-07-04    05:29    loon
    

    How would you represent this as JSON? If you rewrite the early_bird.py program (that finds the earliest time each bird was seen) so that it uses your JSON format, how much code do you save?

Phylogenetic Trees

Objectives

  • That many "matrix" problems may be best solved using dictionaries.
  • Why the values in multi-part keys should be ordered.

Lesson

As Theodosius Dobzhansky said almost a century ago, nothing in biology makes sense except in the light of evolution. Since mutations usually occur one at a time, the more similarities there are between the DNA of two species, the more recently they had a common ancestor. We can use this idea to reconstruct the evolutionary tree for a group of organisms using a hierarchical clustering algorithm.

We don't have to look at the natural world very hard to realize that some organisms are more alike than others. For example, if we look at the appearance, anatomy, and lifecycles of the seven fish shown in Figure 10, we can see that three pairs are closely related. But where does the seventh fit in? And how do the pairs relate to each other?

Pairing Up Species Pairing Up Species
Figure 10: Pairing Up Species

The first step is to find the two species that are most similar, and construct their plausible common ancestor. We then pair two more, and two more, and start joining pairs to individuals, or pairs with other pairs. Eventually, all the organisms are connected. We can redraw those connections as a tree, using the heights of branches to show the number of differences between the species we're joining up (Figure 11).

Pairing Up Species
Figure 11: Tree of Life

Let's turn this into an algorithm:

S = {all organisms}
while S != {}:
  a, b = two closest entries in U
  p = common parent of {a, b}
  S = S - {a, b}
  S = S + {p}

Initially, the set S contains all the species we're interested in. Each time through the loop, we find the two that are closest, create their common parent, remove the two we just paired up from the set, and insert the newly-created parent. Since the set shrinks by one element each time (two out, one in), we can be sure this algorithm eventually terminates.

But how do we calculate the distance between an inferred parent and other species? One simple rule is to use the average distance between that other species and the two species that were combined to create that parent. Let's illustrate it by calculating a phylogenetic tree for humans, vampires, werewolves, and mermaids. The distances between each pair of species is shown in Figure 12 and in the table below. (We only show the lower triangle because it's symmetric.)

Distances Between Species Distances Between Species Distances Between Species Distances Between Species
Figure 12: Distances Between Species
  human vampire werewolf mermaid
human        
vampire 13      
werewolf 5 6    
mermaid 12 15 29  

The closest entries—i.e., the pair with minimum distance—are human and werewolf. We replace this with a common ancestor, which we will call HW, then set the distance between it and each other species X to be (HX + WX)/2, i.e., the average of the human-to-X and werewolf-to-X distances. This gives us a new table:

  HW vampire mermaid
HW      
vampire 9.5    
mermaid 20.5 15  

Repeating this step, we combine HW with V:

  HWV mermaid
HWV    
mermaid 17.75  

and finally HWV with M.

We illustrated our algorithm with a triangular matrix, but the order of the rows and columns is arbitrary. The matrix is really just a lookup table mapping species to distances, and as soon as we think of lookup tables, we should think of dictionaries. The keys are species—either the ones we started with, or the ones we created—and the values are the distances between them, so our original table becomes:

{
    ('human',   'mermaid')  : 12,
    ('human',   'vampire')  : 13,
    ('human',   'werewolf') :  5,
    ('mermaid', 'vampire')  : 15,
    ('mermaid', 'werewolf') : 29,
    ('vampire', 'werewolf') :  6
}

There is one trick here. Whenever we have a distance, such as that between mermaids and vampires, we have to decide whether to use the key ('mermaid', 'vampire') or ('vampire', 'mermaid') (or to record the value twice, once under each key).

Let's start by setting up our test case and then calling a top-level function to process our data:

if __name__ == '__main__':

    species = {'human', 'mermaid', 'werewolf', 'vampire'}

    scores = {
        ('human',   'mermaid')  : 12,
        ('human',   'vampire')  : 13,
        ('human',   'werewolf') :  5,
        ('mermaid', 'vampire')  : 15,
        ('mermaid', 'werewolf') : 29,
        ('vampire', 'werewolf') :  6
    }

    order = main(species, scores)
    print order

In a real program, of course, the data would be read in from a file, and the set of actual species' names would be generated from it, but this will do for now.

Next, let's translate our algorithm into something that could be runnable Python:

def main(species, scores):
    result = []
    while len(species) > 1:
        left, right = find_min_pair(species, scores)
        result.append(make_pair(left, right))
        species -= {left, right}
        make_new_pairs(species, scores, left, right)
        species.add(make_name(left, right))
    return result

This is almost a direct translation of our starting point; the only significant difference is that we're keeping the set of "active" species in a set, and the scores in a dictionary. As species are combined, we remove their names from the set and add a made-up name for their parent. We never actually remove scores from the table; once the name of a species is out of the set species, we'll never try to look up anything associated with it in scores again.

The next step is to write find_min_pair to find the lowest score currently in the table:

def find_min_pair(species, scores):
    min_pair = None
    min_val = None
    for left in species:
        for right in species:
            if left < right:
                this_pair = make_pair(left, right)
                if (min_val is None) or (scores[this_pair] < min_val):
                    min_pair = this_pair
                    min_val = scores[this_pair]
    return min_pair

This function loops over all possible combinations of species names, but only actually uses the ones that pass our ordering test (i.e., the ones for which the first species name comes before the second species name). If this is the first score we've looked at, or if it's lower than a previously-seen score, we record the pair of species and the associated score. When we're done, we return the pair of species.

The function that makes new entries for the table is fairly straightforward as well. It just loops over all the active species, averages the distances between them and the two species being combined, and puts a new score in the table:

def make_new_pairs(species, scores, left, right):
    for current in species:
        left_score = scores[make_pair(current, left)]
        right_score = scores[make_pair(current, right)]
        new_score = (left_score + right_score) / 2.0
        scores[make_pair(current, make_name(left, right))] = new_score

Finally, the make_pair and make_name functions are simply:

def make_pair(left, right):
    if left < right:
        return (left, right)
    else:
        return (right, left)

def make_name(left, right):
    return '<%s, %s>' % make_pair(left, right)

Let's try running the program:

$ python phylogen.py
[('human', 'werewolf'), ('<human, werewolf>', 'vampire'), ('<<human, werewolf>, vampire>', 'mermaid')]

This shows that humans and werewolves were combined first, that their pairing was then combined with vampires, and that mermaids were added to the cluster last. We obviously should do a lot more testing, but so far, we seem to be on the right track.

Key Points

  • Problems that are described using matrices can often be solved more efficiently using dictionaries.
  • When using tuples as multi-part dictionary keys, order the tuple entries to avoid accidental duplication.

Challenges

  1. Work through the clustering algorithm by hand for the following distance matrix:

      centaur hippogriff pegasus unicorn
    centaur        
    hippogriff 19      
    pegasus 7 23    
    unicorn 7 12 15  

    Check your answers against the output of the program. Would you use this as a test case to check the program's correctness? Why or why not?

  2. The body of the loop in main is:

    left, right = find_min_pair(species, scores)
    result.append(make_pair(left, right))
    species -= {left, right}
    make_new_pairs(species, scores, left, right)
    species.add(make_name(left, right))
    

    What happens if the last line (the call to species.add) is move up above the call to make_new_pairs? (Hint: what assumptions does make_new_pairs make that wouldn't be true if the call to species.add was moved up one line?)

  3. What would happen if the line:

    species -= {left, right}
    

    was moved down one line, so that it was executed after the call to make_new_pairs? (Hint: what assumptions does make_new_pairs make that wouldn't be true if the set subtraction was moved down one line?)

  4. Write docstrings for all five functions in this program. How self-contained are they? How much do you have to explain about who is going to call them, and when, in order to explain what they do and why?

Summary

Every programmer meets lists (or arrays or matrices) early in her career. Many in science never meet sets and dictionaries, and that's a shame: they often make programs easier to write and faster to run at the same time.

Before we leave this topic, try running the function globals at an interactive Python prompt:

>>> globals()
{'__builtins__': <module '__builtin__' (built-in)>,
 '__doc__': None,
 '__name__': '__main__',
 '__package__': None}

That's right—Python actually stores the program's variables in a dictionary. In fact, it uses one dictionary for the global variables and one for each currently-active function call:

>>> def example(first, second):
...     print 'globals in example', globals()
...     print 'locals in example', locals()
... 
>>> example(22, 33)
globals in example {'__builtins__': <module '__builtin__' (built-in)>,
                    '__doc__': None,
                    '__name__': '__main__',
                    '__package__': None,
                    'example': <function example at 0x50b630>}
locals in example {'second': 33,
                   'first': 22}

You now know everything you need to know in order to build a programming language of your own. But please don't: the world will be much better off if you keep doing science instead.