COSC122 Lab 5

Hashing

Goals

This lab will give you some practice with hashing and hash tables; in this lab you will:

• test a simple spell-checker that uses linear search;

• complete a spell-checker that uses the following hash table implementations;

– chaining

– linear probing

– quadratic probing

• Investigate how the hash table load factor affects performance.

You should be familiar with the material in Section 5.2.3 – Hashing 1 of the textbook before attempting this lab.

Important Note

To get the most from this lab you really need to carefully work through this handout doing each task in

order. If you jump sections then your code may not work as expected, for example, in the chaining hash

table we start with a simple hashing function and then fix it to get better results. If you jump ahead then

you miss the point where it is fixed and it will take a very long time to do trial runs! Also, some doc-tests

will fail initially and we explain where this is ok.

Hashing

The goal of hashing and hash tables is to produce a collection with O(1) performance by reducing items

to a hash value and storing them in a table of possible hash values. There are three main components

to this process:

1. the hash table,

2. the hash function, and

3. the method for dealing with hash collisions.

Spell Checking

Automated spell checkers compare every word in a document to the words in a pre-defined dictionary.

2 While this may seem like a trivially easy task, the challenge is in making the process fast; there are over

600,000 words in the Oxford English Dictionary, and many more that are used every day, but aren’t on

that list—grammatical variants, jargon and technical terms, slang, foreign words, etc. Checking a single

sentence could easily cost millions of comparisons if you’re not careful.

The provided spelling.py module contains a few methods and stubs that will be used in this lab,

such as:

1http://interactivepython.org/runestone/static/pythonds/SortSearch/Hashing.html

2Modern spell checkers are a bit more advanced, but for this lab, we’ll keep things simple.

• load_dict_words: loads the words from a dictionary file into a list.

• load_doc_words: reads the words (sequences of characters between A–Z or a–z) from a document, converts them all into lowercase, and returns them as a list.

• ChainingHashTable: an initially inefficient chaining hash table class.

• LinearHashTable: an open addressing hash table class with linear probing.

• QuadraticHashTable: an open addressing hash table class with quadratic probing.

There are several other files provided in the lab archive, which will be used as test input:

• tester.py contains some starter tests, that you will expand upon.

• words_latin-1.txt: a file containing over 610,000 words 3

– the dictionary we will use to check

spelling.

• sherlock.txt: the first adventure of Sir Arthur Conan Doyle’s Memoirs of Sherlock Holmes; ~1,000

words.

• trgov.txt: John Locke’s Two Treatises of Government; ~56,000 words.

• humnature.txt: David Hume’s A Treatise of Human Nature; ~225,000 words.

The easy, but slow, way

Complete the spellcheck_with_list stub method in the spelling module. The function receives

lists of all words in the document and dictionary, and should go through every word in the document to

see if it is also in the dictionary, using a naïve, linear method (ie, ’in’ will be handy). The function should

print out each word in the document that doesn’t appear in the dictionary and keep a running total of

the number of words that weren’t in the dictionary, along with a set of words that were misspelled (ie,

unique_errors is a set of words that were misspelled and therefore contains each word only once).

You can test your spell checker using the shell, as in the example below. But, it will be better to use

the tester.py module that we provide and add in various tests of your own. You can then use Wing’s

Debug function to see the output as it happens, because full runs take some time.

>>> from spelling import *

>>> dictionary = load_dict_words(“words_latin-1.txt”)

>>> document = load_doc_words(“sherlock.txt”)

>>> spellcheck_with_list(document, dictionary)

————————————————–

Misspelled words:

————————————————–

pyland

favorite

…

Many of the “incorrect” words are actually correctly spelt (or the American spelling of ) names

and places! Even with just over 613,000 words, our dictionary is far from complete.

As you should observe, it takes quite a while to check the 10,000 words in sherlock.txt. Think

about how long it would take to check the larger documents (with up to 225,000 words).

> Complete the Sherlock question(s) in Lab quiz Quiz 5.

3From SCOWL: http://wordlist.sf.net/

2

The ChainingHashTable Class

If we kept the dictionary in a hash table, we could significantly reduce the number of comparisons it

takes to check if a word is in it. (In theory! The actual efficiency in practice is dependent on the hashing

function we use and the load factor.)

The ChainingHashTable class contains the implementation for a simple chaining hash table. The

constructor creates a list of empty lists used to store the items. The outer list is sized to a certain number

of slots (by default, 11); the inner lists contain the items for each slot (initially there is just an empty list

in each slot). The __repr__ method is implemented in the class and the example below shows a print

of a simple hash table (note we have a simple hash function at the moment…):

>>> from spelling import ChainingHashTable

>>> hash_table = ChainingHashTable(5)

>>> hash_table.store(“Paul”)

>>> hash_table.store(“Peter”)

>>> hash_table.store(“Paula”)

>>> hash_table.store(“David”)

>>> hash_table.store(“Bob”)

>>> hash_table.store(“Di”)

>>> print(hash_table)

ChainingHashTable:

slot 0 = [“Paul”, “Peter”, “Paula”, “David”, “Bob”, “Di”]

slot 1 = []

slot 2 = []

slot 3 = []

slot 4 = []

Notice how all the names are in slot 0. This is due to our currently very lazy _hash function that

always returns 0. This is intended to demonstrate the worst case scenario for a chaining hash table.

You should now implement the __contains__ method, this will allow you to use the in operator to

see whether a given item is in the hash table, for example:

>>> print(‘Bob’ in hash_table)

True

>>> print(‘Knob’ in hash_table)

False

Spell Checking with ChainingHashTable

Complete the spellcheck_with_hashtable(document, dictionary, ht_type, ht_size) function. This function is similar to that for the simple list based search but it also includes a parameter for

hash table type and hash table size. We provide the code that loads the dictionary words in to the hash

table. You will need to insert the code to check the words in the document against the words in the hash

table.

Once you have completed the function you can call it with calls such as:

• spellcheck_with_hashtable(document, dictionary, ’Chaining’, 11).

Stick to a hash table size of 11 for now – we will crank it up later when we look at the load factor.

After completing spellcheck_with_hashtable, test it in the same way that you did for spellcheck_with_list.

You can, at this stage, ignore the doctest failure relating to the ChainingHashTable, we will fix this

later. . . To start with you should spell check the sherlock.txt file as it is the shortest. Then try trgov.txt,

this will take some time!

3

You should notice, however, that this doesn’t seem any better than the linear method! The example

print out of a chaining hash table above gives you an indication of what is going wrong. Are the items

spread around the hash table evenly?

The ChainingHashTable class uses its _hash method to compute the hash value for items in the

table; but at the moment it always returns 0—all items have the same hash value! This means that all

words will be chained into the same slot, and when it comes to searching for an item, it will perform a

linear search across that slot—effectively doing a linear search over the list of every single item that was

inserted. So, now let’s implement a more sensible hash function in ChainingHashTable.

Better Hash Functions

A perfect hash function transforms every item into a unique hash value. However, this usually isn’t

possible—it requires knowing what every item in the hash table will be before you write the function.4

Instead, hash tables typically settle for a hash function that gives a uniform distribution—the hash values are spread across a large range of possible values.

In Python, defining a hash function for an object is done by adding a __hash__ method that returns

an integer.5 Most built-in objects (numbers, strings, lists, etc.) already have an implementation of this

method; it’s called using the built-in hash function:

>>> hash(“a”)

-1008721619

>>> hash(“Alpha”)

1222515508

But, Python’s inbuilt hash function doesn’t necessarily return the same result on different runs (even

on the same operating system) and it can return negative values that aren’t much fun for list indexes.

So, we have provided a nice_hash function that is platform independent and consistent across runs

but still uses an algorithm very close to Python’s inbuilt hash function.

>>> nice_hash(“a”)

3826102752

>>> nice_hash(“Alpha”)

2147701689

However, you can’t just use the hash value as-is: the range of possible hash values spans over 4

billion integers—far more slots than will exist in our table. The hash values need to be scaled down to

be in the range of the number of slots—typically done using the modulo operation (%). If using this

method then it is wise to set the number of slots to a prime number, to minimise collisions and spread

values more evenly.

Re-implement the _hash function in ChainingHashTable to return the nice_hash of the object,

scaled to the number of slots in the hashtable.

4We could construct a perfect hash function for our dictionary of words, since it won’t change over the course of this lab; but

this typically isn’t the case.

5

If you do this you must also write an __eq__ method that ensures h(x) == h(y) when x == y.

4

For the test example above (adding ’Paul’,’Peter’,’Paula’,’David’,’Bob’ and ’Di’, in that order, to a 5 slot

hash table) should result in a hash table that looks like:

ChainingHashTable:

slot 0 = []

slot 1 = []

slot 2 = [“David”, “Bob”]

slot 3 = [“Paul”, “Peter”, “Di”]

slot 4 = [“Paula”]

Notice the collisions at slots 2 and 3. In this example, collisions will be unavoidable given the number of items is greater than the number of slots. 6

Test your spellcheck_with_hashtable method again, still using ’Chaining’ and a hash table size

of 11.

The Load Factor

Your spell checker should now be faster than the first version you implemented. However, you can still

do much better.

The load factor of a hash table is a metric of how full the table is; it’s calculated using the following

formula:

λ =

number of items

number of slots

A load factor of 0 would indicate an empty table, while a load factor greater than 1 would indicate a table

that has more items than slots (and needs to chain them). Note that the load factor doesn’t consider

the hash function—you can have a low load factor and still have poor performance because your hash

function doesn’t distribute items well.

By default, our HashTable creates a table with 11 slots. If you calculate the load factor for our table

when it’s populated with the entire dictionary of words, you’ll note that it’s incredibly high. Assuming a

perfect hash function, this means that it is creating chains of tens-of-thousands of items that are to be

searched through!

You can adjust the number of slots in the table, by changing the parameter to the HashTable constructor or indirectly by changing the hash table size value that you send to spellcheck_with_hashtable

with some more reasonable hash table sizes. Will a size of 23 help much? how about 101? Try some

load factors under 1.0 —this will require more slots that words in the dictionary7

. Try to find the sweet

spot. Test out the other two documents—trgov.txt and humnature.txt—to ensure larger documents don’t melt your computer.

Checking the spelling of a document should now be almost-instant. You will notice that the construction and population of the dictionary hash table is the most time-consuming task, spell checking

the document takes less than a few seconds.

> Complete the More bad spalling etc question(s) in Lab Quiz 5.

Dealing with Collisions

There are two main ways of dealing with hash collisions:

1. Chaining: A list is maintained at each hash slot with the list containing all items that hash to the

slot.

6One benefit of chaining is that the hash table can hold more items than it has slots so you don’t need to know how many items

are to be added when you create the hash table and you won’t get an error if you add large numbers of items – until your computer

runs out of memory.

7

see http://www.bigprimes.net/archive/prime/ for a selection of big prime numbers

5

2. Open Addressing: If a collision occurs then a probe is done to find another empty slot in the hash

table. All items are stored in the slots of the hash table, therefore space is limited…

So far we have covered chaining and will now experiment with open addressing.

Open addressing

When using open addressing items are stored one per slot in the hash table, therefore the hash table

cannot store more items than it has slots. For example, a five slot hash table that uses open addressing

can store a maximum of 5 items. If an item hashes to a slot that already contains an item then we need to

find another empty location in the table using a probing function. The hash plus probe sequence must

be consistent so that a search will find the item. Depending on the hash table size, probing function

and load factor an empty slot may not be located, even if the table isn’t full! So it is important to get a

feel for how open addressing works.

Linear Probing

If a collision occurs, increment the slot number and try that slot, continuing until an empty slot is found

or the original slot is reached (which indicates the table is full). To avoid running off the end of the hash

table use the modulo function to ensure the new slot number is less than the number of slots. For

example the ith probe will be at sl ot = (cur r _sl ot + i)%t able_si ze – or the next probe will be at the

current slot plus one. Our lab classes have an object variable called n_slots to make this easy for you.

Now is the time to implement the store method in the LinearHashTable class, so that the doc test

returns no errors.

It is good to work through the addition of items in a step by step process, looking at the hash table

as it grows. For example, running commands such as:

from spelling import *

hash_table = LinearHashTable(7)

hash_table.store(“Tennis”)

print(hash_table)

hash_table.store(“Cricket”)

print (hash_table)

hash_table.store(“Swimming”)

print (hash_table)

hash_table.store(“Underwater Motorbike Hockey”)

print (hash_table)

hash_table.store(“Soccer”)

print(hash_table)

> Complete the Linear Probing questions in Lab Quiz 5

Once you are happy that you have your LinearHashTable class working, test your spell checking

routine using LinearHashTables of size 650011, 700001, 800011, 2000003, 3000017, 5000011, 10000019.

Then try using a size of 600011. What should happen here? Does your code deal with this?

• eg, spellcheck_with_hashtable(document, dictionary, ’Linear’, 2000003).

How does the speed change when the size of the hash table is increased?

Quadratic Probing

With quadratic probing, the sequence for probing is based on a quadratic function and in this lab we

will use a simple quadratic function. That is, the ith probe will be at sl ot = (h +i

2

)%t able_si ze, where

h is the original hash collision slot. So, if a clash occurred at slot 0 (in a hash table of size 101) then

the sequence of slots to probe would be 1, 4, 9, 16, 25, 36,… With quadratic probing it helps to set the

6

number of slots in the hash table to a prime number, especially if you are going to have a high load

factor. You should be aware that, for load factors over 0.5, quadratic probing isn’t guaranteed to find an

empty slot.

Now is the time to implement the store and __contains__ methods in the QuadraticHashTable

class, so that the doc test returns no errors. Then do some of your own testing by creating a hash table

and adding in some items, printing out the table after each item is added.

Once you are happy that you have your QuadraticHashTable class working, test your spell checking routine using QuadraticHashTable’s of size 650011, 700001, 800011, 2000003, 3000017, 5000011,

10000019. Then try using a size of 600011. What should happen here? Does your code deal with this?

• eg, spellcheck_with_hashtable(document, dictionary, ’Quadratic’, 700001).

How does the speed change when the size of the hash table is increased?

> Complete the Quadratic Probing questions in lab Quiz 5

Extras

• Instead of using spellcheck_with_list, write a function that uses a binary search of the dictionary, call it something entertaining like spellcheck_bin . How does it compare with the three

hash table implementations? Will this always be the case?

• Change the store method of ChainingHashTable to keep the items in each chain sorted; adjust

the __contains__ method to take advantage of this. Given a load factor λ, and number of items

n, what’s the average number of comparisons needed to find an item using this method (assuming

a perfect hash function)?

7

Sale!