Tutorial 13, CPSC 331, Fall 2010

Hash Tables

About This Exercise

The objectives of this exercise are

to ensure that you understand the definition and basic properties of (various kinds of) hash tables
to ensure that you understand the assumptions that one must make when analyzing hash tables, as well as the way these assumptions are used
to ensure that you understand how fundamental ideas from probability theory are used to analyze hash tables and their algorithms
to ensure that you are aware of various issues that arise when hash tables are implemented
to ensure that you are aware of (some of) the support for hash tables provided by the Java software environment.

Students should read through and try to solve the problems on this exercise before attending the tutorial on Wednesday, November 3.

Expected Background

This exercise is based on material presented during the lectures on Friday, October 29 and Monday, November 1. It is expected that students have attended the above lectures or carefully reviewed the notes for these classes that are now available online.

The exercise also makes use of material in the lecture supplement on hash functions and additional hashing strategies Students should review this supplemental material before spending time on this exercise as well.

Warmup Exercises — To Check That You Understand the Basics

It is to be expected that all of the questions in this first section can be answered by any student who has successfully completed the prerequisites for this course, attended the lectures in which hash tables ere discussed (reviewing the lecture notes, as needed, too), and completed the assigned reading mentioned above.

Students who do have difficulty with these questions, after completing the above, should contact the course instructor to get extra help.

Give a brief description of each of the following.
1. a hash table with chaining
2. a hash table with open addressing
3. a collision between two keys
4. the load factor in a hash table with chaining
5. the simple uniform hashing assumption
6. the uniform hashing assumption
7. a probe sequence
8. linear probing
9. quadratic probing
10. double hashing
11. primary clustering
12. secondary clustering
13. the division method
14. the multiplication method
15. universal hashing
Suppose that you inserted the following values into a hash table of size m=59 (using “the division method”):

473, 418, 356, 312, 119, 237, 605, 307

You should assume that the “universe” U is the set of integers between 0 and 999 and that the hash table is empty before 473 is inserted into it.

Show the hash tables that would result from this sequence of insertions when each of the following kinds of hash tables are used.
1. a hash table with chaining
2. a hash table with open addressing; linear probing is used for collision resolution;
3. a hash table with open addressing; quadratic probing is used for collision resolution;
4. a hash table with open addressing; double hashing is used for collision resolution.
When using double hashing, please use

h₀(x) = x mod 59

as suggested above, and use

h₁(x) = (x mod 58) + 1.
Consider the following hash table (with size m=8).
1. Confirm that this table could be produced from an empty table by inserting the keys
  
  25, 2, 12, 14, 22
  
  using the hash function
  
  h(k, i) = k+i mod 8.
2. Draw the tables produced and mention the results you would obtain if the following sequence of operations was performed, starting with the table given above and using the above hash function.
  1. delete 14;
  2. search for 22;
  3. insert 20;
  4. insert 28.
Describe how the algorithm to insert a key into a hash table with open addressing could be modified in order to allow deleted to be overwritten with the key you wish to insert.

What could go wrong if you simply replaced the first copy of deleted with the key to be inserted, as soon as one such copy was found?
The worst-case performance of an operation on any of the (reasonably simple) kinds of hash tables, that have been described in lectures, is significantly worse than the worst-case performance of the corresponding operation on a balanced binary search tree. Yet, there are situations in which hash tables are considered to be preferable. Why?
Describe three (necessary or) desirable properties that hash functions should have.
Describe the hashCode method that is provided by every Object in Java.

Notice that this should not be used, all by itself, as the “hash function” that maps values to locations in a hash table. Why is that? Why is this function useful (when implementing hash tables) anyway?
What is linearity of expectation? How was this used when the expected shape of a hash table (and cost of operations on it) was analyzed?

To Be Discussed in the Tutorial

While the deterministic hashing strategies (and hash functions) that have been discussed so far have good expected performance, under various assumptions about the sets being represented, there are sets (that could be stored in hash tables) that would cause each of these to perform quite badly, as well. In particular, for every fixed hash function, it is always possible to choose a set of values that would be hashed (by this function) to the same location.

In the next questions we will consider a “randomized” hashing strategy, universal hashing. In this strategy, the hash function to be used is randomly chosen from a set of hash functions; all operations are deterministic after that. While an “unlucky choice” can result in poor performance, there is no fixed set that would produce a badly shaped hash table using all of the hash functions that are available.

Consider the problem of representing subsets of a universe U using a hash table of size m. In order to do this we must use a hash function mapping elements of U to hash table locations, that is, integers between 0 and m−1.

Suppose that H is a finite (and nonempty) set, or “collection,” of functions from U to {0, 1, … m−1}. We say that this set is universal if, for every pair of values x and y in U such that x ≠ y, the number of functions h in U such that h(x) = h(y) is at most |H|/m.

Suppose that you wish to provide access to a subset S of U, using a hash table of size m. Suppose, as well, that a universal collection H of hash functions (mapping elements of U to the locations in a hash table with size m) is available.

Suppose, in particular, that we construct a hash table for S as follows:
1. Choose a function h uniformly and randomly from H. Under these conditions, each function h ∈ H is chosen with probability 1/|H|.
2. Insert the entries of S into a hash table with size m, using hashing with chaining, with the function selected in the previous step as the hash function.
3. Use the resulting hash table to access the set S using the search algorithm described in class.
Notice that, for each of the analyses to be considered below, you will need to consider random variables, where each of the following claims is correct.
- The sample space is (effectively) the same as the set H of hash functions that could be chosen in the first step, above.
- As mentioned above, hash functions are chosen “uniformly” from H. Thus each of the events (that is, elements) of the above sample space is selected with the same probability, namely, 1/|H|.
- Each of the things we will want to measure below can be modelled as a random variable. We will therefore discover information about the “average” value of this by computing the expected value of this random variable, with respect to the sample space and probability function that have just been defined.
Let n be the size of the set S.
1. Consider any element x of S. Show that the expected number of other elements y of S that are hashed to the same location as x is (n−1)/m.
2. Use the above to show that the expected number of comparisons needed for a search is at most 1 + n/m, so that it is at most two (and, therefore, in in O(1)) if m ≥ n.
Now suppose that p is a prime number that is larger than the size of the universe U.

For example, if U is the set of integers that can be represented exactly using Java’s int data type, then p must be chosen to be a prime number that is larger than 4,294,967,296. A variety of prime numbers are available that are not too much larger than this (indeed, there is a prime number between x and 2x for every positive integer x), so one might choose p to be a prime number that is greater than the above value, and that can be represented exactly using Java’s long data type, in this case.
1. Suppose that x and y are distinct real numbers (that is, x ≠ y), and let r and s be any pair of real numbers.
  
  Confirm that if
  
  a = (r−s)/(x−y) and b = (sx−ry)/(x−y)
  
  then ax+b=r and ay+b=s.
2. Now, instead of real numbers (and real arithmetic) suppose that we consider the set Z_p of integers between 0 and p−1 (and arithmetic modulo the prime p) instead.
  
  Since p is prime, it can be shown that every nonzero element z of this set (that is, every integer z between 1 and p−1) has a “multiplicative inverse,” that is, there exists an integer w in the same set such that
  
  z × w ≡ 1 (mod p).
  
  Student in MATH 271 (or students who have already taken this course and still have their textbook) can find a discussion of this, along with an algorithm to compute the multiplicative inverse of a given value, in Section 10.4 of Suzanna Epp’s book “Discrete Mathematics with Applications.”
  
  Use this to show that if x and y are distinct integers that are each between 0 and p−1, and r and s are integers between 0 and p−1 as well (that are not necessarily distinct), then there exist integers a and b that are each between 0 and p−1, such that
  
  ax+b ≡ r (mod p) and ay+b ≡ s (mod p).
3. Count the number of choices of r and s that might be made, as well as the number of solutions a and b that might be found, for the above equations.
  
  Use this to argue that it is not only true that there exist integers a and b, between 0 and p−1 that satisfy the above equations but, furthermore, that these integers are unique.
4. Use this to argue that if a and b are integers such that 1 ≤ a ≤ p−1 and such that 0 ≤ b ≤ p−1, and
  
  f_{a, b}(x) = ax+b mod p
  
  is a function mapping elements of the above set Z_p to the same set Z_p, then, for all pairs of integer x and y in this set, then
  
  f_{a, b}(x) = f_{a, b}(y) if and only if x = y.
5. Now consider the “hash function” h_{a, b}: Z_p → {0, 1, …, m−1}, where
  
  h_{a, b}(x) = f_{a, b}(x) (mod m)
  
  and where the function f_{a, b} is as given above. Notice that, by the above definition
  
  h_{a, b}(x) = h_{a, b}(y) if and only if f_{a, b}(x) ≡ f_{a, b}(y) (mod m).
  
  Note that if r is an integer 0 and p−1, then the number of integers s between 0 and p−1 that are not equal to r, such that r ≡ s (mod m), is at most
  
  ⌈p/m⌉−1 ≤ ((p+m−1)/m)−1 = (p−1)/m.
  
  Let H be the set of hash functions
  
  H = { h_{a, b} | 1 ≤ a ≤ p−1 and 0 ≤ b ≤ p−1 }
  
  (noting that this is a set of size p(p−1)).
  
  Using all of the above, show that H is a universal collection of hash functions, mapping elements of Z_p to positions in hash tables of size m.
6. Recall that p was chosen to be a prime number that was larger than the size of some finite universe U. Explain briefly how you could use the above to randomly generate hash functions for sets whose values are chosen from U (instead of integers between 0 and p−1).

Additional Problems

The following “challenge problems” are not problems that students will necessarily be able to solve or find to be of interest. That noted, they do introduce another interesting idea about hash tables — and they can be discussed with the instructor by students who are interested in them.

You may have noticed that, when we analyze operations on hash tables, we generally think of the set that we are representing as being fairly “static,” that is, its elements are fixed (or changes to this set are extremely rare). In the extreme case, the only operations that are performed (after the set’s elements have been inserted into the hash table) are searches.

Perfect hashing, and related strategies that use perfect hash functions, are strategies that are really only useful for this extreme case — because the set that is to be represented is used to decide which hash function to use!

Let S be a subset of U with size n, and let m be an integer (as usual, used as our hash table size) such that m ≥ n².

Suppose that a hash function h (for this set U and table size n) is randomly and uniformly selected from a universal collection of hash functions. The elements of the above set S should then be inserted into the table using hash function h.

Show that the probability that there are any collisions between keys, at all (so that more than one access into the table is ever needed to find a key) is less than one-half.

Notice that this immediately gives us a bad way to randomly generate hash tables with good performance: Randomly choose hash functions from a universal collection, as described above — rejecting the hash function, and trying again, until a hash function hashing all the elements of S into different positions in the table has been found.

And, of course, this is not a good idea because far too much space is being wasted here! We would like the size of our structure to be linear in the size of the set that is being represented rather than quadratic in it.

With that noted, consider a two-level structure, instead.
1. The top level (or main part) of this structure is a hash table whose size is the same as the size of the set being represented. The hash function used to produce this should be uniformly and randomly selected from a universal collection of hash functions.
2. Just as for hashing with chaining, each position in the main hash table is a reference to a data structure that is used to store the values that are hashed to this position in the table. However, instead of a linked list, a second hash table (again, one for each position in the main table) is used to store these values.
The second-level hash tables are constructed as suggested in Question 11, above — so that each has size that is quadratic in the number of values that are to be stored in it.

Some additional notation will be helpful: Suppose that the set S, whose elements are to be stored, includes the elements x₁, x₂, …, x_n.

For integers i between 0 and m−1, and for integers j and k that are each between 1 and n, consider the following random variables:
- X_{i, j} has value 1 if h(x_j) = i and has value 0 otherwise.
- Y_{i, j, k} has value 1 if h(x_j) = h(x_k) = i and has value 0 otherwise.
The sample space used to define these random variables is (in effect) the universal collection of hash functions, from which the main hash function h has been chosen. Since the hash function is uniformly chosen from this collection, each element of this sample space has probability 1/|H|.
1. Write the size of the hash table corresponding to position i of the main table as a linear combination of some of these random variables.
2. Use your answer to the above question, and write the sum of the sizes of all of these second-level hash tables as a linear combination of the above random variables, as well.
3. Use linearity of expectation, and your answer to the above question, to show that the expected value of the sum of the sizes of these tables is at most 2n.
4. Use this to prove that the probability, that the sum of these sizes is greater than or equal to 3n, is at most 2/3.
5. Finally, use all of the above to describe a way to randomly construct a hash table whose size is linear in a given set S (whose values are available to you, ahead of time), and such the number of steps needed to search in this table is in O(1) in the worst case.

Last updated:
http://www.cpsc.ucalgary.ca/~jacobs/cpsc331/F10/tutorials/tutorial13.html