Hash Functions, CPSC 331, Winter 2012

Hash Functions and Additional Hashing Strategies

Hash Functions

In order to implement a hash table — or analyze one’s behaviour — it is necessary to know more about the hash function that is to be used. Recall that this is a function

h: U → {0, 1, … m−1 }

where U is the universe of objects that might be stored in the table, and m is the table size.

Desirable Properties

A hash function should have the following properties in order to be useful.

Low Cost: Evaluation of the hash function should not be expensive! While the cost will necessarily depend on the size of the input (that is, the length of the representation of the value x that is supplied), either it should not depend on the size of the set being stored in the table (or on the table size) at all, or it should grow very slowly as a function of these.

To see why this is important, consider the use of a hash function whose evaluation has cost in Ω(log n) if n is the size of the set stored in the hash table; the worst-case cost of any operation in the hash table will be asymptotically no better than the worst-case cost of the operation if a balanced search tree was used to represent the set instead.
Well Defined: If the hash function is applied repeatedly to the same value x then the same location should be returned (as “h(x)”) every time.
Uniformity: The hash function should distribute keys as evenly over the locations in the hash table as possible.

Structure of Hash Functions To Be Considered

We will consider hash tables that are compositions of functions,

h = h₁ ο h₂

so that h(x) = h₁(h₂(x)) for any value x.

The function

h₂: U → Z

(which is actually the first function you will apply) maps objects to integers.

The implementation of this depends on the kind of objects that are included in the “universe” U. In Java, this function is implemented as the function

public int hashCode()

which — as a method of the Object class at the root of the class hierarchy, is available for every class and object in Java.

Additional information about Java’s implementation of this (for various kinds of classes) can be found as the description of the hashCode method, for each class, in the online Java documentation.

The function

h₁: Z → {0, 1, … m−1 }

maps integers to locations in the hash table.

Since the function h₂ is implemented in Java by the hashCode function, which returns values of the int data type, it can be assumed (if necessary) that the integers that are given as inputs to the function h₁ are small enough to be represented using Java’s int data type, that is, they have values between −2,147,483,648 and 2,147,483,647 (inclusive).

Simplifying Assumption: In our analysis we will generally assume that, if a hash table is being used to represent a set of n distinct objects

x₁, x₂, … x_n

from the universe U, then the integers

h₂(x₁), h₂(x₂), …, h₂(x_n)

are distinct, as well — that is, h₂(x_i) ≠ h₂(x_j) for all integers i and j between 1 and n such that i ≠ j. While this cannot be guaranteed in all cases it considerably simplifies the analysis of algorithms that access or modify hash table, and exceptions to this are so rare that they do not complicate the use of hash tables in practice.

In the rest of these notes we will concentrate on functions mapping integers to hash table locations, that are used as the above function h₁ in practice.

The Division Method

In the division method one simply maps the integer k to the integer h₁(k) = k mod m.

This seems to work well if m is a prime number (but not too close in value to either a power of two or a power of ten). It is less effective, in practice, if m has lots of small prime factors.

The Multiplication Method

In the multiplication method h₁(k) is computed as follows. Here, A is a fixed real number such that 0 < A < 1.

Compute the real number c = k × A.
Replace c with its “fractional part”, c − ⌊c⌋ (the result is a real number that is greater than equal to zero and strictly less than one)
Return h₁(k) = ⌊m × c⌋ (the result is an integer between 0 and m−1)

One advantage of this is the choice of table size is less critical (this can be used for virtually any table size). Instead the choice of the real number A is significant. A disadvantage is that method requires arithmic on floating point numbers and, of course, computers do not carry out such computations exactly.

Additional Hashing Strategies

A Common Situation

One situation in which hash tables seem to be very useful, which will be explored here, is the situation in which the set being accessed is static — that is, values are neither added to it or removed from it after it has been created (and the table has been filled with the set to be used) — so that only searches are performed after the table has been set up.

The rest of these notes considers this situation.

Universal Hashing

Universal hashing is a variation on the hashing strategies that we have talked about in which a randomized algorithm is used to choose the hash function that will be used to create and access a hash table. Once this hash function has been chosen, the hash table is used as described in class.

In particular, the hash function is uniformly and randomly chosen from a universal collection of hash functions, where this is as defined below.

Definition: Consider the use of a hash table where (as usual) values to be stored are chosen from a large finite universe U, and where the table size is a positive integer m — so that the hash function should be a function

h: U → {0, 1, … m−1 }.

A finite collection H of functions from U to {0, 1 … m−1} is said to be universal if, for every pair of distinct values s and t in U, the number of functions h ∈ H such that h(s) = h(t) is at most |H|/m.

Now consider the behaviour of a hash table when it is used to store some set S ⊆ U using “Univeral Hashing.” Even when the set S is “fixed,” the hash table is not! Instead, properties of the hash table, and the number of steps needed to find keys in it, are all random variables that are defined for a sample space consisting of the universal collection H from which the has function was chosen. In an analysis of this kind of hashing, we try to find upper bounds for the expected values of these random variables.

It turns out that universal collections of hash functions do exist, and that they are not too difficult to describe or use. A universal collection, for the universe U consisting of all the integers that can be represented using Java’s int data type, will be considered as part of the tutorial exercise on hash tables.

Perfect Hashing

Using universal hashing, and two levels of hashing (so that the entries in the main hash table are references to smaller hash tables, and the smaller hash tables are as discussed in class) it is possible to design a hashing structure in which there are no collisions at all — so that the worst-case time to search for a value is in O(1).

This is related to the concept of a perfect hash function, that is, a hash function mapping distinct elements of the finite set S ⊆ U (whose entries we wish to store) to hash table locations with no collisions.

Students who are interested in learning more about this should consider the optional “challenge problems” that are given at the end of the tutorial exercise on hash tables.

Additional References

The main reference, used as the source of information when creating the lecture notes on this topic, has been Cormen, Leiserson, Rivest and Stein’s text “Introduction to Algorithms,“ which can be read online as an ebook. Hash tables are discussed in Chapter 11 of this text. The Wikipedia articles on hash functions, universal hashing, and perfect hash functions are all quite readable and provide additional information about these topics.

Last updated:
http://www.cpsc.ucalgary.ca/~jacobs/Courses/cpsc331/W12/handouts/lecture20-supplement.html