Hashing ======= Searches discussed so far vary in the efficiency with which they can find the appropriate key. Obviously, the ideal is to have a way of searching which was always O(1). While this is rarely possible, there is a method which is O(1) or some multiple rather than being dependent on the number of data elements. Hash tables are a way of finding a record using keys which are a part of the record and which also can be found by using specific algorithms which operate on the key to determine the location of the information. The information (records for example) can be stored in the hash table or it may be stored on disk with disk offsets to the file position stored in the hash table. The process of determining the location of the information in the hash table uses a hash function. The process of BUILDING or FINDING information in the hash table is called hashing. Hashing is the antithesis of sorting. Sorting arranges the records in some pattern. Hashing scatters the records throughout the hash table in a completely random fashion. Therefore, hashing is appropriate for implementing a specified relationship among elements but it does not lend itself to operations which attempt to make use of other relationships among the data. In implementing the hash table, other relationships may be destroyed. For example an ordered list can be put into a hash table but since hashing is the antithesis of sorting, the sorted nature of the data is lost. It is therefore difficult to determine the fourth item in the ordered sequence. The price you pay for efficient searching is the loss of relationships which previously existed in the data. Suppose that a company has 500 employees and that the record for each employee is to be hashed (placed in the table) by social insurance number. Using the whole number would require 1 billion storage locations in order to guarantee that each number would end up in a unique location. This is an obvious waste of memory and is in fact not likely possible. What is needed is a way to use the social security number with a hashing function which would determine a unique location for each name and really only use an array of records which is slightly longer than 500 records. Hashing, by definition, assumes that there is random access to an element which contains the key. It therefore assumes that the implementation provide this (at least initially) and therefore the structure employed is an array. Hash Functions ============== Since hash functions must manipulate the data in some way to find a hash location, arithmetic and logical process may form some part of any one of the categories. Further, the language chosen may also affect how a hash function is actually implemented. The ideal is that the key exists in a form which would uniquely identify each entity and that if an array were constructed, the number of elements in the array would be equal to the number of pieces of data. Since this situation rarely exists, we try to find a way to fit all of the data into an array of the smallest size that will accommodate all of the data. This array size is also dependent on how the hashing process is implemented. Initially, we will assume that the array must be at least equal in size to the number of pieces of data that will be inserted but in most cases, it will be larger than the number of items. Hashing is the process of chopping up the key and mixing it up in various ways in order to obtain an index which will be uniformly distributed over the range of indices -- hence the 'hashing'. There are several common ways of doing this. Truncation ---------- This is a method in which parts of the key are ignored and the remaining portion becomes the index. We take the given key and produce a hash location by taking portions of the key (truncating the key so that the result is smaller). For example, if a hash table can hold 1000 entries and an 8-digit number is used as the key, the third, fifth, and seventh digits starting from the left of the key could be used to produce the index. [e.g... key is 62538194 the hash location is 589]. Notice that the range of possibilities using this hash function is from 0 to 999 which is 1000 different values. Although this type of function is simple and easy to implement, it unfortunately often fails to distribute the keys evenly. When keys fall into groups in the table, clustering is said to occur. By using only portions of the key, the possibilities are reduced and in many cases (such as specific digits in social security numbers) many may be the same. Folding ------- This process breaks the key into several parts and recombines the parts to form an index. The parts may be recombined by addition, subtraction, multiplication, etc and may have to be truncated as well. Such a process is usually better than truncation by itself since it produces a better distribution because all of the numbers in the key contribute to the hash function. [ e.g... using a key of 62538194, and breaking it into 3 numbers using the first 3, the second 3 and the last 2 digits would produce 625, 381, and 94. These could be added to get 1100 which could be truncated to 100. They could also be multiplied together and then three digits chosen from the middle of the number produced]. If the language allows, bit shifting and oring are forms of folding. We are really manipulating parts of a key at the lowest level in order to obtain new combinations to provide an index to the hash table. Modular Arithmetic ------------------ This process essentially assures that the index produced falls within a specified range. The key is converted to an integer which is divided by the range of the index with the resulting function being the value of the remainder. In its simplest form the key is an integer and the modulus of the key (the % operator)is found directly. In other cases, converting to an integer may use any number of processes to convert the key to an integer and also might employ something from the two categories above. If the value of the modulus is a prime number, the distribution of indices obtained is quite uniform, in fact, hash tables almost always have a size which is a prime number. A table whose size is some number which has many prime factors provides the possibilty of many indices which are the same. You would not choose a hashsize which is a multiple of 2, for example. Therefore, if a hash table is to be constructed for about 1000 records, the best bet would be to have a table which can hold at least 1009 records. Sometimes the hash function is suggested by the structure of the key. In other cases, the best hash function can only be determined by experimentation. Perfect minimal hash functions do exist. These are hash functions which will place every item uniquely in the smallest possible hashsize (ideally equal to the number of items). Generally these hash functions are produced when all of the keys are known in advance. Current algorithms do not work well for large numbers of values ( > 500) because the algorithms are generally around O(N^6). @@@@@@@ Hash functions which use all of the key are almost always better than those which use only some of the key. When only portions are used, information is lost and therefore the number of possibilities for the final key are reduced. Similarly, if we work with only integers or characters, there is a limit to what can be done. However, if we deal with the integer in its binary form, then the number of pieces that can be manipulated by the hash function is greatly increased. This is one of the instances where working at the lowest level is advantageous. It is obvious that no matter what function is used, the possibility exists that the use of the function will produce an index which is a duplicate of an index which already exists. This is termed a collision. Therefore, choosing a hash function which is efficiently calculated and which yields a good distribution is only solving some of the problems in hashing. The resolution of collisions must also be addressed. Collision Resolution -------------------- The process of using a hash function to find some initial location in a hash table is often referred to as open hashing. If the position obtained is already occupied, then another open position must be found since there is a collision. Collisions are resolved by a process of rehashing (also called closed hashing). There are two major ways of resolving the collisions. Open Addressing Resolving collisions by open addressing is resolving the problem by taking the next open space as determined by rehashing the key according to some algorithm. Since the hash table has enough space to hold all of the records some space must be open -- thus the 'open addressing'. Some open addressing procedures are: - Linear Probing ---------------- This process starts with the position at which the collision occurred and does a sequential search for the next open position. This essentially the function newposition = (currentposition + 1 ) % hashsize One of the major problems with this approach is that once the table becomes quite full the length of the search for an open space increases. In practice, clustering occurs, that is, the used spaces tend to appear in groups which tend to grow and thus increase the search time to reach an open space. It does, however, eventually find an open space. -Incremental Functions ---------------------- In order to try to avoid clustering, a method which does not look for the first open space must be used. There are numerous incremental functions. Some just vary the amount by which they jump to the next position, However, others use the key and rehash it. This makes the increment dependent on the key that collides with another and is often used. These methods are called key dependent Increments. - Key dependent Increments -------------------------- If the original hash function results in a good distribution, then key-dependent functions work quite well for rehashing and all locations in the table will eventually be probed for a place to put the element. Key-dependent increments are determined by using the key to calculate a new value and then using this as an increment to determine successive probes. Some checks should also be done before the increment obtained is used to search for new positions. For example, if the original hash function was (key % 401) we might choose a function of (key % 13) to find the increment. Thus the closed hash function becomes newposition = currentposition + (key % 13) The increment is now dependent on the key and different increments exist. If the increment is equal to hashsize the same position will be probed each time so this value cannot be used. Generally, a different number is used to find the modulus and while the open hash function could be used to find an increment for the closed hash function, this should be avoided if possible. Another problem exists if the size of the hash table is not a prime. For a number of the increment values only some of the positions will be probed regardless of the number of increments. If we ensure that the hashsize is prime and the divisors for the open and closed hash are different and prime, then this method will eventually access all positions as does the linear probe. However, using a key-dependent method does not usually result in as much clustering and therefore searches for an empty position should not be as long as for the linear method. There are many more ways of creating hash tables. Some are better than the ones mentioned here but most are also much more complex. They should be investigated if the nature of the key is such that it is difficult to provide good open and closed hash functions which distribute the data well. Effect of Deletions =================== Deletions from a hash table cause considerable problems. Memory space which comes available because of a deletion should be re-used. If there have been no collisions at that location resulting from hashing the key, then the deletion would pose no problem. However, if after inserting the first element into the location, subsequent hashing has resulted in the same location and rehashing has been used to place the element in another position, taking out the first element causes problems. When searching for an element in a hash table, it is important to note that the SEARCHING ALGORITHM IS really the same as the INSERTION ALGORITHM. Therefore, if there is a collision, some mechanism is available to resolve the problem and place the element. The probe continues until an open space is found. Removing any element breaks the search because an open element indicates that an element can be placed, or conversely, indicates that no more elements hash to that location given the initial key. In order to cope with this, any deletion must set a flag in the hash table to indicate that the space is now open for the insertion of a new element if insertion is taking place but that if the hash table is being searched, there are other elements which follow and the collision resolution procedures should be used to continue searching the table. This flag which indicates that a value did exist but has been deleted is often called a tombstone. Because deletions cause considerable difficulties and can reduce the efficiency of using hash tables, they should be avoided whenever possible. Hash tables are not good data structures if the data changes often. Thus under normal conditions, deletions occur just before the table is rebuilt. Collision Resolution by Chaining ================================ External Chaining ----------------- Using arrays for hash tables is reasonable because there is random access to any element. Thus access can be reduced to O(1), or if collisions occur some multiple depending on the number of rehashes required before the collision is resolved. However, arrays can use substantial amounts of memory and in order to make collision resolution effective it is often necessary to declare arrays which are larger than are actually filled with elements (so searches for empty positions do not reach an overflow condition). An option is to make the hash table an array of pointers to linked lists. Then when a key is hashed, the position in the table is accessed which points to a linked lists. Resolution of collisions is simple -- it requires only that a node be added to a linked list. This insertion is usually as the first item in the list which is the easiest to accomplish since it does not require a traversal of the list. Deletions also pose no special problems. As long as the hash function produces a reasonable distribution, none of the linked lists is likely to be very long. If the open hash is a poor one, then the lists will be long and the search to find a key is sequential through the list. The search for a key can soon become more than O(lg n) when keys hash to only a few hash locations. Analysis of Searching Using Hash Tables ======================================= As mentioned, searching follows the same procedure as insertion. When collisions occur, a method exists for resolving the collisions. When searching, if the element is not found at the first locations, the rehashing procedure is used to determine the next position to check until the search is either successful or not. With any reasonable size problem, collisions are likely to occur, therefore, these should be minimized but using functions which produce a uniform distribution and collision resolution procedures should be efficient (not access too many positions which are already filled or produce clusters). In analyzing search efficiency, the average is usually used. The average is the sum of the search lengths for all of the items divided by the number of items. Searching with hash tables is highly dependent on how full the table is since as the table approaches a full state, more rehashes are necessary. The proportion of the table which is full is called its load factor. When collisions are resolved using open addressing the maximum load factor is 1. Using chaining however, the load factor can be greater than 1 when the table is full but the linked list attached to each hash address can have more than 1 element. The table below gives the average number of probes for different collision resolution methods at various load levels for the hash table. The values result from experimentation with the different hash tables and methods. The tables are obviously quite large (and must be more than 430). The important thing to look at the magnitude of the differences at different loads and with different methods of collision resolution. Note that the average number of probes is different when the search is successful and unsuccessful. Load factor 0.10 0.50 0.80 0.90 0.99 2.00 ============================================================ Successful Chaining 1.04 1.2 1.4 1.4 1.5 2.0 Key dep 1.04 1.5 2.1 2.7 5.2 Linear 1.05 1.6 3.4 6.2 21.3 _ Unsuccessful Chaining 0.11 0.53 0.78 0.90 0.99 2.04 Key dep 1.13 2.2 5.2 11.9 126 Linear 1.13 2.7 15.4 59.8 430 Obviously, chaining consistently requires fewer probes than open addressing. However, traversal of the linked list is slow and if the records are small it may be just as well to use open addressing. With a good open hash function which distributes the keys evenly, chaining is best under two conditions -- when the number of unsuccessful searches is large or when the information stored is large (assuming that the records are stored in the table). Open addressing would likely be a reasonable choice when most searches are likely to be successful, the load factor is moderate, and the information stored with the key is relatively small. In comparison to other methods of search it is important to note that the number of probes is dependent only on the load factor on the hash table and not on the absolute number of items in the table. On average, retrieval from a hash table which contains 20000 items out of a possible 40000 items is no slower than retrieval from a table of 200 items out of a possible 400. With sequential search, if the number of items is increased 100 times, it would take 100 times as long to search. If a binary search tree is increased 100 times, the search would take about twice as long. Obviously, the price paid for the efficiency of the hash table is that large amounts of memory must be available for a large amount of data in order to keep the load factor moderate.