CmSc 250 Intro to Algorithms
Hashing: Collision Resolution - Separate Chaining
- Collision resolution
Problem: Obviously, a mapping from a potentially huge set of strings to a small set of integers
will not be unique. The hash function maps keys into indices in many-to-one fashion.
Having a second key into a previously used slot is called a collision.Collision resolution: deals with keys that are mapped to same addresses.
Methods:
Separate chaining
Open addressing
Linear probing
Quadratic probing
Double hashing
- Separate chaining
Invented by H. P. Luhn, an IBM engineer, in January 1953.
Idea: Keys hashing to same address are kept in lists attached to that addressFor each table address, a linked list of the records whose keys hash to that address, is built. This method is useful for highly dynamic situations, where the number of the search keys cannot be predicted in advance.
Example
Records: (each character is a key):
Key: A S E A R C H I N G E X A M P L E Hash (M = 11): 1 8 5 1 7 3 8 9 3 7 5 2 1 2 5 1 5 Separate chaining: 0 1 2 3 4 5 6 7 8 9 10 = L M N = E = G H I = A X C P R S = A = = E = = A E = =Each column is a linked list. Thus we maintain M lists with M list header nodes. We can use the list search and insert procedures for sequential searching.
Complexity of separate chaining
The time to compute the index of a given key is a constant. Then we have to search in a list for the record. Therefore the time depends on the length of the lists. It has been shown empirically that on average the list length is N/M (the load factor L), provided M is prime and we use a function that gives good distribution.
Unsuccessful searches go to the end of some list, hence we have L comparisons. Successful searches are expected to go half the way down some list. On average the number of comparisons in successful search is L/2. Therefore we can say that runtime complexity of separate chaining is O(L).
Note, that what really matters is the load factor rather than the size of the table or the number of records, taken separately.
How to choose M in separate chaining?
Since the method is used in cases when we cannot predict the number of records in advance, the choice of M basically depends on other factors such as available memory. Typically M is chosen relatively small so as not to use up a large area of contiguous memory, but enough large so that the lists are short for more efficient sequential search. Recommendations in the literature vary form M to be about one tenth of N - the number of the records to M to be equal (or close to) N.
Other methods of chaining:
- Keep the lists ordered: useful if there are much more searches than inserts,
and if most of the searches are unsuccessful.- Represent the chains as binary search tree. Extra effort needed – not efficient.
Advantages of separate chaining– used when memory is of concern, easily implemented.
Disadvantages – unevenly distributed keys – long lists and many empty spaces in the table.
Learning goals
Exam-like questions
S I M P L E T A S K Hash : 7 3 5 7 5 3 7 1 7 4What would be the disposition of the keys using separate chaining in a table of size 7?