CmSc 250 Intro to Algorithms


Hashing: Collision Resolution - Separate Chaining

  1. Collision resolution
  2. Separate chaining

Learning Goals
Exam-like questions


  1. Collision resolution

  2. Problem: Obviously, a mapping from a potentially huge set of strings to a small set of integers
    will not be unique. The hash function maps keys into indices in many-to-one fashion.
    Having a second key into a previously used slot is called a collision.

    Collision resolution: deals with keys that are mapped to same addresses.

    Methods:

    Separate chaining
    Open addressing

    Linear probing
    Quadratic probing
    Double hashing

  3. Separate chaining

  4. Invented by H. P. Luhn, an IBM engineer, in January 1953.
    Idea: Keys hashing to same address are kept in lists attached to that address

    For each table address, a linked list of the records whose keys hash to that address, is built. This method is useful for highly dynamic situations, where the number of the search keys cannot be predicted in advance.

    Example

    Records: (each character is a key):

    
    Key:             A  S  E  A  R  C  H  I  N  G  E  X  A  M  P  L  E
    Hash (M = 11):   1  8  5  1  7  3  8  9  3  7  5  2  1  2  5  1  5
    
    Separate chaining:
    		 0  1  2  3  4	5  6  7	 8  9 10
    		 =  L  M  N  =	E  =  G	 H  I  =
    		    A  X  C	P     R	 S  =
    		    A  =  =	E     =	 =
    		    A		E
    		    =		=
    
    

    Each column is a linked list. Thus we maintain M lists with M list header nodes. We can use the list search and insert procedures for sequential searching.

    Complexity of separate chaining

    The time to compute the index of a given key is a constant. Then we have to search in a list for the record. Therefore the time depends on the length of the lists. It has been shown empirically that on average the list length is N/M (the load factor L), provided M is prime and we use a function that gives good distribution.

    Unsuccessful searches go to the end of some list, hence we have L comparisons. Successful searches are expected to go half the way down some list. On average the number of comparisons in successful search is L/2. Therefore we can say that runtime complexity of separate chaining is O(L).

    Note, that what really matters is the load factor rather than the size of the table or the number of records, taken separately.

    How to choose M in separate chaining?

    Since the method is used in cases when we cannot predict the number of records in advance, the choice of M basically depends on other factors such as available memory. Typically M is chosen relatively small so as not to use up a large area of contiguous memory, but enough large so that the lists are short for more efficient sequential search. Recommendations in the literature vary form M to be about one tenth of N - the number of the records to M to be equal (or close to) N.

    Other methods of chaining:

    • Keep the lists ordered: useful if there are much more searches than inserts,
      and if most of the searches are unsuccessful.
    • Represent the chains as binary search tree. Extra effort needed not efficient.

    Advantages of separate chaining used when memory is of concern, easily implemented.
    Disadvantages unevenly distributed keys long lists and many empty spaces in the table.


Learning goals

  • Know about collision resolution methods
  • Be able to explain extendible hashing

Exam-like questions

  1. Assume that each letter is a key of a record, and the hash function is computed for each key:
    
    	S I M P L E T A S K
    Hash :  7 3 5 7 5 3 7 1 7 4
    
    What would be the disposition of the keys using separate chaining in a table of size 7?

  • Discsuss the complexity of search in separate chaining.
  • Discuss the choice of the table size in separate chaining.

  • Back to Contents page
    Created by Lydia Sinapova