CmSc 250 Intro to Algorithms


Hashing: Open Addressing, Rehashing, Extendible Hashing

  1. Open addressing
  2. Rehashing
  3. Extendible Hashing
  4. Conclusion

Learning Goals
Exam-like questions


  1. Open addressing

  2. Invented by A. P. Ershov and W. W. Peterson in 1957 independently.

    Idea: Store collisions in the hash table itself.

    The method uses a collision resolution function in addition to the hash functon.
    If collision occurs, next probes are performed following the formula:

    hi(x) = (hash(x) + f(i)) mod TableSize

    where:
    hash(x) is the hash function
    f(i) is the collision resolution function
    i is the number of the current attempt (probe) to insert an element.

    1. Linear probing (linear hashing, sequential probing): f(i) = i

    2. Insert: When there is a collision we just probe the next slot in the table.
      If it is unoccupied – we store the key there.
      If it is occupied – we continue probing the next slot.

      Search: If the key hashes to a position that is occupied and there is no match,
      we probe the next position.

      a) match – successful search
      b) empty position – unsuccessful search
      c) occupied and no match – continue probing.

      When the end of the table is reached, the probing continues from the beginning,
      until the original starting position is reached.

      Problems with delete: a special flag is needed to distinguish deleted from empty positions.
      This is necessary for the search function – if we come to a "deleted" position,
      the search has to continue as the deletion might have been done after
      the insertion of the key we are looking for, and it might be further in the table.

      Here is an example of Linear probing

      Total amount of memory space – less, since no pointers are maintained.

      Disadvantage: " Primary clustering"

      Large clusters tend to build up: if an empty slot is preceded by i filled slots, the probability that the empty slot is the next one to be filled is (i+1)/M.
      If the preceding slot was empty, the probability is 1/M.

      This means that when the table begins to fill up, many other slots are examined.
      Linear probing runs slowly for nearly full tables.

    3. Quadratic probing: f(i) = i2

    4. A guadratic function is used to compute the next index in the table to be probed.

      Example:

      In linear probing we check the i-th position. If it is occupied, we check the i+1st position,
      next we check the i+2nd, etc.
      In quadric probing, if the i-th position is occupied we check the i+1st,
      next we check the i+4th, next - i + 9th etc.

      The idea here is to skip regions in the table with possible clusters.

    5. Double hashing: f(i) = i * hash2(x)

    6. Purpose – same as in quadratic probing : to overcome the disadvantage of clustering.

      Instead of examining each successive entry following a collided position, we use
      a second hash function to get a fixed increment for the "probe" sequence.
      The second function should be chosen so that the increment and M are relatively prime.
      Otherwise there will be slots that would remain unexamined.

      Example: hash2(x) = R - (x mod R), R is smaller than TableSize, prime.

    In open addressing the load factor L is less than 1.
    Good strategy is to keep L < 0.5
    If the table is close to full, the search time grows and may become equal to the table size

  3. Rehashing
  4. If the table is close to full, the search time grows and may become equal to the table size.

    When the load factor exceeds a certain value (e.g. greater than 0.5) we do rehashing :

    Build a second table twice as large as the original
    and rehash there all the keys of the original table.

    Rehashing is expensive operation, with running time O(N)
    However, once done, the new hash table will have good performance.

  5. Extendible hashing
  6. Used when the amount of data is too large to fit in main memory and external storage is used.

    N records in total to store, M records in one disk block

    The problem: in ordinary hashing several disk blocks may be examined to find an element -
    a time consuming process.

    Extendible hashing: no more than two blocks are examined.

    Idea:

    Keys are grouped according to the first m bits in their code.
    Each group is stored in one disk block.
    If the block becomes full and no more records can be inserted, each group is split into two,
    and m+1 bits are considered to determine the location of a record.

    Example: lets' say we have 4 groups of keys according to the first two bits:

    
    
    00		01		10		11
    
    00010		01001		10001		11000
    00100		01010		10100		11010
    		01100
    

    Each disk block in the example can contain 3 records only, 4 blocks are needed to store the above keys

    New key to be inserted: 01011.

    Block2 is full, so we start considering 3 bits:

    
    000/001		010		011		100/101	       110/111
    (still on same block)
    
    00010		01001		01100		10001		11000
    00100		01010				10100		11010
    		01011						
    
    

    The second group of keys is split onto two disk blocks - one for keys staring with 010,
    and one for keys starting with 011.

    A directory is maintained in main memory with pointers to the disk blocks for each bit pattern.
    The size of the directory is 2D = O(N(1+1/M)/M), where
    D - number of bits considered
    N - number of records
    M - number of disk blocks.

  7. Conclusion

  8. Hashing is the best search method (constant running time) if we don't need to have the records sorted.

    The choice of the hash function remains the most difficult part of the task and depends very much on the nature of the keys.

    Separate chaining or open addressing?

    Open addressing is the preferred method if there is enough memory
    to keep a table twice larger than the number of the records.

    Separate chaining is used when we don't know in advance the number of the records to be stored. Though it requires additional time for list processing, it is simpler to implement.

    Some application areas

    Dictionaries, on-line spell checkers, compiler symbol tables.


Learning goals

  • Know about collision resolution methods
  • Be able to explain extendible hashing

Exam-like questions

  1. Assume that each letter is a key of a record, and the hash function is computed for each key:
    
            S I M P L E T A S K
    Hash :  7 3 5 7 5 3 7 1 7 4
    
    What would be the disposition of the keys using linear probing in a table of size 15?

  • What should be the value of the load factor with open addressing so that we have good performance?
  • When is separate chaining preferred over open addressing?

  • Back to Contents page
    Created by Lydia Sinapova