CmSc 250 Intro to Algorithms


Hashing

Learning Goals
Exam-like questions


  1. Introduction
  2. Accessing elements in an array is extremely efficient. Array elements are accessed by index. If we can find a mapping between the search keys and indices, we can store each record in the element with the corresponding index. Thus each element would be found with one operation only.

    Example:

    To keep information about 1000 students, we can assign and identification number to each student
    – a number between 0 and 999, and then use an array of 1000 elements.

    Let us assume that we want to use the SSN of each student. SSN is a 9-digit number. If we want to use directly the number as an index, the table should have much more elements than the number of the students, which is a great waste of space.

    However, if we had a mapping between the 1000 SSNs of the students
    and the numbers between 0-999, than we can use the SSNs as search keys and still use an array of 1000 elements to store the records.

    Approach: Directly referencing records in a table by doing arithmetic operations on keys
    to map them onto table addresses.

    Advantage: the records can be references directly - ideally the search time is a constant , complexity O(1)

    Question: how to find such correspondence?
    Answers:

    direct access tables
    hash tables

  3. Direct-address tables
  4. Direct-address tables – the most elementary form of hashing.

    Assumption – direct one-to-one correspondence
    between the keys and numbers 0, 1, …, m-1., m – not very large.

    Searching is fast, but there is cost – the size of the array we need is the size of the largest key.
    Not very useful if only a few keys are widely distributed.

  5. Hash functions
  6. Hash function: function that transforms the search key into a table address.

    Hash functions transform the keys into numbers within a predetermined interval. These numbers are then used as indices in an array (table, hash table) to store the records

    Here is the code of a hash function using base = 32. The key is a string of characters, stored in an array key of length Key_Length.

    tbl_size is the size of the table.

    
    int hash32 (char key[Key_Length])
    {
      int h = 0;
      int i;
    
      for (i=0; i < Key_Length ; i++)
      {
        h = (32 * h + key[i]) % tbl_size;
      }
      
      return h;
    }
    

  7. Hash tables: Basic concepts
  8. Once we have found the method of mapping keys to indexes, the questions to be solved is how to choose the size of the table (array) to store the records, and how to perform the basic operations:

    insert
    search
    delete

    Let N be the number of the records to be stored, and M - the size of the array (hash table). The integer, generated by a hash function between 0 and M-1 is used as an index in a hash table of M elements.

    Initially all slots in the table are blank. This is shown either by a sentinel value, or a special field in each slot.
    To insert use the hash function to generate an address for each value to be inserted.
    To search for a key in the table the same hash function is used.
    To delete a record with a given key - first we apply the search method and when the key is found we delete the record.

    Size of the table: Ideally we would like to store N records in a table with size N. However, in many cases we don't know in advance the exact number of records. Also, the hash function can map two keys to one and the same index, and some cells in the array will not be used. Hence we assume that the size of the table can be different from the number of the records. We use M to denote the size of the table.

    A characteristic of the hash table is its

    load factor L = N/M: the ratio between the number of records to be stored and the size of the table. The method to choose the size of the table depends on the chosen method of collision resolution, discussed below.

    M should be a prime number.
    It has been proved that if M is a prime number, we obtain better (more even) distribution
    of the keys over the table.

    Collision resolution

    Collision is the case when two or more keys hash to one and the same index in the hash table.

    Collision resolution deals with keys that are mapped to same indexes.

    Methods:

    Separate chaining
    Open addressing

    Linear probing
    Quadric probing
    Double hashing

Learning goals

  • Know the basic idea of hashing
  • Know the Horner's method to represent and compute polynomials
  • Know what a hash function is and the basic method to compute an index

Exam-like questions

  • What does a hash function do?
  • Give the Horner's representation of the following polynomial:
    6x4 + 3x3 + 2x2 + 6x + 1
  • Find the decimal number corresponding to IOWA, when it is treated as a base-32 number.
  • The size of the hash table is usually chosen to be a prime number. Why?

Back to Contents page
Created by Lydia Sinapova