CmSc 250 Intro to Algorithms
Hashing
Learning Goals
Exam-like questions
- Introduction
Accessing elements in an array is extremely efficient. Array elements are accessed by index. If we can find a mapping between the search keys and indices, we can store each record in the element with the corresponding index. Thus each element would be found with one operation only.
Example:
To keep information about 1000 students, we can assign and identification number to each student
– a number between 0 and 999, and then use an array of 1000 elements.Let us assume that we want to use the SSN of each student. SSN is a 9-digit number. If we want to use directly the number as an index, the table should have much more elements than the number of the students, which is a great waste of space.
However, if we had a mapping between the 1000 SSNs of the students
and the numbers between 0-999, than we can use the SSNs as search keys and still use an array of 1000 elements to store the records.Approach: Directly referencing records in a table by doing arithmetic operations on keys
to map them onto table addresses.Advantage: the records can be references directly - ideally the search time is a constant , complexity O(1)
Question: how to find such correspondence?
Answers:direct access tables
hash tables
- Direct-address tables
Direct-address tables – the most elementary form of hashing.
Assumption – direct one-to-one correspondence
between the keys and numbers 0, 1, …, m-1., m – not very large.Searching is fast, but there is cost – the size of the array we need is the size of the largest key.
Not very useful if only a few keys are widely distributed.
- Hash functions
Hash function: function that transforms the search key into a table address.
Hash functions transform the keys into numbers within a predetermined interval. These numbers are then used as indices in an array (table, hash table) to store the records
Keys – numbers.
If M is the size of the array, then h(key) = key % M.
This will map all the keys into numbers within the interval [0, M-1].- Keys – strings of characters
Treat the binary representation of a key as a number, and then apply the first case.
How keys are treated as numbers: If each character is represented with m bits,
then the string can be treated as base-2m number.Example:
A K E Y : 00001 01011 00101 11001 = 1 . 323 + 11 . 322 + 5 . 321 + 25 . 320 = 44271
If the keys are very long, an overflow may occur.
A solution to this is to apply the Horner’s method in computing the hash function.Horner’s method – representation of polynomials:
anxn + an-1.xn-1 + an-2.xn-2 + … + a1x1 + a0x0 =
x(x(…x(x (an.x +an-1) + an-2 ) + …. ) + a1) + a0
Example
4x5 + 2x4 + 3x3 + x2 + 7x1 + 9x0 =
x(x(x(x (4.x +2) + 3) + 1 ) + 7) + 9
The polynomial can be computed by alternating the multiplication and addition operations.
V E R Y L O N G K E Y
10110 00101 10010 11001 01100 01111 01110 00111 01011 00101 11001
22 5 18 25 12 15 14 7 11 5 25
22.3210 + 5. 329 + 18. 328 + 25. 327 + 12. 326 + 15.325 + 14.324 + 7. 323 +
+ 11. 322 + 5. 321 + 25. 320
=(((((((((22.32 + 5)32 + 18)32 + 25)32 + 12)32 + 15)32 + 14)32 + 7)32 +11)32 +5)32 + 25
Having this representation we compute the hash function by applying
the mod operation at each step, thus avoiding overflowing.First we compute h0 = (22.32 +5)%N
Then we compute h1 = (32.h0 + 18)%N
Then h2 = (32.h1 +25)%N
Etc.
Here is the code of a hash function using base = 32. The key is a string of characters, stored in an array key of length Key_Length.
tbl_size is the size of the table.
int hash32 (char key[Key_Length]) { int h = 0; int i; for (i=0; i < Key_Length ; i++) { h = (32 * h + key[i]) % tbl_size; } return h; }
- Hash tables: Basic concepts
Once we have found the method of mapping keys to indexes, the questions to be solved is how to choose the size of the table (array) to store the records, and how to perform the basic operations:
insert
search
deleteLet N be the number of the records to be stored, and M - the size of the array (hash table). The integer, generated by a hash function between 0 and M-1 is used as an index in a hash table of M elements.
Initially all slots in the table are blank. This is shown either by a sentinel value, or a special field in each slot.
To insert use the hash function to generate an address for each value to be inserted.
To search for a key in the table the same hash function is used.
To delete a record with a given key - first we apply the search method and when the key is found we delete the record.Size of the table: Ideally we would like to store N records in a table with size N. However, in many cases we don't know in advance the exact number of records. Also, the hash function can map two keys to one and the same index, and some cells in the array will not be used. Hence we assume that the size of the table can be different from the number of the records. We use M to denote the size of the table.
A characteristic of the hash table is its
load factor L = N/M: the ratio between the number of records to be stored and the size of the table. The method to choose the size of the table depends on the chosen method of collision resolution, discussed below. M should be a prime number.
It has been proved that if M is a prime number, we obtain better (more even) distribution
of the keys over the table.Collision resolution
Collision is the case when two or more keys hash to one and the same index in the hash table.Collision resolution deals with keys that are mapped to same indexes.
Methods:
Separate chaining
Open addressingLinear probing
Quadric probing
Double hashing
Learning goals
Exam-like questions