On the previous slide, we looked at the structure of a hash map, which assigns each key/value pair to a number of “buckets” or linked lists based on the hash function.
In our simple but impractical example, we took the length of the string as the hash function. If we solve the problem of collisions by having a linked list in each bucket, then taking the string length as the hash function will theoretically work. But it has an obvious problem if, say, we want our keys to be, say, 100,000 words of English. In this case, our linked lists at array position 6, as well as those corresponding to other common string lengths, will still have thousands of entries, while higher numbers will have practically none. Sure, it’s better to scan a list of 10,000 entries than a list of 100,000 when looking for the key. But really, our hash function of taking the string length isn’t adequate.
We need to choose a better hash function that ideally has these properties:
* It can distribute the keys more or less evenly over the range of positions in the array. We want to avoid the situation where position 6 has a list of 10,000 entries and position 13 has a list of only 20 entries.
* It can distribute the keys over a larger range of values. If there are 100,000 words in the map and a perfectly-distributed hash function, then having 50 positions in the array would still give an average list size of 2,000 items. But if we could have 5,000 positions, there'd be just 20 items in each list. And if we had in the region of 100,000 positions then we'd on average make just one or two comparisons1 to find the mapping we were looking for.
In practice, we combine these problems and define our task as coming up with a hash function that distributes hash codes evenly over the entire range of possible integers (in Java, that means a 4-byte number, or one with about 4 billion possible values). Then we assume that whatever the size of the array in our hash map (whatever the modulus), the mappings will be distributed more or less evenly.
Let us have a look at how the Java String class actually computes its hash codes. The algorithm goes something like his:
public int hashCode() {
int hash = 0;
for (int i = 0; i < length(); i++) {
hash = hash * 31 + charAt(i);
}
return hash % MAX_SIZE;
}
This is not literally the code, but it gives a fair bit of idea. The good bits about it :
* the function takes account of the values of all the characters in the string;
* on each cycle, the function involves two different operations
Like any generic algorithm, this one has some worst cases. But it turns out that for many “typical” sets of strings, the algorithm used by the String class works quite well. That is, it tends to generate hash codes with reasonably random lower bits.
So, there you go. This was a very shallow primer on hashing and its applications.
Note that, we can hash :
* strings,
* integers,
* arrays,
you name it
Average case time complexity for search with good hash functions is O(1) * time complexity of calculating the hash * time complexity of doing a single comparison
Worst case time complexity for search with hashing is when all keys collide. Which means the search can take N comparisons. So, O(N) * time complexity of calculating the hash * time complexity of doing a single comparison
.
However, for good enough hash functions the worst case is extremely rare and is almost never encountered with random data.