How Java HashMap Works Internally: Bucket, Hash Code & Collision Handling
Introduction
Java HashMap is one of the most frequently used data structures for its efficient data storage and retrieval with constant-time* complexity. HashMap internally uses a combination of Array, LinkedList, and Balanced Tree (Red-Black) data structures to store its value. Let’s explore in detail.
Storage
Let’s start with the very basic. When we put a value in a HashMap using the put method map.put(K, V)
, HashMap internally use an array of Node to store it. It often refers as a Bucket.
|
|
A Node<K,V>
is a class acts as an individual entry contains the key and value.
|
|
If we try to visualize the basic storage of the HashMap, we can say the Bucket is an array of Object, where Object=Node<Key, Value> which holds the key and value.
Initial size of the Bucket?
Initial size is 16 defined by DEFAULT_INITIAL_CAPACITY
in the code. 16 is an arbitrary choice made by the developers, but the size always should be power of two. Why? We will get into that in a moment.
|
|
Now since the bucket is a native array, every Node needs an index to be stored. How its calculated?
Index Calculation
The index is calculated by the hashCode of the Key. Lets say we have a key "name"
, we want to put map.put("name", "Batman")
. We can get the hash code of the Key using "name".hashCode()
which is equal 3373707
.
Note that HashMap uses a different method
int hash(Object key)
internally which calculates the hash code differently. Read more
But 3373707
is larger than the bucket capacity 16
. We can not use this number as index. We can use a modulo operation to fit the number to the bucket capacity. So it will be stored in index 11
of the Bucket.
|
|
Why the Capacity of the Bucket Should be a Power of Two?
Since the modulo operation %
is relatively slower, HashMap uses the bitwise and (&)
operator to calculate the index. It only works with the numbers power of two.
|
|
Collision Handling
Now what if two different Keys have the same hash code? For example "Aa"
and "BB"
generates the same hashcode 2112
. We can not put both values in a single index 0. To overcome this, HashMap uses LinkedList and/or Red-Black tree to store duplicate entries in the same index.
|
|
Instead of storing a single Node in a particular index, HashMap stores either a LinkedList or a Tree based on the number of duplicate Keys for a hash code.
Node<K, V>
is actually a LinkedList having field next
node, and another class TreeNode<K,V>
is used as red black tree.
|
|
|
|
When the duplicate Key size for a hash code/index exceeded the TREEIFY_THRESHOLD
defined in the code, it will be converted to balanced Tree. Why? A LinkedinList complexity is O(n)
while a Red-Black tree always maintain a complexity of O(log N)
which is significantly improved compared to LinkedList for size greater than 8. The Balanced Tree (Red-Black) is introduced in Java 8.
On the other hand, when the duplicate size reduced to 6, it will be again converted to LinkedList from Tree defined by the UNTREEIFY_THRESHOLD
in the code.
|
|
|
|
|
|
|
|
So if we put values for keys Aa
and BB
using put, map.put("Aa", "val1")
and map.put("BB", "val2")
, the closest visualization looks like this:
Resizing, Rehashing and Load Factor
HashMap uses a loadFactor
to decide when to resize the Bucket. The default load factor is 0.75
defined in the code titled DEFAULT_LOAD_FACTOR
. It defines that when the Bucket is approximately filled by 75%
, then it will be resized to increase the capacity to accommodate more entries. For example if the current Bucket capacity is 16
and Bucket is filled with approx. 0.75 * 16 = 12
Keys, the Bucket capacity will be increased to 32
(next power of 2 value after 16).
|
|
When Bucket is resized, every stored entries hash code again re-calculated to a new index to distribute in the new Bucket capacity which is known as rehashing. As mentioned, the index is calculated from the capacity mod, (capacity - 1) & hash
, its need to re-calculate every entries index using the new resized capacity.
|
|
Support null or empty string as Key
If we look into the HashMap internal hash method, it returns 0
for null key. Hence, when null
is used as a Key, it always stores in the 0
index of the Bucket.
Same for empty string, if we run "".hashCode()
or new String().hashCode()
both returns 0
as hash code value.
|
|
Performance
As the HashMap uses a native array for its storage, the complexity of put, get and delete is constant-time O(1)
. But in worst case scenario, when there are duplicate hash code for multiple Keys exists, the complexity increased to O(N)
. Because it will search in the LinkedList to find the Key. From version Java 8, worst case complexity reduced to O(long N)
using a balanced tree as we discussed above.