How Java HashMap Works Internally: Bucket, Hash Code & Collision Handling

October 8, 2023 · 1340 words · 7 minutes

Contents

Introduction

Java HashMap is one of the most frequently used data structures for its efficient data storage and retrieval with constant-time* complexity. HashMap internally uses a combination of Array, LinkedList, and Balanced Tree (Red-Black) data structures to store its value. Let’s explore in detail.

Storage

Let’s start with the very basic. When we put a value in a HashMap using the put method map.put(K, V), HashMap internally use an array of Node to store it. It often refers as a Bucket.

1
2
3
4
5
6
7


/**
 * The table, initialized on first use, and resized as
 * necessary. When allocated, length is always a power of two.
 * (We also tolerate length zero in some operations to allow
 * bootstrapping mechanics that are currently not needed.)
 */
transient Node<K,V>[] table;

A Node<K,V> is a class acts as an individual entry contains the key and value.

1
2
3
4


static class Node<K,V> implements Map.Entry<K,V> {
    final int hash;
    final K key;
    V value;

If we try to visualize the basic storage of the HashMap, we can say the Bucket is an array of Object, where Object=Node<Key, Value> which holds the key and value.

Initial size of the Bucket?

Initial size is 16 defined by DEFAULT_INITIAL_CAPACITY in the code. 16 is an arbitrary choice made by the developers, but the size always should be power of two. Why? We will get into that in a moment.

1
2
3
4


/**
 * The default initial capacity - MUST be a power of two.
 */
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

Now since the bucket is a native array, every Node needs an index to be stored. How its calculated?

Index Calculation

The index is calculated by the hashCode of the Key. Lets say we have a key "name", we want to put map.put("name", "Batman"). We can get the hash code of the Key using "name".hashCode() which is equal 3373707.

Note that HashMap uses a different method int hash(Object key) internally which calculates the hash code differently. Read more

But 3373707 is larger than the bucket capacity 16. We can not use this number as index. We can use a modulo operation to fit the number to the bucket capacity. So it will be stored in index 11 of the Bucket.

1

"name".hashCode() = 3373707 % 16 = 11

Why the Capacity of the Bucket Should be a Power of Two?

Since the modulo operation % is relatively slower, HashMap uses the bitwise and (&) operator to calculate the index. It only works with the numbers power of two.

1
2


index = (capacity - 1) & hash
index = (16 - 1) & 3373707 = 11 = 3373707 % 16

Collision Handling

Now what if two different Keys have the same hash code? For example "Aa" and "BB" generates the same hashcode 2112. We can not put both values in a single index 0. To overcome this, HashMap uses LinkedList and/or Red-Black tree to store duplicate entries in the same index.

1

"Aa".hashCode() = "BB".hashCode() = 2112 % 16 = 0

Instead of storing a single Node in a particular index, HashMap stores either a LinkedList or a Tree based on the number of duplicate Keys for a hash code.

Node<K, V>is actually a LinkedList having field next node, and another class TreeNode<K,V> is used as red black tree.

1
2
3
4
5


static class Node<K,V> implements Map.Entry<K,V> {
    final int hash;
    final K key;
    V value;
    Node<K,V> next; // next node

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


/**
 * Entry for Tree bins. Extends LinkedHashMap.Entry (which in turn
 * extends Node) so can be used as extension of either regular or
 * linked node.
 */
static final class TreeNode<K,V> extends LinkedHashMap.Entry<K,V> {
    TreeNode<K,V> parent;  // red-black tree links
    TreeNode<K,V> left;
    TreeNode<K,V> right;
    TreeNode<K,V> prev;    // needed to unlink next upon deletion
    boolean red;

When the duplicate Key size for a hash code/index exceeded the TREEIFY_THRESHOLD defined in the code, it will be converted to balanced Tree. Why? A LinkedinList complexity is O(n) while a Red-Black tree always maintain a complexity of O(log N) which is significantly improved compared to LinkedList for size greater than 8. The Balanced Tree (Red-Black) is introduced in Java 8.

On the other hand, when the duplicate size reduced to 6, it will be again converted to LinkedList from Tree defined by the UNTREEIFY_THRESHOLD in the code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


/**
 * The bin count threshold for using a tree rather than list for a
 * bin.  Bins are converted to trees when adding an element to a
 * bin with at least this many nodes. The value must be greater
 * than 2 and should be at least 8 to mesh with assumptions in
 * tree removal about conversion back to plain bins upon
 * shrinkage.
 */
static final int TREEIFY_THRESHOLD = 8;

/**
 * The bin count threshold for untreeifying a (split) bin during a
 * resize operation. Should be less than TREEIFY_THRESHOLD, and at
 * most 6 to mesh with shrinkage detection under removal.
 */
static final int UNTREEIFY_THRESHOLD = 6;

1
2
3
4


// binCount is number of duplicate Key for a particular hashCode/index
if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
    treeifyBin(tab, hash);
break;

1
2


if (lc <= UNTREEIFY_THRESHOLD)
    tab[index] = loHead.untreeify(map);

1
2
3
4
5
6


// loop LinkedList in O(N) when duplicate Key with same hash code exist
do {
    if (e.hash == hash &&
        ((k = e.key) == key || (key != null && key.equals(k))))
        return e;
} while ((e = e.next) != null);

So if we put values for keys Aa and BB using put, map.put("Aa", "val1") and map.put("BB", "val2"), the closest visualization looks like this:

Resizing, Rehashing and Load Factor

HashMap uses a loadFactor to decide when to resize the Bucket. The default load factor is 0.75 defined in the code titled DEFAULT_LOAD_FACTOR. It defines that when the Bucket is approximately filled by 75%, then it will be resized to increase the capacity to accommodate more entries. For example if the current Bucket capacity is 16 and Bucket is filled with approx. 0.75 * 16 = 12 Keys, the Bucket capacity will be increased to 32 (next power of 2 value after 16).

1
2
3
4


/**
 * The load factor used when none specified in constructor.
 */
static final float DEFAULT_LOAD_FACTOR = 0.75f;

When Bucket is resized, every stored entries hash code again re-calculated to a new index to distribute in the new Bucket capacity which is known as rehashing. As mentioned, the index is calculated from the capacity mod, (capacity - 1) & hash, its need to re-calculate every entries index using the new resized capacity.

1
2
3
4
5
6


for (int j = 0; j < oldCap; ++j) {
    Node<K,V> e;
    if ((e = oldTab[j]) != null) {
        oldTab[j] = null;
        if (e.next == null)
            newTab[e.hash & (newCap - 1)] = e;

Support null or empty string as Key

If we look into the HashMap internal hash method, it returns 0 for null key. Hence, when null is used as a Key, it always stores in the 0 index of the Bucket.

Same for empty string, if we run "".hashCode() or new String().hashCode() both returns 0 as hash code value.

1
2
3
4


static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

Performance

As the HashMap uses a native array for its storage, the complexity of put, get and delete is constant-time O(1). But in worst case scenario, when there are duplicate hash code for multiple Keys exists, the complexity increased to O(N). Because it will search in the LinkedList to find the Key. From version Java 8, worst case complexity reduced to O(long N) using a balanced tree as we discussed above.