How to implement a hash table in Java (Part 4)

Marcio Endo
October 5, 2022

edit 2022-10-05: leftmost should be rightmost edit 2023-03-03: spelling

We left the previous post of this series with a mostly functional hash table. The latest iteration of the hash table is capable of performing both put and get operations, it handles hash collisions and it is capable of replacing the value of existing mappings.

However, it still has a major limitation: we cannot add more than four distinct mappings to it. The hash table falls into an infinite loop when we try to "put" a fifth mapping. This was discussed during iteration #11.

To overcome this limitation our hash table needs to automatically increase its capacity. The capacity is the number of buckets in the hash table.

So, in this fourth and last post of our series, we will improve our hash table: it must increase its capacity as needed.

Let's continue.

Before we continue

While not strictly required, I recommend reading the previous parts of this series if you haven't already:

In particular:

Iteration 13: resizing the arrays is not enough

Increasing the capacity of a hash table involves more steps than increasing the capacity of e.g. an array list.

So, before we go into the proper solution, let's see why the array list solution does not work for a hash table.

Iteration 13: test case

Let's start with the following test case:

public class HashTableTest {
  @Test(description = """
  Why do we need a rehash operation?
  """)
  public void iter13() {
    var ht = new HashTable<Key, String>();

    var a = new Key("AAA", 1);
    var b = new Key("BBB", 12);
    var c = new Key("CCC", 23);
    var d = new Key("DDD", 34);
    var e = new Key("EEE", 45);

    ht.put(a, "aaa");
    ht.put(b, "bbb");
    ht.put(c, "ccc");
    ht.put(d, "ddd");
    ht.put(e, "eee");

    assertEquals(ht.get(a), "aaa");
    assertEquals(ht.get(b), "bbb");
    assertEquals(ht.get(c), "ccc");
    assertEquals(ht.get(d), "ddd");
    assertEquals(ht.get(e), "eee");
  }
}

We put five mappings into our hash table. Remember it caused an infinite loop during iteration #11.

We proceed and verify that we are able to retrieve the correct values associated to each key.

Iteration 13: the (incorrect) implementation

As mentioned in the section's title, simply resizing the array is not enough. To see why, the following implementation does exactly it:

// does not work!
final class HashTable<K, V> extends iter12.HashTable<K, V> {
  @Override
  protected final V putInsert(K key, V value, int bucket) {
    V result = super.putInsert(key, value, bucket);

    if (size == keys.length) {
      resize();
    }

    return result;
  }

  // does not work!
  private void resize() {
    var newLength = keys.length << 1;

    keys = Arrays.copyOf(keys, newLength);

    values = Arrays.copyOf(values, newLength);
  }
}

We override the putInsert method. The putInsert method is invoked during a put operation when a new mapping is to be stored in the hash table. The implementation of putInsert was discussed in Part 1.

After the new mapping has been inserted, we verify whether our hash table is full or not. If our hash table is full, the resize method is invoked.

The resize method creates new and larger copies of the keys and of the values array. It uses the Arrays::copyOf method. The copies have a larger length newLength that is the double of the previous length.

Iteration 13: running our test

Running our test fails:

java.lang.AssertionError: expected [bbb] but found [null]
	at org.testng.Assert.fail(Assert.java:110)
	at org.testng.Assert.failNotEquals(Assert.java:1413)
	at org.testng.Assert.assertEqualsImpl(Assert.java:149)
	at org.testng.Assert.assertEquals(Assert.java:131)
	at org.testng.Assert.assertEquals(Assert.java:655)
	at org.testng.Assert.assertEquals(Assert.java:665)
	at iter13.HashTableTest.iter13(HashTableTest.java:43)

While the five "put" operations seem to have been successful, assertion fails at the second "get" operation.

So what is happening?

Iteration 13: bucket is a function of the capacity

Remember our current hash function (implemented during iteration #06):

@Override
protected int bucket(Object key) {
  int hc = key.hashCode();
  
  return Math.floorMod(hc, keys.length);
}

Given an existing key in the hash table its computed bucket is a function of:

Therefore, if we are to increase the capacity of the hash table, the computed bucket of an existing key changes.

Consider the five Key instances of our test:

var a = new Key("AAA", 1);
var b = new Key("BBB", 12);
var c = new Key("CCC", 23);
var d = new Key("DDD", 34);
var e = new Key("EEE", 45);

With the initial capacity of 4, applying the keys to the bucket method yields:

Key(AAA, 1)  = 1
Key(BBB, 12) = 0
Key(CCC, 23) = 3
Key(DDD, 34) = 2
Key(EEE, 45) = 1

On the other hand, when the capacity is increased to 8, the same keys produce the following values:

Key(AAA, 1)  = 1
Key(BBB, 12) = 4
Key(CCC, 23) = 7
Key(DDD, 34) = 2
Key(EEE, 45) = 5

So, after the hash table capacity is increased, key b is expected to be at bucket idx = 4. However, it is still at bucket idx = 0. The test, therefore, fails.

So how can we solve this?

Iteration 13: resize and rehash

After the capacity is increased all of the existing mappings have to be rehashed. In other words:

This constitutes what is called a rehash operation.

Iteration 14: a proper rehash operation

Now that we have discussed the rehash operation, let's go and implement a proper one for our hash table.

Before we do, we need to answer two questions:

Let's discuss the first of the questions.

Iteration 14: space-time tradeoff

So, when should our hash table grow? In other words, when should the rehash operation occur?

At first, it might seem reasonable for the answer to be "when the hash table is full". However, consider the following example:

var ht = new HashTable<Key, String>();

var a = new Key("AAA", 1);
var b = new Key("BBB", 5);

ht.put(a, "aaa");
ht.put(b, "bbb");

assertEquals(
  ht.toString(),
  """
  +-----+-----+-----+
  | idx | key | val |
  +-----+-----+-----+
  |   0 |     |     |
  |   1 | AAA | aaa |
  |   2 | BBB | bbb |
  |   3 |     |     |
  +-----+-----+-----+
  """
);

Even though keys a and b have distinct hash code values, they cause a hash collision for the initial capacity of 4. Both keys want to stay at bucket idx = 1.

On the other hand, if our hash table had a capacity of 6 we would expect the following disposition:

assertEquals(
  ht.toString(),
  """
  +-----+-----+-----+
  | idx | key | val |
  +-----+-----+-----+
  |   0 |     |     |
  |   1 | AAA | aaa |
  |   2 |     |     |
  |   3 |     |     |
  |   4 |     |     |
  |   5 | BBB | bbb |
  +-----+-----+-----+
  """
);

In other words, no collisions!

So, considering keys a and b, when the hash table capacity is 4:

On the other hand, when the capacity is 6:

This represents what is called a space-time tradeoff.

So resizing the hash table "when it is full" might not be best option.

Iteration 14: a quick note on iteration

We will not implement any iteration. In other words, from the Map interface, we will not implement e.g.:

But, regarding iteration, as indirectly provided by such methods, I'd like to note on the following.

Consider two hash table instances:

Iteration over keys, entries or values will be faster for the hash table having the smaller capacity. Remember, they both contain the same key-value pairs.

So, while put and get operations can be "faster" for higher capacities; iterations are definitely slower.

This is another space-time tradeoff regarding hash tables.

Iteration 14: the load factor

In a hash table, the "when to grow?" question is controlled by the load factor. The HashMap documentation presents the following defintion:

The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased.

A usual load factor value is 0.75. This is, for instance, the default load factor of the HashMap class.

And what does it mean for a hash table to have a load factor of 0.75?

It means that, when the number of mappings in the hash table exceeds three quarters of its capacity, the internal arrays are resized. As our hash table has a initial capacity of 4, it must increase the number of available buckets when the fourth (distinct) mapping is stored.

Iteration 14: the growth factor

The second question we have to answer before writing our test case is:

As we will discuss during the next iteration, we have reasons to want a capacity that is a power of two number.

So, during the rehash operation, we will simply double the hash table capacity.

Iteration 14: test case

Having our two questions answered:

We proceed to writing our test case.

We start by creating three distinct Key instances. We associate them to their respective names in lowercase.

public class HashTableTest {
  @Test(description = """
  put() test case

  - load factor = 0.75
  - should resize on 4th insert
  - no hash collisions
  """)
  public void iter14() {
    var ht = new HashTable<Key, String>();

    var a = new Key("AAA", 2);
    var b = new Key("BBB", 6);
    var c = new Key("CCC", 3);

    assertEquals(ht.put(a, "aaa"), null);
    assertEquals(ht.put(b, "bbb"), null);
    assertEquals(ht.put(c, "ccc"), null);
    
    // more...
  }
}

The hash code values of the three keys are chosen so that:

assertEquals(
  ht.toString(),
  """
  +-----+-----+-----+
  | idx | key | val |
  +-----+-----+-----+
  |   0 | CCC | ccc |
  |   1 |     |     |
  |   2 | AAA | aaa |
  |   3 | BBB | bbb |
  +-----+-----+-----+
  """
);

We verify the expected internal disposition by asserting on the toString output.

At this point, we have three quarters of the buckets occupied. Therefore, the hash table will need to be resized after the fourth distinct mapping is associated:

var d = new Key("DDD", 8);

assertEquals(ht.put(d, "ddd"), null);
assertEquals(ht.size(), 4);

assertEquals(
  ht.toString(),
  """
  +-----+-----+-----+
  | idx | key | val |
  +-----+-----+-----+
  |   0 | DDD | ddd |
  |   1 |     |     |
  |   2 | AAA | aaa |
  |   3 | CCC | ccc |
  |   4 |     |     |
  |   5 |     |     |
  |   6 | BBB | bbb |
  |   7 |     |     |
  +-----+-----+-----+
  """
);

We expect the capacity to be 8.

And, with such capacity, we expect no hash collisions. "Manually" computing the expected buckets we have:

Iteration 14: running our test

Our test fails when it is run against the implementation of the previous iteration. Here is the (slightly edited) error message:

java.lang.AssertionError: expected [
+-----+-----+-----+
| idx | key | val |
+-----+-----+-----+
|   0 | DDD | ddd |
|   1 |     |     |
|   2 | AAA | aaa |
|   3 | CCC | ccc |
|   4 |     |     |
|   5 |     |     |
|   6 | BBB | bbb |
|   7 |     |     |
+-----+-----+-----+
] but found [
+-----+-----+-----+
| idx | key | val |
+-----+-----+-----+
|   0 | CCC | ccc |
|   1 | DDD | ddd |
|   2 | AAA | aaa |
|   3 | BBB | bbb |
+-----+-----+-----+
]
	at org.testng.Assert.fail(Assert.java:110)
	at org.testng.Assert.failNotEquals(Assert.java:1413)
	at org.testng.Assert.assertEqualsImpl(Assert.java:149)
	at org.testng.Assert.assertEquals(Assert.java:131)
	at org.testng.Assert.assertEquals(Assert.java:655)
	at org.testng.Assert.assertEquals(Assert.java:665)
	at iter14.HashTableTest.iter14(HashTableTest.java:61)

So, the put operation with key d succeeds but the capacity remains unchanged.

Iteration 14: implementation

And here is an implementation that makes our test pass:

public class HashTable<K, V> extends iter12.HashTable<K, V> {
  private int rehashSize = 3;

  @Override
  protected final V putInsert(K key, V value, int bucket) {
    V result = super.putInsert(key, value, bucket);

    if (size > rehashSize) {
      rehash();
    }

    return result;
  }

  @SuppressWarnings("unchecked")
  private void rehash() {
    var capacity = keys.length << 1;

    if (capacity < 0) {
      throw new OutOfMemoryError();
    }

    var oldKeys = keys;

    var oldValues = values;

    keys = new Object[capacity];

    values = new Object[capacity];

    rehashSize = (int) (capacity * 0.75);

    size = 0;

    for (int i = 0; i < oldKeys.length; i++) {
      var key = oldKeys[i];

      if (key == null) {
        continue;
      }

      var value = oldValues[i];

      put0((K) key, (V) value);
    }
  }
}

Let's discuss each section of the code individually.

Iteration 14: the rehashSize field

The hash table has a new rehashSize field:

public class HashTable<K, V> extends iter12.HashTable<K, V> {
  private int rehashSize = 3;
  
  ...
}

It answers our "when should our hash table grow?" question. In other words, the rehash operation must occur when the number of mappings, the hash table size property, exceeds the value given by rehashSize.

Its value is given the product of the load factor and the _capacity. As for our hash table:

The rehashSize is initialized to the value 3.

Iteration 14: rehash after a put operation (if necessary)

Let's now focus on the putInsert method:

@Override
protected final V putInsert(K key, V value, int bucket) {
  V result = super.putInsert(key, value, bucket);

  if (size > rehashSize) {
    rehash();
  }

  return result;
}

Remember, the putInsert method is invoked when a new key-value pair is to be stored in the hash table.

Before returning the original result it verifies if the current size exceeds the rehashSize. When that is the case it invokes the rehash method.

The technique of rehashing after a new mapping has been inserted is the same used by the HashMap class.

Iteration 14: the rehash method

The rehash method, as the name implies, performs the rehash operation.

@SuppressWarnings("unchecked")
private void rehash() {
  var capacity = keys.length << 1;

  if (capacity < 0) {
    throw new OutOfMemoryError();
  }

  var oldKeys = keys;

  var oldValues = values;

  keys = new Object[capacity];

  values = new Object[capacity];

  rehashSize = (int) (capacity * 0.75);

  size = 0;

  for (int i = 0; i < oldKeys.length; i++) {
    var key = oldKeys[i];

    if (key == null) {
      continue;
    }

    var value = oldValues[i];

    put0((K) key, (V) value);
  }
}

It starts by computing the new capacity:

var capacity = keys.length << 1;

if (capacity < 0) {
  throw new OutOfMemoryError();
}

The new capacity is the double of the current capacity. If the capacity overflows, it throws an OutOfMemoryError.

With the new capacity, it then creates two new backing array instances:

var oldKeys = keys;

var oldValues = values;

keys = new Object[capacity];

values = new Object[capacity];

But, before creating the new arrays, it saves a reference to the old ones in two variables.

As the capacity has changed, the rehashSize field is updated accordingly:

rehashSize = (int) (capacity * 0.75);

As mentioned before, its value is given by the product of the load factor and the hash table capacity.

Then, finally, all of the existing entries are rehashed:

size = 0;

for (int i = 0; i < oldKeys.length; i++) {
  var key = oldKeys[i];

  if (key == null) {
    continue;
  }

  var value = oldValues[i];

  put0((K) key, (V) value);
}

This is done by iterating over all of the entries in the hash table. Each key-value pair is then reinserted into the hash table by invoking the put0 method. Before the iteration occurs, the hash table's size is reset to zero.

The put0 method is invoked by the put method after the key and value have been null-checked.
It was defined during iteration #02:

public final V put(K key, V value) {
  Objects.requireNonNull(key, "key == null");
  Objects.requireNonNull(value, "value == null");

  return put0(key, value);
}

Iteration 14: a quick note on the load factor

In the spirit of keeping things simple, the load factor in our implementation is a constant. Its value of 0.75 is "hard coded" and cannot be changed.

Keep in mind, however, that the load factor is a property of a hash table. In other words, it is a value that can be configured.

As an example, the HashMap class offers a constructor that allows for creating an instance with a custom load factor:

// hash map with capacity = 32 and load factor = 0.6
var map = new HashMap<Key, String>(32, 0.6f);

Iteration 14: a quick note on the HashMap::newHashMap factory

Since JDK 19, the HashMap class offers a new factory method:

// argument is the number of expected mappings, i.e., key-value pairs.
var map = HashMap.<Key, String> newHashMap(10);

In the example, the returned map will be large enough to hold 10 mappings. It has the default load factor of 0.75. Therefore, its capacity will be greater than the supplied argument of 10. The capacity will be given by dividing the number of mappings by the load factor:

              10  <-- no. of mappings
capacity = -----
            0.75  <-- load factor

It is different from the constructor:

// argument is the initial _capacity_
var map = new HashMap<Key, String>(10);

In this case, the argument 10 is the initial capacity (actual capacity will be a power of two number).

Also note that the former will invoke the latter.

Iteration 15: a small improvement to our hash function

We have mentioned a few times about the power-of-two choice for the hash table's capacity. We will finally use this fact to our advantage.

In this iteration we will refactor the bucket method and introduce a small improvement to our hash function.

Iteration 15: test case

As this is a refactoring, we will not write a test case.

Iteration 15: the current hash function

The current hash function was implemented during iteration #06:

@Override
protected int bucket(Object key) {
  int hc = key.hashCode();

  return Math.floorMod(hc, keys.length);
}

Remember that, when the key's hash code value is a positive integer, the code is equivalent to:

int hc = key.hashCode();

return hc % keys.length;

It is the remainder of the division of the key's hash code value by the capacity (number of buckets).

Iteration 15: remainder of the division by a power-of-ten number

Suppose a positive integer number represented in base 10. For example:

123456

We can "visually" compute the remainder of its division by a power-of-ten number:

123456 %    123456 %    123456 % 
    10         100        1000
------      ------      ------
     6          56         456

And so on. Rewriting the divisor in "powers of ten" (and a different presentation) we have:

123456 mod 10^1 =   6
123456 mod 10^2 =  56
123456 mod 10^3 = 456

So, to obtain the remainder, we get the rightmost n digits. Where n is the exponent of the divisor.

We can apply an analogous procedure when the divisor is a power-of-two number. Except the dividend has to be represented in base 2.

Iteration 15: remainder of the division by a power-of-two number

Let's consider the number 27 our dividend. The divisors will be the power-of-two numbers 2, 4, 8 and 16. As a "cheat sheet" the following are the modulus results in base 10:

27 mod  2 =  1 
27 mod  4 =  3
27 mod  8 =  3
27 mod 16 = 11

In binary the number 27 is represented by:

11011

Let's apply the same procedure from the previous section:

11011 %    11011 %    11011 %    11011 %
   10        100       1000      10000
-----      -----      -----      -----
    1         11        011       1011

Or, if you prefer:

11011 mod 00010 =    1
11011 mod 00100 =   11
11011 mod 01000 =  011
11011 mod 10000 = 1011

Which matches our cheat sheet as:

dec  |  bin
-----+-----
1    | 0001
3    | 0011
3    | 0011
11   | 1011

Iteration 15: remainder as a bitwise AND operation

Our current example:

11011 %    11011 %    11011 %    11011 %
   10        100       1000      10000
-----      -----      -----      -----
    1         11        011       1011

Can be rewritten as:

11011 &    11011 &    11011 &    11011 &
   01        011       0111      01111
-----      -----      -----      -----
    1         11        011       1011

Where the operation & is the bitwise AND operation.

Notice how the divisor from the first form turned into the bitwise complement in the second form:

NOT    10 =    01
NOT   100 =   011
NOT  1000 =  0111
NOT 10000 = 01111

And as the operand is a power-of-two number, the NOT operation is equivalent to "minus one" subtraction:

00010 - 1 =    01
00100 - 1 =   011
01000 - 1 =  0111
10000 - 1 = 01111

Written as Java code we have:

public class HashTable<K, V> extends iter14.HashTable<K, V> {
  @Override
  protected final int bucket(Object key) {
    var hc = key.hashCode();

    var mask = keys.length - 1;

    return hc & mask;
  }
}

This is the technique used by the HashMap class.

Iteration 15: advantages?

What is the advantage, if any, of using the "bitwise AND" approach over the Math::floodMod one?

Simply put, the bitwise AND solution is much faster than the Math::floorMod one.

Suppose a hash table for which the keys are String values. As an example, we will use three HTTP request header field names as keys. The following lists their names associated with their String hash code values.

User-Agent      = -1844712829
Accept-Encoding = -1099743112
Connection      =  1217813246

We write the following JMH benchmark:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@State(Scope.Benchmark)
public class Benchmarks {
  @Param({"-1844712829", "-1099743112", "1217813246"})
  public int hashCode;

  public int capacity = 16;

  @Benchmark
  public int bitwiseAnd() {
    var mask = capacity - 1;

    return hashCode & mask;
  }

  @Benchmark
  public int floorMod() {
    return Math.floorMod(hashCode, capacity);
  }
}

The benchmark computes the bucket of our keys for a hypothetical hash table with a capacity of 16. It does so using both the "bitwise AND" method and the floorMod one.

It produces the following on my machine:

Benchmark                      (hashCode)  Mode  Cnt   Score   Error  Units
Benchmarks.bitwiseAnd         -1844712829  avgt    5   1.256 ± 0.002  ns/op
Benchmarks.bitwiseAnd         -1099743112  avgt    5   1.286 ± 0.255  ns/op
Benchmarks.bitwiseAnd          1217813246  avgt    5   1.351 ± 0.542  ns/op
Benchmarks.floorMod           -1844712829  avgt    5  19.164 ± 0.036  ns/op
Benchmarks.floorMod           -1099743112  avgt    5  19.159 ± 0.036  ns/op
Benchmarks.floorMod            1217813246  avgt    5  19.157 ± 0.033  ns/op

The bitwise AND version takes a around 1.2ns per operation. The floorMod version, on the other hand, takes around 19.1ns per operation. So the former is faster than the latter (even when considering the higher error of the former).

Iteration 16: putting it all together

We have developed our hash table in a series of incremental steps. It helped making each increment visible.

It has, however, a drawback.

A test from an earlier iteration is not run against the code of a later iteration. In other words, the test from iteration #05, for example, may never touch the code written during iteration #10.

Therefore, as a final step, we will:

Having done that, we will verify whether all of the tests continue to pass.

Iteration 16: test case

The test case is written by simply combining the test from each iteration in a single test file:

public class HashTableTest {
  @Test(description = """
  size() method

  - empty hash table must return 0
  """)
  public void iter01() {
    ...
  }

  @Test(description = """
  put() method

  - it must reject null keys
  - it must reject null values
  """)
  public void iter02() {
    ...
  }
  
  // iterations #03 to #14
  
  @Test(description = """
  put() test case

  - refactor bucket() method
  """)
  public void iter15() {
    ...
  }
}

For brevity we've only listed a small section of the test. If you are interested, you can find the full version here.

Iteration 16: implementation

The following is a small section of our implementation:

public final class HashTable<K, V> {
  private Object[] keys = new Object[4];

  private Object[] values = new Object[4];

  private int size;

  private int rehashSize = 3;

  @SuppressWarnings("unchecked")
  public final V get(Object key) {
    Objects.requireNonNull(key, "key == null");

    var bucket = bucket(key);

    var candidate = keys[bucket];

    if (key.equals(candidate)) {
      return (V) values[bucket];
    }

    return get0(key, bucket, candidate);
  }

  ...

  private int bucket(Object key) {
    var hc = key.hashCode();

    var mask = keys.length - 1;

    return hc & mask;
  }

  private V get0(Object key, int bucket, Object candidate) {
    if (candidate == null) {
      return null;
    }

    return get1(key, bucket);
  }

  ...
}

Once again, for brevity reasons, only a small section of the code is listed. You can find the full version here.

Two things are worth noting when compared to the versions of previous iterations:

Running the test against this implementation passes.

Great!

Conclusion

And we are done!

In this four part blog post series we have implemented a bare-bones hash table in Java that:

We additionally implemented both the size and the toString methods as testing aids.

The source code for all of the examples in this post can be found at this GitHub repository.