How to implement a hash table in Java (Part 4)
edit 2022-10-05: leftmost should be rightmost edit 2023-03-03: spelling
We left the previous post
of this series with a mostly functional hash table.
The latest iteration of the hash table is capable of performing both put
and get
operations,
it handles hash collisions and it is capable of replacing the value of existing mappings.
However, it still has a major limitation: we cannot add more than four distinct mappings to it. The hash table falls into an infinite loop when we try to "put" a fifth mapping. This was discussed during iteration #11.
To overcome this limitation our hash table needs to automatically increase its capacity. The capacity is the number of buckets in the hash table.
So, in this fourth and last post of our series, we will improve our hash table: it must increase its capacity as needed.
Let's continue.
Before we continue
While not strictly required, I recommend reading the previous parts of this series if you haven't already:
In particular:
-
the "Writing style" section of Part 1 explains why the examples are written the way they are; and
-
the "Iteration 11: infinite loop?" section of Part 3 show the problem our hash table currently has.
Iteration 13: resizing the arrays is not enough
Increasing the capacity of a hash table involves more steps than increasing the capacity of e.g. an array list.
So, before we go into the proper solution, let's see why the array list solution does not work for a hash table.
Iteration 13: test case
Let's start with the following test case:
public class HashTableTest {
@Test(description = """
Why do we need a rehash operation?
""")
public void iter13() {
var ht = new HashTable<Key, String>();
var a = new Key("AAA", 1);
var b = new Key("BBB", 12);
var c = new Key("CCC", 23);
var d = new Key("DDD", 34);
var e = new Key("EEE", 45);
ht.put(a, "aaa");
ht.put(b, "bbb");
ht.put(c, "ccc");
ht.put(d, "ddd");
ht.put(e, "eee");
assertEquals(ht.get(a), "aaa");
assertEquals(ht.get(b), "bbb");
assertEquals(ht.get(c), "ccc");
assertEquals(ht.get(d), "ddd");
assertEquals(ht.get(e), "eee");
}
}
We put five mappings into our hash table. Remember it caused an infinite loop during iteration #11.
We proceed and verify that we are able to retrieve the correct values associated to each key.
Iteration 13: the (incorrect) implementation
As mentioned in the section's title, simply resizing the array is not enough. To see why, the following implementation does exactly it:
// does not work!
final class HashTable<K, V> extends iter12.HashTable<K, V> {
@Override
protected final V putInsert(K key, V value, int bucket) {
V result = super.putInsert(key, value, bucket);
if (size == keys.length) {
resize();
}
return result;
}
// does not work!
private void resize() {
var newLength = keys.length << 1;
keys = Arrays.copyOf(keys, newLength);
values = Arrays.copyOf(values, newLength);
}
}
We override the putInsert
method.
The putInsert
method is invoked during a put
operation when a new mapping is to be stored in the hash table.
The implementation of putInsert
was discussed in Part 1.
After the new mapping has been inserted, we verify whether our hash table is full or not.
If our hash table is full, the resize
method is invoked.
The resize
method creates new and larger copies of the keys
and of the values
array.
It uses the Arrays::copyOf method.
The copies have a larger length newLength
that is the double of the previous length.
Iteration 13: running our test
Running our test fails:
java.lang.AssertionError: expected [bbb] but found [null]
at org.testng.Assert.fail(Assert.java:110)
at org.testng.Assert.failNotEquals(Assert.java:1413)
at org.testng.Assert.assertEqualsImpl(Assert.java:149)
at org.testng.Assert.assertEquals(Assert.java:131)
at org.testng.Assert.assertEquals(Assert.java:655)
at org.testng.Assert.assertEquals(Assert.java:665)
at iter13.HashTableTest.iter13(HashTableTest.java:43)
While the five "put" operations seem to have been successful, assertion fails at the second "get" operation.
So what is happening?
Iteration 13: bucket is a function of the capacity
Remember our current hash function (implemented during iteration #06):
@Override
protected int bucket(Object key) {
int hc = key.hashCode();
return Math.floorMod(hc, keys.length);
}
Given an existing key in the hash table its computed bucket is a function of:
-
the key's hash code value; and
-
the hash table capacity.
Therefore, if we are to increase the capacity of the hash table, the computed bucket of an existing key changes.
Consider the five Key
instances of our test:
var a = new Key("AAA", 1);
var b = new Key("BBB", 12);
var c = new Key("CCC", 23);
var d = new Key("DDD", 34);
var e = new Key("EEE", 45);
With the initial capacity of 4
, applying the keys to the bucket
method yields:
Key(AAA, 1) = 1
Key(BBB, 12) = 0
Key(CCC, 23) = 3
Key(DDD, 34) = 2
Key(EEE, 45) = 1
On the other hand, when the capacity is increased to 8
, the same keys produce the following values:
Key(AAA, 1) = 1
Key(BBB, 12) = 4
Key(CCC, 23) = 7
Key(DDD, 34) = 2
Key(EEE, 45) = 5
So, after the hash table capacity is increased, key b
is expected to be at bucket idx = 4
.
However, it is still at bucket idx = 0
.
The test, therefore, fails.
So how can we solve this?
Iteration 13: resize and rehash
After the capacity is increased all of the existing mappings have to be rehashed. In other words:
-
the buckets for all of the entries in the hash able must be recomputed; and
-
all of the existing key-value pairs must be relocated to the new, recomputed buckets.
This constitutes what is called a rehash operation.
Iteration 14: a proper rehash operation
Now that we have discussed the rehash operation, let's go and implement a proper one for our hash table.
Before we do, we need to answer two questions:
-
when should our hash table grow? and
-
by what factor should the capacity increase?
Let's discuss the first of the questions.
Iteration 14: space-time tradeoff
So, when should our hash table grow? In other words, when should the rehash operation occur?
At first, it might seem reasonable for the answer to be "when the hash table is full". However, consider the following example:
var ht = new HashTable<Key, String>();
var a = new Key("AAA", 1);
var b = new Key("BBB", 5);
ht.put(a, "aaa");
ht.put(b, "bbb");
assertEquals(
ht.toString(),
"""
+-----+-----+-----+
| idx | key | val |
+-----+-----+-----+
| 0 | | |
| 1 | AAA | aaa |
| 2 | BBB | bbb |
| 3 | | |
+-----+-----+-----+
"""
);
Even though keys a
and b
have distinct hash code values, they cause a hash collision for the initial capacity of 4
.
Both keys want to stay at bucket idx = 1
.
On the other hand, if our hash table had a capacity of 6
we would expect the following disposition:
assertEquals(
ht.toString(),
"""
+-----+-----+-----+
| idx | key | val |
+-----+-----+-----+
| 0 | | |
| 1 | AAA | aaa |
| 2 | | |
| 3 | | |
| 4 | | |
| 5 | BBB | bbb |
+-----+-----+-----+
"""
);
In other words, no collisions!
So, considering keys a
and b
, when the hash table capacity is 4
:
-
we use less memory (as our arrays are smaller); but
-
put
andget
operations are slower (as they have to resolve hash collisions);
On the other hand, when the capacity is 6
:
-
put
andget
operations are optimal; but -
we use more memory (as our arrays are larger).
This represents what is called a space-time tradeoff.
So resizing the hash table "when it is full" might not be best option.
Iteration 14: a quick note on iteration
We will not implement any iteration.
In other words, from the Map
interface, we will not implement e.g.:
-
the
keySet
method; -
the
entrySet
method; nor -
the
values
method.
But, regarding iteration, as indirectly provided by such methods, I'd like to note on the following.
Consider two hash table instances:
-
they both contain the same mappings; but
-
one has a capacity that is greater than the other.
Iteration over keys, entries or values will be faster for the hash table having the smaller capacity. Remember, they both contain the same key-value pairs.
So, while put
and get
operations can be "faster" for higher capacities;
iterations are definitely slower.
This is another space-time tradeoff regarding hash tables.
Iteration 14: the load factor
In a hash table, the "when to grow?" question is controlled by the load factor.
The HashMap
documentation presents the following defintion:
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased.
A usual load factor value is 0.75
.
This is, for instance, the default load factor of the HashMap
class.
And what does it mean for a hash table to have a load factor of 0.75
?
It means that, when the number of mappings in the hash table exceeds three quarters of its capacity, the internal arrays are resized.
As our hash table has a initial capacity of 4
, it must increase the number of available buckets when the fourth (distinct) mapping is stored.
Iteration 14: the growth factor
The second question we have to answer before writing our test case is:
-
by what factor should our capacity increase?
As we will discuss during the next iteration, we have reasons to want a capacity that is a power of two number.
So, during the rehash operation, we will simply double the hash table capacity.
Iteration 14: test case
Having our two questions answered:
-
our hash table will have a load factor of
.75
; and -
we will double the hash table capacity.
We proceed to writing our test case.
We start by creating three distinct Key
instances.
We associate them to their respective names in lowercase.
public class HashTableTest {
@Test(description = """
put() test case
- load factor = 0.75
- should resize on 4th insert
- no hash collisions
""")
public void iter14() {
var ht = new HashTable<Key, String>();
var a = new Key("AAA", 2);
var b = new Key("BBB", 6);
var c = new Key("CCC", 3);
assertEquals(ht.put(a, "aaa"), null);
assertEquals(ht.put(b, "bbb"), null);
assertEquals(ht.put(c, "ccc"), null);
// more...
}
}
The hash code values of the three keys are chosen so that:
-
keys
a
andb
will want to occupy the same bucketidx = 2
; and -
key
c
will try to stay at bucketidx = 3
. But, by the time it is mapped, its bucket is already taken by keyb
.
assertEquals(
ht.toString(),
"""
+-----+-----+-----+
| idx | key | val |
+-----+-----+-----+
| 0 | CCC | ccc |
| 1 | | |
| 2 | AAA | aaa |
| 3 | BBB | bbb |
+-----+-----+-----+
"""
);
We verify the expected internal disposition by asserting on the toString
output.
At this point, we have three quarters of the buckets occupied. Therefore, the hash table will need to be resized after the fourth distinct mapping is associated:
var d = new Key("DDD", 8);
assertEquals(ht.put(d, "ddd"), null);
assertEquals(ht.size(), 4);
assertEquals(
ht.toString(),
"""
+-----+-----+-----+
| idx | key | val |
+-----+-----+-----+
| 0 | DDD | ddd |
| 1 | | |
| 2 | AAA | aaa |
| 3 | CCC | ccc |
| 4 | | |
| 5 | | |
| 6 | BBB | bbb |
| 7 | | |
+-----+-----+-----+
"""
);
We expect the capacity to be 8
.
And, with such capacity, we expect no hash collisions. "Manually" computing the expected buckets we have:
-
Key("AAA", 2)
=>2 % 8 = 2
-
Key("BBB", 6)
=>6 % 8 = 6
-
Key("CCC", 3)
=>3 % 8 = 3
-
Key("DDD", 8)
=>8 % 8 = 0
Iteration 14: running our test
Our test fails when it is run against the implementation of the previous iteration. Here is the (slightly edited) error message:
java.lang.AssertionError: expected [
+-----+-----+-----+
| idx | key | val |
+-----+-----+-----+
| 0 | DDD | ddd |
| 1 | | |
| 2 | AAA | aaa |
| 3 | CCC | ccc |
| 4 | | |
| 5 | | |
| 6 | BBB | bbb |
| 7 | | |
+-----+-----+-----+
] but found [
+-----+-----+-----+
| idx | key | val |
+-----+-----+-----+
| 0 | CCC | ccc |
| 1 | DDD | ddd |
| 2 | AAA | aaa |
| 3 | BBB | bbb |
+-----+-----+-----+
]
at org.testng.Assert.fail(Assert.java:110)
at org.testng.Assert.failNotEquals(Assert.java:1413)
at org.testng.Assert.assertEqualsImpl(Assert.java:149)
at org.testng.Assert.assertEquals(Assert.java:131)
at org.testng.Assert.assertEquals(Assert.java:655)
at org.testng.Assert.assertEquals(Assert.java:665)
at iter14.HashTableTest.iter14(HashTableTest.java:61)
So, the put
operation with key d
succeeds but the capacity remains unchanged.
Iteration 14: implementation
And here is an implementation that makes our test pass:
public class HashTable<K, V> extends iter12.HashTable<K, V> {
private int rehashSize = 3;
@Override
protected final V putInsert(K key, V value, int bucket) {
V result = super.putInsert(key, value, bucket);
if (size > rehashSize) {
rehash();
}
return result;
}
@SuppressWarnings("unchecked")
private void rehash() {
var capacity = keys.length << 1;
if (capacity < 0) {
throw new OutOfMemoryError();
}
var oldKeys = keys;
var oldValues = values;
keys = new Object[capacity];
values = new Object[capacity];
rehashSize = (int) (capacity * 0.75);
size = 0;
for (int i = 0; i < oldKeys.length; i++) {
var key = oldKeys[i];
if (key == null) {
continue;
}
var value = oldValues[i];
put0((K) key, (V) value);
}
}
}
Let's discuss each section of the code individually.
Iteration 14: the rehashSize
field
The hash table has a new rehashSize
field:
public class HashTable<K, V> extends iter12.HashTable<K, V> {
private int rehashSize = 3;
...
}
It answers our "when should our hash table grow?" question.
In other words, the rehash operation must occur when the number of mappings, the hash table size
property, exceeds the value given by rehashSize
.
Its value is given the product of the load factor and the _capacity. As for our hash table:
-
the load factor is
0.75
; and -
the initial capacity is
4
.
The rehashSize
is initialized to the value 3
.
Iteration 14: rehash after a put
operation (if necessary)
Let's now focus on the putInsert
method:
@Override
protected final V putInsert(K key, V value, int bucket) {
V result = super.putInsert(key, value, bucket);
if (size > rehashSize) {
rehash();
}
return result;
}
Remember, the putInsert
method is invoked when a new key-value pair is to be stored in the hash table.
Before returning the original result
it verifies if the current size
exceeds the rehashSize
.
When that is the case it invokes the rehash
method.
The technique of rehashing after a new mapping has been inserted is the same used by the HashMap
class.
Iteration 14: the rehash
method
The rehash
method, as the name implies, performs the rehash operation.
@SuppressWarnings("unchecked")
private void rehash() {
var capacity = keys.length << 1;
if (capacity < 0) {
throw new OutOfMemoryError();
}
var oldKeys = keys;
var oldValues = values;
keys = new Object[capacity];
values = new Object[capacity];
rehashSize = (int) (capacity * 0.75);
size = 0;
for (int i = 0; i < oldKeys.length; i++) {
var key = oldKeys[i];
if (key == null) {
continue;
}
var value = oldValues[i];
put0((K) key, (V) value);
}
}
It starts by computing the new capacity:
var capacity = keys.length << 1;
if (capacity < 0) {
throw new OutOfMemoryError();
}
The new capacity
is the double of the current capacity.
If the capacity
overflows, it throws an OutOfMemoryError
.
With the new capacity
, it then creates two new backing array instances:
var oldKeys = keys;
var oldValues = values;
keys = new Object[capacity];
values = new Object[capacity];
But, before creating the new arrays, it saves a reference to the old ones in two variables.
As the capacity
has changed, the rehashSize
field is updated accordingly:
rehashSize = (int) (capacity * 0.75);
As mentioned before, its value is given by the product of the load factor and the hash table capacity.
Then, finally, all of the existing entries are rehashed:
size = 0;
for (int i = 0; i < oldKeys.length; i++) {
var key = oldKeys[i];
if (key == null) {
continue;
}
var value = oldValues[i];
put0((K) key, (V) value);
}
This is done by iterating over all of the entries in the hash table.
Each key-value pair is then reinserted into the hash table by invoking the put0
method.
Before the iteration occurs, the hash table's size
is reset to zero.
The put0
method is invoked by the put
method after the key and value have been null-checked.
It was defined during iteration #02:
public final V put(K key, V value) {
Objects.requireNonNull(key, "key == null");
Objects.requireNonNull(value, "value == null");
return put0(key, value);
}
Iteration 14: a quick note on the load factor
In the spirit of keeping things simple, the load factor in our implementation is a constant.
Its value of 0.75
is "hard coded" and cannot be changed.
Keep in mind, however, that the load factor is a property of a hash table. In other words, it is a value that can be configured.
As an example, the HashMap
class offers a constructor that allows for creating an instance with a custom load factor:
// hash map with capacity = 32 and load factor = 0.6
var map = new HashMap<Key, String>(32, 0.6f);
Iteration 14: a quick note on the HashMap::newHashMap
factory
Since JDK 19, the HashMap
class offers a new factory method:
// argument is the number of expected mappings, i.e., key-value pairs.
var map = HashMap.<Key, String> newHashMap(10);
In the example, the returned map will be large enough to hold 10
mappings.
It has the default load factor of 0.75
.
Therefore, its capacity will be greater than the supplied argument of 10
.
The capacity will be given by dividing the number of mappings by the load factor:
10 <-- no. of mappings
capacity = -----
0.75 <-- load factor
It is different from the constructor:
// argument is the initial _capacity_
var map = new HashMap<Key, String>(10);
In this case, the argument 10
is the initial capacity (actual capacity will be a power of two number).
Also note that the former will invoke the latter.
Iteration 15: a small improvement to our hash function
We have mentioned a few times about the power-of-two choice for the hash table's capacity. We will finally use this fact to our advantage.
In this iteration we will refactor the bucket
method and introduce a small improvement to our hash function.
Iteration 15: test case
As this is a refactoring, we will not write a test case.
Iteration 15: the current hash function
The current hash function was implemented during iteration #06:
@Override
protected int bucket(Object key) {
int hc = key.hashCode();
return Math.floorMod(hc, keys.length);
}
Remember that, when the key's hash code value is a positive integer, the code is equivalent to:
int hc = key.hashCode();
return hc % keys.length;
It is the remainder of the division of the key's hash code value by the capacity (number of buckets).
Iteration 15: remainder of the division by a power-of-ten number
Suppose a positive integer number represented in base 10. For example:
123456
We can "visually" compute the remainder of its division by a power-of-ten number:
123456 % 123456 % 123456 %
10 100 1000
------ ------ ------
6 56 456
And so on. Rewriting the divisor in "powers of ten" (and a different presentation) we have:
123456 mod 10^1 = 6
123456 mod 10^2 = 56
123456 mod 10^3 = 456
So, to obtain the remainder, we get the rightmost n
digits.
Where n
is the exponent of the divisor.
We can apply an analogous procedure when the divisor is a power-of-two number. Except the dividend has to be represented in base 2.
Iteration 15: remainder of the division by a power-of-two number
Let's consider the number 27
our dividend.
The divisors will be the power-of-two numbers 2
, 4
, 8
and 16
.
As a "cheat sheet" the following are the modulus results in base 10:
27 mod 2 = 1
27 mod 4 = 3
27 mod 8 = 3
27 mod 16 = 11
In binary the number 27
is represented by:
11011
Let's apply the same procedure from the previous section:
11011 % 11011 % 11011 % 11011 %
10 100 1000 10000
----- ----- ----- -----
1 11 011 1011
Or, if you prefer:
11011 mod 00010 = 1
11011 mod 00100 = 11
11011 mod 01000 = 011
11011 mod 10000 = 1011
Which matches our cheat sheet as:
dec | bin
-----+-----
1 | 0001
3 | 0011
3 | 0011
11 | 1011
Iteration 15: remainder as a bitwise AND operation
Our current example:
11011 % 11011 % 11011 % 11011 %
10 100 1000 10000
----- ----- ----- -----
1 11 011 1011
Can be rewritten as:
11011 & 11011 & 11011 & 11011 &
01 011 0111 01111
----- ----- ----- -----
1 11 011 1011
Where the operation &
is the bitwise AND operation.
Notice how the divisor from the first form turned into the bitwise complement in the second form:
NOT 10 = 01
NOT 100 = 011
NOT 1000 = 0111
NOT 10000 = 01111
And as the operand is a power-of-two number, the NOT operation is equivalent to "minus one" subtraction:
00010 - 1 = 01
00100 - 1 = 011
01000 - 1 = 0111
10000 - 1 = 01111
Written as Java code we have:
public class HashTable<K, V> extends iter14.HashTable<K, V> {
@Override
protected final int bucket(Object key) {
var hc = key.hashCode();
var mask = keys.length - 1;
return hc & mask;
}
}
This is the technique used by the HashMap
class.
Iteration 15: advantages?
What is the advantage, if any, of using the "bitwise AND" approach over the Math::floodMod
one?
Simply put, the bitwise AND solution is much faster than the Math::floorMod
one.
Suppose a hash table for which the keys are String
values.
As an example, we will use three HTTP request header field names as keys.
The following lists their names associated with their String
hash code values.
User-Agent = -1844712829
Accept-Encoding = -1099743112
Connection = 1217813246
We write the following JMH benchmark:
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
@State(Scope.Benchmark)
public class Benchmarks {
@Param({"-1844712829", "-1099743112", "1217813246"})
public int hashCode;
public int capacity = 16;
@Benchmark
public int bitwiseAnd() {
var mask = capacity - 1;
return hashCode & mask;
}
@Benchmark
public int floorMod() {
return Math.floorMod(hashCode, capacity);
}
}
The benchmark computes the bucket of our keys for a hypothetical hash table with a capacity of 16
.
It does so using both the "bitwise AND" method and the floorMod
one.
It produces the following on my machine:
Benchmark (hashCode) Mode Cnt Score Error Units
Benchmarks.bitwiseAnd -1844712829 avgt 5 1.256 ± 0.002 ns/op
Benchmarks.bitwiseAnd -1099743112 avgt 5 1.286 ± 0.255 ns/op
Benchmarks.bitwiseAnd 1217813246 avgt 5 1.351 ± 0.542 ns/op
Benchmarks.floorMod -1844712829 avgt 5 19.164 ± 0.036 ns/op
Benchmarks.floorMod -1099743112 avgt 5 19.159 ± 0.036 ns/op
Benchmarks.floorMod 1217813246 avgt 5 19.157 ± 0.033 ns/op
The bitwise AND version takes a around 1.2ns per operation.
The floorMod
version, on the other hand, takes around 19.1ns per operation.
So the former is faster than the latter (even when considering the higher error of the former).
Iteration 16: putting it all together
We have developed our hash table in a series of incremental steps. It helped making each increment visible.
It has, however, a drawback.
A test from an earlier iteration is not run against the code of a later iteration. In other words, the test from iteration #05, for example, may never touch the code written during iteration #10.
Therefore, as a final step, we will:
-
combine all of the tests in a single test class; and
-
combine all of the hash table in a single
HashTable
class.
Having done that, we will verify whether all of the tests continue to pass.
Iteration 16: test case
The test case is written by simply combining the test from each iteration in a single test file:
public class HashTableTest {
@Test(description = """
size() method
- empty hash table must return 0
""")
public void iter01() {
...
}
@Test(description = """
put() method
- it must reject null keys
- it must reject null values
""")
public void iter02() {
...
}
// iterations #03 to #14
@Test(description = """
put() test case
- refactor bucket() method
""")
public void iter15() {
...
}
}
For brevity we've only listed a small section of the test. If you are interested, you can find the full version here.
Iteration 16: implementation
The following is a small section of our implementation:
public final class HashTable<K, V> {
private Object[] keys = new Object[4];
private Object[] values = new Object[4];
private int size;
private int rehashSize = 3;
@SuppressWarnings("unchecked")
public final V get(Object key) {
Objects.requireNonNull(key, "key == null");
var bucket = bucket(key);
var candidate = keys[bucket];
if (key.equals(candidate)) {
return (V) values[bucket];
}
return get0(key, bucket, candidate);
}
...
private int bucket(Object key) {
var hc = key.hashCode();
var mask = keys.length - 1;
return hc & mask;
}
private V get0(Object key, int bucket, Object candidate) {
if (candidate == null) {
return null;
}
return get1(key, bucket);
}
...
}
Once again, for brevity reasons, only a small section of the code is listed. You can find the full version here.
Two things are worth noting when compared to the versions of previous iterations:
-
the class does not have an
extends
declaration; and -
protected
fields and methods are allprivate
in this version.
Running the test against this implementation passes.
Great!
Conclusion
And we are done!
In this four part blog post series we have implemented a bare-bones hash table in Java that:
-
supports both the
put
and theget
methods as defined by theMap
interface; -
rejects
null
keys andnull
values; -
uses linear probing for handling hash collisions; and
-
dynamically resizes (and rehashes) itself when required.
We additionally implemented both the size
and the toString
methods as testing aids.
The source code for all of the examples in this post can be found at this GitHub repository.