How to implement a hash table in Java (Part 1)
If you are developing an application in Java chances are you are currently using an implementation of java.util.Map
or java.util.Set
in your programs.
When you use any of the following you are using a hash table data structure in your program:
-
java.util.HashMap
-
java.util.HashSet
which internally just delegates to aHashMap
-
java.util.ImmutableCollections.MapN
as returned by theMap.of
methods -
java.util.ImmutableCollections.SetN
as returned by theSet.of
methods -
java.util.EnumMap
These are only a few of the examples from the JDK; the JDK has more hash-based implementations.
I've always wanted to know the details of how a hash table work.
I find the HashMap
source code to be quite difficult to parse though.
So I was a bit intimidated to try and implement one.
I have read this blog post from Ben Hoyt sometime last year. It describes the implementation of a hash table structure in the C programming language. It encouraged me to try and implement one in Java from scratch.
So, in this blog series, I will guide you through the implementation of a hash table in Java.
Introduction
Before we begin, just a couple of notes.
Not quite like a HashMap
I should preface saying the hash table we will implement is going to be somewhat different from java.util.HashMap
.
It will still give a good understanding (hopefully) of the internals. But, implementation-wise, it will be closer to:
-
java.util.ImmutableCollections.MapN
as returned by theMap.of
methods; -
java.util.ImmutableCollections.SetN
as returned by theSet.of
methods; and to -
java.util.EnumMap
Writing style
We will create our hash table implementation in a series of small steps. Each iteration will add on top of the previous one.
We want each increment to be visible. In other words, we want to easily see what has changed from the previous step.
To achieve this goal we will have the implementation from one step extend the one from the previous step. This may be quite particular.
Keep in mind the sole purpose of this usage is to make each change from a previous iteration more apparent.
In other words, I am not advocating for the use of this technique in production code.
A small set of requirements
So we know where we are headed to, let's write down a small set of requirements.
We will implement a very small subset of the java.util.Map
interface.
Representing it as a Java interface we would have the following:
public interface HashTable<K, V> {
V put(K key, V value);
V get(Object key);
int size();
}
Please note that, to keep things simple, we will create our implementation directly as a class. The interface is listed only to show the methods we want to implement.
Our target is to implement three methods: the put
, the get
and the size
methods.
I should note that the size
method is not strictly required.
But it will help writing some of the assertions along the way so we will implement it.
As an additional and final requirement our implementation must reject null
keys and null
values.
Rejecting null
keys in particular will simplify our implementation.
Iteration 01: the size() method implementation
We want to implement our hash table in a series of small steps.
So it is fitting for our first step to be to implement the size
method.
Iteration 01: test case
Let's start with a test case. It verifies that a newly created hash table has a size of zero:
public class HashTableTest {
@Test(description = """
size() method
- empty hash table must return 0
""")
public void iter01() {
var ht = new HashTable<Object, Object>();
assertEquals(ht.size(), 0);
}
}
Note that we are not "putting" nor "getting" anything from our hash table.
For this reason the types of the keys and of the values are not important.
So, in this test, our hash table instance simply maps an Object
to another Object
.
Iteration 01: implementation
The implementation is trivial. It makes our test pass:
public class HashTable<K, V> {
protected int size;
public final int size() { return size; }
}
The only thing to notice is that the visibility of size
field: protected
.
It is required so that subclasses can directly read or modify its value.
Iteration 02: reject null
keys and null
values
As a next step let's implement the requirement that our hash table must reject both null
keys and null
values.
It means we will implement a first version of the put
method.
Granted, it will not do much yet.
Iteration 02: test case
Once again, we start with a test case:
public class HashTableTest {
@Test(description = """
put() method
- it must reject null keys
- it must reject null values
""")
public void iter02() {
var ht = new HashTable<String, String>();
assertEquals(ht.size(), 0);
try {
ht.put(null, "non-null value");
Assert.fail("Expected a NPE");
} catch (NullPointerException expected) {
assertEquals(expected.getMessage(), "key == null");
}
try {
ht.put("non-null key", null);
Assert.fail("Expected a NPE");
} catch (NullPointerException expected) {
assertEquals(expected.getMessage(), "value == null");
}
assertEquals(ht.size(), 0);
}
}
We first verify that an attempt to put a null
key and a non-null value fails with a NullPointerException
.
Next, we verify that a null
value (with a non-null key) also causes a NullPointerException
to be thrown.
In both cases we verify that the exception contains the expected error message.
Finally, we verify that the hash table did not change by asserting its size remains zero. Note that this is not really sufficient to verify that no key-value mappings were added. But, at the moment, it is all we can do.
Iteration 02: implementation
Here's an implementation that make our test pass:
public class HashTable<K, V> extends iter01.HashTable<K, V> {
public final V put(K key, V value) {
Objects.requireNonNull(key, "key == null");
Objects.requireNonNull(value, "value == null");
return put0(key, value);
}
protected V put0(K key, V value) {
throw new UnsupportedOperationException("Implement me");
}
}
The first thing to notice is that this class is a subclass of the implementation of the previous iteration. I mentioned this at the beginning of the post. We want to easily see what has changed from the previous iteration.
And what has changed from the previous iteration?
We created a put
implementation that rejects both null
keys and null
values.
If, on the other hand, both the key
and the value
are non-null, it continues execution by calling a put0
method.
At this iteration the put0
method simply throws an UnsupportedOperationException
.
Iteration 03: put
key-value mappings (happy path)
We handled the put
method preconditions in the previous section.
Now we will actually add key-value mappings to our hash table exercising a happy path.
Our happy path consists of:
-
each inserted key-value pair is stored at a distinct bucket (i.e., no hash collisions);
-
each inserted key-value pair is distinct from each other (i.e., no value replacements); and
-
the hash code values are all positive
If you don't know what a hash collision is, don't worry. We will discuss it in this series.
Iteration 03: test case
We start with a test case.
Given the requirements of our test, let's use Integer
instances as keys.
The hash code value of a Integer
object is simply the int
value it wraps.
Integer
instances, therefore, make it easy to define an exact hash code value for our key.
Here's our test listing:
public class HashTableTest {
@Test(description = """
put() and size() methods
- positive hash codes
- no hash collisions
- no value replacements
""")
public void iter03() {
var ht = new HashTable<Integer, String>();
assertEquals(ht.size(), 0);
assertEquals(ht.put(1, "One"), null);
assertEquals(ht.size(), 1);
assertEquals(ht.put(2, "Two"), null);
assertEquals(ht.size(), 2);
}
}
We start by creating a hash table instance that maps Integer
values to their names in English.
We add two key-value mappings to our hash table;
and, since there are no value replacements, we assert that the put
method returns null
.
Remember the Javadocs for put
method.
The put
method returns:
the previous value associated with key, or null if there was no mapping for key.
Additionally, we verify that the size of our hash table increases by one after each key-value insertion.
Iteration 03: implementation
And here's an implementation that makes our third test pass:
public class HashTable<K, V> extends iter02.HashTable<K, V> {
protected Object[] keys = new Object[4];
protected Object[] values = new Object[4];
protected int bucket(Object key) {
var hc = key.hashCode();
return hc % keys.length;
}
@Override
protected final V put0(K key, V value) {
var bucket = bucket(key);
var existing = keys[bucket];
if (existing == null) {
return putInsert(key, value, bucket);
}
return put1(key, value, bucket, existing);
}
protected V put1(K key, V value, int bucket, Object existing) {
throw new UnsupportedOperationException("Implement me");
}
protected V putInsert(K key, V value, int bucket) {
keys[bucket] = key;
values[bucket] = value;
size++;
return null;
}
}
Let's take a closer look into the implementation.
Iteration 03: the keys
and values
arrays
Our class declares two arrays of the same length: keys
and values
.
As their names indicate they are used to store the key-value mappings.
A key-value pair mapping is stored at the two arrays in the following manner:
-
the
keys
array holds the key of the mapping; -
the
values
array holds the value of the mapping; -
both arrays have the same length; and
-
the index
bucket
at which the key and value are stored are the same.
In other words, a single key-value pair can be stored like so:
keys[bucket] = key;
values[bucket] = value;
Similarly, a single key-value pair can be read like so:
var key = keys[bucket];
var value = values[bucket];
The following diagram illustrates a possible disposition for the mappings 1=One
and 2=Two
.
--------|-------+-------+-------+-------+
index:| 0 | 1 | 2 | 3 |
--------|-------+-------+-------+-------+
keys:| | 1 | 2 | |
--------|-------+-------+-------+-------+
values:| | "One" | "Two" | |
--------|-------+-------+-------+-------+
But how is the bucket
index computed?
Iteration 03: the initial array length
Before we look into the bucket
method, notice how the hash table arrays were both initialized with a length of 4
:
protected Object[] keys = new Object[4];
protected Object[] values = new Object[4];
The choice of a power-of-two number is not a coincidence. We will look into it later. For now, just consider it an arbitrary positive value.
Iteration 03: the bucket
method
The bucket
method is responsible for computing the index.
At this iteration the method returns the remainder of the division between the key's hash code value and the keys
array length.
// the iter 03 'incomplete' implementation
protected int bucket(Object key) {
var hc = key.hashCode();
return hc % keys.length;
}
As an example, consider the following:
-
the
keys
array length is4
; and -
the specified
key
has a positive hash code value
Then the bucket
variable will hold a value between:
-
zero (inclusive); and
-
the
keys
length (exclusive).
These are all of the valid values for the indices of the keys
array.
And, as keys
and values
have the same length, these are also all of the valid values for the indices of the values
array.
As a final note I should mention that the implementation of this iteration is known to be incomplete. It fails to handle negative hash code values. We will look into it during iteration #05.
Iteration 03: is this a new mapping?
With the bucket
index computed, the put0
method goes on to verify if hash table contains the specified key
or not.
It does so by checking if the keys
array component at that index is null
.
var bucket = bucket(key);
var existing = keys[bucket];
if (existing == null) {
// new mapping
return putInsert(key, value, bucket);
}
...
Remember, our hash table disallows null
keys;
so if a component of the keys
array is null
, it means there is no existing mapping at that location.
Iteration 03: the putInsert
method
The actual insertion is done at the putInsert
method.
After storing both the specified key
and value
it increments the hash table's size
value by one and returns null
.
protected V putInsert(K key, V value, int bucket) {
keys[bucket] = key;
values[bucket] = value;
size++;
return null;
}
Iteration 03: same key or collision?
If the keys
array component at the computed index is non-null, it means that either:
-
we have the same key at the bucket (and possibly a distinct value); or
-
we have a distinct key at the same bucket.
if (existing == null) {
return putInsert(key, value, bucket);
}
// what to do here?
return put1(key, value, bucket, existing);
We will look into those situations in a moment.
For now, we throw an UnsupportedOperationException
just in case we inadvertently reach this branch.
protected V put1(K key, V value, int bucket, Object existing) {
throw new UnsupportedOperationException("Implement me");
}
Iteration 03: a note on the two arrays
In our hash table implementation we are storing the keys and the values in two distinct arrays. While I try to have code as closer to "real-life" as possible, this is a blog post. In other words, I believe the "easy to understand" requirement is more important.
In a more "production-ready" hash table implementation, however, we would probably store the key-value pairs in a single array. For an example from the JDK see java.util.ImmutableCollections.MapN.
One reason would be to reduce memory requirements of our hash table:
-
the hash table would have one less field; and
-
having one less array, it would require one less array object header.
A second reason would be to (probably) improve performance. I am sorry I do not have much to add on this topic as I do not know much about performance. I believe it relates to this article but don't take my word for it.
Iteration 04: get
the value mapped to a key (happy path)
We have implemented the put
method for a happy path.
Let's now implement the get
method for the same case:
-
the hash table contains the key-value mapping. In other words, the
get
method will return a non-null value; -
no hash collisions; and
-
positive hash code values.
Iteration 04: test case
We start with a test case once more:
public class HashTableTest {
@Test(description = """
get() method
- hash table contains the mapping
- no hash collisions
- positive hash codes
""")
public void iter04() {
var ht = new HashTable<Integer, String>();
assertEquals(ht.size(), 0);
assertEquals(ht.put(1, "One"), null);
assertEquals(ht.size(), 1);
assertEquals(ht.get(1), "One");
assertEquals(ht.put(2, "Two"), null);
assertEquals(ht.size(), 2);
assertEquals(ht.get(2), "Two");
}
}
The test is very similar to the one from the previous iteration.
The difference is that we verify that the hash table has correctly stored the key-value pairings.
We do so by invoking the get
method for each stored key: we expected it to return the correct value.
Iteration 04: implementation
And here's an implementation that make our fourth test pass:
public class HashTable<K, V> extends iter03.HashTable<K, V> {
@SuppressWarnings("unchecked")
public final V get(Object key) {
Objects.requireNonNull(key, "key == null");
var bucket = bucket(key);
var candidate = keys[bucket];
if (key.equals(candidate)) {
return (V) values[bucket];
}
return get0(key, bucket, candidate);
}
protected V get0(Object key, int bucket, Object candidate) {
throw new UnsupportedOperationException("Implement me");
}
}
Let's look at our get
implementation.
It starts by rejecting null
keys.
The Javadocs of the get(Object key)
method states that it should throw a NullPointerException
:
if the specified key is null and this map does not permit null keys
Next, it invokes the bucket
method created in the previous iteration.
Remember that this method computes and returns the array index at which to store the specified key
.
So, if the specified key
is equal to the value at the index returned by the bucket
method, it means our hash table contains a mapping for the specified key.
In this case it returns the corresponding value from the values
array.
Note that the expression in the if
statement is "null-safe" as key
is guaranteed to be non-null.
If the expression evaluates to false
then our get
method invokes the get0
method.
For this iteration it simply throws a UnsupportedOperationException
.
In the next blog posts of this series
We have a bare-bones hash table implementation so far. We are able to both "put" and "get" mappings, albeit in a very limited set of conditions.
In the next blog posts of this series, among other things, we will:
-
handle keys with negative hash code values
-
handle hash collisions
-
perform a rehash operation, i.e., resize the inner arrays so our hash table can hold more mappings.
Stay tuned.
The source code of all of the examples can be found in this GitHub repository. It also contains the draft code for the upcoming posts.
Continue reading
The second part of this series is already available.
You can continue reading by following the link.