How to implement a hash table in Java (Part 1)

Marcio Endo
September 14, 2022

If you are developing an application in Java chances are you are currently using an implementation of java.util.Map or java.util.Set in your programs.

When you use any of the following you are using a hash table data structure in your program:

These are only a few of the examples from the JDK; the JDK has more hash-based implementations.

I've always wanted to know the details of how a hash table work. I find the HashMap source code to be quite difficult to parse though. So I was a bit intimidated to try and implement one.

I have read this blog post from Ben Hoyt sometime last year. It describes the implementation of a hash table structure in the C programming language. It encouraged me to try and implement one in Java from scratch.

So, in this blog series, I will guide you through the implementation of a hash table in Java.

Introduction

Before we begin, just a couple of notes.

Not quite like a HashMap

I should preface saying the hash table we will implement is going to be somewhat different from java.util.HashMap.

It will still give a good understanding (hopefully) of the internals. But, implementation-wise, it will be closer to:

Writing style

We will create our hash table implementation in a series of small steps. Each iteration will add on top of the previous one.

We want each increment to be visible. In other words, we want to easily see what has changed from the previous step.

To achieve this goal we will have the implementation from one step extend the one from the previous step. This may be quite particular.

Keep in mind the sole purpose of this usage is to make each change from a previous iteration more apparent.

In other words, I am not advocating for the use of this technique in production code.

A small set of requirements

So we know where we are headed to, let's write down a small set of requirements.

We will implement a very small subset of the java.util.Map interface. Representing it as a Java interface we would have the following:

public interface HashTable<K, V> {
  V put(K key, V value);
  
  V get(Object key);
  
  int size();
}

Please note that, to keep things simple, we will create our implementation directly as a class. The interface is listed only to show the methods we want to implement.

Our target is to implement three methods: the put, the get and the size methods. I should note that the size method is not strictly required. But it will help writing some of the assertions along the way so we will implement it.

As an additional and final requirement our implementation must reject null keys and null values. Rejecting null keys in particular will simplify our implementation.

Iteration 01: the size() method implementation

We want to implement our hash table in a series of small steps. So it is fitting for our first step to be to implement the size method.

Iteration 01: test case

Let's start with a test case. It verifies that a newly created hash table has a size of zero:

public class HashTableTest {
  @Test(description = """
  size() method

  - empty hash table must return 0
  """)
  public void iter01() {
    var ht = new HashTable<Object, Object>();
    assertEquals(ht.size(), 0);
  }
}

Note that we are not "putting" nor "getting" anything from our hash table. For this reason the types of the keys and of the values are not important. So, in this test, our hash table instance simply maps an Object to another Object.

Iteration 01: implementation

The implementation is trivial. It makes our test pass:

public class HashTable<K, V> {
  protected int size;

  public final int size() { return size; }
}

The only thing to notice is that the visibility of size field: protected. It is required so that subclasses can directly read or modify its value.

Iteration 02: reject null keys and null values

As a next step let's implement the requirement that our hash table must reject both null keys and null values. It means we will implement a first version of the put method. Granted, it will not do much yet.

Iteration 02: test case

Once again, we start with a test case:

public class HashTableTest {
  @Test(description = """
  put() method

  - it must reject null keys
  - it must reject null values
  """)
  public void iter02() {
    var ht = new HashTable<String, String>();
    assertEquals(ht.size(), 0);

    try {
      ht.put(null, "non-null value");

      Assert.fail("Expected a NPE");
    } catch (NullPointerException expected) {
      assertEquals(expected.getMessage(), "key == null");
    }

    try {
      ht.put("non-null key", null);

      Assert.fail("Expected a NPE");
    } catch (NullPointerException expected) {
      assertEquals(expected.getMessage(), "value == null");
    }

    assertEquals(ht.size(), 0);
  }
}

We first verify that an attempt to put a null key and a non-null value fails with a NullPointerException. Next, we verify that a null value (with a non-null key) also causes a NullPointerException to be thrown. In both cases we verify that the exception contains the expected error message.

Finally, we verify that the hash table did not change by asserting its size remains zero. Note that this is not really sufficient to verify that no key-value mappings were added. But, at the moment, it is all we can do.

Iteration 02: implementation

Here's an implementation that make our test pass:

public class HashTable<K, V> extends iter01.HashTable<K, V> {
  public final V put(K key, V value) {
    Objects.requireNonNull(key, "key == null");
    Objects.requireNonNull(value, "value == null");

    return put0(key, value);
  }

  protected V put0(K key, V value) {
    throw new UnsupportedOperationException("Implement me");
  }
}

The first thing to notice is that this class is a subclass of the implementation of the previous iteration. I mentioned this at the beginning of the post. We want to easily see what has changed from the previous iteration.

And what has changed from the previous iteration? We created a put implementation that rejects both null keys and null values.

If, on the other hand, both the key and the value are non-null, it continues execution by calling a put0 method. At this iteration the put0 method simply throws an UnsupportedOperationException.

Iteration 03: put key-value mappings (happy path)

We handled the put method preconditions in the previous section. Now we will actually add key-value mappings to our hash table exercising a happy path.

Our happy path consists of:

If you don't know what a hash collision is, don't worry. We will discuss it in this series.

Iteration 03: test case

We start with a test case.

Given the requirements of our test, let's use Integer instances as keys. The hash code value of a Integer object is simply the int value it wraps. Integer instances, therefore, make it easy to define an exact hash code value for our key.

Here's our test listing:

public class HashTableTest {
  @Test(description = """
  put() and size() methods

  - positive hash codes
  - no hash collisions
  - no value replacements
  """)
  public void iter03() {
    var ht = new HashTable<Integer, String>();
    assertEquals(ht.size(), 0);

    assertEquals(ht.put(1, "One"), null);
    assertEquals(ht.size(), 1);

    assertEquals(ht.put(2, "Two"), null);
    assertEquals(ht.size(), 2);
  }
}

We start by creating a hash table instance that maps Integer values to their names in English.

We add two key-value mappings to our hash table; and, since there are no value replacements, we assert that the put method returns null. Remember the Javadocs for put method. The put method returns:

the previous value associated with key, or null if there was no mapping for key.

Additionally, we verify that the size of our hash table increases by one after each key-value insertion.

Iteration 03: implementation

And here's an implementation that makes our third test pass:

public class HashTable<K, V> extends iter02.HashTable<K, V> {
  protected Object[] keys = new Object[4];

  protected Object[] values = new Object[4];

  protected int bucket(Object key) {
    var hc = key.hashCode();

    return hc % keys.length;
  }

  @Override
  protected final V put0(K key, V value) {
    var bucket = bucket(key);

    var existing = keys[bucket];

    if (existing == null) {
      return putInsert(key, value, bucket);
    }

    return put1(key, value, bucket, existing);
  }

  protected V put1(K key, V value, int bucket, Object existing) {
    throw new UnsupportedOperationException("Implement me");
  }

  protected V putInsert(K key, V value, int bucket) {
    keys[bucket] = key;

    values[bucket] = value;

    size++;

    return null;
  }
}

Let's take a closer look into the implementation.

Iteration 03: the keys and values arrays

Our class declares two arrays of the same length: keys and values. As their names indicate they are used to store the key-value mappings. A key-value pair mapping is stored at the two arrays in the following manner:

In other words, a single key-value pair can be stored like so:

keys[bucket] = key;
values[bucket] = value;

Similarly, a single key-value pair can be read like so:

var key = keys[bucket];
var value = values[bucket];

The following diagram illustrates a possible disposition for the mappings 1=One and 2=Two.

--------|-------+-------+-------+-------+
  index:|     0 |     1 |     2 |     3 |
--------|-------+-------+-------+-------+  
   keys:|       |     1 |     2 |       |
--------|-------+-------+-------+-------+
 values:|       | "One" | "Two" |       |
--------|-------+-------+-------+-------+

But how is the bucket index computed?

Iteration 03: the initial array length

Before we look into the bucket method, notice how the hash table arrays were both initialized with a length of 4:

protected Object[] keys = new Object[4];

protected Object[] values = new Object[4];

The choice of a power-of-two number is not a coincidence. We will look into it later. For now, just consider it an arbitrary positive value.

Iteration 03: the bucket method

The bucket method is responsible for computing the index. At this iteration the method returns the remainder of the division between the key's hash code value and the keys array length.

// the iter 03 'incomplete' implementation
protected int bucket(Object key) {
  var hc = key.hashCode();

  return hc % keys.length;
}

As an example, consider the following:

Then the bucket variable will hold a value between:

These are all of the valid values for the indices of the keys array. And, as keys and values have the same length, these are also all of the valid values for the indices of the values array.

As a final note I should mention that the implementation of this iteration is known to be incomplete. It fails to handle negative hash code values. We will look into it during iteration #05.

Iteration 03: is this a new mapping?

With the bucket index computed, the put0 method goes on to verify if hash table contains the specified key or not. It does so by checking if the keys array component at that index is null.

var bucket = bucket(key);

var existing = keys[bucket];

if (existing == null) {
  // new mapping
  return putInsert(key, value, bucket);
}

...

Remember, our hash table disallows null keys; so if a component of the keys array is null, it means there is no existing mapping at that location.

Iteration 03: the putInsert method

The actual insertion is done at the putInsert method. After storing both the specified key and value it increments the hash table's size value by one and returns null.

protected V putInsert(K key, V value, int bucket) {
  keys[bucket] = key;

  values[bucket] = value;

  size++;

  return null;
}

Iteration 03: same key or collision?

If the keys array component at the computed index is non-null, it means that either:

if (existing == null) {
  return putInsert(key, value, bucket);
}

// what to do here?
return put1(key, value, bucket, existing);

We will look into those situations in a moment. For now, we throw an UnsupportedOperationException just in case we inadvertently reach this branch.

protected V put1(K key, V value, int bucket, Object existing) {
  throw new UnsupportedOperationException("Implement me");
}

Iteration 03: a note on the two arrays

In our hash table implementation we are storing the keys and the values in two distinct arrays. While I try to have code as closer to "real-life" as possible, this is a blog post. In other words, I believe the "easy to understand" requirement is more important.

In a more "production-ready" hash table implementation, however, we would probably store the key-value pairs in a single array. For an example from the JDK see java.util.ImmutableCollections.MapN.

One reason would be to reduce memory requirements of our hash table:

A second reason would be to (probably) improve performance. I am sorry I do not have much to add on this topic as I do not know much about performance. I believe it relates to this article but don't take my word for it.

Iteration 04: get the value mapped to a key (happy path)

We have implemented the put method for a happy path. Let's now implement the get method for the same case:

Iteration 04: test case

We start with a test case once more:

public class HashTableTest {
  @Test(description = """
  get() method

  - hash table contains the mapping
  - no hash collisions
  - positive hash codes
  """)
  public void iter04() {
    var ht = new HashTable<Integer, String>();
    assertEquals(ht.size(), 0);

    assertEquals(ht.put(1, "One"), null);
    assertEquals(ht.size(), 1);
    assertEquals(ht.get(1), "One");

    assertEquals(ht.put(2, "Two"), null);
    assertEquals(ht.size(), 2);
    assertEquals(ht.get(2), "Two");
  }
}

The test is very similar to the one from the previous iteration. The difference is that we verify that the hash table has correctly stored the key-value pairings. We do so by invoking the get method for each stored key: we expected it to return the correct value.

Iteration 04: implementation

And here's an implementation that make our fourth test pass:

public class HashTable<K, V> extends iter03.HashTable<K, V> {
  @SuppressWarnings("unchecked")
  public final V get(Object key) {
    Objects.requireNonNull(key, "key == null");

    var bucket = bucket(key);

    var candidate = keys[bucket];

    if (key.equals(candidate)) {
      return (V) values[bucket];
    }

    return get0(key, bucket, candidate);
  }

  protected V get0(Object key, int bucket, Object candidate) {
    throw new UnsupportedOperationException("Implement me");
  }
}

Let's look at our get implementation.

It starts by rejecting null keys. The Javadocs of the get(Object key) method states that it should throw a NullPointerException:

if the specified key is null and this map does not permit null keys

Next, it invokes the bucket method created in the previous iteration. Remember that this method computes and returns the array index at which to store the specified key.

So, if the specified key is equal to the value at the index returned by the bucket method, it means our hash table contains a mapping for the specified key. In this case it returns the corresponding value from the values array.

Note that the expression in the if statement is "null-safe" as key is guaranteed to be non-null.

If the expression evaluates to false then our get method invokes the get0 method. For this iteration it simply throws a UnsupportedOperationException.

In the next blog posts of this series

We have a bare-bones hash table implementation so far. We are able to both "put" and "get" mappings, albeit in a very limited set of conditions.

In the next blog posts of this series, among other things, we will:

Stay tuned.

The source code of all of the examples can be found in this GitHub repository. It also contains the draft code for the upcoming posts.

Continue reading

The second part of this series is already available.

You can continue reading by following the link.