Implementing a two-method array-backed growable list in Java

Marcio Endo
June 22, 2022

edit 2022-06-22: ImmutableCollections.List12 is not array-backed

Every Java application will eventually use an instance of the List interface.

I do not have actual numbers but I believe it is true. Case in point here is a snippet from the ClassLoader class:

// The classes loaded by this class loader. The only purpose of this table
// is to keep the classes from being GC'ed until the loader is GC'ed.
private final ArrayList<Class<?>> classes = new ArrayList<>();

It means a "Hello world" program will use List objects, albeit not directly in its source code.

Among List implementations, the ones backed by an array are the most common ones. Some examples from the JDK are:

Again, I have no numbers but I am pretty confident on this claim. Though I should stress the fact I am speaking about Java. This may not hold true in other languages, Lisp dialects in particular. Note however I know very little about Lisp.

In this blog post we will focus on the ArrayList class. We will implement a List like class having only two of its methods:

In doing so we will discuss some of the ArrayList design choices and internals. Though the number of methods discussed might seem low there is a lot to unpack. So let's get started.

Our running test case

We will start with a test case. It exercises a happy path. It is enough for the purposes of this blog post:

@Test
public void test() {
  var list = new GrowableList<UUID>();

  var arr = new Object[100_000];

  for (int i = 0; i < arr.length; i++) {
    var uuid = UUID.randomUUID();

    assertTrue(list.add(uuid));

    arr[i] = uuid;
  }

  assertEquals(list.toArray(), arr);
}

The name of our class is GrowableList. It only declares a add and a toArray methods.

The test creates an instance of our list and a object array of length 100_000. It then fills the array with random UUID instances. It adds the same instances in the same order to our list. Finally it verifies if the list contains the objects in the same order as they were added. It does so by comparing the toArray return value with the array filled with the UUID instances.

It is worth noting that this test has a chicken-and-egg situation. We are using two unimplemented methods to verify the correct implementation of each other. We could have used the list's actual array for the assertion. But for the purposes of this post this test should be enough.

Starting with an "always throws" implementation

With our test defined let's start writing our class. As a starting point, we will write an "always throws" implementation:

public class GrowableList<E> {
  public boolean add(E e) {
    throw new UnsupportedOperationException("Implement me");
  }

  public Object[] toArray() {
    throw new UnsupportedOperationException("Implement me");
  }
}

It allows us to run our test and see it fail. That is what we are aiming for at the moment.

Note that the class does not implement the java.util.List interface. Remember, our goal is to create a List like class declaring two methods only. The signatures of both methods are identical to those from the List interface.

The test should fail in the first add method invocation:

FAILED: test
java.lang.UnsupportedOperationException: Implement me
        at iter1.GrowableList.add(GrowableList.java:10)
        at iter1.GrowableListTest.test(GrowableListTest.java:24)

Great. It is (not) working in the expected way.

Backing array: what declared type?

So the first step to make our test pass is to implement the add method.

To do it, we must declare our backing array. Since our list is generic on the type parameter E we might me tempted to declare it as:

private E[] data;

While we can declare the field this way, we cannot create an array instance of type E. In other words, while the previous snippet compiles, the following snippet does not:

// does not compile
private E[] data = new E[10];

Due to the language's type erasure, the E[] type is erased to a Object[] type at runtime. Or perhaps, to be more precise, the compiler does the type erasure when generating bytecode. So let's just declare our data field like so:

public class GrowableList<E> {
  private Object[] data;

  public boolean add(E e) {
    throw new UnsupportedOperationException("Implement me");
  }

  public Object[] toArray() {
    throw new UnsupportedOperationException("Implement me");
  }
}

So this is the first thing to know about the ArrayList implementation:

1) the declared type of the backing array of an ArrayList is Object[]

A corollary is that an unchecked cast is required in read methods. For example the get method implementation in ArrayList is equivalent to:

@SuppressWarnings("unchecked")
public E get(int index) {
  Objects.checkIndex(index, size);
  return (E) data[index];
}

The cast is fine. The reason is that all write methods allow only instances of E to be added to the list.

The only catch in the previous statement is: as long as no raw types operations are performed. The following example compiles:

public class RawList {
  @SuppressWarnings({"rawtypes", "unchecked"})
  public static void main(String[] args) {
    List raw = new ArrayList<String>();

    raw.add("1");
    raw.add("2");
    raw.add(Integer.valueOf(3));

    List<String> unchecked = raw;

    for (String s : unchecked) {
      System.out.println(s);
    }
  }
}

But causes a ClassCastException during execution:

1
2
Exception in thread "main" java.lang.ClassCastException: class java.lang.Integer cannot be cast to class java.lang.String (java.lang.Integer and java.lang.String are in module java.base of loader 'bootstrap')
        at iter1.ListUnchecked.main(ListUnchecked.java:22)

But then we were aware of the risks by explicitly suppressing two of the compiler warnings.

Backing array: what initial value?

We have defined the type of our backing array: Object[].

Great. Now, we have to initialize it. Or do we have to? If so to what initial length?

Before we go into that let's discuss a little about memory consumption.

An experiment: a very large array

What happens if we try to allocate a very large array? In other words, is it possible or will we run into any trouble?

Let's find out.

The following is a simple program that creates a very large array of strings. It assigns a few instances to the array and prints "Bye" before exiting:

public class VMLimit {
  public static void main(String[] args) {
    var abc = new String[Integer.MAX_VALUE];

    System.out.println("Actual array length is " + abc.length);

    var i = 0;

    abc[i++] = "A";
    abc[i++] = "B";
    abc[i++] = "C";

    System.out.println("Bye");
  }
}

On my machine it does not work:

$ java src/main/java/iter2/VMLimit.java
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at iter2.VMLimit.main(VMLimit.java:10)

Given the message, let's try a length that is a little smaller. We can look at the JDK's ArraysSupport source code for some inspiration:

public class PlayItSafe {
  public static void main(String[] args) {
    var abc = new String[Integer.MAX_VALUE - 8];

    System.out.println("Actual array length is " + abc.length);

    var i = 0;

    abc[i++] = "A";
    abc[i++] = "B";
    abc[i++] = "C";

    System.out.println("Bye");
  }
}

And the output is slightly different:

$ java src/main/java/iter2/PlayItSafe.java
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at iter2.PlayItSafe.main(PlayItSafe.java:10)

Let's try to increase the heap to 4G:

$ java -Xmx4G src/main/java/iter2/PlayItSafe.java
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at iter2.PlayItSafe.main(PlayItSafe.java:10)

Maybe 8G:

$ java -Xmx8G src/main/java/iter2/PlayItSafe.java
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at iter2.PlayItSafe.main(PlayItSafe.java:10)

Ok, perhaps 10G. Let's also time the execution:

$ time java -Xmx10G src/main/java/iter2/PlayItSafe.java
Actual array length is 2,147,483,639
Bye

real    0m5,366s
user    0m3,153s
sys     0m3,159s

Our simple program:

Granted, regarding the last statement, my Linux box runs on an old Phenom II X4 955. But still.

Arrays consume memory

The conclusion of our experiment might have already been a known fact for most. Still it is worth noting it.

In our experiment:

the String[] array requires heap memory even though the array is mostly "empty"

In the example the array has only 3 string references assigned. The remaining indices of all of the 2,147,483,639 components are null references.

To find out the actual memory requirements of such array we can use the JOL tool. JOL stands for "Java Object Layout". I must say I am very far from being an expert user of the tool.

Having said that, running it with our very large array gives the following output:

String[2147483639]
[Ljava.lang.String; object internals:
OFF  SZ               TYPE DESCRIPTION               VALUE
  0   8                    (object header: mark)     0x0000000000000001 (non-biasable; age: 0)
  8   4                    (object header: class)    0x00008bd0
 12   4                    (array length)            2147483639
 12   4                    (alignment/padding gap)
 16 8589934556   java.lang.String String;.<elements>        N/A
8589934572   4                    (object alignment gap)
Instance size: 8589934576 bytes
Space losses: 4 bytes internal + 4 bytes external = 8 bytes total

We are interested in the "Instance size" output: 8,589,934,576 bytes. Which might explain why our program in the previous section failed to run with a heap of 8G. The array itself is almost 8GiB large. It does not leave space for the JVM itself, i.e., main thread, class loaders, classes.

For perspective, running the tool with string array String[1_000]:

String[1_000]
[Ljava.lang.String; object internals:
OFF  SZ               TYPE DESCRIPTION               VALUE
  0   8                    (object header: mark)     0x0000000000000001 (non-biasable; age: 0)
  8   4                    (object header: class)    0x00008bd0
 12   4                    (array length)            1000
 12   4                    (alignment/padding gap)
 16 4000   java.lang.String String;.<elements>        N/A
Instance size: 4016 bytes
Space losses: 4 bytes internal + 0 bytes external = 4 bytes total

Requires close to 4KiB.

Note that this is the size of the array themselves. In other words, it does not account for the string objects the array might eventually reference.

Back to the backing array: what initial value?

After running our experiment, let's go back to our pending question.

We saw that arrays consume memory. For this reason, in the ArrayList class, the backing array is lazily initialized. The following snippet tries to explain it:

List<String> list = new ArrayList<>();
// backing array is equivalent to:
// new Object[0]

list.add("initialize");
// backing array is:
// new Object[10]

When a new ArrayList is created using the no-args constructor the backing array is initially zero-length. In the snippet above it is mentioned that it is equivalent to a new Object[0]. It is equivalent because the zero-length array is not created each time. Instead a singleton zero-length array is reused.

Put in Java code the previous paragraph becomes:

public class GrowableList<E> {
  private static final Object[] NO_DATA = new Object[0];

  private Object[] data = NO_DATA;

  public boolean add(E e) {
    throw new UnsupportedOperationException("Implement me");
  }

  public Object[] toArray() {
    throw new UnsupportedOperationException("Implement me");
  }
}

So this is the second thing to know about the ArrayList implementation:

2) the backing array of an ArrayList is lazily initialized (when created using the no-args constructor or with a explicit zero capacity)

Why is this design choice a good thing? Simply put it reduces memory consumption.

A partial add implementation

Our backing array is initialized. It cannot hold any values yet but we can at least partially write the add method.

The Javadocs for the List.add method states:

Appends the specified element to the end of this list (optional operation).

So we need to store the index representing the end of the list. In other words, we need to store the location where the element should be appended to.

If the list is empty, the element should be stored at the index 0. If the list has one element, it should be stored at the index 1. If the list has two elements, index 2. And so on.

The index matches the size of the list. Let's name it size. After an element is added to the list, we increase it by one. The implementation becomes:

public class GrowableList<E> {
  private static final Object[] NO_DATA = new Object[0];

  private Object[] data = NO_DATA;

  private int size;

  public boolean add(E e) {
    data[size++] = e;

    return true;
  }

  public Object[] toArray() {
    throw new UnsupportedOperationException("Implement me");
  }
}

In this post we will not discuss whether our list should accept or reject null values.

Let's run our test. We expect it to fail when trying to assign the array component with index 0:

FAILED: test
java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
        at iter1.GrowableList.add(GrowableList.java:16)
        at iter1.GrowableListTest.test(GrowableListTest.java:24)

It is (not) working in the expected way.

Backing array: what initial length?

So we know the backing array is lazily initialized. But what is the length of the array when it needs to be created?

In a previous code snippet we hinted that the ArrayList default capacity is 10. It says so in the ArrayList no-args constructor Javadocs:

Constructs an empty list with an initial capacity of ten.

So, the first time the add method is invoked, we should create a new array with a length of 10. The implementation becomes:

public class GrowableList<E> {
  private static final Object[] NO_DATA = new Object[0];

  private Object[] data = NO_DATA;

  private int size;

  public boolean add(E e) {
    if (data == NO_DATA) {
      data = new Object[10];
    }

    data[size++] = e;

    return true;
  }

  public Object[] toArray() {
    throw new UnsupportedOperationException("Implement me");
  }
}

Let's run our test. It should fail like the previous run. Except it should now fail at the index 10:

FAILED: test
java.lang.ArrayIndexOutOfBoundsException: Index 10 out of bounds for length 10
        at iter1.GrowableList.add(GrowableList.java:20)
        at iter1.GrowableListTest.test(GrowableListTest.java:24)

Once again, it is (not) working in the expected way.

Backing array: grow only when required

Let's refactor our code before we implement the array resizing. As long as the array can hold additional elements, no resize is needed. Let's add it to our implementation:

public boolean add(E e) {
  if (data == NO_DATA) {
    data = new Object[10];
  }

  if (size < data.length) {
    data[size++] = e;

    return true;
  }

  throw new UnsupportedOperationException(
    "Implement me: resize array");
}

Let's rerun our test. It should fail with the UnsupportedOperationException:

FAILED: test
java.lang.UnsupportedOperationException: Implement me: resize array
        at iter1.GrowableList.add(GrowableList.java:26)
        at iter1.GrowableListTest.test(GrowableListTest.java:24)

Indeed.

Backing array: what growth rate?

The next step to fix our failing test is to resize the backing array. The resize must occur once the array cannot hold any additional elements.

To be precise, as the length of an array is fixed, the steps are:

The question is:

There are two competing issues here:

  1. we should minimize the number of copying operations. Copying references from one array to the other costs CPU cycles. The larger the array the higher the cost; and

  2. we should minimize memory consumption.

In other words:

the new array should be large enough to minimize copying operations. But not larger: it can potentially waste memory.

It is a matter of tradeoffs. So the third thing to know about the ArrayList is:

3) ArrayList grows its backing array by a factor of 1.5x (when required by an add method invocation).

Let's add it to our code:

public boolean add(E e) {
  if (data == NO_DATA) {
    data = new Object[10];
  }

  if (size < data.length) {
    return append(e);
  }

  int halfLength = data.length >> 1;

  int newLength = data.length + halfLength;

  data = Arrays.copyOf(data, newLength);

  return append(e);
}

private boolean append(E e) {
  data[size++] = e;

  return true;
}

Let's run our test. If our resize code is correct, we expect the test to fail on the missing toArray implementation:

FAILED: test
java.lang.UnsupportedOperationException: Implement me
        at iter1.GrowableList.toArray(GrowableList.java:36)
        at iter1.GrowableListTest.test(GrowableListTest.java:29)

Great. The array resizing code seems to be working properly.

The toArray method implementation

The toArray Javadocs states:

Returns an array containing all of the elements in this list in proper sequence (from first to last element). (...) this method must allocate a new array even if this list is backed by an array

It must return a copy of the backing array limited to the actual elements present in the list. If the list has 3 elements, the length of returned array must be 3. Even if the backing array length might be greater than that.

The implementation is trivial when using the Arrays.copyOf method:

public Object[] toArray() {
  return Arrays.copyOf(data, size);
}

And the test now passes:

PASSED: test

Great!

int overflow

The following program shows how the length of backing array grows. It starts with a length of 10 and continually applies the "grows by a factor of 1.5x" logic to it. It loops while the length is positive.

public class GrowthRate {
  public static void main(String[] args) {
    System.out.println("     iter :         length");
    System.out.println("--------------------------");

    int iter = 1;

    int length = 10;

    do {
      System.out.format("%9d : %,14d%n", iter++, length);

      int halfLength = length >> 1;

      length = length + halfLength;
    } while (length > 0);

    System.out.format("%9d : %,14d%n", iter, length);

    System.out.println();

    System.out.format("Soft limit: %,14d%n", Integer.MAX_VALUE - 8);
  }
}

Running the program gives the following output:

-    iter :         length
--------------------------
        1 :             10
        2 :             15
        3 :             22
        4 :             33
        5 :             49
(...)
       45 :    532,254,060
       46 :    798,381,090
       47 :  1,197,571,635
       48 :  1,796,357,452
       49 : -1,600,431,118

Soft limit:  2,147,483,639

We can see two things:

A solution is:

private static final int SOFT_MAX_ARRAY_LENGTH = Integer.MAX_VALUE - 8;

public boolean add(E e) {
  if (data == NO_DATA) {
    data = new Object[10];
  }

  if (size < data.length) {
    return append(e);
  }

  int halfLength = data.length >> 1;

  int newLength = data.length + halfLength;

  if (newLength < 0) {
    newLength = SOFT_MAX_ARRAY_LENGTH;
  }

  data = Arrays.copyOf(data, newLength);

  return append(e);
}

If newLength overflows the array is resized to the SOFT_MAX_ARRAY_LENGTH value instead.

This is close to what ArrayList does. The actual solution used by ArrayList involves additional checks. The reason being it is required to support adding elements from a collection:

boolean addAll(Collection<? extends E> c);

Let's go back to the output of our GrowthRate program:

-    iter :         length
--------------------------
        1 :             10
        2 :             15

Suppose an ArrayList has 10 elements. We would be in the first row of the table. Any add operation would require a backing array resize.

Next, suppose we invoke the addAll method on our list with a collection containing 8 additional elements.

From the table, if the preferred growth rate was to be used, the new array length would be 15. An array of length 15 is not enough to hold all required 18 elements. So ArrayList would increase the backing array to 18 instead.

But we are not implementing the addAll method (defined by the Collection interface). Therefore, for the purposes of this blog post, the suggested solution is enough.

Regardless, this is the fourth thing to know about the ArrayList:

4) ArrayList takes int overflows into account when growing the array

Of course any growable array must take int overflows into account. It is not restricted to the ArrayList.

Conclusion

This was a much longer post than anticipated.

In this blog post we implemented an array-backed List like class. Even though our class had only two methods there was a lot to unpack:

We might revisit this subject in a future blog post. There is more to be discussed.