Implementing a two-method array-backed growable list in Java
edit 2022-06-22: ImmutableCollections.List12 is not array-backed
Every Java application will eventually use an instance of the List
interface.
I do not have actual numbers but I believe it is true.
Case in point here is a snippet from the ClassLoader
class:
// The classes loaded by this class loader. The only purpose of this table
// is to keep the classes from being GC'ed until the loader is GC'ed.
private final ArrayList<Class<?>> classes = new ArrayList<>();
It means a "Hello world" program will use List
objects, albeit not directly in its source code.
Among List
implementations, the ones backed by an array are the most common ones.
Some examples from the JDK are:
-
ArrayList
-
Arrays.ArrayList
as returned by theArrays.asList
method -
ImmutableCollections.ListN
as returned by the manyList.of
methods
Again, I have no numbers but I am pretty confident on this claim. Though I should stress the fact I am speaking about Java. This may not hold true in other languages, Lisp dialects in particular. Note however I know very little about Lisp.
In this blog post we will focus on the ArrayList
class.
We will implement a List
like class having only two of its methods:
-
add(E e)
-
toArray()
In doing so we will discuss some of the ArrayList
design choices and internals.
Though the number of methods discussed might seem low there is a lot to unpack.
So let's get started.
Our running test case
We will start with a test case. It exercises a happy path. It is enough for the purposes of this blog post:
@Test
public void test() {
var list = new GrowableList<UUID>();
var arr = new Object[100_000];
for (int i = 0; i < arr.length; i++) {
var uuid = UUID.randomUUID();
assertTrue(list.add(uuid));
arr[i] = uuid;
}
assertEquals(list.toArray(), arr);
}
The name of our class is GrowableList
.
It only declares a add
and a toArray
methods.
The test creates an instance of our list and a object array of length 100_000
.
It then fills the array with random UUID
instances.
It adds the same instances in the same order to our list.
Finally it verifies if the list contains the objects in the same order as they were added.
It does so by comparing the toArray
return value with the array filled with the UUID
instances.
It is worth noting that this test has a chicken-and-egg situation. We are using two unimplemented methods to verify the correct implementation of each other. We could have used the list's actual array for the assertion. But for the purposes of this post this test should be enough.
Starting with an "always throws" implementation
With our test defined let's start writing our class. As a starting point, we will write an "always throws" implementation:
public class GrowableList<E> {
public boolean add(E e) {
throw new UnsupportedOperationException("Implement me");
}
public Object[] toArray() {
throw new UnsupportedOperationException("Implement me");
}
}
It allows us to run our test and see it fail. That is what we are aiming for at the moment.
Note that the class does not implement the java.util.List
interface.
Remember, our goal is to create a List
like class declaring two methods only.
The signatures of both methods are identical to those from the List
interface.
The test should fail in the first add
method invocation:
FAILED: test
java.lang.UnsupportedOperationException: Implement me
at iter1.GrowableList.add(GrowableList.java:10)
at iter1.GrowableListTest.test(GrowableListTest.java:24)
Great. It is (not) working in the expected way.
Backing array: what declared type?
So the first step to make our test pass is to implement the add
method.
To do it, we must declare our backing array.
Since our list is generic on the type parameter E
we might me tempted to declare it as:
private E[] data;
While we can declare the field this way, we cannot create an array instance of type E
.
In other words, while the previous snippet compiles, the following snippet does not:
// does not compile
private E[] data = new E[10];
Due to the language's type erasure, the E[]
type is erased to a Object[]
type at runtime.
Or perhaps, to be more precise, the compiler does the type erasure when generating bytecode.
So let's just declare our data
field like so:
public class GrowableList<E> {
private Object[] data;
public boolean add(E e) {
throw new UnsupportedOperationException("Implement me");
}
public Object[] toArray() {
throw new UnsupportedOperationException("Implement me");
}
}
So this is the first thing to know about the ArrayList
implementation:
1) the declared type of the backing array of an ArrayList is Object[]
A corollary is that an unchecked cast is required in read methods.
For example the get
method implementation in ArrayList
is equivalent to:
@SuppressWarnings("unchecked")
public E get(int index) {
Objects.checkIndex(index, size);
return (E) data[index];
}
The cast is fine.
The reason is that all write methods allow only instances of E
to be added to the list.
The only catch in the previous statement is: as long as no raw types operations are performed. The following example compiles:
public class RawList {
@SuppressWarnings({"rawtypes", "unchecked"})
public static void main(String[] args) {
List raw = new ArrayList<String>();
raw.add("1");
raw.add("2");
raw.add(Integer.valueOf(3));
List<String> unchecked = raw;
for (String s : unchecked) {
System.out.println(s);
}
}
}
But causes a ClassCastException
during execution:
1
2
Exception in thread "main" java.lang.ClassCastException: class java.lang.Integer cannot be cast to class java.lang.String (java.lang.Integer and java.lang.String are in module java.base of loader 'bootstrap')
at iter1.ListUnchecked.main(ListUnchecked.java:22)
But then we were aware of the risks by explicitly suppressing two of the compiler warnings.
Backing array: what initial value?
We have defined the type of our backing array: Object[]
.
Great. Now, we have to initialize it. Or do we have to? If so to what initial length?
Before we go into that let's discuss a little about memory consumption.
An experiment: a very large array
What happens if we try to allocate a very large array? In other words, is it possible or will we run into any trouble?
Let's find out.
The following is a simple program that creates a very large array of strings. It assigns a few instances to the array and prints "Bye" before exiting:
public class VMLimit {
public static void main(String[] args) {
var abc = new String[Integer.MAX_VALUE];
System.out.println("Actual array length is " + abc.length);
var i = 0;
abc[i++] = "A";
abc[i++] = "B";
abc[i++] = "C";
System.out.println("Bye");
}
}
On my machine it does not work:
$ java src/main/java/iter2/VMLimit.java
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at iter2.VMLimit.main(VMLimit.java:10)
Given the message, let's try a length that is a little smaller.
We can look at the JDK's ArraysSupport
source code for some inspiration:
public class PlayItSafe {
public static void main(String[] args) {
var abc = new String[Integer.MAX_VALUE - 8];
System.out.println("Actual array length is " + abc.length);
var i = 0;
abc[i++] = "A";
abc[i++] = "B";
abc[i++] = "C";
System.out.println("Bye");
}
}
And the output is slightly different:
$ java src/main/java/iter2/PlayItSafe.java
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at iter2.PlayItSafe.main(PlayItSafe.java:10)
Let's try to increase the heap to 4G
:
$ java -Xmx4G src/main/java/iter2/PlayItSafe.java
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at iter2.PlayItSafe.main(PlayItSafe.java:10)
Maybe 8G
:
$ java -Xmx8G src/main/java/iter2/PlayItSafe.java
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at iter2.PlayItSafe.main(PlayItSafe.java:10)
Ok, perhaps 10G
.
Let's also time the execution:
$ time java -Xmx10G src/main/java/iter2/PlayItSafe.java
Actual array length is 2,147,483,639
Bye
real 0m5,366s
user 0m3,153s
sys 0m3,159s
Our simple program:
-
requires a number between
8G
and10G
of heap to execute; and -
took more than 5 seconds (real time) to execute.
Granted, regarding the last statement, my Linux box runs on an old Phenom II X4 955. But still.
Arrays consume memory
The conclusion of our experiment might have already been a known fact for most. Still it is worth noting it.
In our experiment:
the
String[]
array requires heap memory even though the array is mostly "empty"
In the example the array has only 3 string references assigned.
The remaining indices of all of the 2,147,483,639
components are null
references.
To find out the actual memory requirements of such array we can use the JOL tool. JOL stands for "Java Object Layout". I must say I am very far from being an expert user of the tool.
Having said that, running it with our very large array gives the following output:
String[2147483639]
[Ljava.lang.String; object internals:
OFF SZ TYPE DESCRIPTION VALUE
0 8 (object header: mark) 0x0000000000000001 (non-biasable; age: 0)
8 4 (object header: class) 0x00008bd0
12 4 (array length) 2147483639
12 4 (alignment/padding gap)
16 8589934556 java.lang.String String;.<elements> N/A
8589934572 4 (object alignment gap)
Instance size: 8589934576 bytes
Space losses: 4 bytes internal + 4 bytes external = 8 bytes total
We are interested in the "Instance size" output: 8,589,934,576 bytes.
Which might explain why our program in the previous section failed to run with a heap of 8G
.
The array itself is almost 8GiB large.
It does not leave space for the JVM itself, i.e., main thread, class loaders, classes.
For perspective, running the tool with string array String[1_000]
:
String[1_000]
[Ljava.lang.String; object internals:
OFF SZ TYPE DESCRIPTION VALUE
0 8 (object header: mark) 0x0000000000000001 (non-biasable; age: 0)
8 4 (object header: class) 0x00008bd0
12 4 (array length) 1000
12 4 (alignment/padding gap)
16 4000 java.lang.String String;.<elements> N/A
Instance size: 4016 bytes
Space losses: 4 bytes internal + 0 bytes external = 4 bytes total
Requires close to 4KiB.
Note that this is the size of the array themselves. In other words, it does not account for the string objects the array might eventually reference.
Back to the backing array: what initial value?
After running our experiment, let's go back to our pending question.
We saw that arrays consume memory.
For this reason, in the ArrayList
class, the backing array is lazily initialized.
The following snippet tries to explain it:
List<String> list = new ArrayList<>();
// backing array is equivalent to:
// new Object[0]
list.add("initialize");
// backing array is:
// new Object[10]
When a new ArrayList
is created using the no-args constructor the backing array is initially zero-length.
In the snippet above it is mentioned that it is equivalent to a new Object[0]
.
It is equivalent because the zero-length array is not created each time.
Instead a singleton zero-length array is reused.
Put in Java code the previous paragraph becomes:
public class GrowableList<E> {
private static final Object[] NO_DATA = new Object[0];
private Object[] data = NO_DATA;
public boolean add(E e) {
throw new UnsupportedOperationException("Implement me");
}
public Object[] toArray() {
throw new UnsupportedOperationException("Implement me");
}
}
So this is the second thing to know about the ArrayList
implementation:
2) the backing array of an ArrayList is lazily initialized (when created using the no-args constructor or with a explicit zero capacity)
Why is this design choice a good thing? Simply put it reduces memory consumption.
A partial add
implementation
Our backing array is initialized.
It cannot hold any values yet but we can at least partially write the add
method.
The Javadocs for the List.add
method states:
Appends the specified element to the end of this list (optional operation).
So we need to store the index representing the end of the list. In other words, we need to store the location where the element should be appended to.
If the list is empty, the element should be stored at the index 0
.
If the list has one element, it should be stored at the index 1
.
If the list has two elements, index 2
.
And so on.
The index matches the size of the list.
Let's name it size
.
After an element is added to the list, we increase it by one.
The implementation becomes:
public class GrowableList<E> {
private static final Object[] NO_DATA = new Object[0];
private Object[] data = NO_DATA;
private int size;
public boolean add(E e) {
data[size++] = e;
return true;
}
public Object[] toArray() {
throw new UnsupportedOperationException("Implement me");
}
}
In this post we will not discuss whether our list should accept or reject null
values.
Let's run our test.
We expect it to fail when trying to assign the array component with index 0
:
FAILED: test
java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
at iter1.GrowableList.add(GrowableList.java:16)
at iter1.GrowableListTest.test(GrowableListTest.java:24)
It is (not) working in the expected way.
Backing array: what initial length?
So we know the backing array is lazily initialized. But what is the length of the array when it needs to be created?
In a previous code snippet we hinted that the ArrayList
default capacity is 10
.
It says so in the ArrayList
no-args constructor Javadocs:
Constructs an empty list with an initial capacity of ten.
So, the first time the add
method is invoked, we should create a new array with a length of 10.
The implementation becomes:
public class GrowableList<E> {
private static final Object[] NO_DATA = new Object[0];
private Object[] data = NO_DATA;
private int size;
public boolean add(E e) {
if (data == NO_DATA) {
data = new Object[10];
}
data[size++] = e;
return true;
}
public Object[] toArray() {
throw new UnsupportedOperationException("Implement me");
}
}
Let's run our test.
It should fail like the previous run.
Except it should now fail at the index 10
:
FAILED: test
java.lang.ArrayIndexOutOfBoundsException: Index 10 out of bounds for length 10
at iter1.GrowableList.add(GrowableList.java:20)
at iter1.GrowableListTest.test(GrowableListTest.java:24)
Once again, it is (not) working in the expected way.
Backing array: grow only when required
Let's refactor our code before we implement the array resizing. As long as the array can hold additional elements, no resize is needed. Let's add it to our implementation:
public boolean add(E e) {
if (data == NO_DATA) {
data = new Object[10];
}
if (size < data.length) {
data[size++] = e;
return true;
}
throw new UnsupportedOperationException(
"Implement me: resize array");
}
Let's rerun our test.
It should fail with the UnsupportedOperationException
:
FAILED: test
java.lang.UnsupportedOperationException: Implement me: resize array
at iter1.GrowableList.add(GrowableList.java:26)
at iter1.GrowableListTest.test(GrowableListTest.java:24)
Indeed.
Backing array: what growth rate?
The next step to fix our failing test is to resize the backing array. The resize must occur once the array cannot hold any additional elements.
To be precise, as the length of an array is fixed, the steps are:
-
a new larger array must be allocated (if possible);
-
the references in the old array must be copied to the new array (in the same order); and
-
our backing array field must now reference the new larger array.
The question is:
-
how big should the new array be?
There are two competing issues here:
-
we should minimize the number of copying operations. Copying references from one array to the other costs CPU cycles. The larger the array the higher the cost; and
-
we should minimize memory consumption.
In other words:
the new array should be large enough to minimize copying operations. But not larger: it can potentially waste memory.
It is a matter of tradeoffs.
So the third thing to know about the ArrayList
is:
3) ArrayList grows its backing array by a factor of 1.5x (when required by an add method invocation).
Let's add it to our code:
public boolean add(E e) {
if (data == NO_DATA) {
data = new Object[10];
}
if (size < data.length) {
return append(e);
}
int halfLength = data.length >> 1;
int newLength = data.length + halfLength;
data = Arrays.copyOf(data, newLength);
return append(e);
}
private boolean append(E e) {
data[size++] = e;
return true;
}
Let's run our test.
If our resize code is correct, we expect the test to fail on the missing toArray
implementation:
FAILED: test
java.lang.UnsupportedOperationException: Implement me
at iter1.GrowableList.toArray(GrowableList.java:36)
at iter1.GrowableListTest.test(GrowableListTest.java:29)
Great. The array resizing code seems to be working properly.
The toArray
method implementation
The toArray
Javadocs states:
Returns an array containing all of the elements in this list in proper sequence (from first to last element). (...) this method must allocate a new array even if this list is backed by an array
It must return a copy of the backing array limited to the actual elements present in the list. If the list has 3 elements, the length of returned array must be 3. Even if the backing array length might be greater than that.
The implementation is trivial when using the Arrays.copyOf
method:
public Object[] toArray() {
return Arrays.copyOf(data, size);
}
And the test now passes:
PASSED: test
Great!
int
overflow
The following program shows how the length of backing array grows. It starts with a length of 10 and continually applies the "grows by a factor of 1.5x" logic to it. It loops while the length is positive.
public class GrowthRate {
public static void main(String[] args) {
System.out.println(" iter : length");
System.out.println("--------------------------");
int iter = 1;
int length = 10;
do {
System.out.format("%9d : %,14d%n", iter++, length);
int halfLength = length >> 1;
length = length + halfLength;
} while (length > 0);
System.out.format("%9d : %,14d%n", iter, length);
System.out.println();
System.out.format("Soft limit: %,14d%n", Integer.MAX_VALUE - 8);
}
}
Running the program gives the following output:
- iter : length
--------------------------
1 : 10
2 : 15
3 : 22
4 : 33
5 : 49
(...)
45 : 532,254,060
46 : 798,381,090
47 : 1,197,571,635
48 : 1,796,357,452
49 : -1,600,431,118
Soft limit: 2,147,483,639
We can see two things:
-
the length overflows in the 49th iteration; and
-
the length from the previous iteration is still far from the soft limit.
A solution is:
private static final int SOFT_MAX_ARRAY_LENGTH = Integer.MAX_VALUE - 8;
public boolean add(E e) {
if (data == NO_DATA) {
data = new Object[10];
}
if (size < data.length) {
return append(e);
}
int halfLength = data.length >> 1;
int newLength = data.length + halfLength;
if (newLength < 0) {
newLength = SOFT_MAX_ARRAY_LENGTH;
}
data = Arrays.copyOf(data, newLength);
return append(e);
}
If newLength
overflows the array is resized to the SOFT_MAX_ARRAY_LENGTH
value instead.
This is close to what ArrayList
does.
The actual solution used by ArrayList
involves additional checks.
The reason being it is required to support adding elements from a collection:
boolean addAll(Collection<? extends E> c);
Let's go back to the output of our GrowthRate
program:
- iter : length
--------------------------
1 : 10
2 : 15
Suppose an ArrayList
has 10 elements.
We would be in the first row of the table.
Any add operation would require a backing array resize.
Next, suppose we invoke the addAll
method on our list with a collection containing 8 additional elements.
From the table, if the preferred growth rate was to be used, the new array length would be 15.
An array of length 15 is not enough to hold all required 18 elements.
So ArrayList
would increase the backing array to 18 instead.
But we are not implementing the addAll
method (defined by the Collection
interface).
Therefore, for the purposes of this blog post, the suggested solution is enough.
Regardless, this is the fourth thing to know about the ArrayList
:
4) ArrayList takes int overflows into account when growing the array
Of course any growable array must take int overflows into account.
It is not restricted to the ArrayList
.
Conclusion
This was a much longer post than anticipated.
In this blog post we implemented an array-backed List
like class.
Even though our class had only two methods there was a lot to unpack:
-
the declared type of the backing array
-
arrays consume memory
-
the JOL tool
-
the initial value of the backing array
-
the initial length of the backing array
-
the growth rate of the backing array
-
accounting for
int
overflow
We might revisit this subject in a future blog post. There is more to be discussed.