Scanning Java Class Files #4: String Literals

In the previous post in this series,
we learned that the constant pool is a sequence of varying-size elements, and,
while the ClassFile
structure provides the number of entries in a constant pool,
we can only know the location of a particular entry by walking the constant pool up to that entry.
We also improved our scanner implementation; it iterates through the entries in the constant pool, storing the index where each entry begins, and printing the entry number and its kind. While our scanner visits every entry in the constant pool, it does not read the value associated to the entry.
In this fourth blog post in this series, we'll read and print the value of all string literals in the constant pool.
A Note on the Class-File API
We are not using the Class-File API in this blog post series, as it abstracts away the low-level details of the Java class file format we want to learn about.
Print the Value of All String Literals
We want our scanner to visit and process all string literals defined in the constant pool. To verify our implementation works correctly, we will have it print the string literal values as we visit them. However, parts of the output currently being produced adds noise to our current goal. So, before working on the string literals, we'll suppress the undesired output.
09: Skip Version
Our scanner currently prints the minor and major version numbers.
Instead of doing that, we will simply skip them.
Let's rename the printVersion
method to skipVersion
, and we'll change its implementation to the following:
private void skipVersion() throws IOException { check(4); idx += 4;}
In the method, we:
-
Check if the class file has at least 4 more bytes available, throwing if not.
-
Advance the
idx
instance variable so it points to the byte immediately after the class file version.
10: Quiet Constant Pool Traversal
Our scanner, while traversing and indexing the constant pool,
currently prints the entry number and the constant kind name.
Let's suppress this output.
We'll change the switch
statement in the readConstantPoolEntry
method to the following:
switch (tag) { case CONSTANT_Utf8 -> { int l = readU2(); idx += l; } case CONSTANT_Integer -> { idx += 4; } case CONSTANT_Float -> { idx += 4; } case CONSTANT_Long -> { idx += 8; entry++; } ... default -> throw new IOException("Unknown constant pool tag=" + tag);}
We've removed the p
method and all of its invocations found in the switch
statement.
Next, let's start writing the string literal visiting code.
11: The processStringLiterals Method
In the traverseConstantPool
method, we iterate through each entry in the constant pool.
On every iteration, we record the starting index of the entry in the constantPoolIndex
array.
Once the method finishes, this array maps each constant pool entry number to its starting position within the class file.
To visit and process all string literals in the constant pool, we write the following processStringLiterals
method:
private void processStringLiterals() throws IOException { for (int entry = 1; entry < constantPoolIndex.length; entry++) { int index; index = constantPoolIndex[entry]; byte tag; tag = data[index]; if (tag != CONSTANT_String) { continue; } processStringLiteral(index); }}private void processStringLiteral(int index) throws IOException { ...}
Here's a breakdown:
-
It consists of a
for
loop statement, which iterates over the components of theconstantPoolIndex
array -
We initialize the
entry
variable to1
, as the constant pool uses 1-based indexing. -
Inside the loop, we retrieve the starting position of each entry from the
constantPoolIndex
array and store it in theindex
local variable. -
Using the
index
value, we read the correspondingtag
byte from thedata
array. -
If the
tag
value does not matchCONSTANT_String
, we skip to the next constant pool entry. -
If the
tag
indicates a string literal, we callprocessStringLiteral
with the currentindex
value to handle it.
The CONSTANT_String_info Structure
The structure for the String
constant pool entry is defined in Section 4.4.3 of the JVMS:
CONSTANT_String_info {
u1 tag;
u2 string_index;
}
It consists of the 1-byte tag
value followed by an u2
value named string_index
.
The latter represents the entry number of an Utf8
entry in the same constant pool.
The Utf8
entry stores the string literal contents.
12: The processStringLiteral Method
Let's translate the CONSTANT_String_info
structure into a Java method.
We write the processStringLiteral
method as follows:
private void processStringLiteral(int index) throws IOException { int stringIndex; stringIndex = readU2(index + 1); String utf8; utf8 = readUtf8(stringIndex); System.out.printf(" %s%n", utf8);}private String readUtf8(int entryNumber) throws IOException { ...}
Here's a breakdown:
-
The method takes an
index
parameter, whose value is equal to the position of thetag
value of ourString
constant pool entry. -
We begin the method by reading the
u2
value that follows the tag. We used a modifiedreadU2
method, which we'll see in more details in a bit. -
This
u2
value represents the number of anUtf8
entry in the same constant pool. We use this entry number to call thereadUtf8
method. -
The
readUtf8
is not implemented yet, but it should return the decodedUtf8
entry value as aString
object. -
We print the string literal value.
13: Modified readU2 Method
For completeness, here's the modified readU2
we used in the previous section:
private int readU2() throws IOException { check(2); int index = idx; idx += 2; return readU2(index);}private int readU2(int index) { byte b0 = data[index + 0]; byte b1 = data[index + 1]; int v0 = toInt(b0, 8); int v1 = toInt(b1, 0); return v0 | v1;}
The original readU2
method was refactored to use the new modified readU2
method.
Its bottom section was moved into the new readU2
method.
The modified readU2
takes an index
parameter, instead of relying on the idx
instance variable.
Additionally, it does not perform the data
bounds check, as it is used to read a section of the data
array which was already visited before.
The CONSTANT_Utf8_info Structure
In Section 4.4.7
of the JVMS
we find the structure for the Utf8
constant pool entry:
CONSTANT_Utf8_info {
u1 tag;
u2 length;
u1 bytes[length];
}
The tag
value is followed by an u2
value, which represents the number of bytes in the encoded string that follows.
14: The readUtf8 Method
We translate the information of the previous section into the readUtf8
method:
private String readUtf8(int entryNumber) throws IOException { int index; index = constantPoolIndex[entryNumber]; byte tag; tag = data[index]; if (tag != CONSTANT_Utf8) { throw new IOException("Malformed constant pool"); } int length; length = readU2(index + 1); return new String(data, index + 3, length, StandardCharsets.UTF_8);}
Here's a breakdown:
-
The method accepts an
entryNumber
parameter, which is the entry number of ourCONSTANT_Utf8_info
structure. -
We begin the method by retrieving the starting position of the entry from the
constantPoolIndex
array. -
Next, we read the
tag
byte from thedata
array. -
We check if the
tag
value is equal toCONSTANT_Utf8
; if not we throw signaling a malformed constant pool. -
We then read the
length
value using thereadU2
method. -
Finally, we create a new
String
directly from thedata
array.
The last step has an issue which we'll see in a bit. In the meantime, let's test our implementation as it currently stands.
Testing Our Implementation
First, let's test our implementation against our HelloWorld.class
:
$ java --enable-preview ClassFile4.java HelloWorld.class
ClassFile /tmp/HelloWorld.class
Hello, World!
It found and printed a single string literal, our "Hello, World!"
message.
Next, let's try it on a class having more string literals.
We write the following Java class:
public class StringLiterals { static final String A = "Static Field Init"; final String b = "Instance Field Init"; void m() { var s = "Local Variable"; t("Method arg"); } void t(String s) {} }
And we run our implementation against its compiled class file:
$ java --enable-preview ClassFile4.java StringLiterals.class
ClassFile /tmp/StringLiterals.class
Instance Field Init
Local Variable
Method arg
Static Field Init
Our implementation seems to be working. To be sure, let's test our implementation against some emoji.
Code Points Above U+FFFF
Our current implementation does not work with Unicode code points that are above U+FFFF
.
As an example, let's test it with the following Java class:
public class Emoji { String example = "Hi 😀";}
It declares an string literal containg an emoji. Let's run our implementation against its compiled class file:
$ java --enable-preview ClassFile4.java Emoji.class
ClassFile /tmp/Emoji.class
Hi ��
That's not the correct value:
-
Utf8
constant pool entries encodes the string value using a modified UTF-8 encoding. -
Our implementation wrongly assumes the byte stream uses the standard UTF-8 encoding.
We should work on this issue next.
In the Next Blog Post in This Series
In this fourth blog post in this series,
we learned that the String
constant pool entry, apart from its tag
value, has a 2-byte string_index
value.
The latter represents the entry number of an Utf8
entry in the same constant pool containing the actual string literal value.
We also learned that Utf8
constant pool entry stores the string value using a modified UTF-8 encoding.
In this blog post, we had our class file scanner visit all string literals defined in a Java class file.
However, our scanner does not print the correct value if the string literal contains Unicode code points above U+FFFF
.
So, in the next blog post in this series, we'll have our class file scanner correctly decode all Utf8
string values found in a Java class file.
You can find the source code used in this blog post in this Gist.