Scanning Java Class Files #4: String Literals

Marcio EndoMarcio EndoMar 16, 2025

In the previous post in this series, we learned that the constant pool is a sequence of varying-size elements, and, while the ClassFile structure provides the number of entries in a constant pool, we can only know the location of a particular entry by walking the constant pool up to that entry.

We also improved our scanner implementation; it iterates through the entries in the constant pool, storing the index where each entry begins, and printing the entry number and its kind. While our scanner visits every entry in the constant pool, it does not read the value associated to the entry.

In this fourth blog post in this series, we'll read and print the value of all string literals in the constant pool.

A Note on the Class-File API

We are not using the Class-File API in this blog post series, as it abstracts away the low-level details of the Java class file format we want to learn about.

We want our scanner to visit and process all string literals defined in the constant pool. To verify our implementation works correctly, we will have it print the string literal values as we visit them. However, parts of the output currently being produced adds noise to our current goal. So, before working on the string literals, we'll suppress the undesired output.

09: Skip Version

Our scanner currently prints the minor and major version numbers. Instead of doing that, we will simply skip them. Let's rename the printVersion method to skipVersion, and we'll change its implementation to the following:

private void skipVersion() throws IOException {  check(4);    idx += 4;}

In the method, we:

  • Check if the class file has at least 4 more bytes available, throwing if not.

  • Advance the idx instance variable so it points to the byte immediately after the class file version.

10: Quiet Constant Pool Traversal

Our scanner, while traversing and indexing the constant pool, currently prints the entry number and the constant kind name. Let's suppress this output. We'll change the switch statement in the readConstantPoolEntry method to the following:

switch (tag) {  case CONSTANT_Utf8 -> { int l = readU2(); idx += l; }  case CONSTANT_Integer -> { idx += 4; }  case CONSTANT_Float -> { idx += 4; }  case CONSTANT_Long -> { idx += 8; entry++; }  ...  default -> throw new IOException("Unknown constant pool tag=" + tag);}

We've removed the p method and all of its invocations found in the switch statement. Next, let's start writing the string literal visiting code.

11: The processStringLiterals Method

In the traverseConstantPool method, we iterate through each entry in the constant pool. On every iteration, we record the starting index of the entry in the constantPoolIndex array. Once the method finishes, this array maps each constant pool entry number to its starting position within the class file.

To visit and process all string literals in the constant pool, we write the following processStringLiterals method:

private void processStringLiterals() throws IOException {  for (int entry = 1; entry < constantPoolIndex.length; entry++) {    int index;    index = constantPoolIndex[entry];    byte tag;    tag = data[index];    if (tag != CONSTANT_String) {      continue;    }        processStringLiteral(index);  }}private void processStringLiteral(int index) throws IOException {  ...}

Here's a breakdown:

  1. It consists of a for loop statement, which iterates over the components of the constantPoolIndex array

  2. We initialize the entry variable to 1, as the constant pool uses 1-based indexing.

  3. Inside the loop, we retrieve the starting position of each entry from the constantPoolIndex array and store it in the index local variable.

  4. Using the index value, we read the corresponding tag byte from the data array.

  5. If the tag value does not match CONSTANT_String, we skip to the next constant pool entry.

  6. If the tag indicates a string literal, we call processStringLiteral with the current index value to handle it.

The CONSTANT_String_info Structure

The structure for the String constant pool entry is defined in Section 4.4.3 of the JVMS:

CONSTANT_String_info {
    u1 tag;
    u2 string_index;
}

It consists of the 1-byte tag value followed by an u2 value named string_index. The latter represents the entry number of an Utf8 entry in the same constant pool. The Utf8 entry stores the string literal contents.

12: The processStringLiteral Method

Let's translate the CONSTANT_String_info structure into a Java method. We write the processStringLiteral method as follows:

private void processStringLiteral(int index) throws IOException {  int stringIndex;  stringIndex = readU2(index + 1);  String utf8;  utf8 = readUtf8(stringIndex);  System.out.printf("  %s%n", utf8);}private String readUtf8(int entryNumber) throws IOException {  ...}

Here's a breakdown:

  1. The method takes an index parameter, whose value is equal to the position of the tag value of our String constant pool entry.

  2. We begin the method by reading the u2 value that follows the tag. We used a modified readU2 method, which we'll see in more details in a bit.

  3. This u2 value represents the number of an Utf8 entry in the same constant pool. We use this entry number to call the readUtf8 method.

  4. The readUtf8 is not implemented yet, but it should return the decoded Utf8 entry value as a String object.

  5. We print the string literal value.

13: Modified readU2 Method

For completeness, here's the modified readU2 we used in the previous section:

private int readU2() throws IOException {  check(2);  int index = idx;  idx += 2;  return readU2(index);}private int readU2(int index) {  byte b0 = data[index + 0];  byte b1 = data[index + 1];  int v0 = toInt(b0, 8);  int v1 = toInt(b1, 0);  return v0 | v1;}

The original readU2 method was refactored to use the new modified readU2 method. Its bottom section was moved into the new readU2 method.

The modified readU2 takes an index parameter, instead of relying on the idx instance variable. Additionally, it does not perform the data bounds check, as it is used to read a section of the data array which was already visited before.

The CONSTANT_Utf8_info Structure

In Section 4.4.7 of the JVMS we find the structure for the Utf8 constant pool entry:

CONSTANT_Utf8_info {
    u1 tag;
    u2 length;
    u1 bytes[length];
}

The tag value is followed by an u2 value, which represents the number of bytes in the encoded string that follows.

14: The readUtf8 Method

We translate the information of the previous section into the readUtf8 method:

private String readUtf8(int entryNumber) throws IOException {  int index;  index = constantPoolIndex[entryNumber];  byte tag;  tag = data[index];  if (tag != CONSTANT_Utf8) {    throw new IOException("Malformed constant pool");  }  int length;  length = readU2(index + 1);  return new String(data, index + 3, length, StandardCharsets.UTF_8);}

Here's a breakdown:

  1. The method accepts an entryNumber parameter, which is the entry number of our CONSTANT_Utf8_info structure.

  2. We begin the method by retrieving the starting position of the entry from the constantPoolIndex array.

  3. Next, we read the tag byte from the data array.

  4. We check if the tag value is equal to CONSTANT_Utf8; if not we throw signaling a malformed constant pool.

  5. We then read the length value using the readU2 method.

  6. Finally, we create a new String directly from the data array.

The last step has an issue which we'll see in a bit. In the meantime, let's test our implementation as it currently stands.

Testing Our Implementation

First, let's test our implementation against our HelloWorld.class:

$ java --enable-preview ClassFile4.java HelloWorld.class
ClassFile /tmp/HelloWorld.class
  Hello, World!

It found and printed a single string literal, our "Hello, World!" message. Next, let's try it on a class having more string literals. We write the following Java class:

public class StringLiterals {    static final String A = "Static Field Init";    final String b = "Instance Field Init";    void m() {    var s = "Local Variable";    t("Method arg");  }    void t(String s) {}  }

And we run our implementation against its compiled class file:

$ java --enable-preview ClassFile4.java StringLiterals.class 
ClassFile /tmp/StringLiterals.class
  Instance Field Init
  Local Variable
  Method arg
  Static Field Init

Our implementation seems to be working. To be sure, let's test our implementation against some emoji.

Code Points Above U+FFFF

Our current implementation does not work with Unicode code points that are above U+FFFF. As an example, let's test it with the following Java class:

public class Emoji {  String example = "Hi 😀";}

It declares an string literal containg an emoji. Let's run our implementation against its compiled class file:

$ java --enable-preview ClassFile4.java Emoji.class 
ClassFile /tmp/Emoji.class
  Hi ��

That's not the correct value:

  • Utf8 constant pool entries encodes the string value using a modified UTF-8 encoding.

  • Our implementation wrongly assumes the byte stream uses the standard UTF-8 encoding.

We should work on this issue next.

In the Next Blog Post in This Series

In this fourth blog post in this series, we learned that the String constant pool entry, apart from its tag value, has a 2-byte string_index value. The latter represents the entry number of an Utf8 entry in the same constant pool containing the actual string literal value. We also learned that Utf8 constant pool entry stores the string value using a modified UTF-8 encoding.

In this blog post, we had our class file scanner visit all string literals defined in a Java class file. However, our scanner does not print the correct value if the string literal contains Unicode code points above U+FFFF. So, in the next blog post in this series, we'll have our class file scanner correctly decode all Utf8 string values found in a Java class file.

You can find the source code used in this blog post in this Gist.