Scanning Java Class Files #5: Modified UTF-8

Marcio EndoMarcio EndoApr 13, 2025

In the previous post in this series, we had our Java class file scanner print the value of all the string literals found in a Java class file. But it has a bug that prevents emoji characters from being printed correctly.

In this fifth blog post in this series, we'll fix our class file scanner so it correctly prints all string values. In doing so, we'll discuss the UTF-32 and UTF-16 character encodings used by the String class; and the modified UTF-8 character encoding used by Java class files.

Java Strings and Unicode

In Java, instances of the String class represent sequences of Unicode characters. So, in a Java source file, string literals may contain any character defined in the Unicode standard. But what exactly are Unicode characters?

Unicode Code Points

In the Unicode Standard, characters are assigned to integer numbers called code points. So we call an Unicode character an Unicode Code Point or simply Code Point. A code point is represented by the U+CAFE notation, i.e., the U+ prefix followed by the hexadecimal representation of the code point value. For example:

  • The Latin Capital Letter A is assigned to the U+0041 code point.

  • The Japanese Hiragana ideogram ใ‚ is assigned to the U+3042 code point.

  • The "Grinning Face" emoji ๐Ÿ˜€ is assigned to the U+1F600 code point.

Unicode Codespace

The range of all code points is called Codespace and consists of the integers from U+0000 to U+10FFFF. In decimal notation, the range is from 0 to 1,114,111.

Not all code points are assigned. In fact, just over 24% of the code points are assigned by the standard, and an even smaller percentage, just over 13% of the code points, are assigned to a graphic character.

Regardless, in order to be compatible with Unicode, a system must support all values in the Unicode Codespace.

Unicode Character Encoding

A Java class file is a stream of u1, u2 and u4 values. The u1, u2 and u4 types represent unsigned 8-bit, 16-bit and 32-bit quantities, respectively.

So, to store the value of a string literal in a class file, we need to come up with a function that maps an Unicode code point to one or more u1, u2 and u4 values. In other words, we need to come up with a function that encodes a code point into a what's formally called a code unit.

UTF-32

The highest Unicode code point is U+10FFFF which requires 21 bits. The smallest Java integer data type that can encode 21 bits is the 32-bit int primitive type. So we can use int values to encode all code points in the codescape.

For example, the String::codePoints method returns an IntStream where each int value directly maps to a code point.

This encoding form is formally called UTF-32. The UTF acronim stands for Unicode Transformation Format, and the number 32 refers to the size in bits of the smallest quantity used in the encoding.

UTF-32 is useful when processing strings in memory. But, UTF-32 is wasteful when used to store string values in a class file. In applications in the English language, for example, the majority of code points will require 8-bits. In applications in other languages, the majority of code points will typically require 16-bits at most.

UTF-16

If most code points will have 16-bits at most, why not use this quantity to encode characters? This is the idea behind the UTF-16 format.

In Java, the char primitive type represents an UTF-16 code unit. A char value represents a code point from U+0000 to U+FFFF. Interestingly, code points from U+D800 to U+DFFF are reserved. They do not represent any graphic character and, instead, they are used exclusively by the UTF-16 encoding.

To encode code points above U+FFFF, UTF-16 uses a pair of code units from that reserved range. First, the offset of the code point from U+010000 is computed, resulting in a 20-bit number from 0x00000 to 0xFFFFF. The value is then split into two 10-bits parts:

  • The top 10 bits are added to U+D800 resulting in a code unit from U+D800 to U+DBFF called High Surrogate.

  • The bottom 10 bits are added to U+DC00 resulting in a code unit from U+DC00 to U+DFFF called Low Surrogate.

UTF-16 is an improvement over UTF-32 in terms of required storage space, but it's wasteful when string literals are mostly in the English language.

UTF-8

I/O APIs typically work with 8-bit byte quantities. In Java, for example, the InputStream and OutputStream classes work by reading and writing byte sequences. So, apart from storage requirements, it'd be nice to have a byte-oriented encoding form. This is the idea behind the UTF-8 format.

In UTF-8, each code point is mapped to a sequence of one to four 8-bit code units, like so:

Code Point Byte 1 Byte 2 Byte 3 Byte 4
0yyyzzzz 0yyyzzzz
00000xxx yyyyzzzz 110xxxyy 10yyzzzz
wwwwxxxx yyyyzzzz 1110wwww 10xxxxyy 10yyzzzz
000uvvvv wwwwxxxx yyyyzzzz 11110uvv 10vvwwww 10xxxxyy 10yyzzzz

In other words:

  • Code points from U+0000 to U+007F use the 1-byte form.

  • Code points from U+0080 to U+07FF use the 2-bytes form.

  • Code points from U+0800 to U+FFFF use the 3-bytes form.

  • Code points from U+10000 to U+10FFFF use the 4-bytes form.

Therefore, UTF-8 will be worse than UTF-16 in terms of storage requirements, if most code points in an application falls in the U+0800 to U+FFFF range. On the other hand, it is a byte-oriented encoding form, and it is compatible with ASCII, meaning that an ASCII encoded string is also an UTF-8 encoded string.

A variation of the UTF-8 encoding form is used by Java class files to store string literal values.

Modified UTF-8 (U+0000 to U+FFFF)

When Java was first introduced, Unicode code points were 16-bits in size. In other words, an at that time, valid code points were restricted to the U+0000 to U+FFFF range.

Shortly after Java 1.0 was released, version 2.0 of the Unicode standard was released, extending the number of code points to the current U+0000 to U+10FFFF range. Full support for all Unicode code points would be introduced in Java 5.0.

So, prior to the Java 5.0 release, only code points from U+0000 to U+FFFF could be encoded. As mentioned, a form of UTF-8 was chosen to encode the string values in the Java class file. It works as the regular UTF-8 encoding shown in the previous section, with one modification:

  • The U+0000 code point is encoded using the 2-bytes form.

It means that, in a Java class file, an encoded string value will never contain the null byte.

Modified UTF-8 (U+10000 to U+10FFFF)

Starting with Java 5.0, support for Unicode code points in the range U+10000 to U+10FFFF was introduced. This required Java class files to handle these code points, while maintaining compatibility to existing class files.

For code points up to U+FFFF, the encoding remained unchanged. However, for code points above U+FFFF, it was decided to deviate from the regular UTF-8. Instead of using the 4-bytes form, it was decided to:

  • First, each code point is first encoded to the UTF-16 format resulting in two surrogate code units.

  • Next, each surrogate is individually encoded using the UTF-8 3-bytes form, resulting in a 6-bytes total for each code point above U+FFFF.

The following table illustrates the encoding of the two surrogate code units:

Surrogate Byte 1 Byte 2 Byte 3
110110xx yyyyzzzz 11101101 1010xxyy 10yyzzzz
110111xx yyyyzzzz 11101101 1011xxyy 10yyzzzz

The first row shows the high surrogate while the second row shows the low surrogate.

Decoding Modified UTF-8

Now that we know how string literals are encoded in the class file, let's work on our scanner so it properly decode all Utf8 string values.

15: The Updated readUtf8 Method

First, we'll update our current readUtf8 method implementation to the following:

private String readUtf8(int entryNumber) throws IOException {  int entryIndex;  entryIndex = constantPoolIndex[entryNumber];  byte tag;  tag = data[entryIndex];  if (tag != CONSTANT_Utf8) {    throw new IOException("Malformed constant pool");  }  int length;  length = readU2(entryIndex + 1);  int startIndex;  startIndex = entryIndex + 3;  return decodeUtf8(startIndex, length);}

So, instead of returning a new String instance assuming a regular UTF-8 encoding, we call the new decodeUtf8 method instead.

16: The decodeUtf8 Method

Our decodeUtf8 method takes two int parameters:

  • The startIndex parameter represents the index of the first byte of the encoded string value.

  • The length parameters represents the number of bytes of the encoded string value.

private String decodeUtf8(int startIndex, int length) throws IOException {  int index;  index = startIndex;   int endIndex;  endIndex = startIndex + length;    StringBuilder out;  out = new StringBuilder(length);  while (index < endIndex) {	// decoding loop  }  return out.toString();}

Here's a breakdown:

  • First, we declare the index and endIndex variables to iterate over the bytes of the encoded string.

  • Next, we create a StringBuilder instance to hold the decoded string value. We initialize it with length, the largest possible decoded string length, when all characters are encoded using the 1-byte form. The string will be shorter if it contains any character encoded with more than 1 byte.

  • Then, we enter the while loop responsible for the actual decoding process. We'll discuss the loop in the next section.

  • Finally, after we've decoded all characters, we return the resulting string.

17: The Decoding Loop

In the decoding loop we iterate over all bytes of the encoded string value. Here's the full listing:

while (index < endIndex) {  byte byte0;  byte0 = data[index++];  int highBits;  highBits = Byte.toUnsignedInt(byte0) >> 4;  char c;  c = switch (highBits) {    // 0yyyzzzz    case 0b0000, 0b0001,         0b0010, 0b0011,         0b0100, 0b0101, 0b0110, 0b0111 -> decode(byte0);    // 110xxxyy 10yyzzzz    case 0b1100, 0b1101 -> decode(byte0, read(index++));    // 1110wwww 10xxxxyy 10yyzzzz    case 0b1110 -> decode(byte0, read(index++), read(index++));    default -> throw new IOException("Malformed UTF-8 value");  };  out.append(c);}

In the loop:

  • We begin by getting the four highest bits of the current byte in the loop, and we assign it to the highBits variable.

  • If highBits matches the 0yyy bit pattern, we decode a 1-byte sequence.

  • If highBits matches the 110x bit pattern, we decode a 2-bytes sequence.

  • If highBits is exactly 1110, we decode a 3-bytes sequence.

  • If the four highest bits doesn't match any of the previous three cases, then we have an invalid encoded sequence.

  • At the end of the loop, we append the decoded char value to our StringBuilder instance.

18: The read Method

For completeness, here's the read method we've used in the previous section:

private byte read(int index) throws IOException {  try {    return data[index];  } catch (ArrayIndexOutOfBoundsException e) {    throw new IOException("Invalid class file: Utf8 value", e);  }}

We assume the index is valid. If an ArrayIndexOutOfBoundsException is thrown, we rethrow it wrapped in an IOException, indicating an invalid class file.

19: The decode Method: 1-byte form

The method decodes an UTF-8 1-byte sequence:

private char decode(byte byte0) {  return (char) byte0;}

In the UTF-8 1-byte form, the code unit maps directly to a code point.

20: The decode Method: 2-bytes form

The two bytes in the 2-bytes sequence conforms to the following bit patterns:

  • 110xxxyy

  • 10yyzzzz

Here's the method that decodes it:

private char decode(byte byte0, byte byte1) throws IOException {  checkUtf8(byte1);  int bits0 = (byte0 & 0b0001_1111) << 6;  int bits1 = (byte1 & 0b0011_1111) << 0;  return (char) (bits0 | bits1);}private void checkUtf8(byte b) throws IOException {  int topTwoBits = b & 0b1100_0000;    if (topTwoBits != 0b1000_0000) {    throw new IOException("Malformed UTF-8 value");  }}

The checkUtf8 utility method verifies if the bit pattern of the second encoded byte conforms to 10yyzzzz.

21: The decode Method: 3-bytes form

The three bytes in the 3-bytes sequence conforms to the following bit patterns:

  • 1110wwww

  • 10xxxxyy

  • 10yyzzzz

And here's the method that decodes it:

private char decode(byte byte0, byte byte1, byte byte2) throws IOException {  checkUtf8(byte1);  checkUtf8(byte2);  int bits0 = (byte0 & 0b0000_1111) << 12;  int bits1 = (byte1 & 0b0011_1111) << 6;  int bits2 = (byte2 & 0b0011_1111) << 0;  return (char) (bits0 | bits1 | bits2);}

It uses the same checkUtf8 utility method defined earlier.

Testing the Current Iteration

To test the current iteration of our scanner, we create the following Java class:

public class ModifiedUtf8 {  String byte1Form = " !\"#$%&'()*+,-./0123456789:;<=>?@" +       "ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~";  String byte2Form = "\0ยกยขยฃยคยฅยฆยงยจยฉยชยซยฌยฎยฏยฐยฑยฒยณยดยต";  String byte3Form = "ไธ€ไบŒไธ‰ๅ››ไบ”ๅ…ญไธƒๅ…ซไนๅ";  String byte6Form = "๐Ÿ˜€๐Ÿ˜๐Ÿ˜‚๐Ÿ˜ƒ๐Ÿ˜„๐Ÿ˜…๐Ÿ˜†๐Ÿ˜‡๐Ÿ˜ˆ๐Ÿ˜‰๐Ÿ˜Š๐Ÿ˜‹๐Ÿ˜Œ๐Ÿ˜๐Ÿ˜Ž๐Ÿ˜๐Ÿ˜๐Ÿ˜‘๐Ÿ˜’๐Ÿ˜“";}

The class declares four string values:

  • The first string contains characters that, when encoded in the class file, will use the Modified UTF-8 1-byte form only.

  • The characters in the second string will use the Modified UTF-8 2-bytes form only.

  • The characters in the third string will use the Modified UTF-8 3-bytes form only.

  • The characters in the fourth string will use the Modified UTF-8 6-bytes form only.

When executed, our program prints:

ClassFile /tmp/ModifiedUtf8.class
   !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
  \u0000ยกยขยฃยคยฅยฆยงยจยฉยชยซยฌยฎยฏยฐยฑยฒยณยดยต
  ไธ€ไบŒไธ‰ๅ››ไบ”ๅ…ญไธƒๅ…ซไนๅ
  ๐Ÿ˜€๐Ÿ˜๐Ÿ˜‚๐Ÿ˜ƒ๐Ÿ˜„๐Ÿ˜…๐Ÿ˜†๐Ÿ˜‡๐Ÿ˜ˆ๐Ÿ˜‰๐Ÿ˜Š๐Ÿ˜‹๐Ÿ˜Œ๐Ÿ˜๐Ÿ˜Ž๐Ÿ˜๐Ÿ˜๐Ÿ˜‘๐Ÿ˜’๐Ÿ˜“

It works.

I should note that, the processStringLiteral method was modified so that any character which causes Character::isISOControl to return true is printed in its Unicode escape sequence representation.

Conclusion

In this fifth blog post in this series, we learned about the different Unicode Transformation Formats. We also learned that Java class files uses a modified UTF-8 encoding. It differs from the regular UTF-8 in two points:

  • First, it uses the 2-byte form to encode the U+0000 code point, whereas the regular UTF-8 uses the 1-byte form.

  • Second, code points above U+FFFF are first encoded to UTF-16, and the two resulting surrogate code units are each encoded using the UTF-8 3-bytes form.

Finally, we fixed our Java class file scanner so it correctly decodes all string literal values.

You can find the source code used in this blog post in this Gist.