Scanning Java Class Files #5: Modified UTF-8

In the previous post in this series, we had our Java class file scanner print the value of all the string literals found in a Java class file. But it has a bug that prevents emoji characters from being printed correctly.
In this fifth blog post in this series, we'll fix our class file scanner so it correctly prints all string values.
In doing so, we'll discuss the UTF-32 and UTF-16 character encodings used by the String
class;
and the modified UTF-8 character encoding used by Java class files.
Java Strings and Unicode
In Java, instances of the String
class represent sequences of Unicode characters.
So, in a Java source file, string literals may contain any character defined in the Unicode standard.
But what exactly are Unicode characters?
Unicode Code Points
In the Unicode Standard,
characters are assigned to integer numbers called code points.
So we call an Unicode character an Unicode Code Point or simply Code Point.
A code point is represented by the U+CAFE
notation, i.e.,
the U+
prefix followed by the hexadecimal representation of the code point value.
For example:
-
The Latin Capital Letter A is assigned to the
U+0041
code point. -
The Japanese Hiragana ideogram ใ is assigned to the
U+3042
code point. -
The "Grinning Face" emoji ๐ is assigned to the
U+1F600
code point.
Unicode Codespace
The range of all code points is called Codespace
and consists of the integers from U+0000
to U+10FFFF
.
In decimal notation, the range is from 0
to 1,114,111
.
Not all code points are assigned. In fact, just over 24% of the code points are assigned by the standard, and an even smaller percentage, just over 13% of the code points, are assigned to a graphic character.
Regardless, in order to be compatible with Unicode, a system must support all values in the Unicode Codespace.
Unicode Character Encoding
A Java class file is a stream of u1
, u2
and u4
values.
The u1
, u2
and u4
types represent unsigned 8-bit, 16-bit and 32-bit quantities, respectively.
So, to store the value of a string literal in a class file,
we need to come up with a function that maps an Unicode code point to one or more u1
, u2
and u4
values.
In other words, we need to come up with a function that encodes a code point into a what's formally called a code unit.
UTF-32
The highest Unicode code point is U+10FFFF
which requires 21 bits.
The smallest Java integer data type that can encode 21 bits is the 32-bit int
primitive type.
So we can use int
values to encode all code points in the codescape.
For example, the String::codePoints
method returns an IntStream
where each int
value directly maps to a code point.
This encoding form is formally called UTF-32.
The UTF
acronim stands for Unicode Transformation Format,
and the number 32 refers to the size in bits of the smallest quantity used in the encoding.
UTF-32 is useful when processing strings in memory. But, UTF-32 is wasteful when used to store string values in a class file. In applications in the English language, for example, the majority of code points will require 8-bits. In applications in other languages, the majority of code points will typically require 16-bits at most.
UTF-16
If most code points will have 16-bits at most, why not use this quantity to encode characters? This is the idea behind the UTF-16 format.
In Java, the char
primitive type represents an UTF-16 code unit.
A char
value represents a code point from U+0000
to U+FFFF
.
Interestingly, code points from U+D800
to U+DFFF
are reserved.
They do not represent any graphic character and, instead, they are used exclusively by the UTF-16 encoding.
To encode code points above U+FFFF
, UTF-16 uses a pair of code units from that reserved range.
First, the offset of the code point from U+010000
is computed, resulting in a 20-bit number from 0x00000
to 0xFFFFF
.
The value is then split into two 10-bits parts:
-
The top 10 bits are added to
U+D800
resulting in a code unit fromU+D800
toU+DBFF
called High Surrogate. -
The bottom 10 bits are added to
U+DC00
resulting in a code unit fromU+DC00
toU+DFFF
called Low Surrogate.
UTF-16 is an improvement over UTF-32 in terms of required storage space, but it's wasteful when string literals are mostly in the English language.
UTF-8
I/O APIs typically work with 8-bit byte
quantities.
In Java, for example, the InputStream
and OutputStream
classes work by reading and writing byte
sequences.
So, apart from storage requirements, it'd be nice to have a byte-oriented encoding form.
This is the idea behind the UTF-8 format.
In UTF-8, each code point is mapped to a sequence of one to four 8-bit code units, like so:
Code Point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|
0yyyzzzz |
0yyyzzzz |
|||
00000xxx yyyyzzzz |
110xxxyy |
10yyzzzz |
||
wwwwxxxx yyyyzzzz |
1110wwww |
10xxxxyy |
10yyzzzz |
|
000uvvvv wwwwxxxx yyyyzzzz |
11110uvv |
10vvwwww |
10xxxxyy |
10yyzzzz |
In other words:
-
Code points from
U+0000
toU+007F
use the 1-byte form. -
Code points from
U+0080
toU+07FF
use the 2-bytes form. -
Code points from
U+0800
toU+FFFF
use the 3-bytes form. -
Code points from
U+10000
toU+10FFFF
use the 4-bytes form.
Therefore, UTF-8 will be worse than UTF-16 in terms of storage requirements, if most code points in an application falls in the U+0800
to U+FFFF
range.
On the other hand, it is a byte-oriented encoding form, and it is compatible with ASCII,
meaning that an ASCII encoded string is also an UTF-8 encoded string.
A variation of the UTF-8 encoding form is used by Java class files to store string literal values.
Modified UTF-8 (U+0000 to U+FFFF)
When Java was first introduced, Unicode code points were 16-bits in size.
In other words, an at that time, valid code points were restricted to the U+0000
to U+FFFF
range.
Shortly after Java 1.0 was released, version 2.0 of the Unicode standard was released,
extending the number of code points to the current U+0000
to U+10FFFF
range.
Full support for all Unicode code points would be introduced in Java 5.0.
So, prior to the Java 5.0 release, only code points from U+0000
to U+FFFF
could be encoded.
As mentioned, a form of UTF-8 was chosen to encode the string values in the Java class file.
It works as the regular UTF-8 encoding shown in the previous section, with one modification:
-
The
U+0000
code point is encoded using the 2-bytes form.
It means that, in a Java class file, an encoded string value will never contain the null
byte.
Modified UTF-8 (U+10000 to U+10FFFF)
Starting with Java 5.0, support for Unicode code points in the range U+10000
to U+10FFFF
was introduced.
This required Java class files to handle these code points,
while maintaining compatibility to existing class files.
For code points up to U+FFFF
, the encoding remained unchanged.
However, for code points above U+FFFF
, it was decided to deviate from the regular UTF-8.
Instead of using the 4-bytes form, it was decided to:
-
First, each code point is first encoded to the UTF-16 format resulting in two surrogate code units.
-
Next, each surrogate is individually encoded using the UTF-8 3-bytes form, resulting in a 6-bytes total for each code point above
U+FFFF
.
The following table illustrates the encoding of the two surrogate code units:
Surrogate | Byte 1 | Byte 2 | Byte 3 |
---|---|---|---|
110110xx yyyyzzzz |
11101101 |
1010xxyy |
10yyzzzz |
110111xx yyyyzzzz |
11101101 |
1011xxyy |
10yyzzzz |
The first row shows the high surrogate while the second row shows the low surrogate.
Decoding Modified UTF-8
Now that we know how string literals are encoded in the class file,
let's work on our scanner so it properly decode all Utf8
string values.
15: The Updated readUtf8 Method
First, we'll update our current readUtf8
method implementation to the following:
private String readUtf8(int entryNumber) throws IOException { int entryIndex; entryIndex = constantPoolIndex[entryNumber]; byte tag; tag = data[entryIndex]; if (tag != CONSTANT_Utf8) { throw new IOException("Malformed constant pool"); } int length; length = readU2(entryIndex + 1); int startIndex; startIndex = entryIndex + 3; return decodeUtf8(startIndex, length);}
So, instead of returning a new String
instance assuming a regular UTF-8 encoding, we call the new decodeUtf8
method instead.
16: The decodeUtf8 Method
Our decodeUtf8
method takes two int
parameters:
-
The
startIndex
parameter represents the index of the first byte of the encoded string value. -
The
length
parameters represents the number of bytes of the encoded string value.
private String decodeUtf8(int startIndex, int length) throws IOException { int index; index = startIndex; int endIndex; endIndex = startIndex + length; StringBuilder out; out = new StringBuilder(length); while (index < endIndex) { // decoding loop } return out.toString();}
Here's a breakdown:
-
First, we declare the
index
andendIndex
variables to iterate over the bytes of the encoded string. -
Next, we create a
StringBuilder
instance to hold the decoded string value. We initialize it withlength
, the largest possible decoded string length, when all characters are encoded using the 1-byte form. The string will be shorter if it contains any character encoded with more than 1 byte. -
Then, we enter the
while
loop responsible for the actual decoding process. We'll discuss the loop in the next section. -
Finally, after we've decoded all characters, we return the resulting string.
17: The Decoding Loop
In the decoding loop we iterate over all bytes of the encoded string value. Here's the full listing:
while (index < endIndex) { byte byte0; byte0 = data[index++]; int highBits; highBits = Byte.toUnsignedInt(byte0) >> 4; char c; c = switch (highBits) { // 0yyyzzzz case 0b0000, 0b0001, 0b0010, 0b0011, 0b0100, 0b0101, 0b0110, 0b0111 -> decode(byte0); // 110xxxyy 10yyzzzz case 0b1100, 0b1101 -> decode(byte0, read(index++)); // 1110wwww 10xxxxyy 10yyzzzz case 0b1110 -> decode(byte0, read(index++), read(index++)); default -> throw new IOException("Malformed UTF-8 value"); }; out.append(c);}
In the loop:
-
We begin by getting the four highest bits of the current byte in the loop, and we assign it to the
highBits
variable. -
If
highBits
matches the0yyy
bit pattern, we decode a 1-byte sequence. -
If
highBits
matches the110x
bit pattern, we decode a 2-bytes sequence. -
If
highBits
is exactly1110
, we decode a 3-bytes sequence. -
If the four highest bits doesn't match any of the previous three cases, then we have an invalid encoded sequence.
-
At the end of the loop, we append the decoded
char
value to ourStringBuilder
instance.
18: The read Method
For completeness, here's the read
method we've used in the previous section:
private byte read(int index) throws IOException { try { return data[index]; } catch (ArrayIndexOutOfBoundsException e) { throw new IOException("Invalid class file: Utf8 value", e); }}
We assume the index is valid.
If an ArrayIndexOutOfBoundsException
is thrown, we rethrow it wrapped in an IOException
,
indicating an invalid class file.
19: The decode Method: 1-byte form
The method decodes an UTF-8 1-byte sequence:
private char decode(byte byte0) { return (char) byte0;}
In the UTF-8 1-byte form, the code unit maps directly to a code point.
20: The decode Method: 2-bytes form
The two bytes in the 2-bytes sequence conforms to the following bit patterns:
-
110xxxyy
-
10yyzzzz
Here's the method that decodes it:
private char decode(byte byte0, byte byte1) throws IOException { checkUtf8(byte1); int bits0 = (byte0 & 0b0001_1111) << 6; int bits1 = (byte1 & 0b0011_1111) << 0; return (char) (bits0 | bits1);}private void checkUtf8(byte b) throws IOException { int topTwoBits = b & 0b1100_0000; if (topTwoBits != 0b1000_0000) { throw new IOException("Malformed UTF-8 value"); }}
The checkUtf8
utility method verifies if the bit pattern of the second encoded byte conforms to 10yyzzzz
.
21: The decode Method: 3-bytes form
The three bytes in the 3-bytes sequence conforms to the following bit patterns:
-
1110wwww
-
10xxxxyy
-
10yyzzzz
And here's the method that decodes it:
private char decode(byte byte0, byte byte1, byte byte2) throws IOException { checkUtf8(byte1); checkUtf8(byte2); int bits0 = (byte0 & 0b0000_1111) << 12; int bits1 = (byte1 & 0b0011_1111) << 6; int bits2 = (byte2 & 0b0011_1111) << 0; return (char) (bits0 | bits1 | bits2);}
It uses the same checkUtf8
utility method defined earlier.
Testing the Current Iteration
To test the current iteration of our scanner, we create the following Java class:
public class ModifiedUtf8 { String byte1Form = " !\"#$%&'()*+,-./0123456789:;<=>?@" + "ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"; String byte2Form = "\0ยกยขยฃยคยฅยฆยงยจยฉยชยซยฌยฎยฏยฐยฑยฒยณยดยต"; String byte3Form = "ไธไบไธๅไบๅ
ญไธๅ
ซไนๅ"; String byte6Form = "๐๐๐๐๐๐
๐๐๐๐๐๐๐๐๐๐๐๐๐๐";}
The class declares four string values:
-
The first string contains characters that, when encoded in the class file, will use the Modified UTF-8 1-byte form only.
-
The characters in the second string will use the Modified UTF-8 2-bytes form only.
-
The characters in the third string will use the Modified UTF-8 3-bytes form only.
-
The characters in the fourth string will use the Modified UTF-8 6-bytes form only.
When executed, our program prints:
ClassFile /tmp/ModifiedUtf8.class
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
\u0000ยกยขยฃยคยฅยฆยงยจยฉยชยซยฌยฎยฏยฐยฑยฒยณยดยต
ไธไบไธๅไบๅ
ญไธๅ
ซไนๅ
๐๐๐๐๐๐
๐๐๐๐๐๐๐๐๐๐๐๐๐๐
It works.
I should note that, the processStringLiteral
method was modified so that any character which causes Character::isISOControl
to return true
is printed in its Unicode escape sequence representation.
Conclusion
In this fifth blog post in this series, we learned about the different Unicode Transformation Formats. We also learned that Java class files uses a modified UTF-8 encoding. It differs from the regular UTF-8 in two points:
-
First, it uses the 2-byte form to encode the
U+0000
code point, whereas the regular UTF-8 uses the 1-byte form. -
Second, code points above
U+FFFF
are first encoded to UTF-16, and the two resulting surrogate code units are each encoded using the UTF-8 3-bytes form.
Finally, we fixed our Java class file scanner so it correctly decodes all string literal values.
You can find the source code used in this blog post in this Gist.