Difference between compact strings and compressed strings in Java 9

JavaStringJava 9

Java Problem Overview


What are the advantages of compact strings over compressed strings in JDK9?

Java Solutions


Solution 1 - Java

Compressed strings (Java 6) and compact strings (Java 9) both have the same motivation (strings are often effectively Latin-1, so half the space is wasted) and goal (make those strings small) but the implementations differ a lot.

Compressed Strings

In an interview Aleksey Shipilëv (who was in charge of implementing the Java 9 feature) had this to say about compressed strings:

> UseCompressedStrings feature was rather conservative: while distinguishing between char[] and byte[] case, and trying to compress the char[] into byte[] on String construction, it done most String operations on char[], which required to unpack the String. Therefore, it benefited only a special type of workloads, where most strings are compressible (so compression does not go to waste), and only a limited amount of known String operations are performed on them (so no unpacking is needed). In great many workloads, enabling -XX:+UseCompressedStrings was a pessimization. > > [...] UseCompressedStrings implementation was basically an optional feature that maintained a completely distinct String implementation in alt-rt.jar, which was loaded once the VM option is supplied. Optional features are harder to test, since they double the number of option combinations to try.

Compact Strings

In Java 9 on the other hand, compact strings are fully integrated into the JDK source. String is always backed by byte[], where characters use one byte if they are Latin-1 and otherwise two. Most operations do a check to see which is the case, e.g. charAt:

public char charAt(int index) {
    if (isLatin1()) {
        return StringLatin1.charAt(value, index);
    } else {
        return StringUTF16.charAt(value, index);
    }
}

Compact strings are enabled by default and can be partially disabled - "partially" because they are still backed by a byte[] and operations returning chars must still put them together from two separate bytes (due to intrinsics it is hard to say whether this has a performance impact).

More

If you're interested in more background on compact strings I recommend to read the interview I linked to above and/or watch this great talk by the same Aleksey Shipilëv (which also explains the new string concatenation).

Solution 2 - Java

XX:+UseCompressedStrings and Compact Strings are different things.

UseCompressedStrings meant that Strings that are ASCII only could be converted to byte[], but this was off by-default. In jdk-9 this optimization is always on, but not via the flag itself, but build-in.

Until java-9 Strings are stored internally as a char[] in UTF-16 encoding. From java-9 and up they will be store as byte[]. Why?

Because in ISO_LATIN_1 each character can be encoded in a single byte (8 bits) vs what it is used to be until now (16 bits, 8 of each where never used). This works only for ISO_LATIN_1, but that is the majority of Strings used anyway.

So that is done for space usage.

Here is a small example that should make things more clear:

class StringCharVsByte {
    public static void main(String[] args) {
        String first = "first";
        String russianFirst = "первыи";
 
        char[] c1 = first.toCharArray();
        char[] c2 = russianFirst.toCharArray();
 
        for (char c : c1) {
            System.out.println(c >>> 8);
        }
 
        for (char c : c2) {
            System.out.println(c >>> 8);
        }
    }
}

In the first case we are going to get zeroes only, meaning that the most significant 8 bits are zeroes; in the second case there is going to be a non-zero value, meaning that at least one bit from the most significant 8, is present.

That means that if internally we store Strings as an array of chars, there are string literals that actually waste half of each char. It turns out there are multiple applications that actually waste a lot of space because of this.

You have a String made from 10 Latin1 characters? You just lost 80 bits, or 10 bytes. To mitigate this String compression was made. And now, there will be no space loss for these Strings.

Internally this also means some very nice things. To distinguish between String that are LATIN1 and UTF-16 there's a field coder:

/**
 * The identifier of the encoding used to encode the bytes in
 * {@code value}. The supported values in this implementation are
 *
 * LATIN1
 * UTF16
 *
 * @implNote This field is trusted by the VM, and is a subject to
 * constant folding if String instance is constant. Overwriting this
 * field after construction will cause problems.
 */
private final byte coder;

Now based on this length is computed differently:

public int length() {
    return value.length >> coder();
}

If our String is Latin1 only, coder is going to be zero, so length of value (the byte array) is the size of chars. For non-Latin1 divide by two.

Solution 3 - Java

Compact Strings will have best of both worlds.

As can be seen in the definition provided in OpenJDK documentation:

> The new String class will store characters encoded either as ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per character), based upon the contents of the string. The encoding flag will indicate which encoding is used.

As mentioned by @Eugene, most of the strings are encoded in Latin-1 format and require one byte per character and hence do not require the whole 2-byte space provide in current String class implementation.

The new String class implementation will shift from UTF-16 char array to a byte array plus an encoding-flag field. The additional encoding field will show whether the characters are stored using UTF-16 or Latin-1 format.

This also concludes that we will also be able to store strings in UTF-16 format if required. And this also becomes the main point of difference between the Compressed String of Java 6 and Compact String of Java 9 as in Compressed String only byte[] array was used for storage which was then representated as pure ASCII.

Solution 4 - Java

Compressed Strings (-XX:+UseCompressedStrings)

This was a optional feature introduced in Java 6 Update 21 to improve SPECjbb performance by encoding only US-ASCII String on a byte per character.

This feature can be enabled by an -XX flag (-XX:+UseCompressedStrings). When it is enabled, String.value was changed to an Object reference and would point either to a byte[], for strings containing only 7-bit US-ASCII characters, or else a char[].

Later it was removed in Java 7 because of high maintenance and hard to test.

Compact String

This is a new feature introduced in Java 9 to build a memory efficient String.

Before Java 9, the String class stored characters in a char array, using two bytes for each character but from Java 9 the new String class will store characters in byte[](one byte per character) or char[](two bytes per character), based upon the contents of the string, plus an encoding-flag field. If String characters are of type Latin-1 then byte[] will be used else if characters are of type UTF-16 then char[] will be used. The encoding flag will indicate which encoding is used.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionsoorapadmanView Question on Stackoverflow
Solution 1 - JavaNicolai ParlogView Answer on Stackoverflow
Solution 2 - JavaEugeneView Answer on Stackoverflow
Solution 3 - JavaDhaval SimariaView Answer on Stackoverflow
Solution 4 - JavaMohit TyagiView Answer on Stackoverflow