Declaring multiple arrays with 64 elements 1000 times faster than declaring array of 65 elements

JavaArrays

Java Problem Overview


Recently I noticed declaring an array containing 64 elements is a lot faster (>1000 fold) than declaring the same type of array with 65 elements.

Here is the code I used to test this:

public class Tests{
    public static void main(String args[]){
        double start = System.nanoTime();
        int job = 100000000;//100 million
        for(int i = 0; i < job; i++){
            double[] test = new double[64];
        }
        double end = System.nanoTime();
        System.out.println("Total runtime = " + (end-start)/1000000 + " ms");
    }
}

This runs in approximately 6 ms, if I replace new double[64] with new double[65] it takes approximately 7 seconds. This problem becomes exponentially more severe if the job is spread across more and more threads, which is where my problem originates from.

This problem also occurs with different types of arrays such as int[65] or String[65]. This problem does not occur with large strings: String test = "many characters";, but does start occurring when this is changed into String test = i + "";

I was wondering why this is the case and if it is possible to circumvent this problem.

Java Solutions


Solution 1 - Java

You are observing a behavior that is caused by the optimizations done by the JIT compiler of your Java VM. This behavior is reproducible triggered with scalar arrays up to 64 elements, and is not triggered with arrays larger than 64.

Before going into details, let's take a closer look at the body of the loop:

double[] test = new double[64];

The body has no effect (observable behavior). That means it makes no difference outside of the program execution whether this statement is executed or not. The same is true for the whole loop. So it might happen, that the code optimizer translates the loop to something (or nothing) with the same functional and different timing behavior.

For benchmarks you should at least adhere to the following two guidelines. If you had done so, the difference would have been significantly smaller.

  • Warm-up the JIT compiler (and optimizer) by executing the benchmark several times.
  • Use the result of every expression and print it at the end of the benchmark.

Now let's go into details. Not surprisingly there is an optimization that is triggered for scalar arrays not larger than 64 elements. The optimization is part of the Escape analysis. It puts small objects and small arrays onto the stack instead of allocating them on the heap - or even better optimize them away entirely. You can find some information about it in the following article by Brian Goetz written in 2005:

The optimization can be disabled with the command line option -XX:-DoEscapeAnalysis. The magic value 64 for scalar arrays can also be changed on the command line. If you execute your program as follows, there will be no difference between arrays with 64 and 65 elements:

java -XX:EliminateAllocationArraySizeLimit=65 Tests

Having said that, I strongly discourage using such command line options. I doubt that it makes a huge difference in a realistic application. I would only use it, if I would be absolutely convinced of the necessity - and not based on the results of some pseudo benchmarks.

Solution 2 - Java

There are any number of ways that there can be a difference, based on the size of an object.

As nosid stated, the JITC may be (most likely is) allocating small "local" objects on the stack, and the size cutoff for "small" arrays may be at 64 elements.

Allocating on the stack is significantly faster than allocating in heap, and, more to the point, stack does not need to be garbage collected, so GC overhead is greatly reduced. (And for this test case GC overhead is likely 80-90% of the total execution time.)

Further, once the value is stack-allocated the JITC can perform "dead code elimination", determine that the result of the new is never used anywhere, and, after assuring there are no side-effects that would be lost, eliminate the entire new operation, and then the (now empty) loop itself.

Even if the JITC does not do stack allocation, it's entirely possible for objects smaller than a certain size to be allocated in a heap differently (eg, from a different "space") than larger objects. (Normally this would not produce quite so dramatic timing differences, though.)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSipkoView Question on Stackoverflow
Solution 1 - JavanosidView Answer on Stackoverflow
Solution 2 - JavaHot LicksView Answer on Stackoverflow