Why does a 4 billion-iteration Java loop take only 2 ms?

JavaFor LoopJvm

Java Problem Overview


I'm running the following Java code on a laptop with 2.7 GHz Intel Core i7. I intended to let it measure how long it takes to finish a loop with 2^32 iterations, which I expected to be roughly 1.48 seconds (4/2.7 = 1.48).

But actually it only takes 2 milliseconds, instead of 1.48 s. I'm wondering if this is a result of any JVM optimization underneath?

public static void main(String[] args)
{
    long start = System.nanoTime();

    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++){
    }
    long finish = System.nanoTime();
    long d = (finish - start) / 1000000;

    System.out.println("Used " + d);
}

Java Solutions


Solution 1 - Java

There are one of two possibilities going on here:

  1. The compiler realized that the loop is redundant and doing nothing so it optimized it away.

  2. The JIT (just-in-time compiler) realized that the loop is redundant and doing nothing, so it optimized it away.

Modern compilers are very intelligent; they can see when code is useless. Try putting an empty loop into GodBolt and look at the output, then turn on -O2 optimizations, you will see that the output is something along the lines of

main():
    xor eax, eax
    ret

I would like to clarify something, in Java most of the optimizations are done by the JIT. In some other languages (like C/C++) most of the optimizations are done by the first compiler.

Solution 2 - Java

It looks like it was optimized away by JIT compiler. When I turn it off (-Djava.compiler=NONE), the code runs much slower:

$ javac MyClass.java
$ java MyClass
Used 4
$ java -Djava.compiler=NONE MyClass
Used 40409

I put OP's code inside of class MyClass.

Solution 3 - Java

I just will state the obvious - that this is a JVM optimization that happens, the loop will simply be remove at all. Here is a small test that shows what a huge difference JIT has when enabled/enabled only for C1 Compiler and disabled at all.

Disclaimer: don't write tests like this - this is just to prove that the actual loop "removal" happens in the C2 Compiler:

@Benchmark
@Fork(1)
public void full() {
    long result = 0;
    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
        ++result;
    }
}

@Benchmark
@Fork(1)
public void minusOne() {
    long result = 0;
    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE - 1; i++) {
        ++result;
    }
}

@Benchmark
@Fork(value = 1, jvmArgsAppend = { "-XX:TieredStopAtLevel=1" })
public void withoutC2() {
    long result = 0;
    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE - 1; i++) {
        ++result;
    }
}

@Benchmark
@Fork(value = 1, jvmArgsAppend = { "-Xint" })
public void withoutAll() {
    long result = 0;
    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE - 1; i++) {
        ++result;
    }
}

The results show that depending on which part of the JIT is enabled, method gets faster (so much faster that it looks like it's doing "nothing" - loop removal, which seems to be happening in the C2 Compiler - which is the maximum level):

 Benchmark                Mode  Cnt      Score   Error  Units
 Loop.full        avgt    210⁻⁷          ms/op
 Loop.minusOne    avgt    210⁻⁶          ms/op
 Loop.withoutAll  avgt    2  51782.751          ms/op
 Loop.withoutC2   avgt    2   1699.137          ms/op 

Solution 4 - Java

As already pointed out, JIT (just-in-time) compiler can optimize an empty loop in order to remove unnecessary iterations. But how?

Actually, there are two JIT compilers: C1 & C2. First, the code is compiled with the C1. C1 collects the statistics and helps the JVM to discover that in 100% cases our empty loop doesn't change anything and is useless. In this situation C2 enters the stage. When the code is get called very often, it can be optimized and compiled with the C2 using collected statistics.

As an example, I will test the next code snippet (my JDK is set to slowdebug build 9-internal):

public class Demo {
    private static void run() {
        for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
        }
        System.out.println("Done!");
    }
}

With the following command line options:

-XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*Demo.run

And there are different versions of my run method, compiled with the C1 and C2 appropriately. For me, the final variant (C2) looks something like this:

...

; B1: #	B3 B2 <- BLOCK HEAD IS JUNK  Freq: 1
0x00000000125461b0: mov   dword ptr [rsp+0ffffffffffff7000h], eax
0x00000000125461b7: push  rbp
0x00000000125461b8: sub   rsp, 40h
0x00000000125461bc: mov   ebp, dword ptr [rdx]
0x00000000125461be: mov   rcx, rdx
0x00000000125461c1: mov   r10, 57fbc220h
0x00000000125461cb: call  indirect r10    ; *iload_1

0x00000000125461ce: cmp   ebp, 7fffffffh  ; 7fffffff => 2147483647
0x00000000125461d4: jnl   125461dbh       ; jump if not less

; B2: #	B3 <- B1  Freq: 0.999999
0x00000000125461d6: mov   ebp, 7fffffffh  ; *if_icmpge

; B3: #	N44 <- B1 B2  Freq: 1       
0x00000000125461db: mov   edx, 0ffffff5dh
0x0000000012837d60: nop
0x0000000012837d61: nop
0x0000000012837d62: nop
0x0000000012837d63: call  0ae86fa0h

...

It is a little bit messy, but If you look closely, you may notice that there is no long running loop here. There are 3 blocks: B1, B2 and B3 and the execution steps can be B1 -> B2 -> B3 or B1 -> B3. Where Freq: 1 - normalized estimated frequency of a block execution.

Solution 5 - Java

You are measuring the time it take to detect the loop doesn't do anything, compile the code in a background thread and eliminate the code.

for (int t = 0; t < 5; t++) {
    long start = System.nanoTime();
    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
    }
    long time = System.nanoTime() - start;

    String s = String.format("%d: Took %.6f ms", t, time / 1e6);
    Thread.sleep(50);
    System.out.println(s);
    Thread.sleep(50);
}

If you run this with -XX:+PrintCompilation you can see the code has been compiled in the background to level 3 or C1 compiler and after a few loops to level 4 of C4.

    129   34 %     3       A::main @ 15 (93 bytes)
    130   35       3       A::main (93 bytes)
    130   36 %     4       A::main @ 15 (93 bytes)
    131   34 %     3       A::main @ -2 (93 bytes)   made not entrant
    131   36 %     4       A::main @ -2 (93 bytes)   made not entrant
0: Took 2.510408 ms
    268   75 %     3       A::main @ 15 (93 bytes)
    271   76 %     4       A::main @ 15 (93 bytes)
    274   75 %     3       A::main @ -2 (93 bytes)   made not entrant
1: Took 5.629456 ms
2: Took 0.000000 ms
3: Took 0.000364 ms
4: Took 0.000365 ms

If you change the loop to use a long it doesn't get as optimised.

    for (long i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
    }

instead you get

0: Took 1579.267321 ms
1: Took 1674.148662 ms
2: Took 1885.692166 ms
3: Took 1709.870567 ms
4: Took 1754.005112 ms

Solution 6 - Java

You consider start and finish time in nanosecond and you divide by 10^6 for calculate the latency

long d = (finish - start) / 1000000

it should be 10^9 because 1 second = 10^9 nanosecond.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestiontwimoView Question on Stackoverflow
Solution 1 - Javavan denchView Answer on Stackoverflow
Solution 2 - JavaAkavallView Answer on Stackoverflow
Solution 3 - JavaEugeneView Answer on Stackoverflow
Solution 4 - JavaOleksandr PyrohovView Answer on Stackoverflow
Solution 5 - JavaPeter LawreyView Answer on Stackoverflow
Solution 6 - JavaDHARMENDRA SINGHView Answer on Stackoverflow