Why is 2 * (i * i) faster than 2 * i * i in Java?

JavaPerformanceBenchmarkingBytecodeJit

Java Problem Overview


The following Java program takes on average between 0.50 secs and 0.55 secs to run:

public static void main(String[] args) {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += 2 * (i * i);
    }
    System.out.println((double) (System.nanoTime() - startTime) / 1000000000 + " s");
    System.out.println("n = " + n);
}

If I replace 2 * (i * i) with 2 * i * i, it takes between 0.60 and 0.65 secs to run. How come?

I ran each version of the program 15 times, alternating between the two. Here are the results:

 2*(i*i)  |  2*i*i
----------+----------
0.5183738 | 0.6246434
0.5298337 | 0.6049722
0.5308647 | 0.6603363
0.5133458 | 0.6243328
0.5003011 | 0.6541802
0.5366181 | 0.6312638
0.515149  | 0.6241105
0.5237389 | 0.627815
0.5249942 | 0.6114252
0.5641624 | 0.6781033
0.538412  | 0.6393969
0.5466744 | 0.6608845
0.531159  | 0.6201077
0.5048032 | 0.6511559
0.5232789 | 0.6544526

The fastest run of 2 * i * i took longer than the slowest run of 2 * (i * i). If they had the same efficiency, the probability of this happening would be less than 1/2^15 * 100% = 0.00305%.

Java Solutions


Solution 1 - Java

There is a slight difference in the ordering of the bytecode.

2 * (i * i):

     iconst_2
     iload0
     iload0
     imul
     imul
     iadd

vs 2 * i * i:

     iconst_2
     iload0
     imul
     iload0
     imul
     iadd

At first sight this should not make a difference; if anything the second version is more optimal since it uses one slot less.

So we need to dig deeper into the lower level (JIT)1.

Remember that JIT tends to unroll small loops very aggressively. Indeed we observe a 16x unrolling for the 2 * (i * i) case:

030   B2: #	B2 B3 <- B1 B2 	Loop: B2-B2 inner main of N18 Freq: 1e+006
030   	addl    R11, RBP	# int
033   	movl    RBP, R13	# spill
036   	addl    RBP, #14	# int
039   	imull   RBP, RBP	# int
03c   	movl    R9, R13	# spill
03f   	addl    R9, #13	# int
043   	imull   R9, R9	# int
047   	sall    RBP, #1
049   	sall    R9, #1
04c   	movl    R8, R13	# spill
04f   	addl    R8, #15	# int
053   	movl    R10, R8	# spill
056   	movdl   XMM1, R8	# spill
05b   	imull   R10, R8	# int
05f   	movl    R8, R13	# spill
062   	addl    R8, #12	# int
066   	imull   R8, R8	# int
06a   	sall    R10, #1
06d   	movl    [rsp + #32], R10	# spill
072   	sall    R8, #1
075   	movl    RBX, R13	# spill
078   	addl    RBX, #11	# int
07b   	imull   RBX, RBX	# int
07e   	movl    RCX, R13	# spill
081   	addl    RCX, #10	# int
084   	imull   RCX, RCX	# int
087   	sall    RBX, #1
089   	sall    RCX, #1
08b   	movl    RDX, R13	# spill
08e   	addl    RDX, #8	# int
091   	imull   RDX, RDX	# int
094   	movl    RDI, R13	# spill
097   	addl    RDI, #7	# int
09a   	imull   RDI, RDI	# int
09d   	sall    RDX, #1
09f   	sall    RDI, #1
0a1   	movl    RAX, R13	# spill
0a4   	addl    RAX, #6	# int
0a7   	imull   RAX, RAX	# int
0aa   	movl    RSI, R13	# spill
0ad   	addl    RSI, #4	# int
0b0   	imull   RSI, RSI	# int
0b3   	sall    RAX, #1
0b5   	sall    RSI, #1
0b7   	movl    R10, R13	# spill
0ba   	addl    R10, #2	# int
0be   	imull   R10, R10	# int
0c2   	movl    R14, R13	# spill
0c5   	incl    R14	# int
0c8   	imull   R14, R14	# int
0cc   	sall    R10, #1
0cf   	sall    R14, #1
0d2   	addl    R14, R11	# int
0d5   	addl    R14, R10	# int
0d8   	movl    R10, R13	# spill
0db   	addl    R10, #3	# int
0df   	imull   R10, R10	# int
0e3   	movl    R11, R13	# spill
0e6   	addl    R11, #5	# int
0ea   	imull   R11, R11	# int
0ee   	sall    R10, #1
0f1   	addl    R10, R14	# int
0f4   	addl    R10, RSI	# int
0f7   	sall    R11, #1
0fa   	addl    R11, R10	# int
0fd   	addl    R11, RAX	# int
100   	addl    R11, RDI	# int
103   	addl    R11, RDX	# int
106   	movl    R10, R13	# spill
109   	addl    R10, #9	# int
10d   	imull   R10, R10	# int
111   	sall    R10, #1
114   	addl    R10, R11	# int
117   	addl    R10, RCX	# int
11a   	addl    R10, RBX	# int
11d   	addl    R10, R8	# int
120   	addl    R9, R10	# int
123   	addl    RBP, R9	# int
126   	addl    RBP, [RSP + #32 (32-bit)]	# int
12a   	addl    R13, #16	# int
12e   	movl    R11, R13	# spill
131   	imull   R11, R13	# int
135   	sall    R11, #1
138   	cmpl    R13, #999999985
13f   	jl     B2	# loop end  P=1.000000 C=6554623.000000

We see that there is 1 register that is "spilled" onto the stack.

And for the 2 * i * i version:

05a   B3: #	B2 B4 <- B1 B2 	Loop: B3-B2 inner main of N18 Freq: 1e+006
05a   	addl    RBX, R11	# int
05d   	movl    [rsp + #32], RBX	# spill
061   	movl    R11, R8	# spill
064   	addl    R11, #15	# int
068   	movl    [rsp + #36], R11	# spill
06d   	movl    R11, R8	# spill
070   	addl    R11, #14	# int
074   	movl    R10, R9	# spill
077   	addl    R10, #16	# int
07b   	movdl   XMM2, R10	# spill
080   	movl    RCX, R9	# spill
083   	addl    RCX, #14	# int
086   	movdl   XMM1, RCX	# spill
08a   	movl    R10, R9	# spill
08d   	addl    R10, #12	# int
091   	movdl   XMM4, R10	# spill
096   	movl    RCX, R9	# spill
099   	addl    RCX, #10	# int
09c   	movdl   XMM6, RCX	# spill
0a0   	movl    RBX, R9	# spill
0a3   	addl    RBX, #8	# int
0a6   	movl    RCX, R9	# spill
0a9   	addl    RCX, #6	# int
0ac   	movl    RDX, R9	# spill
0af   	addl    RDX, #4	# int
0b2   	addl    R9, #2	# int
0b6   	movl    R10, R14	# spill
0b9   	addl    R10, #22	# int
0bd   	movdl   XMM3, R10	# spill
0c2   	movl    RDI, R14	# spill
0c5   	addl    RDI, #20	# int
0c8   	movl    RAX, R14	# spill
0cb   	addl    RAX, #32	# int
0ce   	movl    RSI, R14	# spill
0d1   	addl    RSI, #18	# int
0d4   	movl    R13, R14	# spill
0d7   	addl    R13, #24	# int
0db   	movl    R10, R14	# spill
0de   	addl    R10, #26	# int
0e2   	movl    [rsp + #40], R10	# spill
0e7   	movl    RBP, R14	# spill
0ea   	addl    RBP, #28	# int
0ed   	imull   RBP, R11	# int
0f1   	addl    R14, #30	# int
0f5   	imull   R14, [RSP + #36 (32-bit)]	# int
0fb   	movl    R10, R8	# spill
0fe   	addl    R10, #11	# int
102   	movdl   R11, XMM3	# spill
107   	imull   R11, R10	# int
10b   	movl    [rsp + #44], R11	# spill
110   	movl    R10, R8	# spill
113   	addl    R10, #10	# int
117   	imull   RDI, R10	# int
11b   	movl    R11, R8	# spill
11e   	addl    R11, #8	# int
122   	movdl   R10, XMM2	# spill
127   	imull   R10, R11	# int
12b   	movl    [rsp + #48], R10	# spill
130   	movl    R10, R8	# spill
133   	addl    R10, #7	# int
137   	movdl   R11, XMM1	# spill
13c   	imull   R11, R10	# int
140   	movl    [rsp + #52], R11	# spill
145   	movl    R11, R8	# spill
148   	addl    R11, #6	# int
14c   	movdl   R10, XMM4	# spill
151   	imull   R10, R11	# int
155   	movl    [rsp + #56], R10	# spill
15a   	movl    R10, R8	# spill
15d   	addl    R10, #5	# int
161   	movdl   R11, XMM6	# spill
166   	imull   R11, R10	# int
16a   	movl    [rsp + #60], R11	# spill
16f   	movl    R11, R8	# spill
172   	addl    R11, #4	# int
176   	imull   RBX, R11	# int
17a   	movl    R11, R8	# spill
17d   	addl    R11, #3	# int
181   	imull   RCX, R11	# int
185   	movl    R10, R8	# spill
188   	addl    R10, #2	# int
18c   	imull   RDX, R10	# int
190   	movl    R11, R8	# spill
193   	incl    R11	# int
196   	imull   R9, R11	# int
19a   	addl    R9, [RSP + #32 (32-bit)]	# int
19f   	addl    R9, RDX	# int
1a2   	addl    R9, RCX	# int
1a5   	addl    R9, RBX	# int
1a8   	addl    R9, [RSP + #60 (32-bit)]	# int
1ad   	addl    R9, [RSP + #56 (32-bit)]	# int
1b2   	addl    R9, [RSP + #52 (32-bit)]	# int
1b7   	addl    R9, [RSP + #48 (32-bit)]	# int
1bc   	movl    R10, R8	# spill
1bf   	addl    R10, #9	# int
1c3   	imull   R10, RSI	# int
1c7   	addl    R10, R9	# int
1ca   	addl    R10, RDI	# int
1cd   	addl    R10, [RSP + #44 (32-bit)]	# int
1d2   	movl    R11, R8	# spill
1d5   	addl    R11, #12	# int
1d9   	imull   R13, R11	# int
1dd   	addl    R13, R10	# int
1e0   	movl    R10, R8	# spill
1e3   	addl    R10, #13	# int
1e7   	imull   R10, [RSP + #40 (32-bit)]	# int
1ed   	addl    R10, R13	# int
1f0   	addl    RBP, R10	# int
1f3   	addl    R14, RBP	# int
1f6   	movl    R10, R8	# spill
1f9   	addl    R10, #16	# int
1fd   	cmpl    R10, #999999985
204   	jl     B2	# loop end  P=1.000000 C=7419903.000000

Here we observe much more "spilling" and more accesses to the stack [RSP + ...], due to more intermediate results that need to be preserved.

Thus the answer to the question is simple: 2 * (i * i) is faster than 2 * i * i because the JIT generates more optimal assembly code for the first case.


But of course it is obvious that neither the first nor the second version is any good; the loop could really benefit from vectorization, since any x86-64 CPU has at least SSE2 support.

So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot, all the while missing out on various other opportunities.

In fact, modern x86-64 CPUs break down the instructions further into micro-ops (µops) and with features like register renaming, µop caches and loop buffers, loop optimization takes a lot more finesse than a simple unrolling for optimal performance. According to Agner Fog's optimization guide:

> The gain in performance due to the µop cache can be quite > considerable if the average instruction length is more than 4 bytes. > The following methods of optimizing the use of the µop cache may > be considered: > > * Make sure that critical loops are small enough to fit into the µop cache. > * Align the most critical loop entries and function entries by 32. > * Avoid unnecessary loop unrolling. > * Avoid instructions that have extra load time
> . . .

Regarding those load times - even the fastest L1D hit costs 4 cycles, an extra register and µop, so yes, even a few accesses to memory will hurt performance in tight loops.

But back to the vectorization opportunity - to see how fast it can be, we can compile a similar C application with GCC, which outright vectorizes it (AVX2 is shown, SSE2 is similar)2:

  vmovdqa ymm0, YMMWORD PTR .LC0[rip]
  vmovdqa ymm3, YMMWORD PTR .LC1[rip]
  xor eax, eax
  vpxor xmm2, xmm2, xmm2
.L2:
  vpmulld ymm1, ymm0, ymm0
  inc eax
  vpaddd ymm0, ymm0, ymm3
  vpslld ymm1, ymm1, 1
  vpaddd ymm2, ymm2, ymm1
  cmp eax, 125000000      ; 8 calculations per iteration
  jne .L2
  vmovdqa xmm0, xmm2
  vextracti128 xmm2, ymm2, 1
  vpaddd xmm2, xmm0, xmm2
  vpsrldq xmm0, xmm2, 8
  vpaddd xmm0, xmm2, xmm0
  vpsrldq xmm1, xmm0, 4
  vpaddd xmm0, xmm0, xmm1
  vmovd eax, xmm0
  vzeroupper

With run times:

  • SSE: 0.24 s, or 2 times as fast.
  • AVX: 0.15 s, or 3 times as fast.
  • AVX2: 0.08 s, or 5 times as fast.

1 To get JIT generated assembly output, get a debug JVM and run with -XX:+PrintOptoAssembly

2 The C version is compiled with the -fwrapv flag, which enables GCC to treat signed integer overflow as a two's-complement wrap-around.

Solution 2 - Java

(Editor's note: this answer is contradicted by evidence from looking at the asm, as shown by another answer. This was a guess backed up by some experiments, but it turned out not to be correct.)


When the multiplication is 2 * (i * i), the JVM is able to factor out the multiplication by 2 from the loop, resulting in this equivalent but more efficient code:

int n = 0;
for (int i = 0; i < 1000000000; i++) {
    n += i * i;
}
n *= 2;

but when the multiplication is (2 * i) * i, the JVM doesn't optimize it since the multiplication by a constant is no longer right before the n += addition.

Here are a few reasons why I think this is the case:

  • Adding an if (n == 0) n = 1 statement at the start of the loop results in both versions being as efficient, since factoring out the multiplication no longer guarantees that the result will be the same
  • The optimized version (by factoring out the multiplication by 2) is exactly as fast as the 2 * (i * i) version

Here is the test code that I used to draw these conclusions:

public static void main(String[] args) {
    long fastVersion = 0;
    long slowVersion = 0;
    long optimizedVersion = 0;
    long modifiedFastVersion = 0;
    long modifiedSlowVersion = 0;

    for (int i = 0; i < 10; i++) {
        fastVersion += fastVersion();
        slowVersion += slowVersion();
        optimizedVersion += optimizedVersion();
        modifiedFastVersion += modifiedFastVersion();
        modifiedSlowVersion += modifiedSlowVersion();
    }

    System.out.println("Fast version: " + (double) fastVersion / 1000000000 + " s");
    System.out.println("Slow version: " + (double) slowVersion / 1000000000 + " s");
    System.out.println("Optimized version: " + (double) optimizedVersion / 1000000000 + " s");
    System.out.println("Modified fast version: " + (double) modifiedFastVersion / 1000000000 + " s");
    System.out.println("Modified slow version: " + (double) modifiedSlowVersion / 1000000000 + " s");
}

private static long fastVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += 2 * (i * i);
    }
    return System.nanoTime() - startTime;
}

private static long slowVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += 2 * i * i;
    }
    return System.nanoTime() - startTime;
}

private static long optimizedVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        n += i * i;
    }
    n *= 2;
    return System.nanoTime() - startTime;
}

private static long modifiedFastVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        if (n == 0) n = 1;
        n += 2 * (i * i);
    }
    return System.nanoTime() - startTime;
}

private static long modifiedSlowVersion() {
    long startTime = System.nanoTime();
    int n = 0;
    for (int i = 0; i < 1000000000; i++) {
        if (n == 0) n = 1;
        n += 2 * i * i;
    }
    return System.nanoTime() - startTime;
}

And here are the results:

Fast version: 5.7274411 s
Slow version: 7.6190804 s
Optimized version: 5.1348007 s
Modified fast version: 7.1492705 s
Modified slow version: 7.2952668 s

Solution 3 - Java

Byte codes: https://cs.nyu.edu/courses/fall00/V22.0201-001/jvm2.html Byte codes Viewer: https://github.com/Konloch/bytecode-viewer

On my JDK (Windows 10 64 bit, 1.8.0_65-b17) I can reproduce and explain:

public static void main(String[] args) {
    int repeat = 10;
    long A = 0;
    long B = 0;
    for (int i = 0; i < repeat; i++) {
        A += test();
        B += testB();
    }

    System.out.println(A / repeat + " ms");
    System.out.println(B / repeat + " ms");
}


private static long test() {
    int n = 0;
    for (int i = 0; i < 1000; i++) {
        n += multi(i);
    }
    long startTime = System.currentTimeMillis();
    for (int i = 0; i < 1000000000; i++) {
        n += multi(i);
    }
    long ms = (System.currentTimeMillis() - startTime);
    System.out.println(ms + " ms A " + n);
    return ms;
}


private static long testB() {
    int n = 0;
    for (int i = 0; i < 1000; i++) {
        n += multiB(i);
    }
    long startTime = System.currentTimeMillis();
    for (int i = 0; i < 1000000000; i++) {
        n += multiB(i);
    }
    long ms = (System.currentTimeMillis() - startTime);
    System.out.println(ms + " ms B " + n);
    return ms;
}

private static int multiB(int i) {
    return 2 * (i * i);
}

private static int multi(int i) {
    return 2 * i * i;
}

Output:

...
405 ms A 785527736
327 ms B 785527736
404 ms A 785527736
329 ms B 785527736
404 ms A 785527736
328 ms B 785527736
404 ms A 785527736
328 ms B 785527736
410 ms
333 ms

So why? The byte code is this:

 private static multiB(int arg0) { // 2 * (i * i)
     <localVar:index=0, name=i , desc=I, sig=null, start=L1, end=L2>

     L1 {
         iconst_2
         iload0
         iload0
         imul
         imul
         ireturn
     }
     L2 {
     }
 }

 private static multi(int arg0) { // 2 * i * i
     <localVar:index=0, name=i , desc=I, sig=null, start=L1, end=L2>

     L1 {
         iconst_2
         iload0
         imul
         iload0
         imul
         ireturn
     }
     L2 {
     }
 }

The difference being: With brackets (2 * (i * i)):

  • push const stack
  • push local on stack
  • push local on stack
  • multiply top of stack
  • multiply top of stack

Without brackets (2 * i * i):

  • push const stack
  • push local on stack
  • multiply top of stack
  • push local on stack
  • multiply top of stack

Loading all on the stack and then working back down is faster than switching between putting on the stack and operating on it.

Solution 4 - Java

Kasperd asked in a comment of the accepted answer: > The Java and C examples use quite different register names. Are both example using the AMD64 ISA?

xor edx, edx
xor eax, eax
.L2:
mov ecx, edx
imul ecx, edx
add edx, 1
lea eax, [rax+rcx*2]
cmp edx, 1000000000
jne .L2

I don't have enough reputation to answer this in the comments, but these are the same ISA. It's worth pointing out that the GCC version uses 32-bit integer logic and the JVM compiled version uses 64-bit integer logic internally.

R8 to R15 are just new X86_64 registers. EAX to EDX are the lower parts of the RAX to RDX general purpose registers. The important part in the answer is that the GCC version is not unrolled. It simply executes one round of the loop per actual machine code loop. While the JVM version has 16 rounds of the loop in one physical loop (based on rustyx answer, I did not reinterpret the assembly). This is one of the reasons why there are more registers being used since the loop body is actually 16 times longer.

Solution 5 - Java

While not directly related to the question's environment, just for the curiosity, I did the same test on .NET Core 2.1, x64, release mode.

Here is the interesting result, confirming similar phonomena (other way around) happening over the dark side of the force. Code:

static void Main(string[] args)
{
    Stopwatch watch = new Stopwatch();

    Console.WriteLine("2 * (i * i)");

    for (int a = 0; a < 10; a++)
    {
        int n = 0;

        watch.Restart();

        for (int i = 0; i < 1000000000; i++)
        {
            n += 2 * (i * i);
        }

        watch.Stop();

        Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds} ms");
    }

    Console.WriteLine();
    Console.WriteLine("2 * i * i");

    for (int a = 0; a < 10; a++)
    {
        int n = 0;

        watch.Restart();

        for (int i = 0; i < 1000000000; i++)
        {
            n += 2 * i * i;
        }

        watch.Stop();

        Console.WriteLine($"result:{n}, {watch.ElapsedMilliseconds}ms");
    }
}

Result:

2 * (i * i)

  • result:119860736, 438 ms
  • result:119860736, 433 ms
  • result:119860736, 437 ms
  • result:119860736, 435 ms
  • result:119860736, 436 ms
  • result:119860736, 435 ms
  • result:119860736, 435 ms
  • result:119860736, 439 ms
  • result:119860736, 436 ms
  • result:119860736, 437 ms

2 * i * i

  • result:119860736, 417 ms
  • result:119860736, 417 ms
  • result:119860736, 417 ms
  • result:119860736, 418 ms
  • result:119860736, 418 ms
  • result:119860736, 417 ms
  • result:119860736, 418 ms
  • result:119860736, 416 ms
  • result:119860736, 417 ms
  • result:119860736, 418 ms

Solution 6 - Java

I got similar results:

2 * (i * i): 0.458765943 s, n=119860736
2 * i * i: 0.580255126 s, n=119860736

I got the SAME results if both loops were in the same program, or each was in a separate .java file/.class, executed on a separate run.

Finally, here is a javap -c -v <.java> decompile of each:

     3: ldc           #3                  // String 2 * (i * i):
     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V
     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
    11: lstore_1
    12: iconst_0
    13: istore_3
    14: iconst_0
    15: istore        4
    17: iload         4
    19: ldc           #6                  // int 1000000000
    21: if_icmpge     40
    24: iload_3
    25: iconst_2
    26: iload         4
    28: iload         4
    30: imul
    31: imul
    32: iadd
    33: istore_3
    34: iinc          4, 1
    37: goto          17

vs.

     3: ldc           #3                  // String 2 * i * i:
     5: invokevirtual #4                  // Method java/io/PrintStream.print:(Ljava/lang/String;)V
     8: invokestatic  #5                  // Method java/lang/System.nanoTime:()J
    11: lstore_1
    12: iconst_0
    13: istore_3
    14: iconst_0
    15: istore        4
    17: iload         4
    19: ldc           #6                  // int 1000000000
    21: if_icmpge     40
    24: iload_3
    25: iconst_2
    26: iload         4
    28: imul
    29: iload         4
    31: imul
    32: iadd
    33: istore_3
    34: iinc          4, 1
    37: goto          17

FYI -

java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Solution 7 - Java

Interesting observation using Java 11 and switching off loop unrolling with the following VM option:

-XX:LoopUnrollLimit=0

The loop with the 2 * (i * i) expression results in more compact native code1:

L0001: add    eax,r11d
       inc    r8d
       mov    r11d,r8d
       imul   r11d,r8d
       shl    r11d,1h
       cmp    r8d,r10d
       jl     L0001

in comparison with the 2 * i * i version:

L0001: add    eax,r11d
       mov    r11d,r8d
       shl    r11d,1h
       add    r11d,2h
       inc    r8d
       imul   r11d,r8d
       cmp    r8d,r10d
       jl     L0001

Java version:

java version "11" 2018-09-25
Java(TM) SE Runtime Environment 18.9 (build 11+28)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode)

Benchmark results:

Benchmark          (size)  Mode  Cnt    Score     Error  Units
LoopTest.fast  1000000000  avgt    5  694,868 ±  36,470  ms/op
LoopTest.slow  1000000000  avgt    5  769,840 ± 135,006  ms/op

Benchmark source code:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@State(Scope.Thread)
@Fork(1)
public class LoopTest {

    @Param("1000000000") private int size;

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
            .include(LoopTest.class.getSimpleName())
            .jvmArgs("-XX:LoopUnrollLimit=0")
            .build();
        new Runner(opt).run();
    }

    @Benchmark
    public int slow() {
        int n = 0;
        for (int i = 0; i < size; i++)
            n += 2 * i * i;
        return n;
    }

    @Benchmark
    public int fast() {
        int n = 0;
        for (int i = 0; i < size; i++)
            n += 2 * (i * i);
        return n;
    }
}

1 - VM options used: -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly -XX:LoopUnrollLimit=0

Solution 8 - Java

I tried a JMH using the default archetype: I also added an optimized version based on Runemoro's explanation.

@State(Scope.Benchmark)
@Warmup(iterations = 2)
@Fork(1)
@Measurement(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
//@BenchmarkMode({ Mode.All })
@BenchmarkMode(Mode.AverageTime)
public class MyBenchmark {
  @Param({ "100", "1000", "1000000000" })
  private int size;

  @Benchmark
  public int two_square_i() {
    int n = 0;
    for (int i = 0; i < size; i++) {
      n += 2 * (i * i);
    }
    return n;
  }

  @Benchmark
  public int square_i_two() {
    int n = 0;
    for (int i = 0; i < size; i++) {
      n += i * i;
    }
    return 2*n;
  }

  @Benchmark
  public int two_i_() {
    int n = 0;
    for (int i = 0; i < size; i++) {
      n += 2 * i * i;
    }
    return n;
  }
}

The result are here:

Benchmark                           (size)  Mode  Samples          Score   Score error  Units
o.s.MyBenchmark.square_i_two           100  avgt       10         58,062         1,410  ns/op
o.s.MyBenchmark.square_i_two          1000  avgt       10        547,393        12,851  ns/op
o.s.MyBenchmark.square_i_two    1000000000  avgt       10  540343681,267  16795210,324  ns/op
o.s.MyBenchmark.two_i_                 100  avgt       10         87,491         2,004  ns/op
o.s.MyBenchmark.two_i_                1000  avgt       10       1015,388        30,313  ns/op
o.s.MyBenchmark.two_i_          1000000000  avgt       10  967100076,600  24929570,556  ns/op
o.s.MyBenchmark.two_square_i           100  avgt       10         70,715         2,107  ns/op
o.s.MyBenchmark.two_square_i          1000  avgt       10        686,977        24,613  ns/op
o.s.MyBenchmark.two_square_i    1000000000  avgt       10  652736811,450  27015580,488  ns/op

On my PC (Core i7 860 - it is doing nothing much apart from reading on my smartphone):

  • n += i*i then n*2 is first
  • 2 * (i * i) is second.

The JVM is clearly not optimizing the same way than a human does (based on Runemoro's answer).

Now then, reading bytecode: javap -c -v ./target/classes/org/sample/MyBenchmark.class

I am not expert on bytecode, but we iload_2 before we imul: that's probably where you get the difference: I can suppose that the JVM optimize reading i twice (i is already here, and there is no need to load it again) whilst in the 2*i*i it can't.

Solution 9 - Java

More of an addendum. I did repro the experiment using the latest Java 8 JVM from IBM:

java version "1.8.0_191"
Java(TM) 2 Runtime Environment, Standard Edition (IBM build 1.8.0_191-b12 26_Oct_2018_18_45 Mac OS X x64(SR5 FP25))
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)

And this shows very similar results:

0.374653912 s
n = 119860736
0.447778698 s
n = 119860736

(second results using 2 * i * i).

Interestingly enough, when running on the same machine, but using Oracle Java:

Java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

results are on average a bit slower:

0.414331815 s
n = 119860736
0.491430656 s
n = 119860736

Long story short: even the minor version number of HotSpot matter here, as subtle differences within the JIT implementation can have notable effects.

Solution 10 - Java

The two methods of adding do generate slightly different byte code:

  17: iconst_2
  18: iload         4
  20: iload         4
  22: imul
  23: imul
  24: iadd

For 2 * (i * i) vs:

  17: iconst_2
  18: iload         4
  20: imul
  21: iload         4
  23: imul
  24: iadd

For 2 * i * i.

And when using a JMH benchmark like this:

@Warmup(iterations = 5, batchSize = 1)
@Measurement(iterations = 5, batchSize = 1)
@Fork(1)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class MyBenchmark {

	@Benchmark
	public int noBrackets() {
		int n = 0;
	    for (int i = 0; i < 1000000000; i++) {
	        n += 2 * i * i;
	    }
	    return n;
	}

	@Benchmark
	public int brackets() {
		int n = 0;
	    for (int i = 0; i < 1000000000; i++) {
	        n += 2 * (i * i);
	    }
	    return n;
	}

}

The difference is clear:

# JMH version: 1.21
# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
# VM options: <none>

Benchmark                      (n)  Mode  Cnt    Score    Error  Units
MyBenchmark.brackets    1000000000  avgt    5  380.889 ± 58.011  ms/op
MyBenchmark.noBrackets  1000000000  avgt    5  512.464 ± 11.098  ms/op

What you observe is correct, and not just an anomaly of your benchmarking style (i.e. no warmup, see How do I write a correct micro-benchmark in Java?)

Running again with Graal:

# JMH version: 1.21
# VM version: JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28
# VM options: -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler

Benchmark                      (n)  Mode  Cnt    Score    Error  Units
MyBenchmark.brackets    1000000000  avgt    5  335.100 ± 23.085  ms/op
MyBenchmark.noBrackets  1000000000  avgt    5  331.163 ± 50.670  ms/op

You see that the results are much closer, which makes sense, since Graal is an overall better performing, more modern, compiler.

So this is really just up to how well the JIT compiler is able to optimize a particular piece of code, and doesn't necessarily have a logical reason to it.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionStefanView Question on Stackoverflow
Solution 1 - JavarustyxView Answer on Stackoverflow
Solution 2 - Javauser1779715View Answer on Stackoverflow
Solution 3 - JavaDSchmidtView Answer on Stackoverflow
Solution 4 - JavaPuzzledView Answer on Stackoverflow
Solution 5 - JavaÜnsal ErsözView Answer on Stackoverflow
Solution 6 - Javapaulsm4View Answer on Stackoverflow
Solution 7 - JavaOleksandr PyrohovView Answer on Stackoverflow
Solution 8 - JavaNoDataFoundView Answer on Stackoverflow
Solution 9 - JavaGhostCatView Answer on Stackoverflow
Solution 10 - JavaJorn VerneeView Answer on Stackoverflow