Why indirect increment is faster than direct inc?

JavaPerformanceJvm

Java Problem Overview


The question had been asked by another SO member, but was disappointingly deleted. The comments were saying the measurements are flawed and does not make sense.

However I was able to reproduce the original problem with a small benchmark under JMH:

package bench;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.*;
import org.openjdk.jmh.runner.options.*;
import java.util.concurrent.*;

@State(Scope.Benchmark)
public class LoopInc {

    private int getValue() {
        return ThreadLocalRandom.current().nextInt(2);
    }

    @Benchmark
    public int directInc() {
        int result = 0;
        for (int i = 0; i < 1000; i++) {
            switch (getValue()) {
                case 0:
                    break;
                case 1:
                    result++;
                    break;
            }
        }
        return result;
    }

    @Benchmark
    public int indirectInc() {
        int result = 0;
        for (int i = 0; i < 1000; i++) {
            boolean incr = false;
            switch (getValue()) {
                case 0:
                    break;
                case 1:
                    incr = true;
                    break;
            }

            if (incr) {
                result++;
            }
        }
        return result;
    }

    public static void main(String[] args) throws RunnerException {
        Options options = new OptionsBuilder()
                .include("bench.LoopInc.*")
                .warmupIterations(5)
                .measurementIterations(10)
                .forks(3)
                .timeUnit(TimeUnit.MILLISECONDS)
                .build();
        new Runner(options).run();
    }
}

The benchmarks shows indirectInc works 3 times faster, though the "optimization" is not obvious at all. One would assume indirectInc should work a bit slower because it involves an extra intermediate operation.

Benchmark             Mode  Cnt    Score   Error   Units
LoopInc.directInc    thrpt   30  127,301 ± 0,202  ops/ms
LoopInc.indirectInc  thrpt   30  378,147 ± 1,144  ops/ms

 

java version "1.8.0_51"
Java(TM) SE Runtime Environment (build 1.8.0_51-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode)

What causes JIT to compile indirectInc better than similar directInc?

Java Solutions


Solution 1 - Java

Ok, this is how you approach these things.

  1. Try to reproduce it. Okay, it reproduces:

    Benchmark Mode Cnt Score Error Units LoopInc.directInc thrpt 15 175.678 ± 1.118 ops/ms LoopInc.indirectInc thrpt 15 641.413 ± 9.722 ops/ms

  2. Try to see the generated assembly with -prof perfasm. Long story short, it produces a lot of generated code, and so we probably want to limit the loop unrolling. But, it can affect the performance, and can pretty much be the cause. So, let's re-run with -XX:LoopUnrollLimit=1. Okay, the score is lower, but the difference is still there, excellent:

    Benchmark Mode Cnt Score Error Units LoopInc.directInc thrpt 15 161.147 ± 6.101 ops/ms LoopInc.indirectInc thrpt 15 489.430 ± 1.698 ops/ms

  3. Take another look at the generated code, still nothing that pops out to our eye. Well, this seems interesting. Let's get on this properly. Can we characterize the workload? Of course we can, with the help of -prof perfnorm, which normalizes the hardware counters per benchmark op. Let's see:

    Benchmark Mode Cnt Score Error Units LoopInc.directInc thrpt 15 161.875 ± 3.038 ops/ms LoopInc.directInc:·CPI thrpt 3 0.967 ± 0.196 #/op LoopInc.directInc:·L1-dcache-load-misses thrpt 3 0.394 ± 3.663 #/op LoopInc.directInc:·L1-dcache-loads thrpt 3 2149.594 ± 228.166 #/op LoopInc.directInc:·L1-dcache-store-misses thrpt 3 0.114 ± 1.001 #/op LoopInc.directInc:·L1-dcache-stores thrpt 3 1073.666 ± 96.066 #/op LoopInc.directInc:·L1-icache-load-misses thrpt 3 0.965 ± 22.984 #/op LoopInc.directInc:·LLC-loads thrpt 3 0.204 ± 2.763 #/op LoopInc.directInc:·LLC-stores thrpt 3 0.060 ± 0.633 #/op LoopInc.directInc:·branch-misses thrpt 3 536.068 ± 43.293 #/op LoopInc.directInc:·branches thrpt 3 3728.890 ± 220.539 #/op LoopInc.directInc:·cycles thrpt 3 26219.146 ± 6287.590 #/op LoopInc.directInc:·dTLB-load-misses thrpt 3 0.063 ± 0.124 #/op LoopInc.directInc:·dTLB-loads thrpt 3 2136.942 ± 165.990 #/op LoopInc.directInc:·dTLB-store-misses thrpt 3 0.022 ± 0.029 #/op LoopInc.directInc:·dTLB-stores thrpt 3 1084.787 ± 417.281 #/op LoopInc.directInc:·iTLB-load-misses thrpt 3 0.081 ± 0.333 #/op LoopInc.directInc:·iTLB-loads thrpt 3 3.623 ± 19.955 #/op LoopInc.directInc:·instructions thrpt 3 27114.052 ± 1843.720 #/op

    LoopInc.indirectInc thrpt 15 489.164 ± 2.692 ops/ms LoopInc.indirectInc:·CPI thrpt 3 0.281 ± 0.015 #/op LoopInc.indirectInc:·L1-dcache-load-misses thrpt 3 0.503 ± 9.071 #/op LoopInc.indirectInc:·L1-dcache-loads thrpt 3 2149.806 ± 369.040 #/op LoopInc.indirectInc:·L1-dcache-store-misses thrpt 3 0.167 ± 1.370 #/op LoopInc.indirectInc:·L1-dcache-stores thrpt 3 1073.895 ± 186.741 #/op LoopInc.indirectInc:·L1-icache-load-misses thrpt 3 0.313 ± 1.275 #/op LoopInc.indirectInc:·branch-misses thrpt 3 1.102 ± 0.375 #/op LoopInc.indirectInc:·branches thrpt 3 2143.670 ± 228.475 #/op LoopInc.indirectInc:·cycles thrpt 3 8701.665 ± 706.183 #/op LoopInc.indirectInc:·dTLB-load-misses thrpt 3 0.020 ± 0.301 #/op LoopInc.indirectInc:·dTLB-loads thrpt 3 2141.965 ± 135.852 #/op LoopInc.indirectInc:·dTLB-store-misses thrpt 3 0.002 ± 0.029 #/op LoopInc.indirectInc:·dTLB-stores thrpt 3 1070.376 ± 81.445 #/op LoopInc.indirectInc:·iTLB-load-misses thrpt 3 0.007 ± 0.135 #/op LoopInc.indirectInc:·iTLB-loads thrpt 3 0.310 ± 5.768 #/op LoopInc.indirectInc:·instructions thrpt 3 30968.207 ± 3627.540 #/op

Oh, both benchmarks have comparable number of instructions. The slower one takes more cycles (that's why CPI is also not ideal in `directInc`; `indirectInc`, however, produces a close-to-ideal CPI). If you look closely at possible causes: there is not many cache misses, not many TLB misses, but slow benchmark has lots of branch misses. AHA! Now we know what to look in the generated code.
  1. Let's look at generated code again. -prof perfasm conveniently highlights the jumps. And then you will see this...

    directInc:

                     ╭│      0x00007fa0a82a50ff: jmp    0x00007fa0a82a5116
    11.39%   16.90%  ││ ↗    0x00007fa0a82a5101: inc    %edx               ;*iinc
                     ││ │                                                  ; - org.openjdk.LoopInc::directInc@46 (line 18)
    12.52%   23.11%  ││ │↗↗  0x00007fa0a82a5103: mov    %r10,0xe8(%r11)    ;*invokevirtual putLong
                     ││ │││                                                ; - java.util.concurrent.ThreadLocalRandom::nextSeed@27 (line 241)
    12.00%    8.14%  ││ │││  0x00007fa0a82a510a: inc    %r8d               ;*iinc
                     ││ │││                                                ; - org.openjdk.LoopInc::directInc@46 (line 18)
     0.03%    0.03%  ││ │││  0x00007fa0a82a510d: cmp    $0x3e8,%r8d
                     │╰ │││  0x00007fa0a82a5114: jge    0x00007fa0a82a50c7  ;*aload_0
                     │  │││                                                ; - org.openjdk.LoopInc::directInc@11 (line 19)
     0.80%    0.91%  ↘  │││  0x00007fa0a82a5116: mov    0xf0(%r11),%r10d   ;*invokevirtual getInt
                        │││                                                ; - java.util.concurrent.ThreadLocalRandom::current@9 (line 222)
     4.28%    1.23%     │││  0x00007fa0a82a511d: test   %r10d,%r10d
                       ╭│││  0x00007fa0a82a5120: je     0x00007fa0a82a517b  ;*ifne
                       ││││                                                ; - java.util.concurrent.ThreadLocalRandom::current@12 (line 222)
     2.11%    0.01%    ││││  0x00007fa0a82a5122: movabs $0x9e3779b97f4a7c15,%r10
     0.01%    0.07%    ││││  0x00007fa0a82a512c: add    0xe8(%r11),%r10    ;*ladd
                       ││││                                                ; - java.util.concurrent.ThreadLocalRandom::nextSeed@24 (line 242)
     7.73%    1.89%    ││││  0x00007fa0a82a5133: mov    %r10,%r9
     1.21%    1.84%    ││││  0x00007fa0a82a5136: shr    $0x21,%r9
     1.90%    0.03%    ││││  0x00007fa0a82a513a: xor    %r10,%r9
     2.02%    0.03%    ││││  0x00007fa0a82a513d: movabs $0xff51afd7ed558ccd,%rcx
     0.94%    1.82%    ││││  0x00007fa0a82a5147: imul   %rcx,%r9           ;*lmul
                       ││││                                                ; - java.util.concurrent.ThreadLocalRandom::mix32@9 (line 182)
     7.01%    2.40%    ││││  0x00007fa0a82a514b: mov    %r9,%rcx
                       ││││  0x00007fa0a82a514e: shr    $0x21,%rcx
     1.89%    0.70%    ││││  0x00007fa0a82a5152: xor    %r9,%rcx
     3.11%    2.55%    ││││  0x00007fa0a82a5155: movabs $0xc4ceb9fe1a85ec53,%r9
     0.99%    1.50%    ││││  0x00007fa0a82a515f: imul   %r9,%rcx
     7.66%    2.89%    ││││  0x00007fa0a82a5163: shr    $0x20,%rcx
     3.70%    1.97%    ││││  0x00007fa0a82a5167: mov    %ecx,%r9d
              0.11%    ││││  0x00007fa0a82a516a: and    $0x1,%r9d          ;*iand
                       ││││                                                ; - java.util.concurrent.ThreadLocalRandom::nextInt@34 (line 356)
     3.76%   11.13%    ││││  0x00007fa0a82a516e: cmp    $0x1,%r9d
                       │╰││  0x00007fa0a82a5172: je     0x00007fa0a82a5101
    10.48%   16.62%    │ ││  0x00007fa0a82a5174: test   %r9d,%r9d
                       │ ╰│  0x00007fa0a82a5177: je     0x00007fa0a82a5103  ;*lookupswitch
                       │  │                                                ; - org.openjdk.LoopInc::directInc@15 (line 19)
                       │  ╰  0x00007fa0a82a5179: jmp    0x00007fa0a82a5103  ;*aload_0
                       │                                                   ; - org.openjdk.LoopInc::directInc@11 (line 19)
                       ↘     0x00007fa0a82a517b: mov    $0xffffff5d,%esi
    
**indirectInc**:
      0.01%    0.01%  ↗  0x00007f65588d8260: mov    %edx,%r9d
      0.01%           │  0x00007f65588d8263: nopw   0x0(%rax,%rax,1)
     11.99%   11.38%  │  0x00007f65588d826c: data16 data16 xchg %ax,%ax  ;*iconst_0
                      │                                                ; - org.openjdk.LoopInc::indirectInc@11 (line 34)
                      │  0x00007f65588d8270: mov    0xf0(%r8),%r10d    ;*invokevirtual getInt
                      │                                                ; - java.util.concurrent.ThreadLocalRandom::current@9 (line 222)
                      │  0x00007f65588d8277: test   %r10d,%r10d
                      │  0x00007f65588d827a: je     0x00007f65588d8331  ;*ifne
                      │                                                ; - java.util.concurrent.ThreadLocalRandom::current@12 (line 222)
      0.01%           │  0x00007f65588d8280: movabs $0x9e3779b97f4a7c15,%r10
     11.80%   11.49%  │  0x00007f65588d828a: add    0xe8(%r8),%r10     ;*ladd
                      │                                                ; - java.util.concurrent.ThreadLocalRandom::nextSeed@24 (line 242)
      0.01%    0.01%  │  0x00007f65588d8291: mov    %r10,0xe8(%r8)     ;*invokevirtual putLong
                      │                                                ; - java.util.concurrent.ThreadLocalRandom::nextSeed@27 (line 241)
                      │  0x00007f65588d8298: mov    %r9d,%edx
      0.01%    0.01%  │  0x00007f65588d829b: inc    %edx
     11.12%   12.40%  │  0x00007f65588d829d: mov    %r10,%rcx
               0.01%  │  0x00007f65588d82a0: shr    $0x21,%rcx
      0.03%           │  0x00007f65588d82a4: xor    %r10,%rcx
      0.06%    0.03%  │  0x00007f65588d82a7: movabs $0xff51afd7ed558ccd,%r10
     12.38%   13.94%  │  0x00007f65588d82b1: imul   %r10,%rcx          ;*lmul
                      │                                                ; - java.util.concurrent.ThreadLocalRandom::mix32@9 (line 182)
      0.03%    0.01%  │  0x00007f65588d82b5: mov    %rcx,%r10
                      │  0x00007f65588d82b8: shr    $0x21,%r10
               0.03%  │  0x00007f65588d82bc: xor    %rcx,%r10
     11.43%   12.62%  │  0x00007f65588d82bf: movabs $0xc4ceb9fe1a85ec53,%rcx
               0.01%  │  0x00007f65588d82c9: imul   %rcx,%r10
      0.34%    0.30%  │  0x00007f65588d82cd: shr    $0x20,%r10
      0.85%    0.76%  │  0x00007f65588d82d1: mov    %r10d,%r10d
     11.81%   11.51%  │  0x00007f65588d82d4: and    $0x1,%r10d
      2.16%    1.78%  │  0x00007f65588d82d8: cmp    $0x1,%r10d
      3.45%    3.00%  │  0x00007f65588d82dc: cmovne %r9d,%edx         <----- HERE IT IS
     17.55%   15.86%  │  0x00007f65588d82e0: inc    %r11d              ;*iinc
                      │                                                ; - org.openjdk.LoopInc::indirectInc@56 (line 33)
                      │  0x00007f65588d82e3: cmp    $0x3e8,%r11d
                      ╰  0x00007f65588d82ea: jl     0x00007f65588d8260  ;*if_icmpge
                                                               ; - org.openjdk.LoopInc::indirectInc@8 (line 33)

Notice the `cmovne` instead of `jmp` -- this is why we have more "predictable" branches. HotSpot profiles the branches, and emits the conditional move when the branch profile branch is very flat. In other words, dodge a very likely branch misprediction by paying a bit for the added latency of conditional move. However, in this case, switch is special: it has *more than two* alternatives (0, 1, and "nothing"). This is why, I speculate, the `result` increment is not being folded into cmov. (Generally speaking, HotSpot could have stored zero to `result` in "default", but it blew it, oh well)

5. To confirm that hypothesis, let's make a directCompleteInc case, where we still use switch, but now cover all the cases:

    @Benchmark
    public int directCompleteInc() {
        int result = 0;
        for (int i = 0; i < 1000; i++) {
            switch (getValue()) {
                case 1:
                    result++;
                    break;
                default:
                    break;
            }
        }
        return result;
    }

...and measure it, and this time without any options, like the OP did:

    Benchmark                   Mode  Cnt    Score    Error   Units
    LoopInc.directCompleteInc  thrpt    5  644.414 ±  0.371  ops/ms
    LoopInc.directInc          thrpt    5  174.974 ±  0.103  ops/ms
    LoopInc.indirectInc        thrpt    5  644.015 ±  0.533  ops/ms

THERE. 

6. Confirm directCompleteInc is using cmov with -prof perfasm. It does.

  1. Drink up.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionapanginView Question on Stackoverflow
Solution 1 - JavaAleksey ShipilevView Answer on Stackoverflow