Why is "while (i++ < n) {}" significantly slower than "while (++i < n) {}"

JavaPerformanceCompiler OptimizationPost IncrementPre Increment

Java Problem Overview


Apparently on my Windows 8 laptop with HotSpot JDK 1.7.0_45 (with all compiler/VM options set to default), the below loop

final int n = Integer.MAX_VALUE;
int i = 0;
while (++i < n) {
}

is at least 2 orders of magnitude faster (~10 ms vs. ~5000 ms) than:

final int n = Integer.MAX_VALUE;
int i = 0;
while (i++ < n) {
}

I happened to notice this problem while writing a loop to evaluate another irrelevant performance issue. And the difference between ++i < n and i++ < n was huge enough to significantly influence the result.

If we look at the bytecode, the loop body of the faster version is:

iinc
iload
ldc
if_icmplt

And for the slower version:

iload
iinc
ldc
if_icmplt

So for ++i < n, it first increments local variable i by 1 and then push it onto the operand stack while i++ < n does those 2 steps in reverse order. But that doesn't seem to explain why the former is much faster. Is there any temp copy involved in the latter case? Or is it something beyond the bytecode (VM implementation, hardware, etc.) that should be responsible for the performance difference?

I've read some other discussion regarding ++i and i++ (not exhaustively though), but didn't find any answer that is Java-specific and directly related to the case where ++i or i++ is involved in a value comparison.

Java Solutions


Solution 1 - Java

As others have pointed out, the test is flawed in many ways.

You did not tell us exactly how you did this test. However, I tried to implement a "naive" test (no offense) like this:

class PrePostIncrement
{
    public static void main(String args[])
    {
        for (int j=0; j<3; j++)
        {
            for (int i=0; i<5; i++)
            {
                long before = System.nanoTime();
                runPreIncrement();
                long after = System.nanoTime();
                System.out.println("pre  : "+(after-before)/1e6);
            }
            for (int i=0; i<5; i++)
            {
                long before = System.nanoTime();
                runPostIncrement();
                long after = System.nanoTime();
                System.out.println("post : "+(after-before)/1e6);
            }
        }
    }

    private static void runPreIncrement()
    {
        final int n = Integer.MAX_VALUE;
        int i = 0;
        while (++i < n) {}
    }

    private static void runPostIncrement()
    {
        final int n = Integer.MAX_VALUE;
        int i = 0;
        while (i++ < n) {}
    }
}

When running this with default settings, there seems to be a small difference. But the real flaw of the benchmark becomes obvious when you run this with the -server flag. The results in my case then are along something like

...
pre  : 6.96E-4
pre  : 6.96E-4
pre  : 0.001044
pre  : 3.48E-4
pre  : 3.48E-4
post : 1279.734543
post : 1295.989086
post : 1284.654267
post : 1282.349093
post : 1275.204583

Obviously, the pre-increment version has been completely optimized away. The reason is rather simple: The result is not used. It does not matter at all whether the loop is executed or not, so the JIT simply removes it.

This is confirmed by a look at the hotspot disassembly: The pre-increment version results in this code:

[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x0000000055060500} &apos;runPreIncrement&apos; &apos;()V&apos; in &apos;PrePostIncrement&apos;
  #           [sp+0x20]  (sp of caller)
  0x000000000286fd80: sub    $0x18,%rsp
  0x000000000286fd87: mov    %rbp,0x10(%rsp)    ;*synchronization entry
												; - PrePostIncrement::runPreIncrement@-1 (line 28)

  0x000000000286fd8c: add    $0x10,%rsp
  0x000000000286fd90: pop    %rbp
  0x000000000286fd91: test   %eax,-0x243fd97(%rip)        # 0x0000000000430000
												;   {poll_return}
  0x000000000286fd97: retq   
  0x000000000286fd98: hlt    
  0x000000000286fd99: hlt    
  0x000000000286fd9a: hlt    
  0x000000000286fd9b: hlt    
  0x000000000286fd9c: hlt    
  0x000000000286fd9d: hlt    
  0x000000000286fd9e: hlt    
  0x000000000286fd9f: hlt    

The post-increment version results in this code:

[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x00000000550605b8} &apos;runPostIncrement&apos; &apos;()V&apos; in &apos;PrePostIncrement&apos;
  #           [sp+0x20]  (sp of caller)
  0x000000000286d0c0: sub    $0x18,%rsp
  0x000000000286d0c7: mov    %rbp,0x10(%rsp)    ;*synchronization entry
												; - PrePostIncrement::runPostIncrement@-1 (line 35)

  0x000000000286d0cc: mov    $0x1,%r11d
  0x000000000286d0d2: jmp    0x000000000286d0e3
  0x000000000286d0d4: nopl   0x0(%rax,%rax,1)
  0x000000000286d0dc: data32 data32 xchg %ax,%ax
  0x000000000286d0e0: inc    %r11d              ; OopMap{off=35}
												;*goto
												; - PrePostIncrement::runPostIncrement@11 (line 36)

  0x000000000286d0e3: test   %eax,-0x243d0e9(%rip)        # 0x0000000000430000
												;*goto
												; - PrePostIncrement::runPostIncrement@11 (line 36)
												;   {poll}
  0x000000000286d0e9: cmp    $0x7fffffff,%r11d
  0x000000000286d0f0: jl     0x000000000286d0e0  ;*if_icmpge
												; - PrePostIncrement::runPostIncrement@8 (line 36)

  0x000000000286d0f2: add    $0x10,%rsp
  0x000000000286d0f6: pop    %rbp
  0x000000000286d0f7: test   %eax,-0x243d0fd(%rip)        # 0x0000000000430000
												;   {poll_return}
  0x000000000286d0fd: retq   
  0x000000000286d0fe: hlt    
  0x000000000286d0ff: hlt    

It's not entirely clear for me why it seemingly does not remove the post-increment version. (In fact, I consider asking this as a separate question). But at least, this explains why you might see differences with an "order of magnitude"...


EDIT: Interestingly, when changing the upper limit of the loop from Integer.MAX_VALUE to Integer.MAX_VALUE-1, then both versions are optimized away and require "zero" time. Somehow this limit (which still appears as 0x7fffffff in the assembly) prevents the optimization. Presumably, this has something to do with the comparison being mapped to a (singed!) cmp instruction, but I can not give a profound reason beyond that. The JIT works in mysterious ways...

Solution 2 - Java

The difference between ++i and i++ is that ++i effectively increments the variable and 'returns' that new value. i++ on the other hand effectively creates a temp variable to hold the current value in i, then increments the variable 'returning' the temp variable's value. This is where the extra overhead is coming from.

// i++ evaluates to something like this
// Imagine though that somehow i was passed by reference
int temp = i;
i = i + 1;
return temp;

// ++i evaluates to
i = i + 1;
return i;

In your case it appears that the increment won't be optimized by the JVM because you are using the result in an expression. The JVM can on the other hand optimize a loop like this.

for( int i = 0; i < Integer.MAX_VALUE; i++ ) {}

This is because the result of i++ is never used. In a loop like this you should be able to use both ++i and i++ with the same performance as if you used ++i.

Solution 3 - Java

EDIT 2

You should really look here:

http://hg.openjdk.java.net/code-tools/jmh/file/f90aef7f1d2c/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_11_Loops.java

EDIT The more I think about it, I realise that this test is somehow wrong, the loop will get seriously optimized by the JVM.

I think that you should just drop the @Param and let n=2.

This way you will test the performance of the while itself. The results I get in this case :

o.m.t.WhileTest.testFirst      avgt         5        0.787        0.086    ns/op
o.m.t.WhileTest.testSecond     avgt         5        0.782        0.087    ns/op

The is almost no difference

The very first question you should ask yourself is how you test and measure this. This is micro-benchmarking and in Java this is an art, and almost always a simple user (like me) will get the results wrong. You should rely on a benchmark test and very good tool for that. I used JMH to test this:

    @Measurement(iterations=5, time=1, timeUnit=TimeUnit.MILLISECONDS)
@Fork(1)
@Warmup(iterations=5, time=1, timeUnit=TimeUnit.SECONDS)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@State(Scope.Benchmark)
public class WhileTest {
	public static void main(String[] args) throws Exception {
		Options opt = new OptionsBuilder()
 			.include(".*" + WhileTest.class.getSimpleName() + ".*")
 			.threads(1)
 			.build();

		new Runner(opt).run();
	}
	
	
	@Param({"100", "10000", "100000", "1000000"})
	private int n;
	
	/*
	@State(Scope.Benchmark)
	public static class HOLDER_I {
		int x;
	}
	*/
	
	
	@Benchmark
	public int testFirst(){
		int i = 0;
		while (++i < n) {
		}
		return i;
	}
	
	@Benchmark
	public int testSecond(){
		int i = 0;
		while (i++ < n) {
		}
		return i;
	}
}

Someone way more experienced in JMH might correct this results (I really hope so!, since I am not that versatile in JMH yet), but the results show that the difference is pretty darn small:

Benchmark                        (n)   Mode   Samples        Score  Score error    Units
o.m.t.WhileTest.testFirst        100   avgt         5        1.271        0.096    ns/op
o.m.t.WhileTest.testFirst      10000   avgt         5        1.319        0.125    ns/op
o.m.t.WhileTest.testFirst     100000   avgt         5        1.327        0.241    ns/op
o.m.t.WhileTest.testFirst    1000000   avgt         5        1.311        0.136    ns/op
o.m.t.WhileTest.testSecond       100   avgt         5        1.450        0.525    ns/op
o.m.t.WhileTest.testSecond     10000   avgt         5        1.563        0.479    ns/op
o.m.t.WhileTest.testSecond    100000   avgt         5        1.418        0.428    ns/op
o.m.t.WhileTest.testSecond   1000000   avgt         5        1.344        0.120    ns/op

The Score field is the one you are interested in.

Solution 4 - Java

probably this test is not enough to take conclusions but I would say if this is the case, the JVM can optimize this expression by changing i++ to ++i since the stored value of i++ (pre value) is never used in this loop.

Solution 5 - Java

I suggest you should (whenever possible) always use ++c rather than c++ as the former will never be slower since, conceptually, a deep copy of c has to be taken in the latter case in order to return the previous value.

Indeed many optimisers will optimise away an unnecessary deep copy but they can't easily do that if you're making use of the expression value. And you're doing just that in your case.

Many folk disagree though: they see it as as a micro-optimisation.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionsikanView Question on Stackoverflow
Solution 1 - JavaMarco13View Answer on Stackoverflow
Solution 2 - JavaSmith_61View Answer on Stackoverflow
Solution 3 - JavaEugeneView Answer on Stackoverflow
Solution 4 - JavadanibuizaView Answer on Stackoverflow
Solution 5 - JavaBathshebaView Answer on Stackoverflow