If statement vs if-else statement, which is faster?

C++PerformanceC++11AssemblyMicrobenchmark

C++ Problem Overview


I argued with a friend the other day about those two snippets. Which is faster and why ?

value = 5;
if (condition) {
    value = 6;
}

and:

if (condition) {
    value = 6;
} else {
    value = 5;
}

What if value is a matrix ?

Note: I know that value = condition ? 6 : 5; exists and I expect it to be faster, but it wasn't an option.

Edit (requested by staff since question is on hold at the moment):

  • please answer by considering either x86 assembly generated by mainstream compilers (say g++, clang++, vc, mingw) in both optimized and non optimized versions or MIPS assembly.
  • when assembly differ, explain why a version is faster and when (e.g. "better because no branching and branching has following issue blahblah")

C++ Solutions


Solution 1 - C++

TL;DR: In unoptimized code, if without else seems irrelevantly more efficient but with even the most basic level of optimization enabled the code is basically rewritten to value = condition + 5.


I gave it a try and generated the assembly for the following code:

int ifonly(bool condition, int value)
{
    value = 5;
    if (condition) {
        value = 6;
    }
    return value;
}

int ifelse(bool condition, int value)
{
    if (condition) {
        value = 6;
    } else {
        value = 5;
    }
    return value;
}

On gcc 6.3 with optimizations disabled (-O0), the relevant difference is:

 mov     DWORD PTR [rbp-8], 5
 cmp     BYTE PTR [rbp-4], 0
 je      .L2
 mov     DWORD PTR [rbp-8], 6
.L2:
 mov     eax, DWORD PTR [rbp-8]

for ifonly, while ifelse has

 cmp     BYTE PTR [rbp-4], 0
 je      .L5
 mov     DWORD PTR [rbp-8], 6
 jmp     .L6
.L5:
 mov     DWORD PTR [rbp-8], 5
.L6:
 mov     eax, DWORD PTR [rbp-8]

The latter looks slightly less efficient because it has an extra jump but both have at least two and at most three assignments so unless you really need to squeeze every last drop of performance (hint: unless you are working on a space shuttle you don't, and even then you probably don't) the difference won't be noticeable.

However, even with the lowest optimization level (-O1) both functions reduce to the same:

test    dil, dil
setne   al
movzx   eax, al
add     eax, 5

which is basically the equivalent of

return 5 + condition;

assuming condition is zero or one. Higher optimization levels don't really change the output, except they manage to avoid the movzx by efficiently zeroing out the EAX register at the start.


Disclaimer: You probably shouldn't write 5 + condition yourself (even though the standard guarantees that converting true to an integer type gives 1) because your intent might not be immediately obvious to people reading your code (which may include your future self). The point of this code is to show that what the compiler produces in both cases is (practically) identical. Ciprian Tomoiaga states it quite well in the comments:

> a human's job is to write code for humans and let the compiler write code for the machine.

Solution 2 - C++

The answer from CompuChip shows that for int they both are optimized to the same assembly, so it doesn't matter.

> What if value is a matrix ?

I will interpret this in a more general way, i.e. what if value is of a type whose constructions and assignments are expensive (and moves are cheap).

then

T value = init1;
if (condition)
   value = init2;

is sub-optimal because in case condition is true, you do the unnecessary initialization to init1 and then you do the copy assignment.

T value;
if (condition)
   value = init2;
else
   value = init3;

This is better. But still sub-optimal if default construction is expensive and if copy construction is more expensive then initialization.

You have the conditional operator solution which is good:

T value = condition ? init1 : init2;

Or, if you don't like the conditional operator, you can create a helper function like this:

T create(bool condition)
{
  if (condition)
     return {init1};
  else
     return {init2};
}

T value = create(condition);

Depending on what init1 and init2 are you can also consider this:

auto final_init = condition ? init1 : init2;
T value = final_init;

But again I must emphasize that this is relevant only when construction and assignments are really expensive for the given type. And even then, only by profiling you know for sure.

Solution 3 - C++

In pseudo-assembly language,

    li    #0, r0
    test  r1
    beq   L1
    li    #1, r0
L1:

may or may not be faster than

    test  r1
    beq   L1
    li    #1, r0
    bra   L2
L1:
    li    #0, r0
L2:

depending on how sophisticated the actual CPU is. Going from simplest to fanciest:

  • With any CPU manufactured after roughly 1990, good performance depends on the code fitting within the instruction cache. When in doubt, therefore, minimize code size. This weighs in favor of the first example.

  • With a basic "in-order, five-stage pipeline" CPU, which is still roughly what you get in many microcontrollers, there is a pipeline bubble every time a branch—conditional or unconditional—is taken, so it is also important to minimize the number of branch instructions. This also weighs in favor of the first example.

  • Somewhat more sophisticated CPUs—fancy enough to do "out-of-order execution", but not fancy enough to use the best known implementations of that concept—may incur pipeline bubbles whenever they encounter write-after-write hazards. This weighs in favor of the second example, where r0 is written only once no matter what. These CPUs are usually fancy enough to process unconditional branches in the instruction fetcher, so you aren't just trading the write-after-write penalty for a branch penalty.

I don't know if anyone is still making this kind of CPU anymore. However, the CPUs that do use the "best known implementations" of out-of-order execution are likely to cut corners on the less frequently used instructions, so you need to be aware that this sort of thing can happen. A real example is false data dependencies on the destination registers in popcnt and lzcnt on Sandy Bridge CPUs.

  • At the highest end, the OOO engine will wind up issuing exactly the same sequence of internal operations for both code fragments—this is the hardware version of "don't worry about it, the compiler will generate the same machine code either way." However, code size still does matter, and now you also should be worrying about the predictability of the conditional branch. Branch prediction failures potentially cause a complete pipeline flush, which is catastrophic for performance; see https://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array to understand how much difference this can make.

If the branch is highly unpredictable, and your CPU has conditional-set or conditional-move instructions, this is the time to use them:

<!-- language: lang-none -->

        li    #0, r0
        test  r1
        setne r0

or

<!-- language: lang-none -->

        li    #0, r0
        li    #1, r2
        test  r1
        movne r2, r0

The conditional-set version is also more compact than any other alternative; if that instruction is available it is practically guaranteed to be the Right Thing for this scenario, even if the branch was predictable.  The conditional-move version requires an additional scratch register, and always wastes one `li` instruction's worth of dispatch and execute resources; if the branch was in fact predictable, the branchy version may well be faster.

Solution 4 - C++

In unoptimised code, the first example assigns a variable always once and sometimes twice. The second example only ever assigns a variable once. The conditional is the same on both code paths, so that shouldn't matter. In optimised code, it depends on the compiler.

As always, if you are that concerned, generate the assembly and see what the compiler is actually doing.

Solution 5 - C++

What would make you think any of them even the one liner is faster or slower?

unsigned int fun0 ( unsigned int condition, unsigned int value )
{
    value = 5;
    if (condition) {
        value = 6;
    }
    return(value);
}
unsigned int fun1 ( unsigned int condition, unsigned int value )
{

    if (condition) {
        value = 6;
    } else {
        value = 5;
    }
    return(value);
}
unsigned int fun2 ( unsigned int condition, unsigned int value )
{
    value = condition ? 6 : 5;
    return(value);
}

More lines of code of a high level language gives the compiler more to work with so if you want to make a general rule about it give the compiler more code to work with. If the algorithm is the same like the cases above then one would expect the compiler with minimal optimization to figure that out.

00000000 <fun0>:
   0:	e3500000 	cmp	r0, #0
   4:	03a00005 	moveq	r0, #5
   8:	13a00006 	movne	r0, #6
   c:	e12fff1e 	bx	lr

00000010 <fun1>:
  10:	e3500000 	cmp	r0, #0
  14:	13a00006 	movne	r0, #6
  18:	03a00005 	moveq	r0, #5
  1c:	e12fff1e 	bx	lr

00000020 <fun2>:
  20:	e3500000 	cmp	r0, #0
  24:	13a00006 	movne	r0, #6
  28:	03a00005 	moveq	r0, #5
  2c:	e12fff1e 	bx	lr
 

not a big surprise it did the first function in a different order, same execution time though.

0000000000000000 <fun0>:
   0:	7100001f 	cmp	w0, #0x0
   4:	1a9f07e0 	cset	w0, ne
   8:	11001400 	add	w0, w0, #0x5
   c:	d65f03c0 	ret

0000000000000010 <fun1>:
  10:	7100001f 	cmp	w0, #0x0
  14:	1a9f07e0 	cset	w0, ne
  18:	11001400 	add	w0, w0, #0x5
  1c:	d65f03c0 	ret

0000000000000020 <fun2>:
  20:	7100001f 	cmp	w0, #0x0
  24:	1a9f07e0 	cset	w0, ne
  28:	11001400 	add	w0, w0, #0x5
  2c:	d65f03c0 	ret

Hopefully you get the idea you could have just tried this if it wasnt obvious that the different implementations were not actually different.

As far as a matrix goes, not sure how that matters,

if(condition)
{
 big blob of code a
}
else
{
 big blob of code b
}

just going to put the same if-then-else wrapper around the big blobs of code be they value=5 or something more complicated. Likewise the comparison even if it is a big blob of code it still has to be computed, and equal to or not equal to something is often compiled with the negative, if (condition) do something is often compiled as if not condition goto.

00000000 <fun0>:
   0:	0f 93       	tst	r15		
   2:	03 24       	jz	$+8      	;abs 0xa
   4:	3f 40 06 00 	mov	#6,	r15	;#0x0006
   8:	30 41       	ret			
   a:	3f 40 05 00 	mov	#5,	r15	;#0x0005
   e:	30 41       	ret			

00000010 <fun1>:
  10:	0f 93       	tst	r15		
  12:	03 20       	jnz	$+8      	;abs 0x1a
  14:	3f 40 05 00 	mov	#5,	r15	;#0x0005
  18:	30 41       	ret			
  1a:	3f 40 06 00 	mov	#6,	r15	;#0x0006
  1e:	30 41       	ret			

00000020 <fun2>:
  20:	0f 93       	tst	r15		
  22:	03 20       	jnz	$+8      	;abs 0x2a
  24:	3f 40 05 00 	mov	#5,	r15	;#0x0005
  28:	30 41       	ret			
  2a:	3f 40 06 00 	mov	#6,	r15	;#0x0006
  2e:	30 41

we just went through this exercise with someone else recently on stackoverflow. this mips compiler interestingly in that case not only realized the functions were the same, but had one function simply jump to the other to save on code space. Didnt do that here though

00000000 <fun0>:
   0:	0004102b 	sltu	$2,$0,$4
   4:	03e00008 	jr	$31
   8:	24420005 	addiu	$2,$2,5

0000000c <fun1>:
   c:	0004102b 	sltu	$2,$0,$4
  10:	03e00008 	jr	$31
  14:	24420005 	addiu	$2,$2,5

00000018 <fun2>:
  18:	0004102b 	sltu	$2,$0,$4
  1c:	03e00008 	jr	$31
  20:	24420005 	addiu	$2,$2,5

some more targets.

00000000 <_fun0>:
   0:	1166           	mov	r5, -(sp)
   2:	1185           	mov	sp, r5
   4:	0bf5 0004      	tst	4(r5)
   8:	0304           	beq	12 <_fun0+0x12>
   a:	15c0 0006      	mov	$6, r0
   e:	1585           	mov	(sp)+, r5
  10:	0087           	rts	pc
  12:	15c0 0005      	mov	$5, r0
  16:	1585           	mov	(sp)+, r5
  18:	0087           	rts	pc

0000001a <_fun1>:
  1a:	1166           	mov	r5, -(sp)
  1c:	1185           	mov	sp, r5
  1e:	0bf5 0004      	tst	4(r5)
  22:	0204           	bne	2c <_fun1+0x12>
  24:	15c0 0005      	mov	$5, r0
  28:	1585           	mov	(sp)+, r5
  2a:	0087           	rts	pc
  2c:	15c0 0006      	mov	$6, r0
  30:	1585           	mov	(sp)+, r5
  32:	0087           	rts	pc

00000034 <_fun2>:
  34:	1166           	mov	r5, -(sp)
  36:	1185           	mov	sp, r5
  38:	0bf5 0004      	tst	4(r5)
  3c:	0204           	bne	46 <_fun2+0x12>
  3e:	15c0 0005      	mov	$5, r0
  42:	1585           	mov	(sp)+, r5
  44:	0087           	rts	pc
  46:	15c0 0006      	mov	$6, r0
  4a:	1585           	mov	(sp)+, r5
  4c:	0087           	rts	pc

00000000 <fun0>:
   0:	00a03533          	snez	x10,x10
   4:	0515                	addi	x10,x10,5
   6:	8082                	ret

00000008 <fun1>:
   8:	00a03533          	snez	x10,x10
   c:	0515                	addi	x10,x10,5
   e:	8082                	ret

00000010 <fun2>:
  10:	00a03533          	snez	x10,x10
  14:	0515                	addi	x10,x10,5
  16:	8082                	ret

and compilers

with this i code one would expect the different targets to match as well

define i32 @fun0(i32 %condition, i32 %value) #0 {
  %1 = icmp ne i32 %condition, 0
  %. = select i1 %1, i32 6, i32 5
  ret i32 %.
}

; Function Attrs: norecurse nounwind readnone
define i32 @fun1(i32 %condition, i32 %value) #0 {
  %1 = icmp eq i32 %condition, 0
  %. = select i1 %1, i32 5, i32 6
  ret i32 %.
}

; Function Attrs: norecurse nounwind readnone
define i32 @fun2(i32 %condition, i32 %value) #0 {
  %1 = icmp ne i32 %condition, 0
  %2 = select i1 %1, i32 6, i32 5
  ret i32 %2
}


00000000 <fun0>:
   0:	e3a01005 	mov	r1, #5
   4:	e3500000 	cmp	r0, #0
   8:	13a01006 	movne	r1, #6
   c:	e1a00001 	mov	r0, r1
  10:	e12fff1e 	bx	lr

00000014 <fun1>:
  14:	e3a01006 	mov	r1, #6
  18:	e3500000 	cmp	r0, #0
  1c:	03a01005 	moveq	r1, #5
  20:	e1a00001 	mov	r0, r1
  24:	e12fff1e 	bx	lr

00000028 <fun2>:
  28:	e3a01005 	mov	r1, #5
  2c:	e3500000 	cmp	r0, #0
  30:	13a01006 	movne	r1, #6
  34:	e1a00001 	mov	r0, r1
  38:	e12fff1e 	bx	lr


fun0:
	push.w	r4
	mov.w	r1, r4
	mov.w	r15, r12
	mov.w	#6, r15
	cmp.w	#0, r12
	jne	.LBB0_2
	mov.w	#5, r15
.LBB0_2:
	pop.w	r4
	ret

fun1:
	push.w	r4
	mov.w	r1, r4
	mov.w	r15, r12
	mov.w	#5, r15
	cmp.w	#0, r12
	jeq	.LBB1_2
	mov.w	#6, r15
.LBB1_2:
	pop.w	r4
	ret


fun2:
	push.w	r4
	mov.w	r1, r4
	mov.w	r15, r12
	mov.w	#6, r15
	cmp.w	#0, r12
	jne	.LBB2_2
	mov.w	#5, r15
.LBB2_2:
	pop.w	r4
	ret

Now technically there is a performance difference in some of these solutions, sometimes the result is 5 case has a jump over the result is 6 code, and vice versa, is a branch faster than executing through? one could argue but the execution should vary. But that is more of an if condition vs if not condition in the code resulting in the compiler doing the if this jump over else execute through. but this is not necessarily due to the coding style but the comparison and the if and the else cases in whatever syntax.

Solution 6 - C++

Ok, since assembly is one of the tags, I will just assume your code is pseudo code (and not necessarily c) and translate it by human into 6502 assembly.

1st Option (without else)

        ldy #$00
        lda #$05
        dey
        bmi false
        lda #$06
false   brk

2nd Option (with else)

        ldy #$00
        dey
        bmi else
        lda #$06
        sec
        bcs end
else    lda #$05
end     brk

Assumptions: Condition is in Y register set this to 0 or 1 on the first line of either option, result will be in accumulator.

So, after counting cycles for both possibilities of each case, we see that the 1st construct is generally faster; 9 cycles when condition is 0 and 10 cycles when condition is 1, whereas option two is also 9 cycles when condition is 0, but 13 cycles when condition is 1. (cycle counts do not include the BRK at the end).

Conclusion: If only is faster than If-Else construct.

And for completeness, here is an optimized value = condition + 5 solution:

ldy #$00
lda #$00
tya
adc #$05
brk

This cuts our time down to 8 cycles (again not including the BRK at the end).

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJulien__View Question on Stackoverflow
Solution 1 - C++CompuChipView Answer on Stackoverflow
Solution 2 - C++bolovView Answer on Stackoverflow
Solution 3 - C++zwolView Answer on Stackoverflow
Solution 4 - C++NeilView Answer on Stackoverflow
Solution 5 - C++old_timerView Answer on Stackoverflow
Solution 6 - C++Glen YatesView Answer on Stackoverflow