What is the advantage of GCC's __builtin_expect in if else statements?

CLinuxGccBuilt InLikely Unlikely

C Problem Overview


I came across a #define in which they use __builtin_expect.

The documentation says:

> ### Built-in Function: long __builtin_expect (long exp, long c)

> You may use __builtin_expect to provide the compiler with branch > prediction information. In general, you should prefer to use actual > profile feedback for this (-fprofile-arcs), as programmers are > notoriously bad at predicting how their programs actually perform. > However, there are applications in which this data is hard to collect. > > > The return value is the value of exp, which should be an integral > expression. The semantics of the built-in are that it is expected that > exp == c. For example: > > if (__builtin_expect (x, 0)) > foo (); > > would indicate that we do not expect to call foo, since we expect x to be zero.

So why not directly use:

if (x)
    foo ();

instead of the complicated syntax with __builtin_expect?

C Solutions


Solution 1 - C

Imagine the assembly code that would be generated from:

if (__builtin_expect(x, 0)) {
    foo();
    ...
} else {
    bar();
    ...
}

I guess it should be something like:

  cmp   $x, 0
  jne   _foo
_bar:
  call  bar
  ...
  jmp   after_if
_foo:
  call  foo
  ...
after_if:

You can see that the instructions are arranged in such an order that the bar case precedes the foo case (as opposed to the C code). This can utilise the CPU pipeline better, since a jump thrashes the already fetched instructions.

Before the jump is executed, the instructions below it (the bar case) are pushed to the pipeline. Since the foo case is unlikely, jumping too is unlikely, hence thrashing the pipeline is unlikely.

Solution 2 - C

Let's decompile to see what GCC 4.8 does with it

Blagovest mentioned branch inversion to improve the pipeline, but do current compilers really do it? Let's find out!

Without __builtin_expect

#include "stdio.h"
#include "time.h"

int main() {
    /* Use time to prevent it from being optimized away. */
    int i = !time(NULL);
    if (i)
        puts("a");
    return 0;
}

Compile and decompile with GCC 4.8.2 x86_64 Linux:

gcc -c -O3 -std=gnu11 main.c
objdump -dr main.o

Output:

0000000000000000 <main>:
   0:       48 83 ec 08             sub    $0x8,%rsp
   4:       31 ff                   xor    %edi,%edi
   6:       e8 00 00 00 00          callq  b <main+0xb>
                    7: R_X86_64_PC32        time-0x4
   b:       48 85 c0                test   %rax,%rax
   e:       75 0a                   jne    1a <main+0x1a>
  10:       bf 00 00 00 00          mov    $0x0,%edi
                    11: R_X86_64_32 .rodata.str1.1
  15:       e8 00 00 00 00          callq  1a <main+0x1a>
                    16: R_X86_64_PC32       puts-0x4
  1a:       31 c0                   xor    %eax,%eax
  1c:       48 83 c4 08             add    $0x8,%rsp
  20:       c3                      retq

The instruction order in memory was unchanged: first the puts and then retq return.

With __builtin_expect

Now replace if (i) with:

if (__builtin_expect(i, 0))

and we get:

0000000000000000 <main>:
   0:       48 83 ec 08             sub    $0x8,%rsp
   4:       31 ff                   xor    %edi,%edi
   6:       e8 00 00 00 00          callq  b <main+0xb>
                    7: R_X86_64_PC32        time-0x4
   b:       48 85 c0                test   %rax,%rax
   e:       74 07                   je     17 <main+0x17>
  10:       31 c0                   xor    %eax,%eax
  12:       48 83 c4 08             add    $0x8,%rsp
  16:       c3                      retq
  17:       bf 00 00 00 00          mov    $0x0,%edi
                    18: R_X86_64_32 .rodata.str1.1
  1c:       e8 00 00 00 00          callq  21 <main+0x21>
                    1d: R_X86_64_PC32       puts-0x4
  21:       eb ed                   jmp    10 <main+0x10>

The puts was moved to the very end of the function, the retq return!

The new code is basically the same as:

int i = !time(NULL);
if (i)
    goto puts;
ret:
return 0;
puts:
puts("a");
goto ret;

This optimization was not done with -O0.

But good luck on writing an example that runs faster with __builtin_expect than without, CPUs are really smart those days. My naive attempts are here.

C++20 [[likely]] and [[unlikely]]

C++20 has standardized those C++ built-ins: https://stackoverflow.com/questions/51797959/how-to-use-c20s-likely-unlikely-attribute-in-if-else-statement They will likely (a pun!) do the same thing.

Solution 3 - C

The idea of __builtin_expect is to tell the compiler that you'll usually find that the expression evaluates to c, so that the compiler can optimize for that case.

I'd guess that someone thought they were being clever and that they were speeding things up by doing this.

Unfortunately, unless the situation is very well understood (it's likely that they have done no such thing), it may well have made things worse. The documentation even says:

> In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.

In general, you shouldn't be using __builtin_expect unless:

  • You have a very real performance issue
  • You've already optimized the algorithms in the system appropriately
  • You've got performance data to back up your assertion that a particular case is the most likely

Solution 4 - C

Well, as it says in the description, the first version adds a predictive element to the construction, telling the compiler that the x == 0 branch is the more likely one - that is, it's the branch that will be taken more often by your program.

With that in mind, the compiler can optimize the conditional so that it requires the least amount of work when the expected condition holds, at the expense of maybe having to do more work in case of the unexpected condition.

Take a look at how conditionals are implemented during the compilation phase, and also in the resulting assembly, to see how one branch may be less work than the other.

However, I would only expect this optimization to have noticeable effect if the conditional in question is part of a tight inner loop that gets called a lot, since the difference in the resulting code is relatively small. And if you optimize it the wrong way round, you may well decrease your performance.

Solution 5 - C

I don't see any of the answers addressing the question that I think you were asking, paraphrased:

> Is there a more portable way of hinting branch prediction to the compiler.

The title of your question made me think of doing it this way:

if ( !x ) {} else foo();

If the compiler assumes that 'true' is more likely, it could optimize for not calling foo().

The problem here is just that you don't, in general, know what the compiler will assume -- so any code that uses this kind of technique would need to be carefully measured (and possibly monitored over time if the context changes).

Solution 6 - C

I test it on Mac according @Blagovest Buyukliev and @Ciro. The assembles look clear and I add comments;

Commands are gcc -c -O3 -std=gnu11 testOpt.c; otool -tVI testOpt.o

When I use -O3 , it looks the same no matter the __builtin_expect(i, 0) exist or not.

testOpt.o:
(__TEXT,__text) section
_main:
0000000000000000	pushq	%rbp     
0000000000000001	movq	%rsp, %rbp    // open function stack
0000000000000004	xorl	%edi, %edi       // set time args 0 (NULL)
0000000000000006	callq	_time      // call time(NULL)
000000000000000b	testq	%rax, %rax   // check time(NULL)  result
000000000000000e	je	0x14           //  jump 0x14 if testq result = 0, namely jump to puts
0000000000000010	xorl	%eax, %eax   //  return 0   ,  return appear first 
0000000000000012	popq	%rbp    //  return 0
0000000000000013	retq                     //  return 0
0000000000000014	leaq	0x9(%rip), %rdi  ## literal pool for: "a"  // puts  part, afterwards
000000000000001b	callq	_puts
0000000000000020	xorl	%eax, %eax
0000000000000022	popq	%rbp
0000000000000023	retq

When compile with -O2 , it looks different with and without __builtin_expect(i, 0)

First without

testOpt.o:
(__TEXT,__text) section
_main:
0000000000000000	pushq	%rbp
0000000000000001	movq	%rsp, %rbp
0000000000000004	xorl	%edi, %edi
0000000000000006	callq	_time
000000000000000b	testq	%rax, %rax
000000000000000e	jne	0x1c       //   jump to 0x1c if not zero, then return
0000000000000010	leaq	0x9(%rip), %rdi ## literal pool for: "a"   //   put part appear first ,  following   jne 0x1c
0000000000000017	callq	_puts
000000000000001c	xorl	%eax, %eax     // return part appear  afterwards
000000000000001e	popq	%rbp
000000000000001f	retq

Now with __builtin_expect(i, 0)

testOpt.o:
(__TEXT,__text) section
_main:
0000000000000000	pushq	%rbp
0000000000000001	movq	%rsp, %rbp
0000000000000004	xorl	%edi, %edi
0000000000000006	callq	_time
000000000000000b	testq	%rax, %rax
000000000000000e	je	0x14   // jump to 0x14 if zero  then put. otherwise return 
0000000000000010	xorl	%eax, %eax   // return appear first 
0000000000000012	popq	%rbp
0000000000000013	retq
0000000000000014	leaq	0x7(%rip), %rdi ## literal pool for: "a"
000000000000001b	callq	_puts
0000000000000020	jmp	0x10

To summarize, __builtin_expect works in the last case.

Solution 7 - C

In most of the cases, you should leave the branch prediction as it is and you do not need to worry about it.

One case where it is beneficial is CPU intensive algorithms with a lot of branching. In some cases, the jumps could lead the to exceed the current CPU program cache making the CPU wait for the next part of the software to run. By pushing the unlikely branches at the end, you will keep your memory close and only jump for unlikely cases.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionRajSanpuiView Question on Stackoverflow
Solution 1 - CBlagovest BuyuklievView Answer on Stackoverflow
Solution 2 - CCiro Santilli Путлер Капут 六四事View Answer on Stackoverflow
Solution 3 - CMichael KohneView Answer on Stackoverflow
Solution 4 - CKerrek SBView Answer on Stackoverflow
Solution 5 - CBrent BradburnView Answer on Stackoverflow
Solution 6 - CVictor ChoyView Answer on Stackoverflow
Solution 7 - CAlexis PaquesView Answer on Stackoverflow