Empty loop is slower than a non-empty one in C

CPerformanceLoops

C Problem Overview


While trying to know how long a line of C code used to execute, I noticed this weird thing :

int main (char argc, char * argv[]) {
    time_t begin, end;
    uint64_t i;
    double total_time, free_time;
    int A = 1;
    int B = 1;

    begin = clock();
    for (i = 0; i<(1<<31)-1; i++);
    end = clock();
    free_time = (double)(end-begin)/CLOCKS_PER_SEC;
    printf("%f\n", free_time);

    begin = clock();
    for (i = 0; i<(1<<31)-1; i++) {
        A += B%2;
    }
    end = clock();
    free_time = (double)(end-begin)/CLOCKS_PER_SEC;
    printf("%f\n", free_time);

    return(0);
}

Which when executed displays :

5.873425
4.826874

Why does the empty loop use more time than the second which has an instruction within ? Of course I've tried many variants but everytime, an empty loop takes more time than one with one single instruction within.

Note that I have tried swapping loops order and adding some warmup code and it didn't change my problem at all.

I'm using codeblocks as IDE with GNU gcc compiler, linux ubuntu 14.04 and have a quadcore intel i5 at 2.3GHz (I've tried running the programm on a single core, this doesn't change the result).

C Solutions


Solution 1 - C

Assuming that your code uses a 32 bit integer int type (which your system probably does), then nothing can be determined from your code. Instead, it exhibits undefined behavior.

foo.c:5:5: error: first parameter of 'main' (argument count) must be of type 'int'
int main (char argc, char * argv[]) {
    ^
foo.c:13:26: warning: overflow in expression; result is 2147483647 with type 'int' [-Winteger-overflow]
    for (i = 0; i<(1<<31)-1; i++);
                         ^
foo.c:19:26: warning: overflow in expression; result is 2147483647 with type 'int' [-Winteger-overflow]
    for (i = 0; i<(1<<31)-1; i++) {
                         ^

Let's try to fix that:

#include <stdint.h>
#include <stdio.h>
#include <time.h>
#include <limits.h>

int main (int argc, char * argv[]) {
    time_t begin, end;
    uint64_t i;
    double total_time, free_time;
    int A = 1;
    int B = 1;

    begin = clock();
    for (i = 0; i<INT_MAX; i++);
    end = clock();
    free_time = (double)(end-begin)/CLOCKS_PER_SEC;
    printf("%f\n", free_time);

    begin = clock();
    for (i = 0; i<INT_MAX; i++) {
        A += B%2;
    }
    end = clock();
    free_time = (double)(end-begin)/CLOCKS_PER_SEC;
    printf("%f\n", free_time);

    return(0);
}

Now, let's look at the assembly output of this code. Personally, I find LLVM's internal assembly very readable, so I'm going to show that. I'll produce it by running:

clang -O3 foo.c -S -emit-llvm -std=gnu99

Here's the relevant portion of the output (the main function):

define i32 @main(i32 %argc, i8** nocapture readnone %argv) #0 {
  %1 = tail call i64 @"\01_clock"() #3
  %2 = tail call i64 @"\01_clock"() #3
  %3 = sub nsw i64 %2, %1
  %4 = sitofp i64 %3 to double
  %5 = fdiv double %4, 1.000000e+06
  %6 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), double %5) #3
  %7 = tail call i64 @"\01_clock"() #3
  %8 = tail call i64 @"\01_clock"() #3
  %9 = sub nsw i64 %8, %7
  %10 = sitofp i64 %9 to double
  %11 = fdiv double %10, 1.000000e+06
  %12 = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([4 x i8]* @.str, i64 0, i64 0), double %11) #3
  ret i32 0
}

Note that there are no operations between the calls to clock() for either case. So they are both compiled to the exact same thing.

Solution 2 - C

The fact is that modern processors are complicated. All the instructions executed will interact with each other in complicated and interesting ways. Thanks for "that other guy" for posting the code.

Both OP and "that other guy" apparently found that the short loop takes 11 cycles, while the long one takes 9 cycles. For the long loop, 9 cycles is plenty of time even though there are lot of operations. For the short loop, there must be some stall caused by it being so short, and just adding a nop makes the loop long enough to avoid the stall.

One thing that happens if we look at the code:

0x00000000004005af <+50>:    addq   $0x1,-0x20(%rbp)
0x00000000004005b4 <+55>:    cmpq   $0x7fffffff,-0x20(%rbp)
0x00000000004005bc <+63>:    jb     0x4005af <main+50>

We read i and write it back (addq). We read it immediately again, and compare it (cmpq). And then we loop. But the loop uses branch prediction. So at the time when the addq is executed, the processor isn't really sure it is allowed to write to i (because branch prediction could be wrong).

Then we compare to i. The processor will try to avoid reading i from memory, because reading it takes a long time. Instead some bit of hardware will remember that we just wrote to i by adding to it, and instead of reading i, the cmpq instruction gets the data from the store instruction. Unfortunately, we are not sure at this point if the write to i actually happened or not! So that could introduce a stall here.

The problem here is that the conditional jump, the addq which leads to a conditional store, and the cmpq which isn't sure where to get the data from, are all very very close together. They are unusually close together. It could be that they are so close together, the processor cannot figure out at this time whether to take i from the store instruction or to read it from memory. And reads it from memory, which is slower because it has to wait for the store to finish. And adding just one nop gives the processor enough time.

Usually you think that there is RAM, and there is cache. On a modern Intel processor, reading memory can read from (slowest to fastest):

  1. Memory (RAM)
  2. L3 cache (optional)
  3. L2 cache
  4. L1 cache
  5. Previous store instruction that hasn't written to L1 cache yet.

So what the processor does internally in the short, slow loop:

  1. Read i from L1 cache
  2. Add 1 to i
  3. Write i to L1 cache
  4. Wait until i is written to L1 cache
  5. Read i from L1 cache
  6. Compare i with INT_MAX
  7. Branch to (1) if it is less.

In the long, fast, loop the processor does:

  1. Lots of stuff
  2. Read i from L1 cache
  3. Add 1 to i
  4. Do a "store" instruction that will write i to L1 cache
  5. Read i directly from the "store" instruction without touching L1 cache
  6. Compare i with INT_MAX
  7. Branch to (1) if it is less.

Solution 3 - C

This answer assumes that you've already understood and addressed the excellent points regarding undefined behavior sharth makes in his answer. He also points out tricks that the compiler may play on your code. You should take steps to make sure the compiler doesn't recognize the entire loop as useless. For example, changing the iterator declaration to volatile uint64_t i; will prevent removal of the loop, and volatile int A; will ensure that the second loop actually does more work than the first. But even if you do all of that, you may still discover that:

Code later in a program may well execute more quickly than earlier code.

The clock() library function could have caused an icache miss after reading the timer, and before returning. This would cause some extra time in the first measured interval. (For later calls, the code is already in cache). However this effect will be tiny, certainly too small for clock() to measure, even if it was a page fault all the way to disk. Random context switches can add to either time interval.

More importantly, you have an i5 CPU, which has dynamic clocking. When your program begins execution, the clock rate is mostly likely low, because the CPU has been idle. Just running the program makes the CPU no longer idle, so after a short delay the clock speed will increase. The ratio between idle and TurboBoosted CPU clock frequency can be significant. (On my ultrabook's Haswell i5-4200U, the former multiplier is 8, and the latter is 26, making startup code run less than 30% as rapidly as later code! "Calibrated" loops for implementing delays are a terrible idea on modern computers!)

Including a warmup phase (running a benchmark repeatedly, and throwing the first result away) for more precise timing is not only for managed frameworks with JIT compilers!

Solution 4 - C

I am able to reproduce this with GCC 4.8.2-19ubuntu1 with no optimization:

$ ./a.out 
4.780179
3.762356

Here is the empty loop:

0x00000000004005af <+50>:    addq   $0x1,-0x20(%rbp)
0x00000000004005b4 <+55>:    cmpq   $0x7fffffff,-0x20(%rbp)
0x00000000004005bc <+63>:    jb     0x4005af <main+50>

And here's the non-empty one:

0x000000000040061a <+157>:   mov    -0x24(%rbp),%eax
0x000000000040061d <+160>:   cltd   
0x000000000040061e <+161>:   shr    $0x1f,%edx
0x0000000000400621 <+164>:   add    %edx,%eax
0x0000000000400623 <+166>:   and    $0x1,%eax
0x0000000000400626 <+169>:   sub    %edx,%eax
0x0000000000400628 <+171>:   add    %eax,-0x28(%rbp)
0x000000000040062b <+174>:   addq   $0x1,-0x20(%rbp)
0x0000000000400630 <+179>:   cmpq   $0x7fffffff,-0x20(%rbp)
0x0000000000400638 <+187>:   jb     0x40061a <main+157>

Let's insert a nop in the empty loop:

0x00000000004005af <+50>:    nop
0x00000000004005b0 <+51>:    addq   $0x1,-0x20(%rbp)
0x00000000004005b5 <+56>:    cmpq   $0x7fffffff,-0x20(%rbp)
0x00000000004005bd <+64>:    jb     0x4005af <main+50>

They now run equally fast:

$ ./a.out 
3.846031
3.705035

I imagine this shows the importance of alignment, but I'm afraid I can't specifically say how :|

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionCelerioView Question on Stackoverflow
Solution 1 - CBill LynchView Answer on Stackoverflow
Solution 2 - Cgnasher729View Answer on Stackoverflow
Solution 3 - CBen VoigtView Answer on Stackoverflow
Solution 4 - Cthat other guyView Answer on Stackoverflow