Why does this delay-loop start to run faster after several iterations with no sleep?

C++LinuxPerformanceBenchmarking

C++ Problem Overview


Consider:

#include <time.h>
#include <unistd.h>
#include <iostream>
using namespace std;

const int times = 1000;
const int N = 100000;

void run() {
  for (int j = 0; j < N; j++) {
  }
}

int main() {
  clock_t main_start = clock();
  for (int i = 0; i < times; i++) {
    clock_t start = clock();
    run();
    cout << "cost: " << (clock() - start) / 1000.0 << " ms." << endl;
    //usleep(1000);
  }
  cout << "total cost: " << (clock() - main_start) / 1000.0 << " ms." << endl;
}

Here is the example code. In the first 26 iterations of the timing loop, the run function costs about 0.4 ms, but then the cost reduces to 0.2 ms.

When the usleep is uncommented, the delay-loop takes 0.4 ms for all runs, never speeding up. Why?

The code is compiled with g++ -O0 (no optimization), so the delay loop isn't optimized away. It's run on Intel(R) Core(TM) i3-3220 CPU @ 3.30 GHz, with 3.13.0-32-generic Ubuntu 14.04.1 LTS (Trusty Tahr).

C++ Solutions


Solution 1 - C++

After 26 iterations, Linux ramps the CPU up to the maximum clock speed since your process uses its full time slice a couple of times in a row.

If you checked with performance counters instead of wall-clock time, you'd see that the core clock cycles per delay-loop stayed constant, confirming that it's just an effect of DVFS (which all modern CPUs use to run at a more energy-efficient frequency and voltage most of the time).

If you tested on a Skylake with kernel support for the new power-management mode (where the hardware takes full control of the clock speed), ramp-up would happen much faster.

If you leave it running for a while on an Intel CPU with Turbo, you'll probably see the time per iteration increase again slightly once thermal limits require the clock speed to reduce back down to the maximum sustained frequency. (See https://stackoverflow.com/questions/36363613/why-cant-my-cpu-maintain-peak-performance-in-hpc for more about Turbo letting the CPU run faster than it can sustain for high-power workloads.)


Introducing a usleep prevents Linux's CPU frequency governor from ramping up the clock speed, because the process isn't generating 100% load even at minimum frequency. (I.e. the kernel's heuristic decides that the CPU is running fast enough for the workload that's running on it.)



comments on other theories:

re: David's theory that a potential context switch from usleep could pollute caches: That's not a bad idea in general, but it doesn't help explain this code.

Cache / TLB pollution isn't important at all for this experiment. There's basically nothing inside the timing window that touches memory other than the end of the stack. Most of the time is spent in a tiny loop (1 line of instruction cache) that only touches one int of stack memory. Any potential cache pollution during usleep is a tiny fraction of the time for this code (real code will be different)!

In more detail for x86:

The call to clock() itself might cache-miss, but a code-fetch cache miss delays the starting-time measurement, rather than being part of what's measured. The second call to clock() will almost never be delayed, because it should still be hot in cache.

The run function may be in a different cache line from main (since gcc marks main as "cold", so it gets optimized less and placed with other cold functions/data). We can expect one or two instruction-cache misses. They're probably still in the same 4k page, though, so main will have triggered the potential TLB miss before entering the timed region of the program.

gcc -O0 will compile the OP's code to something like this (Godbolt Compiler explorer): keeping the loop counter in memory on the stack.

The empty loop keeps the loop counter in stack memory, so on a typical Intel x86 CPU the loop runs at one iteration per ~6 cycles on the OP's IvyBridge CPU, thanks to the store-forwarding latency that's part of add with a memory destination (read-modify-write). 100k iterations * 6 cycles/iteration is 600k cycles, which dominates the contribution of at most a couple cache misses (~200 cycles each for code-fetch misses which prevent further instructions from issuing until they're resolved).

Out-of-order execution and store-forwarding should mostly hide the potential cache miss on accessing the stack (as part of the call instruction).

Even if the loop-counter was kept in a register, 100k cycles is a lot.

Solution 2 - C++

A call to usleep may or may not result in a context switch. If it does, it will take longer than if it doesn't.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionphyxnjView Question on Stackoverflow
Solution 1 - C++Peter CordesView Answer on Stackoverflow
Solution 2 - C++David SchwartzView Answer on Stackoverflow