C++: Mysteriously huge speedup from keeping one operand in a register

C++CPerformanceOptimizationAssembly

C++ Problem Overview


I have been trying to get an idea of the impact of having an array in L1 cache versus memory by timing a routine that scales and sums the elements of an array using the following code (I am aware that I should just scale the result by 'a' at the end; the point is to do both a multiply and an add within the loop - so far, the compiler hasn't figured out to factor out 'a'):

double sum(double a,double* X,int size)
{
	double total = 0.0;
	for(int i = 0;  i < size; ++i)
	{
		total += a*X[i];
	}
	return total;
}

#define KB 1024
int main()
{
    //Approximately half the L1 cache size of my machine
	int operand_size = (32*KB)/(sizeof(double)*2);
	printf("Operand size: %d\n", operand_size);
	double* X = new double[operand_size];
	fill(X,operand_size);

    double seconds = timer();
    double result;
    int n_iterations = 100000;
    for(int i = 0; i < n_iterations; ++i)
    {
        result = sum(3.5,X,operand_size);
        //result += rand();  
    }
    seconds = timer() - seconds; 

    double mflops = 2e-6*double(n_iterations*operand_size)/seconds;
    printf("Vector size %d: mflops=%.1f, result=%.1f\n",operand_size,mflops,result);
	return 0;
}

Note that the timer() and fill() routines are not included for brevity; their full source can be found here if you want to run the code:

http://codepad.org/agPWItZS

Now, here is where it gets interesting. This is the output:

Operand size: 2048
Vector size 2048: mflops=588.8, result=-67.8

This is totally un-cached performance, despite the fact that all elements of X should be held in cache between loop iterations. Looking at the assembly code generated by:

g++ -O3 -S -fno-asynchronous-unwind-tables register_opt_example.cpp

I notice one oddity in the sum function loop:

L55:
	movsd	(%r12,%rax,8), %xmm0
	mulsd	%xmm1, %xmm0
	addsd	-72(%rbp), %xmm0
	movsd	%xmm0, -72(%rbp)
	incq	%rax
	cmpq	$2048, %rax
	jne	L55

The instructions:

	addsd	-72(%rbp), %xmm0
	movsd	%xmm0, -72(%rbp)

indicate that it is storing the value of "total" in sum() on the stack, and reading and writing it at every loop iteration. I modified the assembly so that this operand is kept in a a register:

...
addsd	%xmm0, %xmm3
...

This small change creates a huge performance boost:

Operand size: 2048
Vector size 2048: mflops=1958.9, result=-67.8

tl;dr My question is: why does replacing a single memory location access with a register, speed up the code so much, given that the single location should be stored in L1 cache? What architectural factors make this possible? It seems very strange that writing one stack location repeatedly would completely destroy the effectiveness of a cache.

Appendix

My gcc version is:

Target: i686-apple-darwin10
Configured with: /var/tmp/gcc/gcc-5646.1~2/src/configure --disable-checking --enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --with-gxx-include-dir=/include/c++/4.2.1 --program-prefix=i686-apple-darwin10- --host=x86_64-apple-darwin10 --target=i686-apple-darwin10
Thread model: posix
gcc version 4.2.1 (Apple Inc. build 5646) (dot 1)

My CPU is:

Intel Xeon X5650

C++ Solutions


Solution 1 - C++

It's likely a combination of a longer dependency chain, along with Load Misprediction*.


Longer Dependency Chain:

First, we identify the critical dependency paths. Then we look at the instruction latencies provided by: http://www.agner.org/optimize/instruction_tables.pdf (page 117)

In the unoptimized version, the critical dependency path is:

  • addsd -72(%rbp), %xmm0
  • movsd %xmm0, -72(%rbp)

Internally, it probably breaks up into:

  • load (2 cycles)
  • addsd (3 cycles)
  • store (3 cycles)

If we look at the optimized version, it's just:

  • addsd (3 cycles)

So you have 8 cycles vs. 3 cycles. Almost a factor of 3.

I'm not sure how sensitive the Nehalem processor line is to store-load dependencies and how well it does forwarding. But it's reasonable to believe that it's not zero.


Load-store Misprediction:

Modern processors use prediction in more ways you can imagine. The most famous of these is probably Branch Prediction. One of the lesser known ones is Load Prediction.

When a processor sees a load, it will immediately load it before all pending writes finish. It will assume that those writes will not conflict with the loaded values.

If an earlier write turns out to conflict with a load, then the load must be re-executed and the computation rolled back to the point of the load. (in much the same way that branch mispredictions roll back)

How it is relevant here:

Needless to say, modern processors will be able to execute multiple iterations of this loop simultaneously. So the processor will be attempting to perform the load (addsd -72(%rbp), %xmm0) before it finishes the store (movsd %xmm0, -72(%rbp)) from the previous iteration.

The result? The previous store conflicts with the load - thus a misprediction and a roll back.

*Note that I'm unsure of the name "Load Prediction". I only read about it in the Intel docs and they didn't seem to give it a name.

Solution 2 - C++

I would surmise the problem isn't in the cache/memory access but in the processor (execution of your code). There are several visible bottlenecks here.

Performance numbers here were based on the boxes I was using (either sandybridge or westmere)

Peak performance for scalar math is 2.7Ghz x2 FLOPS/Clock x2 since processor can do an add and multiply simultaneously. Theoretical efficiency of the code is 0.6/(2.7*2) = 11%

Bandwidth needed: 2 doubles per (+) and (x) -> 4bytes/Flop 4 bytes * 5.4GFLOPS = 21.6GB/s

If you know it was read recently its likely in L1 (89GB/s), L2 (42GB/s) or L3(24GB/s) so we can rule out cache B/W

The memory susbsystem is 18.9 GB/s so even in main memory, peak performance should approach 18.9/21.6GB/s = 87.5 %

  • May want to batch up requests (via unrolling) as early as possible

Even with speculative execution, tot += a *X[i] the adds will be serialized because tot(n) need to be eval'd before tot(n+1) can be kicked off

First unroll loop
move i by 8's and do

{//your func
    for( int i = 0; i < size; i += 8 ){
        tot += a * X[i];
        tot += a * X[i+1];
        ...
        tot += a * X[i+7];
    }
    return tot
}

Use multiple accumulators
This will break dependencies and allow us to avoid stalling on the addition pipeline

{//your func//
    int tot,tot2,tot3,tot4;
    tot = tot2 = tot3 = tot4 = 0
    for( int i = 0; i < size; i += 8 ) 
        tot  += a * X[i];
        tot2 += a * X[i+1];
        tot3 += a * X[i+2];
        tot4 += a * X[i+3];
        tot  += a * X[i+4];
        tot2 += a * X[i+5];
        tot3 += a * X[i+6];
        tot4 += a * X[i+7];
    }
    return tot + tot2 + tot3 + tot4;
}

UPDATE After running this on a SandyBridge box I have access to: (2.7GHZ SandyBridge with -O2 -march=native -mtune=native

Original code:

Operand size: 2048  
Vector size 2048: mflops=2206.2, result=61.8  
2.206 / 5.4 = 40.8%

Improved Code:

Operand size: 2048  
Vector size 2048: mflops=5313.7, result=61.8  
5.3137 / 5.4 = 98.4%  

Solution 3 - C++

I can't actually reproduce this because my compiler (gcc 4.7.2) keeps total in a register.

I suspect the main reason for the slowness doesn't have to do with the L1 cache, but rather is due to the data dependency between the store in

movsd   %xmm0, -72(%rbp)

and the load on the subsequent iteration:

addsd   -72(%rbp), %xmm0

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSam ManzerView Question on Stackoverflow
Solution 1 - C++MysticialView Answer on Stackoverflow
Solution 2 - C++UpAndAdamView Answer on Stackoverflow
Solution 3 - C++NPEView Answer on Stackoverflow