When, if ever, is loop unrolling still useful?

Performance Optimization Language Agnostic Micro Optimization Loop Unrolling

Performance Problem Overview

I've been trying to optimize some extremely performance-critical code (a quick sort algorithm that's being called millions and millions of times inside a monte carlo simulation) by loop unrolling. Here's the inner loop I'm trying to speed up:

// Search for elements to swap.
while(myArray[++index1] < pivot) {}
while(pivot < myArray[--index2]) {}

I tried unrolling to something like:

while(true) {
    if(myArray[++index1] < pivot) break;
    if(myArray[++index1] < pivot) break;
    // More unrolling
}


while(true) {
    if(pivot < myArray[--index2]) break;
    if(pivot < myArray[--index2]) break;
    // More unrolling
}

This made absolutely no difference so I changed it back to the more readable form. I've had similar experiences other times I've tried loop unrolling. Given the quality of branch predictors on modern hardware, when, if ever, is loop unrolling still a useful optimization?

Performance Solutions

Solution 1 - Performance

Loop unrolling makes sense if you can break dependency chains. This gives a out of order or super-scalar CPU the possibility to schedule things better and thus run faster.

A simple example:

for (int i=0; i<n; i++)
{
  sum += data[i];
}

Here the dependency chain of the arguments is very short. If you get a stall because you have a cache-miss on the data-array the cpu cannot do anything but to wait.

On the other hand this code:

for (int i=0; i<n-3; i+=4)  // note the n-3 bound for starting i + 0..3
{
  sum1 += data[i+0];
  sum2 += data[i+1];
  sum3 += data[i+2];
  sum4 += data[i+3];
}
sum = sum1 + sum2 + sum3 + sum4;
// if n%4 != 0, handle final 0..3 elements with a rolled up loop or whatever

could run faster. If you get a cache miss or other stall in one calculation there are still three other dependency chains that don't depend on the stall. A out of order CPU can execute these in parallel.

(See https://stackoverflow.com/questions/45113527/why-does-mulss-take-only-3-cycles-on-haswell-different-from-agners-instruction for an in-depth look at how register-renaming helps CPUs find that parallelism, and an in depth look at the details for FP dot-product on modern x86-64 CPUs with their throughput vs. latency characteristics for pipelined floating-point SIMD FMA ALUs. Hiding latency of FP addition or FMA is a major benefit to multiple accumulators, since latencies are longer than integer but SIMD throughput is often similar.)

Solution 2 - Performance

Those wouldn't make any difference because you're doing the same number of comparisons. Here's a better example. Instead of:

for (int i=0; i<200; i++) {
  doStuff();
}

write:

for (int i=0; i<50; i++) {
  doStuff();
  doStuff();
  doStuff();
  doStuff();
}

Even then it almost certainly won't matter but you are now doing 50 comparisons instead of 200 (imagine the comparison is more complex).

Manual loop unrolling in general is largely an artifact of history however. It's another of the growing list of things that a good compiler will do for you when it matters. For example, most people don't bother to write x <<= 1 or x += x instead of x *= 2. You just write x *= 2 and the compiler will optimize it for you to whatever is best.

Basically there's increasingly less need to second-guess your compiler.

Solution 3 - Performance

Regardless of branch prediction on modern hardware, most compilers do loop unrolling for you anyway.

It would be worthwhile finding out how much optimizations your compiler does for you.

I found Felix von Leitner's presentation very enlightening on the subject. I recommend you read it. Summary: Modern compilers are VERY clever, so hand optimizations are almost never effective.

Solution 4 - Performance

As far as I understand it, modern compilers already unroll loops where appropriate - an example being gcc, if passed the optimisation flags it the manual says it will:

> Unroll loops whose number of > iterations can be determined at > compile time or upon entry to the > loop.

So, in practice it's likely that your compiler will do the trivial cases for you. It's up to you therefore to make sure that as many as possible of your loops are easy for the compiler to determine how many iterations will be needed.

Solution 5 - Performance

Loop unrolling, whether it's hand unrolling or compiler unrolling, can often be counter-productive, particularly with more recent x86 CPUs (Core 2, Core i7). Bottom line: benchmark your code with and without loop unrolling on whatever CPUs you plan to deploy this code on.

Solution 6 - Performance

Trying without knowing is not the way to do it.
Does this sort take a high percentage of overall time?

All loop unrolling does is reduce the loop overhead of incrementing/decrementing, comparing for the stop condition, and jumping. If what you're doing in the loop takes more instruction cycles than the loop overhead itself, you're not going to see much improvement percentage-wise.

Here's an example of how to get maximum performance.

Solution 7 - Performance

Loop unrolling can be helpful in specific cases. The only gain isn't skipping some tests!

It can for instance allow scalar replacement, efficient insertion of software prefetching... You would be surprised actually how useful it can be (you can easily get 10% speedup on most loops even with -O3) by aggressively unrolling.

As it was said before though, it depends a lot on the loop and the compiler and experiment is necessary. It's hard to make a rule (or the compiler heuristic for unrolling would be perfect)

Solution 8 - Performance

Loop unrolling entirely depends on your problem size. It is entirely dependent on your algorithm being able to reduce the size into smaller groups of work. What you did above does not look like that. I am not sure if a monte carlo simulation can even be unrolled.

I good scenario for loop unrolling would be rotating an image. Since you could rotate separate groups of work. To get this to work you would have to reduce the number of iterations.

Solution 9 - Performance

Loop unrolling is still useful if there are a lot of local variables both in and with the loop. To reuse those registers more instead of saving one for the loop index.

In your example, you use small amount of local variables, not overusing the registers.

Comparison (to loop end) are also a major drawback if the comparison is heavy (i.e non-test instruction), especially if it depends on an external function.

Loop unrolling helps increasing the CPU's awareness for branch prediction as well, but those occur anyway.

Content Type	Original Author	Original Content on Stackoverflow
Question	dsimcha	View Question on Stackoverflow
Solution 1 - Performance	Nils Pipenbrinck	View Answer on Stackoverflow
Solution 2 - Performance	cletus	View Answer on Stackoverflow
Solution 3 - Performance	Peter Alexander	View Answer on Stackoverflow
Solution 4 - Performance	Rich Bradshaw	View Answer on Stackoverflow
Solution 5 - Performance	Paul R	View Answer on Stackoverflow
Solution 6 - Performance	Mike Dunlavey	View Answer on Stackoverflow
Solution 7 - Performance	Kamchatka	View Answer on Stackoverflow
Solution 8 - Performance	jwendl	View Answer on Stackoverflow
Solution 9 - Performance	LiraNuna	View Answer on Stackoverflow