In C++, should I bother to cache variables, or let the compiler do the optimization? (Aliasing)

C++PerformanceCachingOptimizationStrict Aliasing

C++ Problem Overview


Consider the following code (p is of type unsigned char* and bitmap->width is of some integer type, exactly which is unknown and depends on which version of some external library we're using):

for (unsigned x = 0;  x < static_cast<unsigned>(bitmap->width);  ++x)
{
    *p++ = 0xAA;
    *p++ = 0xBB;
    *p++ = 0xCC;
}

Is it worth optimizing it [..]

Could there be a case where this could yield more efficient results by writing:

unsigned width(static_cast<unsigned>(bitmap->width));
for (unsigned x = 0;  x < width;  ++x)
{
    *p++ = 0xAA;
    *p++ = 0xBB;
    *p++ = 0xCC;
}

... or is this trivial for the compiler to optimize?

What would you consider to be "better" code?

Note from editor (Ike): for those wondering about the strikeout text, the original question, as phrased, was dangerously close to off-topic territory and was very close to being closed in spite of positive feedback. These have been stricken out. Yet please do not punish the answerers who addressed these stricken sections of the question.

C++ Solutions


Solution 1 - C++

At first glance, I thought the compiler could generate equivalent assembly for both versions with optimization flags activated. When I checked it, I was surprised to see the result:

Source unoptimized.cpp

note: this code is not meant to be executed.

struct bitmap_t
{
    long long width;
} bitmap;

int main(int argc, char** argv)
{
    for (unsigned x = 0 ; x < static_cast<unsigned>(bitmap.width) ; ++x)
    {
        argv[x][0] = '\0';
    }
    return 0;
}
Source optimized.cpp

note: this code is not meant to be executed.

struct bitmap_t
{
    long long width;
} bitmap;

int main(int argc, char** argv)
{
    const unsigned width = static_cast<unsigned>(bitmap.width);
    for (unsigned x = 0 ; x < width ; ++x)
    {
        argv[x][0] = '\0';
    }
    return 0;
}

Compilation
  • $ g++ -s -O3 unoptimized.cpp
  • $ g++ -s -O3 optimized.cpp

Assembly (unoptimized.s)
	.file	"unoptimized.cpp"
	.text
	.p2align 4,,15
.globl main
	.type	main, @function
main:
.LFB0:
	.cfi_startproc
	.cfi_personality 0x3,__gxx_personality_v0
	movl	bitmap(%rip), %eax
	testl	%eax, %eax
	je	.L2
	xorl	%eax, %eax
	.p2align 4,,10
	.p2align 3
.L3:
	mov	%eax, %edx
	addl	$1, %eax
	movq	(%rsi,%rdx,8), %rdx
	movb	$0, (%rdx)
	cmpl	bitmap(%rip), %eax
	jb	.L3
.L2:
	xorl	%eax, %eax
	ret
	.cfi_endproc
.LFE0:
	.size	main, .-main
.globl bitmap
	.bss
	.align 8
	.type	bitmap, @object
	.size	bitmap, 8
bitmap:
	.zero	8
	.ident	"GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-16)"
	.section	.note.GNU-stack,"",@progbits
Assembly (optimized.s)
	.file	"optimized.cpp"
	.text
	.p2align 4,,15
.globl main
	.type	main, @function
main:
.LFB0:
	.cfi_startproc
	.cfi_personality 0x3,__gxx_personality_v0
	movl	bitmap(%rip), %eax
	testl	%eax, %eax
	je	.L2
	subl	$1, %eax
	leaq	8(,%rax,8), %rcx
	xorl	%eax, %eax
	.p2align 4,,10
	.p2align 3
.L3:
	movq	(%rsi,%rax), %rdx
	addq	$8, %rax
	cmpq	%rcx, %rax
	movb	$0, (%rdx)
	jne	.L3
.L2:
	xorl	%eax, %eax
	ret
	.cfi_endproc
.LFE0:
	.size	main, .-main
.globl bitmap
	.bss
	.align 8
	.type	bitmap, @object
	.size	bitmap, 8
bitmap:
	.zero	8
	.ident	"GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-16)"
	.section	.note.GNU-stack,"",@progbits
diff
$ diff -uN unoptimized.s optimized.s
--- unoptimized.s	2015-11-24 16:11:55.837922223 +0000
+++ optimized.s	2015-11-24 16:12:02.628922941 +0000
@@ -1,4 +1,4 @@
-	.file	"unoptimized.cpp"
+	.file	"optimized.cpp"
 	.text
 	.p2align 4,,15
 .globl main
@@ -10,16 +10,17 @@
 	movl	bitmap(%rip), %eax
 	testl	%eax, %eax
 	je	.L2
+	subl	$1, %eax
+	leaq	8(,%rax,8), %rcx
 	xorl	%eax, %eax
 	.p2align 4,,10
 	.p2align 3
 .L3:
-	mov	%eax, %edx
-	addl	$1, %eax
-	movq	(%rsi,%rdx,8), %rdx
+	movq	(%rsi,%rax), %rdx
+	addq	$8, %rax
+	cmpq	%rcx, %rax
 	movb	$0, (%rdx)
-	cmpl	bitmap(%rip), %eax
-	jb	.L3
+	jne	.L3
 .L2:
 	xorl	%eax, %eax
 	ret

The generated assembly for the optimized version does actually load (lea) the width constant unlike the unoptimized version which computes the width offset at each iteration (movq).

When I'll get time, I eventually post some benchmark on that. Good question.

Solution 2 - C++

There is actually insufficient information from your code snippet to be able to tell, and the one thing that I can think of is aliasing. From our point of view, it's pretty clear that you don't want p and bitmap to point to the same location in memory, but the compiler doesn't know that and (because p is of type char*) the compiler has to make this code work even if p and bitmap overlap.

This means in this case that if the loop changes bitmap->width through the pointer p then that has to be seen when re-reading bitmap->width later on, which in turn means that storing it in a local variable would be illegal.

That being said, I believe some compilers will actually sometimes generate two versions of the same code (I have seen circumstantial evidence of this, but never directly sought out information on what the compiler is doing in this case), and quickly check if the pointers alias and run the faster code if it determines it's okay to.

That being said, I stand by my comment about simply measuring the performance of the two versions, my money is on not seeing any consistent performance difference between the two versions of the code.

In my opinion, questions like these are okay if your purpose is to learn about compiler optimization theories and techniques, but is a waste of time (a useless micro-optimization) if your end goal here is to make the program run faster.

Solution 3 - C++

Ok, guys, so I've measured, with GCC -O3 (using GCC 4.9 on Linux x64).

Turns out, the second version runs 54% faster!

So, I guess aliasing is the thing, I hadn't thought about it.

[Edit]

I've tried again the first version with all pointers defined with __restrict__, and the results are the same. Weird.. Either aliasing is not the problem, or, for some reason, the compiler doesn't optimize it well even with __restrict__.

[Edit 2]

Ok, I think I was pretty much able to prove that aliasing is the problem. I repeated my original test, this time using an array rather than a pointer:

const std::size_t n = 0x80000000ull;
bitmap->width = n;
static unsigned char d[n*3];
std::size_t i=0;
for (unsigned x = 0;  x < static_cast<unsigned>(bitmap->width);  ++x)
{
    d[i++] = 0xAA;
    d[i++] = 0xBB;
    d[i++] = 0xCC;
}

And measured (had to use "-mcmodel=large" to link it). Then I tried:

const std::size_t n = 0x80000000ull;
bitmap->width = n;
static unsigned char d[n*3];
std::size_t i=0;
unsigned width(static_cast<unsigned>(bitmap->width));
for (unsigned x = 0;  x < width;  ++x)
{
    d[i++] = 0xAA;
    d[i++] = 0xBB;
    d[i++] = 0xCC;
}

The measure results were the same - Seems like the compiler was able to optimize it by itself.

Then I tried the original codes (with a pointer p), this time when p is of type std::uint16_t*. Again, the results were the same - due to strict aliasing. Then I tried building with "-fno-strict-aliasing", and again saw a difference in time.

Solution 4 - C++

Other answers have pointed out that hoisting the pointer operation out of the loop may change defined behaviour due to aliasing rules that allow char to alias anything and hence is not an allowable optimisation for a compiler even though in most cases it is obviously correct to a human programmer.

They have also pointed out that hoisting the operation out of the loop is usually but not always an improvement from a performance point of view and is often a negative from a readability point of view.

I would like to point out that there is often a "third way". Rather than counting up to the number of iterations you want you can count down to zero. This means that the number of iterations is only needed once at the start of the loop, it doesn't have to be stored after that. Better still at the assembler level it often eliminates the need for an explicit comparison as the decrement operation will usually set flags that indicate whether the counter was zero both before (carry flag) and after (zero flag) the decrement.

for (unsigned x = static_cast<unsigned>(bitmap->width);x > 0;  x--)
{
    *p++ = 0xAA;
    *p++ = 0xBB;
    *p++ = 0xCC;
}

Note that this version of the loop gives x values in the range 1..width rather than the range 0..(width-1). That doesn't matter in your case because you aren't actually using x for anything but it's something to be aware of. If you want a count down loop with x values in the range 0..(width-1) you can do.

for (unsigned x = static_cast<unsigned>(bitmap->width); x-- > 0;)
{
    *p++ = 0xAA;
    *p++ = 0xBB;
    *p++ = 0xCC;
}

You can also get rid of the casts in the above examples if you want without worrying about it's impact on comparison rules since all you are doing with bitmap->width is assigning it directly to a variable.

Solution 5 - C++

The only thing here that can prevent the optimization is the strict aliasing rule. In short:

> > "Strict aliasing is an assumption, made by the C (or C++) compiler, that dereferencing pointers to objects of different types will never refer to the same memory location (i.e. alias each other.)" > > […] > > The exception to the rule is a char*, which is allowed to point to any type.

The exception also applies to unsigned and signed char pointers.

This is the case in your code: You're modifying *p through p which is an unsigned char*, so the compiler must assume that it could point to bitmap->width. Hence the caching of bitmap->width is an invalid optimization. This optimization-preventing behavior is shown in YSC's answer.

If and only if p pointed to a non-char and non-decltype(bitmap->width) type, would the caching be a possible optimization.

Solution 6 - C++

The question originally asked: > Is it worth optimizing it?

And my answer to that (garnering a good mix of both up and down votes..)

> Let the compiler worry about it. > > The compiler will almost certainly do a better job than you. And > there's no guarantee that your 'optimization' is any better than the > 'obvious' code - have you measured it?? > > More importantly, have you any proof that the code you're optimizing > has any impact on the performance of your program?

Despite the downvotes (and now seeing the aliasing issue), I'm still happy with that as a valid answer. If you don't know if it's worth optimizing something, it probably isn't.

A rather different question, of course, would be this: > How can I tell if it's worth optimizing a fragment of code?

First, does your application or library need to run faster than it currently does? Is the user kept waiting too long? Does your software forecast yesterday's weather instead of tomorrow's?

Only you can really tell this, based on what your software is for and what your users expect.

Assuming your software does need some optimzation, the next thing to do is start measuring. Profilers will tell you where your code spends it's time. If your fragment isn't showing as a bottleneck, it's best left alone. Profilers and other measuring tools will also tell you if your changes have made a difference. It's possible to spend hours attemtping to optimize code, only to find you've made no discernible difference.

> What do you mean by 'optimizing', anyway?

If you're not writing 'optimized' code, than your code should be as clear, clean and concise as you can make it. The "Premature optimization is evil" argument isn't an excuse for sloppy or inefficient code.

Optimized code normally sacrifices some of the attributes above for performance. It could involve introducing additional local variables, having objects with wider than expected scope or even reversing normal loop ordering. All of these may be less clear or concise, so document the code (briefly!) about why you're doing this.

But often, with 'slow' code, these micro-optimizations are the last resort. The first place to look is at algorithms and data structures. Is there a way of avoiding doing the work at all? Can linear searches be replaced with binary ones? Would a linked list be faster here than a vector? Or a hash table? Can I cache results? Making good 'efficient' decisions here can often affect performance by an order of magnitude or more!

Solution 7 - C++

I use the following pattern in the situation like this. It is almost as short as the first case of yours, and is better than the second case, because it keeps the temporary variable local to the loop.

for (unsigned int x = 0, n = static_cast<unsigned>(bitmap->width); x < n; ++x)
{
  *p++ = 0xAA;
  *p++ = 0xBB;
  *p++ = 0xCC;
}

This will be faster with a less than smart compiler, debug build, or certain compilation flags.

Edit1: Placing a constant operation outside of a loop is a good programming pattern. It shows understanding of basics of machine operation, especially in C/C++. I'd argue that the effort to prove themselves should be on people that do not follow this practice. If compiler punishes for a good pattern, it is a bug in the compiler.

Edit2:: I've measured my suggestion against original code on vs2013, got %1 improvement. Can we do better? A simple manual optimization gives 3 times improvement over the original loop on x64 machine without resorting to exotic instructions. The code below assumes little endian system and properly aligned bitmap. TEST 0 is original (9 sec), TEST 1 is faster (3 sec). I bet somebody could make this even faster, and the result of the test would depend on the size of the bitmap. Definitely soon in the future, compiler will be able to produce consistently fastest code. I afraid this will be the future when the compiler will be also a programmer AI, so we would be out of work. But for now, just write code that shows that you know that extra operation in the loop is not needed.

#include <memory>
#include <time.h>

struct Bitmap_line
{
  int blah;
  unsigned int width;
  Bitmap_line(unsigned int w)
  {
    blah = 0;
    width = w;
  }
};

#define TEST 0 //define 1 for faster test

int main(int argc, char* argv[])
{
  unsigned int size = (4 * 1024 * 1024) / 3 * 3; //makes it divisible by 3
  unsigned char* pointer = (unsigned char*)malloc(size);
  memset(pointer, 0, size);
  std::unique_ptr<Bitmap_line> bitmap(new Bitmap_line(size / 3));
  clock_t told = clock();
#if TEST == 0
  for (int iter = 0; iter < 10000; iter++)
  {
    unsigned char* p = pointer;
    for (unsigned x = 0; x < static_cast<unsigned>(bitmap->width); ++x)
    //for (unsigned x = 0, n = static_cast<unsigned>(bitmap->width); x < n; ++x)
    {
      *p++ = 0xAA;
      *p++ = 0xBB;
      *p++ = 0xCC;
    }
  }
#else
  for (int iter = 0; iter < 10000; iter++)
  {
    unsigned char* p = pointer;
    unsigned x = 0;
    for (const unsigned n = static_cast<unsigned>(bitmap->width) - 4; x < n; x += 4)
    {
      *(int64_t*)p = 0xBBAACCBBAACCBBAALL;
      p += 8;
      *(int32_t*)p = 0xCCBBAACC;
      p += 4;
    }

    for (const unsigned n = static_cast<unsigned>(bitmap->width); x < n; ++x)
    {
      *p++ = 0xAA;
      *p++ = 0xBB;
      *p++ = 0xCC;
    }
  }
#endif
  double ms = 1000.0 * double(clock() - told) / CLOCKS_PER_SEC;
  printf("time %0.3f\n", ms);

  {
    //verify
    unsigned char* p = pointer;
    for (unsigned x = 0, n = static_cast<unsigned>(bitmap->width); x < n; ++x)
    {
      if ((*p++ != 0xAA) || (*p++ != 0xBB) || (*p++ != 0xCC))
      {
        printf("EEEEEEEEEEEEERRRRORRRR!!!\n");
        abort();
      }
    }
  }

  return 0;
}

Solution 8 - C++

There are two things to consider.

A) How often will the optimization run?

If the answer is not very often, like only when a user clicks a button, then don't bother if it makes your code unreadable. If the answer is 1000 times a second then you will probably want to go with the optimization. If it is even a bit complex be sure to put a comment in to explain what is going on to help the next guy that comes along.

B) Will this make the code harder to upkeep/troubleshoot?

If you're not seeing a huge gain in performance then making your code cryptic simply to save a few clock ticks is not a good idea. Lots of people will tell you that any good programmer should be able to look at the code and figure out what is going on. This is true. The problem is that in the business world the extra time figuring that out costs money. So, if you can make it prettier to read then do it. Your friends will thank you for it.

That said I'd personally use the B example.

Solution 9 - C++

The compiler is able to optimize a lot of things. For your example, you should go for the readability, mantainability and what follows your code standard. For more information about what can be optimized (with GCC), see this blog post.

Solution 10 - C++

As a general rule, let the compiler do the optimization for you, until you determine that you should take over. The logic for this has nothing to do with performance, but rather with human readability. In the vast majority of cases, the readability of your program is more important than its performance. You should aim to write code which is easier for a human to read, and then only worry about optimization when you are convinced that performance is more important than the maintainability of your code.

Once you do see that performance matters, you should run a profiler on the code to determine which loops are being inefficient, and optimize those individually. There may indeed be cases where you want to do that optimization (especially if you migrate towards C++, where STL containers get involved), but the cost in terms of readability is great.

In addition, I can think of pathological situations where it could actually slow the code down. For example, consider the case where the compiler could not prove that bitmap->width was constant through the process. By adding the width variable you force the compiler to maintain a local variable in that scope. If, for some platform specific reason, that extra variable prevented some stack-space optimization, it may have to reorganize how it is emitting bytecodes, and produce something less efficient.

As an example, on Windows x64, one is obliged to call a special API call, __chkstk in the preamble of the function if the function will use more than 1 page of local variables. This function gives windows a chance to manage the guard pages they use to expand the stack when needed. If your extra variable pushes the stack usage up from below 1 page to at-or-above 1 page, your function is now obliged to call __chkstk every time it is entered. If you were to optimize this loop on a slow path, you may actually slow the fast path down more than you saved on the slow path!

Sure, it's a bit pathological, but the point of that example is that you can actually slow the compiler down. It just shows that you do have to profile your work to determine where the optimizations go. In the mean time, please don't sacrifice readability in any way for an optimization that may or may not matter.

Solution 11 - C++

The comparison is wrong since the two code snippets

for (unsigned x = 0;  x < static_cast<unsigned>(bitmap->width);  ++x)

and

unsigned width(static_cast<unsigned>(bitmap->width));
for (unsigned x = 0;  x<width ;  ++x)

are not equivalent

In the first case width is dependent and not const, and one cannot assume that it may not change between subsequent iterations. Thus it cannot be optimized, but has to be checked at every loop.

In your optimized case a local variable is assigned the value of bitmap->width at some point during program execution. The compiler can verify that this does not actually change.

Did you think about multi threading , or maybe the value could be externally dependent such that its value is volatile. How would one expect the compiler to figure all these things out if you do not tell?

The compiler can only do as good as your code lets it.

Solution 12 - C++

Unless you know how exactly the compiler optimizes the code, it is better to do your own optimizations by keeping the code readability, and design. Practically it is hard to check assembly code for every function we write for new compiler versions.

Solution 13 - C++

Compiler cannot optimize bitmap->width because value of width can be changed between iterations. There are a few most common reasons:

  1. Multi-threading. Compiler cannot predict if other thread is about to change value.
  2. Modification inside loop, sometimes it is not simple to tell if variable will be changed inside loop.
  3. It is function call, e.g. iterator::end() or container::size() so it is hard to predict if it will always returns the same result.

To sum up (my personal opinion) for places that requires high level of optimization you need to do that by yourself, in other places just leave it, compiler may optimize it or not, if there is no big difference code readability is main target.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionYaron Cohen-TalView Question on Stackoverflow
Solution 1 - C++YSCView Answer on Stackoverflow
Solution 2 - C++SirGuyView Answer on Stackoverflow
Solution 3 - C++Yaron Cohen-TalView Answer on Stackoverflow
Solution 4 - C++plugwashView Answer on Stackoverflow
Solution 5 - C++emlaiView Answer on Stackoverflow
Solution 6 - C++RoddyView Answer on Stackoverflow
Solution 7 - C++0kcatsView Answer on Stackoverflow
Solution 8 - C++soulsabrView Answer on Stackoverflow
Solution 9 - C++Guillaume RacicotView Answer on Stackoverflow
Solution 10 - C++Cort AmmonView Answer on Stackoverflow
Solution 11 - C++g24lView Answer on Stackoverflow
Solution 12 - C++Vinayak S MView Answer on Stackoverflow
Solution 13 - C++ST3View Answer on Stackoverflow