Why can't GCC optimize the logical / bitwise AND pair in "x && (x & 4242)" to "x & 4242"?

C++GccOptimizationCompiler Optimization

C++ Problem Overview


Here are two functions which I claim do exactly the same thing:

bool fast(int x)
{
  return x & 4242;
}

bool slow(int x)
{
  return x && (x & 4242);
}

Logically they do the same thing, and just to be 100% sure I wrote a test that ran all four billion possible inputs through both of them, and they matched. (x & 4242 is only non-zero if it has set bits in specific positions, which means x has a non-zero value, so testing x!=0 separately as the other side of a logical && is redundant.) But the assembly code is a different story:

fast:
    andl    $4242, %edi
    setne   %al
    ret

slow:
    xorl    %eax, %eax
    testl   %edi, %edi
    je      .L3
    andl    $4242, %edi
    setne   %al
.L3:
    rep
    ret

I was surprised that GCC could not make the leap of logic to eliminate the redundant test. I tried g++ 4.4.3 and 4.7.2 with -O2, -O3, and -Os, all of which generated the same code. The platform is Linux x86_64.

Can someone explain why GCC shouldn't be smart enough to generate the same code in both cases?

Edit to add test harness:

#include <cstdlib>
#include <vector>
using namespace std;

int main(int argc, char* argv[])
{
    // make vector filled with numbers starting from argv[1]
    int seed = atoi(argv[1]);
    vector<int> v(100000);
    for (int j = 0; j < 100000; ++j)
        v[j] = j + seed;

    // count how many times the function returns true
    int result = 0;
    for (int j = 0; j < 100000; ++j)
        for (int i : v)
            result += slow(i); // or fast(i), try both

    return result;
}

I tested the above with clang 5.1 on Mac OS with -O3. It took 2.9 seconds using fast() and 3.8 seconds using slow(). If I instead use a vector of all zeros, there is no significant difference in performance between the two functions.


Other compilers:

  • mainline clang 3.7 and later do the optimization even for &&, clang 3.6 and earlier don't. https://godbolt.org/z/v5bjrvrP1
  • latest GCC trunk (march 2022) and 11.2 still don't.
  • Current MSVC does both parts with branches, not using setcc.
  • ICC makes asm like GCC's, LLVM-based ICX is like clang. https://godbolt.org/z/cjKfr8r5b

C++ Solutions


Solution 1 - C++

Exactly why should it be able to optimize the code? You're assuming that any transformation that works will be done. That's not at all how optimizers work. They're not Artificial Intelligences. They simply work by parametrically replacing known patterns. E.g. the "Common Subexpression Elimination" scans an expression for common subexpressions, and moves them forwards, if that does not change side effects.

(BTW, CSE shows that optimizers are already quite aware of what code movement is allowed in the possible presence of side effects. They know that you have to be careful with &&. Whether expr && expr can be CSE-optimized or not depends on the side effects of expr.)

So, in summary: which pattern do you think applies here?

Solution 2 - C++

You are correct that this appears to be a deficiency, and possibly an outright bug, in the optimizer.

Consider:

bool slow(int x)
{
  return x && (x & 4242);
}

bool slow2(int x)
{
  return (x & 4242) && x;
}

Assembly emitted by GCC 4.8.1 (-O3):

slow:
    xorl    %eax, %eax
    testl   %edi, %edi
    je      .L2
    andl    $4242, %edi
    setne   %al
.L2:
    rep ret

slow2:
    andl    $4242, %edi
    setne   %al
    ret

In other words, slow2 is misnamed.

I have only contributed the occasional patch to GCC, so whether my point of view carries any weight is debatable :-). But it is certainly strange, in my view, for GCC to optimize one of these and not the other. I suggest filing a bug report.

[Update]

Surprisingly small changes appear to make a big difference. For example:

bool slow3(int x)
{
  int y = x & 4242;
  return y && x;
}

...generates "slow" code again. I have no hypothesis for this behavior.

You can experiment with all of these on multiple compilers here.

Solution 3 - C++

This is how your code looks in ARM which should make slow run faster when input it 0.

fast(int):
	movw	r3, #4242
	and	r3, r0, r3
	adds	r0, r3, #0
	movne	r0, #1
	bx	lr
slow(int):
	cmp	r0, #0
	bxeq	lr
	movw	r3, #4242
	and	r3, r0, r3
	adds	r0, r3, #0
	movne	r0, #1
	bx	lr

However GCC would optimize very nicely when you start using such trivial functions anyway.

bool foo() {
  	return fast(4242) && slow(42);
}

becomes

foo():
	mov	r0, #1
	bx	lr

My point is sometimes such code requires more context to be optimized further so why would implementers of optimizers (improvers!) should bother?

Another example:

bool bar(int c) {
  if (fast(c))
    return slow(c);
}

becomes

bar(int):
	movw	r3, #4242
	and	r3, r0, r3
	cmp	r3, #0
	movne	r0, #1
	bxne	lr
	bx	lr

Solution 4 - C++

To perform this optimization, one needs to study the expression for two distinct cases: x == 0, simplifying to false, and x != 0, simplifying to x & 4242. And then be smart enough to see that the value of the second expression also yields the correct value even for x == 0.

Let us imagine that the compiler performs a case study and finds simplifications.

If x != 0, the expression simplifies to x & 4242.

If x == 0, the expression simplifies to false.

After simplification, we obtain two completely unrelated expressions. To reconcile them, the compiler should ask unnatural questions:

If x != 0, can false be used instead of x & 4242 anyway ? [No]

If x == 0, can x & 4242 be used instead of false anyway ? [Yes]

Solution 5 - C++

It is mildly interesting to note that this optimisation is not valid on all machines. Specifically if you run on a machine which uses the one's complement representation of negative numbers then:

-0 & 4242 == true
-0 && ( -0 & 4242 ) == false

GCC has never supported such representations, but they are allowed for by the C standard.

Solution 6 - C++

The last compiler I worked on did not do these sorts of optimizations. Writing an optimizer to take advantage of optimizations related to combining binary and logical operators will not speed up the applications. The main reason for this is that people do not use binary operators like that very often. Many people don't feel comfortable with binary operators and those that do will typically not write useless operations that need to be optimized.

If I go to the trouble of writing

return (x & 4242)

and I understand what that means why would I bother with the extra step. For the same reason i would not write this suboptimal code

if (x==0) return false;
if (x==1) return true;
if (x==0xFFFEFD6) return false;
if (x==4242) return true;
return (x & 4242)

There is just better use of compiler dev's time than to optimize stuff that makes no difference. There are just so many bigger fish to fry in compiler optimization.

Solution 7 - C++

C places fewer restrictions on the behavior of signed integral types then unsigned integral types. Negative values in particular can legally do strange things with bit operations. If any possible arguments to the bit operation have legally unconstrained behavior, the compiler can't remove them.

For example, "x/y==1 or true" might crash the program if you divide by zero, so the compiler can't ignore the evaluation of the division. Negative signed values and bit operations never actually do things like that on any common system, but I'm not sure the language definition rules it out.

You should try the code with unsigned ints and see if that helps. If it does you'll know it's an issue with the types and not the expression.

Solution 8 - C++

Not an answer, but a note on the topic - which could well be phrased "Should" the compiler optimize it:

Logical means bool which is either 0 meaning false or non-zero meaning true and the operator that yields these is && with keyword and.

Bitwise means boolean logic and the operator is & with keyword bitand.

&& is essentially wraps each term with (x!=0)?1:0 ie. "is it non-0?" or "if it's !=0 then it's 1"

& checks bit sameness. ie. "Give me the bits that are the same". Which works as expected for bool values, but any other you just get the bits that are the same in all values.

You can play with equivalents here. (The confusion arises because values != 0 also evaluate to true - another question arises: shouldn't they just be "undefined" and generate a warning, to avoid people mistaking these?)

So if you're dealing with just bool values, you can just bitwise AND for both evaluations.

bool fast(bool x)
{
  return x & 4242;
}

bool slow(bool x)
{
  return x & (x & 4242);
}

That gets optimized just fine. See here.

If each & produces a 0 or 1 or is a bool, then it's a drop in replacement. But (y && (x & z)) and ( y & (x & z)) are not equivalent if any value is greater than 1. For example: 1 && (2&2) is true, 1 & (2&2) is false. It is again equivalent at 1 && (3 & 3 ) but it should be clear that these don't compare the same things. The former tests if y is true, and if x and z are non-zero while the latter tests which bits are the same on x, y and z. (See here)

See also: https://stackoverflow.com/questions/6577504/is-there-any-difference-between-and-with-bools/6577545#6577545 and https://stackoverflow.com/questions/47243955/boolean-values-as-8-bit-in-compilers-are-operations-on-them-inefficient

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJohn ZwinckView Question on Stackoverflow
Solution 1 - C++MSaltersView Answer on Stackoverflow
Solution 2 - C++NemoView Answer on Stackoverflow
Solution 3 - C++auselenView Answer on Stackoverflow
Solution 4 - C++Yves DaoustView Answer on Stackoverflow
Solution 5 - C++andypeaView Answer on Stackoverflow
Solution 6 - C++MarcView Answer on Stackoverflow
Solution 7 - C++eggcrookView Answer on Stackoverflow
Solution 8 - C++dagelfView Answer on Stackoverflow