Why can't GCC optimize the logical / bitwise AND pair in "x && (x & 4242)" to "x & 4242"?
C++GccOptimizationCompiler OptimizationC++ Problem Overview
Here are two functions which I claim do exactly the same thing:
bool fast(int x)
{
return x & 4242;
}
bool slow(int x)
{
return x && (x & 4242);
}
Logically they do the same thing, and just to be 100% sure I wrote a test that ran all four billion possible inputs through both of them, and they matched. (x & 4242
is only non-zero if it has set bits in specific positions, which means x
has a non-zero value, so testing x!=0
separately as the other side of a logical &&
is redundant.) But the assembly code is a different story:
fast:
andl $4242, %edi
setne %al
ret
slow:
xorl %eax, %eax
testl %edi, %edi
je .L3
andl $4242, %edi
setne %al
.L3:
rep
ret
I was surprised that GCC could not make the leap of logic to eliminate the redundant test. I tried g++ 4.4.3 and 4.7.2 with -O2, -O3, and -Os, all of which generated the same code. The platform is Linux x86_64.
Can someone explain why GCC shouldn't be smart enough to generate the same code in both cases?
Edit to add test harness:
#include <cstdlib>
#include <vector>
using namespace std;
int main(int argc, char* argv[])
{
// make vector filled with numbers starting from argv[1]
int seed = atoi(argv[1]);
vector<int> v(100000);
for (int j = 0; j < 100000; ++j)
v[j] = j + seed;
// count how many times the function returns true
int result = 0;
for (int j = 0; j < 100000; ++j)
for (int i : v)
result += slow(i); // or fast(i), try both
return result;
}
I tested the above with clang 5.1 on Mac OS with -O3. It took 2.9 seconds using fast()
and 3.8 seconds using slow()
. If I instead use a vector of all zeros, there is no significant difference in performance between the two functions.
Other compilers:
- mainline clang 3.7 and later do the optimization even for
&&
, clang 3.6 and earlier don't. https://godbolt.org/z/v5bjrvrP1 - latest GCC trunk (march 2022) and 11.2 still don't.
- Current MSVC does both parts with branches, not using
setcc
. - ICC makes asm like GCC's, LLVM-based ICX is like clang. https://godbolt.org/z/cjKfr8r5b
C++ Solutions
Solution 1 - C++
Exactly why should it be able to optimize the code? You're assuming that any transformation that works will be done. That's not at all how optimizers work. They're not Artificial Intelligences. They simply work by parametrically replacing known patterns. E.g. the "Common Subexpression Elimination" scans an expression for common subexpressions, and moves them forwards, if that does not change side effects.
(BTW, CSE shows that optimizers are already quite aware of what code movement is allowed in the possible presence of side effects. They know that you have to be careful with &&
. Whether expr && expr
can be CSE-optimized or not depends on the side effects of expr
.)
So, in summary: which pattern do you think applies here?
Solution 2 - C++
You are correct that this appears to be a deficiency, and possibly an outright bug, in the optimizer.
Consider:
bool slow(int x)
{
return x && (x & 4242);
}
bool slow2(int x)
{
return (x & 4242) && x;
}
Assembly emitted by GCC 4.8.1 (-O3):
slow:
xorl %eax, %eax
testl %edi, %edi
je .L2
andl $4242, %edi
setne %al
.L2:
rep ret
slow2:
andl $4242, %edi
setne %al
ret
In other words, slow2
is misnamed.
I have only contributed the occasional patch to GCC, so whether my point of view carries any weight is debatable :-). But it is certainly strange, in my view, for GCC to optimize one of these and not the other. I suggest filing a bug report.
[Update]
Surprisingly small changes appear to make a big difference. For example:
bool slow3(int x)
{
int y = x & 4242;
return y && x;
}
...generates "slow" code again. I have no hypothesis for this behavior.
You can experiment with all of these on multiple compilers here.
Solution 3 - C++
This is how your code looks in ARM which should make slow
run faster when input it 0.
fast(int):
movw r3, #4242
and r3, r0, r3
adds r0, r3, #0
movne r0, #1
bx lr
slow(int):
cmp r0, #0
bxeq lr
movw r3, #4242
and r3, r0, r3
adds r0, r3, #0
movne r0, #1
bx lr
However GCC would optimize very nicely when you start using such trivial functions anyway.
bool foo() {
return fast(4242) && slow(42);
}
becomes
foo():
mov r0, #1
bx lr
My point is sometimes such code requires more context to be optimized further so why would implementers of optimizers (improvers!) should bother?
Another example:
bool bar(int c) {
if (fast(c))
return slow(c);
}
becomes
bar(int):
movw r3, #4242
and r3, r0, r3
cmp r3, #0
movne r0, #1
bxne lr
bx lr
Solution 4 - C++
To perform this optimization, one needs to study the expression for two distinct cases: x == 0
, simplifying to false
, and x != 0
, simplifying to x & 4242
. And then be smart enough to see that the value of the second expression also yields the correct value even for x == 0
.
Let us imagine that the compiler performs a case study and finds simplifications.
If x != 0
, the expression simplifies to x & 4242
.
If x == 0
, the expression simplifies to false
.
After simplification, we obtain two completely unrelated expressions. To reconcile them, the compiler should ask unnatural questions:
If x != 0
, can false
be used instead of x & 4242
anyway ? [No]
If x == 0
, can x & 4242
be used instead of false
anyway ? [Yes]
Solution 5 - C++
It is mildly interesting to note that this optimisation is not valid on all machines. Specifically if you run on a machine which uses the one's complement representation of negative numbers then:
-0 & 4242 == true
-0 && ( -0 & 4242 ) == false
GCC has never supported such representations, but they are allowed for by the C standard.
Solution 6 - C++
The last compiler I worked on did not do these sorts of optimizations. Writing an optimizer to take advantage of optimizations related to combining binary and logical operators will not speed up the applications. The main reason for this is that people do not use binary operators like that very often. Many people don't feel comfortable with binary operators and those that do will typically not write useless operations that need to be optimized.
If I go to the trouble of writing
return (x & 4242)
and I understand what that means why would I bother with the extra step. For the same reason i would not write this suboptimal code
if (x==0) return false;
if (x==1) return true;
if (x==0xFFFEFD6) return false;
if (x==4242) return true;
return (x & 4242)
There is just better use of compiler dev's time than to optimize stuff that makes no difference. There are just so many bigger fish to fry in compiler optimization.
Solution 7 - C++
C places fewer restrictions on the behavior of signed integral types then unsigned integral types. Negative values in particular can legally do strange things with bit operations. If any possible arguments to the bit operation have legally unconstrained behavior, the compiler can't remove them.
For example, "x/y==1 or true" might crash the program if you divide by zero, so the compiler can't ignore the evaluation of the division. Negative signed values and bit operations never actually do things like that on any common system, but I'm not sure the language definition rules it out.
You should try the code with unsigned ints and see if that helps. If it does you'll know it's an issue with the types and not the expression.
Solution 8 - C++
Not an answer, but a note on the topic - which could well be phrased "Should" the compiler optimize it:
Logical means bool
which is either 0
meaning false
or non-zero meaning true
and the operator that yields these is &&
with keyword and
.
Bitwise means boolean logic and the operator is &
with keyword bitand
.
&&
is essentially wraps each term with (x!=0)?1:0
ie. "is it non-0?" or "if it's !=0
then it's 1
"
&
checks bit sameness. ie. "Give me the bits that are the same". Which works as expected for bool
values, but any other you just get the bits that are the same in all values.
You can play with equivalents here. (The confusion arises because values != 0 also evaluate to true - another question arises: shouldn't they just be "undefined" and generate a warning, to avoid people mistaking these?)
So if you're dealing with just bool
values, you can just bitwise
AND for both evaluations.
bool fast(bool x)
{
return x & 4242;
}
bool slow(bool x)
{
return x & (x & 4242);
}
That gets optimized just fine. See here.
If each &
produces a 0
or 1
or is a bool
, then it's a drop in replacement. But (y && (x & z))
and ( y & (x & z))
are not equivalent if any value is greater than 1. For example: 1 && (2&2)
is true
, 1 & (2&2)
is false
. It is again equivalent at 1 && (3 & 3 )
but it should be clear that these don't compare the same things. The former tests if y
is true, and if x
and z
are non-zero while the latter tests which bits are the same on x
, y
and z
. (See here)
See also: https://stackoverflow.com/questions/6577504/is-there-any-difference-between-and-with-bools/6577545#6577545 and https://stackoverflow.com/questions/47243955/boolean-values-as-8-bit-in-compilers-are-operations-on-them-inefficient