Why don't modern compilers coalesce neighboring memory accesses?

C++AssemblyOptimizationCompiler Optimization

C++ Problem Overview


Consider the following code:

bool AllZeroes(const char buf[4])
{
    return buf[0] == 0 &&
           buf[1] == 0 &&
           buf[2] == 0 &&
           buf[3] == 0;
}

Output assembly from Clang 13 with -O3:

AllZeroes(char const*):                        # @AllZeroes(char const*)
        cmp     byte ptr [rdi], 0
        je      .LBB0_2
        xor     eax, eax
        ret
.LBB0_2:
        cmp     byte ptr [rdi + 1], 0
        je      .LBB0_4
        xor     eax, eax
        ret
.LBB0_4:
        cmp     byte ptr [rdi + 2], 0
        je      .LBB0_6
        xor     eax, eax
        ret
.LBB0_6:
        cmp     byte ptr [rdi + 3], 0
        sete    al
        ret

Each byte is compared individually, but it could've been optimized into a single 32-bit int comparison:

bool AllZeroes(const char buf[4])
{
    return *(int*)buf == 0;
}

Resulting in:

AllZeroes2(char const*):                      # @AllZeroes2(char const*)
        cmp     dword ptr [rdi], 0
        sete    al
        ret

I've also checked GCC and MSVC, and neither of them does this optimization. Is this disallowed by the C++ specification?

Edit: Changing the short-circuited AND (&&) to bitwise AND (&) will generate the optimized code. Also, changing the order the bytes are compared doesn't affect the code gen: https://godbolt.org/z/Y7TcG93sP

C++ Solutions


Solution 1 - C++

If buf[0] is nonzero, the code will not access buf[1]. So the function should return false without checking the other buf elements. If buf is close to the end of the last memory page, buf[1] may trigger an access fault. The compiler should be very careful to not read stuff which may be forbidden to read.

Solution 2 - C++

The first thing to understand is that f(const char buf[4]) does not guarantee that the pointer points to 4 elements, it means exactly the same as const char *buf, the 4 is completely ignored by the language. (C99 has a solution to this, but it's not supported in C++, more on that below)

Given AllZeroes(memset(malloc(1),~0,1)), the implementation

bool AllZeroes(const char buf[4])
{
    return buf[0] == 0 &&
           buf[1] == 0 &&
           buf[2] == 0 &&
           buf[3] == 0;
}

should work, because it never tries to read byte #2 (which doesn't exist) when it notices that byte #1 is non-zero, while the implementation

bool AllZeroes(const int32_t *buf)
{
    return (*buf == 0);
}

should segfault as it tries to read the first 4 bytes while only 1 byte exists (malloced 1 byte only)

FWIW Clang gets it right (and GCC doesn't) in C99 with the implementation

_Bool AllZeroes(const char buf[static 4])
{
    return buf[0] == 0 &
           buf[1] == 0 &
           buf[2] == 0 &
           buf[3] == 0;
}

which compiles to the same as

_Bool AllZeroes(const int32_t *buf)
{
    return (*buf == 0);
}

see https://godbolt.org/z/Grqs3En3K (thanks to Caze @libera #C for finding that)

  • unfortunately buf[static 4], which in C99 is a guarantee-from-the-programmer-to-the-compiler that the pointer points to minimum 4 elements, is not supported in C++

Solution 3 - C++

There's the short-circuit evaluation thing. So it can't be optimized as you think. If buf[0] == 0 is false buf[1] == 0 must not be checked. It can be UB or something forbidden to use or whatever - this all must still work.

https://en.wikipedia.org/wiki/Short-circuit_evaluation

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionDanielView Question on Stackoverflow
Solution 1 - C++anatolygView Answer on Stackoverflow
Solution 2 - C++hanshenrikView Answer on Stackoverflow
Solution 3 - C++Алексей НеудачинView Answer on Stackoverflow