Why does using the ternary operator to return a string generate considerably different code from returning in an equivalent if/else block?

C++AssemblyOptimizationClangCompiler Optimization

C++ Problem Overview


I was playing with the Compiler Explorer and I stumbled upon an interesting behavior with the ternary operator when using something like this:

std::string get_string(bool b)
{
    return b ? "Hello" : "Stack-overflow";
}

The compiler generated code for this (clang trunk, with -O3) is this:

get_string[abi:cxx11](bool):                 # @get_string[abi:cxx11](bool)
        push    r15
        push    r14
        push    rbx
        mov     rbx, rdi
        mov     ecx, offset .L.str
        mov     eax, offset .L.str.1
        test    esi, esi
        cmovne  rax, rcx
        add     rdi, 16 #< Why is the compiler storing the length of the string
        mov     qword ptr [rbx], rdi
        xor     sil, 1
        movzx   ecx, sil
        lea     r15, [rcx + 8*rcx]
        lea     r14, [rcx + 8*rcx]
        add     r14, 5 #< I also think this is the length of "Hello" (but not sure)
        mov     rsi, rax
        mov     rdx, r14
        call    memcpy #< Why is there a call to memcpy
        mov     qword ptr [rbx + 8], r14
        mov     byte ptr [rbx + r15 + 21], 0
        mov     rax, rbx
        pop     rbx
        pop     r14
        pop     r15
        ret
.L.str:
        .asciz  "Hello"

.L.str.1:
        .asciz  "Stack-Overflow"

However, the compiler generated code for the following snippet is considerably smaller and with no calls to memcpy, and does not care about knowing the length of both strings at the same time. There are 2 different labels that it jumps to

std::string better_string(bool b)
{
    if (b)
    {
        return "Hello";
    }
    else
    {
        return "Stack-Overflow";
    }
}

The compiler generated code for the above snippet (clang trunk with -O3) is this:

better_string[abi:cxx11](bool):              # @better_string[abi:cxx11](bool)
        mov     rax, rdi
        lea     rcx, [rdi + 16]
        mov     qword ptr [rdi], rcx
        test    sil, sil
        je      .LBB0_2
        mov     dword ptr [rcx], 1819043144
        mov     word ptr [rcx + 4], 111
        mov     ecx, 5
        mov     qword ptr [rax + 8], rcx
        ret
.LBB0_2:
        movabs  rdx, 8606216600190023247
        mov     qword ptr [rcx + 6], rdx
        movabs  rdx, 8525082558887720019
        mov     qword ptr [rcx], rdx
        mov     byte ptr [rax + 30], 0
        mov     ecx, 14
        mov     qword ptr [rax + 8], rcx
        ret

The same result is when I use the ternary operator with:

std::string get_string(bool b)
{
    return b ? std::string("Hello") : std::string("Stack-Overflow");
}

I would like to know why the ternary operator in the first example generates that compiler code. I believe that the culprit lies within the const char[].

P.S: GCC does calls to strlen in the first example but Clang doesn't.

Link to the Compiler Explorer example: https://godbolt.org/z/Exqs6G

Thank you for your time!

sorry for the wall of code

C++ Solutions


Solution 1 - C++

The overarching difference here is that the first version is branchless.

16 isn’t the length of any string here (the longer one, with NUL, is only 15 bytes long); it’s an offset into the return object (whose address is passed in RDI to support RVO), used to indicate that the small-string optimization is in use (note the lack of allocation). The lengths are 5 or 5+1+8 stored in R14, which is stored in the std::string as well as passed to memcpy (along with a pointer chosen by CMOVNE) to load the actual string bytes.

The other version has an obvious branch (although part of the std::string construction has been hoisted above it) and actually does have 5 and 14 explicitly, but is obfuscated by the fact that the string bytes have been included as immediate values (expressed as integers) of various sizes.

As for why these three equivalent functions produce two different versions of the generated code, all I can offer is that optimizers are iterative and heuristic algorithms; they don’t reliably find the same “best” assembly independently of their starting point.

Solution 2 - C++

The first version returns a string object which is initialized with a not-constant expression yielding one of the string literals, so the constructor is run as for any other variable string object, thus the memcpy to do the initialization.

The other variants return either one string object initialized with a string literal or another string object initialized with another string literal, both of which can be optimized to a string object constructed from a constant expression where no memcpy is needed.

So the real answer is: the first version operates the ?: operator on char[] expressions before initializing the objects and the other versions on the string objects already being initialized.

It does not matter whether one of the versions is branchless.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMarius TView Question on Stackoverflow
Solution 1 - C++Davis HerringView Answer on Stackoverflow
Solution 2 - C++JuergenView Answer on Stackoverflow