What is the fastest way to update a variable on a condition?

C++Optimization

C++ Problem Overview


I have a pointer, ptr, and a condition, cond. I need the fastest possible way to reset ptr if cond is true, or keep ptr unchanged if cond is false. The current implementation is, trivially:

void reset_if_true(void*& ptr, bool cond)
{
    if (cond)
        ptr = nullptr;
}

I'm aware that the above code performance is good, and I cannot expect a major performance boost optimizing it. However, this code is called several millions of times per second and every little nanosecond saved is relevant.

I was thinking about something that get rid of the branch, e.g.:

void* p[] = { ptr, nullptr };
ptr = p[cond];

but I'm not sure it's the best way to proceed.

C++ Solutions


Solution 1 - C++

void reset_if_true(void*& ptr, bool cond)
{
    if (cond)
        ptr = nullptr;
}

The naive solution will undoubtedly be the fastest in the majority of cases. Although it has a branch, which can be slow on modern pipelined processors, it is only slow if the branch is mispredicted. Since branch predictors are very good nowadays, unless the value of cond is extremely unpredictable, it's likely that a simple conditional branch is the fastest way to write the code.

And if it is not, a good compiler should know that and be able to optimize the code to something better, considering the target architecture. Which goes to gnasher729's point: just write the code the simple way and leave optimization in the hands of the optimizer.

While this is good advice in general, sometimes it is taken too far. If you actually care about the speed of this code, you need to check and see what the compiler is actually doing with it. Check the object code that it is generating, and make sure that it is sensible and that the function's code is getting inlined.

Such an examination can be quite revealing. For example, let's consider x86-64, where branches can be quite expensive in cases where branch prediction is foiled (which is really the only time when this is an interesting question, so let's assume that cond is completely unpredictable). Almost all compilers are going to generate the following for the naive implementation:

reset_if_true(void*&, bool):
    test   sil, sil              ; test 'cond'
    je     CondIsFalse
    mov    QWORD PTR [rdi], 0    ; set 'ptr' to nullptr, and fall through
  CondIsFalse:
    ret

This is about as tight of code as you could imagine. But if you put the branch predictor in a pathological case, it might end up being slower than using a conditional move:

reset_if_true(void*&, bool):
    xor    eax, eax              ; pre-zero the register RAX
    test   sil, sil              ; test 'cond'
    cmove  rax, QWORD PTR [rdi]  ; if 'cond' is false, set the register RAX to 'ptr'
    mov    QWORD PTR [rdi], rax  ; set 'ptr' to the value in the register RAX
    ret                          ;  (which is either 'ptr' or 0)

Conditional moves have a relatively high latency, so they are considerably slower than a well-predicted branch, but they might be faster than a completely unpredictable branch. You would expect a compiler to know this when targeting the x86 architecture, but it doesn't (at least in this simple example) have any knowledge about cond's predictability. It assumes the simple case, that branch prediction will be on your side, and generates code A instead of code B.

If you decide that you want to encourage the compiler to generate branchless code because of an unpredictable condition, you might try the following:

void reset_if_true_alt(void*& ptr, bool cond)
{
    ptr = (cond) ? nullptr : ptr;
}

This succeeds in persuading modern versions of Clang to generate branchless code B, but is a complete pessimization in GCC and MSVC. If you hadn't checked the generated assembly, you wouldn't have known that. If you want to force GCC and MSVC to generate branchless code, you will have to work harder. For example, you might use the variation posted in the question:

void reset_if_true(void*& ptr, bool cond)
{
    void* p[] = { ptr, nullptr };
    ptr = p[cond];
}

When targeting x86, all compilers generate branchless code for that, but it is not especially pretty code. In fact, none of them generate conditional moves. Instead, you get multiple accesses to memory in order to build the array:

reset_if_true_alt(void*&, bool):
    mov     rax, QWORD PTR [rdi]
    movzx   esi, sil
    mov     QWORD PTR [rsp-16], 0
    mov     QWORD PTR [rsp-24], rax
    mov     rax, QWORD PTR [rsp-24+rsi*8]
    mov     QWORD PTR [rdi], rax
    ret

Ugly and probably very inefficient. I'd predict that it gives the conditional jump version a run for its money even in the case where the branch is mispredicted. You'd have to benchmark it to be sure, of course, but it is probably not a good choice.

If you were still desperate to eliminate the branch on MSVC or GCC, you'd have to do something uglier involving reinterpreting the pointer bits and twiddling them. Something like:

void reset_if_true_alt(void*& ptr, bool cond)
{
    std::uintptr_t p = reinterpret_cast<std::uintptr_t&>(ptr);
    p &= -(!cond);
    ptr = reinterpret_cast<void*>(p);
}

That will give you the following:

reset_if_true_alt(void*&, bool):
    xor   eax, eax
    test  sil, sil
    sete  al
    neg   eax
    cdqe
    and   QWORD PTR [rdi], rax
    ret

Again, here we've got more instructions than a simple branch, but at least they're relatively low-latency instructions. A benchmark on realistic data will tell you if the tradeoff is worth it. And give you the justification you need to put in a comment if you're going to actually check-in code like this.

Once I went down the bit-twiddling rabbit hole, I was able to force MSVC and GCC to use conditional move instructions. Apparently they weren't doing this optimization because we were dealing with a pointer:

void reset_if_true_alt(void*& ptr, bool cond)
{
    std::uintptr_t p = reinterpret_cast<std::uintptr_t&>(ptr);
    ptr = reinterpret_cast<void*>(cond ? 0 : p);
}

reset_if_true_alt(void*&, bool):
    mov    rax, QWORD PTR [rdi]
    xor    edx, edx
    test   sil, sil
    cmovne rax, rdx
    mov    QWORD PTR [rdi], rax
    ret

Given the latency of CMOVNE and the similar number of instructions, I'm not sure if this would actually be faster than the previous version. The benchmark you ran would tell you if it was.

Similarly, if we bit-twiddle the condition, we save ourselves one memory access:

void reset_if_true_alt(void*& ptr, bool cond)
{
   std::uintptr_t c = (cond ? 0 : -1);
   reinterpret_cast<std::uintptr_t&>(ptr) &= c;
}

reset_if_true_alt(void*&, bool):
     xor    esi, 1
     movzx  esi, sil
     neg    rsi
     and    QWORD PTR [rdi], rsi
     ret

(That's GCC. MSVC does something slightly different, preferring its characteristic sequence of neg, sbb, neg, and dec instructions, but the two are morally equivalent. Clang transforms it into the same conditional move that we saw it generate above.) This may be the best code yet if we need to avoid branches, considering that it generates sane output on all tested compilers while preserving (some degree of) readability in the source code.

Solution 2 - C++

The lowest-hanging fruit here isn't what you think it is. As discussed in several other answers, reset_if_true is going to be compiled to machine code that is as fast as you can reasonably expect to get for what it does. If that's not fast enough, you need to start thinking about changing what it does. I see two options there, one easy, one not so easy:

  1. Change the calling convention:

     template <class T>
     inline T* reset_if_true(T* ptr, bool condition)
     {
         return condition ? nullptr : ptr;
     }
    

    and then change the caller(s) to read something like

     ptr_var = reset_if_true(ptr_var, expression);
    

    What this does is make it more likely that ptr_var will get to live in a register during the critical innermost loop that's calling reset_if_true millions of times a second, and there won't be any memory accesses associated with it. ptr_var getting forced out to memory is the most expensive thing in your code the way it is right now; even more expensive than potentially mispredicted branches. (A sufficiently good compiler may make this transformation for you provided reset_if_true is inlinable, but it's not always possible for it to do so.)

  2. Change the surrounding algorithm, so that reset_if_true does not get called millions of times a second anymore.

    Since you didn't tell us what the surrounding algorithm is, I can't help you with that. I can, however, tell you that doing something involving checking a condition millions of times a second, probably indicates an algorithm with quadratic time complexity or worse, and that always means you should at least think about finding a better one. (There may not be a better one, alas.)

Solution 3 - C++

As long as we have sizeof(size_t) == sizeof(void*), nullptr being represented in binary as 0 and size_t using all bits (or having std::uintptr_t), you can do this:

// typedef std::uintptr_t ptrint_t; // uncomment if you have it
typedef size_t ptrint_t; // comment out if you have std::uintptr_t

void reset_if_true(void*& ptr, bool cond)
{
	((ptrint_t&)ptr) &= -ptrint_t( !cond );
}

Note, however, that the time the cast from bool to size_t takes is very much implementation-dependent and might take a branch in itself.

Solution 4 - C++

The code is absolutely straightforward.

You certainly make things a lot faster by inlining the function (if the compiler didn't inline it on its own). For example, inlining could mean that the pointer variable that you are setting to null could stay in a register.

Other than that, this code is so straightforward, if there are any tricks that could be used to make it faster, the compiler would use them.

Solution 5 - C++

Update: I reimplemented my answer.

In the following code the idea is converting the pointer into a number and multiplying it by a number (cond). Note inline used. Multiplication may help using an architecture that uses pipelining.

#include <cstdint>

template <typename T>
inline T* reset_if_true(T* p, bool cond) {
  void* ptr = (void*)p; // The optimising compiler (-O3) will get rid of unnecessary variables.
  intptr_t ptrint;
  // This is an unrecommended practice.
  ptrint = (intptr_t)ptr;
  ptrint = ptrint * cond;  // Multiply the integer
  void* ptr2 = (void*)ptrint;
  T* ptrv = (T*)ptr2;
  return ptrv;
}

Example usage:

#include <iostream>
#include <vector>

void test1(){
    //doulbe d = 3.141592;
    //typedef std::vector<double> mytype;
    std::vector<double> data = {3,1,4};
    auto ptr = &data;
    std::cout << (void*)ptr << std::endl;
    auto ptr2 = reset_if_true(ptr, 1);
    //auto ptr2 = (mytype*)reset_if_true(ptr, 1);
    std::cout << reset_if_true(ptr, 1) << " -> " << (*(reset_if_true(ptr, 1))).size() << std::endl;
    std::cout << reset_if_true(ptr, 2) << " -> "<< (*(reset_if_true(ptr, 2))).size() << std::endl;
    std::cout << reset_if_true(ptr, 0) <<
        " is null? " << (reset_if_true(ptr, 0) == NULL) <<  // Dont dereference a null.
        std::endl;
}


void test2(){
    double data = 3.141500123;
    auto ptr = &data;
    std::cout << (void*)ptr << std::endl;
    auto ptr2 = reset_if_true(ptr, 1);
    //auto ptr2 = (mytype*)reset_if_true(ptr, 1);
    std::cout << reset_if_true(ptr, 1) << " -> " << (*(reset_if_true(ptr, 1))) << std::endl;
    std::cout << reset_if_true(ptr, 2) << " -> "<< (*(reset_if_true(ptr, 2))) << std::endl;
    std::cout << reset_if_true(ptr, 0) <<
        " is null? " << (reset_if_true(ptr, 0) == NULL) <<  // Dont dereference a null.
        std::endl;

}

int main(){ test1(); test2(); }

Compile using these flags: -O3 -std=c++14. The output is:

0x5690
0x5690 -> 3
0x5690 -> 3
0 is null? 1
0x5690
0x5690 -> 3.1415
0x5690 -> 3.1415
0 is null? 1

It might have memory alignment problems when such options are used in the compiler commandline -s FORCE_ALIGNED_MEMORY=1 . Also see reinterpret_cast. Don't forget to use -O3.

The cond can be any non-zero value. There is some room for performance improvement here if we know it is not other than 0 or 1. In that case, you can use int another integer type for cond.

PS. This is an updated answer. The previous answer, as I already clearly mentioned in my answer, had issues. The solution is using intptr_t, and of course inline.

Compiler options used:

 em++ reset_if_true.cpp -O3 -std=c++14 -o reset_if_true.js
 node reset_if_true.js

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionshrikeView Question on Stackoverflow
Solution 1 - C++Cody GrayView Answer on Stackoverflow
Solution 2 - C++zwolView Answer on Stackoverflow
Solution 3 - C++lorroView Answer on Stackoverflow
Solution 4 - C++gnasher729View Answer on Stackoverflow
Solution 5 - C++Sohail SiView Answer on Stackoverflow