How do memory_order_seq_cst and memory_order_acq_rel differ?

C++C++11Memory ModelStdatomic

C++ Problem Overview


Stores are release operations and loads are acquire operations for both. I know that memory_order_seq_cst is meant to impose an additional total ordering for all operations, but I'm failing to build an example where it isn't the case if all the memory_order_seq_cst are replaced by memory_order_acq_rel.

Do I miss something, or the difference is just a documentation effect, i.e. one should use memory_order_seq_cst if one intend not to play with a more relaxed model and use memory_order_acq_rel when constraining the relaxed model?

C++ Solutions


Solution 1 - C++

http://en.cppreference.com/w/cpp/atomic/memory_order has a good example at the bottom that only works with memory_order_seq_cst. Essentially memory_order_acq_rel provides read and write orderings relative to the atomic variable, while memory_order_seq_cst provides read and write ordering globally. That is, the sequentially consistent operations are visible in the same order across all threads.

The example boils down to this:

bool x= false;
bool y= false;
int z= 0;

a() { x= true; }
b() { y= true; }
c() { while (!x); if (y) z++; }
d() { while (!y); if (x) z++; }

// kick off a, b, c, d, join all threads
assert(z!=0);

Operations on z are guarded by two atomic variables, not one, so you can't use acquire-release semantics to enforce that z is always incremented.

Solution 2 - C++

On ISAs like x86 where atomics map to barriers, and the actual machine model includes a store buffer:

  • seq_cst stores require flushing the store buffer so this thread's later reads are delayed until after the store is globally visible.

  • acquire or release do not have to flush the store buffer. Normal x86 loads and stores have essentially acq and rel semantics. (seq_cst plus a store buffer with store forwarding.)

    But x86 atomic RMW operations always get promoted to seq_cst because the x86 asm lock prefix is a full memory barrier. Other ISAs can do relaxed or acq_rel RMWs in asm, with the store side being able to do limited reordering with later stores. (But not in ways that would make the RMW appear non-atomic: https://stackoverflow.com/questions/65568185/for-purposes-of-ordering-is-atomic-read-modify-write-one-operation-or-two)


https://preshing.com/20120515/memory-reordering-caught-in-the-act is an instructive example of the difference between a seq_cst store and a plain release store. (It's actually mov + mfence vs. plain mov in x86 asm. In practice xchg is a more efficient way to do a seq_cst store on most x86 CPUs, but GCC does use mov+mfence)


Fun fact: AArch64's LDAR acquire-load instruction is actually a sequential-acquire, having a special interaction with STLR. Not until ARMv8.3 LDAPR can arm64 do plain acquire operations that can reorder with earlier release and seq_cst stores (STLR). (seq_cst loads still use LDAR because they need that interaction with STLR to recover sequential consistency; seq_cst and release stores both use STLR).

With STLR / LDAR you get sequential consistency, but only having to drain the store buffer before the next LDAR, not right away after each seq_cst store before other operations. I think real AArch64 HW does implement it this way, rather than simply draining the store buffer before committing an STLR.

Strengthening rel or acq_rel to seq_cst by using LDAR / STLR doesn't need to be expensive, unless you seq_cst store something, and then seq_cst load something else. Then it's just as bad as x86.

Some other ISAs (like PowerPC) have more choices of barriers and can strengthen up to mo_rel or mo_acq_rel more cheaply than mo_seq_cst, but their seq_cst can't be as cheap as AArch64; seq-cst stores need a full barrier.

So AArch64 is an exception to the rule that seq_cst stores drain the store buffer on the spot, either with a special instruction or a barrier instruction after. It's not a coincidence that ARMv8 was designed after C++11 / Java / etc. basically settled on seq_cst being the default for lockless atomic operations, so making them efficient was important. And after CPU architects had a few years to think about alternatives to providing barrier instructions or just acquire/release vs. relaxed load/store instructions.

Solution 3 - C++

Try to build Dekkers or Petersons algorithm with just acquire/release semantics.

That won't work because acquire/release semantics doesn't provide [StoreLoad] fence.

In case of Dekkers algorithm:

flag[self]=1 <-- STORE
while(true){
    if(flag[other]==0) { <--- LOAD
        break;
    }
    flag[self]=0;
    while(turn==other);
    flag[self]=1	    
}

Without [StoreLoad] fence the store could jump in front of the load and then the algorithm would break. 2 threads at the same time would see that the other lock is free, set their own lock and continue. And now you have 2 threads within the critical section.

Solution 4 - C++

Still use the definition and example from memory_order. But replace memory_order_seq_cst with memory_order_release in store and memory_order_acquire in load.

Release-Acquire ordering guarantees everything that happened-before a store in one thread becomes a visible side effect in the thread that did a load. But in our example, nothing happens before store in both thread0 and thread1.

x.store(true, std::memory_order_release); // thread0

y.store(true, std::memory_order_release); // thread1

Further more, without memory_order_seq_cst, the sequential ordering of thread2 and thread3 are not guaranteed. You can imagine they becomes:

if (y.load(std::memory_order_acquire)) { ++z; } // thread2, load y first
while (!x.load(std::memory_order_acquire)); // and then, load x

if (x.load(std::memory_order_acquire)) { ++z; } // thread3, load x first
while (!y.load(std::memory_order_acquire)); // and then, load y

So, if thread2 and thread3 are executed before thread0 and thread1, that means both x and y stay false, thus, ++z is never touched, z stay 0 and the assert fires.

However, if memory_order_seq_cst enters the picture, it establishes a single total modification order of all atomic operations that are so tagged. Thus, in thread2, x.load then y.load; in thread3, y.load then x.load are sure things.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAProgrammerView Question on Stackoverflow
Solution 1 - C++MSNView Answer on Stackoverflow
Solution 2 - C++Peter CordesView Answer on Stackoverflow
Solution 3 - C++pveentjerView Answer on Stackoverflow
Solution 4 - C++jun.wuView Answer on Stackoverflow