How do "acquire" and "consume" memory orders differ, and when is "consume" preferable?

C++ Problem Overview

The C++11 standard defines a memory model (1.7, 1.10) which contains memory orderings, which are, roughly, "sequentially-consistent", "acquire", "consume", "release", and "relaxed". Equally roughly, a program is correct only if it is race-free, which happens if all actions can be put in some order in which one action happens-before another one. The way that an action X happens-before an action Y is that either X is sequenced before Y (within one thread), or X inter-thread-happens-before Y. The latter condition is given, among others, when

X synchronizes with Y, or
X is dependency-ordered before Y.

Synchronizing-with happens when X is an atomic store with "release" ordering on some atomic variable, and Y is an atomic load with "acquire" ordering on the same variable. Being dependency-ordered-before happens for the analogous situation where Y is load with "consume" ordering (and a suitable memory access). The notion of synchronizes-with extends the happens-before relationship transitively across actions being sequenced-before one another within a thread, but being dependency-ordered-before is extended transitively only through a strict subset of sequenced-before called carries-dependency, which follows a largish set of rules, and notably can be interrupted with std::kill_dependency.

Now then, what is the purpose of the notion of "dependency ordering"? What advantage does it provide over the simpler sequenced-before / synchronizes-with ordering? Since the rules for it are stricter, I assume that can be implemented more efficiently.

Can you give an example of a program where switching from release/acquire to release/consume is both correct and provides a non-trivial advantage? And when would std::kill_dependency provide an improvement? High-level arguments would be nice, but bonus points for hardware-specific differences.

C++ Solutions

Solution 1 - C++

Data dependency ordering was introduced by N2492 with the following rationale:

> There are two significant use cases where the current working draft (N2461) does not support scalability near that possible on some existing hardware. > > * read access to rarely written concurrent data structures > > Rarely written concurrent data structures are quite common, both in operating-system kernels and in server-style applications. Examples include data structures representing outside state (such as routing tables), software configuration (modules currently loaded), hardware configuration (storage device currently in use), and security policies (access control permissions, firewall rules). Read-to-write ratios well in excess of a billion to one are quite common. > > * publish-subscribe semantics for pointer-mediated publication > > Much communication between threads is pointer-mediated, in which the producer publishes a pointer through which the consumer can access information. Access to that data is possible without full acquire semantics. > > In such cases, use of inter-thread data-dependency ordering has resulted in order-of-magnitude speedups and similar improvements in scalability on machines that support inter-thread data-dependency ordering. Such speedups are possible because such machines can avoid the expensive lock acquisitions, atomic instructions, or memory fences that are otherwise required.

emphasis mine

the motivating use case presented there is rcu_dereference() from the Linux kernel

Solution 2 - C++

Load-consume is much like load-acquire, except that it induces happens-before relationships only to expression evaluations that are data-dependent on the load-consume. Wrapping an expression with kill_dependency results in a value that no longer carries a dependency from the load-consume.

The key use case is for the writer to construct a data structure sequentially, then swing a shared pointer to the new structure (using a release or acq_rel atomic). The reader uses load-consume to read the pointer, and dereferences it to get to the data structure. The dereference creates a data dependency, so the reader is guaranteed to see the initialized data.

std::atomic<int *> foo {nullptr};
std::atomic<int> bar;

void thread1()
{
    bar = 7;
    int * x = new int {51};
    foo.store(x, std::memory_order_release);
}

void thread2()
{
    int *y = foo.load(std::memory_order_consume)
    if (y)
    {
        assert(*y == 51); //succeeds
        // assert(bar == 7); //undefined behavior - could race with the store to bar 
        // assert(kill_dependency(*y) + bar == 58) // undefined behavior (same reason)
        assert(*y + bar == 58); // succeeds - evaluation of bar pulled into the dependency 
    }
}

There are two reasons for providing load-consume. The primary reason is that ARM and Power loads are guaranteed to consume, but require additional fencing to turn them into acquires. (On x86, all loads are acquires, so consume provides no direct performance advantage under naive compilation.) The secondary reason is that the compiler can move later operations without data dependence up to before the consume, which it can't do for an acquire. (Enabling such optimizations is the big reason for building all of this memory ordering into the language.)

Wrapping a value with kill_dependency allows computation of an expression that depends on the value to be moved to before the load-consume. This is useful e.g. when the value is an index into an array that was previously read.

Note that the use of consume results in a happens-before relation that is no longer transitive (though it is still guaranteed to be acyclic). For example, the store to bar happens before the store to foo, which happens before the dereference of y, which happens before the read of bar (in the commented-out assert), but the store to bar doesn't happen before the read of bar. This leads to a rather more complicated definition of happens-before, but you can imagine how it works (start with sequenced-before, then propagate through any number of release-consume-dataDependency or release-acquire-sequencedBefore links)

Solution 3 - C++

Jeff Preshing has a great blog post answering this question. I can't add anything myself, but think anyone wondering about consume vs. acquire should read his post:

http://preshing.com/20140709/the-purpose-of-memory_order_consume-in-cpp11/

He shows a specific C++ example with corresponding benchmarked assembly code across three different architectures. Compared to memory_order_acquire, memory_order_consume potentially offers a 3x speedup on PowerPC, 1.6x speedup on ARM, and negligible speedup on x86 which has strong consistency anyway. The catch is that as of when he wrote it, only GCC actually treated consume semantics any differently from acquire, and probably because of a bug. Nonetheless, it demonstrates that a speedup is available if the compiler writers can figure out how to take advantage of it.

Solution 4 - C++

I'd like to record a partial finding, even though it's not a real answer and doesn't mean that there won't be a big bounty for a proper answer.

After staring at 1.10 for a while, and in particular the very helpful note in paragraph 11, I think this isn't actually so hard. The big difference between synchronizes-with (henceforth: s/w) and dependency-ordered-before (dob) is that a happens-before relationship can be established by concatenating sequenced-before (s/b) and s/w arbitrarily, but not so for dob. Note one of the definitions for inter-thread happens before:

> A synchronizes-with X and X is sequenced before B

But the analogous statement for ~~A is dependency-ordered before X~~ is missing!

So with release/acquire (i.e. s/w) we can order arbitrary events:

A1    s/b    B1                                            Thread 1
                   s/w
                          C1    s/b    D1                  Thread 2

But now consider an arbitrary sequence of events like this:

A2    s/b    B2                                            Thread 1
                   dob
                          C2    s/b    D2                  Thread 2

In this sequenece, it is still true that A2 happens-before C2 (because A2 is s/b B2 and B2 inter-thread happens before C2 on account of dob; but we could argue that you can never actually tell!). However, it is not true that A2 happens-before D2. The events A2 and D2 are not ordered with respect to one another, unless it actually holds that C2 carries dependency to D2. This is a stricter requirement, and absent that requirement, A2-to-D2 cannot be ordered "across" the release/consume pair.

In other words, a release/consume pair only propagates an ordering of actions which carry a dependency from one to the next. Everything that's not dependent is not ordered across the release/consume pair.

Furthermore, note that the ordering is restored if we append a final, stronger release/acquire pair:

A2    s/b    B2                                                         Th 1
                   dob
                          C2    s/b    D2                               Th 2
                                             s/w
                                                    E2    s/b    F2     Th 3

Now, by the quoted rule, D2 inter-thread happens before F2, and therefore so do C2 and B2, and so A2 happens-before F2. But note that there is still no ordering between A2 and D2 — the ordering is only between A2 and later events.

In summary and in closing, dependency carrying is a strict subset of general sequencing, and release/consume pairs provide an ordering only among actions that carry dependency. As long as no stronger ordering is required (e.g. by passing through a release/acquire pair), there is theoretically a potential for additional optimization, since everything that is not in the dependency chain may be reordered freely.

Maybe here is an example that makes sense?

std::atomic<int> foo(0);

int x = 0;

void thread1()
{
    x = 51;
    foo.store(10, std::memory_order_release);
}

void thread2()
{
    if (foo.load(std::memory_order_acquire) == 10)
    {
        assert(x == 51);
    }
}

As written, the code is race-free and the assertion will hold, because the release/acquire pair orderes the store x = 51 before the load in the assertion. However, by changing "acquire" into "consume", this would no longer be true and the program would have a data race on x, since x = 51 carries no dependency into the store to foo. The optimization point is that this store can be reordered freely without concern to what foo is doing, because there is no dependency.

Content Type	Original Author	Original Content on Stackoverflow
Question	Kerrek SB	View Question on Stackoverflow
Solution 1 - C++	Cubbi	View Answer on Stackoverflow
Solution 2 - C++	user2949652	View Answer on Stackoverflow
Solution 3 - C++	user3188445	View Answer on Stackoverflow
Solution 4 - C++	Kerrek SB	View Answer on Stackoverflow