Is there a production ready lock-free queue or hash implementation in C++

C++StlLock Free

C++ Problem Overview


I ve been googling quite a bit for a lock-free queue in C++. I found some code and some trials - but nothing that i was able to compile. A lock-free hash would also be welcome.

SUMMARY: So far i have no positive answer. There is no "production ready" library, and amazingly none of the existent libraries complies to the API of STL containers.

C++ Solutions


Solution 1 - C++

As of 1.53, boost provides a set of lock free data structures, including queues, stacks and single-producer/single-consumer queues (i.e. ring buffers).

Solution 2 - C++

The starting point would be either of Herb Sutter's DDJ articles for either a single producer and consumer or multiple ones. The code he gives (in-line starting on the second page of each article) uses the C++0x style atomic<T> template type; which you can imitate using the Boost interprocess library.

The boost code is buried in the depths of the interprocess library, but having read through the appropriate header file (atomic.hpp) the implementations for the necessary compare-and-swap operations on the systems I am familiar with look sound.

Solution 3 - C++

Facebook's Folly seems to have lock free data structures based on C++11 <atomic>:

I would dare to say these are currently used in production, so I guess they could safely be used in other projects.

Cheers!

Solution 4 - C++

Yes!

I wrote a lock-free queue. It has Features™:

  • Fully wait-free (no CAS loops)
  • Super fast (over a hundred million enqueue/dequeue operations per second)
  • Uses C++11 move semantics
  • Grows as needed (but only if you want it to)
  • Does lock-free memory management for the elements (using pre-allocated contiguous blocks)
  • Stand-alone (two headers plus a license and readme)
  • Compiles under MSVC2010+, Intel ICC 13, and GCC 4.7.2 (and should work under any C++11 fully-compliant compiler)

It's available on GitHub under the simplified BSD license (feel free to fork it!).

Caveats:

  • Only for single-producer single-consumer architecture (i.e. two threads)
  • Thoroughly tested on x86(-64), and should work on ARM, PowerPC, and other CPUs where aligned native-sized integers and pointer loads and stores are naturally atomic, but has not been field tested on non-x86 CPUs (if someone has one to test it on let me know)
  • No idea if any patents are infringed upon (use at your own risk, etc.). Note that I did design and implement it myself from scratch.

Solution 5 - C++

There is such a library, but it's in C.

Wrapping to C++ should be straightforward.

liblfds

Solution 6 - C++

boost.lockfree is an attempt to create c++ implementations of lockfree stack and fifo classes.

public git repository

Solution 7 - C++

After having checked most of the given answers, i can only state:

The answer is NO.

There is no such thing right that could be used right out of the box.

Solution 8 - C++

The closest thing I am aware of is Windows Interlocked Singly Linked Lists. Of course, it is Windows only.

Solution 9 - C++

If you have a multiple-producer / single-consumer Queue/FIFO, you can easily make one LockFree using SLIST or a trivial Lock Free LIFO stack. What you do is have a second "private" stack for the consumer (which can also be done as a SLIST for simplicity or any other stack model you choose). The consumer pops items off the private stack. Whenever the private LIFO is exhasted, you do a Flush rather than Pop off the shared concurrent SLIST (grabbing the entire SLIST chain) and then walk the Flushed list in-order pushing items onto the private stack.

That works for single-producer / single-consumer and for multiple-producer / single-consumer.

However, it does not work for multiple-consumer cases (with either single-producer or multiple-producers).

Also, as far as hash tables go, they are an ideal candidate for "striping" which is just dividing the hash into segments having a lock per segments of the cache. This is how the Java concurrent library does it (using 32-stripes). If you have a light-weight reader-writer lock, the hash table can be concurrently accessed for simultaneous reads and you will only stall when write are occurring on contested stripes (and possibly if you allow for growing the hash table).

If you roll your own, make sure to interleave your locks with the hash entries rather than put all your locks in an array next to each other so you're less likely to have false sharing.

Solution 10 - C++

And then Intel Threading Building Blocks came. And for a time, it was good.

PS : you are looking for concurrent_queue and concurrent_hash_map

Solution 11 - C++

I may come a bit late on this.

The absence of solutions (at the question was asked) are mainly due to an important issue in C++ (before C++0x/11): C++ have (has) no concurrent memory model.

Now, using std::atomic, you can control memory ordering issues and have proper compare-and-swap operations. I've written myself an implementation of Micheal&Scott's lock-free queue (PODC96) using C++11 and Micheal's Hazard Pointers (IEEE TPDS 2004) to avoid early free and ABA problems. It's working fine but it's a quick and dirty implementation and I'm not satisfied with the actual performances. Code is available on bitbucket: LockFreeExperiment

It's also possible to implement lock-free queue without hazard pointers using double-words CAS (but 64bit versions will only be possible on x86-64 using cmpxchg16b), I've a blog post about that (with untested code for the queue) here: Implementing generic double-word compare and swap for x86/x86-64 (LSE blog.)

My own benchmark show me that double-locked queue (also in Micheal&Scott 1996 paper) performs as well as the lock-free one (I haven't reach enough contention so that locked data structures have performances issues, but my bench are too light for now) and the concurrent queue from Intel's TBB seems even better (two time faster) for a relatively small number (depending on the operating system, under FreeBSD 9, the lowest bound I've found so far, this number is 8 threads on an i7 with 4 ht-core, and thus 8 logical CPUs) of threads and have very strange behavior (execution time of my simple benchmark move from seconds to hours !)

Another limitations about lock-free queues following the STL style: having iterators on lock-free queue has no sens.

Solution 12 - C++

To the best of my knowledge, there is no such thing publicly available yet. One issue an implementor needs to solve is that you need a lock-free memory allocator, which exists, though I cannot seem to find the link right now.

Solution 13 - C++

The following is from Herb Sutter's article on Concurrent lock free Queue http://www.drdobbs.com/parallel/writing-a-generalized-concurrent-queue/211601363?pgno=1 . I have made some changes like compiler reordering stuff. One needs GCC v4.4+ to compile this code.

#include <atomic>
#include <iostream>
using namespace std;

//compile with g++ setting -std=c++0x

#define CACHE_LINE_SIZE 64

template <typename T>
struct LowLockQueue {
private:
    struct Node {
	Node( T* val ) : value(val), next(nullptr) { }
	T* value;
	atomic<Node*> next;
	char pad[CACHE_LINE_SIZE - sizeof(T*)- sizeof(atomic<Node*>)];
    };
    char pad0[CACHE_LINE_SIZE];

// for one consumer at a time
    Node* first;

    char pad1[CACHE_LINE_SIZE
	      - sizeof(Node*)];

// shared among consumers
    atomic<bool> consumerLock;

    char pad2[CACHE_LINE_SIZE
	      - sizeof(atomic<bool>)];

// for one producer at a time
    Node* last;

    char pad3[CACHE_LINE_SIZE
	      - sizeof(Node*)];

// shared among producers
    atomic<bool> producerLock;

    char pad4[CACHE_LINE_SIZE
	      - sizeof(atomic<bool>)];

public:
    LowLockQueue() {
	first = last = new Node( nullptr );
	producerLock = consumerLock = false;
    }
    ~LowLockQueue() {
	while( first != nullptr ) {      // release the list
	    Node* tmp = first;
	    first = tmp->next;
	    delete tmp->value;       // no-op if null
	    delete tmp;
	}
    }

    void Produce( const T& t ) {
	Node* tmp = new Node( new T(t) );
	asm volatile("" ::: "memory");                            // prevent compiler reordering
	while( producerLock.exchange(true) )
	    { }   // acquire exclusivity
	last->next = tmp;         // publish to consumers
	last = tmp;             // swing last forward
	producerLock = false;       // release exclusivity
    }

    bool Consume( T& result ) {
	while( consumerLock.exchange(true) )
	    { }    // acquire exclusivity
	Node* theFirst = first;
	Node* theNext = first-> next;
	if( theNext != nullptr ) {   // if queue is nonempty
	    T* val = theNext->value;    // take it out
	    asm volatile("" ::: "memory");                            // prevent compiler reordering
	    theNext->value = nullptr;  // of the Node
	    first = theNext;          // swing first forward
	    consumerLock = false;             // release exclusivity
	    result = *val;    // now copy it back
	    delete val;       // clean up the value
	    delete theFirst;      // and the old dummy
	    return true;      // and report success
	}
	consumerLock = false;   // release exclusivity
	return false;                  // report queue was empty
    }
};

int main(int argc, char* argv[])
{
    //Instead of this Mambo Jambo one can use pthreads in Linux to test comprehensively
LowLockQueue<int> Q;
Q.Produce(2);
Q.Produce(6);

int a;
Q.Consume(a);
cout<< a << endl;
Q.Consume(a);
cout<< a << endl;

return 0;
}

Solution 14 - C++

I found another solution written in c:

http://www.ddj.com/hpc-high-performance-computing/219500200

Solution 15 - C++

I wrote this at some point probably back in 2010, I'm sure with help from different references. It Multi-Producer Single Consumer.

template <typename T>
class MPSCLockFreeQueue 
{
private:
	struct Node 
	{
		Node( T val ) : value(val), next(NULL) { }
		T value;
		Node* next;
	};
	Node * Head;			   
	__declspec(align(4)) Node * InsertionPoint;	 //__declspec(align(4)) forces 32bit alignment this must be changed for 64bit when appropriate.

public:
	MPSCLockFreeQueue() 
	{
		InsertionPoint = new Node( T() );
		Head = InsertionPoint;
	}
	~MPSCLockFreeQueue() 
	{
		// release the list
		T result;
		while( Consume(result) ) 
		{   
			//The list should be cleaned up before the destructor is called as there is no way to know whether or not to delete the value.
			//So we just do our best.
		}
	}

	void Produce( const T& t ) 
	{
		Node * node = new Node(t);
		Node * oldInsertionPoint = (Node *) InterLockedxChange((volatile void **)&InsertionPoint,node);
		oldInsertionPoint->next = node;
	}

	bool Consume( T& result ) 
	{
		if (Head->next)
		{
			Node * oldHead = Head;
			Head = Head->next;
			delete oldHead;
			result = Head->value;
			return true;
		}		
		return false;           	// else report empty
	}

};

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionRED SOFT ADAIRView Question on Stackoverflow
Solution 1 - C++NovaView Answer on Stackoverflow
Solution 2 - C++Steve GilhamView Answer on Stackoverflow
Solution 3 - C++André NevesView Answer on Stackoverflow
Solution 4 - C++CameronView Answer on Stackoverflow
Solution 5 - C++user82238View Answer on Stackoverflow
Solution 6 - C++timView Answer on Stackoverflow
Solution 7 - C++RED SOFT ADAIRView Answer on Stackoverflow
Solution 8 - C++Nemanja TrifunovicView Answer on Stackoverflow
Solution 9 - C++AdisakView Answer on Stackoverflow
Solution 10 - C++Edouard A.View Answer on Stackoverflow
Solution 11 - C++Marwan BurelleView Answer on Stackoverflow
Solution 12 - C++TobiasView Answer on Stackoverflow
Solution 13 - C++enthusiasticgeekView Answer on Stackoverflow
Solution 14 - C++RED SOFT ADAIRView Answer on Stackoverflow
Solution 15 - C++curlyhairedgeniusView Answer on Stackoverflow