Why does code mutating a shared variable across threads apparently NOT suffer from a race condition?

C++ Problem Overview

I'm using Cygwin GCC and run this code:

#include <iostream>
#include <thread>
#include <vector>
using namespace std;

unsigned u = 0;

void foo()
{
	u++;
}

int main()
{
	vector<thread> threads;
	for(int i = 0; i < 1000; i++) {
		threads.push_back (thread (foo));
	}
	for (auto& t : threads) t.join();

	cout << u << endl;
	return 0;
}

Compiled with the line: g++ -Wall -fexceptions -g -std=c++14 -c main.cpp -o main.o.

It prints 1000, which is correct. However, I expected a lesser number due to threads overwriting a previously incremented value. Why does this code not suffer from mutual access?

My test machine has 4 cores, and I put no restrictions on the program that I know of.

The problem persists when replacing the content of the shared foo with something more complex, e.g.

if (u % 3 == 0) {
    u += 4;
} else {
    u -= 1;
}

C++ Solutions

Solution 1 - C++

foo() is so short that each thread probably finishes before the next one even gets spawned. If you add a sleep for a random time in foo() before the u++, you may start seeing what you expect.

Solution 2 - C++

It is important to understand a race condition does not guarantee the code will run incorrectly, merely that it could do anything, as it is an undefined behavior. Including running as expected.

Particularly on X86 and AMD64 machines race conditions in some cases rarely cause issues as many of the instructions are atomic and the coherency guarantees are very high. These guarantees are somewhat reduced on multi processor systems where the lock prefix is needed for many instructions to be atomic.

If on your machine increment is an atomic op, this will likely run correctly even though according to the language standard it is Undefined Behavior.

Specifically I expect in this case the code may be being compiled to an atomic Fetch and Add instruction (ADD or XADD in X86 assembly) which is indeed atomic in single processor systems, however on multiprocessor systems this is not guaranteed to be atomic and a lock would be required to make it so. If you are running on a multiprocessor system there will be a window where threads could interfere and produce incorrect results.

Specifically I compiled your code to assembly using https://godbolt.org/ and foo() compiles to:

foo():
        add     DWORD PTR u[rip], 1
        ret

This means it is solely performing an add instruction which for a single processor will be atomic (though as mentioned above not so for a multi processor system).

Solution 3 - C++

I think it is not so much the thing if you put a sleep before or after the u++. It is rather that operation u++ translates to code that is - compared to the overhead of spawning threads that call foo - very quickly performed such that it is unlikely to get intercepted. However, if you "prolong" the operation u++, then the race condition will become much more likely:

void foo()
{
    unsigned i = u;
    for (int s=0;s<10000;s++);
    u = i+1;
}

result: 694

BTW: I also tried

if (u % 2) {
    u += 2;
} else {
    u -= 1;
}

and it gave me most times 1997, but sometimes 1995.

Solution 4 - C++

It does suffer from a race condition. Put usleep(1000); before u++; in foo and I see different output (< 1000) each time.

Solution 5 - C++

The likely answer to why the race condition didn't manifest for you, though it does exist, is that foo() is so fast, compared to the time it takes to start a thread, that each thread finishes before the next can even start. But...
Even with your original version, the result varies by system: I tried it your way on a (quad-core) Macbook, and in ten runs, I got 1000 three times, 999 six times, and 998 once. So the race is somewhat rare, but clearly present.
You compiled with '-g', which has a way of making bugs disappear. I recompiled your code, still unchanged but without the '-g', and the race became much more pronounced: I got 1000 once, 999 three times, 998 twice, 997 twice, 996 once, and 992 once.

Re. the suggestion of adding a sleep -- that helps, but (a) a fixed sleep time leaves the threads still skewed by start time (subject to timer resolution), and (b) a random sleep spreads them out when what we want is to pull them closer together. Instead, I'd code them to wait for a start signal, so I can create them all before letting them get to work. With this version (with or without '-g'), I get results all over place, as low as 974, and no higher than 998:

 #include <iostream>
 #include <thread>
 #include <vector>
 using namespace std;
 
 unsigned u = 0;
 bool start = false;
 
 void foo()
 {
     while (!start) {
         std::this_thread::yield();
     }
     u++;
 }
 
 int main()
 {
     vector<thread> threads;
     for(int i = 0; i < 1000; i++) {
         threads.push_back (thread (foo));
     }
     start = true;
     for (auto& t : threads) t.join();
 
     cout << u << endl;
     return 0;
 }

Content Type	Original Author	Original Content on Stackoverflow
Question	mafu	View Question on Stackoverflow
Solution 1 - C++	Rob K	View Answer on Stackoverflow
Solution 2 - C++	Vality	View Answer on Stackoverflow
Solution 3 - C++	Stephan Lechner	View Answer on Stackoverflow
Solution 4 - C++	juf	View Answer on Stackoverflow
Solution 5 - C++	dgould	View Answer on Stackoverflow