What does the "lock" instruction mean in x86 assembly?
C++QtAssemblyX86C++ Problem Overview
I saw some x86 assembly in Qt's source:
q_atomic_increment:
movl 4(%esp), %ecx
lock
incl (%ecx)
mov $0,%eax
setne %al
ret
.align 4,0x90
.type q_atomic_increment,@function
.size q_atomic_increment,.-q_atomic_increment
-
From Googling, I knew
lock
instruction will cause CPU to lock the bus, but I don't know when CPU frees the bus? -
About the whole above code, I don't understand how this code implements the
Add
?
C++ Solutions
Solution 1 - C++
-
LOCK
is not an instruction itself: it is an instruction prefix, which applies to the following instruction. That instruction must be something that does a read-modify-write on memory (INC
,XCHG
,CMPXCHG
etc.) --- in this case it is theincl (%ecx)
instruction whichinc
rements thel
ong word at the address held in theecx
register.The
LOCK
prefix ensures that the CPU has exclusive ownership of the appropriate cache line for the duration of the operation, and provides certain additional ordering guarantees. This may be achieved by asserting a bus lock, but the CPU will avoid this where possible. If the bus is locked then it is only for the duration of the locked instruction. -
This code copies the address of the variable to be incremented off the stack into the
ecx
register, then it doeslock incl (%ecx)
to atomically increment that variable by 1. The next two instructions set theeax
register (which holds the return value from the function) to 0 if the new value of the variable is 0, and 1 otherwise. The operation is an increment, not an add (hence the name).
Solution 2 - C++
What you may be failing to understand is that the microcode required to increment a value requires that we read in the old value first.
The Lock keyword forces the multiple micro instructions that are actually occuring to appear to operate atomically.
If you had 2 threads each trying to increment the same variable, and they both read the same original value at the same time then they both increment to the same value, and they both write out the same value.
Instead of having the variable incremented twice, which is the typical expectation, you end up incrementing the variable once.
The lock keyword prevents this from happening.
Solution 3 - C++
> From google, I knew lock instruction will cause cpu lock the bus,but I > don't know when cpu free the bus ?
LOCK
is an instruction prefix, hence it only applies to the following instruction, the source doesn't make it very clear here but the real instruction is LOCK INC
. So the Bus is locked for the increment, then unlocked
> About the whole above code, I don't understand how these code > implemented the Add?
They don't implement an Add, they implement an increment, along with a return indication if the old value was 0. An addition would use LOCK XADD
(however, windows InterlockedIncrement/Decrement are also implement with LOCK XADD
).
Solution 4 - C++
Minimal runnable C++ threads + LOCK inline assembly example
main.cpp
#include <atomic>
#include <cassert>
#include <iostream>
#include <thread>
#include <vector>
std::atomic_ulong my_atomic_ulong(0);
unsigned long my_non_atomic_ulong = 0;
unsigned long my_arch_atomic_ulong = 0;
unsigned long my_arch_non_atomic_ulong = 0;
size_t niters;
void threadMain() {
for (size_t i = 0; i < niters; ++i) {
my_atomic_ulong++;
my_non_atomic_ulong++;
__asm__ __volatile__ (
"incq %0;"
: "+m" (my_arch_non_atomic_ulong)
:
:
);
__asm__ __volatile__ (
"lock;"
"incq %0;"
: "+m" (my_arch_atomic_ulong)
:
:
);
}
}
int main(int argc, char **argv) {
size_t nthreads;
if (argc > 1) {
nthreads = std::stoull(argv[1], NULL, 0);
} else {
nthreads = 2;
}
if (argc > 2) {
niters = std::stoull(argv[2], NULL, 0);
} else {
niters = 10000;
}
std::vector<std::thread> threads(nthreads);
for (size_t i = 0; i < nthreads; ++i)
threads[i] = std::thread(threadMain);
for (size_t i = 0; i < nthreads; ++i)
threads[i].join();
assert(my_atomic_ulong.load() == nthreads * niters);
assert(my_atomic_ulong == my_atomic_ulong.load());
std::cout << "my_non_atomic_ulong " << my_non_atomic_ulong << std::endl;
assert(my_arch_atomic_ulong == nthreads * niters);
std::cout << "my_arch_non_atomic_ulong " << my_arch_non_atomic_ulong << std::endl;
}
Compile and run:
g++ -ggdb3 -O0 -std=c++11 -Wall -Wextra -pedantic -o main.out main.cpp -pthread
./main.out 2 10000
Possible output:
my_non_atomic_ulong 15264
my_arch_non_atomic_ulong 15267
From this we see that the LOCK prefix made the addition atomic: without it we have race conditions on many of the adds, and the total count at the end is less than the synchronized 20000.
The LOCK prefix is used to implement:
- C++11
std::atomic
: https://stackoverflow.com/questions/31978324/what-exactly-is-stdatomic/58904448#58904448 - C11
atomic_int
: https://stackoverflow.com/questions/56810/how-do-i-start-threads-in-plain-c/52453291#52453291
Tested in Ubuntu 19.04 amd64.