This series is my notes for the talk https://www.infoq.com/presentations/go-locks/.
The talk talks about how locks are implemented in modern languages, taking Go as an example, since it has good concurrency design - its goroutines, which is user-space threads.
On Intel machines, CMPXCHG
(compare-and-swap) used together with the LOCK
prefix can be used to implement simple spin locks. Lock
prefix will lock the system bus, so the the associated instruction can't be interrupted to guarantee atomicity. More details refer to [Let's Talk Locks!] #1: x86 LOCK prefix.
TSX or HLE also has this:
Intel's backward-compatible HLE is based on two new instruction prefixes (rather than new instructions): XACQUIRE (F2) and XRELEASE (F3). You basically put the XACQUIRE prefix on the instruction that starts your critical section, and XRELEASE on the instruction that ends it.
Just using a atomic compare-and-swap implements a spin lock, which means while a thead is holding the system bus, other threads will keep retrying in a loop - its compare-and-swap test will return 1 and thus it keeps looping until CAS returns 0. This wastes CPU cycles.
On x86 such code works, though with a lot of room for improvement:
Note that some memory barriers may be required to correctly implement a spin lock, to avoid compiler reordering - like maybe some other mov eax, 0
from other part of the program can be reordered before mov eax 1
, which would cause the code to not work correctly. However, x86's total store ordering already guarantees this, so no memory barrier is needed for this to work on x86.
Instead of just letting the waiting thread burning CPU cycles, an alternative approach is to put it to sleep and only awake it when the ongoing thread gives up the lock. Linux implements this as futex
. The pros is that the CPU cycles are saved and can run other tasks, but context switch can be expensive too.
An example of using futex to implement mutex:
Because the kernel needs to keep track of what thread is waiting on what futex (aka, uaddr
- the address of the lock integer on which atomic CAS is performed). This is implemented by hashing the address uaddr
to a hash table and store a bucket for each hash key. When different threads are operating the same bucket, a lock is needed. So that is - to implement a futex lock, we first need a lock. In this case, spin lock is used, which makes sense, because we know the bucket operation won't be too long but we are not sure how long the futex owner will keep the lock.
On linux pthread_mutex_lock
is implemented using futex.
More details:
https://lwn.net/Articles/360699/
If we know the waiting thread won't wait for too long, it might be cheaper to just spinning rather than being scheduled out by the OS. A hybrid lock first tries to spin for sometime, then request the OS to put it to sleep.
Instead of requesting the OS to do thread scheduling, which has high context switch cost, Go tries to schedule the threads at runtime system level. So a Go thread may be scheduled but the OS thread could still be the same OS thread. A go routine is essentially a coroutine or a continuation.
More advanced questions involve something like how to avoid the situation that some threads keep being put to sleep and waked up and always can't get a lock - since it takes some time to be able to run after it's added to the ready queue.
Probably the coolest part of the talk!