C++ Thread-Safe Counter

# C++ Thread-Safe Counter ## TLDR This article implements simple thread-safe counter in C++, and benchmark the speedup. Concepts like data race and mutex granularity are discussed. Codes can be found at [github/ThreadSafeCounter](https://github.com/TypeErrorEngine2022/ThreadSafeCounter). ## Pre-requisistes This article assumes reader understanding on basic usage of C++ threads. If you are not familiar with it, feel free to check my article about [Chapter 2 Managing Threads](/6bHqWGzhT4-l_O4bCtK6Kg) ## Introduction We will start with a non thread-safe version, then proceed to a thread-safe version. Finally, we will try to improve the performance of the thread-safe version. All implementations have following public methods: - `increment()`: Increments the counter by 1. - `get_count()`: Returns the total current count. ## Non Thread-safe version To start with, here is a non thread-safe counter. This class only provide the default constructor, so to be consistent with other counter implementations. ```C++ class NonThreadSafeCounter { public: NonThreadSafeCounter() : count_(0) { } NonThreadSafeCounter(const NonThreadSafeCounter&) = delete; NonThreadSafeCounter& operator=(const NonThreadSafeCounter&) = delete; NonThreadSafeCounter(NonThreadSafeCounter&&) = delete; NonThreadSafeCounter& operator=(NonThreadSafeCounter&&) = delete; ~NonThreadSafeCounter() = default; void increment() { ++count_; } long long get_count() const { return count_; } private: long long count_; }; ``` To see whether it is thread-safe, we spawn 4 threads and call `increment` by 10000 times in each thread. Do you think the final count is 40000? The answer is -- *it depends*. You may see weird number like 39997, 39999 or 40000, and the result will be different for every time. This is because **data race** occurs. According to [C++ in Action](https://www.manning.com/books/c-plus-plus-concurrency-in-action-second-edition), data race occurs when there is concurrent modidication to a single object. The later thread may overrite the result with a incorrect value. Please go to the next section to know more. ```C++ #include "headers/non-threadsafe-counter.h" #include <thread> void work_func(NonThreadSafeCounter& counter) { for(int i = 0; i < 10000; ++i) { counter.increment(); } } int main() { NonThreadSafeCounter counter; std::vector<std::thread> threads; for(int i = 0; i < 4; ++i) { threads.emplace_back(work_func, std::ref(counter)); } for(auto& thread : threads) { thread.join(); } // will this be 4 * 10000 = 40000? std::cout << "Final count: " << counter.get_count() << std::endl; return 0; } ``` ## Data race in Non-Thread-Safe Counter At iteration 9999 of each thread for `increment()` function, following events occur. The events are listed in chronological order. Due to unpredictable scheduling of threads, the threads may read and write in an interleave manner. Note that the statement `++count;` actually consists of 2 steps -- `read current x` and then `write x + 1`. For example, thread 1 and 2 read `x=39996` simultaneously, but thread 2 write the result slower. Ideally, after 4 threads increment `x`, it should increase from `39996` to `40000`. However, due to the data race, the final value is `39997`, as some updates are overwritten. With the above driver program, we have no control on thread scheduling and this event list is only possible snapshot. | Thread 1 | Thread 2 | Thread 3 | Thread 4 | | ------------- | ------------- | ------------- | ------------- | | read x=39996 | read x=39996 | | | | | | read x=39996 | | | write x=39997 | | | | | | write x=39997 | | read x=39997 | | | | | write x=39998 | | | | write x=39997 | | ## Thread-safe Counter To prevent data race, the simplest method is to prevent concurrent modification through **mutex**. Mutex is the shorthand of `mutually exclusive`, which acts as a lock to the shared data. Compared to the non-thread-safe version, the following `ThreadSafeCounter` use `std::lock_guard<std::mutex> lock(mutex_)` to lock the `count__` variable in `increment()` and `get_count()`. The first thread that invoke the function will get the lock, a.k.a. first come first serves. The later thread can only keep waiting at the lock statement, and cannot proceed to the section after the lock statement, which is also known as **critical section**. Critical section spans right after the lock statement to the unlock statement. You may wonder where is the unlock statement. Actually the unlock is called when the `std::lock_guard` object is destroyed, i.e. the end of function. In C++, this binding of resource (lock) and object (lock_guard) lifecycle is called **Resource Acquisition Is Initialization(RAII)**, you can access the [C++ RAII Doc](https://en.cppreference.com/w/cpp/language/raii) for reference. ```C++ #ifndef THREADSAFE_COUNTER_H #define THREADSAFE_COUNTER_H #include <iostream> #include <mutex> #include <thread> #include <vector> /// @brief Thread-safe counter implementation using std::mutex class ThreadSafeCounter { public: ThreadSafeCounter() : count_(0) { } ThreadSafeCounter(const ThreadSafeCounter&) = delete; ThreadSafeCounter& operator=(const ThreadSafeCounter&) = delete; ThreadSafeCounter(ThreadSafeCounter&&) = delete; ThreadSafeCounter& operator=(ThreadSafeCounter&&) = delete; ~ThreadSafeCounter() = default; void increment() { std::lock_guard<std::mutex> lock(mutex_); // LOCK !!! // critical section begins ++count_; // critical section ends } long long get_count() const { std::lock_guard<std::mutex> lock(mutex_); // LOCK !!! // critical section begins return count_; // critical section ends } private: long long count_; // mutable keyword allows const methods to lock the mutex mutable std::mutex mutex_; }; #endif // THREADSAFE_COUNTER_H ``` Since this class contains a non-copyable and non-movable `std::mutex` object, this makes the class also non-copyable and non-movable. We delete the related constructor to mark it. With mutex, the access to `x` becomes mutually exclusive, only one thread can modify x at one time. One possible snapshot is shown below. Again, due to unpredictable thread scheduling, we won't know the actual execution order. | Thread 1 | Thread 2 | Thread 3 | Thread 4 | | ------------- | ------------- | ------------- | ------------- | | read x=39996 | | | | | write x=39997 | | | | | | | read x=39997 | | | | | write x=39998 | | | | | | read x=39998 | | | | | write x=39999 | | | read x=39999 | | | | | write x=40000 | | | ## Improve Performance by Sub-Counters In our simple counter, performance is not a problem and *premature optimization is the root of all evil*. However, it is still educative to demonstrate optimization skills and hopefully you can apply it on larger scale. With mutex, we force the update of x to be sequential. but we lose the concurrent advantage. We can separate x to be multiple sub-counters and each thread can concurrently update its own sub-counter! In real-world, big companies like Instagram or Youtube both use similar techniques by counting the number of likes/view in different regions, and aggreagte them when needed. | Thread 1 | Thread 2 | Thread 3 | Thread 4 | | -------- | -------- | -------- | -------- | | Counter 1 | Counter 2 | Counter 3 | Counter 4 | ```C++ #ifndef THREADSAFE_COUNTER_SUB_H #define THREADSAFE_COUNTER_SUB_H #include <iostream> #include <list> #include <mutex> #include <numeric> #include <thread> #include <vector> class SubCounter { private: long long count_; mutable std::mutex mutex_; // mutable keyword allows const methods to lock the mutex public: SubCounter() : count_(0LL) { } ~SubCounter() = default; long long increment() { std::lock_guard<std::mutex> lock(mutex_); return ++count_; } long long get_count() const { std::lock_guard<std::mutex> lock(mutex_); return count_; } }; /// @brief Thread-safe counter with sub-counters. class ThreadSafeCounterWithSubCounter { private: std::list<SubCounter> sub_count_; mutable std::mutex mutex_; // mutable keyword allows const methods to lock the mutex public: ThreadSafeCounterWithSubCounter() = default; ThreadSafeCounterWithSubCounter(const ThreadSafeCounterWithSubCounter&) = delete; ThreadSafeCounterWithSubCounter& operator=(const ThreadSafeCounterWithSubCounter&) = delete; ThreadSafeCounterWithSubCounter(ThreadSafeCounterWithSubCounter&&) = delete; ThreadSafeCounterWithSubCounter& operator=(ThreadSafeCounterWithSubCounter&&) = delete; ~ThreadSafeCounterWithSubCounter() = default; std::list<SubCounter>::iterator register_counter() { std::lock_guard<std::mutex> lock(mutex_); // Since SubCounter contains a mutex, which makes it non-copyable, // we need to use emplace to create a new SubCounter in place. return sub_count_.emplace(sub_count_.end()); } long long increment(std::list<SubCounter>::iterator it) { return it->increment(); } long long get_count() { std::lock_guard<std::mutex> lock(mutex_); auto sum_sub_count = [](long long running_sum, const SubCounter& sub_counter) { return running_sum + sub_counter.get_count(); }; long long sum = std::accumulate(sub_count_.begin(), sub_count_.end(), 0LL, sum_sub_count); return sum; } }; #endif // THREADSAFE_COUNTER_SUB_H ``` Let's break down the program step by step. The `register_counter()` function is called by each thread at the start to obtain an iterator to a `SubCounter` object, which encapsulates a counter and a mutex. Since `SubCounter` is neither movable nor copyable, we use `std::list` instead of `std::vector`. Unlike `std::vector`, which stores objects contiguously in memory and requires copying or moving objects during resizing, `std::list` supports non-copyable and non-movable objects by avoiding such operations. ```C++ std::list<SubCounter>::iterator register_counter() { std::lock_guard<std::mutex> lock(mutex_); // Since SubCounter contains a mutex, which makes it non-copyable, // we need to use emplace to create a new SubCounter in place. return sub_count_.emplace(sub_count_.end()); } ``` The `ThreadSafeCounterWithSubCounter::increment()` function is similar with previous implementations, except it requires an iterator to call the `SubCounter::increment()`. ```C++ long long SubCounter::increment() { std::lock_guard<std::mutex> lock(mutex_); return ++count_; } long long ThreadSafeCounterWithSubCounter::increment(std::list<SubCounter>::iterator it) { return it->increment(); } ``` The `get_count()` function changes a lot. We need to sum up all sub-counters to get the total count. `std::accumulate` is used with a lambda `sum_sub_count` to sum up the sub-counter list (`sub_count_`). 2 lock statements are called. The first lock `ThreadSafeCounterWithSubCounter::mutex_` at the beginning of `ThreadSafeCounterWithSubCounter::get_count()`, while the second lock `SubCounter::mutex_` inside `SubCounter::get_count()`. Although they share the same name `mutex_`, they are distinct variables. In our 4-thread scenario, a `ThreadSafeCounterWithSubCounter` has a `std::list<SubCounter>` and a `std::mutex`, which contains 4 `SubCounter`. Each `SubCounter` contains a `long long` and a `std::mutex`. Inside `ThreadSafeCounterWithSubCounter::get_count()`, the first big lock is locked, and then the 4 `SubCounter`'s locks are lock-and-unlock sequentially. The big lock is not locked in `ThreadSafeCounterWithSubCounter::increment()` function, which allows concurrent update of sub-counters. ```C++ long long SubCounter::get_count() const { std::lock_guard<std::mutex> lock(mutex_); // small lock return count_; } long long ThreadSafeCounterWithSubCounter::get_count() { std::lock_guard<std::mutex> lock(mutex_); // big lock auto sum_sub_count = [](long long running_sum, const SubCounter& sub_counter) { return running_sum + sub_counter.get_count(); }; long long sum = std::accumulate(sub_count_.begin(), sub_count_.end(), 0LL, sum_sub_count); return sum; } ``` | Lock Sequence | | --------------------------------------------- | | Lock ThreadSafeCounterWithSubCounter::mutex | | Lock SubCounter::mutex of Thread 1 | | Unlock SubCounter::mutex of Thread 1 | | Lock SubCounter::mutex of Thread 2 | | Unlock SubCounter::mutex of Thread 2 | | Lock SubCounter::mutex of Thread 3 | | Unlock SubCounter::mutex of Thread 3 | | Lock SubCounter::mutex of Thread 4 | | Unlock SubCounter::mutex of Thread 4 | | UnLock ThreadSafeCounterWithSubCounter::mutex | That's it for `ThreadSafeCounterWithSubCounter`! Before going to the benchmark, can you spot a **critical design bug** in `ThreadSafeCounterWithSubCounter::get_count`? Feel free to leave your answer in the comment! ## Benchmark Since data race occurs in non-thread-safe version, this benchmark will exclude it. This benchmark spwan {1, 2, 4, 8, 16} threads and increment the counter by 100000 in each thread. The average time and speed-up ratio are shown below. It is shown that sub-counter approach significantly improve the performance. This is acheive by finer/smaller lock. ![threadcount-vs-time](https://hackmd.io/_uploads/Sksph-Kxxe.png) ![speedup-ratio](https://hackmd.io/_uploads/ryTT3bFglg.png) ## Further reading *Distributed systems is like multi-threading in a larger scale.* - [C++ Concurrency in Action 2nd Edition, Chapter 3 Sharing Data Between Threads](https://www.manning.com/books/c-plus-plus-concurrency-in-action-second-edition) - [How Facebook & YouTube Handle BILLIONS of Likes & Views!](https://youtu.be/LGOnP9Udffo?si=WiXo0uORdImVZkpF) ## About Me A self-motivated programmer who is exploring low-latency C++. >Personal Website: https://jacky-chen-portfolio.vercel.app/ >GitHub: https://github.com/TypeErrorEngine2022 >Email: jackychenworkcontact@gmail.com