linux internals 2022

# linux internals 2022 ## Role of mutex in Fibonacci driver [gist](https://gist.github.com/cantfindagoodname/11e0348b60354c6a900b71a138471150) There are no shared resources in algorithm of calculating Fibonacci sequence itself, hence no critical section are present in the algorithm. Shared resources are needed when we wish to store "Computed result" to not waste CPU cycle on recomputing the values. ![](https://i.imgur.com/0lGJShJ.png) There are repeating computation without dynamic programming, For an example, `Fib(1)` are called twice. We can calculate Fibonacci sequence bottom-up by storing result their result in an array. ![](https://i.imgur.com/ZOjTIPh.png) The array would be shared, hence we mutex could be used to ensure critical-section problem does not happen. (Mutual exclusion, progress, and bounded waiting) ```c /* array to be passed as argument */ unsigned long long arr[93] = {[0] = 0, [1 ... 2] = 1, [3 ... 92] = 4}; /* Function Definitions */ unsigned long long fib(int n) { if (n < 2) return n; else if (n == 2) return 1; return fib(n - 1) + fib(n - 2); } unsigned long long fib_m(unsigned long long *arr, int n) { if (arr[n] != 4) return arr[n]; return (arr[n-1] = fib_m(arr, n-1)) + (arr[n-2] = fib_m(arr, n-2)); } ``` The program is not as simple, we would compute the fibonacci sequence using Fast doubling method. Instead of computing the value by traverse through the sequence 1 by 1, we traverse through the sequence bit by bit. ```text n a b -------------------------------------------- 0 0 1 1 1 1 2 1 2 5 5 8 11 89 144 23 28657 46368 46 1836311903 2971215073 92 7540113804746346429 12200160415121876738 ``` This makes the time complexity bound from about $O(2^n)$ to $O(lg\ n)$ ```shell Result of Fib(92) Fib(92) Iterative w/ memoization Time : 6208 Fast Doubling Time : 551 ``` However, when we do multiple calculations memoization is needed ```shell Result of Fib(0), Fib(1), ... , Fib(92) Fib(0) - Fib(92) Iterative w/ memoization Time : 7470 Fast Doubling Time : 21223 ``` We can see that fast doubling is significantly slower than iterative without avoiding recomputation of known value. :::info The performance improvement by memoization is migated by compiler optimization, however, it does not change the fact that memoization with random access array made the computation faster ```shell Result of Fib(92) w/ -O3 flag Fib(92) Iterative w/ memoization Time : 4690 Fast Doubling Time : 93 Result of Fib(0), Fib(1), ... , Fib(92) w/ -O3 flag Fib(0) - Fib(92) Iterative w/ memoization Time : 1268 Fast Doubling Time : 107 ``` ::: An attempt to store computed value as a seperate data is made. As Fast Doubling method of computing Fibonacci number would traverse through the bit pattern of `n`, do `2n` if the bit is `0`, and `2n + 1` if the bit is `1`. A binary tree is constructed to store the `RHS` of the Fibonacci value. A node is insert to the left child if current bit is `0`, right child if current bit is `1`. >$$ \begin{split} F(2k) &= F(k)[2F(k+1) - F(k)] \\ F(2k+1) &= F(k+1)^2+F(k)^2 \end{split} >$$ The following tree would be constructed after `Fib(11)` is called ![](https://i.imgur.com/FcnKXiK.png) When `Fib(10)` is called right after, the program would traverse to `(5)`, then continue constructing the tree while computing Fibonacci number. ![](https://i.imgur.com/7UpXC7e.png) There are additional overhead to construct and traverse the tree, which are expected to be $O(d)$, `d` being the number of bits, therefore, we are expected to slow down the program to construct the tree. ```shell Fib(92) Iterative w/ memoization Time : 5983 Fast Doubling Time : 546 Fast Doubling w/ memoization Time : 720 ``` And even so, we would not the program with memoization will not outperform unless the overhead of constructing and tranversing the tree is smaller than the overhead of multiplication and addition. Which are shown to be false, even if the program should be faster than if its fully implemented thanks to spatial locality. ```shell Fib(0) - Fib(92) Iterative w/ memoization Time : 7823 Fast Doubling Time : 21300 Fast Doubling w/ memoization Time : 26259 Fast Doubling w/ memoization2 (After the tree are fully constructed) Time : 24699 ``` Moreover, pointers would be used for the tree data structure. As compiler optimization would have problem with pointer aliasing, we would face severe performance degration when we attempt to use the tree data structure. ```shell Fib(92) Iterative w/ memoization Time : 6029 Fast Doubling Time : 126 Fast Doubling w/ memoization Time : 534 Fib(0) - Fib(92) Iterative w/ memoization Time : 1512 Fast Doubling Time : 96 Fast Doubling w/ memoization Time : 15669 Fast Doubling w/ memoization2 Time : 15140 ``` And so, when using fast doubling, instead of trying to make the program faster with memoization, optimizing the multiplication algorithm might be the better approach. Shared resources are not needed hence mutex locks are very likely not needed too. :::spoiler refcnt / RCU Synchronization mechanisms locking (semaphores, mutes) atomics RCU Mechanisms for Acquisition / Release Synchronization --- btree ref simple counting - no atomics / barriers protected by same lock ```c struct _ref { refcount; } remove() { ref = ref_count() ref -= ref_to_drop } insert() { ref = ref_count() ref += ref_to_add } ``` --- kref protected by atomics (fetch_add / fetch_sub) ```c struct kref{ refcount } init() { ref_count_set(1) } get() { refcount_inc() } put() { if (dec_and_test) } ``` --- protected by atomics + barrier used in networking layer (dst_clone) ```c clone() { atomic_inc() } release() { barrier() atomic_dec() } ``` --- protected by atomics + check + barrier fget() fput() ```c fget() { rcu_read_lock() ... if (!inc_not_zero()) rcu_read_unlock() ... rcu_read_unlock() } fput() { if (dec_and_test()) call_rcu() } ``` --- rwlock writer wait till no reader / vice versa reader can't coexist with writer --- seqlock rwlock but write > read (exact opposite to rwlock) often used when there are plenty of reads and few writes priortize writing, readers can wait additional lock (e.g. mutex ,spinlock) for writers sequence + 1 when writer acquire lock and release lock (odd, even = state) reader checks sequence --- RCU : read-copy-update often used when there are plenty of reads and few writes read operations have no locks, atomics, and barrier write operations would do read, copy then update to pointer. This would protect the object pointed by the pointer, as updating pointer (1-word wide) is an atomic operation when reading during write, we would either read the value before update or the value after update, no in-between :::