# 2022q1 Homework3 (fibdrv) contributed by < [cantfindagoodname](https://github.com/cantfindagoodname) > > [作業要求](https://hackmd.io/@sysprog/linux2022-lab0) ## 實驗環境 ```shell $ cat /proc/version Linux version 5.13.0-30-generic (buildd@lcy02-amd64-003) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #33~20.04.1-Ubuntu SMP Mon Feb 7 14:25:10 UTC 2022 $ gcc --version gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 ``` ## 開發記錄 ### Suppress Noise > Before Suppression ![](https://i.imgur.com/QLah3W3.png) > After Suppression ![](https://i.imgur.com/mfwAVGV.png) > Combine two plots ![](https://i.imgur.com/3rmt5z4.png) #### Modify parameters for bootloader in `/etc/default/grub` Add `isolcpus=0` to `GRUB_CMDLINE_LINUX_DEFAULT` Run `update-grub` in shell ```shell $ sudo taskset -c $CPUID <executable> ``` As the name suggest, `isolcpus` isolates a cpu and `taskset` let user designate a core to run the program. Notice that this will **NOT** speed up the program when all CPU are not busy. To check this is working : > Note : <kbd>1</kbd> upon entering `top` ```shell $ top # run after hammering CPU with tasks ``` > Result ```shell %Cpu0 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 98.0 us, 2.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 98.3 us, 1.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 : 99.3 us, 0.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 99.3 us, 0.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 : 96.4 us, 3.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu8 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu9 : 98.4 us, 1.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu10 : 96.1 us, 3.9 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu11 : 99.7 us, 0.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu12 : 97.0 us, 3.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu13 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu14 : 98.0 us, 2.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu15 : 85.7 us, 2.0 sy, 0.0 ni, 12.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st ``` #### ASLR ```shell $ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" ``` 3. Suppress scaling governor Stop changing CPU clock speed ```shell for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor do echo performance > ${i} done ``` #### Turn off Turbo mode ```shell $ sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" ``` The performance would not necessary be better, but less noise would present. // TODO Remove outlier following statistics rules ### Representing Fibonacci with larger results ```shell $ # Correctness test $ sudo ./client $ ... $ Reading from /dev/fibonacci at offset 500, returned the sequence 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125. $ ... ``` The idea to calculate large number is to represent the value with group of characters instead of a integer data structure. From `man ascii` ```shell Oct Dec Hex Char ────────────────────── 060 48 30 0 061 49 31 1 062 50 32 2 063 51 33 3 064 52 34 4 065 53 35 5 066 54 36 6 067 55 37 7 070 56 38 8 071 57 39 9 ``` We can observe that every character can be converted to their integer counter part by applying a mask `0x0F`, then basic arithmetic can be done accordingly with rules of arithmetic in base-10. #### Addition and Subtraction The largest challenge of implementing addition and subtraction would be handling value pass to the other. Which are `carry` for addition, and `borrow` for subtraction. As the number of digit of result is not known in advance, it would be troublesome to allocate a buffer to store the result temporary, to resolve this problem, I would swap both operands before hand to ensure the value with more digits is always before the value with less digits. ```c /* addition */ if (strlen(a) < strlen(b)) SWAP((size_t *)&a,(size_t *)&b,size_t); /* subtraction */ if (strlen(a) < strlen(b)) SWAP((size_t *)&a,(size_t *)&b,size_t); else if (strlen(a) == strlen(b) && a[0] < b[0]) SWAP((size_t *)&a,(size_t *)&b,size_t); ``` Subtraction would do an extra check to prevent the result to be negative, as the subtraction in fast doubling would only extract the difference (absolute value) of two values. They would then track the `carry` / `borrow` values when doing the arithmetic operations. :::spoiler Implementation ```c /* Addition */ for (i = 0;i < strlen(b);++i){ sum = (a[strlen(a) - i - 1] & 0x0F) + (b[strlen(b) - i - 1] & 0x0F) + carry; buf[i] = (sum % 10) | 0x30; carry = sum / 10; } for (i = strlen(b);i < strlen(a);++i){ sum = (a[strlen(a) - i - 1] & 0x0F) + carry; buf[i] = (sum % 10) | 0x30; carry = sum / 10; } if (carry) buf[i++] = carry | 0x30; buf[i] = '\0'; /* Subtraction */ for (i = 0;i < strlen(b);++i){ sum = (a[strlen(a) - i - 1] & 0x0F) - (b[strlen(b) - i - 1] & 0x0F) - borrow; borrow = sum < 0; buf[i] = (sum + (10 & -(borrow))) | 0x30; } for (i = strlen(b);i < strlen(a);++i){ sum = (a[strlen(a) - i - 1] & 0x0F) - borrow; borrow = sum < 0; buf[i] = (sum + (10 & -(borrow))) | 0x30; } buf[i] = '\0'; ``` ::: #### Multiplication Multiplication iterate over each digits of the multiplier in reverse, then repeatedly add multiplicand by itself to simulate the process of multiplication, with the result shifted according the current multiplier digit. :::spoiler Implementation ```c for (int dig = 0;i >= 0;--i,++dig){ memset(addNum, '\0', sizeof(addNum)); for (int ldigit = multiplier[i] & 0x0F;ldigit > 0;--ldigit){ add(multiplicand, addNum, addNum); } char *cp = addNum + strlen(addNum); for (int t = 0;t < dig;++t) *cp++ = '0'; *cp = '\0'; add(result,addNum,result); } ``` ::: #### `sstring_t` data structure We can now represent very large Fibonacci number. However, data structure using character array causes some problem. Character array would not dynamically allocate spaces, we have to specify the length of array when initializing. When a value too small is specified, the program can not represent the value correctly. When a value too large is specified, the program is very memory inefficient. Hence, we used `malloc`, `free` functions to dynamically allocate space. Even `malloc` would correctly allocate spaces for the very large number. A new problem arises, the program would need to reallocate memory every time the number of digits changes, making the program very inefficient. Even more so for finding Fibonacci sequence, as the arithmetic functions are called very frequently. This then became the performance bottleneck of the program, hence a new data structure `sstring_t` is introduced. ```c struct sstring_t{ char *value; size_t size; size_t capacity; }; ``` `sstring_t` stores both `size` and `capacity` of the very large number, this data structure would allocate `1 << capacity` amount of spaces beforehand, which is the upper-limit of `size`. It only reallocates memory when `size` needed surpass `1 << capacity`. #### Current result ![](https://i.imgur.com/RY7hGnN.png) Some point to notes for the result : - The program is still very inefficient by this time, peaking at around `2.5s` for `fib(500)` - There are some outliers in the result which could be dropped from the data set ### Dropping outliers with statistics kernel time by doing `fib(100)` 64 times : ![](https://i.imgur.com/A5V6V10.png) We would remove outlier by applying [Chebyshev's Theorem](https://en.wikipedia.org/wiki/Chebyshev%27s_inequality), as [68-95-99.7 Rule](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule) can only apply to data sets that are known to be Normal Distribution. However, they both drops data 3 standard deviation away from mean. The goal is to remove spikes from the graph above and make it as flat as possible. `fib(100)` ( mean of ) 64 times by taking only $(\mu-3\sigma,\mu+3\sigma)$ : ![](https://i.imgur.com/nfEwtqT.png) #### current result ![](https://i.imgur.com/zALni9g.png) ### Re: Representing Fibonacci with larger result #### Porting to LKM `fibdrv.c` - Removes memory access error detected by [Address Sanitizer](https://github.com/google/sanitizers) and [valgrind](https://valgrind.org/) - Use `copy_to_user()` to copy continuous memory segment to buffer from user program. - Handles `n == 0` case as `__builtin_clz()` has undefined result when the argument is `0` #### Improve performance of Fibonacci number computation Current : 4-bit arithmetic calculation Improve : 32-bit (-2 for carry) arithmetic calculation progress - not yet implement capacity + size (currently only size) to reduce realloc overhead ![](https://i.imgur.com/R0ufng6.png) > compare with sstring.h (int : improved, sstring : not-improved) ![](https://i.imgur.com/l4MtzkZ.png) One point to note though is the graph above for int array is using iterative method, and sstring is using fast-doubling. Current fast-doubling for integer array would be slower than iterative method as multiply is extremely expensive. Sstring don't have this issue as the overhead of multiplying is dominated by the number of digits (one element per digit) and overhead of resizing array. > sstring (not-improved) versus int (improved) for fast doubling ![](https://i.imgur.com/zMdTfwb.png) > int iterative versus int fast-doubling ![](https://i.imgur.com/Ehvd1rz.png) // TODO improve fast-doubling (should surpass iterative by far : $O(n)\ against\ O(lgn)$ possible overhead : - algorithm of multiplication (currently using long multiplication) - resizing algorithm (yet to implement capacity + size for `bn_t`) ### Synchronization There are several use of mutex in the program. The program would do `mutex_init()` in definition of `__init` initialize function, and `mutex_destroy()` in definition of `__exit` clean up function. The program would use the lock when a process tries to open the device file, the method are defined as `.open` of `fops`. Then release the lock upon closing the device file, defined as `.release` in `fops`. When one process attempts to obtain the lock, it will call `mutex_trylock` to accquire the lock, if the lock is held by other process, the program returns immediately. ```shell $ # attempt to open driver with multiple process $ sudo ./client Failed to open character device: Device or resource busy ``` #### Write a multithreaded program to run client ```c /* create, join */ for (int i = 0;i < thread_count;++i) pthread_create(thread_handler + i, NULL, fib, (void *)(i << 16 | fd)); for (int i = 0;i < thread_count;++i) pthread_join(thread_handler[i], NULL); /* ... */ void *fib(void *fdp) { int offset = 100; char buf[1024]; int fd = open(FIB_DEV, O_RDWR); if (fd < 0) { perror("Failed to open character device"); return NULL; } int thread_count = (int)((long)fdp); lseek(fd, offset, SEEK_SET); read(fd, buf, 1024); printf("[Thread %d] fib(%d) = %s\n", thread_count, offset, buf); fflush(stdout); close(fd); return NULL; } ``` Result : ```shell Failed to open character device: Device or resource busy Failed to open character device: Device or resource busy Failed to open character device: Device or resource busy [Thread 0] fib(100) = 354224848179261915075 Failed to open character device: Device or resource busy Failed to open character device: Device or resource busy [Thread 7] fib(100) = 354224848179261915075 Failed to open character device: Device or resource busy ``` We can see `fibdrv` would block any thread attempting to open device when other device held the mutex lock. After removing code related to mutex lock, run the code again : ```shell [Thread 0] fib(100) = 354224848179261915075 [Thread 1] fib(100) = 354224848179261915075 [Thread 6] fib(100) = 354224848179261915075 [Thread 3] fib(100) = 354224848179261915075 [Thread 2] fib(100) = 354224848179261915075 [Thread 7] fib(100) = 354224848179261915075 [Thread 4] fib(100) = 354224848179261915075 [Thread 5] fib(100) = 354224848179261915075 ``` This result is to be expected, as the program only locks the mutex upon opening the driver, plus none of the shared resource would affect the result of computation. So why do we need mutex. Mutex prevents race condition, which happens when multiple processes are contending for the same shared resource. Write a simple function to trace the program : ```c /* Shared memory, initialized at (dev_read) at this experiment */ static int ref_threads; /* ... */ static ssize_t dev_write(struct file *filep, const char *buffer, size_t len, loff_t *offset) { ++ref_threads; int error_count = 0; error_count = copy_from_user(message, buffer, len); if (error_count == 0){ message_len = strlen(message); printk(KERN_INFO "Thread > %s\n", message); return len; } else { printk(KERN_INFO "write fail\n", len); return -EFAULT; } } static int fib_release(struct inode *inode, struct file *file) { mutex_unlock(&fib_mutex); printk(KERN_INFO "%d\n", ref_threads); return 0; } ``` We can see that with mutex lock, when closing the device, the threads successfully open the device would show the correct ref_threads count. Without mutex lock, sometimes ref_threads would show a result less than the correct count, even though all processes opened the device. // TODO : memoization with mutex ### Reference counting // TODO