# 2022q1 Homework3 (fibdrv) contribute by < `kdnvt` > ## 實驗環境 ```bash $ gcc --version gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 126 Model name: Intel(R) Core(TM) i5-1035G1 CPU @ 1.00GHz Stepping: 5 CPU MHz: 1200.000 CPU max MHz: 3600.0000 CPU min MHz: 400.0000 BogoMIPS: 2380.80 Virtualization: VT-x L1d cache: 192 KiB L1i cache: 128 KiB L2 cache: 2 MiB L3 cache: 6 MiB NUMA node0 CPU(s): 0-7 ``` ## 作業要求 > [fibdrv](https://hackmd.io/PMozqaxqRA-jF2KmgTZcPQ?view#K04-fibdrv) ## 開發過程 ### kmalloc vs vmalloc When allocating memory, the kernel uses kmalloc, vmalloc intead of malloc. In order to be familier with these API, I read the [Memory Allocation Guide ](https://www.kernel.org/doc/html/v5.0/core-api/memory-allocation.html) and [virtual memory and linux](http://malgenomeproject.org/os2018fall/10_Linux_virtual_memory.pdf) to realize the difference between kmalloc and vmalloc. kmalloc returns a memory space which is allocated by slab allocator, and because it is "physical continuous", it is suitable for DMA. vmalloc returns memory allocated by page allocator, which is continuous only in virtual address. The kmalloc is better used to allocate object less than page size, while vmalloc is suitable for memory which needs multiple pages. In fibdrv, I think the memory won't be very large, so I use kmalloc to implement big number arithmetic. However, I wrap the memory allocation and release with the function bn_new, bn_znew, bn_free so I can change the implementation at any time. ### big number arithmetic The fibdrv homework description suggests the use of decimal string to implement the big number arithmetic, but I think it is a waste of space. Therefore, I reference the method of [KYG-yaya573142](https://hackmd.io/@KYWeng/rkGdultSU#%E5%A4%A7%E6%95%B8%E9%81%8B%E7%AE%97), using an array of unsigned long long, and bind all elements' bits as a big number binary representation. For a number $x$, decimal string needs $\lfloor log_{10}x \rfloor$ bytes of memory , and my method only needs $\lceil \lfloor log_2x \rfloor/8 \rceil$ bytes of memory. It's about $8/log_210 \approx 2.4$ times memory usage. In addition to memory usage, this implementation also benefits the data transformation between kernel and user space, because the data to be copied is much less. ```c typedef struct _bn { unsigned long long length; unsigned long long *num; } bn_t; ``` ### bn_new ```c bool bn_new(bn_t *bn_ptr, unsigned long long length) { bn_ptr->length = 0; if ((bn_ptr->num = kmalloc(sizeof(unsigned long long) * length, GFP_KERNEL))) bn_ptr->length = length; return !!bn_ptr->num; } ``` This function allocates memory only for member num in bn_t. The reason I don't use it to allocate the whole structure is that I can declare variable directly and use bn_new like this: ```c bn_t tmp; bn_new(&tmp,1); ``` However, after using this design to implement fibdrv, I think it is somewhat confusing the user. In addition to calling API, the user need to consider the behavior of bn_t, which is bad for a abstract interface. ### bn_extend ```c bool bn_zrenew(bn_t *bn_ptr, unsigned long long length) { if (length == bn_ptr->length) { memset(bn_ptr->num, 0, sizeof(unsigned long long) * length); return true; } unsigned long long *tmp = krealloc(bn_ptr->num, sizeof(unsigned long long) * length, GFP_KERNEL); if (tmp == NULL) return false; memset(tmp, 0, sizeof(unsigned long long) * length); bn_ptr->length = length; bn_ptr->num = tmp; return true; } bool bn_extend(bn_t *bn_ptr, unsigned long long length) { unsigned long long origin = bn_ptr->length; if (length < origin) return true; if (length == bn_ptr->length) return true; unsigned long long *tmp = krealloc(bn_ptr->num, sizeof(unsigned long long) * length, GFP_KERNEL); if (tmp == NULL) return false; memset(tmp + origin, 0, sizeof(unsigned long long) * (length - origin)); bn_ptr->length = length; bn_ptr->num = tmp; return true; } ``` bn_zrenew and bn_extend both extend the allocated memory of bn_ptr. bn_zrenew extends it and set all of bits to zero while bn_extend reserve the original value. Notice that when using krealloc, instead of assign the return value to bn_ptr->num directly, I use tmp to keep its value and check it, in case that krealloc fails and the original bn_ptr->num will cause memory leak. ### bn_free ```c void bn_free(bn_t *bn_ptr) { kfree(bn_ptr->num); bn_ptr->num = NULL; bn_ptr->length = 0; } ``` This function release the memory of num. When other functions need to allocate, resize, or release function, I use these function instead of calling kmalloc, krealloc directly, so I can replace the kmalloc with other functions like kvmalloc, vmalloc if needed without changing other functions. ### bn_add / bn_sub ```c bool bn_add(const bn_t *a, const bn_t *b, bn_t *res) { if (a->length < b->length || (a->length == b->length && b->num[b->length - 1] > a->num[a->length - 1])) swap(a, b); res->length = a->length + (a->num[a->length - 1] > 0); bn_free(res); if (!bn_znew(res, res->length)) return false; bn_move(a, res); bn_add_carry(b, res, 0); return true; } // The bn_t is a unsigned big number, so caller must make sure that // a is bigger than b before calling bn_sub. // cppcheck-suppress unusedFunction bool bn_sub(const bn_t *a, const bn_t *b, bn_t *res) { res->length = max(a->length, b->length); bn_free(res); if (!bn_znew(res, res->length)) return false; if (a->length > b->length) { bn_toggle_move(a, res); bn_add_carry(b, res, 0); // res = (~a+b) = b-a-1 bn_toggle_move(res, res); // res = ~(b-a-1) = (a-b+1) -1 = a-b } else { bn_toggle_move(b, res); bn_add_carry(a, res, 1); } return true; } void bn_add_carry(const bn_t *b, bn_t *res, int carry) { int i; /* * Use bit operation to check overflow. * There are four cases which overflow: * * msb b : 0 1 1 1 * msb res : 1 0 1 1 * msb b + res : 0 0 0 1 * * If overflow occur, it means the carry will * be one on next loop. */ for (i = 0; i < b->length; i++) { unsigned long long overflow = b->num[i] & res->num[i]; unsigned long long msb = b->num[i] | res->num[i]; res->num[i] += b->num[i] + carry; msb &= ~res->num[i]; overflow |= msb; carry = !!(overflow & 1ULL << 63); } for (; i < res->length && carry; i++) { unsigned long long msb = res->num[i]; res->num[i] += carry; msb &= ~res->num[i]; carry = !!(msb & 1ULL << 63); } } ``` The bn_add_carry implement the big number addition. bn_sub and bn_add do the subtraction and addition with bn_add_carry. bn_move move value of the first big number to the second, while bn_toggle_move move 1's complement of the the first big number to the second. Original edition free and allocate memory every time, while improved edition using resize function to implement, which is shown below: ```c bool bn_add(const bn_t *a, const bn_t *b, bn_t *res) { if (a->length < b->length || (a->length == b->length && b->num[b->length - 1] > a->num[a->length - 1])) swap(a, b); if (!bn_zrenew(res, a->length + ((a->num[a->length - 1] & 1LLU << (sizeof(unsigned long long) * 8 - 1)) > 0))) return false; bn_move(a, res); bn_add_carry(b, res, 0); return true; } bool bn_sub(const bn_t *a, const bn_t *b, bn_t *res) { if (!bn_zrenew(res, max(a->length, b->length))) return false; if (a->length > b->length) { bn_toggle_move(a, res); bn_add_carry(b, res, 0); // res = (~a+b) = b-a-1 bn_toggle_move(res, res); // res = ~(b-a-1) = (a-b+1) -1 = a-b } else { bn_toggle_move(b, res); bn_add_carry(a, res, 1); } return true; } ``` ### bn_mult ```c bool bn_mult_slow(bn_t *a, bn_t *res) { bn_shrink(a); bn_shrink(res); bn_t sum; if (!bn_znew(&sum, a->length + res->length)) return false; bn_t tmp; if (!bn_znew(&tmp, a->length + res->length)) return false; for (int i = 0; i < a->length; i++) { for (int j = 0; j < 64; j++) { if (a->num[i] & (1ULL << j)) { bn_move(res, &tmp); if (!bn_lshift(&tmp, i * 64 + j)) return false; bn_add_carry(&tmp, &sum, 0); } } } bn_swap(&sum, res); bn_free(&sum); bn_free(&tmp); bn_shrink(res); return true; } ``` This is my first implementation of multiplication. By traversing all bits of the first big number, if it is set then left shift the second big number with corresponding bits and add it so `sum`. However, the performance of it is terrible, so I change the implementation. ```c bool bn_mult(bn_t *a, bn_t *res) { bn_shrink(a); bn_shrink(res); bn_t low = {}; if (!bn_znew(&low, res->length)) return false; bn_t high = {}; if (!bn_znew(&high, a->length + res->length + 5)) return false; bn_t sum = {}; if (!bn_znew(&sum, a->length + res->length + 5)) return false; for (int i = 0; i < a->length; i++) { for (int j = 0; j < 64; j += 32) { unsigned long long multiplier = ((a->num[i]) >> j) & 0xffffffffULL; bn_extend(&low, res->length); bn_move(res, &low); bn_extend(&high, res->length); bn_move(res, &high); bn_rshift(&high, 32); bn_mask(&low, 0xffffffffULL); bn_mask(&high, 0xffffffffULL); for (int k = 0; k < low.length; k++) { low.num[k] *= multiplier; } for (int k = 0; k < high.length; k++) { high.num[k] *= multiplier; } bn_lshift(&high, 32); bn_extend(&high, low.length + 1); bn_add_carry(&low, &high, 0); bn_lshift(&high, i * 64 + j); bn_extend(&sum, high.length + 1); bn_add_carry(&high, &sum, 0); } } bn_swap(&sum, res); bn_free(&sum); bn_free(&low); bn_free(&high); bn_shrink(res); return true; } ``` This edition multiply the 32 bits data per times. The reason for 32 bits is to prevent overflow. ### Fibonacci number computing and performance measurement ```c static unsigned long long fib_sequence(long long k, bn_t *ret) { /* FIXME: use clz/ctz and fast algorithms to speed up */ if (k == 0 || k == 1) { ret->num = kmalloc(sizeof(unsigned long long), GFP_KERNEL); ret->num[0] = k; ret->length = 1; return 1; } bn_t a, b, res = {}; bn_znew(&a, 1); bn_znew(&b, 1); if (!a.num || !b.num) { bn_free(&a); bn_free(&b); return 0; } a.num[0] = 0; b.num[0] = 1; bool err = false; for (int i = 2; i <= k; i++) { if (!bn_add(&a, &b, &res)) { err = true; break; } bn_swap(&a, &b); bn_swap(&b, &res); } bn_free(&a); bn_free(&res); bn_swap(ret, &b); if (err) { bn_free(ret); return 0; } return ret->length; } ``` This function computes the Fibonacci number by repeatedly adding previous two number. The `bn_to_string` function in `client.c` transform big number binary representation to decimal string, the implementation code is as shown below: ```c int bn_to_string(unsigned long long *bn, int bn_len, char **str_ptr) { int width = sizeof(*bn) * 8; int bn_bits = bn_len * width; int len = bn_bits / 3 + 2; int total_len = 1; char *str = calloc(len, 1); if (!str) return 0; for (int i = bn_len - 1; i >= 0; i--) { for (int j = width - 1; j >= 0; j--) { int carry = 0; if (bn[i] & (1ULL << j)) carry = 1; for (int k = 0; k < len && k < total_len; k++) { str[k] += str[k] + carry; carry = 0; if (str[k] > 9) { carry = 1; str[k] -= 10; } } if (carry) str[total_len++] = 1; } } for (int k = 0; k < total_len; k++) { str[k] += '0'; } reverse(str, total_len); *str_ptr = str; return len; } void reverse(char *str, int len) { int half = len / 2; for (int i = 0; i < half; i++) { char tmp = str[i]; str[i] = str[len - 1 - i]; str[len - 1 - i] = tmp; } } ``` After implementing the basic Fibonacci number computing, I decided to measure its performance first, so I can see the difference when improving it later. I output the four types of execution time: - ktime: kernel time which computing Fibonacci number. - utime: user time which transforms big number binary representation to a decimal string. - u_to_ktime: The time kernel copies data from kernel space to userspace. - total time: sum of all time. Subsequently, I use the gnuplot to draw the diagram. The initial diagram is as shown below: ![](https://i.imgur.com/le2x0eA.png) As we can see, some interference existed. I followed the homework instructions and eliminated it by executing the shell file below: ```shell for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor do echo performance > ${i} done sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" ``` In addition, I isolated the eighth core and disabled many interrupt on it by shell file below: ``` for file in `find /proc/irq -name "smp_affinity"` do var=0x`cat ${file}` var="$(( $var & 0x7f ))" var=`printf '%.2x' ${var}` sudo bash -c "echo ${var} > ${file}" done ``` >reference: [KYG-yaya573142](https://hackmd.io/BZjwYw1FQ1O0NPY5kIS7VQ?view#%E6%8E%92%E9%99%A4%E5%B9%B2%E6%93%BE%E6%95%88%E8%83%BD%E5%88%86%E6%9E%90%E7%9A%84%E5%9B%A0%E7%B4%A0) Run the client on eighth core,and here is the diagram after these changes: ![](https://i.imgur.com/8DWaNx7.png) The measurement is still influenced by something. By using the command: ``` cat /proc/interrupts ``` I can see that some interrupt still happen when executing `client`. Finally, I run the `client` multiple times, remove the outlier and take the average. The implementation in `client` output files of data, each line in a file are data with the same Fibonacci number. After `client` finishes, I run the `data.py` to process the data. The code is as shown below: ```python import numpy as np import math def outlier_filter(datas, threshold = 2): datas = np.array(datas).astype(np.long) if (math.isclose(0,datas.std(),rel_tol = 1e-10)): return datas z = np.abs((datas - datas.mean()) / datas.std()) return datas[z < threshold] if __name__ == "__main__": klines = [] ulines = [] k2ulines = [] totallines = [] with open("ktime.txt") as textFile: klines = [line.split() for line in textFile] with open("utime.txt") as textFile: ulines = [line.split() for line in textFile] with open("k_to_utime.txt") as textFile: k2ulines = [line.split() for line in textFile] with open("total_time.txt") as textFile: totallines = [line.split() for line in textFile] fp = open("plot.txt","w") klines = np.array(klines) ulines = np.array(ulines) k2ulines = np.array(k2ulines) totallines = np.array(totallines) for i in range(klines.shape[0]-1): fp.write(str(i) + " ") fp.write(str(outlier_filter(klines[i]).mean()) + " ") fp.write(str(outlier_filter(ulines[i]).mean()) + " ") fp.write(str(outlier_filter(k2ulines[i]).mean()) + " ") fp.write(str(outlier_filter(totallines[i]).mean()) + "\n") fp.close(); ``` > referenced [4ce43c4](https://github.com/colinyoyo26/fibdrv/commit/4ce43c4e7ee0b80c4ce9e6dcd995bbfdf239206c) And here is the diagram: ![](https://i.imgur.com/0xnUShi.png) For Fibonacci to 1000: ![](https://i.imgur.com/tDaVqdW.png) ### Improve bn_to_string The `bn_to_string` accounted for a large proportion of execution time, so I decided to improve it first. The original implementation needs to traverse the whole decimal string per bit of big number binary representation, which is a big overhead. I improved the performance by scanning 32 bits of big number at once, so it only needs to traverse the whole string per 32 bits. The reason for 32 bits is that I can use an unsigned long long integer to record the remaining carry number without overflow. The implementation code and execution time diagram is as shown below: ```c int bn_to_string(unsigned long long *bn, int bn_len, char **str_ptr) { int width = sizeof(*bn) * 8; int bn_bits = bn_len * width; int len = bn_bits / 3 + 2; int total_len = 1; unsigned char *str = calloc(len, 1); if (!str) return 0; for (int i = bn_len - 1; i >= 0; i--) { for (int j = width - 32; j >= 0; j -= 32) { unsigned long long carry = bn[i] >> j & 0xffffffffllu; /* ++total_len only if k == total_len && carry != 0 */ for (int k = 0; k < len && (k < total_len || carry && ++total_len); k++) { carry += (unsigned long long) str[k] << 32; str[k] = carry % 10; carry /= 10; } } } for (int k = 0; k < total_len; k++) str[k] += '0'; reverse(str, total_len); *str_ptr = str; return len; } ``` ![](https://i.imgur.com/R7dHxj4.png) After I replacing the free and allocate function in `bn_add` with resize funtion, the performance is improved a little: ![](https://i.imgur.com/4pkPO5h.png) ### Fast doubling This time, I tried to improve the Fibonacci number computing time by using fast doubling technique. The implementation code and diagram is as shown below: ```c static unsigned long long fib_doubling(long long k, bn_t *ret) { if (k == 0 || k == 1) { ret->num = kmalloc(sizeof(unsigned long long), GFP_KERNEL); ret->num[0] = k; ret->length = 1; return 1; } bn_t a, b; bn_znew(&a, 2); bn_znew(&b, 2); int bits = 32 - __builtin_clz(k); if (!a.num || !b.num) { bn_free(&a); bn_free(&b); return 0; } a.num[0] = 0; b.num[0] = 1; bool err = false; for (int i = bits - 1; i >= 0; i--) { bn_t t1 = {}, t2 = {}, t3 = {}, t4 = {}; err |= !bn_new(&t1, b.length); err |= !bn_move(&b, &t1); err |= !bn_lshift(&t1, 1); // t1 = 2*b err |= !bn_sub(&t1, &a, &t2); // t2 = 2*b - a err |= !bn_new(&t3, a.length); err |= !bn_move(&a, &t3); // t3 = a err |= !bn_mult(&t3, &t2); // t2 = a*(2*b -a); err |= !bn_mult(&a, &t3); err |= !bn_move(&b, &t1); err |= !bn_mult(&b, &t1); err |= !bn_add(&t1, &t3, &t4); // t4 = a^2 + b^2; err |= !bn_extend(&a, t2.length); err |= !bn_extend(&b, t4.length); err |= !bn_move(&t2, &a); err |= !bn_move(&t4, &b); if (k & 1 << i) { err |= !bn_add(&a, &b, &t1); // t1 = a+b err |= !bn_extend(&a, b.length); err |= !bn_move(&b, &a); // a = b err |= !bn_extend(&b, t1.length); err |= !bn_move(&t1, &b); // b = t1 } bn_free(&t1); bn_free(&t2); bn_free(&t3); bn_free(&t4); if (err) break; } bn_swap(&a, ret); bn_free(&a); bn_free(&b); if (err) { bn_free(ret); return 0; } return ret->length; } ``` ![](https://i.imgur.com/bhxFBd7.png) The performance is even worse than the iterative version: ![](https://i.imgur.com/kF9INbu.png) I think the multiplication may be the bottleneck, so I improved it with another implementation as mentioned above( [bn_mult](https://hackmd.io/@kdnvt/fibdrv#bn_mult) ), the diagram is as shown below: ![](https://i.imgur.com/VHYskY0.png) And the iterative one is as shown below: ![](https://i.imgur.com/Tfc0zpn.png) The improved fast doubling is better than the iterative one this time.