# fibdrv contributed by < `jhan1998` > ## large number calculation ### Compute Fibonacci numbers after F~93~ (inclusive) Since the existing Fibonacci sequence is calculated using long long int, when $f(n) > 2^{64} - 1$, overflow will be generated, and the resulting sequence will be wrong, as can be seen from the following: $$ \begin{gather*} f(93) = 12200160415121876738\\ 2^{64} - 1 = 9223372036854775807\\ f(93) > 2^{64} - 1\\ \end{gather*} $$ After item 93, there is no way to output the sequence correctly, so I use the code in [Small/Short String Optimization](https://hackmd.io/@sysprog/linux2021-quiz3) of the third week quiz to perform string The addition is implemented so that the subsequent sequence can be run. ### Implementation Since `fibdrv.c` will be executed in the Linux kernel mode through LKM, it cannot refer to the general C header file, so some malloc, free, assert must be changed to the corresponding kernel api. For example: ```c malloc() => kmalloc() or kzalloc() free() => kfree() assert() => BUG_ON() ``` Then, implement `string_number_add()`: ```c static void string_number_add(xs *a, xs *b, xs *out) { char *data_a, *data_b; int i, carry = 0; int sum; size_t size_a, size_b; /* * Make sure the string length of 'a' is always greater than * the one of 'b'. */ if (xs_size(a) < xs_size(b)) __swap((void *) &a, (void *) &b); data_a = xs_data(a); data_b = xs_data(b); size_a = xs_size(a); size_b = xs_size(b); reverse_str(data_a, size_a); reverse_str(data_b, size_b); char buf[size_a + 2]; for (i = 0; i < size_b; i++) { sum = (data_a[i] - '0') + (data_b[i] - '0') + carry; buf[i] = '0' + sum % 10; carry = sum / 10; } for (i = size_b; i < size_a; i++) { sum = (data_a[i] - '0') + carry; buf[i] = '0' + sum % 10; carry = sum / 10; } if (carry) buf[i++] = '0' + carry; buf[i] = 0; reverse_str(buf, i); /* Restore the original string */ reverse_str(data_a, size_a); reverse_str(data_b, size_b); if (out) { BUG_ON(sizeof(buf) >= MAX_STR_LEN); *out = *xs_new(&xs_literal_empty(), buf); } } ``` The principle is not difficult to understand, that is, the original string is reversed, so that it can be added character by character from the lowest bit, and finally the result of the addition is reversed. After the implementation, the corresponding `fib_sequence()` is changed to: ```c static xs fib_sequence_str(long long k) { /* FIXME: use clz/ctz and fast algorithms to speed up */ xs f[k + 2]; f[0] = *xs_new(&xs_literal_empty(), "0"); f[1] = *xs_new(&xs_literal_empty(), "1"); for (int i = 2; i <= k; i++) { string_number_add(&f[i - 1], &f[i - 2], &f[i]); } return f[k]; } ``` In addition, `fib_read()` will return the string to user space. ```c static ssize_t fib_read(struct file *file, char *buf, size_t size, loff_t *offset) { long long int k = *offset; xs fib = fib_sequence_str(k); copy_to_user(buf, xs_data(&fib), strlen(xs_data(&fib)) + 1); return 0; } ``` In the end just change `MAX_LENGTH` and we're done. ### Verify correctness Change the value of `fibdrv/scripts/expected.txt` before verifying, otherwise the result will be wrong. ```c $ make check cc -o client client.c make -C /lib/modules/5.4.0-70-generic/build M=fibdrv modules make[1]: Entering directory '/usr/src/linux-headers-5.4.0-70-generic' CC [M] fibdrv/fibdrv.o Building modules, stage 2. MODPOST 1 modules CC [M] fibdrv/fibdrv.mod.o LD [M] /fibdrv/fibdrv.ko make[1]: Leaving directory '/usr/src/linux-headers-5.4.0-70-generic' make unload make[1]: Entering directory 'fibdrv' sudo rmmod fibdrv || true >/dev/null rmmod: ERROR: Module fibdrv is not currently loaded make[1]: Leaving directory 'fibdrv' make load make[1]: Entering directory 'fibdrv' sudo insmod fibdrv.ko make[1]: Leaving directory 'fibdrv' sudo ./client > out make unload make[1]: Entering directory 'fibdrv' sudo rmmod fibdrv || true >/dev/null make[1]: Leaving directory 'fibdrv' Passed [-] ``` Then use [The first 500 Fibonacci numbers](https://blog.abelotech.com/posts/first-500-fibonacci-numbers/) to verify whether the calculation to $F_{500}$ is correct. Change the answer in `fibdrv/scripts/expected.txt` to [The first 500 Fibonacci numbers](https://blog.abelotech.com/posts/first-500-fibonacci-numbers/) series. ```c 0 0 1 1 2 1 3 2 4 3 5 5 6 8 7 13 8 21 9 34 10 55 . . . . 495 12571784316098613244768412217102088629494571867212494830322628505768565710528091492618664254204672140130 496 20341574322680408081083829243820203612317308197211964554628215486203974898255803242740333222721700974747 497 32913358638779021325852241460922292241811880064424459384950843991972540608783894735358997476926373114877 498 53254932961459429406936070704742495854129188261636423939579059478176515507039697978099330699648074089624 499 86168291600238450732788312165664788095941068326060883324529903470149056115823592713458328176574447204501 500 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125 ``` Next verify if it is correct. ```c $ make check make -C /lib/modules/5.4.0-70-generic/build M=fibdrv modules make[1]: Entering directory '/usr/src/linux-headers-5.4.0-70-generic' Building modules, stage 2. MODPOST 1 modules make[1]: Leaving directory '/usr/src/linux-headers-5.4.0-70-generic' make unload make[1]: Entering directory 'fibdrv' sudo rmmod fibdrv || true >/dev/null rmmod: ERROR: Module fibdrv is not currently loaded make[1]: Leaving directory 'fibdrv' make load make[1]: Entering directory 'fibdrv' sudo insmod fibdrv.ko make[1]: Leaving directory 'fibdrv' sudo ./client > out make unload make[1]: Entering directory 'fibdrv' sudo rmmod fibdrv || true >/dev/null make[1]: Leaving directory 'fibdrv' Passed [-] ``` From this we can see that this implementation is correct at least up to $F_{500}$. ## Exclude factors that interfere with performance analysis ### Limit the CPU to a specific program First modify `GRUB_CMDLINE_LINUX_DEFAULT` in `/etc/default/grub` to specify that cpu 7 is not used for kernel schduler scheduling, change it to `"quiet splash isolcpus=7"` and then need `sudo update-grub` to Update the settings, and finally restart the machine. You can use `cat /sys/devices/system/cpu/isolated` command to know whether it is successful or not. ```shell $ cat /sys/devices/system/cpu/isolated 7 ``` Looking at the executable cores of `pid 1` also shows that only 0 - 6 are left. ```shell $ taskset -cp 1 pid 1's current affinity list: 0-6 ``` Use taskset to specify the CPU core to execute the program: ```shell taskset -c 7 ./client ``` ### Eliminate performance interference factors Next, compare the difference before and after exclusion. Here, fast doubling and iterate are compared. **before exclusion:** ![](https://i.imgur.com/5Pfz4dz.png) It can be seen that the amplitude of the shock is also large. Next, eliminate the interfering factors * suppress [address space layout randomization (ASLR)](https://en.wikipedia.org/wiki/Address_space_layout_randomization) ```shell $ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space" ``` * Set scaling_governor to performance Since here we only need to set the 7th cpu as perfoemance, so just execute: ```shell echo performance > /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor ``` * For Intel processors, turn off turbo mode ```shell sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" ``` **After exclusion:** ![](https://i.imgur.com/EPkyXzA.png) There are still some slight vibrations. #### SMP IRQ affinity > Refer to [KYG-yaya573142](https://hackmd.io/@KYWeng/rkGdultSU#%E6%8E%92%E9%99%A4%E5%B9%B2%E6%93%BE%E6%95%25) notes We can specify or limit which CPU can do IRQ by modifying `/proc/irq/IRQ#/smp_affinity`. Execute the following script: ```shell #!/bin/bash for file in `find /proc/irq -name "smp_affinity"` do var=0x`cat ${file}` var="$(( $var & 0x7f ))" var = `printf '%.2x' ${var}` sudo bash -c "echo ${var} > ${file}" done sudo bash -c "echo 7f > /proc/irq/default_smp_affinity" ``` Combine the previous steps into one script: ```shell #!/bin/bash CPUID=7 ORIG_ASLR=`cat /proc/sys/kernel/randomize_va_space` ORIG_GOV=`cat /sys/devices/system/cpu/cpu$CPUID/cpufreq/scaling_governor` ORIG_TURBO=`cat /sys/devices/system/cpu/intel_pstate/no_turbo` sudo bash -c "echo 0 > /proc/sys/kernel/randomize_va_space" sudo bash -c "echo performance > /sys/devices/system/cpu/cpu$CPUID/cpufreq/scaling_governor" sudo bash -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" make unload make load rm -f plot.txt sudo taskset -c 7 ./client_plot > plot.txt gnuplot before.gp itr=0 for file in `find /proc/irq -name "smp_affinity"` do var=0x`cat ${file}` arr[${itr}]=${var} var="$(( $var & 0x7f ))" var=`printf '%.2x' ${var}` sudo bash -c "echo ${var} > ${file}" (( itr++ )) done sudo bash -c "echo 7f > /proc/irq/default_smp_affinity" rm -f plot.txt sudo taskset -c 7 ./client_plot > plot.txt gnuplot after.gp make unload # restore the original system settings sudo bash -c "echo $ORIG_ASLR > /proc/sys/kernel/randomize_va_space" sudo bash -c "echo $ORIG_GOV > /sys/devices/system/cpu/cpu$CPUID/cpufreq/scaling_governor" sudo bash -c "echo $ORIG_TURBO > /sys/devices/system/cpu/intel_pstate/no_turbo" itr=0 for file in `find /proc/irq -name "smp_affinity"` do var=${arr[${itr}]} var=`printf '%.2x' ${var}` sudo bash -c "echo ${var} > ${file}" (( itr++ )) done sudo bash -c "echo ff > /proc/irq/default_smp_affinity" ``` Finally, a more stable result can be obtained ![](https://i.imgur.com/6iS2OZO.png) ## Measure user space and kernel space time In `client_plot.c`, I use `clock_gettime` to get the time, where `CLOCK_MONOTONIC` refers to the monotonically increasing time, which will not be affected by the switching of the system clock, but will be affected by `adjtime()` and `NTP` influences. > the CLOCK_MONOTONIC clock is not affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), but is affected by the incremental adjustments performed by adjtime(3) and NTP. Among them, `adjtime()` is a function used to adjust the system clock, which will adjust the speed of the clock according to the timeval structure brought in. > The adjtime() function gradually adjusts the system clock (as returned by gettimeofday(2)). The amount of time by which the clock is to be adjusted is specified in the structure pointed to by delta. This structure has the following form: > ```cpp > struct timeval { > time_t tv_sec; /* seconds */ > suseconds_t tv_usec; /* microseconds */ > }; > ``` > If the adjustment in delta is positive, then the system clock is speeded up by some small percentage (i.e., by adding a small amount of time to the clock value in each second) until the adjustment has been completed. If the adjustment in delta is negative, then the clock is slowed down in a similar fashion. ```c clock_gettime(CLOCK_MONOTONIC, &start); kernel_t = write(fd, write_buf, 0); /* recursion w/ cache */ clock_gettime(CLOCK_MONOTONIC, &end); long long user_t = (long long)(end.tv_sec * 1e9 + end.tv_nsec) - (start.tv_sec * 1e9 + start.tv_nsec); printf("%d %lld %lld %lld\n", i, user_t, kernel_t, user_t - kernel_t); ``` In the kernel space, declare `ktime_t` to record the time, and use write to return the time according to the prompt of the job description. ```c kt = ktime_get(); fib_sequence(*offset); kt = ktime_sub(ktime_get(), kt); ``` Finally, subtract the time in kernel space from the time measured in user space, and we can get the time from user space to kernel space plus kernel space and back to user space. Here it is measured with the default `fib_sequence()` function: ![](https://i.imgur.com/16MZ7fK.png) It can be seen that the later items will take more time to calculate, and the time trend of user space and kernel space is the same, except for the overhead passed in the middle. Among them, the overhead of the system call is about 800 ns except for the large n = 1 and 2 at the beginning. Next, we can know the overhead of user to kernel and kernel to user by returning the time point when entering the kernel space. ```c case 2: return (ssize_t) ktime_to_ns(ktime_get()); ``` ```c clock_gettime(CLOCK_MONOTONIC, &start); kernel_t = write(fd, write_buf, 2); /* recursion w/ cache */ clock_gettime(CLOCK_MONOTONIC, &end); long long kernel_t_u = (long long)(end.tv_sec * 1e9 + end.tv_nsec) - kernel_t; long long user_t_k = kernel_t - (long long)(start.tv_sec * 1e9 + start.tv_nsec); printf("%d %lld %lld\n", i, user_t_k, kernel_t_u); ``` ![](https://i.imgur.com/4LSjA5B.png) It turns out that the overhead of kernel to user will be higher than that of user to kernel, and the reason is still being discussed. :::warning user to kernel is mainly a mode switch, no special memory range check is required :notes:jserv ::: ### Page fault Next, to solve the problem of excessive overhead in the first few items, in [KYG-yaya573142](https://hackmd.io/@KYWeng/rkGdultSU#%E9%87%8F%E6%B8%AC%E6%99%82 %E9%96%93%E7%9A%84%E6%96%B9%E5%BC%8F) In the notes, the teacher mentioned that it is a page fault problem. After reading [Threaded RT-application with memory locking and stack handling example](https://rt.wiki.kernel.org/index.php/Threaded_RT-application_with_memory_locking_and_stack_handling_example), I learned that [mlock](https://man7 .org/linux/man-pages/man2/mlock.2.html) series of system calls, so that the memory will not be swap out and can always stay in RAM. The method used in [Threaded RT-application with memory locking and stack handling example](https://rt.wiki.kernel.org/index.php/Threaded_RT-application_with_memory_locking_and_stack_handling_example) is to first use `mlockall` to combine current and future The memory is locked to prevent the system from swapping it out, and then a memory map is required to be reserved in RAM, so that the program can use the pre-reserved memory to avoid page faults. The detailed code is as follows. ```c static void configure_malloc_behavior(void) { /* Now lock all current and future pages from preventing of being paged */ if (mlockall(MCL_CURRENT | MCL_FUTURE)) perror("mlockall failed:"); /* Turn off malloc trimming.*/ mallopt(M_TRIM_THRESHOLD, -1); /* Turn off mmap usage. */ mallopt(M_MMAP_MAX, 0); } ``` Among them, `malopt(M_TRIM_THRESHOLD, -1)` will automatically release the memory after free when the memory grows to a critical point. [mallopt](https://man7.org/linux/man-pages/man3/mallopt.3.html) > When the amount of contiguous free memory at the top of the heap grows sufficiently large, free(3) employs sbrk(2) to release this memory back to the system. Setting -1 here is to turn off this function. And `malopt(M_MMAP_MAX, 0)` is to avoid making `mmap()` do large allocation request. > M_MMAP_MAX This parameter specifies the maximum number of allocation requests that may be simultaneously serviced using mmap(2). This parameter exists because some systems have a limited number of internal tables for use by mmap(2), and using more than a few of them may degrade performance. The default value is 65,536, a value which has no special significance and which serves only as a safeguard. Setting this parameter to 0 disables the use of mmap(2) for servicing large allocation requests. Modify the measurement time code: ```c if (mlockall(MCL_CURRENT | MCL_FUTURE)) perror("mlockall failed:"); /* Turn off malloc trimming.*/ mallopt(M_TRIM_THRESHOLD, -1); /* Turn off mmap usage. */ mallopt(M_MMAP_MAX, 0); for (int i = 0; i <= offset; i++) { lseek(fd, 0, SEEK_SET); clock_gettime(CLOCK_MONOTONIC, &start); kernel_t = write(fd, write_buf, 2); clock_gettime(CLOCK_MONOTONIC, &end); } ``` The same is to perform mlock first and then map a section of space first. ![](https://i.imgur.com/IgO5Nl4.png) After the actual measurement, the overhead when n = 1 and 2 can be effectively reduced. Re-measure with the default `fib_sequence()` function: ![](https://i.imgur.com/7z0fYcl.png) ## Use statistical methods to remove extreme values According to the job description, outliers beyond two standard deviations are removed, and the calculation results of the remaining samples are averaged 50 times ![](https://i.imgur.com/k8ch8Vx.png) Due to the low performance of large number calculations, the measurement results after $F_{100}$ will be optimized. ## fast doubling & clz / ctz Use statistical methods to remove extreme values to measure the performance difference between fast doubling and general implementation, and the difference between fast doubling itself after introducing clz / ctz. Regarding the implementation of fast doubling, refer to the job description to calculate the current item based on the value of each bit. now known $$ F(2K) = F(k)[2F(K+1)-F(k)] \\ F(2K+1) = F(k+1)^2 +F(k)^2 $$ So when the current bits are 0, we can use $F(k)$ and $F(k+1)$ to calculate the value of $F(2k)$ and $F(2k+1)$, and when When the bits is 1, the obtained result can be used to calculate $F(2k+1)$ and $F(2k+2)$. The basic implementation is: ```c static long long fib_fast_double(long long k) { /* FIXME: use clz/ctz and fast algorithms to speed up */ if (k < 2) { return k; } long long f[2]; f[0] = 0; f[1] = 1; for (unsigned int i = 1U << 31; i; i >>= 1) { long long k1 = f[0] * (f[1] * 2 - f[0]); long long k2 = f[0] * f[0] + f[1] * f[1]; if (k & i) { f[0] = k2; f[1] = k2 + k1; } else { f[0] = k1; f[1] = k2; } } return f[0]; } ``` Comparison with Default Practice ![](https://i.imgur.com/HhBvChf.png) Furthermore, since the currently implemented fast doubling will complete 32 times regardless of the number of items, we use `__builtin_clz()` to reduce the number of calculations. where the loop will start from ```c 1 << 31 - __builtin_clz(k) ``` Start moving right. Finally got the result ![](https://i.imgur.com/saNvE5p.png) It can be seen that after adding `__builtin_clz()`, it has become significantly faster.