contributed by < jhan1998
>
Since the existing Fibonacci sequence is calculated using long long int, when , overflow will be generated, and the resulting sequence will be wrong, as can be seen from the following:
After item 93, there is no way to output the sequence correctly, so I use the code in Small/Short String Optimization of the third week quiz to perform string The addition is implemented so that the subsequent sequence can be run.
Since fibdrv.c
will be executed in the Linux kernel mode through LKM, it cannot refer to the general C header file, so some malloc, free, assert must be changed to the corresponding kernel api. For example:
Then, implement string_number_add()
:
The principle is not difficult to understand, that is, the original string is reversed, so that it can be added character by character from the lowest bit, and finally the result of the addition is reversed.
After the implementation, the corresponding fib_sequence()
is changed to:
In addition, fib_read()
will return the string to user space.
In the end just change MAX_LENGTH
and we're done.
Change the value of fibdrv/scripts/expected.txt
before verifying, otherwise the result will be wrong.
Then use The first 500 Fibonacci numbers to verify whether the calculation to is correct.
Change the answer in fibdrv/scripts/expected.txt
to The first 500 Fibonacci numbers series.
Next verify if it is correct.
From this we can see that this implementation is correct at least up to .
First modify GRUB_CMDLINE_LINUX_DEFAULT
in /etc/default/grub
to specify that cpu 7 is not used for kernel schduler scheduling, change it to "quiet splash isolcpus=7"
and then need sudo update-grub
to Update the settings, and finally restart the machine.
You can use cat /sys/devices/system/cpu/isolated
command to know whether it is successful or not.
Looking at the executable cores of pid 1
also shows that only 0 - 6 are left.
Use taskset to specify the CPU core to execute the program:
Next, compare the difference before and after exclusion. Here, fast doubling and iterate are compared.
before exclusion:
It can be seen that the amplitude of the shock is also large.
Next, eliminate the interfering factors
Since here we only need to set the 7th cpu as perfoemance, so just execute:
After exclusion:
There are still some slight vibrations.
Refer to KYG-yaya573142 notes
We can specify or limit which CPU can do IRQ by modifying /proc/irq/IRQ#/smp_affinity
.
Execute the following script:
Combine the previous steps into one script:
Finally, a more stable result can be obtained
In client_plot.c
, I use clock_gettime
to get the time, where CLOCK_MONOTONIC
refers to the monotonically increasing time, which will not be affected by the switching of the system clock, but will be affected by adjtime()
and NTP
influences.
the CLOCK_MONOTONIC clock is not affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), but is affected by the incremental adjustments performed by adjtime(3) and NTP.
Among them, adjtime()
is a function used to adjust the system clock, which will adjust the speed of the clock according to the timeval structure brought in.
The adjtime() function gradually adjusts the system clock (as returned by gettimeofday(2)). The amount of time by which the clock is to be adjusted is specified in the structure pointed to by delta. This structure has the following form:
If the adjustment in delta is positive, then the system clock is speeded up by some small percentage (i.e., by adding a small amount of time to the clock value in each second) until the adjustment has been completed. If the adjustment in delta is negative, then the clock is slowed down in a similar fashion.
In the kernel space, declare ktime_t
to record the time, and use write to return the time according to the prompt of the job description.
Finally, subtract the time in kernel space from the time measured in user space, and we can get the time from user space to kernel space plus kernel space and back to user space.
Here it is measured with the default fib_sequence()
function:
It can be seen that the later items will take more time to calculate, and the time trend of user space and kernel space is the same, except for the overhead passed in the middle.
Among them, the overhead of the system call is about 800 ns except for the large n = 1 and 2 at the beginning.
Next, we can know the overhead of user to kernel and kernel to user by returning the time point when entering the kernel space.
It turns out that the overhead of kernel to user will be higher than that of user to kernel, and the reason is still being discussed.
user to kernel is mainly a mode switch, no special memory range check is required
jserv
Next, to solve the problem of excessive overhead in the first few items, in [KYG-yaya573142](https://hackmd.io/@KYWeng/rkGdultSU#量測時 %E9%96%93%E7%9A%84%E6%96%B9%E5%BC%8F) In the notes, the teacher mentioned that it is a page fault problem.
After reading Threaded RT-application with memory locking and stack handling example, I learned that [mlock](https://man7 .org/linux/man-pages/man2/mlock.2.html) series of system calls, so that the memory will not be swap out and can always stay in RAM.
The method used in Threaded RT-application with memory locking and stack handling example is to first use mlockall
to combine current and future The memory is locked to prevent the system from swapping it out, and then a memory map is required to be reserved in RAM, so that the program can use the pre-reserved memory to avoid page faults. The detailed code is as follows.
Among them, malopt(M_TRIM_THRESHOLD, -1)
will automatically release the memory after free when the memory grows to a critical point.
When the amount of contiguous free memory at the top of the heap grows sufficiently large, free(3) employs sbrk(2) to release this memory back to the system.
Setting -1 here is to turn off this function.
And malopt(M_MMAP_MAX, 0)
is to avoid making mmap()
do large allocation request.
M_MMAP_MAX
This parameter specifies the maximum number of allocation requests that may be simultaneously serviced using mmap(2). This parameter exists because some systems have a limited number of internal tables for use by mmap(2), and using more than a few of them may degrade performance.
The default value is 65,536, a value which has no special significance and which serves only as a safeguard.
Setting this parameter to 0 disables the use of mmap(2) for servicing large allocation requests.
Modify the measurement time code:
The same is to perform mlock first and then map a section of space first.
After the actual measurement, the overhead when n = 1 and 2 can be effectively reduced.
Re-measure with the default fib_sequence()
function:
According to the job description, outliers beyond two standard deviations are removed, and the calculation results of the remaining samples are averaged 50 times
Due to the low performance of large number calculations, the measurement results after will be optimized.
Use statistical methods to remove extreme values to measure the performance difference between fast doubling and general implementation, and the difference between fast doubling itself after introducing clz / ctz.
Regarding the implementation of fast doubling, refer to the job description to calculate the current item based on the value of each bit.
now known
So when the current bits are 0, we can use and to calculate the value of and , and when When the bits is 1, the obtained result can be used to calculate and .
The basic implementation is:
Comparison with Default Practice
Furthermore, since the currently implemented fast doubling will complete 32 times regardless of the number of items, we use __builtin_clz()
to reduce the number of calculations.
where the loop will start from
Start moving right.
Finally got the result
It can be seen that after adding __builtin_clz()
, it has become significantly faster.