contributed by <yaohwang99
>
Refer to KYG-yaya573142's report and homework discription
First, create another client script for measuring the time.
Here, I have sampling time set to 10 for stabler result after post processing the data.
As suggested in homework discription, I use write()
to measure time in kernel level.
Because lots of process is running, the result is unstable.
Follow the guide in homework discription, create a shell script to designate cpu 7 to client_stat
and restore the parameters afterwards.
The result is now more acceptable.
C99 variable-length array (VLA) is not allowed in Linux kernel, we can solve the problem by using fixed size array(or three variables).
The important part is that when and is calculated, which one is used for in the next iteration.
Ignore the outliers, the result is much better then iteration.
Create custom data structure for big number.
Because we need to allocate memory, I use malloc
in user mode to make sure there is no memory leak.
Converting big number to string is a little tricky
'\0'
, therefore size_t len = (a->block_num * sizeof(u32) * 8) / 3 + 2;
is sufficient.1 | 1 | 0 | 1 | 0 | 1 | |
---|---|---|---|---|---|---|
plus 1 | yes | yes | no | yes | no | yes |
1 | 2 | 4 | 8 | 16 | 32 | |
1 | 2 | 4 | 8 | 16 | ||
1 | 2 | 4 | ||||
1 | ||||||
return | 1 | 3 | 6 | 13 | 26 | 53 |
Test the result in user mode, using the basic iterative approach.
Free the unused big number so there is no memory leak
Verify that there is no memory leak with valgrind.
kvmalloc()
kvmalloc tries to use kmalloc, if failed(because there is no contiguous physical memory), use vmalloc.
kvrealloc()
and kvfree()
kvrealloc()
and kvfree()
is defined in linux/mm/util.c, however, it is not defined in the headers that we can include, so I defined it myself according to linux/mm/util.c.
In kernel space, eliminate the leading zeros, copy the char
array back to user space, then release the memory.
In user space, use the last argument to choose which case to enter.
In this case, the memory in big_num_buf
will be written by copy_to_user
in kernel space.
If the buffer size of big_num_buf
is too small, copy_to_user
will return the remaining number of characters.
In client.c
, I print out the result with the following format and successfully passed verify.py
.
Calculate each fibonacci number for a few times.
Plot the result using the median produces stable result.
Notice that there is a big jump at and , that is because one more block of memory is required to store the result.
Notice that the calculating sequence is ,
instead of .
This way, each is calculated in very different time, the sample will therefore be less biased.
Big jump when need to allocate new block of memory.
To calculate fibonacci number using fast doubling, we need multiplication and subtraction of big number.
For subtraction, calculate 2's complement, add, and discard the overflow value. Here we assume a is greater then b.
For multiplication, we use the standard way of multipling 2 binary numbers.
Iterate through all the bits in a
, add b
to the result if the current bit is 1
, then shift b
left 1 bit for the next iteration.
Implement fast doubling with the above functions.
big_num_square()
duplicates the argument then use big_num_mul()
.
The time measurement result is worse than the iterative approach.
Refer to KYG-yaya573142's report, the reason may include the following:
fib_sequence_big_num_fdouble()
, I use a lot of temporary pointer and memory copy, which may be avoided.resulting in:
u64
for each block and use __int128
from gcc if needed.Use perf to check the programs hot spot.
At first glance, we might think that multiplication is slow. However, if we look at the distribution of the callee from the following statistic. kfree()
and kmalloc()
uses lots of resource because big_num_mul()
or other function calls the function a lot.
We can save a lot of resource by reusing the existing memory block, for example, void big_num_add(big_num_t *c, big_num_t *a, big_num_t *b)
instead of big_num_t *big_num_add(big_num_t *a, big_num_t *b)
because the latter requires c
be freed to recieve the return value.
The result is much better with a 40% improvement, from 790 ms to 475 ms.
Check again with perf
Next, I try to improve big_num_lshift()
.
By using cnt
to record how many bits to shift when the current bit is 1, we don't have to shift every iteration.
The result is finally better then the iterative approach
The reference is bignum.
Now the bottleneck is that I used too many memory operation unnecessarily.
Use big_num_sub()
as example, we don't need to calculate the 2's complement of the big number or set c
to 0
.
We can improve big_num_square()
by symmetry.
[Reference]
Squaring can be at most 2X faster than regular multiplication between arbitrary numbers. because of symmetry. For example, calculate the square of 1011 and try to spot a pattern that we can exploit. u0:u3 represent the bits in the number from the most significant to the least significant.
The result is not improved much because of the addition cost.
Now the result for numbers less than 1000 can be done in 10 micro seconds.
The bottleneck is still memory allocation and free.
We can modify the structure to record current block number (the number of blocks that is in use, or non-zero) and allocated block number.
In the resize function, we only allocate new memory when num > true_block_num
which will not happen if we pre-allocate sufficient memory.
Based on observation, the number of block needed increases by 1 every 46 fibonacci number.
After pre-allocating memory, the result is so much closer to the reference.
If we measure the time for fib(10000), it is clear that we still need lots of improvement.
64 bits cpu provides instruction for addition/multiplcation for 64 bits numbers.
By changing the block size of the big number, we can be twice as fast as the previous version.
__asm__
instead of 128 bits integer.Result: