fibdrv

contributed by < jhan1998 >

large number calculation

Compute Fibonacci numbers after F₉₃ (inclusive)

Since the existing Fibonacci sequence is calculated using long long int, when

f (n) > 2^{64} - 1

, overflow will be generated, and the resulting sequence will be wrong, as can be seen from the following:

\begin{matrix} f (93) = 12200160415121876738 \\ 2^{64} - 1 = 9223372036854775807 \\ f (93) > 2^{64} - 1 \end{matrix}

After item 93, there is no way to output the sequence correctly, so I use the code in Small/Short String Optimization of the third week quiz to perform string The addition is implemented so that the subsequent sequence can be run.

Implementation

Since fibdrv.c will be executed in the Linux kernel mode through LKM, it cannot refer to the general C header file, so some malloc, free, assert must be changed to the corresponding kernel api. For example:

malloc() => kmalloc() or kzalloc()
free() => kfree()
assert() => BUG_ON()

Then, implement string_number_add()：

static void string_number_add(xs *a, xs *b, xs *out)
{
    char *data_a, *data_b;
    int i, carry = 0;
    int sum;
    size_t size_a, size_b;

    /*
     * Make sure the string length of 'a' is always greater than
     * the one of 'b'.
     */
    if (xs_size(a) < xs_size(b))
        __swap((void *) &a, (void *) &b);

    data_a = xs_data(a);
    data_b = xs_data(b);

    size_a = xs_size(a);
    size_b = xs_size(b);

    reverse_str(data_a, size_a);
    reverse_str(data_b, size_b);

    char buf[size_a + 2];

    for (i = 0; i < size_b; i++) {
        sum = (data_a[i] - '0') + (data_b[i] - '0') + carry;
        buf[i] = '0' + sum % 10;
        carry = sum / 10;
    }

    for (i = size_b; i < size_a; i++) {
        sum = (data_a[i] - '0') + carry;
        buf[i] = '0' + sum % 10;
        carry = sum / 10;
    }

    if (carry)
        buf[i++] = '0' + carry;

    buf[i] = 0;

    reverse_str(buf, i);

    /* Restore the original string */
    reverse_str(data_a, size_a);
    reverse_str(data_b, size_b);

    if (out) {
        BUG_ON(sizeof(buf) >= MAX_STR_LEN);
        *out = *xs_new(&xs_literal_empty(), buf);
    }
}

The principle is not difficult to understand, that is, the original string is reversed, so that it can be added character by character from the lowest bit, and finally the result of the addition is reversed.

After the implementation, the corresponding fib_sequence() is changed to:

static xs fib_sequence_str(long long k)
{
    /* FIXME: use clz/ctz and fast algorithms to speed up */
    xs f[k + 2];

    f[0] = *xs_new(&xs_literal_empty(), "0");
    f[1] = *xs_new(&xs_literal_empty(), "1");
    for (int i = 2; i <= k; i++) {
        string_number_add(&f[i - 1], &f[i - 2], &f[i]);
    }
    return f[k];
}

In addition, fib_read() will return the string to user space.

static ssize_t fib_read(struct file *file,
                         char *buf,
                         size_t size,
                         loff_t *offset)
{
     long long int k = *offset;
     xs fib = fib_sequence_str(k);
     copy_to_user(buf, xs_data(&fib), strlen(xs_data(&fib)) + 1);
     return 0;
}

In the end just change MAX_LENGTH and we're done.

Verify correctness

Change the value of fibdrv/scripts/expected.txt before verifying, otherwise the result will be wrong.

$ make check
cc -o client client.c
make -C /lib/modules/5.4.0-70-generic/build M=fibdrv modules
make[1]: Entering directory '/usr/src/linux-headers-5.4.0-70-generic'
  CC [M]  fibdrv/fibdrv.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC [M]  fibdrv/fibdrv.mod.o
  LD [M]  /fibdrv/fibdrv.ko
make[1]: Leaving directory '/usr/src/linux-headers-5.4.0-70-generic'
make unload
make[1]: Entering directory 'fibdrv'
sudo rmmod fibdrv || true >/dev/null
rmmod: ERROR: Module fibdrv is not currently loaded
make[1]: Leaving directory 'fibdrv'
make load
make[1]: Entering directory 'fibdrv'
sudo insmod fibdrv.ko
make[1]: Leaving directory 'fibdrv'
sudo ./client > out
make unload
make[1]: Entering directory 'fibdrv'
sudo rmmod fibdrv || true >/dev/null
make[1]: Leaving directory 'fibdrv'
 Passed [-]

Then use The first 500 Fibonacci numbers to verify whether the calculation to

F_{500}

is correct.

Change the answer in fibdrv/scripts/expected.txt to The first 500 Fibonacci numbers series.

  0 0
  1 1
  2 1
  3 2
  4 3
  5 5
  6 8
  7 13
  8 21
  9 34
 10 55
 .
 .
 .
 .
495 12571784316098613244768412217102088629494571867212494830322628505768565710528091492618664254204672140130
496 20341574322680408081083829243820203612317308197211964554628215486203974898255803242740333222721700974747
497 32913358638779021325852241460922292241811880064424459384950843991972540608783894735358997476926373114877
498 53254932961459429406936070704742495854129188261636423939579059478176515507039697978099330699648074089624
499 86168291600238450732788312165664788095941068326060883324529903470149056115823592713458328176574447204501
500 139423224561697880139724382870407283950070256587697307264108962948325571622863290691557658876222521294125

Next verify if it is correct.

$ make check
make -C /lib/modules/5.4.0-70-generic/build M=fibdrv modules
make[1]: Entering directory '/usr/src/linux-headers-5.4.0-70-generic'
  Building modules, stage 2.
  MODPOST 1 modules
make[1]: Leaving directory '/usr/src/linux-headers-5.4.0-70-generic'
make unload
make[1]: Entering directory 'fibdrv'
sudo rmmod fibdrv || true >/dev/null
rmmod: ERROR: Module fibdrv is not currently loaded
make[1]: Leaving directory 'fibdrv'
make load
make[1]: Entering directory 'fibdrv'
sudo insmod fibdrv.ko
make[1]: Leaving directory 'fibdrv'
sudo ./client > out
make unload
make[1]: Entering directory 'fibdrv'
sudo rmmod fibdrv || true >/dev/null
make[1]: Leaving directory 'fibdrv'
 Passed [-]

From this we can see that this implementation is correct at least up to

F_{500}

Exclude factors that interfere with performance analysis

Limit the CPU to a specific program

First modify GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub to specify that cpu 7 is not used for kernel schduler scheduling, change it to "quiet splash isolcpus=7" and then need sudo update-grub to Update the settings, and finally restart the machine.

You can use cat /sys/devices/system/cpu/isolated command to know whether it is successful or not.

$ cat /sys/devices/system/cpu/isolated
7

Looking at the executable cores of pid 1 also shows that only 0 - 6 are left.

$ taskset -cp 1
pid 1's current affinity list: 0-6

Use taskset to specify the CPU core to execute the program:

taskset -c 7 ./client

Eliminate performance interference factors

Next, compare the difference before and after exclusion. Here, fast doubling and iterate are compared.

before exclusion:

It can be seen that the amplitude of the shock is also large.

Next, eliminate the interfering factors

suppress address space layout randomization (ASLR)

$ sudo sh -c "echo 0 > /proc/sys/kernel/randomize_va_space"

Set scaling_governor to performance

Since here we only need to set the 7th cpu as perfoemance, so just execute:

echo performance > /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor

For Intel processors, turn off turbo mode

sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

After exclusion:

There are still some slight vibrations.

SMP IRQ affinity

Refer to KYG-yaya573142 notes

We can specify or limit which CPU can do IRQ by modifying /proc/irq/IRQ#/smp_affinity.

Execute the following script:

#!/bin/bash
for file in `find /proc/irq -name "smp_affinity"`
do
     var=0x`cat ${file}`
     var="$(( $var & 0x7f ))"
     var = `printf '%.2x' ${var}`
     sudo bash -c "echo ${var} > ${file}"
done
sudo bash -c "echo 7f > /proc/irq/default_smp_affinity"

Combine the previous steps into one script:

#!/bin/bash

CPUID=7
ORIG_ASLR=`cat /proc/sys/kernel/randomize_va_space`
ORIG_GOV=`cat /sys/devices/system/cpu/cpu$CPUID/cpufreq/scaling_governor`
ORIG_TURBO=`cat /sys/devices/system/cpu/intel_pstate/no_turbo`

sudo bash -c "echo 0 > /proc/sys/kernel/randomize_va_space"
sudo bash -c "echo performance > /sys/devices/system/cpu/cpu$CPUID/cpufreq/scaling_governor"
sudo bash -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

make unload
make load
rm -f plot.txt
sudo taskset -c 7 ./client_plot > plot.txt
gnuplot before.gp

itr=0
for file in `find /proc/irq -name "smp_affinity"`
do
    var=0x`cat ${file}`
    arr[${itr}]=${var}
    var="$(( $var & 0x7f ))"
    var=`printf '%.2x' ${var}`
    sudo bash -c "echo ${var} > ${file}"
    (( itr++ ))
done
sudo bash -c "echo 7f > /proc/irq/default_smp_affinity"

rm -f plot.txt
sudo taskset -c 7 ./client_plot > plot.txt
gnuplot after.gp
make unload

# restore the original system settings
sudo bash -c "echo $ORIG_ASLR >  /proc/sys/kernel/randomize_va_space"
sudo bash -c "echo $ORIG_GOV > /sys/devices/system/cpu/cpu$CPUID/cpufreq/scaling_governor"
sudo bash -c "echo $ORIG_TURBO > /sys/devices/system/cpu/intel_pstate/no_turbo"

itr=0
for file in `find /proc/irq -name "smp_affinity"`
do
    var=${arr[${itr}]}
    var=`printf '%.2x' ${var}`
    sudo bash -c "echo ${var} > ${file}"
    (( itr++ ))
done
sudo bash -c "echo ff > /proc/irq/default_smp_affinity"

Finally, a more stable result can be obtained

Measure user space and kernel space time

In client_plot.c, I use clock_gettime to get the time, where CLOCK_MONOTONIC refers to the monotonically increasing time, which will not be affected by the switching of the system clock, but will be affected by adjtime() and NTP influences.

the CLOCK_MONOTONIC clock is not affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes the clock), but is affected by the incremental adjustments performed by adjtime(3) and NTP.

Among them, adjtime() is a function used to adjust the system clock, which will adjust the speed of the clock according to the timeval structure brought in.

The adjtime() function gradually adjusts the system clock (as returned by gettimeofday(2)). The amount of time by which the clock is to be adjusted is specified in the structure pointed to by delta. This structure has the following form:
       struct timeval {
           time_t      tv_sec;     /* seconds */
           suseconds_t tv_usec;    /* microseconds */
       };
If the adjustment in delta is positive, then the system clock is speeded up by some small percentage (i.e., by adding a small amount of time to the clock value in each second) until the adjustment has been completed. If the adjustment in delta is negative, then the clock is slowed down in a similar fashion.

clock_gettime(CLOCK_MONOTONIC, &start);
        kernel_t = write(fd, write_buf, 0); /* recursion w/ cache */
        clock_gettime(CLOCK_MONOTONIC, &end);
        long long user_t = (long long)(end.tv_sec * 1e9 + end.tv_nsec)
             - (start.tv_sec * 1e9 + start.tv_nsec);
        printf("%d %lld %lld %lld\n", i, user_t, kernel_t, user_t - kernel_t);

In the kernel space, declare ktime_t to record the time, and use write to return the time according to the prompt of the job description.

kt = ktime_get();
fib_sequence(*offset);
kt = ktime_sub(ktime_get(), kt);

Finally, subtract the time in kernel space from the time measured in user space, and we can get the time from user space to kernel space plus kernel space and back to user space.

Here it is measured with the default fib_sequence() function:

It can be seen that the later items will take more time to calculate, and the time trend of user space and kernel space is the same, except for the overhead passed in the middle.

Among them, the overhead of the system call is about 800 ns except for the large n = 1 and 2 at the beginning.

Next, we can know the overhead of user to kernel and kernel to user by returning the time point when entering the kernel space.

case 2:
    return (ssize_t) ktime_to_ns(ktime_get());

clock_gettime(CLOCK_MONOTONIC, &start);
kernel_t = write(fd, write_buf, 2); /* recursion w/ cache */
clock_gettime(CLOCK_MONOTONIC, &end);
long long kernel_t_u = (long long)(end.tv_sec * 1e9 + end.tv_nsec) - kernel_t;
long long user_t_k = kernel_t - (long long)(start.tv_sec * 1e9 + start.tv_nsec);
printf("%d %lld %lld\n", i, user_t_k, kernel_t_u);

It turns out that the overhead of kernel to user will be higher than that of user to kernel, and the reason is still being discussed.

user to kernel is mainly a mode switch, no special memory range check is required

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

jserv

Page fault

Next, to solve the problem of excessive overhead in the first few items, in [KYG-yaya573142](https://hackmd.io/@KYWeng/rkGdultSU#量測時 %E9%96%93%E7%9A%84%E6%96%B9%E5%BC%8F) In the notes, the teacher mentioned that it is a page fault problem.

After reading Threaded RT-application with memory locking and stack handling example, I learned that [mlock](https://man7 .org/linux/man-pages/man2/mlock.2.html) series of system calls, so that the memory will not be swap out and can always stay in RAM.

The method used in Threaded RT-application with memory locking and stack handling example is to first use mlockall to combine current and future The memory is locked to prevent the system from swapping it out, and then a memory map is required to be reserved in RAM, so that the program can use the pre-reserved memory to avoid page faults. The detailed code is as follows.

static void configure_malloc_behavior(void)
{
    /* Now lock all current and future pages 
       from preventing of being paged */
    if (mlockall(MCL_CURRENT | MCL_FUTURE))
   	    perror("mlockall failed:");
   
   /* Turn off malloc trimming.*/
    mallopt(M_TRIM_THRESHOLD, -1);
   
   /* Turn off mmap usage. */
    mallopt(M_MMAP_MAX, 0);
}

Among them, malopt(M_TRIM_THRESHOLD, -1) will automatically release the memory after free when the memory grows to a critical point.

mallopt

When the amount of contiguous free memory at the top of the heap grows sufficiently large, free(3) employs sbrk(2) to release this memory back to the system.

Setting -1 here is to turn off this function.

And malopt(M_MMAP_MAX, 0) is to avoid making mmap() do large allocation request.

M_MMAP_MAX
This parameter specifies the maximum number of allocation requests that may be simultaneously serviced using mmap(2). This parameter exists because some systems have a limited number of internal tables for use by mmap(2), and using more than a few of them may degrade performance.
The default value is 65,536, a value which has no special significance and which serves only as a safeguard.
Setting this parameter to 0 disables the use of mmap(2) for servicing large allocation requests.

Modify the measurement time code:

if (mlockall(MCL_CURRENT | MCL_FUTURE))
        perror("mlockall failed:");

    /* Turn off malloc trimming.*/
    mallopt(M_TRIM_THRESHOLD, -1);

    /* Turn off mmap usage. */
    mallopt(M_MMAP_MAX, 0);

    for (int i = 0; i <= offset; i++) {
        lseek(fd, 0, SEEK_SET);
        clock_gettime(CLOCK_MONOTONIC, &start);
        kernel_t = write(fd, write_buf, 2);
        clock_gettime(CLOCK_MONOTONIC, &end);
    }

The same is to perform mlock first and then map a section of space first.

After the actual measurement, the overhead when n = 1 and 2 can be effectively reduced.

Re-measure with the default fib_sequence() function:

Use statistical methods to remove extreme values

According to the job description, outliers beyond two standard deviations are removed, and the calculation results of the remaining samples are averaged 50 times

Due to the low performance of large number calculations, the measurement results after

F_{100}

will be optimized.

fast doubling & clz / ctz

Use statistical methods to remove extreme values to measure the performance difference between fast doubling and general implementation, and the difference between fast doubling itself after introducing clz / ctz.

Regarding the implementation of fast doubling, refer to the job description to calculate the current item based on the value of each bit.

now known

F (2 K) = F (k) [2 F (K + 1) - F (k)] F (2 K + 1) = F (k + 1)^{2} + F (k)^{2}

So when the current bits are 0, we can use

F (k)

and

F (k + 1)

to calculate the value of

F (2 k)

and

F (2 k + 1)

, and when When the bits is 1, the obtained result can be used to calculate

F (2 k + 1)

and

F (2 k + 2)

The basic implementation is:

static long long fib_fast_double(long long k)
{
    /* FIXME: use clz/ctz and fast algorithms to speed up */
    if (k < 2) {
        return k;
    }

    long long f[2];

    f[0] = 0;
    f[1] = 1;

    for (unsigned int i = 1U << 31; i; i >>= 1) {
        long long k1 = f[0] * (f[1] * 2 - f[0]);
        long long k2 = f[0] * f[0] + f[1] * f[1];

        if (k & i) {
            f[0] = k2;
            f[1] = k2 + k1;
        } else {
            f[0] = k1;
            f[1] = k2;
        }
    }

    return f[0];
}

Comparison with Default Practice

Furthermore, since the currently implemented fast doubling will complete 32 times regardless of the number of items, we use __builtin_clz() to reduce the number of calculations.

where the loop will start from

1 << 31 - __builtin_clz(k)

Start moving right.

Finally got the result

It can be seen that after adding __builtin_clz(), it has become significantly faster.

fibdrv

large number calculation

Compute Fibonacci numbers after F93 (inclusive)

Implementation

Verify correctness

Exclude factors that interfere with performance analysis

Limit the CPU to a specific program

Eliminate performance interference factors

SMP IRQ affinity

Measure user space and kernel space time

Page fault

Use statistical methods to remove extreme values

fast doubling & clz / ctz

Read more

Kata container

Benchmark Your Computer Black Box

Key-Value Stroages

Kata container deploy record

Compute Fibonacci numbers after F₉₃ (inclusive)