Try   HackMD

SMP OS Kernel for RISC-V

莊易騰

TODO

  • Use QEMU to run uCore-SMP and fix related issues
  • Trace uCore-SMP using GDB + QEMU
  • Explain how the OS kernel on an SMP processor initializes each hart
  • Explain the use of CLINT and uCore-SMP
  • Explore the implementation of HSM, OpenSBI and uCore-SMP
  • Examine the system call design of uCore-SMP, identify its design flaws, and propose improvements
  • Discuss CPU scheduling issues in SMP, using uCore-SMP as an example, and analyze its fairness

Source Code

The source code for this project is available on GitHub.

You should download the Alpine Python3 RootFS and place it in the user directory.

Tool Version

  • QEMU Emulator: Version 5.0.0
  • RISC-V GCC Compiler: riscv64-unknown-elf-gcc (g04696df09) Version 14.2.0
  • OpenSBI: Version 0.9 (fw_jump.bin in the bootloader directory)

Please refer to commits 4834d56, a2b16cf, and 95021ed.
These commits resolve compilation errors and modify the Makefile in the repository to ensure successful operation.

To complete the compilation process, you can use the all.sh script as follows:

$ chmod +x all.sh
$ ./all.sh

If successful, the result will look like the image below:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

We can use make run to launch QEMU and execute the kernel:

$ make run

Trace uCore-SMP using GDB + QEMU

GDB Background
An inferior is a general term that refers to "an entity being debugged using GDB."

How to Run GDB on uCore-SMP
We can configure GDB to use an initialization file called .gdbinit.
To set this up, follow these steps before running GDB:

$ mkdir ~/.config/gdb
$ echo "add-auto-load-safe-path your_uCore-SMP_path/.gdbinit" >> ~/.config/gdb/gdbinit

Next, start the GDB server with the following command:

$ make gdb_debug

Finally, open another terminal and use the following command to connect to the GDB server:

$ riscv64-unknown-elf-gdb

Common GDB commands

  • r: Run the program being debugged
  • l: Show the source code around the current execution point
  • info breakpoints: List all breakpoints
  • d breakpoint_number: Delete a specific breakpoint
  • n: Execute the next line of code, stepping over function calls
  • s: Execute the next line of code, stepping into function calls
  • c: Resume execution until the next breakpoint
  • bt: Display the current call stack
  • info registers: Display the current values of CPU registers
  • q: Exit GDB

The issue appeared in the fu740 branch, but I discovered that I should be using the fu740-python branch.

When I used the make run commnad to launch QEMU, I encounterd the following issue:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Using GDB, I determined that the problem originates in the kinit function.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

To address the issue, I traced the code of the kinit function. The following image illustrates the flow of the kinit function. we can ovserve that the freerange function repeatedly calls the recycle_physical_page function. Within the recycle_physical_page function, the memset function is invoked to initialize physical page.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The issue I encountered was caused by the memory size being too small. To resolve this, we need to adjust the memory size when launching QEMU.

According to the image below, we need to allocate 256 MiB of memory space, spanning from 0x80000000 to 0x90000000.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

According to the QEMU documentation, the default RAM size is 128 MiB.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

The issue appeared in the fu740 branch, but I discovered that I should be using the fu740-python branch.

After resolving the issue mentioned above, I encountered another problem:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

After discussing with my instructor, I attempted to utilize the QEMU monitor to resolve the issue and learn how to use it effectively.

QEMU Monitor Background
The QEMU monitor is used to give complex commands to the QEMU emulator.

How to run QEMU monitor on uCore-SMP
To use the QEMU monitor for memory inspection, I modified the uCore-SMP makefile.
You can launch the QEMU monitor by running the following command:

make monitor_debug

Finally, open another terminal and use the following command to connect to the QEMU monitor:

$ telnet 127.0.0.1 4444

Common commands

  • info qtree: Show device tree
  • xp /fmt addr: Physical memory dump starting at addr

I used the QEMU monitor to inspect memory state and discovered that the content of the mount point consisted entirely of zeros. This led me to suspect that the issue might have been be caused by the format of the drive image.

Ultimately, I identified the root cause and resolved the issue (Exploration Process). The image below demonstrates Python executing successfully on the kernel.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Exploration Process
This section explains how I resolved the f_mount failure issue.

Upon encountering the issue, I began investigating the functionality of the f_mount API and learned that it is used to mount a filesystem object to a logical volume. Additionally, I examined the meaning of f_mount's return value and found that the return value of 13 indicates FR_NO_FILESYSTEM.

However, I initially lacked a clear understanding of FR_NO_FILESYSTEM, so I consulted with my instructor. My instructor suggested using the QEMU monitor to inspect the memory's status. Following this advice, I eventually discovered that the drive image's format was not as expected.

To investigate further, I examined the contents of fs.img using the file command, which identified the type of fs.img as "data".

user/fs.img: data

I also analyzed the contents of riscv64-rootfs.img with the file command, and the result was as follows:

user/riscv64-rootfs.img: DOS/MBR boot sector MS-MBR Windows 7 english at offset 0x163 "Invalid partition table" at offset 0x17b "Error loading operating system" at offset 0x19a "Missing operating system", disk signature 0xc7aafbf3; partition 1 : ID=0xb, active, start-CHS (0x0,32,33), end-CHS (0xc,190,50), startsector 2048, 202752 sectors

From this analysis, I concluded that fs.img was not useable. I then revisited the README.md file for further clarification. Upon reviewing it, I realized that I had been executing the kernel on the wrong branch. After switching to the correct branch, I was able to successfully launch the kernel.

Explain how the OS kernel on an SMP processor initializes each hart

Before entering the kernel's main function, _entry.S allocates a boot stack for each hart.
Since the code only allocates 8 stacks of 4 KB each, the number of CPUs should not exceed 8.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

When the first hart begins executing the main function, it stores the hartid in the tp register and sets the static variable first_hart to 0.

The first hart then calls the init_cpu function to initialize the status of all CPUs, including setting the core_id and start_cycle.

for (int i = 0; i < NCPU; i++) {
    ...
    cpus[i].core_id = i;
    ...
    // The r_time function returns the value of the time csr
    cpus[i].start_cycle = r_time();
}

Next, the first hart calls the trapinit_hart function to set stvec and enable the supervisor global interrupt by setting sstatus.SIE. It then enables the supervisor external interrupt and the supervisor software interrupt by setting sie.SEIE and sie.SSIE, respectively.

PLIC Background
PLIC supports up-to 1023 interrupts (0 is reserved) and 15872 contexts.

  • Each interrupt has a priority value that can be set by the programmer
  • Each target can be interrupted by N interrupt sources, so there are N interrupt enable bits to control whether the target receives these interrupts
  • Each target has an interrupt threshold. For an interrupt to occur, its priority value must be greater than the threshold

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

After executing the trapinit_hart function, the first hart executes the plicinit and plicinithart functions. The plicinit function sets the priority of interrupt 1 to 1. The plicinithart function enables interrupt 1 for supervisor mode and sets the supervisor's interrupt threshold to 0.

Then, the first hart executes the kvminit function to create the kernel page table and the kvminithart function to write to the satp CSR, enabling virtual memory. After enabling virtual memory, the first hart executes the sfnce.vma zero, zero instruction to flush all TLB entries.

After the first hart completes all initialization, it calls the init_booted function to set both static arrays, booted and halted, to all zeros:

volatile int booted[NCPU]; // whether a core is booted
volatile int halted[NCPU]; // whether a core is halted

The first hart then sets booted[hartid] to 1, marking itself as booted, and begins booting the other harts by calling the start_hart function.

booted[hartid] = 1; // Mark the first hart as booted
int CPU_START = 0; // Starting index for CPUs to boot
for (int i = CPU_START; i < NCPU; i++) {
    if (i != hartid) { // Skip the first hart (already booted)
        start_hart(i, (uint64_t)_entry, 0);
        while (booted[i] == 0) {
            // Wait until the hart has updated its booted status
        }
    }
}

OpenSBI Background
An SBI extension defines a set of SBI functions which provides a particular functionality to supervisor-mode software.
The higher privilege software providing SBI interface to the supervisor-mode software is referred as an SBI implementation or Supervisor Execution Environment (SEE).
image

All SBI functions share a single binary encoding, which facilitates the mixing of SBI extensions. The SBI specification follows the below calling convention.

  • An ECALL is used as the control transfer instruction between the supervisor and the SEE
  • a7 encodes the SBI extension ID (EID)
  • a6 encodes the SBI function ID (FID)
  • SBI functions must return a pair of values in a0 and a1, with a0 returning an error code.
    ​​​​struct sbiret {
    ​​​​    long error;
    ​​​​    long value;
    ​​​​};
    

The start_hart function internally calls the a_sbi_ecall function, which uses the ecall instruction to invoke the Supervisor Execution Environment (SEE).

void start_hart(uint64 hartid, uint64 start_addr, uint64 a1) {
    a_sbi_ecall(0x48534D, 0, hartid, start_addr, a1, 0, 0, 0);
}

The value 0x48534D represents the Extension ID (EID) for the Hart State Management (HSM) extension in the RISC-V SBI.

The second argument of the a_sbi_ecall function is the Function ID (FID).

In the HSM extension, Function ID 0 represents to the sbi_hart_start function. The sbi_hart_start function request the SBI implementation to start executing the target hart in supervisor-mode.

struct sbiret sbi_hart_start(
    unsigned long hartid,
    unsigned long start_addr,
    unsigned long opaque
)

After the first hart completes its initialization, the other harts execute the hart_bootcamp function as follows:

void hart_bootcamp(uint64 hartid, uint64 a1) {
    w_tp(hartid);       // Write hart ID to the thread pointer
    kvminithart();      // Initialize kernel virtual memory for this hart
    trapinit_hart();    // Set up the trap handling for this hart
    plicinithart();     // Configure the PLIC for this hart
    booted[hartid] = 1; // Mark this hart as booted
}

Once each hart (besides the first) has executed the hart_bootcamp function, they enter a while loop to wair for all harts to complete booting:

while (!all_started) {
    ; // Busy wait until all hart have started
}

When the first hart completes booting all the other harts, it will execute the wait_all_boot function to signal that all harts have started by setting all_started to 1.

void wait_all_boot() {
    ...
    all_started = 1;
}

Once all harts have been booted and the all_started flag is set to 1 by the first hart, all harts proceed to execute the scheduler function.

Explain the use of CLINT and uCore-SMP

Explore the implementation of HSM, OpenSBI and uCore-SMP

This content is referenced from the fw.md.
OpenSBI currently supports three different types of firmwares:

  • Firmware with Dynamic Information (FW_DYNAMIC)
  • Firmware with Jump Address (FW_JUMP)
    The FW_JUMP firmware assumes a fixed address of the next booting stage entry, e.g. a bootloader or an OS kernel, without directly including the binary code for this next stage.
  • Firmware with Payload (FW_PAYLOAD)

The uCore-SMP utilizes the FW_JUMP firmware, so I will explain the FW_JUMP flow.

At the beginning, the _start function determines which hart will execute the _relocate function. Assume that hart 0 executes the amoadd.w instruction first. As shown in the image below, hart 0 reads a zero value from memory, while the other harts will read a non-zero value from memory. Therefore, hart 0 is designated to execute the _relocate function.
image

The next image illustrates the execution flow for harts other than hart 0 (the boot core).
These non-boot cores will wait for the boot core to complete relocation and initialization before proceeding.
image

The following image displays the execution flow for hart 0 (the boot core).
image

After one hart executes _start_warm, it will call the sbi_init function.
In the sbi_init function, a lottery mechanism is used to select a coldboot hart from all available harts. Only one hart will execute init_coldboot, while the other harts will execute init_warmboot.

if (next_mode_supported && atomic_xchg(&coldboot_lottery, 1) == 0)
    coldboot = TRUE;

if (coldboot)
    init_coldboot(scratch, hartid);
else
    init_warmboot(scratch, hartid);

HSM Extension Background
The Hart State Management (HSM) Extension introduces a set of hart states and a set of functions which allow the supervisor-mode software to request a hart state change.

Examine the system call design of uCore-SMP, identify its design flaws, and propose improvements

Discuss CPU scheduling issues in SMP, using uCore-SMP as an example, and analyze its fairness

The OS kernel manages all processes and determines which process each hart will execute by scheduling them using a specific algorithm.

The uCore-SMP kernel uses a structure called proc to record the state of each process.

struct proc {
    struct spinlock lock;
    enum procstate state;
    int pid;
    ...
    struct context context;
    ...
    uint64 stride;
    uint64 priority;
    ...
};

We can see that there are two variables, stride and priority, in the proc structure. The uCore-SMP primarily uses these two variables to determine which process will be executed, following an algorithm called Stride Scheduling.

Scheduling Metrics
To compare different scheduling policies effectively, we need a scheduling matric.

The turnaround time of a job is defined as the difference between the time the job completes and the time it arrives in the system:

Tturnaround=TcompletionTarrival

While turnaround time serves as a performance matric, scheduling algorithms can also be evaluated in terms of fairness. However, performance and fairness often conflict, making it challenging to optimize both simultaneously.

The advent of time-shared system introduce a new metric called response time. Response time is defined as the duration between a job's arrival in the system and the first time it is scheduled to run:

Tresponse=TfirstrunTarrival

Proportional-Share Background
The proportional-share scheduler, also known as a fair-share scheduler, is built around a straightforward concept: instead of optimzing for matrics like turnaround time or response time, this scheduler focuses on ensuring that each job receives a specific percentage of CPU time.

Stride scheduling is a deterministic fair-share scheduler that is also straightforward to implement. Each job in the system is assigned a stride, which is inversely proportional to the number of tickets it holds. The stride for each job is calculated by dividing a large constant value by the number of tickets assigned to it.

The pseudocode for stride scheduling is shown below:

curr = remove_min(queue);
schedule(curr);
curr->pass += curr->stride;
insert(queue, curr);

Assume we have three jobs, and their strides are illustrated in the image below:
image

The uCore-SMP kernel uses the following code to find the process with the smallest stride:

for (struct proc *p = pool; p < &pool[NPROC]; p++)
{
    if (p->state == RUNNABLE && !p->lock.locked)
    {
        if (p->stride < min_stride)
        {
            min_stride = p->stride;
            next_proc = p;
        }
    }
}

Once the scheduler identifies the process with the smallest stride, it updates the next_proc's stride using the following code:

uint64 pass = BIGSTRIDE / (next_proc->priority);
next_proc->stride += pass;

In the current implementation of uCore-SMP, a process's priority is fixed at 16, resulting in all processes having the same pass value, calculated as:

uint64 pass = BITSTRIDE / 16;

Since pass value is identical for all processes, the stride adjustment does not account for variations in priority, causing uniform progression of stride values across all processes. This effectively makes the scheduler behave as though all processes have equal priority.

In the uCore-SMP implementation, while the default priority is fixed at 16, the system includes a system call, sys_setpriority, which allows processes to dynamically adjust their priority. The function is defined as follows:

/**
 * @brief Set the priority of the current process
 * 
 * @param priority The desired priority value (must be >= 2).
 * @return int64 Returns the set priority value if successful, or -1 if the input is invalid.
 */
int64 sys_setpriority(int64 priority) {
    if (2 <= priority) {
        struct proc *p = curr_proc();
        acquire(&p->lock);
        p->priority = priority;
        release(&p->lock);
        return priority;
    }
    return -1;
}

Reference

6.1 Statements and Declarations in Expressions
What does 'inferior' mean in the term 'inferior debugger'?
2.1.3 What GDB Does During Startup
2.1.4 Initialization Files