# SMP OS Kernel for RISC-V
> 莊易騰
## TODO
- [x] Use QEMU to run uCore-SMP and fix related issues
- [x] Trace uCore-SMP using GDB + QEMU
- [x] Explain how the OS kernel on an SMP processor initializes each hart
- [ ] Explain the use of CLINT and uCore-SMP
- [ ] Explore the implementation of HSM, OpenSBI and uCore-SMP
- [ ] Examine the system call design of uCore-SMP, identify its design flaws, and propose improvements
- [ ] Discuss CPU scheduling issues in SMP, using uCore-SMP as an example, and analyze its fairness
## Source Code
The source code for this project is available on [GitHub](https://github.com/404allen404/uCore-SMP).
:::info
You should download the [Alpine Python3 RootFS](https://github.com/NKU-EmbeddedSystem/uCore-SMP/releases) and place it in the `user` directory.
:::
## Tool Version
- QEMU Emulator: Version 5.0.0
- RISC-V GCC Compiler: riscv64-unknown-elf-gcc (g04696df09) Version 14.2.0
- OpenSBI: Version 0.9 (`fw_jump.bin` in the bootloader directory)
## Use QEMU to run uCore-SMP and Fix Related Issues
Please refer to commits `4834d56`, `a2b16cf`, and `95021ed`.
These commits resolve compilation errors and modify the Makefile in the repository to ensure successful operation.
To complete the compilation process, you can use the `all.sh` script as follows:
```shell
$ chmod +x all.sh
$ ./all.sh
```
If successful, the result will look like the image below:

We can use `make run` to launch QEMU and execute the kernel:
```shell
$ make run
```
## Trace uCore-SMP using GDB + QEMU
:::info
**GDB Background**
An **inferior** is a general term that refers to "an entity being debugged using GDB."
**How to Run GDB on uCore-SMP**
We can configure GDB to use an initialization file called `.gdbinit`.
To set this up, follow these steps before running GDB:
```shell
$ mkdir ~/.config/gdb
$ echo "add-auto-load-safe-path your_uCore-SMP_path/.gdbinit" >> ~/.config/gdb/gdbinit
```
Next, start the GDB server with the following command:
```shell
$ make gdb_debug
```
Finally, open another terminal and use the following command to connect to the GDB server:
```shell
$ riscv64-unknown-elf-gdb
```
**Common GDB commands**
- **`r`**: Run the program being debugged
- **`l`**: Show the source code around the current execution point
- **`info breakpoints`**: List all breakpoints
- **`d breakpoint_number`**: Delete a specific breakpoint
- **`n`**: Execute the next line of code, stepping over function calls
- **`s`**: Execute the next line of code, stepping into function calls
- **`c`**: Resume execution until the next breakpoint
- **`bt`**: Display the current call stack
- **`info registers`**: Display the current values of CPU registers
- **`q`**: Exit GDB
:::
:::warning
The issue appeared in the `fu740` branch, but I discovered that I should be using the `fu740-python` branch.
:::
When I used the `make run` commnad to launch QEMU, I encounterd the following issue:

Using GDB, I determined that the problem originates in the `kinit` function.

To address the issue, I traced the code of the `kinit` function. The following image illustrates the flow of the `kinit` function. we can ovserve that the `freerange` function repeatedly calls the `recycle_physical_page` function. Within the `recycle_physical_page` function, the `memset` function is invoked to initialize physical page.

The issue I encountered was caused by the memory size being too small. To resolve this, we need to adjust the memory size when launching QEMU.
According to the image below, we need to allocate **256 MiB** of memory space, spanning from `0x80000000` to `0x90000000`.

:::info
According to the QEMU documentation, the default RAM size is **128 MiB**.

:::
:::warning
The issue appeared in the `fu740` branch, but I discovered that I should be using the `fu740-python` branch.
:::
After resolving the issue mentioned above, I encountered another problem:

After discussing with my instructor, I attempted to utilize the **QEMU monitor** to resolve the issue and learn how to use it effectively.
:::info
**QEMU Monitor Background**
The QEMU monitor is used to give complex commands to the QEMU emulator.
**How to run QEMU monitor on uCore-SMP**
To use the QEMU monitor for memory inspection, I modified the uCore-SMP makefile.
You can launch the QEMU monitor by running the following command:
```shell
make monitor_debug
```
Finally, open another terminal and use the following command to connect to the QEMU monitor:
```shell
$ telnet 127.0.0.1 4444
```
**Common commands**
- **`info qtree`**: Show device tree
- **`xp /fmt addr`**: Physical memory dump starting at addr
:::
I used the QEMU monitor to inspect memory state and discovered that the content of the mount point consisted entirely of zeros. This led me to suspect that the issue might have been be caused by the format of the drive image.
Ultimately, I identified the root cause and resolved the issue ([Exploration Process](#exploration-process)). The image below demonstrates Python executing successfully on the kernel.

<a id="exploration-process"></a>
:::info
**Exploration Process**
This section explains how I resolved the `f_mount` failure issue.
Upon encountering the issue, I began investigating the functionality of the `f_mount` API and learned that it is used to mount a filesystem object to a logical volume. Additionally, I examined the meaning of `f_mount`'s return value and found that the return value of 13 indicates **FR_NO_FILESYSTEM**.
However, I initially lacked a clear understanding of **FR_NO_FILESYSTEM**, so I consulted with my instructor. My instructor suggested using the **QEMU monitor** to inspect the memory's status. Following this advice, I eventually discovered that the drive image's format was not as expected.
To investigate further, I examined the contents of `fs.img` using the **file** command, which identified the type of `fs.img` as "data".
```
user/fs.img: data
```
I also analyzed the contents of `riscv64-rootfs.img` with the **file** command, and the result was as follows:
```
user/riscv64-rootfs.img: DOS/MBR boot sector MS-MBR Windows 7 english at offset 0x163 "Invalid partition table" at offset 0x17b "Error loading operating system" at offset 0x19a "Missing operating system", disk signature 0xc7aafbf3; partition 1 : ID=0xb, active, start-CHS (0x0,32,33), end-CHS (0xc,190,50), startsector 2048, 202752 sectors
```
From this analysis, I concluded that `fs.img` was not useable. I then revisited the `README.md` file for further clarification. Upon reviewing it, I realized that I had been executing the kernel on the wrong branch. After switching to the correct branch, I was able to successfully launch the kernel.
:::
## Explain how the OS kernel on an SMP processor initializes each hart
Before entering the kernel's main function, `_entry.S` allocates a boot stack for each hart.
Since the code only allocates 8 stacks of 4 KB each, the number of CPUs should not exceed 8.

When the first hart begins executing the `main` function, it stores the **hartid** in the `tp` register and sets the static variable **first_hart** to **0**.
The first hart then calls the `init_cpu` function to initialize the status of all CPUs, including setting the **core_id** and **start_cycle**.
```c
for (int i = 0; i < NCPU; i++) {
...
cpus[i].core_id = i;
...
// The r_time function returns the value of the time csr
cpus[i].start_cycle = r_time();
}
```
Next, the first hart calls the `trapinit_hart` function to set **stvec** and enable the supervisor global interrupt by setting **sstatus.SIE**. It then enables the supervisor external interrupt and the supervisor software interrupt by setting **sie.SEIE and sie.SSIE**, respectively.
:::info
**PLIC Background**
PLIC supports up-to **1023** interrupts (0 is reserved) and **15872** contexts.
- Each interrupt has a priority value that can be set by the programmer
- Each target can be interrupted by N interrupt sources, so there are N interrupt enable bits to control whether the target receives these interrupts
- Each target has an interrupt threshold. For an interrupt to occur, its priority value must be greater than the threshold

:::
After executing the `trapinit_hart` function, the first hart executes the `plicinit` and `plicinithart` functions. The `plicinit` function sets the priority of interrupt 1 to 1. The `plicinithart` function enables interrupt 1 for supervisor mode and sets the supervisor's interrupt threshold to 0.
Then, the first hart executes the `kvminit` function to create the kernel page table and the `kvminithart` function to write to the **satp** CSR, enabling virtual memory. After enabling virtual memory, the first hart executes the **sfnce.vma zero, zero** instruction to flush all TLB entries.
After the first hart completes all initialization, it calls the `init_booted` function to set both static arrays, **booted** and **halted**, to all zeros:
```c
volatile int booted[NCPU]; // whether a core is booted
volatile int halted[NCPU]; // whether a core is halted
```
The first hart then sets `booted[hartid]` to **1**, marking itself as booted, and begins booting the other harts by calling the `start_hart` function.
```c
booted[hartid] = 1; // Mark the first hart as booted
int CPU_START = 0; // Starting index for CPUs to boot
for (int i = CPU_START; i < NCPU; i++) {
if (i != hartid) { // Skip the first hart (already booted)
start_hart(i, (uint64_t)_entry, 0);
while (booted[i] == 0) {
// Wait until the hart has updated its booted status
}
}
}
```
:::info
**OpenSBI Background**
An SBI extension defines a set of SBI functions which provides a particular functionality to supervisor-mode software.
The higher privilege software providing SBI interface to the supervisor-mode software is referred as an SBI implementation or Supervisor Execution Environment (SEE).

All SBI functions share a single binary encoding, which facilitates the mixing of SBI extensions. The SBI specification follows the below calling convention.
- An **ECALL** is used as the control transfer instruction between the supervisor and the SEE
- **a7** encodes the SBI extension ID (EID)
- **a6** encodes the SBI function ID (FID)
- SBI functions must return a pair of values in **a0** and **a1**, with **a0** returning an error code.
```c
struct sbiret {
long error;
long value;
};
```
:::
The `start_hart` function internally calls the `a_sbi_ecall` function, which uses the **ecall** instruction to invoke the Supervisor Execution Environment (SEE).
```c
void start_hart(uint64 hartid, uint64 start_addr, uint64 a1) {
a_sbi_ecall(0x48534D, 0, hartid, start_addr, a1, 0, 0, 0);
}
```
:::info
The value **0x48534D** represents the Extension ID (EID) for the **Hart State Management (HSM)** extension in the RISC-V SBI.
The **second argument** of the `a_sbi_ecall` function is the Function ID (FID).
In the HSM extension, Function ID 0 represents to the `sbi_hart_start` function. The `sbi_hart_start` function request the SBI implementation to start executing the target hart in supervisor-mode.
```c
struct sbiret sbi_hart_start(
unsigned long hartid,
unsigned long start_addr,
unsigned long opaque
)
```
:::
After the first hart completes its initialization, the other harts execute the `hart_bootcamp` function as follows:
```c
void hart_bootcamp(uint64 hartid, uint64 a1) {
w_tp(hartid); // Write hart ID to the thread pointer
kvminithart(); // Initialize kernel virtual memory for this hart
trapinit_hart(); // Set up the trap handling for this hart
plicinithart(); // Configure the PLIC for this hart
booted[hartid] = 1; // Mark this hart as booted
}
```
Once each hart (besides the first) has executed the `hart_bootcamp` function, they enter a `while` loop to wair for all harts to complete booting:
```c
while (!all_started) {
; // Busy wait until all hart have started
}
```
When the first hart completes booting all the other harts, it will execute the `wait_all_boot` function to signal that all harts have started by setting **all_started** to **1**.
```c
void wait_all_boot() {
...
all_started = 1;
}
```
Once all harts have been booted and the **all_started** flag is set to **1** by the first hart, all harts proceed to execute the `scheduler` function.
## Explain the use of CLINT and uCore-SMP
## Explore the implementation of HSM, OpenSBI and uCore-SMP
:::info
**This content is referenced from the [fw.md](https://github.com/riscv-software-src/opensbi/blob/v0.9/docs/firmware/fw.md).**
OpenSBI currently supports three different types of firmwares:
- Firmware with Dynamic Information (FW_DYNAMIC)
- Firmware with Jump Address (FW_JUMP)
**The FW_JUMP firmware assumes a fixed address of the next booting stage entry, e.g. a bootloader or an OS kernel, without directly including the binary code for this next stage.**
- Firmware with Payload (FW_PAYLOAD)
:::
The **uCore-SMP** utilizes the **FW_JUMP** firmware, so I will explain the **FW_JUMP** flow.
At the beginning, the `_start` function determines which hart will execute the `_relocate` function. Assume that hart 0 executes the **amoadd.w** instruction first. As shown in the image below, hart 0 reads a zero value from memory, while the other harts will read a non-zero value from memory. Therefore, hart 0 is designated to execute the `_relocate` function.

The next image illustrates the execution flow for harts other than hart 0 (the boot core).
These non-boot cores will wait for the boot core to complete relocation and initialization before proceeding.

The following image displays the execution flow for hart 0 (the boot core).

After one hart executes `_start_warm`, it will call the `sbi_init` function.
In the `sbi_init` function, a lottery mechanism is used to select a coldboot hart from all available harts. Only one hart will execute `init_coldboot`, while the other harts will execute `init_warmboot`.
```c
if (next_mode_supported && atomic_xchg(&coldboot_lottery, 1) == 0)
coldboot = TRUE;
if (coldboot)
init_coldboot(scratch, hartid);
else
init_warmboot(scratch, hartid);
```
:::info
**HSM Extension Background**
The Hart State Management (HSM) Extension introduces a set of hart states and a set of functions which allow the supervisor-mode software to request a hart state change.
:::
## Examine the system call design of uCore-SMP, identify its design flaws, and propose improvements
## Discuss CPU scheduling issues in SMP, using uCore-SMP as an example, and analyze its fairness
The OS kernel manages all processes and determines which process each hart will execute by scheduling them using a specific algorithm.
The **uCore-SMP** kernel uses a structure called **proc** to record the state of each process.
```c
struct proc {
struct spinlock lock;
enum procstate state;
int pid;
...
struct context context;
...
uint64 stride;
uint64 priority;
...
};
```
We can see that there are two variables, **stride** and **priority**, in the **proc** structure. The **uCore-SMP** primarily uses these two variables to determine which process will be executed, following an algorithm called **Stride Scheduling**.
:::info
**Scheduling Metrics**
To compare different scheduling policies effectively, we need a **scheduling matric**.
The **turnaround time** of a job is defined as the difference between the time the job completes and the time it arrives in the system:
$$T_{turnaround}=T_{completion}-T_{arrival}$$
While turnaround time serves as a **performance** matric, scheduling algorithms can also be evaluated in terms of **fairness**. However, performance and fairness often conflict, making it challenging to optimize both simultaneously.
The advent of time-shared system introduce a new metric called **response time**. Response time is defined as the duration between a job's arrival in the system and the first time it is scheduled to run:
$$T_{response}=T_{firstrun}-T_{arrival}$$
:::
:::info
**Proportional-Share Background**
The **proportional-share** scheduler, also known as a **fair-share** scheduler, is built around a straightforward concept: instead of optimzing for matrics like turnaround time or response time, this scheduler focuses on ensuring that each job receives a specific percentage of CPU time.
**Stride scheduling** is a deterministic fair-share scheduler that is also straightforward to implement. Each job in the system is assigned a stride, which is inversely proportional to the number of tickets it holds. The stride for each job is calculated by dividing a large constant value by the number of tickets assigned to it.
The pseudocode for stride scheduling is shown below:
```c
curr = remove_min(queue);
schedule(curr);
curr->pass += curr->stride;
insert(queue, curr);
```
Assume we have three jobs, and their strides are illustrated in the image below:

:::
The **uCore-SMP** kernel uses the following code to find the process with the smallest stride:
```c
for (struct proc *p = pool; p < &pool[NPROC]; p++)
{
if (p->state == RUNNABLE && !p->lock.locked)
{
if (p->stride < min_stride)
{
min_stride = p->stride;
next_proc = p;
}
}
}
```
Once the scheduler identifies the process with the smallest stride, it updates the `next_proc`'s stride using the following code:
```c
uint64 pass = BIGSTRIDE / (next_proc->priority);
next_proc->stride += pass;
```
:::info
In the current implementation of **uCore-SMP**, a process's priority is fixed at **16**, resulting in all processes having the same `pass` value, calculated as:
```c
uint64 pass = BITSTRIDE / 16;
```
Since `pass` value is identical for all processes, the stride adjustment does not account for variations in priority, causing uniform progression of `stride` values across all processes. This effectively makes the scheduler behave as though all processes have equal priority.
In the **uCore-SMP** implementation, while the default priority is fixed at **16**, the system includes a system call, `sys_setpriority`, which allows processes to dynamically adjust their priority. The function is defined as follows:
```c
/**
* @brief Set the priority of the current process
*
* @param priority The desired priority value (must be >= 2).
* @return int64 Returns the set priority value if successful, or -1 if the input is invalid.
*/
int64 sys_setpriority(int64 priority) {
if (2 <= priority) {
struct proc *p = curr_proc();
acquire(&p->lock);
p->priority = priority;
release(&p->lock);
return priority;
}
return -1;
}
```
:::
## Reference
[6.1 Statements and Declarations in Expressions](https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html)
[What does 'inferior' mean in the term 'inferior debugger'?](https://stackoverflow.com/questions/16591485/what-does-inferior-mean-in-the-term-inferior-debugger)
[2.1.3 What GDB Does During Startup](https://sourceware.org/gdb/current/onlinedocs/gdb.html/Startup.html#Init%20File%20in%20the%20Current%20Directory%20during%20Startup)
[2.1.4 Initialization Files](https://sourceware.org/gdb/current/onlinedocs/gdb.html/Initialization-Files.html#Initialization-Files)