王景霈, 傅孟楷
Your task is to read Operating System in 1,000 Lines and experiment on QEMU, gradually building a small RISC-V operating system. The system then expands the code to have some POSIX Thread interfaces, such as mutex, conditionvariable, semaphore, barrier, etc., and takes priority inversion into consideration.
reference: https://hackmd.io/@sysprog/concurrency-thread-package
Operating System: Ubuntu 22.04.4 LTS
Kernel: Linux 6.8.0-47-generic
Architecture: x86-64
Instruction | rd is x0 | rs1 is x0 | Reads CSR | Writes CSR |
---|---|---|---|---|
CSRRW | Yes | - | No | Yes |
CSRRW | No | - | Yes | Yes |
CSRRS/CSRRC | - | Yes | Yes | No |
CSRRS/CSRRC | - | No | Yes | Yes |
Instruction | rd is x0 | uimm = 0 | Reads CSR | Writes CSR |
---|---|---|---|---|
CSRRWI | Yes | - | No | Yes |
CSRRWI | No | - | Yes | Yes |
CSRRSI/CSRRCI | - | Yes | Yes | No |
CSRRSI/CSRRCI | - | No | Yes | Yes |
define NULL ((void *)0)
void *
is a special type of generic pointer in the C language, which represents a "pointer without a specific type".
On some platforms or compilers, directly using 0 as a pointer might result in type mismatch warnings or errors.
Casting 0 explicitly to void * indicates that "this is a null pointer, not just the numeric value 0," ensuring semantic clarity and cross-platform consistency.
__VA_ARGS__
is a standard feature in C used for handling variable argument macros, allowing macros to accept a variable number of arguments.
removes extra commas when the variable argument list is empty, preventing compilation errors.
do { ... } while(0)
in Macrosif-else
structures.;
) after its usage.:::
Simplified RISC-V exception handling process:
Save the current instruction address to sepc
(CSR for storing the faulting instruction address).
Record the exception cause in scause
.
Jump to the exception handler based on stvec
.
The stack pointer (SP) is set to kernel space (via sscratch
to store the old user SP).
Necessary register values may be automatically saved to the stack.
Calls handle_trap
using a0
(stack pointer) to analyze scause
.
Performs specific exception handling and may modify saved register values.
Restores saved register contents.
Uses sret
to return to the original execution.
sscratch
register usually used to store the user mode's SP while exception occurse. After handling the trap, when we want to return to user mode, we must restore the original user stack pointer using sscratch
.a0
and passed to exception_handler
.WRITE_CSR
sets stvec
to the address to enter the exception_handler entry(kernel_entry
).kernel_entry
responsibilities:
void kernel_main(void)
{
memset(__bss, 0, (size_t)__bss_end - (size_t)__bss);
WRITE_CSR(stvec, (uint32_t)kernel_entry);
__asm__ __volatile__("unimp"); // pseudo instruction to trigger an illegal instruction trap
for (;;)
{
__asm__ __volatile__("wfi");
}
}
void handle_trap(struct trap_frame *f)
{
uint32_t scause = READ_CSR(scause);
uint32_t stval = READ_CSR(stval);
uint32_t user_pc = READ_CSR(sepc);
PANIC("unexpected trap scause=%x, stval=%x, sepc=%x\n", scause, stval, user_pc);
}
scause
: Stores the type of exception.stval
(Supervisor Trap Value Register): Stores the value associated with the exception.spec
: Stores the address of the instruction where the exception occurred.#reg
turn reg
into string __asm__ __volatile__("csrw " #reg ", %0" ::"r"(__tmp))
A specialized scripting language used to control how the linker arranges memory usage and program layout when combining multiple object files into an executable.
It allows defining symbols like __stack_top
, making them accessible in C for memory allocation control.
By defining this in the linker script instead of hardcoding addresses, the linker can determine the position to avoid overlapping with the kernel's static data.
PAGE_SIZE
is defined in common.h
instead of kernel.h
because it is a general variable used in multiple modules, not just for kernel management but also for broader memory management.
next_paddr
tracks the next free address for allocation, requiring persistence across function calls, making it suitable for a static variable.
RISC-V, s0 to s11 are callee-saved registers. Other registers like a0 are caller-saved registers, and already saved on the stack by the caller.
yield()
will search the executable process, if there is no executable process, it will excute idle_proc.
While using separate kernel stacks for each process, sscratch
will store their own SP of current process.
csrrw sp, sscratch, sp
swap operation. Swaps the user mode sp and kernel mode sp, so system can change from user mode to supervisor mode.
Each process has its own independent kernel stack!
a0
, a1
and naked
When a function is marked as naked, it skips the standard prologue and epilogue generated by the C language function, relying entirely on inline assembly to handle the function's operations. As a result, the first and second parameters of the function directly correspond to the a0 and a1 registers in the inline assembly.
__attribute__((naked)) void switch_context(uint32_t *prev_sp,
uint32_t *next_sp) {
__asm__ __volatile__(
// omitted
// Switch the stack pointer.
"sw sp, (a0)\n" // *prev_sp = sp;
"lw sp, (a1)\n" // Switch stack pointer (sp) here
// omitted
);
}
In C, we can only take the address of the element just past the end of an array (e.g., &array[size]
), but we cannot read from or write to it. Accessing any index outside the array bounds, except the one just past the last element for address computation, is considered undefined behavior.
The effects of ++
and --
operators in C are determined by the type of the corresponding variable. For example:
uint32_t *sp = (uint32_t *) &proc->stack[sizeof(proc->stack)];
*--sp = 0;
map_page()
does :vaddr
to paddr
and creates corresponding page tables.sfence.vma
used to manage TLB.SECTIONS {
. = 0x80200000;
__kernel_base = .;
Examine if kernel virtual address (0x80200000) is correctly mapped to physical address
# satp
| MODE (1 bit) | ASID (9 bits) | PPN (22 bits) |
satp = 80080243, satp stores L1 physical address
We can predict as follow:
PPN = (0x80080243 & 0x3FFFFF) = 0x80243
L1 base physical address = 0x80243 * 4096 = 0x80243000
2. VPN1 = (0x80200000 >> 22) & 0x3FF = 512
VPN0 = (0x80200000 >> 12) & 0x3FF = 512
3. ```
(qemu) xp /x 0x80243000+512*4
0000000080243800: 0x20091001
4. PPN = 0x20091001 >> 10 = 0x20091
L2 Physical Address = PPN * 4096
= 0x20091 * 4096
= 0x80244000
5. ```
(qemu) xp /x 0x80244000+512*4
0000000080244800: 0x200800cf
L2 PTE = 0x200800cf
6. calculate physical address PTE point to.
#PTE
| PPN (22 bits) | Flags (10 bits) |
0x20080 = PPN
Physical Address = PPN * 4096
= 0x20080 * 4096
= 0x80200000
Indicates that 0x80200000 (virtual address) is mapped to 0x80200000 (physical address).
Also, we can check by:
(qemu) info mem
vaddr paddr size attr
-------- ---------------- -------- -------
80200000 0000000080200000 00001000 rwx--ad
In the current implementation, each process has its own page table, and all page tables include mappings to the same kernel space.
After running switch_context()
!
If we encounter interrupt or execption, we will want to do context switch to save the information.
Since when creating process, we will set ra
to user_entry
.
struct process *create_process(const void *image, size_t image_size) {
//omitted
*--sp = 0; // s3
*--sp = 0; // s2
*--sp = 0; // s1
*--sp = 0; // s0
*--sp = (uint32_t) user_entry; // ra (changed!)
When doing switch_context
, we will read the next_process's ra
, and it will lead us to user_entry
and go to USER_BASE
by sret
which be set to sepc
. Thus, we will change to user mode.
After several steps above, we will go to user.c
start()
, and it will call main
which define in shell.c
(We will compile common.c, user.c, shell.c into shell.elf, so we can call main
)
exit()
The user program (shell.c
) calls the exit()
function to terminate the process.
exit()
in user space internally triggers a system call:
syscall(SYS_EXIT, 0, 0, 0);
SYS_EXIT
(with value 3
) is passed as the system call number in the a3
register.syscall
function executes the RISC-V ecall
instruction to request a transition to kernel mode.ecall
instruction generates a synchronous exception.sepc
: Address of the ecall
instruction.scause
: Exception cause (set to 8
, indicating a user-mode ecall
).stvec
register, which is the kernel's trap entry point:stvec = kernel_entry;
kernel_entry
)In kernel_entry
, the kernel saves the trap frame (registers and state) and calls the main trap handler function:
handle_trap(struct trap_frame *f);
handle_trap
)The handle_trap
function reads the scause
register:
scause == 8
: Indicates a user-mode ecall
.handle_syscall(f);
handle_syscall
)handle_syscall
, the kernel checks the system call number passed in the a3
register of the trap frame:
f->a3 == SYS_EXIT
(value 3
).printf("process %d exited\n", current_proc->pid);
current_proc->state = PROC_EXITED;
yield()
to trigger process scheduling.yield()
)current_proc
) is not rescheduled.readfile
When we attempted the final readfile
in Chapter 16, following the textbook's code exactly, we encountered an issue: the file system initialized successfully and listed two files, but using the readfile
command to read the hardcoded hello.txt
in shell.c
failed to find the file.
Eventually, we resolved the issue by changing the hardcoded "hello.txt"
to "./hello.txt"
in shell.c
. The reason is that the command:
(cd disk && tar cf ../disk.tar --format=ustar ./*.txt)
matches file names with relative paths. As a result, the file names in the file system include the ./
prefix, which must match the file names hardcoded in shell.c
to locate the file successfully.
For example:
shiky@SY:~/ncku_master/CO/disk$ ls ./*.txt
./hello.txt ./meow.txt
shiky@SY:~/ncku_master/CO/disk$ ls *.txt
hello.txt meow.txt
Steps:
syscall
.mutex
, priority inheritance
in kernel thread.futex
check exception happend line
llvm-addr2line -e kernel.elf {sepc}
When transitioning the existing minimal system to support threads, we encountered several challenges. The original system was not designed to implement threads. While using kernel-mode threads, the functionality worked as expected, including context switching. However, when attempting to create user-mode threads, we faced two critical issues.
When using a syscall to create threads and directly assigning the user-mode function as the entry point, context switching from the original kernel mode to the new thread was not possible. This is because context switching relies on ret
to return to the address stored in the ra
register. However, in the current kernel mode, there is no method to ret
to a different stack space.
To address this, an additional function is required to rewrite sepc
and sstatus
, and finally, invoke sret
to return to the specified address in user mode.
Even if the above issue is resolved, switching to user mode leads to execution errors for almost every instruction. Further investigation revealed that this is also a mode management issue. Each user-mode thread must have a dedicated user stack to utilize in user mode. It is not permissible to directly access addresses in the kernel stack while in user mode.
PANIC: kernel.c:772: unexpected trap scause=0000000f, stval=80225d58, sepc=01000026
This issue remains unresolved at the moment.
In this github example.
After setting the sp to user entry, the sp been set to the text field address in user mode, but it can not read the address to execute the shell function. Eventually invoke panic.