The entry point of the program is somewhere a function or a symbol called entry or _entry, in this project, it's in kernel/kernel.ld. why _entry is at 0x80000000 is that qemu expects the kernel to be loaded at this address. Then we can go to the entry.S file, this is where it actually starts running
.section .text
是在指示組譯器:
1.將後續的程式碼放入 .text 區域。
2.將其視為可執行的程式碼。
在現代程式中,通常只有 .text 區域被標記為可執行。
除了 .text 之外,還有以下幾個常見的區域: .data:包含初始化的全域與靜態變數。 .bss:包含未初始化的全域與靜態變數。 .rodata:包含唯讀資料,例如字串常數或常數值。 .comment:包含註解或元數據。
At line 15, why do we need to add a1 by 1? The "sp = stack0 + (hartid *4096)" is correct without this line. So, actually, the line 11 is "# sp = stack0 + ((hartid+1) * 4096)", why this happen? because the stack in the memory is actually grows downwards
Then we jump to start() in start.c
#include "types.h"
#include "param.h"
#include "memlayout.h"
#include "riscv.h"
#include "defs.h"
void main();
void timerinit();
// entry.S needs one stack per CPU.
__attribute__ ((aligned (16))) char stack0[4096 * NCPU];
// entry.S jumps here in machine mode on stack0.
void
start()
{
// set M Previous Privilege mode to Supervisor, for mret.
unsigned long x = r_mstatus(); //* n: read current machine status register
x &= ~MSTATUS_MPP_MASK; //* n: Clears the MPP (Machine Previous Privilege) bits in mstatus. These bits specify the privilege mode to return to after an mret (machine return) instruction.
x |= MSTATUS_MPP_S; //* n: setting those cleared bits to be the supervisor level of privilege
w_mstatus(x);
// set M Exception Program Counter to main, for mret.
// requires gcc -mcmodel=medany
w_mepc((uint64)main); //* n: Writes the address of the main function to the mepc register, which holds the return address for exceptions in M-mode.
//* n: 將 main 函數的地址轉換為 64 位元整數,然後寫入 mepc。這表示當 mret 指令執行時,CPU 將跳轉到 main 函數,開始執行作業系統的核心邏輯。
// disable paging for now.
w_satp(0); //* n: turns off virtual memory
// delegate all interrupts and exceptions to supervisor mode.
w_medeleg(0xffff);
w_mideleg(0xffff);
w_sie(r_sie() | SIE_SEIE | SIE_STIE | SIE_SSIE); //* n: supervisor mode is ready and willing to handle SIE_SEIE, SIE_STIE, SIE_SSIE types of interrupts and exceptions
// configure Physical Memory Protection to give supervisor mode
// access to all of physical memory.
w_pmpaddr0(0x3fffffffffffffull); //* n: 允許 S-mode 存取整個物理記憶體。
w_pmpcfg0(0xf); //* n: 將 pmpcfg0 暫存器設置為 0xf,允許讀、寫、執行權限。
// ask for clock interrupts.
timerinit();
// keep each CPU's hartid in its tp register, for cpuid().
int id = r_mhartid();
w_tp(id);
// switch to supervisor mode and jump to main().
asm volatile("mret");
}
// ask each hart to generate timer interrupts.
void
timerinit()
{
// enable supervisor-mode timer interrupts.
w_mie(r_mie() | MIE_STIE);
// enable the sstc extension (i.e. stimecmp).
w_menvcfg(r_menvcfg() | (1L << 63));
// allow supervisor to use stimecmp and time.
w_mcounteren(r_mcounteren() | 2);
// ask for the very first timer interrupt.
w_stimecmp(r_time() + 1000000);
}
The idea of start() function is to let us out of machine mode and into the supervisor level, before we actually can do that, we need to set up a few things that we can only do in machine mode
timerinit(): arrange to receive timer interrupts, they will arrive in machine mode( not supervisor mode) at timervec in kernelvec.S, which turns them into software interrupts for devintr() in trap.c
Each core will begin executing this main() function in parellel
#include "types.h"
#include "param.h"
#include "memlayout.h"
#include "riscv.h"
#include "defs.h"
volatile static int started = 0;
// start() jumps here in supervisor mode on all CPUs.
void
main()
{
if(cpuid() == 0){ //* n: it looks at and return the value of tp register
//* n: so only core0 will execute the if statement, all other core will execute the else statement.
consoleinit();
printfinit();
printf("\n");
printf("xv6 kernel is booting\n");
printf("\n");
kinit(); // physical page allocator
kvminit(); // create kernel page table
kvminithart(); // turn on paging
procinit(); // process table
trapinit(); // trap vectors
trapinithart(); // install kernel trap vector
plicinit(); // set up interrupt controller
plicinithart(); // ask PLIC for device interrupts
binit(); // buffer cache
iinit(); // inode table
fileinit(); // file table
virtio_disk_init(); // emulated hard disk
userinit(); // first user process
__sync_synchronize();
started = 1;
} else {
while(started == 0)
;
__sync_synchronize();
printf("hart %d starting\n", cpuid());
kvminithart(); // turn on paging
trapinithart(); // install kernel trap vector
plicinithart(); // ask PLIC for device interrupts
}
scheduler();
}
This function tell compiler that all the code above this line should be finisfed first before going to execute the code after this line.
void
consoleinit(void)
{
initlock(&cons.lock, "cons");
uartinit();
// connect read and write system calls
// to consoleread and consolewrite.
devsw[CONSOLE].read = consoleread;
devsw[CONSOLE].write = consolewrite;
}
conssole is an abstraction of the physical hardware, and it is something that you can read and write
// map major device number to device functions.
struct devsw {
int (*read)(int, uint64, int);
int (*write)(int, uint64, int);
};
//
// low-level driver routines for 16550a UART.
//
void
uartinit(void)
{
// disable interrupts.
WriteReg(IER, 0x00); //* n: write zero to interrupt enable register
// special mode to set baud rate.
WriteReg(LCR, LCR_BAUD_LATCH);
// LSB for baud rate of 38.4K.
WriteReg(0, 0x03);
// MSB for baud rate of 38.4K.
WriteReg(1, 0x00);
// leave set-baud mode,
// and set word length to 8 bits, no parity.
WriteReg(LCR, LCR_EIGHT_BITS);
// reset and enable FIFOs.
WriteReg(FCR, FCR_FIFO_ENABLE | FCR_FIFO_CLEAR);
// enable transmit and receive interrupts.
WriteReg(IER, IER_TX_ENABLE | IER_RX_ENABLE);
initlock(&uart_tx_lock, "uart");
}
functions here are about the uart chip(16550a UART) not about the whole system, so, for example, the disable interrupt is only disable the chip's interrupt to outside(whole system)
// Print to the console.
int
printf(char *fmt, ...) //* n: ...: variadic argument list
{
va_list ap; //* n: va_list: variadic argument list
int i, cx, c0, c1, c2, locking;
char *s;
locking = pr.locking; //* n: 獲取當前輸出操作是否需要加鎖。如果不需要加鎖(例如在單執行緒環境下)
if(locking)
acquire(&pr.lock); //* n: if lock is needed, then acqire lock until seccess
va_start(ap, fmt); //* n: initialize va_list, va_start 利用最後一個固定參數的地址,計算出可變參數在記憶體中的起始位置,並將這個地址存儲到 va_list 變數中. 初始化後的 va_list: 內部包含了可變參數的位置資訊,類似於一個指向參數起始地址的指標。
for(i = 0; (cx = fmt[i] & 0xff) != 0; i++){
if(cx != '%'){
consputc(cx); //* n: console put character
continue;
}
i++;
c0 = fmt[i+0] & 0xff;
c1 = c2 = 0;
if(c0) c1 = fmt[i+1] & 0xff;
if(c1) c2 = fmt[i+2] & 0xff;
if(c0 == 'd'){
printint(va_arg(ap, int), 10, 1);
} else if(c0 == 'l' && c1 == 'd'){
printint(va_arg(ap, uint64), 10, 1);
i += 1;
} else if(c0 == 'l' && c1 == 'l' && c2 == 'd'){
printint(va_arg(ap, uint64), 10, 1);
i += 2;
} else if(c0 == 'u'){
printint(va_arg(ap, int), 10, 0);
} else if(c0 == 'l' && c1 == 'u'){
printint(va_arg(ap, uint64), 10, 0);
i += 1;
} else if(c0 == 'l' && c1 == 'l' && c2 == 'u'){
printint(va_arg(ap, uint64), 10, 0);
i += 2;
} else if(c0 == 'x'){
printint(va_arg(ap, int), 16, 0);
} else if(c0 == 'l' && c1 == 'x'){
printint(va_arg(ap, uint64), 16, 0);
i += 1;
} else if(c0 == 'l' && c1 == 'l' && c2 == 'x'){
printint(va_arg(ap, uint64), 16, 0);
i += 2;
} else if(c0 == 'p'){
printptr(va_arg(ap, uint64));
} else if(c0 == 's'){
if((s = va_arg(ap, char*)) == 0)
s = "(null)";
for(; *s; s++)
consputc(*s);
} else if(c0 == '%'){
consputc('%');
} else if(c0 == 0){
break;
} else {
// Print unknown % sequence to draw attention.
consputc('%');
consputc(c0);
}
#if 0
switch(c){
case 'd':
printint(va_arg(ap, int), 10, 1);
break;
case 'x':
printint(va_arg(ap, int), 16, 1);
break;
case 'p':
printptr(va_arg(ap, uint64));
break;
case 's':
if((s = va_arg(ap, char*)) == 0)
s = "(null)";
for(; *s; s++)
consputc(*s);
break;
case '%':
consputc('%');
break;
default:
// Print unknown % sequence to draw attention.
consputc('%');
consputc(c);
break;
}
#endif
}
va_end(ap);
if(locking)
release(&pr.lock);
return 0;
}
locking = pr.locking;
if(locking)
acquire(&pr.lock);
在多執行緒的環境中,若多個執行緒同時執行 printf 函式,可能會導致以下問題:
輸出順序錯亂: 輸出的資料可能來自不同執行緒,會混合在一起,無法正確理解。 例如: 執行緒 1:Hello, 執行緒 2:World! 輸出可能是: HeWorlldo, !
競態條件(Race Condition):當多個執行緒同時修改共享資源(如輸出緩衝區)時,可能導致資料損壞或不一致的結果。
static void
printint(long long xx, int base, int sign)
{
char buf[16];
int i;
unsigned long long x;
if(sign && (sign = (xx < 0)))
x = -xx;
else
x = xx;
i = 0;
do {
buf[i++] = digits[x % base];
} while((x /= base) != 0);
if(sign)
buf[i++] = '-';
while(--i >= 0)
consputc(buf[i]);
}
this function help printint things to console ! It transfer each number to char and then print it
buf[16] stores the ouput string "backwards"
static void
printptr(uint64 x)
{
int i;
consputc('0');
consputc('x');
for (i = 0; i < (sizeof(uint64) * 2); i++, x <<= 4)
consputc(digits[x >> (sizeof(uint64) * 8 - 4)]);
}
This is how to print pointer.
we first print "0x",
consputc(digits[x >> (sizeof(uint64) * 8 - 4)]);
can see as consputc(digits[x >> 8 * 8 - 4)]);
which is consputc(digits[x >> 60]);
, so for each round, we print the leftist hex number, and then shift left 4 bit, which means we take out the leftist hex numer
compare the above two print functions, you can see that the printptr() is much concise. Cause we try to print decimal number in prinint(), and decimal numbers doesn't play well with the computer, while hexdecimal plays realy well with computer
kernel.ld
OUTPUT_ARCH( "riscv" ) //* n:告訴鏈結器生成適用於 RISC-V 架構的執行檔。
ENTRY( _entry ) //* n:指定程式的入口點為 _entry, 該符號由entry.S 定義
SECTIONS
{
/*
* ensure that entry.S / _entry is at 0x80000000,
* where qemu's -kernel jumps.
*/
. = 0x80000000;
.text : {
*(.text .text.*)
. = ALIGN(0x1000);
_trampoline = .;
*(trampsec)
. = ALIGN(0x1000);
ASSERT(. - _trampoline == 0x1000, "error: trampoline larger than one page");
PROVIDE(etext = .);
}
.rodata : {
. = ALIGN(16);
*(.srodata .srodata.*) /* do not need to distinguish this from .rodata */
. = ALIGN(16);
*(.rodata .rodata.*)
}
.data : {
. = ALIGN(16);
*(.sdata .sdata.*) /* do not need to distinguish this from .data */
. = ALIGN(16);
*(.data .data.*)
}
.bss : {
. = ALIGN(16);
*(.sbss .sbss.*) /* do not need to distinguish this from .bss */
. = ALIGN(16);
*(.bss .bss.*)
}
PROVIDE(end = .); //* n:定義符號 end, 用於記錄kernel內存的使用結束位置
}
用於生成操作系統內核的執行檔。鏈結腳本的作用是控制程式各部分(如代碼段、只讀數據段、資料段等)的內存佈局。
kalloc.c (kernel allocation)
// Physical memory allocator, for user processes,
// kernel stacks, page-table pages,
// and pipe buffers. Allocates whole 4096-byte pages.
#include "types.h"
#include "param.h"
#include "memlayout.h"
#include "spinlock.h"
#include "riscv.h"
#include "defs.h"
void freerange(void *pa_start, void *pa_end);
extern char end[]; // first address after kernel.
// defined by kernel.ld.
struct run { //* n: 定義空閒頁的資料結構。每個空閒頁用 struct run 表示,next 指向下一個空閒頁。
struct run *next;
};
struct {
struct spinlock lock; //* n: lock:保護空閒頁列表,防止多核心同時訪問時發生競態條件。
struct run *freelist; //* n: freelist:指向空閒頁列表的頭節點。
} kmem;
void
kinit()
{
initlock(&kmem.lock, "kmem");
freerange(end, (void*)PHYSTOP); //* n:
}
void
freerange(void *pa_start, void *pa_end)
{
char *p;
p = (char*)PGROUNDUP((uint64)pa_start);
for(; p + PGSIZE <= (char*)pa_end; p += PGSIZE)
kfree(p);
}
// Free the page of physical memory pointed at by pa,
// which normally should have been returned by a
// call to kalloc(). (The exception is when
// initializing the allocator; see kinit above.)
void
kfree(void *pa)
{
struct run *r;
if(((uint64)pa % PGSIZE) != 0 || (char*)pa < end || (uint64)pa >= PHYSTOP)
panic("kfree");
// Fill with junk to catch dangling refs.
memset(pa, 1, PGSIZE);
r = (struct run*)pa;
acquire(&kmem.lock);
r->next = kmem.freelist;
kmem.freelist = r;
release(&kmem.lock);
}
// Allocate one 4096-byte page of physical memory.
// Returns a pointer that the kernel can use.
// Returns 0 if the memory cannot be allocated.
void *
kalloc(void)
{
struct run *r;
acquire(&kmem.lock);
r = kmem.freelist;
if(r)
kmem.freelist = r->next;
release(&kmem.lock);
if(r)
memset((char*)r, 5, PGSIZE); // fill with junk
return (void*)r;
}
when the system boot, all of the pages should be free
#define PGROUNDUP(sz) (((sz)+PGSIZE-1) & ~(PGSIZE-1))
this macro round up sz to times of PGSIZE(4096)
for example, if sz = 4097
sz = 4097 = 0000...0001, 0000, 0000, 0001 (64bit) PGSIZE-1=4095 = 0000...0000, 1111, 1111, 1111 sz+PGSIZE-1 = 0000...0010, 0000, 0000, 0000 ----A ~(PGSIZE-1) = 1111...1111, 0000, 0000, 0000 ----B A & B = 0000...0010, 0000, 0000, 0000 = 8192 !!
// Fill with junk to catch dangling refs.
memset(pa, 1, PGSIZE);
作用是在釋放一個物理頁時,將整個頁的內容填充為 1(0x01),以便偵測潛在的懸空引用(dangling references)或未初始化的內存存取。這是一種內核開發中的調試技巧,用來幫助發現潛在的記憶體錯誤。 如果 memset() 沒有執行,釋放後的頁仍然可能包含舊資料。例如:
char *p = kalloc();
strcpy(p, "Hello world");
kfree(p);
// 假設 kalloc() 分配了同一頁
char *q = kalloc();
printf("%s\n", q); // 可能輸出 "Hello world",但這是未定義行為
After the first kfree()
the first 8 byte is NULL (64bit 0f 0), because freelist is NULL at first, so r->next is also NULL, which is equal to
*(uint64*)pa = NULL;
The second kfree()
After the Last round
The first 8 byte store PHYSTOP-8KiB is because the next free page's start address is PHYSTOP-8KiB
So the mind blowing thing is that rather than using another struct that somehow keep track of all the free pages, maybe it was a table and it said that page is free, that page is not free… But, IN FACT, that's not what happens. The memory itself is THE data structure, we only use the freelist linked list node to track all free pages
vm.c
pagetable_t
kvmmake(void)
{
pagetable_t kpgtbl;
kpgtbl = (pagetable_t) kalloc(); //* n: allocate a 4K page(retrun a pointer)
memset(kpgtbl, 0, PGSIZE); //* n: set all 4K byte in page table to 0
//* n: mapping hardwares
// uart registers
//* n: the virtual address we're trying to translate is the UART(2nd argument), the physical address(3rd argument) is also UART, so it's a one to one mapping
kvmmap(kpgtbl, UART0, UART0, PGSIZE, PTE_R | PTE_W);
// virtio mmio disk interface
kvmmap(kpgtbl, VIRTIO0, VIRTIO0, PGSIZE, PTE_R | PTE_W);
// PLIC
kvmmap(kpgtbl, PLIC, PLIC, 0x4000000, PTE_R | PTE_W);
// map kernel text executable and read-only.
kvmmap(kpgtbl, KERNBASE, KERNBASE, (uint64)etext-KERNBASE, PTE_R | PTE_X);
// map kernel data and the physical RAM we'll make use of.
kvmmap(kpgtbl, (uint64)etext, (uint64)etext, PHYSTOP-(uint64)etext, PTE_R | PTE_W);
// map the trampoline for trap entry/exit to
// the highest virtual address in the kernel.
kvmmap(kpgtbl, TRAMPOLINE, (uint64)trampoline, PGSIZE, PTE_R | PTE_X);
// allocate and map a kernel stack for each process.
proc_mapstacks(kpgtbl);
return kpgtbl;
}
// Initialize the one kernel_pagetable
void
kvminit(void)
{
kernel_pagetable = kvmmake();
}
// Switch h/w page table register to the kernel's page table,
// and enable paging.
void
kvminithart()
{
// wait for any previous writes to the page table memory to finish.
sfence_vma();
w_satp(MAKE_SATP(kernel_pagetable));
// flush stale entries from the TLB.
sfence_vma();
}
etext 是 內核程式碼的結束地址,它是 kernel.ld(鏈接腳本)中 PROVIDE(etext = .); 定義的符號
//kernel.ld
.text : {
*(.text .text.*)
. = ALIGN(0x1000);
_trampoline = .;
*(trampsec)
. = ALIGN(0x1000);
ASSERT(. - _trampoline == 0x1000, "error: trampoline larger than one page");
PROVIDE(etext = .);
}
.text 段包含 內核的程式碼(可執行),從 0x80000000 開始。 etext 記錄 程式碼段的結束位置,即 .text 段結束的位置。
在 xv6 的記憶體布局:
0x80000000 ┌──────────┐ │ .text │ (可執行的程式碼) etext → ├──────────┤ │ .rodata │ (唯讀數據) │ .data │ (全域變數) │ .bss │ (未初始化全域變數) end → ├──────────┤ │ heap │ │ ... │ PHYSTOP → └──────────┘ (記憶體的最終範圍)
void
kvminit(void)
{
kernel_pagetable = kvmmake();
}
/*
* the kernel's page table.
*/
pagetable_t kernel_pagetable;
typedef uint64 *pagetable_t; // 512 PTEs
so a kernel_pagetable is just a pointer point to a uint64
9 bit of VPN gives 512 PTE, 12 bits of page offset gives pages with size 4096KB
// Create PTEs for virtual addresses starting at va that refer to
// physical addresses starting at pa.
// va and size MUST be page-aligned.
// Returns 0 on success, -1 if walk() couldn't
// allocate a needed page-table page.
int
mappages(pagetable_t pagetable, uint64 va, uint64 size, uint64 pa, int perm)
{
uint64 a, last;
pte_t *pte;
if((va % PGSIZE) != 0)
panic("mappages: va not aligned");
if((size % PGSIZE) != 0)
panic("mappages: size not aligned");
if(size == 0)
panic("mappages: size");
a = va; //* n: start of virtual address
last = va + size - PGSIZE; //* n: end of virtual address
for(;;){
if((pte = walk(pagetable, a, 1)) == 0) //* n: if walk() works, then pte will be a pointer point to a PTE
return -1;
if(*pte & PTE_V)
panic("mappages: remap");
*pte = PA2PTE(pa) | perm | PTE_V;
if(a == last)
break;
a += PGSIZE;
pa += PGSIZE;
}
return 0;
}
why if((va % PGSIZE) != 0) panic("mappages: va not aligned");
shold va always aligned to PGSIZE, why not just write like below ???
why I can't used run and debug like him?
// Return the address of the PTE in page table pagetable
// that corresponds to virtual address va. If alloc!=0,
// create any required page-table pages.
//
// The risc-v Sv39 scheme has three levels of page-table
// pages. A page-table page contains 512 64-bit PTEs.
// A 64-bit virtual address is split into five fields:
// 39..63 -- must be zero.
// 30..38 -- 9 bits of level-2 index.
// 21..29 -- 9 bits of level-1 index.
// 12..20 -- 9 bits of level-0 index.
// 0..11 -- 12 bits of byte offset within the page.
pte_t *
walk(pagetable_t pagetable, uint64 va, int alloc)
{
if(va >= MAXVA)
panic("walk");
for(int level = 2; level > 0; level--) {
pte_t *pte = &pagetable[PX(level, va)];
if(*pte & PTE_V) { //* n: PTE_V is valid bit
pagetable = (pagetable_t)PTE2PA(*pte); //* n:取得下一層頁表的地址
} else {
if(!alloc || (pagetable = (pde_t*)kalloc()) == 0)
return 0;
memset(pagetable, 0, PGSIZE);
*pte = PA2PTE(pagetable) | PTE_V;
}
}
return &pagetable[PX(0, va)];
}
// extract the three 9-bit page table indices from a virtual address.
#define PXMASK 0x1FF // 9 bits
#define PXSHIFT(level) (PGSHIFT+(9*(level))) //* n:計算該level的索引偏移量
#define PX(level, va) ((((uint64) (va)) >> PXSHIFT(level)) & PXMASK) //* n: 得到該level的VPN
// one beyond the highest possible virtual address.
// MAXVA is actually one bit less than the max allowed by
// Sv39, to avoid having to sign-extend virtual addresses
// that have the high bit set.
#define MAXVA (1L << (9 + 9 + 9 + 12 - 1))
riscv manual page 118
how do multi-level page table work, e.g. 4-level
video teach
// UART0: 0x1000 0000
// PGSIZE: 4096
kvmmap(kpgtbl, UART0, UART0, PGSIZE, PTE_R | PTE_W);
// va: 0x1000 0000
// pa: 0x1000 0000
// sz: 4096
// perm: 0x01 | 0x10
void
kvmmap(pagetable_t kpgtbl, uint64 va, uint64 pa, uint64 sz, int perm)
{
if(mappages(kpgtbl, va, sz, pa, perm) != 0)
panic("kvmmap");
}
// Create PTEs for virtual addresses starting at va that refer to
// physical addresses starting at pa.
// va and size MUST be page-aligned.
// Returns 0 on success, -1 if walk() couldn't
// allocate a needed page-table page.
int
mappages(pagetable_t pagetable, uint64 va, uint64 size, uint64 pa, int perm)
{
uint64 a, last;
pte_t *pte;
if((va % PGSIZE) != 0)
panic("mappages: va not aligned");
if((size % PGSIZE) != 0)
panic("mappages: size not aligned");
if(size == 0)
panic("mappages: size");
a = va;
last = va + size - PGSIZE;
for(;;){
if((pte = walk(pagetable, a, 1)) == 0)
return -1;
if(*pte & PTE_V)
panic("mappages: remap");
*pte = PA2PTE(pa) | perm | PTE_V;
if(a == last)
break;
a += PGSIZE;
pa += PGSIZE;
}
return 0;
}
walk(pagetable, a, 1)
// pgtable: 0x87ff f000
// va: 0x1000 0000 (2684353456)
// alloc: 1
pte_t *
walk(pagetable_t pagetable, uint64 va, int alloc)
{
if(va >= MAXVA)
panic("walk");
for(int level = 2; level > 0; level--) {
pte_t *pte = &pagetable[PX(level, va)]; //pte 指向 pagetable 的第 (va的VPN[level]) 個 PTE PX(2,0x1000 0000)=0, 因為剛開始pagetable全為0 所以pagetable[0]=0 所以 *pte=0 pte = 0x87ff f000
if(*pte & PTE_V) { // *pte=0, pagetable[0]=0(第0個PTE的valid bit=0 所以條件不成立 goto else)
pagetable = (pagetable_t)PTE2PA(*pte);
} else {
if(!alloc || (pagetable = (pde_t*)kalloc()) == 0) // allocate a new page(next level), for this example, pagetable=0x87ff e000 != 0
return 0;
memset(pagetable, 0, PGSIZE); // set all bit of pagetable0x87ff e000 to 0
*pte = PA2PTE(pagetable) | PTE_V;
// pagetable: 0x87ff e000
// pagetable: 1000 0111 1111 1111 1110
// PA2PTE(pagetable): 0010 0001 1111 1111 1111 1000 0000 0000
// PA2PTE(pagetable): 0x21ff f800
// *pte=L2pagetable[0] = 0x21f f801
}
}
return &pagetable[PX(0, va)];
}
L2pagetable PTE0 stores 0x21fff801, and all other PTE are 0
// pgtable: 0x87ff e000
// va: 0x1000 0000 (2684353456)
// alloc: 1
pte_t *
walk(pagetable_t pagetable, uint64 va, int alloc)
{
if(va >= MAXVA)
panic("walk");
for(int level = 2; level > 0; level--) {
pte_t *pte = &pagetable[PX(level, va)];
//va: 0001 0000 0000 0000 0000 0000 0000 0000
//PX(1,va): 0001 0000 000 = 128
//&pagetable[128] = 0x87ff e000 + 128*8(每個PTE大小為8byte) = 0x87ff e000 + 0x400 = 0x87ff e400
//pte: 0x87ff e400
if(*pte & PTE_V) { // *pte=0, pagetable[128]=0 所以條件不成立 goto else)
pagetable = (pagetable_t)PTE2PA(*pte);
} else {
if(!alloc || (pagetable = (pde_t*)kalloc()) == 0) // allocate a new page(next level), for this example, pagetable=0x87ff d000 != 0
return 0;
memset(pagetable, 0, PGSIZE); // set all bit of pagetable0x87ff d000 to 0
*pte = PA2PTE(pagetable) | PTE_V;
// pagetable: 0x87ff d000
// pagetable: 1000 0111 1111 1111 1101 0000 0000 0000
// PA2PTE(pagetable): 0010 0001 1111 1111 1111 0100 0000 0000
// PA2PTE(pagetable): 0x21ff f400
// *pte=L1pagetable[128] = 0x21ff f401
}
}
return &pagetable[PX(0, va)];
}
L1pagetable PTE128 stores 0x21fff401, and all other PTE are 0, all PTE in L0pagetable are 0.
After this, level=0, so return &pagetable[PX(0, va)], which is &L0pagetable[0] = 0x87ff d000
After the walk() return 0x87ffd000, we go back to mappages(line 21)
// pa: 0x10000000
// perm: 6 = 0110
mappages(pagetable_t pagetable, uint64 va, uint64 size, uint64 pa, int perm)
{
uint64 a, last;
pte_t *pte;
if((va % PGSIZE) != 0)
panic("mappages: va not aligned");
if((size % PGSIZE) != 0)
panic("mappages: size not aligned");
if(size == 0)
panic("mappages: size");
a = va;
last = va + size - PGSIZE;
for(;;){
if((pte = walk(pagetable, a, 1)) == 0)
return -1;
if(*pte & PTE_V) <------------ start from here // *pte = 0 , so condition failed, goto line23
panic("mappages: remap");
*pte = PA2PTE(pa) | perm | PTE_V; //* n: after the walk(), we now arrived at the PTE0 of L0pagetable, but we haven't assigned the value of this PTE0, which is the goal of this line
// PA2PTE(pa): 0x4000000
// 0x4000000 | 6 | 1 = 0x4000007
// *pte = 0x4000007
// so the PTE0 in L0pagetable stores 0x4000007
if(a == last)
break;
a += PGSIZE;
pa += PGSIZE;
}
return 0;
}
On the left, xv6’s kernel address space. RWX refer to PTE read, write, and execute permissions. On the right, the RISC-V physical address space that xv6 expects to see
user/kill.c
#include "kernel/types.h"
#include "kernel/stat.h"
#include "user/user.h"
int
main(int argc, char **argv)
{
int i;
if(argc < 2){
fprintf(2, "usage: kill pid...\n");
exit(1);
}
for(i=1; i<argc; i++)
kill(atoi(argv[i]));
exit(0);
}
//user.h
// system calls
int fork(void);
int exit(int) __attribute__((noreturn));
int wait(int*);
int pipe(int*);
int write(int, const void*, int);
int read(int, void*, int);
int close(int);
int kill(int);
int exec(const char*, char**);
int open(const char*, int);
int mknod(const char*, short, short);
int unlink(const char*);
int fstat(int fd, struct stat*);
int link(const char*, const char*);
int mkdir(const char*);
int chdir(const char*);
int dup(int);
int getpid(void);
char* sbrk(int);
int sleep(int);
int uptime(void);
These are system calls, and the function codes of these function are actually generated automatically by a perl script called usys.pl
int exit(int) __attribute__((noreturn));
attribute is built in GCC/Clang, so you don't need to define it, and you can't find the definition in vscode. exit(int) attribute((noreturn)) tells the compiler that this function will not return
#!/usr/bin/perl -w
# Generate usys.S, the stubs for syscalls.
print "# generated by usys.pl - do not edit\n";
print "#include \"kernel/syscall.h\"\n";
sub entry {
my $name = shift;
print ".global $name\n";
print "${name}:\n";
print " li a7, SYS_${name}\n";
print " ecall\n";
print " ret\n";
}
entry("fork");
entry("exit");
entry("wait");
entry("pipe");
entry("read");
entry("write");
entry("close");
entry("kill");
entry("exec");
entry("open");
entry("mknod");
entry("unlink");
entry("fstat");
entry("link");
entry("mkdir");
entry("chdir");
entry("dup");
entry("getpid");
entry("sbrk");
entry("sleep");
entry("uptime");
This script is going to generate some assembly code, which will go into the file usys.S. So first, we will print the two line, which are the strings after word 'print', and for the 21 syscall, we're going to make some assembly code.
'shift' 是 Perl 內建函式,主要用來 從陣列或函式參數列表取出第一個元素。
so the code it generated will looked like:
# generated by usys.pl - do not edit
#include "kernel/syscall.h"
.global fork
fork:
li a7, SYS_fork
ecall
ret
.global exit
exit:
li a7, SYS_exit
ecall
ret
......
'ecall' will end execution in user mode, and switch to kernel mode. after completing some certain job like fork, it will return to user mode and execute the next instruction(ret)
Arguments are passed into the 'a' registers, for example:
int open(const char*, int);
the first argument(const char*) in a0, the second in a1. And if there are more, then put in a2, a3 and so on. And the return value will appear in a0
supervisor register for address translation pointer, it is a CSR(control and status register). It will be set and point to page table.
Page tables are kept in main memory and at any one time, there is one page table that is in use, which is the one that's pointed by the satp register.
Virtual address translation is always on in S and U mode after initialization.
There is one page table for the kernel, all cores will share it. It provides a (mostly)one-to-one mapping.
There's also one page table per user process
TRAPS:
Examples:
It is a CSR, it contains a pointer, which is the address of the first instruction of some code will handle whatever trap has occured
There are two pieces of code in xv6 that that handle traps:
這些寄存器 只能在 Supervisor Mode 讀取或寫入,使用者模式(User Mode)不能存取這些寄存器。 機器模式(Machine Mode) 也有一組類似的寄存器,但 xv6 只在處理 計時器中斷(Timer Interrupts) 時使用它們。
當發生Trap時,RISC-V CPU 會執行以下步驟來處理:
The user page table will always mark the top two pages as not accessable in user mode, but readable aand executable (R/X/~U)
It contains code(uservec and userret) virtual addresses for user and kernel 07:30
🖥️ Step 1: 開啟第一個終端機,啟動 QEMU
1️⃣ 進入 xv6-riscv 目錄 cd ~/xv6-riscv 確保你已經編譯好 xv6,如果還沒編譯,請先執行: make clean make qemu-gdb 如果 qemu-gdb 指令不存在,請嘗試: make qemu GDB=1
2️⃣ 執行 QEMU 並開啟 GDB 伺服器 make qemu-gdb 📌 成功後應該會看到類似這樣的輸出 *** Now run 'gdb' in another window. /home/neat/qemu-9.2.0/build/qemu-system-riscv64 -machine virt -bios none -kernel kernel/kernel … 這表示 QEMU 已經啟動,並在 localhost:26000 監聽 GDB 連線。
🖥️ Step 2: 開啟第二個終端機,啟動 GDB
1️⃣ 開啟新的終端機 打開第二個終端機,然後 進入 xv6-riscv 目錄 cd ~/xv6-riscv
2️⃣ 啟動 GDB riscv64-unknown-elf-gdb 📌 成功後應該會看到類似這樣的輸出 GNU gdb (GDB) 15.2 Copyright © 2024 Free Software Foundation, Inc. … (gdb)
3️⃣ 加載 xv6 Kernel 在 GDB 內輸入: file kernel/kernel 📌 成功後應該會看到 Reading symbols from kernel/kernel… 這步驟 讓 GDB 知道應該調試哪個程式,如果跳過這一步,GDB 無法解析符號,無法設定斷點。
4️⃣ 連接 QEMU 在 GDB 內輸入: target remote localhost:26000 📌 成功後應該會看到 Remote debugging using localhost:26000 0x0000000080001da2 in ?? () 這代表 GDB 已經成功連接到 QEMU!
🖥️ Step 3: 設置斷點 & 開始調試
1️⃣ 設置斷點 在 GDB 內輸入: b main 📌 成功後應該會看到 Breakpoint 1 at 0x80001234: file main.c, line 10. 這代表 成功在 main() 設置斷點!
2️⃣ 讓程式繼續執行 在 GDB 內輸入: c 📌 如果成功,應該會停在 main() 並顯示 Breakpoint 1, main () at main.c:10 10 { (gdb) 這表示 xv6 成功在 main() 暫停,你現在可以進行調試