Try   HackMD

OS 2020 HW1 Nachos Overview & Project 1 - Thread Management

Motivation

根據助教的指示,測試執行是否正常

./nachos -e ../test/test2 -e ../test/test1

執行結果如下

Ticks: total 200, idle 32, system 40, user 128
Disk I/O: reads 0, writes 0
Console I/O: reads 0, writes 0
Paging: faults 0
Network I/O: packets received 0, sent 0
user@user-VirtualBox:~/Documents/nachos/nachos-4.0/code/userprog$ ./nachos -e ../test/test2 -e ../test/test1
Total threads number is 2
Thread ../test/test2 is executing.
Thread ../test/test1 is executing.
Print integer:20
Print integer:21
Print integer:22
Print integer:9
Print integer:8
Print integer:7
Print integer:6
Print integer:23
Print integer:22
Print integer:21
Print integer:20
Print integer:19
return value:0
Print integer:18
Print integer:17
Print integer:16
Print integer:15
Print integer:14
Print integer:13
Print integer:12
Print integer:11
Print integer:10
Print integer:9
Print integer:8
Print integer:7
Print integer:6
return value:0
No threads ready or runnable, and no pending interrupts.
Assuming the program completed.
Machine halting!

Ticks: total 600, idle 23, system 100, user 477
Disk I/O: reads 0, writes 0
Console I/O: reads 0, writes 0
Paging: faults 0
Network I/O: packets received 0, sent 0

仔細觀察結果後發現一個大問題,原本test2的輸出是遞增,但在載入test1之後,test2的輸出卻變成了遞減。於是開始檢查test1與test2的程式內容。

test1.c :

#include "syscall.h" main() { int n; for (n=9;n>5;n--) PrintInt(n); }

test2.c :

#include "syscall.h" main() { int n; for (n=20;n<=25;n++) PrintInt(n); }

看完程式碼後發現test2改變的行為不只是遞減,連停止條件都不一樣了。但很明顯的,test2的行為恰好變成了test1的行為(遞減並且只有在比5大的時候才執行),感覺像是被test1覆蓋過去一樣,所以可能是context switching的時候出了問題導致這樣的結果,但也不排除有其他原因造成此現象。

Implementation

尋找錯誤

如果是context switching的問題,那可能的原因是在context switching的時候,test2在記憶體的部分狀態(可能是PCB也可能是儲存指令的記憶體內容)被test1覆蓋掉了。所以首先檢查userprog底下的最像是記憶體管理的檔案。
addrspace .cc :

bool AddrSpace::Load(char *fileName) { OpenFile *executable = kernel->fileSystem->Open(fileName); NoffHeader noffH; unsigned int size; if (executable == NULL) { cerr << "Unable to open file " << fileName << "\n"; return FALSE; } executable->ReadAt((char *)&noffH, sizeof(noffH), 0); if ((noffH.noffMagic != NOFFMAGIC) && (WordToHost(noffH.noffMagic) == NOFFMAGIC)) SwapHeader(&noffH); ASSERT(noffH.noffMagic == NOFFMAGIC); // how big is address space? size = noffH.code.size + noffH.initData.size + noffH.uninitData.size + UserStackSize; // we need to increase the size // to leave room for the stack numPages = divRoundUp(size, PageSize); // cout << "number of pages of " << fileName<< " is "<<numPages<<endl; size = numPages * PageSize; ASSERT(numPages <= NumPhysPages); // check we're not trying // to run anything too big -- // at least until we have // virtual memory DEBUG(dbgAddr, "Initializing address space: " << numPages << ", " << size); // then, copy in the code and data segments into memory if (noffH.code.size > 0) { DEBUG(dbgAddr, "Initializing code segment."); DEBUG(dbgAddr, noffH.code.virtualAddr << ", " << noffH.code.size); executable->ReadAt( &(kernel->machine->mainMemory[noffH.code.virtualAddr]), noffH.code.size, noffH.code.inFileAddr); } if (noffH.initData.size > 0) { DEBUG(dbgAddr, "Initializing data segment."); DEBUG(dbgAddr, noffH.initData.virtualAddr << ", " << noffH.initData.size); executable->ReadAt( &(kernel->machine->mainMemory[noffH.initData.virtualAddr]), noffH.initData.size, noffH.initData.inFileAddr); } delete executable; // close file return TRUE; // success }

若真的是context switching的時候發生錯誤,那錯誤應該會發生在addrspace.cc的130行以及137行。假如一起執行test1, test2時,noffH.code.virtualAddr的值都一樣的話,就代表context switching導致了上述之錯誤。因此我想要使用內建debug功能追蹤,而程式很剛好地在128行就有virtual address的debug,所以只要在error message找到它就好。以下擷取debug message中含有Initializing的部分。

Initializing address space: 11, 1408
Initializing code segment.
0, 336
Initializing stack pointer: 1392
...
Initializing address space: 11, 1408
Initializing code segment.
0, 336
Initializing stack pointer: 1392
...

可以看到的是兩次code segment的virtual address是相同的,在這個addrspace.cc的寫法之下,兩次的physical address也會是相同的,也就是說test1的指令真的覆蓋掉了test2的指令。不過這裡我就產生了一個疑問,既然記憶體的內容被覆蓋了,而且也沒有看到Initializing data segment,那為什麼test2中n的內容沒有被改變呢?

解決錯誤

因為兩次的virtual address都是相同,所以不能再繼續將virtual address直接對應到physical address。我們必須對每一個程式維護一個page table,使得兩個程式所使用的physical address不會重複。為此我們必須要有一個維護physical address的功能。為此增加phyPageStat用於管理physical page的使用情形,以及numFreePages方便記錄剩餘可用的physical page數目。

addrspace.h:

class AddrSapce{
...
    static bool phyPageStat[NumPhysPages];  // Managing usage of physical pages

    static unsigned int numFreePages;  // Recording the physical pages can be used
...
}

為了增加程式的可讀性,定義unused跟occupied,兩者用於表示physical page的使用情形。

addrspace.h:

#define UNUSED false
#define OCCU true

於程式的一開始初始化physical page相關的變數。
addrspace .cc :

bool AddrSpace::phyPageStat[NumPhysPages] = {UNUSED};
unsigned int AddrSpace::numFreePages = NumPhysPages;

因為新增加了維護physical page的工作,所以在destruction的時候,記得要把physical page設定成unused,並一同維護可用page數目。
addrspace .cc :

AddrSpace::~AddrSpace()
{
   unsigned int i = 0;
   for(i = 0; i < numPages; i += 1){
	AddrSpace::phyPageStat[pageTable[i].physicalPage] = UNUSED;
	AddrSpace::numFreePages += 1;
   }
   delete pageTable;
}

我在page table維護的策略就是從頭開始找physical address,有能用的就先佔下來,不能用的就跳過。
addrspace .cc :

AddrSpace::Load(char *fileName)
{
    ...
// find the unused physical memory to used
   unsigned int i = 0, phyPageId = 0;
   for(; i < numPages; i += 1){
	while((phyPageId < NumPhysPages) && (AddrSpace::phyPageStat[phyPageId] == OCCU)) phyPageId += 1;
	AddrSpace::phyPageStat[phyPageId] = OCCU;
	AddrSpace::numFreePages--;
	bzero(&(kernel->machine->mainMemory[phyPageId * PageSize]), PageSize);
	pageTable[i].physicalPage = phyPageId;
	pageTable[i].use = false; 
	pageTable[i].valid = true; 
   }
   ...
}

接下來在把code讀進memory的時候必須要把這邊的virtual address轉回physical address,這就是利用剛剛建立的page table的時候了。我們首先使用noffH.code.virtualAddr / PageSize求出這段程式在virtual memory中是位於第幾個page,接著丟進pageTable中找出它在physical memory中的page index,最後不要忘記要把offset(noffH.code.virtualAddr % PageSize)加回來。
而對於data的部分也是同理。
addrspace .cc :

AddrSpace::Load(char *fileName)
{
    ...
// then, copy in the code and data segments into memory
	if (noffH.code.size > 0) {
        DEBUG(dbgAddr, "Initializing code segment.");
	DEBUG(dbgAddr, noffH.code.virtualAddr << ", " << noffH.code.size);
        	executable->ReadAt(
		&(kernel->machine->mainMemory[
					pageTable[noffH.code.virtualAddr / PageSize].physicalPage * PageSize + (noffH.code.virtualAddr % PageSize)
					]), 
			noffH.code.size, noffH.code.inFileAddr);
    }
	if (noffH.initData.size > 0) {
        DEBUG(dbgAddr, "Initializing data segment.");
	DEBUG(dbgAddr, noffH.initData.virtualAddr << ", " << noffH.initData.size);
        executable->ReadAt(
		&(kernel->machine->mainMemory[
					pageTable[noffH.initData.virtualAddr / PageSize].physicalPage * PageSize + (noffH.initData.virtualAddr % PageSize)
					]),
			noffH.initData.size, noffH.initData.inFileAddr);
        }
        ...
}

Result

修改後結果

一樣輸入一開始所使用的指令

./nachos -e ../test/test2 -e ../test/test1

執行結果如下

Total threads number is 2
Thread ./test/test2 is executing.
Thread ./test/test1 is executing.
Print integer:20
Print integer:21
Print integer:22
Print integer:9
Print integer:8
Print integer:7
Print integer:6
Print integer:23
Print integer:24
Print integer:25
return value:0
return value:0
No threads ready or runnable, and no pending interrupts.
Assuming the program completed.
Machine halting!

Ticks: total 300, idle 8, system 70, user 222
Disk I/O: reads 0, writes 0
Console I/O: reads 0, writes 0
Paging: faults 0
Network I/O: packets received 0, sent 0

已經恢復正常運作,看來只有contex switching的時候有問題,順利解決。

觀察

發現問題

test1雖然覆蓋了test2的指令,而且也沒有看到Initializing data segment(即initData.size <= 0),那為什麼沒有連test2中n的內容都改變呢?換言之,test2的n在context switching的時候是儲存在甚麼地方,為甚麼沒有被改到?

嘗試解答

在我看來n是屬於data的範疇,而不是code。但是從debug message看起來initData.size <= 0。所以首先檢查initData.sizeuninitData.size。於Load函式中加上DEBUG觀察這兩個變數。

addrspace .cc :

AddrSpace::Load(char *fileName)
{
    ...
    DEBUG(dbgTu, "initDataSize = " << noffH.initData.size << ", uninitDataSize = " << noffH.uninitData.size);
    ...
}

為此我也寫了一支程式來看看上述兩個data size究竟是甚麼代表意思。
test3.c :

#include <syscall.h>

int main()
{
    int a = 0, b = 0, i = 0;
    for(i = 0; i < 10; i += 1) a = i;
    b = a + b;
    return 0;
}

debug message:

...
initDataSize = 0, uninitDataSize = 0
...

代表這些變數根本就不是放在data那裡,緊接著的疑問是,那究竟甚麼樣的東西會放在data那邊呢?其實仔細想想main本身就是一個函式,所以是不是nachos把main中的所有東西都放到了code segment,而所謂的data放的是全域變數呢?
test3.c :

#include <syscall.h>

char tt3 = 'a';
short tt4;

int main()
{
    int a = 0, b = 0, i = 0;
    for(i = 0; i < 10; i += 1) a = i;
    b = a + b;
    return 0;
}

debug message:

...
initDataSize = 16, uninitDataSize = 32
...

這樣就印證了對於全域變數和data之間的關係的猜測是正確的,但是這個結果非但沒有解決最初的問題,反而還讓問題變得複雜。根據nachos的程式寫法,他把資料區分為data和code兩者,既然test2中的n不是放在data之中,那就一定是放在code之中了。既然如此,在context switching發生錯誤時,n的值不是更應該被更動嗎?
重新執行一次test1 test2

./nachos -e ../test/test2 -e ../test/test1

debug message:

Switching from: main to: test/test2
...
Switching from: test/test2 to: test/test1
...
Switching from: test/test1 to: test/test2
...

透過此debug message發現context switching程式碼的關鍵字有switch,所以想到可以利用這個關鍵字找到程式碼的位置。

grep -r "Switching" *

透過grep找到此訊息出處為./code/threads/scheduler.cc之中,而其中又有幾行耐人尋味的程式碼。
scheduler. cc :

void
Scheduler::Run (Thread *nextThread, bool finishing)
{
    ...
    if (oldThread->space != NULL) {     // if this thread is a user program,
        oldThread->SaveUserState();     // save the user's CPU registers
        oldThread->space->SaveState();
    }
    ...
}

thread. cc :

void
Thread::SaveUserState()
{
    for (int i = 0; i < NumTotalRegs; i++)
        userRegisters[i] = kernel->machine->ReadRegister(i);
}

thread.h :

class Thread {
...
    int userRegisters[NumTotalRegs];    // user-level CPU register state
...
    }

這下終於找到了n到底存在哪裡,scheduler在變更thread的時候會先把舊的User Sate存到位於threaduserRegisters之中,執行新的thread時,會把User State從該threaduserRegisters中拿出來。而這幾個userRegisters就是test1, test2各自的n所待的地方。

Debug

First Attemption

為了追蹤noffH.code.virtualAddr,先從./code/threads/main.cc中把debugArg的值改成"a"(代表開啟dbgAddr flag),這樣就可以在error output中看到debug message。另外也發現debug flag都放在./code/lib/debug.h之中,而檔案最上方的註解也寫明鼓勵加入自己的flag。

Futhermore

透過此文章得知在指令中添加-d參數即可指定debug flag,這個方法明顯比我一開始嘗試的方法好很多。
例如:

./nachos -d a -e ../test/test2 -e ../test/test1

雜談

尋找檔案名時,利用find指令(regular expression)。

例如:

user@user-VirtualBox:~/Documents/nachos/nachos-4.0/code$ find . -name "kernel.*"
./userprog/kernel.o
./threads/kernel.cc
./threads/kernel.o
./threads/kernel.h
./filesys/kernel.o
./network/kernel.o

尋找method在哪裡被用過,或尋找變數或function在哪裡被定義,可以利用grep指令。

例如:

user@user-VirtualBox:~/Documents/nachos/nachos-4.0/code$ grep -r "Interrupt"
Binary file userprog/mipssim.o matches
Binary file userprog/kernel.o matches
userprog/synchconsole.cc://      Interrupt handler called when keystroke is hit; wake up
userprog/synchconsole.cc://      Interrupt handler called when it's safe to send the next 
Binary file userprog/elevator.o matches
Binary file userprog/disk.o matches
userprog/exception.cc://	Interrupts (which can also cause control to transfer from user
Binary file userprog/machine.o matches
Binary file userprog/timer.o matches
Binary file userprog/userkernel.o matches
Binary file userprog/console.o matches
Binary file userprog/nachos matches
Binary file userprog/alarm.o matches
Binary file userprog/scheduler.o matches
Binary file userprog/thread.o matches
Binary file userprog/addrspace.o matches
Binary file userprog/synch.o matches
Binary file userprog/main.o matches
Binary file userprog/interrupt.o matches
Binary file userprog/exception.o matches
machine/timer.cc:    SetInterrupt();
...

可能會跑出一堆亂七八糟的東西,此時利用指令pipeline的技巧,試著過濾出想要的東西。

user@user-VirtualBox:~/Documents/nachos/nachos-4.0/code$ grep -r "Interrupt" | grep "cc"
userprog/synchconsole.cc://      Interrupt handler called when keystroke is hit; wake up
userprog/synchconsole.cc://      Interrupt handler called when it's safe to send the next 
userprog/exception.cc://	Interrupts (which can also cause control to transfer from user
machine/timer.cc:    SetInterrupt();
machine/timer.cc:    SetInterrupt();	// do last, to let software interrupt handler
machine/timer.cc:// Timer::SetInterrupt
machine/timer.cc:Timer::SetInterrupt() 
machine/interrupt.cc:// PendingInterrupt::PendingInterrupt
machine/interrupt.cc:PendingInterrupt::PendingInterrupt(CallBackObj *callOnInt, 
machine/interrupt.cc:    callOnInterrupt = callOnInt;
machine/interrupt.cc:PendingCompare (PendingInterrupt *x, PendingInterrupt *y)
machine/interrupt.cc:// Interrupt::Interrupt
machine/interrupt.cc://	Interrupts start disabled, with no interrupts pending, etc.
machine/interrupt.cc:Interrupt::Interrupt()
machine/interrupt.cc:    pending = new SortedList<PendingInterrupt *>(PendingCompare);
machine/interrupt.cc:// Interrupt::~Interrupt
machine/interrupt.cc:Interrupt::~Interrupt()
machine/interrupt.cc:// Interrupt::ChangeLevel
machine/interrupt.cc:Interrupt::ChangeLevel(IntStatus old, IntStatus now)
machine/interrupt.cc:// Interrupt::SetLevel
machine/interrupt.cc:Interrupt::SetLevel(IntStatus now)
...

Reference

Nachos Beginner's Guide
滑鐵盧大學 ppt
偉大的學長姐
也是偉大的學長姐
nachos debug教學
nachos document
助教的作業指示