# OS 2020 HW1 Nachos Overview & Project 1 - Thread Management ## Motivation 根據助教的指示,測試執行是否正常 ``` ./nachos -e ../test/test2 -e ../test/test1 ``` 執行結果如下 ``` Ticks: total 200, idle 32, system 40, user 128 Disk I/O: reads 0, writes 0 Console I/O: reads 0, writes 0 Paging: faults 0 Network I/O: packets received 0, sent 0 user@user-VirtualBox:~/Documents/nachos/nachos-4.0/code/userprog$ ./nachos -e ../test/test2 -e ../test/test1 Total threads number is 2 Thread ../test/test2 is executing. Thread ../test/test1 is executing. Print integer:20 Print integer:21 Print integer:22 Print integer:9 Print integer:8 Print integer:7 Print integer:6 Print integer:23 Print integer:22 Print integer:21 Print integer:20 Print integer:19 return value:0 Print integer:18 Print integer:17 Print integer:16 Print integer:15 Print integer:14 Print integer:13 Print integer:12 Print integer:11 Print integer:10 Print integer:9 Print integer:8 Print integer:7 Print integer:6 return value:0 No threads ready or runnable, and no pending interrupts. Assuming the program completed. Machine halting! Ticks: total 600, idle 23, system 100, user 477 Disk I/O: reads 0, writes 0 Console I/O: reads 0, writes 0 Paging: faults 0 Network I/O: packets received 0, sent 0 ``` 仔細觀察結果後發現一個大問題,原本test2的輸出是遞增,但在載入test1之後,test2的輸出卻變成了遞減。於是開始檢查test1與test2的程式內容。 ***test1.c*** : ```C= #include "syscall.h" main() { int n; for (n=9;n>5;n--) PrintInt(n); } ``` ***test2.c*** : ```C= #include "syscall.h" main() { int n; for (n=20;n<=25;n++) PrintInt(n); } ``` 看完程式碼後發現test2改變的行為不只是遞減,連停止條件都不一樣了。但很明顯的,test2的行為恰好變成了test1的行為(遞減並且只有在比5大的時候才執行),感覺像是被test1覆蓋過去一樣,所以可能是context switching的時候出了問題導致這樣的結果,但也不排除有其他原因造成此現象。 ## Implementation ### 尋找錯誤 如果是context switching的問題,那可能的原因是在context switching的時候,test2在記憶體的部分狀態(可能是PCB也可能是儲存指令的記憶體內容)被test1覆蓋掉了。所以首先檢查userprog底下的最像是記憶體管理的檔案。 ***addrspace .cc*** : ```C=93 bool AddrSpace::Load(char *fileName) { OpenFile *executable = kernel->fileSystem->Open(fileName); NoffHeader noffH; unsigned int size; if (executable == NULL) { cerr << "Unable to open file " << fileName << "\n"; return FALSE; } executable->ReadAt((char *)&noffH, sizeof(noffH), 0); if ((noffH.noffMagic != NOFFMAGIC) && (WordToHost(noffH.noffMagic) == NOFFMAGIC)) SwapHeader(&noffH); ASSERT(noffH.noffMagic == NOFFMAGIC); // how big is address space? size = noffH.code.size + noffH.initData.size + noffH.uninitData.size + UserStackSize; // we need to increase the size // to leave room for the stack numPages = divRoundUp(size, PageSize); // cout << "number of pages of " << fileName<< " is "<<numPages<<endl; size = numPages * PageSize; ASSERT(numPages <= NumPhysPages); // check we're not trying // to run anything too big -- // at least until we have // virtual memory DEBUG(dbgAddr, "Initializing address space: " << numPages << ", " << size); // then, copy in the code and data segments into memory if (noffH.code.size > 0) { DEBUG(dbgAddr, "Initializing code segment."); DEBUG(dbgAddr, noffH.code.virtualAddr << ", " << noffH.code.size); executable->ReadAt( &(kernel->machine->mainMemory[noffH.code.virtualAddr]), noffH.code.size, noffH.code.inFileAddr); } if (noffH.initData.size > 0) { DEBUG(dbgAddr, "Initializing data segment."); DEBUG(dbgAddr, noffH.initData.virtualAddr << ", " << noffH.initData.size); executable->ReadAt( &(kernel->machine->mainMemory[noffH.initData.virtualAddr]), noffH.initData.size, noffH.initData.inFileAddr); } delete executable; // close file return TRUE; // success } ``` 若真的是context switching的時候發生錯誤,那錯誤應該會發生在`addrspace.cc`的130行以及137行。假如一起執行test1, test2時,`noffH.code.virtualAddr`的值都一樣的話,就代表context switching導致了上述之錯誤。因此我想要使用內建debug功能追蹤,而程式很剛好地在128行就有virtual address的debug,所以只要在error message找到它就好。以下擷取debug message中含有`Initializing`的部分。 ``` Initializing address space: 11, 1408 Initializing code segment. 0, 336 Initializing stack pointer: 1392 ... Initializing address space: 11, 1408 Initializing code segment. 0, 336 Initializing stack pointer: 1392 ... ``` 可以看到的是兩次code segment的virtual address是相同的,在這個`addrspace.cc`的寫法之下,兩次的physical address也會是相同的,也就是說test1的指令真的覆蓋掉了test2的指令。不過這裡我就產生了一個疑問,既然記憶體的內容被覆蓋了,而且也沒有看到`Initializing data segment`,那為什麼test2中n的內容沒有被改變呢? ### 解決錯誤 因為兩次的virtual address都是相同,所以不能再繼續將virtual address直接對應到physical address。我們必須對每一個程式維護一個page table,使得兩個程式所使用的physical address不會重複。為此我們必須要有一個維護physical address的功能。為此增加`phyPageStat`用於管理physical page的使用情形,以及`numFreePages`方便記錄剩餘可用的physical page數目。 ***addrspace.h***: ```C++ class AddrSapce{ ... static bool phyPageStat[NumPhysPages]; // Managing usage of physical pages static unsigned int numFreePages; // Recording the physical pages can be used ... } ``` 為了增加程式的可讀性,定義unused跟occupied,兩者用於表示physical page的使用情形。 ***addrspace.h***: ```C++ #define UNUSED false #define OCCU true ``` 於程式的一開始初始化physical page相關的變數。 ***addrspace .cc*** : ```C++ bool AddrSpace::phyPageStat[NumPhysPages] = {UNUSED}; unsigned int AddrSpace::numFreePages = NumPhysPages; ``` 因為新增加了維護physical page的工作,所以在destruction的時候,記得要把physical page設定成unused,並一同維護可用page數目。 ***addrspace .cc*** : ```C++ AddrSpace::~AddrSpace() { unsigned int i = 0; for(i = 0; i < numPages; i += 1){ AddrSpace::phyPageStat[pageTable[i].physicalPage] = UNUSED; AddrSpace::numFreePages += 1; } delete pageTable; } ``` 我在page table維護的策略就是從頭開始找physical address,有能用的就先佔下來,不能用的就跳過。 ***addrspace .cc*** : ```C++ AddrSpace::Load(char *fileName) { ... // find the unused physical memory to used unsigned int i = 0, phyPageId = 0; for(; i < numPages; i += 1){ while((phyPageId < NumPhysPages) && (AddrSpace::phyPageStat[phyPageId] == OCCU)) phyPageId += 1; AddrSpace::phyPageStat[phyPageId] = OCCU; AddrSpace::numFreePages--; bzero(&(kernel->machine->mainMemory[phyPageId * PageSize]), PageSize); pageTable[i].physicalPage = phyPageId; pageTable[i].use = false; pageTable[i].valid = true; } ... } ``` 接下來在把code讀進memory的時候必須要把這邊的virtual address轉回physical address,這就是利用剛剛建立的page table的時候了。我們首先使用`noffH.code.virtualAddr / PageSize`求出這段程式在virtual memory中是位於第幾個page,接著丟進`pageTable`中找出它在physical memory中的page index,最後不要忘記要把offset(`noffH.code.virtualAddr % PageSize`)加回來。 而對於data的部分也是同理。 ***addrspace .cc*** : ```C++ AddrSpace::Load(char *fileName) { ... // then, copy in the code and data segments into memory if (noffH.code.size > 0) { DEBUG(dbgAddr, "Initializing code segment."); DEBUG(dbgAddr, noffH.code.virtualAddr << ", " << noffH.code.size); executable->ReadAt( &(kernel->machine->mainMemory[ pageTable[noffH.code.virtualAddr / PageSize].physicalPage * PageSize + (noffH.code.virtualAddr % PageSize) ]), noffH.code.size, noffH.code.inFileAddr); } if (noffH.initData.size > 0) { DEBUG(dbgAddr, "Initializing data segment."); DEBUG(dbgAddr, noffH.initData.virtualAddr << ", " << noffH.initData.size); executable->ReadAt( &(kernel->machine->mainMemory[ pageTable[noffH.initData.virtualAddr / PageSize].physicalPage * PageSize + (noffH.initData.virtualAddr % PageSize) ]), noffH.initData.size, noffH.initData.inFileAddr); } ... } ``` ## Result ### 修改後結果 一樣輸入一開始所使用的指令 ``` ./nachos -e ../test/test2 -e ../test/test1 ``` 執行結果如下 ``` Total threads number is 2 Thread ./test/test2 is executing. Thread ./test/test1 is executing. Print integer:20 Print integer:21 Print integer:22 Print integer:9 Print integer:8 Print integer:7 Print integer:6 Print integer:23 Print integer:24 Print integer:25 return value:0 return value:0 No threads ready or runnable, and no pending interrupts. Assuming the program completed. Machine halting! Ticks: total 300, idle 8, system 70, user 222 Disk I/O: reads 0, writes 0 Console I/O: reads 0, writes 0 Paging: faults 0 Network I/O: packets received 0, sent 0 ``` 已經恢復正常運作,看來只有contex switching的時候有問題,順利解決。 ### 觀察 #### 發現問題 test1雖然覆蓋了test2的指令,而且也沒有看到`Initializing data segment`(即`initData.size` <= 0),那為什麼沒有連test2中n的內容都改變呢?換言之,test2的n在context switching的時候是儲存在甚麼地方,為甚麼沒有被改到? #### 嘗試解答 在我看來n是屬於data的範疇,而不是code。但是從debug message看起來`initData.size` <= 0。所以首先檢查`initData.size`和`uninitData.size`。於`Load`函式中加上DEBUG觀察這兩個變數。 ***addrspace .cc*** : ```C++ AddrSpace::Load(char *fileName) { ... DEBUG(dbgTu, "initDataSize = " << noffH.initData.size << ", uninitDataSize = " << noffH.uninitData.size); ... } ``` 為此我也寫了一支程式來看看上述兩個data size究竟是甚麼代表意思。 ***test3.c*** : ```C #include <syscall.h> int main() { int a = 0, b = 0, i = 0; for(i = 0; i < 10; i += 1) a = i; b = a + b; return 0; } ``` debug message: ``` ... initDataSize = 0, uninitDataSize = 0 ... ``` 代表這些變數根本就不是放在data那裡,緊接著的疑問是,那究竟甚麼樣的東西會放在data那邊呢?其實仔細想想`main`本身就是一個函式,所以是不是nachos把`main`中的所有東西都放到了code segment,而所謂的data放的是全域變數呢? ***test3.c*** : ```C #include <syscall.h> char tt3 = 'a'; short tt4; int main() { int a = 0, b = 0, i = 0; for(i = 0; i < 10; i += 1) a = i; b = a + b; return 0; } ``` debug message: ``` ... initDataSize = 16, uninitDataSize = 32 ... ``` 這樣就印證了對於全域變數和data之間的關係的猜測是正確的,但是這個結果非但沒有解決最初的問題,反而還讓問題變得複雜。根據nachos的程式寫法,他把資料區分為data和code兩者,既然test2中的n不是放在data之中,那就一定是放在code之中了。既然如此,在context switching發生錯誤時,n的值不是更應該被更動嗎? 重新執行一次test1 test2 ``` ./nachos -e ../test/test2 -e ../test/test1 ``` debug message: ``` Switching from: main to: test/test2 ... Switching from: test/test2 to: test/test1 ... Switching from: test/test1 to: test/test2 ... ``` 透過此debug message發現context switching程式碼的關鍵字有switch,所以想到可以利用這個關鍵字找到程式碼的位置。 ``` grep -r "Switching" * ``` 透過`grep`找到此訊息出處為`./code/threads/scheduler.cc`之中,而其中又有幾行耐人尋味的程式碼。 ***scheduler. cc*** : ```c void Scheduler::Run (Thread *nextThread, bool finishing) { ... if (oldThread->space != NULL) { // if this thread is a user program, oldThread->SaveUserState(); // save the user's CPU registers oldThread->space->SaveState(); } ... } ``` ***thread. cc*** : ```c void Thread::SaveUserState() { for (int i = 0; i < NumTotalRegs; i++) userRegisters[i] = kernel->machine->ReadRegister(i); } ``` ***thread.h*** : ```c class Thread { ... int userRegisters[NumTotalRegs]; // user-level CPU register state ... } ``` 這下終於找到了n到底存在哪裡,scheduler在變更thread的時候會先把舊的User Sate存到位於`thread`的`userRegisters`之中,執行新的thread時,會把User State從該`thread`的`userRegisters`中拿出來。而這幾個`userRegisters`就是test1, test2各自的n所待的地方。 ### Debug #### First Attemption 為了追蹤`noffH.code.virtualAddr`,先從`./code/threads/main.cc`中把`debugArg`的值改成`"a"`(代表開啟dbgAddr flag),這樣就可以在error output中看到debug message。另外也發現debug flag都放在`./code/lib/debug.h`之中,而檔案最上方的註解也寫明鼓勵加入自己的flag。 #### Futhermore 透過[此文章](https://puremonkey2010.blogspot.com/2013/03/nachos-40-debugging-nachos.html)得知在指令中添加-d參數即可指定debug flag,這個方法明顯比我一開始嘗試的方法好很多。 例如: ``` ./nachos -d a -e ../test/test2 -e ../test/test1 ``` ### 雜談 尋找檔案名時,利用`find`指令(regular expression)。 例如: ``` user@user-VirtualBox:~/Documents/nachos/nachos-4.0/code$ find . -name "kernel.*" ./userprog/kernel.o ./threads/kernel.cc ./threads/kernel.o ./threads/kernel.h ./filesys/kernel.o ./network/kernel.o ``` 尋找method在哪裡被用過,或尋找變數或function在哪裡被定義,可以利用`grep`指令。 例如: ``` user@user-VirtualBox:~/Documents/nachos/nachos-4.0/code$ grep -r "Interrupt" Binary file userprog/mipssim.o matches Binary file userprog/kernel.o matches userprog/synchconsole.cc:// Interrupt handler called when keystroke is hit; wake up userprog/synchconsole.cc:// Interrupt handler called when it's safe to send the next Binary file userprog/elevator.o matches Binary file userprog/disk.o matches userprog/exception.cc:// Interrupts (which can also cause control to transfer from user Binary file userprog/machine.o matches Binary file userprog/timer.o matches Binary file userprog/userkernel.o matches Binary file userprog/console.o matches Binary file userprog/nachos matches Binary file userprog/alarm.o matches Binary file userprog/scheduler.o matches Binary file userprog/thread.o matches Binary file userprog/addrspace.o matches Binary file userprog/synch.o matches Binary file userprog/main.o matches Binary file userprog/interrupt.o matches Binary file userprog/exception.o matches machine/timer.cc: SetInterrupt(); ... ``` 可能會跑出一堆亂七八糟的東西,此時利用指令pipeline的技巧,試著過濾出想要的東西。 ``` user@user-VirtualBox:~/Documents/nachos/nachos-4.0/code$ grep -r "Interrupt" | grep "cc" userprog/synchconsole.cc:// Interrupt handler called when keystroke is hit; wake up userprog/synchconsole.cc:// Interrupt handler called when it's safe to send the next userprog/exception.cc:// Interrupts (which can also cause control to transfer from user machine/timer.cc: SetInterrupt(); machine/timer.cc: SetInterrupt(); // do last, to let software interrupt handler machine/timer.cc:// Timer::SetInterrupt machine/timer.cc:Timer::SetInterrupt() machine/interrupt.cc:// PendingInterrupt::PendingInterrupt machine/interrupt.cc:PendingInterrupt::PendingInterrupt(CallBackObj *callOnInt, machine/interrupt.cc: callOnInterrupt = callOnInt; machine/interrupt.cc:PendingCompare (PendingInterrupt *x, PendingInterrupt *y) machine/interrupt.cc:// Interrupt::Interrupt machine/interrupt.cc:// Interrupts start disabled, with no interrupts pending, etc. machine/interrupt.cc:Interrupt::Interrupt() machine/interrupt.cc: pending = new SortedList<PendingInterrupt *>(PendingCompare); machine/interrupt.cc:// Interrupt::~Interrupt machine/interrupt.cc:Interrupt::~Interrupt() machine/interrupt.cc:// Interrupt::ChangeLevel machine/interrupt.cc:Interrupt::ChangeLevel(IntStatus old, IntStatus now) machine/interrupt.cc:// Interrupt::SetLevel machine/interrupt.cc:Interrupt::SetLevel(IntStatus now) ... ``` ## Reference [Nachos Beginner's Guide](https://www.ida.liu.se/~TDDI04/material/begguide/) [滑鐵盧大學 ppt](https://student.cs.uwaterloo.ca/~cs350/common/NachosTutorialF06.pdf) [偉大的學長姐](http://blog.terrynini.tw/tw/OS-NachOS-HW1/) [也是偉大的學長姐](https://morris821028.github.io/2014/05/24/lesson/hw-nachos4/) [nachos debug教學](https://puremonkey2010.blogspot.com/2013/03/nachos-40-debugging-nachos.html) [nachos document](https://web.ics.purdue.edu/~cs354/Nachos/index.html) [助教的作業指示](https://hackmd.io/@r83V7BYGQYqK6_RFjL0vbA/B1Et9W_8w#/)