2019q1 Homework4 (riscv)

# 2019q1 Homework4 (riscv) contributed by < `0xff07` > ## VirtFS & virtio > 在 Guest 端透過 $ dmesg | grep 9pnet 命令，我們可發現 9P2000 字樣，這和上述 VirtFS 有何關聯？請解釋運作原理並設計實驗 > riscv-emu 原始程式碼中多次出現 virtio，這樣的機制對於 host 和 guest 兩端有何作用？在閱讀 Virtio: An I/O virtualization framework for Linux 一文後，對照原始程式碼，你發現什麼？ ### 9P Protocol 9P 可以從第2份作業的資料給的[參考資料](https://www.kernel.org/doc/Documentation/filesystems/9p.txt)中找到線索。引用裡的敘述： > v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol. 從這裡可以知道這是個遠端存取 Plan 9 File System 相關的機制。摘錄該資料的 [Grave Robbers from Outer SpaceUsing 9P2000 Under Linux](http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html) 可以知道一些關於 Plan 9 的設計動機： > In Plan 9, all system resources and interfaces are represented as files.UNIX pioneered the concept of treating devices as files, providing a simple, clearinterface to system hardware. In the 8th edition, thismethodology was taken further through the introduction of the /proc synthetic file system tomanage user processes. > ... > Plan 9 took the file system metaphor further, using fileoperations as the simple, well-defined interface to all system and application services. The intuition behind the design was based on the assumption that any programmer knows how to interact with files. 因此可以知道 : Plan 9 的其中一個設計哲學，是把 UNIX 的 "Everything is a file" 更進一步延伸到作業系統的其他層面，類似於可以透過讀寫 `/proc` 中的某些檔案來對作業系統核心進行互動，或是 `sysfs` 控制裝置(如 [控制 Raspberry Pi 上的 GPIO](https://coldnew.github.io/f7349436/))。而 9P 則是： > 9P represents the abstract interface used to access resources under Plan 9. It is somewhat analogous to the VFS layer in Linux. In Plan9, the same protocol operations are used to access bothlocal and remote resources, making the transition fromlocal resources to cluster resources to grid resourcescompletely transparent from an implementationstandpoint. 9P 是這個檔案系統抽象層的名字。類似於 Linux 中的 VFS 抽象層，只要提供 `inode`,`superblock` 等等抽象層的實作，就可以用同一套機制去存取不同檔案系統。另外 9P 有時候也會用 9P protocol 來稱呼。而 9P2000 則是 9P 的新的版本： > In the fourth edition of Plan 9 released in 2002, 9P was redesigned to address a number of shortcomings and retitled 9P2000 另外，也一些功能延伸的版本，比如說 `9P2000.U`, `9P2000.L` 等等。而這個 9P Protocol 被使用在一些地方，VirtFS 是其中一個地方。在前面 v9fs 的相關資料有提到： > Other applications are described in the following papers: > * XCPU & Clustering > http://xcpu.org/papers/xcpu-talk.pdf > * KVMFS: control file system for KVM > http://xcpu.org/papers/kvmfs.pdf > * CellFS: A New Programming Model for the Cell BE > http://xcpu.org/papers/cellfs-talk.pdf > * PROSE I/O: Using 9p to enable Application Partitions > http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf > * VirtFS: A Virtualization Aware File System pass-through http://goo.gl/3WPDg ### VirtFS 參考上述清單中， [VirtFS: A Virtualization Aware File System pass-through](https://landley.net/kdocs/ols/2010/ols2010-pages-109-120.pdf) 一文。這篇文章提到的問題是 guest OS 跟 host OS 在分享檔案系統上的問題。如果虛擬化的機制完全模擬所有的裝置，那麼會在裝置存取上面付出很多代價。比如說： 1. 同步問題： > One of the principle problems with virtualized storagein the form of virtual disks is that data on the disk cannot be concurrently accessed by multiple guests (or in-deed even by the host and the guest) unless the disk isread-only. This is because of the large amount of in-memory state maintained by traditional disk file systemsalong with the aggressive nature of the Linux dcacheand page cache. > 2. 重複的 access control： > We also encounter problems with two management domains for the purposes of user ids, group ids,ACLs, and so forth. 3. 同樣的資料重複 chache > The distributed file systems alsoincur the double-cache behavior of the virtual disks. 4. guest OS 上不同檔案系統操作上的語意差別： > The other problem is that many distributed file systemsimpose their own file system semantics on operations, 最直接的想法或許是讓 guest OS 可以直接存取 host OS 檔案系統，不過這當然很不符合虛擬化的精神。不過退而求其次問：可不可以在 guest 跟 host 之間開啟一個特殊的管道，省略 guest OS 要與實體裝置互動時，上面這些層層管道，但是又不用把原先 guest OS 的檔案系統砍掉重練？這個機制就是 VirtFS 提出來的動機。 VirtFS 的架構大致如下： ![](https://i.imgur.com/dcqLPbb.png) 根據文中的敘述： > TheQEMU server elects to export a portion of its file system hierarchy, and the client on the guest mounts this using9P2000.L protocol. Guest users see the mount point justlike any of the local file systems, while the reads andwrites are actually happening on the host file system. 也就是說，v9fs 對於 guest OS 來說，就跟其他 mount point 一樣，只是當 guest OS 對他自己看到的 v9fs 進行 VFS 系列的操作時，這個 v9fs 其實是轉接到 9P Procotol 的介面(VirtFS Client)，並且用 9P Protocol 去跟 QEMU 中的 9P 介面溝通(VirtFS Server)，接著轉換成對 host OS 的 VFS 的存取。比如說，在 Linux Kernel 5.0 中的 [fs/9p/vfs_inode.c](https://elixir.bootlin.com/linux/v5.0/source/fs/9p/vfs_inode.c#L1450) 中定義了 VFS 中 inode 的相關操作： ```c= static const struct inode_operations v9fs_dir_inode_operations = { .create = v9fs_vfs_create, .lookup = v9fs_vfs_lookup, .atomic_open = v9fs_vfs_atomic_open, .unlink = v9fs_vfs_unlink, .mkdir = v9fs_vfs_mkdir, .rmdir = v9fs_vfs_rmdir, .mknod = v9fs_vfs_mknod, .rename = v9fs_vfs_rename, .getattr = v9fs_vfs_getattr, .setattr = v9fs_vfs_setattr, }; ``` 也就是說，guest OS 對 v9fs 這個檔案系統的檔案操作，其實只是在對 VirtFS Server 進行溝通的過程，只是包著 VFS 的外皮。不過，這個 server 跟 client 的中間，需要有一個有效率的中介層，要不然前面的 double-cache 問題仍然有可能因為這個中間層的實作繼續發生。而這個中介層的其中一個方法，就是 virtio ### virtio 關於 virtio，在文中有提到： > VirtIO is a paravirtual IO bus based on a hypervisor neu-tral DMA API. With KVM on x86, which is the dominant platform targetted by this work, the underlying transport layer is implemented in terms of a PCI device. > > The VirtIO PCI implementation makes extensive use of shared memory. This includes the use of lockless ring queues to establish a message passing interface and indirect reference to scatter/gather buffers to enable zero-copy bulk data transfer. > These properties of the VirtIO PCI transport allowVirtFS to be implemented in such a way that guestdriven I/O operations can be zero-copy. 即 virtio 的實作機制，關鍵是用 lockless ring queue 來進行資料間的傳遞。詳細的機制參考 [Virtio: An I/O virtualization framework for Linux](https://www.ibm.com/developerworks/library/l-virtio/index.html) 一文。 ### 實驗如果進行 VFS 操作實際上是進行 VirtFS 的 server 跟 client 操作，那麼在 guest 進行 `open` 等系統呼叫時，因為 diskimage-linux-riscv 中沒有 toolchain。所以要先想辦法生出 toolchain。幸好有附 buildroot 的配置，可以用 buildroot 產生 toolchain。但是在執行 make 的時候，正在編譯 `host-m4 1.4.17` 時出現以下錯誤： ```gcc freadahead.c: In function 'freadahead': freadahead.c:91:3: error: #error "Please port gnulib freadahead.c to your platform! Look at the definition of fflush, fread, ungetc on your system, then report this to bug-gnulib." #error "Please port gnulib freadahead.c to your platform! Look at the definition of fflush, fread, ungetc on your system, then report this to bug-gnulib." ^~~~~ /usr/bin/gcc -I. -I/home/f/Workspace/riscv-emu/test/output/host/usr/include -O2 -I/home/f/Workspace/riscv-emu/test/output/host/usr/include -c -o fseek.o fseek.c /usr/bin/gcc -I. -I/home/f/Workspace/riscv-emu/test/output/host/usr/include -O2 -I/home/f/Workspace/riscv-emu/test/output/host/usr/include -c -o fseeko.o fseeko.c make[4]: *** [Makefile:1842: freadahead.o] Error 1 make[4]: *** Waiting for unfinished jobs.... fseeko.c: In function 'rpl_fseeko': fseeko.c:109:4: error: #error "Please port gnulib fseeko.c to your platform! Look at the code in fseeko.c, then report this to bug-gnulib." #error "Please port gnulib fseeko.c to your platform! Look at the code in fseeko.c, then report this to bug-gnulib." ^~~~~ ``` 結果[發現](https://github.com/buildroot/buildroot/commit/c48f8a64626c60bd1b46804b7cf1a699ff53cdf3#diff-38fed1a307e1301cace43c735cb2038f)似乎是因為 ubuntu 18.10 用的是 2.28 版本的 glibc 的問題。解法是把下載下來的東西中，`package/m4` 下資料夾的東西，換 [commit](https://github.com/buildroot/buildroot/tree/c48f8a64626c60bd1b46804b7cf1a699ff53cdf3/package/m4) 裡麵的內容 (包含兩個 .patch 檔案)就可以繼續編譯了: ```shell= $ cd package/m4 $ mv m4.mk m4.mk.old $ mv m4.hash m4.hash.old $ wget "https://raw.githubusercontent.com/buildroot/buildroot/c48f8a64626c60bd1b46804b7cf1a699ff53cdf3/package/m4/0001-fflush-adjust-to-glibc-2.28-libio.h-removal.patch" $ wget "https://raw.githubusercontent.com/buildroot/buildroot/c48f8a64626c60bd1b46804b7cf1a699ff53cdf3/package/m4/0002-fflush-be-more-paranoid-about-libio.h-change.patch" $ wget "https://raw.githubusercontent.com/buildroot/buildroot/c48f8a64626c60bd1b46804b7cf1a699ff53cdf3/package/m4/m4.hash" $ wget "https://raw.githubusercontent.com/buildroot/buildroot/c48f8a64626c60bd1b46804b7cf1a699ff53cdf3/package/m4/m4.mk" ``` 然後就可以成功編譯了。接著想要編譯一個有 debug symbol 的版本。所以把能找到的 debug symbol 通通弄上去： 1. build options: * build packages with debugging symbols: debug level 調到 3 * strip command for binaries on target:調整為 (none) 2. Toolchain: * Build cross gdb for the host : 選起來。然後底下的 * Python support 通通都選起來(其實我不知道對應的功能是什麼，但就只是想試試看) 3. Target Packages: * Debugging, profiling and benchmark * gdb * gdbserver * full debugger * Development tools * git:其實只是好玩想看看能不能裝結果發現 ```gcc Error: target not supported by gdbserver. make: *** [package/pkg-generic.mk:188: /home/f/Workspace/riscv-emu/buildroot-debug/output/build/gdb-7.10.1/.stamp_configured] Error 1 ``` 頗尷尬。進去把 gdb server 的選項關掉，改成 fully debugger。結果仍然會在 gdb 7.10.1 時產生錯誤： ```gcc *** BFD does not support target riscv64-buildroot-linux-gnu. *** Look in bfd/config.bfd for supported targets. make[2]: *** [Makefile:2661: configure-bfd] Error 1 ``` 不管是 target package 裡面的 gdb 或是 cross platform gdb 編譯時都會有這個狀況。所以先暫時放棄使用 gdb 追蹤檔案操作的方法。先編出一個基本的版本 ## cross compilation > 透過 $ temu root-riscv64.cfg, 我們在 RISCV/Linux 模擬環境中，可執行 gcc 並輸出對應的執行檔，而之後我們則執行 riscv64-buildroot-linux-gnu-gcc，這兩者有何不同？ (提示: cross-compiler, 複習你所不知道的 C 語言: 編譯器和最佳化原理篇 > 一個 gcc 的 host 跟 target 都是 riscv64，而 riscv64-buildroot-linux-gnu-gcc 編譯器中，host 是 x86-64，而 target 是 riscv64 ## root file system & initramfs > root file system 在 Linux 核心的存在意義為何？而 initramfs 的存在的考量為何？參考自 Gentoo Linux 的 [說明](https://wiki.gentoo.org/wiki/Initramfs/Guide) 在 booting 到主導權轉移給 kernel 後，接下來會執行 `init` (也就是系統程式中學到 pid 0 的那隻程式)，而 `init` 會進行更進步的準備，包含掛載檔案系統。不過，這樣會碰到一個小問題。比如說有時候 `/usr` 在有著不同檔案系統的分割區，這樣看起來就沒辦法在掛載檔案系統之前去使用這些系統程式。或是整個根目錄檔案系統是加密的，這樣一來在解密之前也沒辦法繼續進行更進一步的 booting。所以在 kernel 開始有主導權之後，跟開始執行 `init` 之前，需要一些前置準備來處理前述的問題。`initramfs` 是一個解決這個狀況設計出來的： > An *initramfs* is an initial ram file system based on *tmpfs* (a size-flexible, in-memory lightweight file system), which also did not use a separate block device (so no caching was done and all overhead mentioned earlier disappears). Just like the *initrd*, it contains the tools and scripts needed to mount the file systems before the init binary on the real root file system is called. These tools can be decryption abstraction layers (for encrypted file systems), logical volume managers, software raid, bluetooth driver based file system loaders, etc. > ## softFP > riscv-emu 內建浮點運算模擬器，使用到 SoftFP Library，請以 sqrt 為例，解說 sqrt_sf32, sqrt_sf64, sqrt_sf128 的運作機制，以及如何對應到 RISC-V CPU 模擬器中 > 浮點數的本質是科學記號。從 10 進位的角度來看，假設 $$ a = 1.m \times 2^{exp} $$ 那麼，開根號其實就是做： $$ \sqrt a = \sqrt{1.m} \times 2^{\frac {exp}{2}} $$ 所以，對於浮點數表示法來說，把 1.mantisa 開根號，把 exponent 除二，這樣就開完了。不過有 2 個小細節： 1. mantissa 如果是 $[m]_2$，在 normalize 的表示法中其實是 $[1.m]_2$ 2. 如果 exp 是奇數，那麼會多一個 $\sqrt{2}$ 第一個問題可以這樣解：因為對二進位來說，科學記號有 $1 \leq 1.m < 2$ 的規定，即 $2 \leq 2 \cdot (1.m) < 4$，兩邊同時開根號仍然會維持 $1 \leq 1.414... < \sqrt{2 \cdot 1.m} < 2$。所以如果是奇數的 exponent，把一個 2 往 mantisa 提出來： $$ 1.m \times 2^{exp} = (2 \cdot 1.m) \times 10^{exp - 1} := u \times 10^{exp - 1} $$ 由上面的討論可以知道： $$ \begin{cases} &2 \leq u < 4 \newline &(exp - 1) \text{ is even} \end{cases} $$ 因此： $$ \sqrt{u} \times 10^{(exp - 1)/2} $$ 仍會是個合法的二進位科學記號。這邊為了方便，假定現在將數值的表示法從科學記號，調整成 exponent 為偶數的形式，即： $$ a = 1.m \times 10^{exp} = u \times 10^{e} $$ 其中 $2 \leq u < 4$，且 $e$ 是奇數。假定開根號之後，表示法變成： $$ \sqrt{a} = 1.m' \times 2^{exp'} $$ 則： $$ a = (1.m')^2 \times 2^{2exp'} $$ 故： $$ (1.m')^2 = 1 + 2 \times 0.m' + (0.m')^2 = u $$ 可以注意後面的地方： $$ 1 + \underbrace{2 \times 0.m' + (0.m')^2}_{m} $$ ## e2fsprog ## busybox > busybox 這樣的工具有何作用？請搭配原始程式碼解說根據 `man busybox` 可以發現下面的敘述： ``` DESCRIPTION BusyBox combines tiny versions of many common UNIX utilities into a single small executable. It provides minimalist replacements for most of the utilities you usually find in GNU coreutils, util-linux, etc. The utilities in BusyBox generally have fewer options than their full- featured GNU cousins; however, the options that are included provide the expected functionality and behave very much like their GNU counterparts. ``` 也就是這是一個把很多常用的系統程式(如 `ls`, `mv` 等等)的精減版本通通塞在一個可執行檔中。把程式碼下載下來之後，執行： ```shell $ make defconfig && make doc ``` 就會自己編譯出文件。結果在 `doc` 資料夾中。查閱文件可以知道： > BusyBox is a multi-call binary. A multi-call binary is an executable program that performs the same job as more than one utility program. That means there is just a single BusyBox binary, but that single binary acts like a large number of utilities. This allows BusyBox to be smaller since all the built-in utility programs (we call them applets) can share code for many common operations. > ... > For example, entering > > ln -s /bin/busybox ls > ./ls > will cause BusyBox to behave as 'ls' (if the 'ls' command has been compiled into BusyBox). Generally speaking, you should never need to make all these links yourself, as the BusyBox build system will do this for you when you run the 'make install' command. 也就是說，用 symbolic link 的方式，把對應的系統程式連到 busybox，就可以讓他看起來好像真的有對應的系統程式。如果在 `risc-emu` 中，執行： ```shell $ ls -l /bin ``` 就會發現： ```c= ... lrwxrwxrwx 1 root root 7 May 23 2017 grep -> busybox lrwxrwxrwx 1 root root 7 May 23 2017 gunzip -> busybox lrwxrwxrwx 1 root root 7 May 23 2017 gzip -> busybox lrwxrwxrwx 1 root root 7 May 23 2017 hostname -> busybox lrwxrwxrwx 1 root root 7 May 23 2017 hush -> busybox lrwxrwxrwx 1 root root 7 May 23 2017 ionice -> busybox lrwxrwxrwx 1 root root 7 May 23 2017 iostat -> busybox lrwxrwxrwx 1 root root 7 May 23 2017 ip -> busybox lrwxrwxrwx 1 root root 7 May 23 2017 ipaddr -> busybox lrwxrwxrwx 1 root root 7 May 23 2017 ipcalc -> busybox ... ``` 也就是這裡面的指令其實都是連到 busybox 的 symlink：主程式在 `libbbb/appletlib.c ` 中。大致上的機制為： 1. 有一個 `applet_main` 函數指標陣列。這個陣列的定義在 `/include/applet_tables.h` 中。把他打開來看會看到： ```c= int (*const applet_main[])(int argc, char **argv) = { test_main, test_main, acpid_main, add_remove_shell_main, addgroup_main, adduser_main, adjtimex_main, uname_main, arp_main, arping_main, ash_main, awk_main, base64_main, basename_main, bc_main, ... ls_main, ... }; ``` 每一個對應的系統程式函數都是用 `<name>_main` 這種命名方式，對應的實作是在各自的 .c 檔中。如 `cat_main` 對應的實作就是在 `coreutils/cat.c` 中。另外一個紀錄資料的陣列 `applet_names[]` 中： ```c= const char applet_names[] ALIGN1 = "" "[" "\0" "[[" "\0" "acpid" "\0" "add-shell" "\0" "addgroup" "\0" "adduser" "\0" "adjtimex" "\0" "arch" "\0" "arp" "\0" "arping" "\0" "ash" "\0" ... ``` 這是一個超級長的字串，用 `\0` 分開。所以搜尋機制跟陣列不太一樣： ```c= // In find_applet_by_name(), before linear search, narrow it down // by looking at N "equidistant" names. With ~350 applets: // KNOWN_APPNAME_OFFSETS cycles // 0 9057 // 2 4604 + ~100 bytes of code // 4 2407 + 4 bytes // 8 1342 + 8 bytes // 16 908 + 16 bytes // 32 884 + 32 bytes // With 8, int16_t applet_nameofs[] table has 7 elements. int KNOWN_APPNAME_OFFSETS = 8; // With 128 applets we do two linear searches, with 1..7 strcmp's in the first one // and 1..16 strcmp's in the second. With 256 apps, second search does 1..32 strcmp's. ``` 簡單來說，每個 applet 再這個超大字串中是用字典序排好的。首先會設計個中斷點。這些中斷點是紀錄在同一個檔案中： ```c= const uint16_t applet_nameofs[] ALIGN2 = { 304, 653, 984, 1324, 1663, 2050, 2374, }; ``` 因為是照字典序排，先去比位於上面這些位置的字串，找到這個指令所在的區間，就可以把搜尋範圍從 2xxx 縮小到大約 300 個字元的搜尋範圍。 2. 舉例來說，假設情境是： ``` $ ln -s <path_to_busybox>/boxybox ls $ ./ls ``` 首先，會把 `argv[0]` 中，最後一個斜線(也就是預計使用的系統程式的名字)抽取出來(`applet_name`)，然後丟進 `run_applet_and_exit(applet_name, argv);` 中執行。 3. `run_applet_and_exit(applet_name, argv)` 會先`argv[0]` 是不是 busybox 本人(也就是執行 `busybox <something> ...` 的狀況)。如果不是，就去找出對應的系統程式在 `applet_main` 中的編號，然後執行 `run_applet_no_and_exit` 函數 5. `run_applet_no_and_exit` 會進行一些檢查(取決於編譯時期進行怎麼樣的配置)，然後執行` applet_main[applet_no](argc, argv)` (就是 `ls_main` 對應的函數指標)。