2024q1 Homework6 (integration)

# 2024q1 Homework6 (integration) contributed by < `david965154` > ## 實體電腦運作 GNU/Linux 在執行`$ sudo sh performance.sh` 時遇到以下問題 ``` performance.sh: 3: cannot create /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor: Directory nonexistent ``` 搜尋了許久 1. 參考了之前修課的同學的[筆記](https://hackmd.io/@matches4453/B1rprvoBI#%E8%A8%AD%E5%AE%9A-scaling_governor-%E7%82%BA-performance)，裡面說 > 更改 shell script 可以解決路徑問題但我發現他修改後的與本次作業的完全相同，因此我再去找了當時的的作業，完全看不出他修改了什麼，因此只好往 google 搜尋。 2. 進入 GRUB 更改電源管理、[CPUfreq Setup](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/power_management_guide/cpufreq_setup)、[Can't use "userspace" cpufreq governor and set cpu frequency](https://unix.stackexchange.com/questions/153693/cant-use-userspace-cpufreq-governor-and-set-cpu-frequency)、[CPU frequency scaling](https://wiki.archlinux.org/title/CPU_frequency_scaling)、[Kernel module](https://wiki.archlinux.org/title/Kernel_module)，以上有 Arch Linux、 Red hat、 Ubuntu 不同的討論及關官方文件，但都沒辦法解決問題。所以先將此問題擱置，以下為實驗環境，使用 KVM 。 ```shell $ uname -a Linux venv 6.5.0-28-generic #29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 4 14:39:20 UTC 2 x86_64 x86_64 x86_64 GNU/Linux $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) E-2144G CPU @ 3.60GHz CPU family: 6 Model: 158 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: 10 BogoMIPS: 7200.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc a cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall n x rdtscp lm constant_tsc rep_good nopl xtopology nonsto p_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx r drand hypervisor lahf_lm abm 3dnowprefetch invpcid_sing le pti tpr_shadow flexpriority ept vpid fsgsbase bmi1 a vx2 bmi2 invpcid rdseed clflushopt vnmi md_clear flush_ l1d Virtualization features: Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 128 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 1 MiB (4 instances) L3: 32 MiB (4 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-3 Vulnerabilities: Gather data sampling: Unknown: Dependent on hypervisor status Itlb multihit: KVM: Mitigation: VMX disabled L1tf: Mitigation; PTE Inversion; VMX conditional cache flushe s, SMT disabled Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Meltdown: Mitigation; PTI Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown Retbleed: Vulnerable Spec rstack overflow: Not affected Spec store bypass: Vulnerable Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PB RSB-eIBRS Not affected Srbds: Unknown: Dependent on hypervisor status Tsx async abort: Not affected ``` 其中一核，可以看到其頻率僅在基本頻率 3600 MHz (其他三核相同) ```shell $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 158 model name : Intel(R) Xeon(R) E-2144G CPU @ 3.60GHz stepping : 10 cpu MHz : 3600.004 cache size : 8192 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes ... ``` 以下可以看到都是 Not Available ，我選擇的驅動為 `speedstep_lib` ， [CPU frequency scaling](https://wiki.archlinux.org/title/CPU_frequency_scaling) 裡面所說的 `Xeon` 處理器需選擇 `p4_clockmod` 反而會出現 no device 的錯誤。 ```shell $ cpupower frequency-info analyzing CPU 0: no or unknown cpufreq driver is active on this CPU CPUs which run at the same hardware frequency: Not Available CPUs which need to have their frequency coordinated by software: Not Available maximum transition latency: Cannot determine or is not supported. Not Available available cpufreq governors: Not Available Unable to determine current policy current CPU frequency: Unable to call hardware current CPU frequency: Unable to call to kernel boost state support: Supported: no Active: no ``` ## [Linux 核心模組運作原理](https://hackmd.io/@sysprog/linux-kernel-module) #### `hello world` 在將 hello.c 掛載至核心的時候，突然有了為什麼要掛載的問題，是因為那些 include 的 header file 需在核心層級內才能使用嗎? 在透過 `insmod` `rmmod` 掛載及卸載時呼叫以下兩函式 ``` module_init(hello_init); module_exit(hello_exit); ``` 達成以下 ``` [15731.636304] Hello, world [16805.487312] Goodbye, cruel world [17376.662707] Hello, world [17392.460133] Goodbye, cruel world ``` 不過可以看到其實執行了兩次，因為我掛載卸載各兩次，而 dmesg 會紀綠核心相關的訊息 (包含開機時 kernel 被載入記憶體，然後module/driver 開始驅動硬體產生的訊息) ，因此在掛卸載時皆會被記錄。 #### `fibdrv` `make check` 並掛載 `fibdrv.ko` 後，執行 ```shell $ ls -l /dev/fibonacci crw------- 1 root root 240, 0 四 27 00:00 /dev/fibonacci $ cat /sys/class/fibonacci/fibonacci/dev 240:0 ``` 對照 [fibdev.c](https://github.com/sysprog21/fibdrv/blob/master/fibdrv.c#L108)，找尋彼此的關聯。透過 `dmesg` 可以看到以下訊息 ``` 18473.690275] do_init_module: 'fibdrv'->init suspiciously returned 240, it should follow 0/-E convention do_init_module: loading module anyway... [18473.690279] CPU: 0 PID: 5631 Comm: insmod Tainted: G OE 6.5.0-28-generic #29~22.04.1-Ubuntu ``` 訊息顯示 `do_init_module` 在 `'fibdrv'->init` 回傳了 `240` ，而這個回傳值來自 register_chrdev() 函式**動態分配**的主要設備號。在掛載驅動程式時需向 kernel 註冊並分配給其一個設備號並註冊字元設備。為什麼需要呢 ? 因為 user application 便可以透過此符合此 major number 的 device file 與硬體溝通參考 [Linux 驅動程式觀念解析, #5: 依流程來實作 -- Virtual Device Driver](https://www.jollen.org/blog/2006/05/linux_5_virtual_device_driver.html) 而這裡會需要註冊字元設備則是因為可以為該設備定義所需的系統調用操作，例如 open、read、write 和 close (可在以下程式碼中看到這些功能)，而前面的 hello world 則不需要達成這些功能因此無須註冊字元設備。 ```c const struct file_operations fib_fops = { .owner = THIS_MODULE, .read = fib_read, .write = fib_write, .open = fib_open, .release = fib_release, .llseek = fib_device_lseek, }; ``` #### Linux 核心模組掛載機制 ##### `insmod` 命令背後，對應 Linux 核心內部有什麼操作 ```c static const char __PASTE(__PASTE(__UNIQUE_ID_, author), __COUNTER__)[] \ __used __attribute__((section(".modinfo"), unused, aligned(1))) \ = __stringify(author) "=" "National Cheng Kung University, Taiwan" ``` 這裡的用法，只要把 string literal 並排就等同於一個合併起來的字串。執行過程 `SYSCALL_DEFINE3` 讀取一個檔案 `.ko` 得到其 file descriptor 為。並傳入 `finit_module` > `idempotent_init_module` > `init_module_from_file` > `load_module` > `do_init_module` (先透過 `layout_and_allocate` 為載入的 module 進行記憶體配置) > `do_one_initcall` > `do_trace_initcall_start` `ret = fn();` `do_trace_initcall_finish` `fn` 亦即傳入的 `mod->init`，核心模組的 `init_function` 在 `ret = fn();` 被執行，為核心模組進行真正的初始化的工作。 ##### 為何 `module_init` 所設定的函式得以執行 `__attribute__((alias( ..)))` 的用法 > The alias attribute causes the declaration to be emitted as an alias for another symbol, which must have been previously declared with the same type, and for variables, also the same size and alignment. Declaring an alias with a different type than the target is undefined and may be diagnosed. 就是給予一個 declaration 別名 (必需要 same type, and for variables, also the same size and alignment，否則為 undefined ) 。為什麼? ```c /* These are either module local, or the kernel's dummy ones. */ extern int init_module(void); ``` `extern` 為 `init_module` 可以使用，但是不在這個地方實作。其實就是 `init_fib_dev`，因為使用 `alias` 將 `init_module` 取了一個別名叫作 `init_fib_dev`。 :::warning 問題: 根據教材 > `module_init` 根據 `MODULE` 是否有被定義，`MODULE` 是在此核心模組被編譯時期所定義，若此模組編譯時已內建於核心則不會被定義。我自己認為前面練習是先編譯後才透過 `insmod` 掛載入核心，因此 `MODULE` 應被定義，但是底下卻寫到 > 從前置處理過後的結果可知，這裡選用沒定義 `MODULE` 的 `module_init(x)` ，也就是之後會繼續展開 `__initcall(x)`。 ::: 若是 MODULE 已定義的情況下 ```c #else /* MODULE */ ... /* Each module must use one module_init(). */ #define module_init(initfn) \ static inline initcall_t __maybe_unused __inittest(void) \ { return initfn; } \ int init_module(void) __copy(initfn) __attribute__((alias(#initfn))); ``` `init_module` (即為 init_fib_dev ) 就可以理解 :::warning 問題: ::: 若是未定義 `MODULE` 的情況下 ```c #ifndef MODULE #define module_init(x) __initcall(x); #define __initcall(fn) device_initcall(fn) #define device_initcall(fn) __define_initcall(fn, 6) ``` 教材直接說: > `device_initcall` 展開成 `__define_initcall(fn, 6)`，最後會變成我們預處理看到的結果。最後展開的巨集 ```c static inline __attribute__((unused)) __attribute__((no_instrument_function)) initcall_t __attribute__((unused)) __inittest(void) { return init_fib_dev; } int init_module(void) __attribute__((alias("init_fib_dev")));; ``` > 回傳的 return init_fib_dev，他的資料型態必須要和 initcall_t 相同，否則編譯器會報錯。 $\rightarrow$ 可在編譯時期就知道傳入的 function pointer 是不是合法的。 > 系統呼叫 init_module 讓我們把一個 ELF image 載入到 kernel space 但其實我看不出 `__define_initcall(fn, 6)` 跟以上擴展巨集有什麼關係，中間感覺少了幾項步驟，因此再往下找自 [linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h#L282) ```c #define ____define_initcall(fn, __unused, __name, __sec) \ static initcall_t __name __used \ __attribute__((__section__(__sec))) = fn; #define __unique_initcall(fn, id, __sec, __iid) \ ____define_initcall(fn, \ __initcall_stub(fn, __iid, id), \ __initcall_name(initcall, __iid, id), \ __initcall_section(__sec, __iid)) #define ___define_initcall(fn, id, __sec) \ __unique_initcall(fn, id, __sec, __initcall_id(fn)) #define __define_initcall(fn, id) ___define_initcall(fn, id, .initcall##id) ``` >`__section__(__sec)` 用於將函數放置在指定儲存記憶體位置。而 `fn` 將 function pointer 賦予給前面的變數，並告訴 linker 其將該變數放在指定記憶體位置中，以便在掛載時能找到該函式，即 `init_fib_dev` 。也仍找不到能擴展成教材所述的模樣，也找不到能出現 `init_module` 的地方。我自己認為應該是透過以下 ```c static initcall_t __name __used \ __attribute__((__section__(__sec))) = fn; ``` 達成 `fn` = `mod->init` ，而非前面透過 `alias` 將 `init_module` 取了一個別名叫作 `init_fib_dev` 使 `init_fib_dev` 可以執行。 ## [The Linux Kernel Module Programming Guide](https://sysprog21.github.io/lkmpg/) 第 `12` 節 `Mutex` > If processes running on different CPUs or in different threads try to access the same memory, then it is possible that strange things can happen or your system can lock up. To avoid this, various types of mutual exclusion kernel functions are available. These indicate if a section of code is "locked" or "unlocked" so that simultaneous attempts to run it can not happen. ```c=16 // 嘗試鎖住 mymutex ret = mutex_trylock(&mymutex); // 測試回傳值 ret != 0 代表其有做上鎖的動作 if (ret != 0) { pr_info("mutex is locked\n"); // 驗證是否真的有鎖上 if (mutex_is_locked(&mymutex) == 0) pr_info("The mutex failed to lock!\n"); // 再嘗試解鎖 mutex_unlock(&mymutex); pr_info("mutex is unlocked\n"); } else // 上鎖失敗 pr_info("Failed to lock\n"); ``` 第 `20` 行或許不太適當，其先輸出 `mutex is locked` 的資訊才去做驗證( `if (mutex_is_locked(&mymutex) == 0)` )，應該要先驗證才輸出 `mutex is locked` 。 [simrupt/simrupt.c](https://github.com/sysprog21/simrupt/blob/main/simrupt.c#L143) 函式 `simrupt_work_func` ```c while (1) { /* Consume data from the circular buffer */ mutex_lock(&consumer_lock); val = fast_buf_get(); mutex_unlock(&consumer_lock); if (val < 0) break; /* Store data to the kfifo buffer */ mutex_lock(&producer_lock); produce_data(val); mutex_unlock(&producer_lock); } ``` 同註釋，前半部分為 Consume ，後半為 Store ，為防止不同的 CPU 或者線程嘗試存取同一個記憶體位置 (可能會有 unconsistency 的問題)，這裡使用了 mutex lock 的方法，在要消耗或產生資料時皆透過專屬的 lock 去作鎖及解鎖的動作。函式 `fast_buf_get` 這裡消耗資料的程式碼也很重要， ```c unsigned long head = READ_ONCE(ring->head), tail = ring->tail; int ret; /* return if empty */ if (unlikely(!CIRC_CNT(head, tail, PAGE_SIZE))) return -ENOENT; /* read index before reading contents at that index */ smp_rmb(); /* extract item from the buffer */ ret = ring->buf[tail]; /* finish reading descriptor before incrementing tail */ smp_mb(); /* increment the tail pointer */ ring->tail = (tail + 1) & (PAGE_SIZE - 1); ``` `READ_ONCE` 透過 `volatile` 讓編譯器不要對此進行優化，每次都自記憶體重新讀取而非 register ，以確保其能讀取到正確的值。 **mb (memory barrier)** 再來是 `smp_rmb` 及 `smp_mb` ，用途為確保編譯後是按照順序執行，在 cache miss 發生時，其指令會被放回 queue 中直到操作對象可用，使 CPU 處理其他指令，這樣可能會找至寫入順序不一樣導致錯誤的問題，因此 `smp_rmb` 及 `smp_mb` 會在記憶體添加屏障從而避免編譯器最佳化或上述等原因而導致亂序，其中 `smp_rmb` 為保證讀操作按照順序， `smp_mb` 為保證讀寫操作按照順序，另外 `smp_wmb` 為保證寫操作按照順序，且在使用時需要讓一個 `smp_rmb` 對應到一個 `smp_wmb` (當然也可以是 `smp_mb` )，反之亦然，因為這樣可以保證先讀取 (寫入) 後才會被後寫入 (讀取) ，每一個 memory barrier 都代表其之前的對應操作需先完成。因此這裡在先保證了讀到正確的 tail 的值，才能寫入正確的 `ret` 。 > Memory Barrier 常用場合包括：實現同步原語（synchronization primitives）實現無鎖資料結構（lock-free data structures）驅動程式參考 [理解Memory Barrier（記憶體屏障）](https://topic.alibabacloud.com/tc/a/understanding-memories-barrier-memory-barrier_8_8_10252214.html) ### 探討能否改寫為 lock-free #### [並行程式設計: Lock-Free Programming](https://hackmd.io/@sysprog/concurrency-lockfree) Lock-Free 和 Lock-Less 的分野 - lock-free: 強調以系統的觀點來看，只要執行足夠長的時間，至少會有一個執行緒會有進展 (progress)，並非完全不用 mutex lock，而是指整個程式的執行不會被其單個執行緒 lock 鎖住。 - lockless: 避免使用 lock 達到安全無誤地操作共用資料可以看到 lockless 可能會造成無止境地等待，而 lock-free 可能會等待很久，但至少會有進展。以下就是 lockless 的例子，假設 `old = *p;` 成功，但在執行函式 `CAS` 的時候，若有其他處理動到 `*p` ，那麼 `if (*a == old_val)` 將不會成立並重新執行，直到不受其他處理干擾。 ```c int CAS(int *a, int old_val, int new_val) { if (*a == old_val) { *a = new_val; return old_val; } return *a; } int old, new; do { old = *p; new = NEW_VALUE; } while(CAS(*p, old, new) != new); ``` livelock: 兩邊的進展反而造成整個程序的停滯，以下為例，若第一及第二個執行緒都完成了 `while (x == 0)` ，那麼兩個都會執行 `x = 1 - x` 的動作而造成 `x` 變為 1 又變為 0，而造成停滯不前。 ```c while (x == 0) { x = 1 - x; } ``` wait-free (保證所有的指令可以 bound 在某個時間內完成) > lock-free (保證整體程式的其他部份會有進展) > obstruction-free (當有其他執行緒在動的時候，可能會 livelock) Win32 支援的 CAS: `_InterlockedCompareExchange` ```c void LockFreeQueue::push(Node* newHead) { for (;;) { // Copy a shared variable (m_Head) to a local. Node* oldHead = m_Head; // Do some speculative work, not yet visible to other threads. newHead->next = oldHead; // Next, attempt to publish our changes to the shared variable. // If the shared variable hasn't changed, the CAS succeeds and we return. // Otherwise, repeat. if (_InterlockedCompareExchange(&m_Head, newHead, oldHead) == oldHead) return; } } ``` 如果其中一個執行緒的迴圈無法通過 `_InterlockedCompareExchange` ，代表有其他的執行緒成功了，程式整體是有進展的。改寫: `simrupt` to `lock free` ```c /* Insert a value into the kfifo buffer */ static int produce_data(unsigned char val) { if (val >= 0){ /* Implement a kind of circular FIFO here (skip oldest element if kfifo * buffer is full). */ unsigned int len = kfifo_in(&rx_fifo, &val, sizeof(val)); if (unlikely(len < sizeof(val)) && printk_ratelimit()) pr_warn("%s: %zu bytes dropped\n", __func__, sizeof(val) - len); pr_debug("simrupt: %s: in %u/%u bytes\n", __func__, len, kfifo_len(&rx_fifo)); return 1; } return 0; } /* Workqueue handler: executed by a kernel thread */ static void simrupt_work_func(struct work_struct *w) { int val, cpu; /* This code runs from a kernel thread, so softirqs and hard-irqs must * be enabled. */ WARN_ON_ONCE(in_softirq()); WARN_ON_ONCE(in_interrupt()); /* Pretend to simulate access to per-CPU data, disabling preemption * during the pr_info(). */ cpu = get_cpu(); pr_info("simrupt: [CPU#%d] %s\n", cpu, __func__); put_cpu(); while (produce_data(fast_buf_get())); wake_up_interruptible(&rx_wait); } ``` 我認為 simrupt 裡要改成 lock less 對我來說有點困難，首先函式 `fast_buf_get()` 會從 buffer 中取出一項目，因此如果失敗則要放回去，但此時 `val` 若已經被其他 `thread` 更改，那麼不但無法進到 `kfifo` 中，也已經失去了該項目的位置，因此無法放回去。最後我嘗試寫成 `produce_data(fast_buf_get())` ，最少化這裡的程式碼，試圖讓他變成 atomic `operation` ，而檢查 `val < 0` 則交給 `produce_data` (因此也更改了函式 `produce_data` )，透過回傳執決定是否結束。不過如果會被分割成兩部分，最後仍會產生競爭的問題。 ## Timsort, Pattern Defeating Quicksort (pdqsort) 及 Linux 核心 lib/sort.c ## [CMWQ](https://www.kernel.org/doc/html/latest/core-api/workqueue.html) (Concurrency Managed Workqueue)