CA2025: Quiz6 of Computer Architecture

# CA2025 Quiz6 > 黃偉峰 ## Question 21: Why is context switch expensive on modern CPUs? Source: [[CS61C SU20] Lecture 18.3 - Operating Systems & Virtual Memory: Multiprogramming & Time Sharing](https://www.youtube.com/watch?v=2-FGndKozHk&list=PLDoI-XvXO0arq3vh2gcLvr5isLK4CuK3B&index=3) / [UCB Virtual Memory I](https://docs.google.com/presentation/d/1_KtUjyxALrATgiT9JsiK8GKQXNw5M0BAllsdz78Lm74/edit?slide=id.g31386ae6ad6_1_912#slide=id.g31386ae6ad6_1_912) / [UCB Virtual Memory II](https://docs.google.com/presentation/d/1KB2HKnvqsKq65udERzMw9imTMtV8KREMWmKc7w8XOfk/edit?slide=id.g317a71d716e_2_46#slide=id.g317a71d716e_2_46) ### What is context switch? A **context switch** is the operation performed by the operating system in which the CPU stops executing one process or thread and starts executing another by **saving the current execution context and restoring a different one**. ### Why is it expensive? 1. Operating system must save the current process’s CPU state (such as registers and program counter) and restore the state of another process. This already costs extra instructions and memory accesses. 2. After a context switch, **the CPU often cannot reuse cached data from the previous process because each process uses different data and instructions.** This causes more cache misses and slows down execution. 3. Because different processes have different virtual memory mappings, a context switch may cause the **TLB to lose useful entries, forcing the CPU to look up page tables again**, which is slow. In short, context switches are expensive because they add extra overhead and reduce CPU efficiency by **breaking cache and TLB locality**. ### Experiment: Measuring Context Switch Cost with `lat_ctx` `lat_ctx` is a microbenchmark from lmbench used to measure the **cost of context switching** on a real machine. The program creates multiple **runnable processes and repeatedly switches execution among them**. By timing these switches, `lat_ctx` **reports the average time per context switch.** In this experiment, we vary: - The working set size per process, which controls cache and TLB pressure When the **working set fits in the CPU cache, context switch latency is relatively low**. Once the **working set exceeds cache capacity, latency increases significantly** due to cache misses and TLB misses. Because `lat_ctx` runs directly on the system, the results reflect real hardware and operating system behavior, including cache hierarchy, TLBs, and scheduler overhead. #### Setup The experiment was conducted on a local machine with the following hardware configuration: ``` ukp66482@ukp66482-Ubuntu:~$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz CPU family: 6 Model: 167 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 1 CPU(s) scaling MHz: 53% CPU max MHz: 4900.0000 CPU min MHz: 800.0000 BogoMIPS: 4992.00 Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 384 KiB (8 instances) -> 48 KiB per core L1i: 256 KiB (8 instances) -> 32 KiB per core L2: 4 MiB (8 instances) -> 512 KiB per core L3: 16 MiB (1 instance) ``` --- For each working set size, the following command is executed: ``` taskset -c 0 lat_ctx -s <size in KBytes> 2 ``` The working set size \<size in KBytes\> is chosen as: ``` 4 KB 16 KB 32 KB 48 KB <- L1 boundary 64 KB 128 KB 256 KB 512 KB <- L2 boundary 1024 KB ``` The experiment is pinned to **a single CPU core** using `taskset -c 0` to prevent process migration across cores. This ensures that all context switches occur on the same core and therefore share the same **per-core L1 and L2 caches**, avoiding interference from other cores and the shared L3 cache. These working set sizes are selected to span different levels of the CPU cache hierarchy: - This configuration allows us to clearly observe how context switch latency increases as the per-process working set exceeds different cache levels. ### Result ![image](https://hackmd.io/_uploads/SJx-MBp7We.png) ``` ukp66482@ukp66482-Ubuntu:~$ sudo perf stat -e cache-references,cache-misses,dTLB-loads,dTLB-load-misses taskset -c 0 lat_ctx -s 64 2 "size=64k ovr=1.52 2 1.47 Performance counter stats for 'taskset -c 0 lat_ctx -s 64 2': 65,8262 cache-references 5,6468 cache-misses # 8.58% of all cache refs 10,3511,2352 dTLB-loads 2947 dTLB-load-misses # 0.00% of all dTLB cache accesses 0.587681764 seconds time elapsed 0.539686000 seconds user 0.048064000 seconds sys ukp66482@ukp66482-Ubuntu:~$ sudo perf stat -e cache-references,cache-misses,dTLB-loads,dTLB-load-misses taskset -c 0 lat_ctx -s 128 2 "size=128k ovr=2.70 2 1.48 Performance counter stats for 'taskset -c 0 lat_ctx -s 128 2': 167,8647 cache-references 5,8313 cache-misses # 3.47% of all cache refs 11,1062,6209 dTLB-loads 2814 dTLB-load-misses # 0.00% of all dTLB cache accesses 0.580705546 seconds time elapsed 0.544577000 seconds user 0.036158000 seconds sys ukp66482@ukp66482-Ubuntu:~$ sudo perf stat -e cache-references,cache-misses,dTLB-loads,dTLB-load-misses taskset -c 0 lat_ctx -s 512 2 "size=512k ovr=10.30 2 2.86 Performance counter stats for 'taskset -c 0 lat_ctx -s 512 2': 1,3487,2278 cache-references 9,6815 cache-misses # 0.07% of all cache refs 24,0450,4435 dTLB-loads 6787 dTLB-load-misses # 0.00% of all dTLB cache accesses 1.304165259 seconds time elapsed 1.271728000 seconds user 0.032322000 seconds sys ukp66482@ukp66482-Ubuntu:~$ sudo perf stat -e cache-references,cache-misses,dTLB-loads,dTLB-load-misses taskset -c 0 lat_ctx -s 1024 2 [sudo] password for ukp66482: "size=1024k ovr=21.65 2 2.36 Performance counter stats for 'taskset -c 0 lat_ctx -s 1024 2': 9155,7405 cache-references 11,2874 cache-misses # 0.12% of all cache refs 11,3768,3921 dTLB-loads 3901 dTLB-load-misses # 0.00% of all dTLB cache accesses 0.589024822 seconds time elapsed 0.574979000 seconds user 0.014023000 seconds sys ``` The above figure shows the **average context switch latency as the per-process working set size increases.** - **Working set = 64 KB and 128 KB** - The context switch latency stays low and stable at around **1.2–1.3 µs**. - This means the working set still fits within the **L2 cache**. - `perf` shows **moderate cache activity** in this range. - The **dTLB miss rate is very small**, indicating that TLB behavior does not have a significant impact on the context switch cost **in this range**. - **Working set = 512 KB** - This size is **close to the L2 cache capacity**. - The context switch latency increases sharply to about **2.86 µs**. - The number of **cache references becomes much higher**, meaning cache lines are frequently replaced. - This indicates **cache thrashing at the L2 cache boundary**. - The **dTLB miss rate remains near zero**, showing that the latency increase mainly comes from **cache effects rather than TLB misses**. - **Working set = 1024 KB** - This working set is **much larger than the L2 cache size**. - The context switch latency **decreases compared to 512 KB**, but is still higher than the small working set cases. - At this size, memory accesses are served more **consistently from the shared L3 cache**. - This reduces unstable L2 cache thrashing, leading to a **more stable access pattern**. - The **dTLB miss rate remains negligible in this experiment**, indicating that **TLB effects are not observable under the current workload and memory access pattern**. The context switch cost is therefore dominated by cache hierarchy effects. Overall, the results show that context switch latency increases when the working set is larger than the cache size, and that **cache effects are the main contributor to the higher cost** in this experiment. ### Line Back to the question Context switches are expensive mainly because they break cache locality. Although there is a fixed overhead for saving and restoring registers, the loss of cache efficiency forces the CPU to access slower cache levels. The experiment shows that cache effects can significantly increase context switch latency near the L2 cache boundary. :::danger Check `perf` on Linux as well. ::: ## Question 22: Why is page fault handling slow? Source: [UCB Virtual Memory I](https://docs.google.com/presentation/d/1_KtUjyxALrATgiT9JsiK8GKQXNw5M0BAllsdz78Lm74/edit?slide=id.g31386ae6ad6_1_912#slide=id.g31386ae6ad6_1_912) / [UCB Virtual Memory II](https://docs.google.com/presentation/d/1KB2HKnvqsKq65udERzMw9imTMtV8KREMWmKc7w8XOfk/edit?slide=id.g317a71d716e_2_46#slide=id.g317a71d716e_2_46) ### What is Page Fault? A **page fault** occurs when a program attempts to access a virtual memory page that is **not currently present in physical memory (RAM)** or when the access **violates memory access permissions**. Modern operating systems use **virtual memory**, where each process works with virtual addresses instead of physical addresses. When the CPU translates a virtual address using the page table and discovers that the required page is missing or invalid, it triggers a **page fault exception** and hands control to the operating system. If the access is **valid**, the operating system loads the missing page from disk into memory, updates the page table, and resumes program execution. If the access is **invalid** (for example, accessing an unmapped address or writing to a read-only page), the operating system terminates the process, typically resulting in a segmentation fault. ### Why Is It Slow? This section focuses on **major page faults**, which are the main reason page faults are slow. A page fault is slow because it requires extra work beyond a normal memory access. When a **major page fault** occurs, the operating system must: 1. Stop the running program. 2. Read the required page from disk. 3. Load the page into memory. 4. Update the page table. 5. Resume the program. Disk access is much slower than accessing RAM, which makes a major page fault especially expensive. As a result, **frequent major page faults can significantly reduce program performance**. ### Experiment #### Experimental Setup - Operating System: Linux (Ubuntu) - RAM: 32GB - Page size: 4096 bytes - Measurement tool: `/usr/bin/time -v` - Memory access method: `mmap` (file-backed memory mapping) - Access pattern: Sequential, one access per page :::info `mmap` is used to enable demand paging, so pages are loaded into RAM only upon access, making page faults observable without allocating all memory at once. ::: #### Prepare a large file A large file is created on disk and mapped into memory using `mmap`, so each virtual memory page is backed by data from the file. ```bash dd if=/dev/zero of=bigfile.bin bs=1M count=40960 ``` #### Program Description The following program maps a portion of the file into virtual memory using `mmap` and accesses one byte per page sequentially. ```c #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <sys/time.h> #define PAGE_SIZE 4096 long now_us() { struct timeval tv; gettimeofday(&tv, NULL); return tv.tv_sec * 1000000L + tv.tv_usec; } int main(int argc, char **argv) { if (argc != 2) { fprintf(stderr, "usage: %s <MB>\n", argv[0]); return 1; } int mb = atoi(argv[1]); size_t size = (size_t)mb * 1024 * 1024; int fd = open("bigfile.bin", O_RDONLY); if (fd < 0) { perror("open"); return 1; } char *buf = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0); if (buf == MAP_FAILED) { perror("mmap"); return 1; } long start = now_us(); volatile unsigned long sum = 0; for (size_t i = 0; i < size; i += PAGE_SIZE) { sum += buf[i]; // one access per page } long end = now_us(); printf("%d MB mmap read: %.3f sec (sum=%lu)\n", mb, (end - start) / 1e6, sum); munmap(buf, size); close(fd); return 0; } ``` #### Baseline measurement (small working set) A small portion of the file is mapped and accessed. The working set easily fits in physical memory. ```bash /usr/bin/time -v ./pagefault 1024 ``` #### Page fault measurement (large working set) A much larger portion of the file is mapped, exceeding available physical memory. ```bash /usr/bin/time -v ./pagefault 20480 ``` ### Result ``` ukp66482@ukp66482-Ubuntu:~/Desktop/CA_Quiz6/22_page_fault$ /usr/bin/time -v ./pagefault 1024 1024 MB mmap read: 0.047 sec (sum=0) Command being timed: "./pagefault 1024" User time (seconds): 0.01 System time (seconds): 0.05 Percent of CPU this job got: 98% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.06 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 1049984 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 16463 Voluntary context switches: 1 Involuntary context switches: 3 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ukp66482@ukp66482-Ubuntu:~/Desktop/CA_Quiz6/22_page_fault$ /usr/bin/time -v ./pagefault 20480 20480 MB mmap read: 14.153 sec (sum=0) Command being timed: "./pagefault 20480" User time (seconds): 0.15 System time (seconds): 3.74 Percent of CPU this job got: 26% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:14.47 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 20972544 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 512 Minor (reclaiming a frame) page faults: 378750 Voluntary context switches: 129240 Involuntary context switches: 74 Swaps: 0 File system inputs: 33063688 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` | Size | Time (s) | CPU % | Major PF | Minor PF | | ----- | -------- | ----- | -------- | -------- | | 1 GB | 0.06 | 98% | 0 | 16,463 | | 20 GB | 14.47 | 26% | 512 | 378,750 | #### 1 GB (Small Working Set) When mapping **1 GB** of the file, the working set fits entirely in physical memory. The execution time is very short (**0.06 s**), and CPU utilization is high (**98%**), indicating that the program is **CPU-bound**. No **major page faults** are observed, and only a small number of **minor page faults** occur, which are mainly caused by page table updates rather than disk access. As a result, memory accesses are fast and **do not involve disk I/O**. #### 20 GB (Large Working Set) When mapping **20 GB** of the file, the working set exceeds the available physical memory at runtime. The execution time increases dramatically to **14.47 s**, and CPU utilization drops to **26%**, showing that the program becomes **I/O-bound**. Although the number of **major page faults** is relatively small (**512**), each major page fault triggers expensive disk I/O. Due to **sequential access**, the operating system performs **readahead**, converting many subsequent accesses into **minor page faults**. Nevertheless, disk access dominates the runtime, resulting in a significant performance degradation. ## Question 24: Why are interrupts considered control hazards? Source: [I/O: Devices, Polling, Interrupts](https://drive.google.com/drive/folders/1NRoyYSbAyarMhmPD5LiY6YEqYpLoZx_s) / [Lectures 21-23: RISC-V 5-Stage Pipeline](https://docs.google.com/presentation/d/1v-Squx8lK-oOrflFOwBZh-ue94seVHZudqDOgJmFf5Q/edit?slide=id.g2fa69f884ef_0_2244#slide=id.g2fa69f884ef_0_2244) ### Traps, Interrupts, and Exceptions: Control Flow Alteration ![image](https://hackmd.io/_uploads/By0qNfkEWg.png) Traps, interrupts, and exceptions are events that **alter the normal sequential flow of program execution** by redirecting control to a handler routine. #### Normal Program Execution Under normal execution, instructions are executed sequentially, such as: ``` I_i-1 → I_i → I_i+1 ``` The processor assumes that the Program Counter increases normally and continues fetching instructions along this path. #### Trap / Interrupt Occurrence When a trap, interrupt, or exception occurs during the execution of instruction I_i, the normal control flow is disrupted. At this point: - The current program state must be preserved - The PC of the interrupted instruction is saved into a special register (`MEPC`) This allows the processor to resume execution later without losing program correctness. #### Trap Handler Execution After saving the PC: - Control is transferred to the trap handler The handler consists of instructions: ``` HI_1 → HI_2 → ... → HI_n ``` These handler instructions execute to: - Service the interrupt - Handle the exception - Or resolve the trap condition #### Returning from the Trap When the trap handler completes: - The PC is restored from MEPC - Execution resumes exactly where the program left off ### Why this is a control hazard Traps, interrupts, and exceptions are considered control hazards because: - They change the PC unexpectedly - Instructions already fetched into the pipeline may belong to the wrong control path - The pipeline must be flushed and redirected to the handler - Normal sequential execution assumptions are violated ## Question 25: Describe the purpose of the C extension Source: [RISC-V C Extension P160 ~ P173](https://docs.riscv.org/reference/isa/_attachments/riscv-unprivileged.pdf) ### Purpose of the C extension in RISC-V The **RISC-V Compressed C extension** is designed to **reduce instruction size and improve instruction cache efficiency**. In the base RISC-V ISA, all instructions are 32 bits long. However, many common instructions do not need such a large encoding, such as simple arithmetic instructions (`c.addi`), register moves (`c.mv`), branches (`c.beqz`, `c.bnez`), and load/store instructions (`c.lw`, `c.sw`). The C extension introduces **16-bit compressed instructions** for these frequently used operations. With smaller instructions, more code can fit into the instruction cache, which reduces instruction cache misses and improves instruction fetch performance. Compressed instructions are fully compatible with normal 32-bit instructions and are automatically expanded during instruction decode, so no changes are needed in the execution pipeline. In summary, the purpose of the RISC-V C extension is to **make programs smaller and improve instruction cache behavior without adding complexity**. ### Experiment To evaluate the impact of the RISC-V C extension on instruction cache behavior, we conduct the experiment using the **Ripes simulator**. We use a single C program as the benchmark and compile it twice with identical optimization settings. The only difference between the two binaries is the instruction set configuration: - `rv32im` (without compressed instructions) - `rv32imc` (with compressed instructions) Both binaries are executed on the same RISC-V processor configuration in Ripes. During execution, we observe the instruction cache statistics provided by Ripes, such as cache hit rate and miss rate. By comparing the instruction cache behavior of these two configurations, we can directly evaluate how instruction compression affects instruction cache performance. #### C code ```c // icache_test.c #include <stdint.h> volatile uint32_t sink; __attribute__((noinline)) uint32_t kernel(uint32_t x) { uint32_t s = 0; for (uint32_t i = 0; i < 1024; i++) { s += (x & 1) ? 3 : 7; s ^= (x << 1); s += 5; s ^= (x >> 2); s += 9; x += s; } return s; } int main(void) { uint32_t r = 0; for (uint32_t t = 0; t < 10; t++) { r += kernel(t); } sink = r; } ``` #### Compile Flag ```bash riscv64-unknown-elf-gcc icache_test.c \ -O2 \ -march=rv32im \ -mabi=ilp32 \ -fno-unwind-tables -fno-asynchronous-unwind-tables \ -fomit-frame-pointer \ -nostdlib \ -o icache_no_c.elf riscv64-unknown-elf-gcc icache_test.c \ -O2 \ -march=rv32imc \ -mabi=ilp32 \ -fno-unwind-tables -fno-asynchronous-unwind-tables \ -fomit-frame-pointer \ -nostdlib \ -o icache_with_c.elf ``` #### Cache Configuration The instruction cache is configured with a small cache size to amplify the effect of instruction compression. - Cache type: L1 Instruction Cache - Number of lines: 2³ = 8 - Associativity: 1-way (direct-mapped) - Words per line: 2¹ = 2 - Replacement policy: LRU This configuration makes the instruction cache capacity limited, allowing the impact of the RISC-V C extension on instruction cache hit rate to be observed more clearly. ### Result #### .text size ``` ukp66482@ukp66482-Ubuntu:~/Desktop$ riscv64-unknown-elf-size test_no_c.elf text data bss dec hex filename 160 0 4 164 a4 test_no_c.elf ukp66482@ukp66482-Ubuntu:~/Desktop$ riscv64-unknown-elf-size test_with_c.elf text data bss dec hex filename 94 0 4 98 62 test_with_c.elf ``` The `.text` section size is used to represent the static instruction footprint of the program. - Without C extension: **160 bytes** - With C extension: **94 bytes** Enabling the RISC-V C extension reduces the `.text` size by approximately **41%**, showing a significant improvement in code density. #### I-Cache Hit Rate Without C extension ![without_c](https://hackmd.io/_uploads/r11ov-J4bg.png) #### I-Cache Hit Rate With C extension ![with_c](https://hackmd.io/_uploads/rkyoDW14Zg.png) The instruction cache hit rate is around **0.50** without the C extension and improves to about **0.75** when the C extension is enabled. The experimental results show that enabling the RISC-V C extension significantly improves instruction cache performance. By reducing instruction size, more instructions can fit into the same instruction cache capacity, leading to a higher cache hit rate. In this experiment, the instruction cache hit rate increases from approximately **0.50** to **0.75** when the C extension is enabled. This clearly demonstrates the benefit of instruction compression, especially under a small instruction cache configuration.