CA2025 Problem X at Quiz6

# CA2025 Quiz6 > 郭瀚文 ## 14. Why is return address prediction special? ### Relevant Course Material Week6 [Lecture 13: Compiling-Assembling-Linking-Loading](https://docs.google.com/presentation/d/1uAURy-tL-K9oMtUP1gY9034gsOrxbhhDL9MnoTm1NKM/edit?slide=id.p1#slide=id.p1) Page 12 - Producing Machine Code → **Slide 1** ![image](https://hackmd.io/_uploads/Hy8Hd3eB-x.png) Page 14 - What about Other References? → **Slide 2** ![image](https://hackmd.io/_uploads/HkzPdnxB-x.png) Page 21 - Linker Patches Together Multiple Object Modules → **Slide 3** ![image](https://hackmd.io/_uploads/rkgtdherbl.png) Page 22 - Which Addresses Need Relocating? → **Slide 4** ![image](https://hackmd.io/_uploads/H1j2OnlHbx.png) Page 23 - Which Instructions Need Relocation Editing?→ **Slide 5** ![image](https://hackmd.io/_uploads/Hke0d2grbl.png) Week17 1. [Introduction to Dynamic Branch Prediction](https://www.youtube.com/watch?v=PFmx2p6NA0A)(Video) Explains the motivation behind prediction, the role of the BTB, and the performance penalty of stalls. →**Video 1** 2. [Advanced Branch Prediction](https://www.youtube.com/watch?v=avp3bDqCXYM)(Video) Details the Gshare predictor and how global history is used to predict branch directions. →**Video 2** 3. [Tournament Predictors and Branch Prediction Accuracy](https://www.youtube.com/watch?v=-T8yDJ8vTuI)(Video) Summarizes different predictor accuracies and the hardware structures (BHT, Gshare, etc.) used in modern CPU. →**Video 3** ### What is return address prediction? ```c= #include <stdio.h> int foo(int x) { return x + 1; // foo returns x + 1 } int bar(int y) { int a = foo(y); // call foo and store the return value return a * 2; // bar returns a * 2 } int main() { int p = foo(10); // call foo int q = bar(20); // call bar (which also calls foo) printf("%d %d\n", p, q); return 0; } ``` #### 1. What is a Return Address (RA)? When the CPU executes a call (e.g., `foo(10)` or `bar(20)`), it saves the address of the next instruction after the call.This saved address is the return address (RA).When the callee finishes and executes a return instruction (`ret`), the CPU jumps back to this RA to continue execution. #### 2. Where do return addresses come from in this program? There are three calls, so there are three return addresses created: - `main()` calls `foo(10)` - `main()` calls `bar(20)` - `bar()` calls `foo(20)` #### 3. Why is predicting ret targets hard for a normal BTB-style predictor? `foo()` typically ends with one return instruction (`ret`), but `foo` is called from multiple call sites: - `main()` calls `foo(10)` directly - `bar()` also calls `foo(y)` Therefore, the same ret instruction in `foo` may need to return to: - the instruction after `foo(10)` in `main` - or the instruction after `foo(y)` in `bar` This is a many-to-one mapping: one `ret` PC → many possible targets.A predictor that tries to learn ret_PC -> target (like a BTB/indirect target predictor) can easily oscillate and mispredict. #### 4. How does RAS (Return Address Stack) solve it? Modern CPU use a small hardware stack called the Return Address Stack (RAS): - On call: push the return address into the RAS - On ret: pop the top entry and use it as the predicted return target Consider the execution of `q = bar(20)` (nested control flow): (1) `main` call `bar` → **push** `RA_main_after_bar` (2) `bar` call `foo` → **push** `RA_bar_after_foo` (3) `foo ret` → **pop** `RA_bar_after_foo` (correctly returns to `bar` to compute a * 2) (4) `bar ret` → **pop** `RA_main_after_bar` (correctly returns to `main`) LIFO (Last-In-First-Out) means the return address from the most recent call must be used first, which is exactly why a stack (RAS) works for predicting ret targets. ### Answer of Question 14 1. **Dynamic Target Addresses (The One to Many Problem)** - Contrast with Conditional Branches: As discussed in **Video 1**, standard conditional branches (like `if-else`) only have two possible outcomes: Taken or Not Taken. - The Challenge of Returns: A `return` instruction is an indirect jump. A single function (e.g., `printf`) can be called from hundreds of different locations in a program. Consequently, the same `return` instruction has many different possible target addresses. - BTB Limitations: The Branch Target Buffer (BTB), introduced in **Video 1**, typically stores only the most recent target address for a specific branch. For `return` instructions, the `most recent target` is frequently wrong if the function is called from multiple sites, leading to high misprediction rates. 2. **Incompatibility with Direction-Based Predictors** - Direction vs. Target: **Video 2** and **Video 3** focus heavily on "Direction Prediction"—predicting whether a branch will jump. - Fixed Direction: For a `return` instruction, the "direction" is almost always Taken. The difficulty lies in predicting where it will jump to. - Correlation Limitations: Advanced predictors like Gshare **(Video 2)** use XOR logic between the PC and Global History to find correlations between branches. While Gshare is excellent at predicting patterns of `Taken/Not Taken`, it is not designed to generate a full 64-bit target address for the high-entropy `one to many` mapping required by returns. 3. **Specialized Hardware: The Return Address Stack (RAS)** - LIFO Property: Function calls follow a Last-In, First-Out (LIFO) structure. The most recently called function is the first one to return. - The Mechanism: Because of this perfect nesting logic, processors use a specialized hardware structure called the Return Address Stack (RAS) instead of the standard Branch History Table (BHT) or Gshare tables summarized in **Video 3**. - When a `call` instruction is fetched, the address of the next instruction is pushed onto the RAS. - When a `ret` instruction is fetched, the address at the top of the RAS is popped and used as the predicted target. - Accuracy: This mechanism is highly efficient and provides nearly 100% accuracy as long as the stack does not overflow, which is much higher than what a standard BTB could achieve for returns. ### Importance of PIC/PIE (Position-independent code/execution) #### PIC vs PIE - **PIC (Position-Independent Code)**: code (often in shared libraries) runs correctly regardless of where it is loaded. - **PIE (Position-Independent Executable)**: the executable itself is built like PIC, enabling full **ASLR** for the main program. #### Why PIC/PIE matters in modern systems - **Shared libraries** can be loaded at different base addresses across processes. - **ASLR** randomizes code locations to mitigate exploitation. - Better **memory sharing**: multiple processes can map the same read-only code pages. #### What course lecture discuss? In the course handout **Lecture 13: Compiling, Assembling, Linking, and Loading**, the relationship between **PIC/PIE** and **PC-relative addressing** is explained clearly. First, in **Slide 1**, the slides introduce PC-relative control flow (e.g., `beq/bne/jal`) and PC-relative address formation (e.g., `auipc`), which together support Position-Independent Code (PIC). The key idea is that with PC-relative addressing, the target address can be computed from the current **PC** at runtime, so the code does not rely on fixed absolute addresses. When a program has cross-file references or accesses data whose addresses are not yet known, **Slide 2** mentions the assembler records what needs to be fixed later in **Relocation information** and **Symbol tables**, so the linker can resolve them. Then, **Slide 3** states that the linker uses the relocation table to combine multiple object modules and **patch** the necessary references/addresses during linking. Finally, **Slide 4** and **Slide 5** highlight that PC-relative addressing (including `auipc/addi`) remains valid even after the text section is relocated, so PIC avoids embedding fixed absolute addresses in the text section; any unresolved symbols/data are handled by the linker/loader. #### PC-relative addressing enables PIC/PIE If the loader changes the base address, **absolute addresses break**. With **PC-relative addressing**, targets are computed from the current **PC**, so offsets remain valid after relocation. #### RISC-V viewpoint, minimal ISA but still practical RISC-V supports PIC/PIE with a small set of essential instructions: - control-flow: `beq/bne/jal/jalr`. These implement PC-relative branches and jumps. - address formation: `auipc` + `addi/lw` for PC-relative addresses. This is commonly used for data and symbol access. ### Design Experiment #### Goal Compare a standard ret **(RAS-predicted return)** with an indirect-jump-based return **(RAS-bypassing)**, under a many-call-site workload. #### Setup (Many-to-One Call Test) - Create **100 distinct call sites** (100 different wrapper functions) that all call the same `helper`. - Run the workload repeatedly and measure cycles using `RDTSC`. - Two versions: - 1. **RET (RAS-predicted return)** - 2. **Indirect-jump return (RAS-bypassing)** ChatGPT was used to assist in generating the code for RET and indirect-jump return :::spoiler C Code ```c= #include <stdio.h> #include <stdint.h> #if defined(__x86_64__) || defined(__i386__) #include <x86intrin.h> static inline uint64_t read_cycles(void) { return __rdtsc(); } #else static inline uint64_t read_cycles(void) { return 0; } // fallback #endif #define NOINLINE __attribute__((noinline)) #define CLAMP_UNUSED(x) asm volatile("" :: "r"(x) : "memory") // ---------------- helper A: RET (RAS-predicted return) ---------------- NOINLINE int helper_ret(int x) { return x + 1; } // ---------------- helper B: Indirect-jump return (RAS-bypassing) ---------------- // Works on Windows x86-64 (Microsoft x64 ABI): // - first int arg in ecx // - return in eax // - stack top at entry is return address (because function is naked => no prologue) #if defined(_WIN64) && (defined(__x86_64__) || defined(__amd64__)) __attribute__((naked, noinline)) int helper_jmp(int x) { __asm__ volatile( // eax = ecx + 1 (compute return value) "lea 1(%ecx), %eax \n\t" // pop return address into r11 "pop %r11 \n\t" // jump to return address (bypass 'ret') "jmp *%r11 \n\t" ); } #else // If not Windows x86-64, provide a safe fallback (still runs, but won't test RAS bypass) NOINLINE int helper_jmp(int x) { return x + 1; } #endif // ---------------- 100 call sites via X-macros ---------------- #define X_LIST_100(X) \ X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) \ X(10) X(11) X(12) X(13) X(14) X(15) X(16) X(17) X(18) X(19) \ X(20) X(21) X(22) X(23) X(24) X(25) X(26) X(27) X(28) X(29) \ X(30) X(31) X(32) X(33) X(34) X(35) X(36) X(37) X(38) X(39) \ X(40) X(41) X(42) X(43) X(44) X(45) X(46) X(47) X(48) X(49) \ X(50) X(51) X(52) X(53) X(54) X(55) X(56) X(57) X(58) X(59) \ X(60) X(61) X(62) X(63) X(64) X(65) X(66) X(67) X(68) X(69) \ X(70) X(71) X(72) X(73) X(74) X(75) X(76) X(77) X(78) X(79) \ X(80) X(81) X(82) X(83) X(84) X(85) X(86) X(87) X(88) X(89) \ X(90) X(91) X(92) X(93) X(94) X(95) X(96) X(97) X(98) X(99) #define DECL_WRAPPER_RET(i) \ NOINLINE int callsite_ret_##i(volatile int *sink) { \ int v = helper_ret((i) + (*sink & 1)); \ *sink += v; \ return v; \ } #define DECL_WRAPPER_JMP(i) \ NOINLINE int callsite_jmp_##i(volatile int *sink) { \ int v = helper_jmp((i) + (*sink & 1)); \ *sink += v; \ return v; \ } X_LIST_100(DECL_WRAPPER_RET) X_LIST_100(DECL_WRAPPER_JMP) // make each expanded item a full statement #define DO_CALL_RET(i) do { callsite_ret_##i(&sink); } while (0); #define DO_CALL_JMP(i) do { callsite_jmp_##i(&sink); } while (0); static NOINLINE void run_100_calls_ret(volatile int *psink) { volatile int sink = *psink; X_LIST_100(DO_CALL_RET) *psink = sink; } static NOINLINE void run_100_calls_jmp(volatile int *psink) { volatile int sink = *psink; X_LIST_100(DO_CALL_JMP) *psink = sink; } // Repeat the same benchmark multiple times and report average/min/max. static void bench5(const char *name, void (*fn)(volatile int*), volatile int *sink) { const int REPS = 200000; const int TRIALS = 5; // One-time warm-up (not counted) for (int i = 0; i < 500; i++) fn(sink); uint64_t sum = 0; uint64_t minv = (uint64_t)(-1); uint64_t maxv = 0; printf("\n[%s]\n", name); for (int t = 0; t < TRIALS; t++) { // small warm-up before each trial (optional) for (int i = 0; i < 100; i++) fn(sink); uint64_t t0 = read_cycles(); for (int r = 0; r < REPS; r++) fn(sink); uint64_t t1 = read_cycles(); uint64_t cycles = t1 - t0; printf(" trial %d: %llu\n", t + 1, (unsigned long long)cycles); sum += cycles; if (cycles < minv) minv = cycles; if (cycles > maxv) maxv = cycles; } double avg = (double)sum / (double)TRIALS; printf(" avg: %.0f, min: %llu, max: %llu\n", avg, (unsigned long long)minv, (unsigned long long)maxv); } int main(void) { volatile int sink = 0; bench5("Cycles of RET (RAS-predicted return)", run_100_calls_ret, &sink); bench5("Cycles of Indirect-jump return (RAS-bypassing)", run_100_calls_jmp, &sink); CLAMP_UNUSED(sink); printf("\nsink=%d\n", (int)sink); return 0; } ``` ::: #### Results (5 trials) ``` [Cycles of RET (RAS-predicted return)] trial 1: 308746290 trial 2: 316889577 trial 3: 306649457 trial 4: 297831628 trial 5: 307664802 avg: 307556351, min: 297831628, max: 316889577 [Cycles of Indirect-jump return (RAS-bypassing)] trial 1: 789891529 trial 2: 726795753 trial 3: 803017542 trial 4: 834809540 trial 5: 744662375 avg: 779835348, min: 726795753, max: 834809540 sink=1620265408 ``` #### Slowdown (avg) slowdown = 779,835,348 / 307,556,351≈ 2.54× slower for Indirect-jump return #### Interpretation - A normal `ret` is recognized as a **return** and can use the **Return Address Stack (RAS)** to predict the correct return target (LIFO pairing of call/return). - The RAS-bypassing uses an **indirect jump**, which likely does not use RAS and must rely on generic indirect-branch target prediction. - Under **many call sites** (one return instruction with many possible targets), generic target prediction is less stable, leading to more pipeline disruption and higher cycle count. #### Conclusion In a many-call-site experiment, using the standard `ret` (RAS-enabled return prediction) significantly improves performance; replacing it with an indirect jump makes execution about **2.54× slower** in this experiment. ## 15. Why does cache miss latency matter more than hit latency? ### Relevant Course Material Week12 [Lecture 24: Caches (1)](https://docs.google.com/presentation/d/1dv9zkRV5mS2heGojIPm00Ms0fAqKDQ2D7yE26P4MgJI/edit?slide=id.p1#slide=id.p1) →**Slide 1** Page 12 - Processor-DRAM Gap (Latency) Page 17 - Characteristics of the Memory Hierarchy [Lecture 25: Caches II](https://docs.google.com/presentation/d/1rmFD2GZBMjCSsNuHikNiFiDgoDWORYT9anz52l7bgzc/edit?slide=id.p1#slide=id.p1) →**Slide 2** Page 20 - Cache Terms [CS61C 2024Fa L27 Caches IV](https://docs.google.com/presentation/d/18ISfU5CJujmOwekab3fJ8A8W6z2cFgZ_BNy6moyX8Po/edit?slide=id.p1#slide=id.p1) →**Slide 3** Page 15 - Big Idea Page 19 - Example Page 22 - Example: with L2 cache Page 23 - Example: without L2 cache ### Answer of Question 15 Cache miss latency matters more than hit latency because miss penalties are orders of magnitude larger since they require accessing a lower level such as DRAM, and in the performance model the miss penalty is multiplied by the miss rate, so it often dominates the average access time. #### 1. AMAT shows miss penalty is the amplified term The objective is to minimize: $$ \ \text{AMAT}=\text{Hit Time}+\text{Miss Penalty}\times\text{Miss Rate} \ $$ Hit time is only added once, but miss penalty is multiplied by miss rate, as long as the miss rate is not zero, miss penalty can dominate the average. **Reference:** Slide 3, P15. --- #### 2. Miss penalty is much larger than hit time (CPU–DRAM latency gap) - **Slide 1** states that in 2020 a CPU can execute about **1000 instructions during one DRAM access**, and emphasizes that slow memory access has a **disastrous impact** on performance. **Reference:** Slide 1, P12. - **Slide 1** also explains that upper levels (caches) are smaller and use faster memory, so hit time is small, while the access time to the next lower level is the major component of miss penalty. **Reference:** Slide 1, P17. --- #### 3. Definitions: miss penalty is the cost of fetching from the lower level **Slide 2** defines: - **Miss penalty:** time to replace a block from the lower level in the memory hierarchy into the cache - **Hit time:** time to access cache memory, including tag comparison This highlights that a miss is an expensive cross-level operation, while a hit is just a fast access within the cache. **Reference:** Slide 2, P20. --- #### 4. Numerical examples: large miss penalty explodes AMAT even with small miss rate **Slide 3** gives examples showing the impact directly: - If hit time = 1 cycle, miss rate = 5%, miss penalty = 20 cycles → AMAT = 2 cycles **Reference:** Slide 3, P19. - Without L2, L1 miss penalty = 200 cycles → AMAT = 11 cycles; with L2, AMAT drops to 2.75 cycles **Reference:** Slide 3, P22 and P23. These examples demonstrate that once miss penalty is large, it dominates overall performance even when miss rate is modest. ### Design Experiment - Prepare two programs: 1) **High locality**, such as blocked or tiled matrix access, with a low miss rate. 2) **Low locality**, such as streaming array scans or pointer chasing, with a high miss rate. - Run two controlled comparisons in a simulator or any environment where latency can be tuned: 1) **Change only L1 hit time** from 1 to 2 to 4 cycles, while keeping everything else constant. 2) **Change only miss penalty** by increasing DRAM latency from 100 to 200 to 300 cycles, or by adding or removing an L2 cache to change the L1 miss penalty, while keeping everything else constant. - For each run, measure **runtime or CPI** and record the **miss rate** to ensure it stays roughly the same within each comparison.