CA2025 Quiz6 - HackMD

# CA2025 Quiz6 > Van Nguyen Thi Thao 阮氏草芸 >[!Note] AI tools usage >I use ChatGPT to assist with the Quiz 6 discussion by providing code explanations, grammar revisions, pre-work research, code summaries, and explanations of standard RISC-V instruction usage. ## Question 1: Why is PC-relative addressing important? How to validate your answer with experiments? Use MyCPU for further discussions ##### Answer: ### **PC-relative addressing:** An addressing regime in which the address is the sum of the program counter (PC) and a constant in the instruction.[1] ### **The reason that PC-relative addressing important:** PC-relative addressing is important because it enables position-independent code (PIC), which allows programs to execute correctly regardless of their absolute memory location. This addressing mode calculates target addresses by adding a signed offset to the current Program Counter (PC) value, rather than using absolute addresses.[2] * Key Benefits [3]: * Position Independence: Code can be loaded at any memory address without modification * Efficient Branch/Jump Operations: Short offsets require fewer bits than full addresses * Relocatable Code: Libraries and dynamic linking become feasible * Memory Efficiency: Reduces instruction encoding size for local jumps ### **Connection to MyCPU Implementation:** MyCPU serves as a concrete platform to understand why PC-relative addressing is architecturally important. Through implementing a RISC-V CPU, we can observe how PC-relative addressing impacts multiple stages of instruction processing. **1. Immediate Encoding and Extraction:** Location: `InstructionDecode.scala` PC-relative addressing requires careful immediate extraction from instruction encodings. The B-type and J-type formats demonstrate why this addressing mode is efficient: ```scala= val immB = Cat( Fill(Parameters.DataBits - 13, instruction(31)), // Sign extension instruction(31), // bit [12] instruction(7), // bit [11] instruction(30,25), // bits [10:5] instruction(11,8), // bits [4:1] 0.U(1.W) // bit [0] = 0 (alignment) ) val immJ = Cat( Fill(Parameters.DataBits - 21, instruction(31)), // Sign extension instruction(31), // bit [20] instruction(19,12), // bits [19:12] instruction(20), // bit [11] instruction(30,21), // bits [10:1] 0.U(1.W) // bit [0] = 0 (alignment) ) ``` Architectural Insight: The bit reordering and LSB=0 encoding reveal PC-relative addressing's encoding efficiency: * LSB is always 0 because RISC-V instructions are 2-byte aligned (compressed) or 4-byte aligned (standard). By not encoding this redundant bit, the architecture gains one extra bit of range. * B-type: 13 bits with LSB=0 → ±4KB range (sufficient for most loops and local branches) * J-type: 21 bits with LSB=0 → ±1MB range (sufficient for function calls within modules) * Comparison to absolute addressing: A 32-bit absolute address would require full instruction width just for the address, leaving no room for opcode, registers, or other fields. PC-relative offsets fit within existing instruction formats while maintaining reasonable jump distances. **2. Address Calculation Logic:** Location: `Execute.scala` MyCPU's Execute stage demonstrates the fundamental difference between PC-relative and register-based addressing. ```scala= // PC-relative addressing (branches and JAL) val branchTarget = io.instruction_address + io.immediate val jalTarget = branchTarget // JAL and Branch use same calculation method // Register-based addressing (JALR only) val jalrSum = io.reg1_data + io.immediate val jalrTarget = Cat(jalrSum(Parameters.DataBits - 1, 1), 0.U(1.W)) ``` Architectural Insight: This implementation reveals why position independence requires both addressing modes. * **PC-relative (branches, JAL):** * Calculate `target = current_PC + signed_offset` * Works regardless of where code is loaded in memory * Enables relocatable code, shared libraries, ASLR security * **Register-based (JALR):** * Calculate `target = (register_value + offset) & ~1` * Necessary for indirect jumps and function returns * Return addresses stored in registers cannot use PC-relative calculation * **Critical Design Principle:** If JALR also used PC-relative addressing, function returns would break because: * Return address is stored in `ra` register (not predictable from PC) * Function could be called from multiple locations (different return addresses) * PC-relative would always jump to same location relative to current PC ### **Validate answer using experiments:** I have written the assembly code to verify the position-independence feature. **Hypothesis**: Code using PC-relative branches should execute identically when loaded at different memory addresses. Test program: ```scala= # Position Independence Test Program # Demonstrates PC-relative addressing in RISC-V .section .text .globl _start _start: # Initialize counter to 0 li a0, 0 loop: # Increment counter addi a0, a0, 1 # Load comparison value (10) li t0, 10 # PC-relative branch: if (a0 < 10) goto loop blt a0, t0, loop # PC-relative jump to end_func jal ra, end_func end_func: # Store final result to memory address 0 sw a0, 0(zero) # Infinite halt loop halt: j halt # Some padding nop nop nop nop ``` Experiment procedure: 1. Compile test program: ``` cd 1-single-cycle mkdir -p experiments cd experiments riscv64-unknown-elf-as -march=rv32i -mabi=ilp32 -o test_pic.o test_pic.S ``` 2. Create linker scripts with different load addresses: * link_0x1000.lds (default): ``` SECTIONS { . = 0x00001000; .text : { *(.text) } } ``` * link_0x2000.lds (relocated): ``` SECTIONS { . = 0x00002000; .text : { *(.text) } } ``` 3. Link the ELF files: * Version 1: loaded at 0x1000: ``` riscv-none-elf-ld -T link_0x1000.lds -o test_0x1000.elf test_pic.o ``` * Version 2: loaded at 0x2000 ``` riscv-none-elf-ld -T link_0x1000.lds -o test_0x1000.elf test_pic.o ``` 4. Generate disassembly code ```scala= # Disassemble version 1 riscv64-unknown-elf-objdump -d -M no-aliases test_0x1000.elf > test_0x1000.dis # Disassemble version 2 riscv64-unknown-elf-objdump -d -M no-aliases test_0x2000.elf > test_0x2000.dis ``` Disassebly result: ``` test_0x1000.elf: file format elf32-littleriscv Disassembly of section .text: 00001000 <_start>: 1000: 00000513 addi a0,zero,0 00001004 <loop>: 1004: 00150513 addi a0,a0,1 1008: 00a00293 addi t0,zero,10 100c: fe554ce3 blt a0,t0,1004 <loop> 1010: 004000ef jal ra,1014 <end_func> 00001014 <end_func>: 1014: 00a02023 sw a0,0(zero) # 0 <_start-0x1000> 00001018 <halt>: 1018: 0000006f jal zero,1018 <halt> 101c: 00000013 addi zero,zero,0 1020: 00000013 addi zero,zero,0 1024: 00000013 addi zero,zero,0 1028: 00000013 addi zero,zero,0 ``` ``` test_0x2000.elf: file format elf32-littleriscv Disassembly of section .text: 00002000 <_start>: 2000: 00000513 addi a0,zero,0 00002004 <loop>: 2004: 00150513 addi a0,a0,1 2008: 00a00293 addi t0,zero,10 200c: fe554ce3 blt a0,t0,2004 <loop> 2010: 004000ef jal ra,2014 <end_func> 00002014 <end_func>: 2014: 00a02023 sw a0,0(zero) # 0 <_start-0x2000> 00002018 <halt>: 2018: 0000006f jal zero,2018 <halt> 201c: 00000013 addi zero,zero,0 2020: 00000013 addi zero,zero,0 2024: 00000013 addi zero,zero,0 2028: 00000013 addi zero,zero,0 ``` **Observation:** | Version | Address | Instructions | Encoding | Target | | ------- | ------- | -------------- | -------- | --- | | 1 | 0x100c | `blt a0,t0, loop` | fe554ce3 | 0x1004 | | 2 | 0x200c | `blt a0,t0, loop` | fe554ce3 | 0x2004 | I verified position independence through static disassembly analysis. By comparing the two disassembly files, I can prove: * Instruction encodings are identical (fe554ce3 for the branch) * PC-relative offsets are constant (-8 bytes in both cases) * Target addresses adjust automatically based on PC ## Question 2: Why does a deeper pipeline increase branch misprediction penalty? How to validate your answer with experiments? Use MyCPU for further discussions ##### Answer: A deeper pipeline increases branch misprediction penalty because more instructions are speculatively fetched and partially executed before the branch outcome is determined. When a branch is mispredicted, all these speculatively executed instructions must be flushed (discarded), wasting more cycles [4]. Core Principle: Branch Misprediction Penalty = Number of pipeline stages between Instruction Fetch (IF) and Branch Resolution stage In deeper pipelines: * Branch resolution occurs later in the pipeline * More instructions enter the pipeline speculatively * Misprediction requires flushing more stages * Results in higher cycle penalty ### **Connection to MyCPU Implementation:** **1. Theoretical Analysis** 1.1. Single-Cycle CPU (No Pipeline) Characteristics: * Every instruction completes in 1 cycle * Branch outcome known immediately * No speculative execution Branch Misprediction Penalty: 0 cycles ``` Cycle 1: BEQ instruction executes, branch taken/not-taken determined Cycle 2: Correct next instruction fetched and executed ``` 1.2. 3-Stage Pipeline Architecture: IF → ID → EX/MEM/WB (3 stages, EX/MEM/WB folded into one) Pipeline Stages: * IF (Instruction Fetch) - Fetch instruction from memory * ID (Instruction Decode) - Decode instruction, read registers * EX (Execute) - ALU operations, memory access, write back, branch evaluation Branch Resolution: Occurs in EX stage (stage 3) Location: `threestage/Execute.scala` ```scala= val instruction_jump_flag = (opcode === InstructionTypes.B) && MuxLookup(funct3, false.B)( IndexedSeq( InstructionsTypeB.beq -> (io.reg1_data === io.reg2_data), // Comparison in EX InstructionsTypeB.bne -> (io.reg1_data =/= io.reg2_data), // ... ) ) ``` When branch is taken, `threestage/Control.scala` generates flush signal, which clears both pipeline registers in `threestage/CPU.scala`: ```scala= if2id.io.flush := ctrl.io.Flush // Flush IF/ID register id2ex.io.flush := ctrl.io.Flush // Flush ID/EX register ``` Timeline of Branch Misprediction: Cycle 1: BEQ enters IF Cycle 2: BEQ in ID, Next+1 enters IF (speculative - wrong path) Cycle 3: BEQ in EX, Next+1 in ID, Next+2 enters IF (speculative - wrong path) * Branch evaluated in Execute.scala * Misprediction detected Cycle 4: Flush Next+1 (in ID) and Next+2 (in IF) Fetch correct target instruction Cycle 5: Correct instruction in ID Cycle 6: Correct instruction in EX Instructions speculatively fetched: 2 (Next+1, Next+2) Pipeline registers flushed: 2 (IF2ID, ID2EX) Penalty = 2 cycles Formula verification: Branch resolved at stage 3 (EX) Penalty = (Stage 3) - 1 = 2 cycles 1.3. 5-Stage Pipeline with Stalling (FiveStageStall) Architecture: IF → ID → EX → MEM → WB (full 5 stages, no optimizations) Pipeline Stages: * IF (Instruction Fetch) - Fetch instruction from memory * ID (Instruction Decode) - Decode instruction, read registers * EX (Execute) - ALU operations, branch evaluation * MEM (Memory Access) - Load/store operations * WB (Write Back) - Write result to register file Branch Resolution: Occurs in EX stage (stage 3) Location: `fivestage_stall/Execute.scala` ```scala= val instruction_jump_flag = (opcode === InstructionTypes.jal) || (opcode === InstructionTypes.jalr) || ((opcode === InstructionTypes.B) && branchCondition) // Branch condition evaluated in EX stage val branchCondition = MuxLookup(funct3, false.B)( Seq( InstructionsTypeB.beq -> (io.reg1_data === io.reg2_data), InstructionsTypeB.bne -> (io.reg1_data =/= io.reg2_data), InstructionsTypeB.blt -> (io.reg1_data.asSInt < io.reg2_data.asSInt), InstructionsTypeB.bge -> (io.reg1_data.asSInt >= io.reg2_data.asSInt), InstructionsTypeB.bltu -> (io.reg1_data < io.reg2_data), InstructionsTypeB.bgeu -> (io.reg1_data >= io.reg2_data) ) ) ``` Flush logic `threestage/Control.scala`: ```scala= io.if_flush := ex_jump_flag // Flush IF/ID io.id_flush := ex_jump_flag // Flush ID/EX ``` Timeline of Branch Misprediction: Cycle 1: BEQ enters IF Cycle 2: BEQ in ID, Next+1 enters IF (speculative) Cycle 3: BEQ in EX, Next+1 in ID, Next+2 enters IF (speculative) * Branch evaluated in EX stage * Misprediction detected * Flush signal generated Cycle 4: BEQ in MEM, Next+1 flushed, Next+2 flushed Fetch correct target instruction Cycle 5: BEQ in WB, Target in ID Cycle 6: Target in EX Key Difference from ThreeStage: * Despite having 5 stages total, branch still resolves at stage 3 (EX) * Same 2-cycle penalty as ThreeStage * Additional MEM and WB stages don't affect branch resolution timing * No forwarding → Must wait for clean register values from WB Instructions speculatively fetched: 2 (Next+1, Next+2) Pipeline registers flushed: 2 (IF2ID, ID2EX) Penalty = 2 cycles Formula verification: Branch resolved at stage 3 (EX) Penalty = (Stage 3) - 1 = 2 cycles ### **Validate answer using experiments:** Branch Misprediction Penalty Analysis Using Single-Cycle vs ThreeStage Pipeline Objective: Demonstrate why deeper pipelines increase branch misprediction penalty Test program: I wrote both test files `BranchPenaltyTest.scala` for single cycle CPU and pipelined CPU Location: `ca2025-mycpu/1-single-cycle/src/test/scala/riscv/singlecycle/BranchPenaltyTest.scala` [Github link](https://github.com/scarlett910/ca2025-mycpu/blob/main/1-single-cycle/src/test/scala/riscv/singlecycle/BranchPenaltyTest.scala) Run test: ``` sbt "singleCycle/testOnly riscv.singlecycle.BranchPenaltyTest" ``` Result: ``` [info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.29) [info] loading project definition from /home/harrypotter/ca2025-mycpu/project [info] loading settings for project root from build.sbt... [info] set current project to mycpu-root (in build file:/home/harrypotter/ca2025-mycpu/) [info] compiling 1 Scala source to /home/harrypotter/ca2025-mycpu/1-single-cycle/target/scala-2.13/test-classes ... [info] BranchPenaltyTest: [info] Single Cycle CPU - Branch Penalty Measurement ====================================================================== SINGLE-CYCLE CPU - Cycle Measurement ====================================================================== Program: fibonacci.asmbin (recursive fib(10)) Expected result: 55 ---------------------------------------------------------------------- ... 10000 cycles ====================================================================== RESULTS: ====================================================================== Status: ✓ PASS Cycles: 10,500 Result: 55 (expected: 55) ====================================================================== ✓ Single-Cycle measurement complete ✓ Record this number: 10500 cycles ✓ Result saved to: /tmp/branch_penalty_results.txt ====================================================================== [info] - should measure execution cycles for Fibonacci [info] Run completed in 14 seconds, 813 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 22 s, completed ``` Location: `ca2025-mycpu/3-pipeline/src/test/scala/riscv/BranchPenaltyTest.scala` [Github link](https://github.com/scarlett910/ca2025-mycpu/blob/main/3-pipeline/src/test/scala/riscv/BranchPenaltyTest.scala) Run test: ``` sbt "pipeline/testOnly riscv.BranchPenaltyTest" ``` Result: ``` [info] welcome to sbt 1.10.7 (Eclipse Adoptium Java 11.0.29) [info] loading project definition from /home/harrypotter/ca2025-mycpu/project [info] loading settings for project root from build.sbt... [info] set current project to mycpu-root (in build file:/home/harrypotter/ca2025-mycpu/) [info] compiling 1 Scala source to /home/harrypotter/ca2025-mycpu/3-pipeline/target/scala-2.13/test-classes ... [info] BranchPenaltyTest: [info] Pipeline - Cycle Measurement ====================================================================== Three-stage Pipelined CPU ====================================================================== ====================================================================== RESULT: 13,400 cycles ====================================================================== [info] - should measure cycles for Fibonacci (Three-stage Pipelined CPU) ====================================================================== Five-stage Pipelined CPU with Stalling ====================================================================== ====================================================================== RESULT: 19,300 cycles ====================================================================== [info] - should measure cycles for Fibonacci (Five-stage Pipelined CPU with Stalling) [info] Run completed in 43 seconds, 453 milliseconds. [info] Total number of tests run: 4 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 51 s, completed ``` Experimental Results: * Single-Cycle (1 stage): 10,500 cycles → No penalty * Three-Stage (3 stages): 13,400 cycles → 28% overhead * Five-Stage (5 stages): 19,300 cycles → 84% overhead A deeper pipeline increases the overall performance penalty because: * More speculative instructions are fetched When a branch is fetched, the pipeline immediately fetches the next sequential instructions Deeper pipeline = more stages = more wrong-path instructions before branch resolves These must all be flushed on misprediction * Branch resolution happens at the same stage (EX), but more stages exist Both Three-Stage and Five-Stage resolve branches at stage 3 (EX) Formula: Penalty = (Resolution_Stage - 1) = 2 cycles However, the Five-Stage has additional stages (MEM, WB) that create more pipeline hazards * Accumulation of penalties over entire program Single-Cycle: No penalty, every instruction completes correctly Three-Stage: 2-cycle penalty per branch misprediction Five-Stage: 2-cycle branch penalty + additional data hazard stalls Although the data hazards also affect the cycles of pipelined CPU, we can still clearly observe the impact of deeper pipelines on branch misprediction penalty, as the additional stages increase the amount of wrong-path work that must be flushed and amplify the overall performance loss. ## References: [[1] D. A. Patterson and J. L. Hennessy, Computer Organization and Design RISC-V Edition: The Hardware/Software Interface. San Francisco, CA: Morgan Kaufmann, 2017.](https://www.cs.sfu.ca/~ashriram/Courses/CS295/assets/books/HandP_RISCV.pdf) [[2] CS61C - Chapter 12: Instruction Format II](https://docs.google.com/presentation/d/1rYJeuDUPezDdk01M8ib3XTo15IGLdsMJclzD4Vv9XGQ/edit?slide=id.p8#slide=id.p8) [[3] Difference between PC relative and Base register Addressing Modes - GeeksforGeeks](https://www.geeksforgeeks.org/computer-organization-architecture/difference-between-pc-relative-and-base-register-addressing-modes/) [[4] C. Terman, “L15: Pipelining the Beta — 37. Branch Prediction,” Pipelining the Beta, MIT 6.004 Computation Structures. ](https://computationstructures.org/lectures/pbeta/pbeta.html#37)