CA2025 Quiz6 - HackMD

# CA2025 Quiz6 > 張家瑋 ## Q4: What ISA feature most directly impacts compiler optimization freedom? ### Ans: The number of registers is determined by the ISA and can not be changed once the CPU is built. While a smaller number of registers allows for faster access times, on the other hand, a smaller register file reduces compiler optimization freedom.This is because the compiler needs to use registers be more efficiently in that case. ### Reference: - Lecture 7: Intro to RISC-V `Each ISA a predetermined number of registers` - slide 15 `RISC-V code must be very carefully put together to efficiently use registers` - slide 15 `Why 32? Smaller is faster, but too small is bad` - slide 15 - Lecture 8: RISC-V Data Transfer `Each kernel has a register footprint and note that register use can be optimized better than I did before` - slide 3 ### How to validate my answer with experiments Constraining the compiler to use fewer registers on the same RISC-V pipeline #### Experiment Design 1 : compare number of lw/sw instructions - Write a high register-pressure program Write a high register-pressure workload and implement it in C so that the compiler performs register allocation. For example, we can reuse `bf16_sqrt` algorithm from [Quiz 1 Problem C](https://hackmd.io/@sysprog/arch2025-quiz1-sol#Problem-C). This program needs to maintain many live temporaries simultaneously. In addition, under the ISA RV32I subset multiplication must be implemented in other methods (no `mul`), which introduces even more live temporaries. Overall, these conditions make this program becomes a high register pressure program. - Compile two versions We then compile the same C program with and without register restrictions ``(via -ffixed-xN)``. - command ```bash= # Normal build riscv64-unknown-elf-gcc -O2 -S -march=rv32i -mabi=ilp32 bf16_sqrt.c -o normal.s # Restricted build (example: reserve x5~x12) riscv64-unknown-elf-gcc -O2 -S -march=rv32i -mabi=ilp32 -ffixed-x5 -ffixed-x6 -ffixed-x7 -ffixed-x28 -ffixed-x29 -ffixed-x30 bf16_sqrt.c -o limited.s ``` - Measure strategy Now we check whether `sw/lw` instructions significantly increase in the restricted version. That's because with fewer registers, the compiler can't keep all values in registers, so it must spill them to the stack and reload them later, which introduces extra sw/lw instructions. #### Experiment Design 2: Run on MyCPU (3-stage pipeline) and analyze waveform - Setup - Target: MyCPU(3-pipeline) - Analysis Tools: Verilator, GTKWave - Compile two versions ```bash= # Normal riscv64-unknown-elf-gcc -O2 -march=rv32i -mabi=ilp32 main.c bf16_sqrt.c -o normal.elf # Restricted riscv64-unknown-elf-gcc -O2 -march=rv32i -mabi=ilp32 \ -ffixed-x5 -ffixed-x6 -ffixed-x7 -ffixed-x28 -ffixed-x29 -ffixed-x30 \ main.c bf16_sqrt.c -o limited.elf ``` - Create .asmbin file ```bash= riscv64-unknown-elf-objcopy -O binary -j .text -j .data normal.elf normal.asmbin riscv64-unknown-elf-objcopy -O binary -j .text -j .data limited.elf limited.asmbin ``` Then we copy `.asmbin` file into `/src/main/resources/` - Run on Verilator and generate waveform ```bash= # Normal make sim SIM_ARGS="-instruction src/main/resources/normal.asmbin" cp trace.vcd trace_normal.vcd # Restricted make sim SIM_ARGS="-instruction src/main/resources/limited.asmbin" cp trace.vcd trace_limited.vcd ``` - observation ```bash= gtkwave trace_normal.vcd gtkwave trace_limited.vcd ``` - Verification Now we check whether the number of pulses on `mem_read_enable / mem_write_enable` is significantly higher in the restricted version. These pulses indicate more frequent data-memory accesses, largely caused by additional lw/sw instructions from stack spills and reloads. ### Implementation - Encounter problem First, I thought as compiler is no longer to use `mul` instruction, this would make it tend to use traditional multiplication, which will produce lots of live temporaries. However, GCC call to a software runtime helper `mulsi3` (from libgcc) to perform 32-bit integer multiplication. ```asm .text .globl __mulsi3 ... call __mulsi3 ... ``` Second, we no longer need to perform a traditional multiplication, so the number of live temporaries we have to keep track of drops significantly. Moreover, with `-O2` enabled, the compiler-generated code is actually quite good—certainly better than my line-by-line C-to-assembly translation. As a result, register pressure under the limited-register setting isn’t really an issue in this case. - Solution (Make register usage more restricted) command ```bash= riscv64-unknown-elf-gcc -O2 -S -march=rv32i -mabi=ilp32 -ffixed-x5 -ffixed-x6 -ffixed-x7 -ffixed-x9 -ffixed-x18 -ffixed-x19 -ffixed-x20 -ffixed-x21 -ffixed-x22 -ffixed-x23 -ffixed-x24 -ffixed-x25 -ffixed-x26 -ffixed-x27 -ffixed-x28 -ffixed-x29 -ffixed-x30 bf16_sqrt.c -o limited.s ``` Here we constrait compiler no allowed to use registers `t0-t5` and `s1-s11` in generating assembly `limited.s` - Observation ```BASH = blz@localhost:~/ca2025-mycpu/3-pipeline/csrc$ grep -E '^\s*(lw|sw)\b' normal.s | wc -l 18 blz@localhost:~/ca2025-mycpu/3-pipeline/csrc$ grep -E '^\s*(lw|sw)\b' limited.s | wc -l 26 ``` We observed that the number of `lw/sw` instructions truly increased. <details> <summary>normal.s </summary> ```asm= .file "bf16_sqrt.c" .option nopic .attribute arch, "rv32i2p1" .attribute unaligned_access, 0 .attribute stack_align, 16 .text .globl __mulsi3 .align 2 .globl bf16_sqrt .type bf16_sqrt, @function bf16_sqrt: slli a0,a0,16 srli a0,a0,16 addi sp,sp,-32 srli a5,a0,7 sw s5,4(sp) sw ra,28(sp) andi a5,a5,0xff li a3,255 srli a4,a0,15 andi s5,a0,127 beq a5,a3,.L28 bne a5,zero,.L4 li a0,0 beq s5,zero,.L3 li a0,32768 addi a0,a0,-64 neg a4,a4 and a0,a0,a4 .L3: lw ra,28(sp) lw s5,4(sp) addi sp,sp,32 jr ra .L4: li a0,32768 addi a0,a0,-64 bne a4,zero,.L3 sw s0,24(sp) addi s0,a5,-127 sw s1,20(sp) sw s2,16(sp) sw s3,12(sp) sw s4,8(sp) andi a4,s0,1 ori s5,s5,128 bne a4,zero,.L29 srai s0,s0,1 addi s0,s0,127 .L8: li s3,128 li s4,256 li s2,90 .L11: add s1,s2,s4 srli s1,s1,1 mv a1,s1 mv a0,s1 call __mulsi3 srli a0,a0,7 bltu s5,a0,.L9 addi s2,s1,1 mv s3,s1 .L10: bgeu s4,s2,.L11 li a5,255 bleu s3,a5,.L12 addi a0,s0,1 slli a0,a0,7 slli a0,a0,16 srli s3,s3,1 srli a0,a0,16 .L13: lw s0,24(sp) lw ra,28(sp) andi s3,s3,127 lw s1,20(sp) lw s2,16(sp) lw s4,8(sp) lw s5,4(sp) or a0,a0,s3 lw s3,12(sp) addi sp,sp,32 jr ra .L28: bne s5,zero,.L3 beq a4,zero,.L3 lw ra,28(sp) li a0,32768 lw s5,4(sp) addi a0,a0,-64 addi sp,sp,32 jr ra .L29: addi a5,a5,-128 srai a5,a5,1 slli s5,s5,1 addi s0,a5,127 j .L8 .L9: addi s4,s1,-1 j .L10 .L12: li a5,127 bgtu s3,a5,.L14 li a4,1 j .L16 .L15: beq s0,a4,.L30 .L16: slli s3,s3,1 addi s0,s0,-1 bleu s3,a5,.L15 .L14: slli a0,s0,7 slli a0,a0,16 srli a0,a0,16 j .L13 .L30: li a0,128 j .L13 .size bf16_sqrt, .-bf16_sqrt .ident "GCC: (13.2.0-11ubuntu1+12) 13.2.0" ``` </details> <details> <summary>limited.s </summary> ```asm= .file "bf16_sqrt.c" .option nopic .attribute arch, "rv32i2p1" .attribute unaligned_access, 0 .attribute stack_align, 16 .text .globl __mulsi3 .align 2 .globl bf16_sqrt .type bf16_sqrt, @function bf16_sqrt: slli a0,a0,16 srli a0,a0,16 srli a5,a0,7 andi a5,a5,0xff li a3,255 srli a4,a0,15 andi a2,a0,127 beq a5,a3,.L30 bne a5,zero,.L4 li a0,0 beq a2,zero,.L31 li a0,32768 neg a4,a4 addi a0,a0,-64 and a0,a0,a4 ret .L4: li a0,32768 addi a0,a0,-64 bne a4,zero,.L32 addi sp,sp,-48 addi a4,a5,-127 ori a2,a2,128 sw ra,44(sp) sw s0,40(sp) andi a3,a4,1 sw a2,24(sp) bne a3,zero,.L33 srai a4,a4,1 addi a5,a4,127 sw a5,28(sp) .L8: li a5,128 sw a5,16(sp) li a4,256 li s0,90 .L11: add a1,s0,a4 srli a1,a1,1 mv a0,a1 sw a4,20(sp) sw a1,12(sp) call __mulsi3 lw a1,12(sp) lw a5,24(sp) srli a0,a0,7 addi a4,a1,-1 bltu a5,a0,.L10 lw a4,20(sp) addi s0,a1,1 sw a1,16(sp) .L10: bgeu a4,s0,.L11 lw a4,16(sp) li a5,255 bleu a4,a5,.L12 lw a5,28(sp) addi a0,a5,1 slli a0,a0,7 srli a5,a4,1 slli a0,a0,16 sw a5,16(sp) srli a0,a0,16 .L13: lw a5,16(sp) lw ra,44(sp) andi s0,a5,127 or a0,a0,s0 lw s0,40(sp) addi sp,sp,48 jr ra .L30: bne a2,zero,.L27 beq a4,zero,.L34 li a0,32768 addi a0,a0,-64 ret .L27: ret .L33: addi a5,a5,-128 srai a5,a5,1 slli a4,a2,1 addi a5,a5,127 sw a4,24(sp) sw a5,28(sp) j .L8 .L34: ret .L12: lw a4,16(sp) li a5,127 bgtu a4,a5,.L14 li a4,1 j .L16 .L15: lw a3,28(sp) beq a3,a4,.L35 .L16: lw a3,16(sp) lw a2,28(sp) slli a3,a3,1 addi a2,a2,-1 sw a3,16(sp) sw a2,28(sp) bleu a3,a5,.L15 slli a0,a2,7 slli a0,a0,16 srli a0,a0,16 j .L13 .L31: ret .L32: ret .L14: lw a5,28(sp) slli a0,a5,7 slli a0,a0,16 srli a0,a0,16 j .L13 .L35: li a0,128 j .L13 .size bf16_sqrt, .-bf16_sqrt .ident "GCC: (13.2.0-11ubuntu1+12) 13.2.0" ``` </details> From the above code, we can see that the compiler keeps long-lived variables in `s0-s5` in `normal.s`; On the other hands, due to the restriction of register, the compiler has not enough registers in `limited.s` to save long-lived variable, which will end up with lots of spill/reload situation. We can observe above scenario in label `.L11` (binary search in original code) - `.L11` section in `normal.s` - `low/mid/high/result/m` all use s regs stored - there's no lw/sw inside loop ```asm add s1,s2,s4 srli s1,s1,1 # mid = (low+high)/2 mv a1,s1 mv a0,s1 call __mulsi3 srli a0,a0,7 # sq = (mid * mid) / 128; bltu s5,a0,.L9 addi s2,s1,1 mv s3,s1 ... bgeu s4,s2,.L11 ``` - `.L11` section in `limited.s` ```asm add a1,s0,a4 # s0=low,a4=high srli a1,a1,1 # a1=mid mv a0,a1 sw a4,20(sp) # spill high sw a1,12(sp) # spill mid call __mulsi3 lw a1,12(sp) # reload mid lw a5,24(sp) # reload m srli a0,a0,7 addi a4,a1,-1 bltu a5,a0,.L10 lw a4,20(sp) addi s0,a1,1 sw a1,16(sp) ... lw a4,20(sp) # reload high ... bgeu a4,s0,.L11 ``` Because we need `call __mulsi3` for multiplcation, compiler have to assume that all caller saved registers (a, t regs) will be coverd after the function call. Due to this, it should be remain value where it might be used later, however, there's no free register to use. The only way to save this values is spill them to memory(stack) and reload them later when we need them. In particular, variable `high` is repeatedly stored to and loaded from in memory address `stack pointer + 20`, while variable `mid` is the same respect to `stack pointer + 12`, and so on. - wave validation Next, we run the assembly program on the pipelined RISC-V CPU. he workflow is very similar to [our Assignment 3](https://hackmd.io/mNi8JittTgWWEjnLOy3ybw?view#Modified-Assembly), so i'll skip the steps here and just validate the result. >trace_normal.vcd ![image](https://hackmd.io/_uploads/SJPT0T3Ebe.png) First, let's look at the `trace_normal.vcd`. As mentioned above, we focus on the signal `mem_io_memory_read_enable` and `mem_io_memory_write_enable`, which correspond to the usage of lw/sw. As we can see, these signals are asserted only a few times over hundreds of executed instructions. >limited_normal.vcd ![image](https://hackmd.io/_uploads/BJ58g0hV-g.png) This time the number of the signal `mem_io_memory_read_enable` and `mem_io_memory_write_enable` are asserted has significantly increased , which indicates more frequent memory accesses (loads/stores) during execution. This matches our expectation: with fewer available registers, the compiler spills more long-lived values to the stack and later reloads them, leading to more lw/sw activity and higher memory traffic. ## Q5: Why is PC-relative addressing important? ### Ans: Rather than using absolute addressing, which typically requires building a 32-bits immediate, PC-relative addressing only requires few bit field to determine the jump target. Moreover, this feature enables Position-Independent Code(PIC). While absolute addressing is brittle to code movement, PC-relative addressing doesn't need editing even if text is relocated. This is because the offset between instructions remains constant when the whole text segment is relocated. ### Reference: - Lecture 12: Instruction Formats (2) `Use sparingly: Brittle to code movement/need to build 32-bit immediate` - silde 8 ``“Position-Independent Code”: If all of code moves, relative offsets don’t change!`` - silde 8 - Lecture 13: Compiling-Assembling-Linking-Loading `Never relocate Position-independent code (PIC)` - silde 22 `Again, conditional branches (B-Type) don’t need editing!` - silde 23 `PC-relative addressing preserved even if text is relocated` - silde 23 ### How to validate my answer with experiments To see whether absolute addressing instruction need relocation or not #### Experiment Design - test code First, we write down an absolute addressing code ```c .section .text .globl _start _start: la t0, global_var j . .section .data .globl global_var global_var: .word 0x12345678 ``` The assembler doesn't know the finla address of `global_var` when producing a relocatable object file, cause linker combine mutiple object file together. At the end, it will go through Relocation Table and fill in all absolute address. - Turn into a relocatable object file ``` bash= riscv64-unknown-elf-gcc -march=rv32i -mabi=ilp32 -c test.S -o test.o ``` - Disassemble and print relocation entries - `-d`: disassemble `.text` section in object file - `-r`: print relocation entries ```BASH= objdump -dr test.o ``` - validation Now we check whether the code generated for `la t0, global_var` contains relocation entries. ## Reference https://wiki.csie.ncku.edu.tw/arch/schedule