Lab4_2 - HackMD

# Lab4_2 [1. Reference Workbook](https://drive.google.com/file/d/16MzSuKLZ-L679w-FC9tt5ThfWpS33fkX/view) [2. github code package](https://github.com/Raywang908/lab4/tree/master/lab4_2/lab-caravel_fir) [3. no_buffer fir design](https://github.com/Eltdot/SoC_Lab3_Fir) ## Spec * `0x300000000` = ap_idle, ap_start, ap_done * `0x300000010` = data_length * `0x300000014` = tap number * `0x300000040` = x[n] input * `0x300000044` = y[n] output * `0x300000080 ~ 0x3000000FF` = memory to store coefficient (tap) ## Overview ### run_sim ``` rm -f counter_la_fir.hex riscv32-unknown-elf-gcc -O3 -g -Wl,--no-warn-rwx-segments -g \ --save-temps \ -Xlinker -Map=output.map \ -I../../firmware \ -march=rv32i -mabi=ilp32 -D__vexriscv__ \ -Wl,-Bstatic,-T,../../firmware/sections.lds,--strip-discarded \ -ffreestanding -nostartfiles -o counter_la_fir.elf ../../firmware/crt0_vex.S ../../firmware/isr.c fir.c counter_la_fir.c # -nostartfiles cp fir_opt.hex counter_la_fir.hex #riscv32-unknown-elf-objcopy -O verilog counter_la_fir.elf counter_la_fir.hex riscv32-unknown-elf-objdump -D counter_la_fir.elf > counter_la_fir.out # to fix flash base address sed -ie 's/@10/@00/g' counter_la_fir.hex iverilog -Ttyp -DFUNCTIONAL -DSIM -DUNIT_DELAY=#1 \ -f./include.rtl.list -o counter_la_fir.vvp counter_la_fir_tb.v vvp counter_la_fir.vvp rm -f counter_la_fir.vvp counter_la_fir.elf counter_la_fir.hexe ``` * `cp fir_opt.hex counter_la_fir.hex` this infers that we use `fir_opt.hex` as the instruction set. The reason will be discuss below. Mainly because we want to reduce pipeline stall. * `riscv32-unknown-elf-gcc -O3 -g -Wl,--no-warn-rwx-segments -g \` we use -O3 to optimize the `.hex` code generation because we want to unroll the loop and reduce the instructions when sending x and receiving y. | Option | Description | |------------------------------|-----------------------------------------------------------------------------| | `-O0` | No optimization (default), used for debugging. | | `-O1` | Basic optimization, fast compilation with minor code changes. | | `-O2` | performs most safe optimizations without changing semantics. | | `-O3` | Enables further optimizations like loop unrolling and inlining, for performance needs but may increase code size. | | `-Os` | Optimizes for smaller code size (suitable for embedded systems). | | `-Ofast` | All `-O3` optimizations plus aggressive ones that may break IEEE compliance; higher risk. | | `-g` | Includes debugging information (can be combined with `-O2`). | | `-Wl,--no-warn-rwx-segments` | Passes to linker (`ld`); suppresses RWX segment warnings. | ### fir.h ```cpp #ifndef __FIR_H__ #define __FIR_H__ #include <stdint.h> #define N 11 #define data_length 64 // MMIO register addresses（保留變數名稱） #define reg_fir_control 0x30000000 #define reg_data_length 0x30000010 #define reg_tap_number 0x30000014 #define fir_x 0x30000040 #define fir_y 0x30000044 #define reg_fir_coeff 0x30000080 // base address for tap coefficients // MMIO access macros #define write_reg(addr, data) (*(volatile uint32_t*)(addr) = (data)) #define read_reg(addr) (*(volatile uint32_t*)(addr)) // Data buffers int taps[N] = {0, -10, -9, 23, 56, 63, 56, 23, -9, -10, 0}; int inputbuffer[N]; int outputsignal[N]; //int reg_fir_x; int reg_fir_y; #endif ``` * we use `write_reg` to write data into the register. * we use `read_reg` to read data from the register. ### fir.c ```cpp #include "fir.h" #include <defs.h> void __attribute__ ( ( section ( ".mprjram" ) ) ) initfir() { //initial your fir //reg_fir_x = 0; reg_fir_y = 0; write_reg(reg_data_length, data_length); write_reg(reg_tap_number, N); for (int i = 0; i < N; i = i + 1){ write_reg((reg_fir_coeff + (4 * i)), taps[i]); } for (int i = 0; i < N; i = i + 1){ reg_mprj_datal = (read_reg(reg_fir_coeff + (4 * i)) << 16); } } ``` * we first write in `data_length` and `tap_length` to `fir.v`. * The first for loop is to write tap into BRAM (in user_project). * The second for loop is to read tap from BRAM. ```cpp int* __attribute__ ( ( section ( ".mprjram" ) ) ) fir(){ //write down your fir reg_mprj_datal = 0x00A50000; initfir(); polling while(1) { if((read_reg(reg_fir_control) & (1 << 2)) == 0x00000004){ write_reg(reg_fir_control, 1); break; } } for(int i = 0; i < data_length; i = i + 1){ //reg_fir_x = i; write_reg(fir_x, i); reg_fir_y = read_reg(fir_y); //reg_mprj_datal = reg_fir_y << 16; } reg_mprj_datal = (reg_fir_y << 24) | (0x005A0000); reg_mprj_datal = 0x00A50000; while(1) { if((read_reg(reg_fir_control) & (1 << 2)) == 0x00000004){ write_reg(reg_fir_control, 1); break; } } for(int i = 0; i < data_length; i = i + 1){ //reg_fir_x = i; write_reg(fir_x, i); reg_fir_y = read_reg(fir_y); reg_mprj_datal = reg_fir_y << 16; } reg_mprj_datal = (reg_fir_y << 24) | (0x005A0000); reg_mprj_datal = 0x00A50000; while(1) { if((read_reg(reg_fir_control) & (1 << 2)) == 0x00000004){ write_reg(reg_fir_control, 1); break; } } for(int i = 0; i < data_length; i = i + 1){ //reg_fir_x = i; write_reg(fir_x, i); reg_fir_y = read_reg(fir_y); reg_mprj_datal = reg_fir_y << 16; } reg_mprj_datal = (reg_fir_y << 24) | (0x005A0000); return outputsignal; } ``` * The while loop is to read ap_crtl and check if the engine is `ap_idle == 1`. If it is ap_idle, it assert `ap_start == 1`. * The first for loop only write x and receive y while other for loop send y via `mprj_io` to testbench. ### .hex ``` 38000224: 04f72023 sw a5,64(a4) # 30000040 <_esram_rom+0x1ffffd08> 38000228: 04472683 lw a3,68(a4) 3800022c: 00178793 addi a5,a5,1 38000230: fec79ae3 bne a5,a2,38000224 <fir+0x124> 38000234: 01869793 slli a5,a3,0x18 38000238: 005a06b7 lui a3,0x5a0 ``` ![image](https://hackmd.io/_uploads/BkMVTeAggg.png) * This is the result if we only use -O3 to optimize the .hex file. The latency between x and latency between y is **17 cycles**. * We observe that CPU keep repeating the instruction above, and even after `SM` shakehands `38000230` still remains in cache. So, we assume that Risc-V structure does not have forwarding, the only way to solve harzard problem is to stall pipeline. Additionly, cache will fetch `38000234` and `38000238`, so it seems that Risc-V uses `always not taken` branch prediction. * In order to make less cycle requirement. I choose to change `bne a5,a2,38000224` to `beq a5,a2,38000238` and `jal x0, 38000224` to make branch prediction correct. Simultaneously, I choose to switch `lw a3,68(a4)` and `addi a5,a5,1`. In order to solve data hazard(RAW) between `addi a5,a5,1` and `bne a5,a2,38000224`. ![image](https://hackmd.io/_uploads/B1Wxzb0exe.png) * This is the result after modifying the .hex file, and we the latency between x and latency between y is **13 cycles**. * The latency can be reduce to **12 cycle** since `valid` is designed to pullup at the next cycle when `wbs_cyc == 1`. ![image](https://hackmd.io/_uploads/ry2NQZRlxe.png) ![image](https://hackmd.io/_uploads/BymLX-Aegl.png) * Final version ![image](https://hackmd.io/_uploads/BJgjQ9dZxx.png) * Using fir design with no buffer, and the latency is 51 cycle. ## Waveform View ![image](https://hackmd.io/_uploads/H1PnNWRleg.png) * The left activity in iBus is doing the system initailization, and when doing the instruction `100000d8 <sram_loop>:`, CPU is told to move instruction of fir.c into mprj_ram(`0x38000000`). Resulting in the activity in dBus. ![image](https://hackmd.io/_uploads/rk1KIW0ggg.png) * Then, we write tap and read tap which is mention in the `fir.c`. ![image](https://hackmd.io/_uploads/Sk1GDZ0lxe.png) * The last part that is doing ss and sm transfer which is writing x and receiving y.