SoC Lab4-2 - HackMD

# SoC Lab4-2 Cavarel FIR [Github Link](https://github.com/dqrengg/SoC_Laboratory/tree/main/Lab4/Lab4-2) ## 1. Block Diagram ![caravel_fir_block_diagram](https://hackmd.io/_uploads/H1eFHGSeeg.png) ## 2. User Project ### 2.1 WB Decode * `user_project_wrapper` serves as datapath. * Selection singals. `0x3000_0000` mappes to `mprj`, `0x3800_0000` mappes to `mprjram`. ``` assign mem_sel = (wbs_adr_i[31:22] == 10'b0011_1000_00); assign axi_sel = (wbs_adr_i[31:20] == 12'b0011_0000_0000); ``` * Cycle selection ``` assign mem_cyc = wbs_cyc_i & mem_sel; assign axi_cyc = wbs_cyc_i & axi_sel; ``` * Shared signals (`ack`, `dat`) ``` assign wbs_ack_o = mem_ack | axi_ack; assign wbs_dat_o = ({ 32{mem_sel} } & mem_dat) | ({ 32{axi_sel} } & axi_dat); ``` ### 2.2 External Memory * Read with 10-cycle latency ``` always @(posedge wb_clk_i or posedge wb_rst_i) begin if (wb_rst_i) begin count <= 0; end else begin if (count == 4'd10) count <= 0; else if (wbs_cyc_i & !wbs_we_i) count <= count + 1; end end ``` * WB-BRAM conversion ``` assign valid = wbs_cyc_i & wbs_stb_i; assign we = valid & wbs_we_i; assign en = 1'b1; ``` * Shared `ack` ``` assign wbs_ack_o = ((valid & wbs_we_i) | count == 4'd10); ``` * BRAM `Do` connects to `wb_dat_o` ### 2.3 WB-AXI conversion #### Address Decode * `!we` for read, `we` for write ``` assign axilite_sel = (wbs_adr_i[31:8] == 24'h3000_00) & (wbs_adr_i != 32'h3000_0040) & (wbs_adr_i != 32'h3000_0044); assign axis_s_sel = (wbs_adr_i == 32'h3000_0040); assign axis_m_sel = (wbs_adr_i == 32'h3000_0044); ``` #### Shared `ack` * AXI-Lite read / write response singals * AXIS waits for ready or valid ``` assign axilite_ack = (awready & wready) | rvalid; assign axis_s_ack = axis_s_sel & ss_tready; assign axis_m_ack = axis_m_sel & sm_tvalid; assign axis_ack = axis_m_ack | axis_s_ack; assign wbs_ack_o = wbs_cyc_i & (axilite_ack | axis_ack); ``` #### Shared `dat` * Only read operations return data ``` assign wbs_dat_o = { 32{axis_m_sel} } & sm_tdata | { 32{rvalid} } & rdata; ``` #### AXI-Lite Write ``` assign awvalid = valid & wbs_we_i & axilite_sel; assign awaddr = wbs_adr_i[11:0]; assign wvalid = valid & wbs_we_i & axilite_sel; assign wdata = wbs_dat_i; ``` #### AXI-Lite Read * de-assert `arvalid` after handshaking ``` always @(posedge wb_clk_i or posedge wb_rst_i) begin if (wb_rst_i) arvalid_en <= 1; else begin if (arvalid && arready) arvalid_en <= 0; else if (rvalid && rready) arvalid_en <= 1; end end assign rready = wbs_cyc_i & !wbs_we_i & axilite_sel; assign arvalid = valid & !wbs_we_i & axilite_sel & arvalid_en; assign araddr = wbs_adr_i[11:0]; ``` #### AXIS Master * CPU reads from `0x3000_0044` ``` assign sm_tready = wbs_cyc_i & !wbs_we_i & axis_m_sel ``` #### AXIS Slave * CPU writes to `0x3000_0040` * Use counter to generated `ss_tlast`. Reset counter when writing to `0x3000_0010` ``` assign ss_tvalid = valid & wbs_we_i & axis_s_sel; assign ss_tdata = wbs_dat_i; assign ss_tlast = ss_tvalid & (ss_last_count == 1); always @(posedge wb_clk_i) begin if (valid & wbs_we_i & (wbs_adr_i == 32'h3000_0010)) begin ss_last_count <= wbs_dat_i; end else if (ss_tvalid & ss_tready) begin ss_last_count <= ss_last_count - 1; end end ``` ### 2.4 FIR * FIR in Lab3. ## 3. Performance Analysis ### 3.1 Initial Version - Normal While Loop * After CPU fetches code from user memory, the instructions of the FIR calculation loop will be inside the cache. * firmware code ``` uint32_t i = 0; while (i < N) { reg_fir_x = i; output_signals[i] = reg_fir_y; i++; } ``` * Performance issues * **Hardware latency** The firmware reads `Y` immediately after writing `X`. However, in FIR engine, it takes 3 cycles to compute `Y` once receiving corresponding `X`. Therefore, CPU needs to wait for hardware. * Pipeline hazards * Waveform - 16 cycles per Y ![waveform1](https://hackmd.io/_uploads/r1z7YmSxgl.png) ### 3.2 Optimization 1 - Reorder Read/Write in While Loop * Modify the firmware code, putting `bne` between 'write `X`' and 'read `Y`'. ``` uint32_t i = 0; reg_fir_x = i; while (i < N-1) { output_signals[i] = reg_fir_y; i++; reg_fir_x = i; } output_signals[i] = reg_fir_y; ``` * Performance issues: **Pipeline hazards** * **RAW**: `i` increment and 'read `X`' * Waveform - 15 cycles per Y ![waveform2](https://hackmd.io/_uploads/B1A7Ymrglg.png) ### 3.3 Optimization 2 - Counter Increment after Write X * firmware code ``` uint32_t i = 1; reg_fir_x = 0; while (i < N) { output_signals[i-1] = reg_fir_y; reg_fir_x = i; i++; } output_signals[N] = reg_fir_y; ``` * Performance issues: **Pipeline hazards** * **RAW**: `addi` and `sw`, `addi` and `bne` * Waveform - 19 cycles per Y ![waveform3](https://hackmd.io/_uploads/r154t7Helg.png) ### 3.4 Optimization 3 - Reorder Instructions to Solve Pipeline Hazards * Original instruction order, RAW hazard between `addi` and first `sw`, as well as `addi` and `bne`. ``` 380000a0: 0446a603 lw a2,68(a3) # 30000044 380000a4: 00470713 addi a4,a4,4 # a50004 380000a8: fec72e23 sw a2,-4(a4) 380000ac: 04f6a023 sw a5,64(a3) 380000b0: 00178793 addi a5,a5,1 380000b4: feb796e3 bne a5,a1,380000a0 ``` * Optimized instruction order, eliminating RAW hazards ``` 380000a0: 0446a603 lw a2,68(a3) # 30000044 380000a4: 00470713 addi a4,a4,4 # a50004 380000a8: 04f6a023 sw a5,64(a3) 380000ac: 00178793 addi a5,a5,1 380000b0: fec72e23 sw a2,-4(a4) 380000b4: feb796e3 bne a5,a1,380000a0 ``` * Waveform - 12 cycles per Y ![waveform4](https://hackmd.io/_uploads/SJDHF7Sxeg.png) * It seems that this CPU cannot support data forwarding in some cases, even the assembly code is reordered. ## 4. Observation ### Potential Bug * In my implementation of WB-AXIS master, it returns `ack` in one cycle. * In management core, WB data selection `slave_sel_r` has one cycle latency. ``` slave_sel_r <= slave_sel; ``` ``` shared_dat_r = ((((((({32{slave_sel_r[0]}} & mgmtsoc_vexriscv_debug_bus_dat_r) | ({32{slave_sel_r[1]}} & dff_bus_dat_r)) | ({32{slave_sel_r[2]}} & dff2_bus_dat_r)) | ({32{slave_sel_r[3]}} & mgmtsoc_litespimmap_bus_dat_r)) | ({32{slave_sel_r[4]}} & mprj_dat_r)) | ({32{slave_sel_r[5]}} & hk_dat_r)) | ({32{slave_sel_r[6]}} & mgmtsoc_wishbone_dat_r)); ``` * Therefore, WB read operation should have at least one cycle latency. * In fact, running firmware code in 3.3 produces bug that WB can't get AXIS data and MPRJ_IO outputs incorrect final `Y` value. * If changing firmware code below, the bug can be temporarily resolved, becasue the final `Y` is still computing and WB can't read the data in one cycle. ``` uint32_t i = 2; reg_fir_x = 0; reg_fir_x = 1; while (i < N) { outputsignal[i-2] = reg_fir_y; reg_fir_x = i; i++; } outputsignal[N-2] = reg_fir_y; outputsignal[N-1] = reg_fir_y; ``` * Or WB-AXIS read operation can be designed completing in 2 cycle, but the FIR throughput will increase to 13 cycles/Y. * Although the design running firmware code above passes the verification in this case, it may raise potential bugs in future design.