# SoC Lab4-2 Cavarel FIR
[Github Link](https://github.com/dqrengg/SoC_Laboratory/tree/main/Lab4/Lab4-2)
## 1. Block Diagram

## 2. User Project
### 2.1 WB Decode
* `user_project_wrapper` serves as datapath.
* Selection singals. `0x3000_0000` mappes to `mprj`, `0x3800_0000` mappes to `mprjram`.
```
assign mem_sel = (wbs_adr_i[31:22] == 10'b0011_1000_00);
assign axi_sel = (wbs_adr_i[31:20] == 12'b0011_0000_0000);
```
* Cycle selection
```
assign mem_cyc = wbs_cyc_i & mem_sel;
assign axi_cyc = wbs_cyc_i & axi_sel;
```
* Shared signals (`ack`, `dat`)
```
assign wbs_ack_o = mem_ack | axi_ack;
assign wbs_dat_o = ({ 32{mem_sel} } & mem_dat) | ({ 32{axi_sel} } & axi_dat);
```
### 2.2 External Memory
* Read with 10-cycle latency
```
always @(posedge wb_clk_i or posedge wb_rst_i) begin
if (wb_rst_i) begin
count <= 0;
end else begin
if (count == 4'd10) count <= 0;
else if (wbs_cyc_i & !wbs_we_i) count <= count + 1;
end
end
```
* WB-BRAM conversion
```
assign valid = wbs_cyc_i & wbs_stb_i;
assign we = valid & wbs_we_i;
assign en = 1'b1;
```
* Shared `ack`
```
assign wbs_ack_o = ((valid & wbs_we_i) | count == 4'd10);
```
* BRAM `Do` connects to `wb_dat_o`
### 2.3 WB-AXI conversion
#### Address Decode
* `!we` for read, `we` for write
```
assign axilite_sel = (wbs_adr_i[31:8] == 24'h3000_00)
& (wbs_adr_i != 32'h3000_0040) & (wbs_adr_i != 32'h3000_0044);
assign axis_s_sel = (wbs_adr_i == 32'h3000_0040);
assign axis_m_sel = (wbs_adr_i == 32'h3000_0044);
```
#### Shared `ack`
* AXI-Lite read / write response singals
* AXIS waits for ready or valid
```
assign axilite_ack = (awready & wready) | rvalid;
assign axis_s_ack = axis_s_sel & ss_tready;
assign axis_m_ack = axis_m_sel & sm_tvalid;
assign axis_ack = axis_m_ack | axis_s_ack;
assign wbs_ack_o = wbs_cyc_i & (axilite_ack | axis_ack);
```
#### Shared `dat`
* Only read operations return data
```
assign wbs_dat_o = { 32{axis_m_sel} } & sm_tdata | { 32{rvalid} } & rdata;
```
#### AXI-Lite Write
```
assign awvalid = valid & wbs_we_i & axilite_sel;
assign awaddr = wbs_adr_i[11:0];
assign wvalid = valid & wbs_we_i & axilite_sel;
assign wdata = wbs_dat_i;
```
#### AXI-Lite Read
* de-assert `arvalid` after handshaking
```
always @(posedge wb_clk_i or posedge wb_rst_i) begin
if (wb_rst_i) arvalid_en <= 1;
else begin
if (arvalid && arready) arvalid_en <= 0;
else if (rvalid && rready) arvalid_en <= 1;
end
end
assign rready = wbs_cyc_i & !wbs_we_i & axilite_sel;
assign arvalid = valid & !wbs_we_i & axilite_sel & arvalid_en;
assign araddr = wbs_adr_i[11:0];
```
#### AXIS Master
* CPU reads from `0x3000_0044`
```
assign sm_tready = wbs_cyc_i & !wbs_we_i & axis_m_sel
```
#### AXIS Slave
* CPU writes to `0x3000_0040`
* Use counter to generated `ss_tlast`. Reset counter when writing to `0x3000_0010`
```
assign ss_tvalid = valid & wbs_we_i & axis_s_sel;
assign ss_tdata = wbs_dat_i;
assign ss_tlast = ss_tvalid & (ss_last_count == 1);
always @(posedge wb_clk_i) begin
if (valid & wbs_we_i & (wbs_adr_i == 32'h3000_0010)) begin
ss_last_count <= wbs_dat_i;
end else if (ss_tvalid & ss_tready) begin
ss_last_count <= ss_last_count - 1;
end
end
```
### 2.4 FIR
* FIR in Lab3.
## 3. Performance Analysis
### 3.1 Initial Version - Normal While Loop
* After CPU fetches code from user memory, the instructions of the FIR calculation loop will be inside the cache.
* firmware code
```
uint32_t i = 0;
while (i < N) {
reg_fir_x = i;
output_signals[i] = reg_fir_y;
i++;
}
```
* Performance issues
* **Hardware latency**
The firmware reads `Y` immediately after writing `X`. However, in FIR engine, it takes 3 cycles to compute `Y` once receiving corresponding `X`. Therefore, CPU needs to wait for hardware.
* Pipeline hazards
* Waveform - 16 cycles per Y

### 3.2 Optimization 1 - Reorder Read/Write in While Loop
* Modify the firmware code, putting `bne` between 'write `X`' and 'read `Y`'.
```
uint32_t i = 0;
reg_fir_x = i;
while (i < N-1) {
output_signals[i] = reg_fir_y;
i++;
reg_fir_x = i;
}
output_signals[i] = reg_fir_y;
```
* Performance issues: **Pipeline hazards**
* **RAW**: `i` increment and 'read `X`'
* Waveform - 15 cycles per Y

### 3.3 Optimization 2 - Counter Increment after Write X
* firmware code
```
uint32_t i = 1;
reg_fir_x = 0;
while (i < N) {
output_signals[i-1] = reg_fir_y;
reg_fir_x = i;
i++;
}
output_signals[N] = reg_fir_y;
```
* Performance issues: **Pipeline hazards**
* **RAW**: `addi` and `sw`, `addi` and `bne`
* Waveform - 19 cycles per Y

### 3.4 Optimization 3 - Reorder Instructions to Solve Pipeline Hazards
* Original instruction order, RAW hazard between `addi` and first `sw`, as well as `addi` and `bne`.
```
380000a0: 0446a603 lw a2,68(a3) # 30000044
380000a4: 00470713 addi a4,a4,4 # a50004
380000a8: fec72e23 sw a2,-4(a4)
380000ac: 04f6a023 sw a5,64(a3)
380000b0: 00178793 addi a5,a5,1
380000b4: feb796e3 bne a5,a1,380000a0
```
* Optimized instruction order, eliminating RAW hazards
```
380000a0: 0446a603 lw a2,68(a3) # 30000044
380000a4: 00470713 addi a4,a4,4 # a50004
380000a8: 04f6a023 sw a5,64(a3)
380000ac: 00178793 addi a5,a5,1
380000b0: fec72e23 sw a2,-4(a4)
380000b4: feb796e3 bne a5,a1,380000a0
```
* Waveform - 12 cycles per Y

* It seems that this CPU cannot support data forwarding in some cases, even the assembly code is reordered.
## 4. Observation
### Potential Bug
* In my implementation of WB-AXIS master, it returns `ack` in one cycle.
* In management core, WB data selection `slave_sel_r` has one cycle latency.
```
slave_sel_r <= slave_sel;
```
```
shared_dat_r = ((((((({32{slave_sel_r[0]}} & mgmtsoc_vexriscv_debug_bus_dat_r)
| ({32{slave_sel_r[1]}} & dff_bus_dat_r))
| ({32{slave_sel_r[2]}} & dff2_bus_dat_r))
| ({32{slave_sel_r[3]}} & mgmtsoc_litespimmap_bus_dat_r))
| ({32{slave_sel_r[4]}} & mprj_dat_r))
| ({32{slave_sel_r[5]}} & hk_dat_r))
| ({32{slave_sel_r[6]}} & mgmtsoc_wishbone_dat_r));
```
* Therefore, WB read operation should have at least one cycle latency.
* In fact, running firmware code in 3.3 produces bug that WB can't get AXIS data and MPRJ_IO outputs incorrect final `Y` value.
* If changing firmware code below, the bug can be temporarily resolved, becasue the final `Y` is still computing and WB can't read the data in one cycle.
```
uint32_t i = 2;
reg_fir_x = 0;
reg_fir_x = 1;
while (i < N) {
outputsignal[i-2] = reg_fir_y;
reg_fir_x = i;
i++;
}
outputsignal[N-2] = reg_fir_y;
outputsignal[N-1] = reg_fir_y;
```
* Or WB-AXIS read operation can be designed completing in 2 cycle, but the FIR throughput will increase to 13 cycles/Y.
* Although the design running firmware code above passes the verification in this case, it may raise potential bugs in future design.