# Lab4_2
[1. Reference Workbook](https://drive.google.com/file/d/16MzSuKLZ-L679w-FC9tt5ThfWpS33fkX/view)
[2. github code package](https://github.com/Raywang908/lab4/tree/master/lab4_2/lab-caravel_fir)
[3. no_buffer fir design](https://github.com/Eltdot/SoC_Lab3_Fir)
## Spec
* `0x300000000` = ap_idle, ap_start, ap_done
* `0x300000010` = data_length
* `0x300000014` = tap number
* `0x300000040` = x[n] input
* `0x300000044` = y[n] output
* `0x300000080 ~ 0x3000000FF` = memory to store coefficient (tap)
## Overview
### run_sim
```
rm -f counter_la_fir.hex
riscv32-unknown-elf-gcc -O3 -g -Wl,--no-warn-rwx-segments -g \
--save-temps \
-Xlinker -Map=output.map \
-I../../firmware \
-march=rv32i -mabi=ilp32 -D__vexriscv__ \
-Wl,-Bstatic,-T,../../firmware/sections.lds,--strip-discarded \
-ffreestanding -nostartfiles -o counter_la_fir.elf ../../firmware/crt0_vex.S ../../firmware/isr.c fir.c counter_la_fir.c
# -nostartfiles
cp fir_opt.hex counter_la_fir.hex
#riscv32-unknown-elf-objcopy -O verilog counter_la_fir.elf counter_la_fir.hex
riscv32-unknown-elf-objdump -D counter_la_fir.elf > counter_la_fir.out
# to fix flash base address
sed -ie 's/@10/@00/g' counter_la_fir.hex
iverilog -Ttyp -DFUNCTIONAL -DSIM -DUNIT_DELAY=#1 \
-f./include.rtl.list -o counter_la_fir.vvp counter_la_fir_tb.v
vvp counter_la_fir.vvp
rm -f counter_la_fir.vvp counter_la_fir.elf counter_la_fir.hexe
```
* `cp fir_opt.hex counter_la_fir.hex` this infers that we use `fir_opt.hex` as the instruction set. The reason will be discuss below. Mainly because we want to reduce pipeline stall.
* `riscv32-unknown-elf-gcc -O3 -g -Wl,--no-warn-rwx-segments -g \` we use -O3 to optimize the `.hex` code generation because we want to unroll the loop and reduce the instructions when sending x and receiving y.
| Option | Description |
|------------------------------|-----------------------------------------------------------------------------|
| `-O0` | No optimization (default), used for debugging. |
| `-O1` | Basic optimization, fast compilation with minor code changes. |
| `-O2` | performs most safe optimizations without changing semantics. |
| `-O3` | Enables further optimizations like loop unrolling and inlining, for performance needs but may increase code size. |
| `-Os` | Optimizes for smaller code size (suitable for embedded systems). |
| `-Ofast` | All `-O3` optimizations plus aggressive ones that may break IEEE compliance; higher risk. |
| `-g` | Includes debugging information (can be combined with `-O2`). |
| `-Wl,--no-warn-rwx-segments` | Passes to linker (`ld`); suppresses RWX segment warnings. |
### fir.h
```cpp
#ifndef __FIR_H__
#define __FIR_H__
#include <stdint.h>
#define N 11
#define data_length 64
// MMIO register addresses(保留變數名稱)
#define reg_fir_control 0x30000000
#define reg_data_length 0x30000010
#define reg_tap_number 0x30000014
#define fir_x 0x30000040
#define fir_y 0x30000044
#define reg_fir_coeff 0x30000080 // base address for tap coefficients
// MMIO access macros
#define write_reg(addr, data) (*(volatile uint32_t*)(addr) = (data))
#define read_reg(addr) (*(volatile uint32_t*)(addr))
// Data buffers
int taps[N] = {0, -10, -9, 23, 56, 63, 56, 23, -9, -10, 0};
int inputbuffer[N];
int outputsignal[N];
//int reg_fir_x;
int reg_fir_y;
#endif
```
* we use `write_reg` to write data into the register.
* we use `read_reg` to read data from the register.
### fir.c
```cpp
#include "fir.h"
#include <defs.h>
void __attribute__ ( ( section ( ".mprjram" ) ) ) initfir() {
//initial your fir
//reg_fir_x = 0;
reg_fir_y = 0;
write_reg(reg_data_length, data_length);
write_reg(reg_tap_number, N);
for (int i = 0; i < N; i = i + 1){
write_reg((reg_fir_coeff + (4 * i)), taps[i]);
}
for (int i = 0; i < N; i = i + 1){
reg_mprj_datal = (read_reg(reg_fir_coeff + (4 * i)) << 16);
}
}
```
* we first write in `data_length` and `tap_length` to `fir.v`.
* The first for loop is to write tap into BRAM (in user_project).
* The second for loop is to read tap from BRAM.
```cpp
int* __attribute__ ( ( section ( ".mprjram" ) ) ) fir(){
//write down your fir
reg_mprj_datal = 0x00A50000;
initfir();
polling
while(1) {
if((read_reg(reg_fir_control) & (1 << 2)) == 0x00000004){
write_reg(reg_fir_control, 1);
break;
}
}
for(int i = 0; i < data_length; i = i + 1){
//reg_fir_x = i;
write_reg(fir_x, i);
reg_fir_y = read_reg(fir_y);
//reg_mprj_datal = reg_fir_y << 16;
}
reg_mprj_datal = (reg_fir_y << 24) | (0x005A0000);
reg_mprj_datal = 0x00A50000;
while(1) {
if((read_reg(reg_fir_control) & (1 << 2)) == 0x00000004){
write_reg(reg_fir_control, 1);
break;
}
}
for(int i = 0; i < data_length; i = i + 1){
//reg_fir_x = i;
write_reg(fir_x, i);
reg_fir_y = read_reg(fir_y);
reg_mprj_datal = reg_fir_y << 16;
}
reg_mprj_datal = (reg_fir_y << 24) | (0x005A0000);
reg_mprj_datal = 0x00A50000;
while(1) {
if((read_reg(reg_fir_control) & (1 << 2)) == 0x00000004){
write_reg(reg_fir_control, 1);
break;
}
}
for(int i = 0; i < data_length; i = i + 1){
//reg_fir_x = i;
write_reg(fir_x, i);
reg_fir_y = read_reg(fir_y);
reg_mprj_datal = reg_fir_y << 16;
}
reg_mprj_datal = (reg_fir_y << 24) | (0x005A0000);
return outputsignal;
}
```
* The while loop is to read ap_crtl and check if the engine is `ap_idle == 1`. If it is ap_idle, it assert `ap_start == 1`.
* The first for loop only write x and receive y while other for loop send y via `mprj_io` to testbench.
### .hex
```
38000224: 04f72023 sw a5,64(a4) # 30000040 <_esram_rom+0x1ffffd08>
38000228: 04472683 lw a3,68(a4)
3800022c: 00178793 addi a5,a5,1
38000230: fec79ae3 bne a5,a2,38000224 <fir+0x124>
38000234: 01869793 slli a5,a3,0x18
38000238: 005a06b7 lui a3,0x5a0
```

* This is the result if we only use -O3 to optimize the .hex file. The latency between x and latency between y is **17 cycles**.
* We observe that CPU keep repeating the instruction above, and even after `SM` shakehands `38000230` still remains in cache. So, we assume that Risc-V structure does not have forwarding, the only way to solve harzard problem is to stall pipeline. Additionly, cache will fetch `38000234` and `38000238`, so it seems that Risc-V uses `always not taken` branch prediction.
* In order to make less cycle requirement. I choose to change `bne a5,a2,38000224` to `beq a5,a2,38000238` and `jal x0, 38000224` to make branch prediction correct. Simultaneously, I choose to switch `lw a3,68(a4)` and `addi a5,a5,1`. In order to solve data hazard(RAW) between `addi a5,a5,1` and `bne a5,a2,38000224`.

* This is the result after modifying the .hex file, and we the latency between x and latency between y is **13 cycles**.
* The latency can be reduce to **12 cycle** since `valid` is designed to pullup at the next cycle when `wbs_cyc == 1`.


* Final version

* Using fir design with no buffer, and the latency is 51 cycle.
## Waveform View

* The left activity in iBus is doing the system initailization, and when doing the instruction `100000d8 <sram_loop>:`, CPU is told to move instruction of fir.c into mprj_ram(`0x38000000`). Resulting in the activity in dBus.

* Then, we write tap and read tap which is mention in the `fir.c`.

* The last part that is doing ss and sm transfer which is writing x and receiving y.