## Lab3 FIR
### Overview
This `FIR` module is implemented with an `AXI4-Lite` interface for configuration and `AXI4-Stream` interfaces for input and output data transfer. The design includes a main `FSM` for overall FIR operation control, a `core engine` for computation, and a `BRAM signal processor` for managing coefficient addressing and data storage.

* **Axi-lite interface** : Used for reading and writing parameters, including the number of taps(`tap_num`), number of data(`data_length`) and coefficients
* **FSM**
* **Read state fsm :** Controls the `rvalid` signal, it includes two state--`RIDLE` and `RVALID`
* **Entire FIR fsm :** Controls the operation of FIR, it inclues three states--`IDLE`, `CALC` and `DONE`
* **Data transfer fsm :** Controls when new data can be sent in via the SS Bus and when processed data can be transferred out via the SM Bus. It includes four states--`DT_WAIT`, `DT_PROC`, `DT_DONE` and `DT_IDLE`.
* **Configuration registers :** They store the value of `tap_num`, `data length` and the fir operation states(`ap_ctrl`).
* **Bram32 signal processor :** In this module, we have to process the input signal to Bram32, and get the desired value from it.
* **tapRAM signal processor :** Before the core engine starts, first **write** the corresponding values and addresses via the AXI-Lite interface, then **read** them out one by one to verify the values. After the engine starts, continuously **generate addresses** to perform the convolution.
* **dataRAM signal processor :** Before the core engine starts, first **initialize** it to ensure all addresses store 0. After the engine starts, for each data transfer cycle, first **write** in a new data value, then generate the remaining values by continuously **reading** from the previously stored addresses.
* **Axi-stream interface** : Handles streaming data input (SS Bus) and processed data output (SM Bus) with valid/ready handshaking
* **Core engine :** Do the calculation `y[t] = ∑(h[i] ⋅ x[t−i])`
* **Counter :** Count how many `y` values are generated and transferred out.
---
### Design SPEC
* Data_Width : 32 bits
* `tap_num` : Programmable, with a maximum value of 32
* `data_length` : Programmable
* Interface : The valid/ready rate is varied
* Axi-lite : Configure coefficients
* Axi-stream : Stream in(x), stream out(y)
* Using one multiplier and adder, the two operations are in seperate pipeline cycles
* Program `ap_start` to initiate FIR core engine
---
### Implementation
#### Axi-lite interface
**Write channel**
✅Case 1 : Valid write

The testbench could send the address and data in different cycles, and their sequence doesn't matter. To **synchronize** the address and data, after both `valid` are sampled, I will set the `ready` signal to `1` for one cycle. Then, all four signals will be pulled down, and the system will wait for the next write operation.

❌Case 2 : Invalid write
If axi-lite channel tempts to write in data when the engine is doing `CALC`, it is invalid write, ignore this operation. Note that `data_length` and `tap_num` could not be overwritten for this case.
**Read channel**
✅Case 1 : Valid read

We should first get the address, and then fetch the data corresponding to this address.
For `arready`, I set it to pull up after detecting `arvalid`, and it stays high for one cycle. During this cycle, I will latch the transferred address.
Next, I use a FSM to determine the value of `rvalid`. Once both `arready` and `arvalid` are high, `rvalid` can pull up, and it will pull down together with `rready`.


❌Case 2 : Invalid read
If axi-lite channel tempts to read tap parameters when the engine is doing `CALC`, it is invalid read, return `32'hffffffff`. Note that `data_length`, `tap_num`, `ap_ctrl` could be read at any state, just return their values at that moment.
#### Entire FIR fsm

* **`IDLE`** : (1) Configure coefficients, including write/read(check) `data_length`, `tap_num` and tap coefficients. (2) Initialize dataRAM
* **`CALC`** : After `ap_start`, the core engine starts to do calculation.
* **`DONE`** : After the last data is transferred, enter this state. When reading address 0x00, it will exit this state and return to the `IDLE` state.
#### Configuration registers
* **`0x00`**: `ap_ctrl` = {ap_idle, ap_done, ap_start}
* **`0x10-14`**: `data_length`, valid wirte during `IDLE` state
* **`0x14-18`**:` tap_num`, valid wirte during `IDLE` state
#### tapRAM signal processor
* **Write operation :** `IDLE` state, axilite write channel

Write in the tap parameters sent in via axilite write channel.
* **Read operation :** `IDLE` state, axilite read channel

Read out the written parameters and check their values.
* **Address generation :** `CALC` state

For each data cycle, tapRAM should continuously generate tap parameters for the calculation. Use `tap_cnt` to track these parameters : In each cycle, `tap_cnt` resets to 0 and then increments continuously.
#### dataRAM signal processor
* Initialize: Before new calculation starts, can be in `DONE` or `IDLE` state

Before performing calculations, each address in `bram32` should be set to 0.
* Read/ Write operation : `CALC` state
After both `ss_tvalid` and `ss_tready` are `1`, FIR enters a new data transfer cycle. The first data is provided by `ss_tdata_latch`, then stored into dataRAM, and the remaining values are generated by continuously reading from the previously stored addresses.
* Address accessing

* First data (right after `ss_tvalid && ss_tready`)

Write in the new data stored in `ss_tdata_latch`
* Remaining data

Reading previously stored data
#### Axi-stream interface (using Data transfer fsm)
Define a complete **data transfer cycle** as :
```
receiving input data (ss_tready && ss_tvalid)
-> doing calculation (dt_state == CALC)
-> transferred output data (sm_tready && sm_tvalid)
-> wait till next input comes (dt_state == WAIT)
```
I use a **Data transfer fsm** to control the data flow :

* **`DT_WAIT`** : Last data cycle complete, `ss_tready` pulls up and wait for testbench signal `ss_tvalid` to start a new data cycle
* **`DT_PROC`**: SS bus handshake complete, start to calculate `y`
* **`DT_DONE`** : A valid `y` has been calculated, `sm_tvalid` is asserted, and it waits for the testbench signal `sm_tready` to complete the SM bus handshake.
* **`DT_IDLE`** : Finish all calculations and wait for a new set of data, once `ap_start` is programmed, go to `DT_WAIT` and start a new data transfer cycle.
And the **data transfer waveform** would be like :

* Each "->" means delay one cycle
-tap transfer : tap_cnt -> tap_Do -> h
-data transfer : x_r_cnt -> data_Do -> x
-calculation : x * h -> mul -> (sigma) y -> sm_tdata
* When tap_cnt == 3, y should be set to the first mul value so that it can start accumulating for the new data

#### Core engine
Since addition and multiplication occur in separate pipeline cycles, I store the two operations in different registers.

#### Counter
* `y_cnt` counts the number of valid y outputs. When a valid `y` is successfully transferred after an SM bus handshake, `y_cnt` increments by `1`.
* When the last `y` is transferred (i.e., `y_cnt == data_length`), `sm_tlast` will be set to 1.

---
### Results
* Throuput : 11116
```verilog=
-----------Congratulations! Pass-------------
-------------------third dataset complete cycle count 11116(to know best case throuput) ---------------------
```
* Total cell area : 18500.832235
```verilog=
Number of ports: 330
Number of nets: 3304
Number of cells: 2978
Number of combinational cells: 2648
Number of sequential cells: 321
Number of macros/black boxes: 0
Number of buf/inv: 558
Number of references: 83
Combinational area: 13495.305825
Buf/Inv area: 2483.712042
Noncombinational area: 5005.526410
Macro/Black Box area: 0.000000
Net Interconnect area: undefined (No wire load specified)
Total cell area: 18500.832235
Total area: undefined
```
---
### Debugging
Initially, I didn't consider the possibility of implementing multiple test cases in a single testbench, so I only used three data states: `DT_WAIT`, `DT_PROC`, and `DT_DONE` to control the data flow. However, once a test case has completed all calculations, data transfer should pause until the next ap_start is triggered, allowing the data transfer cycle to restart. So, I added an extra state : `DT_IDLE`, to handle this situation.
---
### Fir rule check
#### Rule 1: Design should not be custimized for testbench
✅ Designed for spec
#### Rule 2: Do not use specific hardcoded constant in design
❌ I should use `localparam` to define those constant
Original:
```verilog=
reg [4:0] x_w_cnt;
wire [4:0] x_w_cnt_tmp;
reg [4:0] x_r_cnt;
reg [4:0] x_r_cnt_tmp;
reg [4:0] addr_cnt;
```
Revised:
```verilog=
parameter pDATA_WIDTH = 32;
localparam CNT_WIDTH = $clog2(pDATA_WIDTH);
reg [(CNT_WIDTH-1):0] x_w_cnt;
wire [(CNT_WIDTH-1):0] x_w_cnt_tmp;
reg [(CNT_WIDTH-1):0] x_r_cnt;
reg [(CNT_WIDTH-1):0] x_r_cnt_tmp;
reg [(CNT_WIDTH-1):0] addr_cnt;
```
#### Rule 3: Mul/Add in separate pipeline cycle
✅ I use registers to store the result of mul/add, so they are in seperate pipeline cycle.
```verilog=
//************************************************************************************************************//
//****************************************** core engine: convolution*****************************************//
// multipler
assign x_tmp = (tap_cnt == 1) ? ss_tdata_latch : data_Do;
assign h_tmp = tap_Do;
assign mul_tmp = x * h;
// adder
always@ (posedge axis_clk or negedge axis_rst_n) begin
if(!axis_rst_n) begin
y <= 0;
end else if(state == IDLE) begin
y <= 0;
end else if (tap_cnt == 3) begin
y <= mul;
end else begin
y <= y + mul;
end
end
always@ (posedge axis_clk or negedge axis_rst_n) begin
if(!axis_rst_n) begin
x <= 0;
h <= 0;
mul <= 0;
end else if(state == IDLE) begin
x <= 0;
h <= 0;
mul <= 0;
end else begin
x <= x_tmp;
h <= h_tmp;
mul <= mul_tmp;
end
end
//****************************************** core engine: convolution*****************************************//
//************************************************************************************************************//
```
#### Rule 4: Do not use DSP
❌ Since I didn't specify *use_dsp = "no"*, when simulating the fir module, it may use DSP.
#### Rule 5: Coding should be concise
✅ I think I’ve used clear signal names and added helpful comments.
❌ I could design a cleaner data path, for example by relying solely on FSMs to control the signals.
#### Rule 6: Avoid Faulty Logic - Not qualify by control signal
✅ I've clearly read the spec, so I think there's no fauty logic.
#### Rule 7: ap_start, ap_idle, ap_done should be separatly controlled
❌ I use `ap_ctrl[2:0]` to control those signals.
Original:
```verilog=
// 0x00: ap_ctrl = {ap_idle, ap_done, ap_start}
always @* begin
ap_ctrl[0] = (awaddr == 12'h00 && wdata [0] == 1'b1 && awready && wready) ? 1'b1 : 1'b0;
ap_ctrl[1] = (state == DONE) ? 1'b1 : 1'b0;
ap_ctrl[2] = (state == IDLE) ? 1'b1 : 1'b0;
end
```
Revised:
```verilog=
// 0x00: ap_ctrl = {ap_idle, ap_done, ap_start}
always @* begin
ap_start = (awaddr == 12'h00 && wdata [0] == 1'b1 && awready && wready) ? 1'b1 : 1'b0;
ap_done = (state == DONE) ? 1'b1 : 1'b0;
ap_idle = (state == IDLE) ? 1'b1 : 1'b0;
end
```
#### Rule 8: Avoid input to output path
❌ I use a **Data transfer fsm** to control this data path from input to output:

However, I think if we reverse the design to use separate FSMs and registers (to latch the current I/O), the data path would be more efficient. This way, we wouldn’t have to wait for the last data cycle to complete before starting the next one. The valid data could be held in registers, allowing the next data path to proceed immediately.
#### Rule 9: AXI bus signals should not be in the FSM
✅ As specified in **Rule 8**, the data path control excludes AXI bus signals.
#### Rule 10: Should not Lock-step on Xin (ss_tready), Yout (sm_tavlid)
❌ This problem has already illustrated in **Rule 8**.
#### Rule 11: Avoid Unnessary registers/latches
✅ I’ve reviewed and reorganized my design, and I believe I didn’t use any redundant registers or latches.
#### Rule 12: What is your design II, i.e. Y output rate?
```verilog
11116/300 ≈ 37
```
The ideal case is `32`. I think I should have done these two things to achieve the ideal case:
(1) The counting should not rely on `tap_cnt`, but rather on the FSM state.
(2) Change the write of 1c and the read of 32c into a read of 32c by using a multiplexer to select the first input to the multiplier from the current input.
---
### Github link
[lab3_fir](https://github.com/nina1032024/SoC/tree/main/spring_lab/lab3_fir)