## Lab3 FIR ### Overview This `FIR` module is implemented with an `AXI4-Lite` interface for configuration and `AXI4-Stream` interfaces for input and output data transfer. The design includes a main `FSM` for overall FIR operation control, a `core engine` for computation, and a `BRAM signal processor` for managing coefficient addressing and data storage. ![image](https://hackmd.io/_uploads/rkMYoIYhJe.png) * **Axi-lite interface** : Used for reading and writing parameters, including the number of taps(`tap_num`), number of data(`data_length`) and coefficients * **FSM** * **Read state fsm :** Controls the `rvalid` signal, it includes two state--`RIDLE` and `RVALID` * **Entire FIR fsm :** Controls the operation of FIR, it inclues three states--`IDLE`, `CALC` and `DONE` * **Data transfer fsm :** Controls when new data can be sent in via the SS Bus and when processed data can be transferred out via the SM Bus. It includes four states--`DT_WAIT`, `DT_PROC`, `DT_DONE` and `DT_IDLE`. * **Configuration registers :** They store the value of `tap_num`, `data length` and the fir operation states(`ap_ctrl`). * **Bram32 signal processor :** In this module, we have to process the input signal to Bram32, and get the desired value from it. * **tapRAM signal processor :** Before the core engine starts, first **write** the corresponding values and addresses via the AXI-Lite interface, then **read** them out one by one to verify the values. After the engine starts, continuously **generate addresses** to perform the convolution. * **dataRAM signal processor :** Before the core engine starts, first **initialize** it to ensure all addresses store 0. After the engine starts, for each data transfer cycle, first **write** in a new data value, then generate the remaining values by continuously **reading** from the previously stored addresses. * **Axi-stream interface** : Handles streaming data input (SS Bus) and processed data output (SM Bus) with valid/ready handshaking * **Core engine :** Do the calculation `y[t] = ∑(h[i] ⋅ x[t−i])` * **Counter :** Count how many `y` values are generated and transferred out. --- ### Design SPEC * Data_Width : 32 bits * `tap_num` : Programmable, with a maximum value of 32 * `data_length` : Programmable * Interface : The valid/ready rate is varied * Axi-lite : Configure coefficients * Axi-stream : Stream in(x), stream out(y) * Using one multiplier and adder, the two operations are in seperate pipeline cycles * Program `ap_start` to initiate FIR core engine --- ### Implementation #### Axi-lite interface **Write channel** ✅Case 1 : Valid write ![image](https://hackmd.io/_uploads/ByDRHHKnJg.png) The testbench could send the address and data in different cycles, and their sequence doesn't matter. To **synchronize** the address and data, after both `valid` are sampled, I will set the `ready` signal to `1` for one cycle. Then, all four signals will be pulled down, and the system will wait for the next write operation. ![image](https://hackmd.io/_uploads/SyK6keo2kx.png) ❌Case 2 : Invalid write If axi-lite channel tempts to write in data when the engine is doing `CALC`, it is invalid write, ignore this operation. Note that `data_length` and `tap_num` could not be overwritten for this case. **Read channel** ✅Case 1 : Valid read ![image](https://hackmd.io/_uploads/rJH318Fnkx.png) We should first get the address, and then fetch the data corresponding to this address. For `arready`, I set it to pull up after detecting `arvalid`, and it stays high for one cycle. During this cycle, I will latch the transferred address. Next, I use a FSM to determine the value of `rvalid`. Once both `arready` and `arvalid` are high, `rvalid` can pull up, and it will pull down together with `rready`. ![image](https://hackmd.io/_uploads/r18gMLt3Jx.png) ![image](https://hackmd.io/_uploads/SJqmgeonJx.png) ❌Case 2 : Invalid read If axi-lite channel tempts to read tap parameters when the engine is doing `CALC`, it is invalid read, return `32'hffffffff`. Note that `data_length`, `tap_num`, `ap_ctrl` could be read at any state, just return their values at that moment. #### Entire FIR fsm ![image](https://hackmd.io/_uploads/Hk2TrUYnyl.png) * **`IDLE`** : (1) Configure coefficients, including write/read(check) `data_length`, `tap_num` and tap coefficients. (2) Initialize dataRAM * **`CALC`** : After `ap_start`, the core engine starts to do calculation. * **`DONE`** : After the last data is transferred, enter this state. When reading address 0x00, it will exit this state and return to the `IDLE` state. #### Configuration registers * **`0x00`**: `ap_ctrl` = {ap_idle, ap_done, ap_start} * **`0x10-14`**: `data_length`, valid wirte during `IDLE` state * **`0x14-18`**:` tap_num`, valid wirte during `IDLE` state #### tapRAM signal processor * **Write operation :** `IDLE` state, axilite write channel ![image](https://hackmd.io/_uploads/SJN_-85nJx.png) Write in the tap parameters sent in via axilite write channel. * **Read operation :** `IDLE` state, axilite read channel ![image](https://hackmd.io/_uploads/rykFZIcnyg.png) Read out the written parameters and check their values. * **Address generation :** `CALC` state ![image](https://hackmd.io/_uploads/BJL7PI92Je.png) For each data cycle, tapRAM should continuously generate tap parameters for the calculation. Use `tap_cnt` to track these parameters : In each cycle, `tap_cnt` resets to 0 and then increments continuously. #### dataRAM signal processor * Initialize: Before new calculation starts, can be in `DONE` or `IDLE` state ![image](https://hackmd.io/_uploads/HJCNw852Jl.png) Before performing calculations, each address in `bram32` should be set to 0. * Read/ Write operation : `CALC` state After both `ss_tvalid` and `ss_tready` are `1`, FIR enters a new data transfer cycle. The first data is provided by `ss_tdata_latch`, then stored into dataRAM, and the remaining values are generated by continuously reading from the previously stored addresses. * Address accessing ![image](https://hackmd.io/_uploads/B10rt85hkg.png) * First data (right after `ss_tvalid && ss_tready`) ![image](https://hackmd.io/_uploads/SJSYP8chyl.png) Write in the new data stored in `ss_tdata_latch` * Remaining data ![image](https://hackmd.io/_uploads/SkaKD85nyg.png) Reading previously stored data #### Axi-stream interface (using Data transfer fsm) Define a complete **data transfer cycle** as : ``` receiving input data (ss_tready && ss_tvalid) -> doing calculation (dt_state == CALC) -> transferred output data (sm_tready && sm_tvalid) -> wait till next input comes (dt_state == WAIT) ``` I use a **Data transfer fsm** to control the data flow : ![image](https://hackmd.io/_uploads/Hy6Ak9c2kg.png) * **`DT_WAIT`** : Last data cycle complete, `ss_tready` pulls up and wait for testbench signal `ss_tvalid` to start a new data cycle * **`DT_PROC`**: SS bus handshake complete, start to calculate `y` * **`DT_DONE`** : A valid `y` has been calculated, `sm_tvalid` is asserted, and it waits for the testbench signal `sm_tready` to complete the SM bus handshake. * **`DT_IDLE`** : Finish all calculations and wait for a new set of data, once `ap_start` is programmed, go to `DT_WAIT` and start a new data transfer cycle. And the **data transfer waveform** would be like : ![image](https://hackmd.io/_uploads/S1cVgqqn1g.png) * Each "->" means delay one cycle -tap transfer : tap_cnt -> tap_Do -> h -data transfer : x_r_cnt -> data_Do -> x -calculation : x * h -> mul -> (sigma) y -> sm_tdata * When tap_cnt == 3, y should be set to the first mul value so that it can start accumulating for the new data ![DE9CECA4-0316-4019-B710-9C047BD3A0BC (1)](https://hackmd.io/_uploads/SkiBfejnyx.jpg) #### Core engine Since addition and multiplication occur in separate pipeline cycles, I store the two operations in different registers. ![image](https://hackmd.io/_uploads/SklLWa1o31e.png) #### Counter * `y_cnt` counts the number of valid y outputs. When a valid `y` is successfully transferred after an SM bus handshake, `y_cnt` increments by `1`. * When the last `y` is transferred (i.e., `y_cnt == data_length`), `sm_tlast` will be set to 1. ![image](https://hackmd.io/_uploads/HkQfkeonkl.png) --- ### Results * Throuput : 11116 ```verilog= -----------Congratulations! Pass------------- -------------------third dataset complete cycle count 11116(to know best case throuput) --------------------- ``` * Total cell area : 18500.832235 ```verilog= Number of ports: 330 Number of nets: 3304 Number of cells: 2978 Number of combinational cells: 2648 Number of sequential cells: 321 Number of macros/black boxes: 0 Number of buf/inv: 558 Number of references: 83 Combinational area: 13495.305825 Buf/Inv area: 2483.712042 Noncombinational area: 5005.526410 Macro/Black Box area: 0.000000 Net Interconnect area: undefined (No wire load specified) Total cell area: 18500.832235 Total area: undefined ``` --- ### Debugging Initially, I didn't consider the possibility of implementing multiple test cases in a single testbench, so I only used three data states: `DT_WAIT`, `DT_PROC`, and `DT_DONE` to control the data flow. However, once a test case has completed all calculations, data transfer should pause until the next ap_start is triggered, allowing the data transfer cycle to restart. So, I added an extra state : `DT_IDLE`, to handle this situation. --- ### Fir rule check #### Rule 1: Design should not be custimized for testbench ✅ Designed for spec #### Rule 2: Do not use specific hardcoded constant in design ❌ I should use `localparam` to define those constant Original: ```verilog= reg [4:0] x_w_cnt; wire [4:0] x_w_cnt_tmp; reg [4:0] x_r_cnt; reg [4:0] x_r_cnt_tmp; reg [4:0] addr_cnt; ``` Revised: ```verilog= parameter pDATA_WIDTH = 32; localparam CNT_WIDTH = $clog2(pDATA_WIDTH); reg [(CNT_WIDTH-1):0] x_w_cnt; wire [(CNT_WIDTH-1):0] x_w_cnt_tmp; reg [(CNT_WIDTH-1):0] x_r_cnt; reg [(CNT_WIDTH-1):0] x_r_cnt_tmp; reg [(CNT_WIDTH-1):0] addr_cnt; ``` #### Rule 3: Mul/Add in separate pipeline cycle ✅ I use registers to store the result of mul/add, so they are in seperate pipeline cycle. ```verilog= //************************************************************************************************************// //****************************************** core engine: convolution*****************************************// // multipler assign x_tmp = (tap_cnt == 1) ? ss_tdata_latch : data_Do; assign h_tmp = tap_Do; assign mul_tmp = x * h; // adder always@ (posedge axis_clk or negedge axis_rst_n) begin if(!axis_rst_n) begin y <= 0; end else if(state == IDLE) begin y <= 0; end else if (tap_cnt == 3) begin y <= mul; end else begin y <= y + mul; end end always@ (posedge axis_clk or negedge axis_rst_n) begin if(!axis_rst_n) begin x <= 0; h <= 0; mul <= 0; end else if(state == IDLE) begin x <= 0; h <= 0; mul <= 0; end else begin x <= x_tmp; h <= h_tmp; mul <= mul_tmp; end end //****************************************** core engine: convolution*****************************************// //************************************************************************************************************// ``` #### Rule 4: Do not use DSP ❌ Since I didn't specify *use_dsp = "no"*, when simulating the fir module, it may use DSP. #### Rule 5: Coding should be concise ✅ I think I’ve used clear signal names and added helpful comments. ❌ I could design a cleaner data path, for example by relying solely on FSMs to control the signals. #### Rule 6: Avoid Faulty Logic - Not qualify by control signal ✅ I've clearly read the spec, so I think there's no fauty logic. #### Rule 7: ap_start, ap_idle, ap_done should be separatly controlled ❌ I use `ap_ctrl[2:0]` to control those signals. Original: ```verilog= // 0x00: ap_ctrl = {ap_idle, ap_done, ap_start} always @* begin ap_ctrl[0] = (awaddr == 12'h00 && wdata [0] == 1'b1 && awready && wready) ? 1'b1 : 1'b0; ap_ctrl[1] = (state == DONE) ? 1'b1 : 1'b0; ap_ctrl[2] = (state == IDLE) ? 1'b1 : 1'b0; end ``` Revised: ```verilog= // 0x00: ap_ctrl = {ap_idle, ap_done, ap_start} always @* begin ap_start = (awaddr == 12'h00 && wdata [0] == 1'b1 && awready && wready) ? 1'b1 : 1'b0; ap_done = (state == DONE) ? 1'b1 : 1'b0; ap_idle = (state == IDLE) ? 1'b1 : 1'b0; end ``` #### Rule 8: Avoid input to output path ❌ I use a **Data transfer fsm** to control this data path from input to output: ![image](https://hackmd.io/_uploads/Hy6Ak9c2kg.png) However, I think if we reverse the design to use separate FSMs and registers (to latch the current I/O), the data path would be more efficient. This way, we wouldn’t have to wait for the last data cycle to complete before starting the next one. The valid data could be held in registers, allowing the next data path to proceed immediately. #### Rule 9: AXI bus signals should not be in the FSM ✅ As specified in **Rule 8**, the data path control excludes AXI bus signals. #### Rule 10: Should not Lock-step on Xin (ss_tready), Yout (sm_tavlid) ❌ This problem has already illustrated in **Rule 8**. #### Rule 11: Avoid Unnessary registers/latches ✅ I’ve reviewed and reorganized my design, and I believe I didn’t use any redundant registers or latches. #### Rule 12: What is your design II, i.e. Y output rate? ```verilog 11116/300 ≈ 37 ``` The ideal case is `32`. I think I should have done these two things to achieve the ideal case: (1) The counting should not rely on `tap_cnt`, but rather on the FSM state. (2) Change the write of 1c and the read of 32c into a read of 32c by using a multiplexer to select the first input to the multiplier from the current input. --- ### Github link [lab3_fir](https://github.com/nina1032024/SoC/tree/main/spring_lab/lab3_fir)