# SoC Lab3 Verilog FIR
[Github Link](https://github.com/dqrengg/SoC_Laboratory/tree/main/Lab3)
## FIR Design
### 1. Block Diagram

### 2. FSM
#### State diagram

```
init_done = (init_addr == Tape_Num - 1);
cfg_done = (awaddr == 12'h000) & awvalid & awready
& (wdata[0] == 1'b1) & wvalid & wready;
all_done = sm_tvalid && sm_tready && sm_tlast;
```
#### Logic controlled by FSM
| Signals | INIT | WAIT | CALC |
|---------|------|------|------|
| `Tap_Di` | `rdata` | `rdata` | N/A |
| `Tap_A` | `araddr` | `araddr` | `tap_addr` |
| `Tap_EN` | `awvalid` & `wvalid` \| `arvalid` | `awvalid` & `wvalid` \| `arvalid` | `1` |
| `Tap_WE` | `awvalid` & `wvalid` | `awvalid` & `wvalid` | N/A |
| `Data_Di` | `0` | N/A | `ss_tdata` |
| `Data_A` | `init_addr` | N/A | `data_addr` |
| `Data_EN` | `1` | `0` | `1` |
| `Data_WE` | `1` | `0` | `tap_addr == 0` |
| | | | |
| | | | |
| | | | |
### 3. ap_* protocol

#### ap_start
1. set by host, when programing `ap_start` = 1.
2. reset by engine, when the first `X` input is sampled.
#### ap_done
3. set by engine, when the last `Y` is transferred.
4. reset by engine, when AXI reads 0x000.
#### ap_idle
5. set by engine, when the first `X` input is sampled.
6. reset by engine, when the last `Y` is transferred.
### 4. FIR Core
#### Pipeline stages
1. Address generation
2. SRAM access
3. multiplication
4. addition
#### Input/Output control
##### X input
* Tap address counts from `10` to `0` in every calculation.
* Data address starts at `0`. While address = `9`, Data address goes back to `0` next cycle.
* While Tap address is not `0`, the engine reads `x` from SRAM.
* While Tap address = `0`, `X` input overwrites the previous data in SRAM and the engine reads `x` from bypass.
* Without pointer to implement shift registers.
##### Y output
* The engine passes Tap address to pipeline stages
* When Tap address of the final stage (addition), `Y` output is ready.
#### Pipeline stall
##### Stall conditions
1. When X doesn't come in and input buffer is empty.
2. When Y is not accepted and output buffer is full.
##### How to resolve stall
Every stage keeps the data of last cycle.
### 5. AXI-Lite

* For **read operation**, `arready` is always `1` unless under certain scenarios. After receiving `arvalid`, `rdata` will be validated next cycle. Before the host accepting `rdata`, `arready` will be de-asserted.
* For **write operation**, the engine starts processing write request when both `awvalid` and `wvalid` are asserted, and asserts `awready` and `wready` next cycle.
* When receiving both read and write requests at the same time, the engine will process the read request first.
* If the current cycle has only a write request, the engine will process the write request next cycle. At this point, the engine cannot process a read request, so `arready` will be `0`.
* During calculation, the engine ignores any write request, but returns the `awready` and `wready` as normal.
* During calculation, the engine returns `32'hFFFF_FFFF` for reading Tap and a valid value for reading status.
* `awaddr` and `rdata` are controlled by FSM output to handle invalid operations.
### 6. AXIS
#### AXIS buffer
* The buffers are placed between the engine and `X` input/`Y` output.
* The buffers are designed as zero-delay buffers, passing the data through the bypass path while being empty.
* Stall the pipeline while being full (Y buffer) or empty (X buffer).
#### Master

#### Slave

## Testbench
### 1. Cycle count
* Total 6601 cycles for `data_length` = 600, 11 cycles per `Y` output.
* To measure cycle count, AXIS has no latency.
### 2. AXI-Lite Tap configuration
#### Valid configure
* The host first writes all Tap into SRAM, then reads back to verify.

* The host writes and reads back at the same time (write before read for the same address).
* Short latency (0~3ns)

* Long latency (0~2 cycle)

#### Invalid configure
* The host tries to write `Tap` to `0` during calculation, the engine should ignore it.
* The host tries to read `Tap` during calculation, the engine should return `32'hFFFF_FFFF`.
(only shows the long latency case)

### 3. AXIS
* Check whether `Y` outputs are same as goldon.
* No latency (11 cycle per `Y` output)

* Short latency (11 cycles + 0~3ns)

* Long latency (11 + 0~2 cycles)

### 4. Status check
* Host tries to write `ap_start` = 1, `ap_done` = 1 and `ap_idle` = 1 during after the engine started. The engine should ignore invalid configuration.

* The host check the status after calculation completed.

## Reports
### 1. Timing
* Clock period = 7 ns, freq = 142.857 MHz
* Slack = 0.904 ns
```
---------------------------------------------------------------------------------------------------
From Clock: axis_clk
To Clock: axis_clk
Setup : 0 Failing Endpoints, Worst Slack 0.904ns, Total Violation 0.000ns
Hold : 0 Failing Endpoints, Worst Slack 0.145ns, Total Violation 0.000ns
PW : 0 Failing Endpoints, Worst Slack 3.000ns, Total Violation 0.000ns
---------------------------------------------------------------------------------------------------
```
* Max delay path: FSM output to X buffer
```
Max Delay Paths
--------------------------------------------------------------------------------------
Slack (MET) : 0.904ns (required time - arrival time)
Source: state_reg[2]/C
(rising edge-triggered cell FDCE clocked by axis_clk {rise@0.000ns fall@3.500ns period=7.000ns})
Destination: x_buf_reg[0][0]/CE
(rising edge-triggered cell FDCE clocked by axis_clk {rise@0.000ns fall@3.500ns period=7.000ns})
Path Group: axis_clk
Path Type: Setup (Max at Slow Process Corner)
Requirement: 7.000ns (axis_clk rise@7.000ns - axis_clk rise@0.000ns)
Data Path Delay: 5.714ns (logic 1.269ns (22.209%) route 4.445ns (77.791%))
Logic Levels: 5 (LUT3=2 LUT5=2 LUT6=1)
Clock Path Skew: -0.145ns (DCD - SCD + CPR)
Destination Clock Delay (DCD): 2.128ns = ( 9.128 - 7.000 )
Source Clock Delay (SCD): 2.456ns
Clock Pessimism Removal (CPR): 0.184ns
Clock Uncertainty: 0.035ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
Total System Jitter (TSJ): 0.071ns
Total Input Jitter (TIJ): 0.000ns
Discrete Jitter (DJ): 0.000ns
Phase Error (PE): 0.000ns
Location Delay type Incr(ns) Path(ns) Netlist Resource(s)
------------------------------------------------------------------- -------------------
(clock axis_clk rise edge)
0.000 0.000 r
0.000 0.000 r axis_clk (IN)
net (fo=0) 0.000 0.000 axis_clk
IBUF (Prop_ibuf_I_O) 0.972 0.972 r axis_clk_IBUF_inst/O
net (fo=1, unplaced) 0.800 1.771 axis_clk_IBUF
BUFG (Prop_bufg_I_O) 0.101 1.872 r axis_clk_IBUF_BUFG_inst/O
net (fo=412, unplaced) 0.584 2.456 axis_clk_IBUF_BUFG
FDCE r state_reg[2]/C
------------------------------------------------------------------- -------------------
FDCE (Prop_fdce_C_Q) 0.478 2.934 f state_reg[2]/Q
net (fo=55, unplaced) 0.826 3.760 state[2]
LUT3 (Prop_lut3_I0_O) 0.295 4.055 f ss_tready_OBUF_inst_i_4/O
net (fo=23, unplaced) 1.174 5.229 ss_tready_OBUF_inst_i_4_n_0
LUT5 (Prop_lut5_I0_O) 0.124 5.353 f tap_addr_pre[9]_i_2/O
net (fo=2, unplaced) 0.460 5.813 y_stall
LUT6 (Prop_lut6_I5_O) 0.124 5.937 f tap_A_OBUF[11]_inst_i_2/O
net (fo=26, unplaced) 0.968 6.905 stall
LUT5 (Prop_lut5_I0_O) 0.124 7.029 r x_buf_wp[1]_i_2/O
net (fo=4, unplaced) 0.473 7.502 x_buf_wp0
LUT3 (Prop_lut3_I1_O) 0.124 7.626 r x_buf[0][31]_i_1/O
net (fo=32, unplaced) 0.544 8.170 x_buf[0][31]_i_1_n_0
FDCE r x_buf_reg[0][0]/CE
------------------------------------------------------------------- -------------------
(clock axis_clk rise edge)
7.000 7.000 r
0.000 7.000 r axis_clk (IN)
net (fo=0) 0.000 7.000 axis_clk
IBUF (Prop_ibuf_I_O) 0.838 7.838 r axis_clk_IBUF_inst/O
net (fo=1, unplaced) 0.760 8.598 axis_clk_IBUF
BUFG (Prop_bufg_I_O) 0.091 8.689 r axis_clk_IBUF_BUFG_inst/O
net (fo=412, unplaced) 0.439 9.128 axis_clk_IBUF_BUFG
FDCE r x_buf_reg[0][0]/C
clock pessimism 0.184 9.311
clock uncertainty -0.035 9.276
FDCE (Setup_fdce_C_CE) -0.202 9.074 x_buf_reg[0][0]
-------------------------------------------------------------------
required time 9.074
arrival time -8.170
-------------------------------------------------------------------
slack 0.904
```
### 2. Resources
* LUT, FF
```
+-------------------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+-------------------------+------+-------+------------+-----------+-------+
| Slice LUTs* | 427 | 0 | 0 | 53200 | 0.80 |
| LUT as Logic | 427 | 0 | 0 | 53200 | 0.80 |
| LUT as Memory | 0 | 0 | 0 | 17400 | 0.00 |
| Slice Registers | 412 | 0 | 0 | 106400 | 0.39 |
| Register as Flip Flop | 412 | 0 | 0 | 106400 | 0.39 |
| Register as Latch | 0 | 0 | 0 | 106400 | 0.00 |
| F7 Muxes | 0 | 0 | 0 | 26600 | 0.00 |
| F8 Muxes | 0 | 0 | 0 | 13300 | 0.00 |
+-------------------------+------+-------+------------+-----------+-------+
```
* BRAM
```
+----------------+------+-------+------------+-----------+-------+
| Site Type | Used | Fixed | Prohibited | Available | Util% |
+----------------+------+-------+------------+-----------+-------+
| Block RAM Tile | 0 | 0 | 0 | 140 | 0.00 |
| RAMB36/FIFO* | 0 | 0 | 0 | 140 | 0.00 |
| RAMB18 | 0 | 0 | 0 | 280 | 0.00 |
+----------------+------+-------+------------+-----------+-------+
```