# Lab 4-2 Caravel FIR
Student ID: 20726557
This lab involves integrating Lab 3-FIR and Lab 4-1 exmem-FIR into the Caravel user project area. A Wishbone to AXI interface is designed to communicate with FIR engine from Lab 3.
---
## Design block diagram – datapath, control-path

## The interface protocol between firmware, user project, and testbench
### Firmware and User Project
The interface protocol between the firmware and the user project uses the Wishbone protocol to transmit data. The RISC-V CPU will execute the instructions in the firmware to transmit data through the Wishbone interface to the FIR engine and EXMEM. A Wishbone decoder will detect the access address of the transaction. The FIR engine is accessed by through the address of 0x3000_xxxx, while the EXMEM is accessed through the address of 0x3800_xxxx.

The firmware accesses the FIR engine in the User Project through a Wishbone-to-AXI conversion layer. The Wishbone-to-AXI layer detects the Wishbone request to access the FIR engine and converts the access into AXI-Lite or AXI-Stream format.
This lab uses the following addresses to access the FIR engine registers
* 0x00: AP Configuration Register (AXI-Lite)
* 0x10-13: Data Length Register (AXI-Lite)
* 0x14-18: Number of Taps (AXI-Lite)
* 0x40-43: X[n] input (AXI-Stream)
* 0x44-47: Y[n] input (AXI-Stream)
* 0x80-100: Tap parameters
The access of EXMEM is done directly through the Wishbone decoder to store and load the firmware code of the FIR engine. The Wishbone signals are converted into the bram control signals. The Wishbone receives an acknowledgement after a certain period of delay set by a parameter.
### Firmware and Testbench
The Firmware and Testbench communicates through the mprj_io signal. The testbench sets mprj_io[23:16] as the checkbits to determine the state of the project. When the firmware loads the FIR engine, the checkbits will be set to 8'A5 and to notify the testbench to start checking for the FIR output and calculate the latency. After the FIR process has completed, the firmware sets the CPU to assert 8'h5A to the checkbits. The testbench will check the final Y[n] output with the a reference output and record the latency. This is repeated 3 times to and the total latency is recorded.

---
## Waveform and analysis of the hardware/software behavior.
### Software Behavior
#### Moving code from SPIFlash to EXMEM

The firmware writes the code to access the FIR engine from SPIFlash to the EXMEM. This is done through the Wishbone interface. After a certain delay set by a parameter, an acknowledge is sent back through the Wishbone interface to complete the write operation.
#### Reading code from EXMEM

When the FIR engine is enabled, the code is retrieved from the EXMEM. This is also done through the Wishbone interface and also follows the same delay to return the acknowledgement when the read is completeted.
#### Initfir()

The FIR engine is initiated by passing the data length, number of taps, and the tap coefficients to the corresponding addresses. This is done in software by passing the data through using the Wishbone interface.

#### fir()

The software starts the FIR engine by setting AP configuration register to 0x01. The software transmits starts the X[n] and receives Y[n] through the Wishbone interface. The data length is 64 sets.

#### Communication with Testbench

The software communicates with the verilog testbench through the mprj_io. When the checkbits is set to 8'hA5, the testbench will start the latency counter. When the FIR engine completes, the software sets the checkbits to 8'5A, allowing the testbench to determine the total latency.

### Hardware Behavior
#### Wishbone-to-EXMEM Delay Interface

The Wishbone-to-EXMEM Delay Interface is in charge of the read/write operation between the Firmware and the EXMEM. The interface converts the Wishbone signals into BRAM control signals such that the data can be read/written. The wbs_ack_o ackwoledgement signal is asserted after a certain delay, determined using the DELAYS parameter, through the use of a delay counter.

#### Wishbone-to-AXI Interface
AXI-Lite

AXI-Stream

To communicate with the FIR engine, the Wishbone signals from the CPU is converted to the AXI protocol. To access the AP Configuration Register, Data Length, Tap Number registers, and Tap Coefficient RAM, the Wishbone signals are converted in AXI-Lite Protocol. The Read and Write transactions of the FIR engine is converted from Wishbone to AXI-Stream Protocol.

#### FIR Engine

The FIR Engine receives the AXI-Lite and AXI-Stream transactions to compute the FIR outputs. When ap_start is set to 1, the FIR engine starts the computation by reading X[n] and output Y[n] through the AXI-Stream.
---
## What is the FIR engine theoretical throughput, i.e. data rate? Measured throughput?

The theoretical delay per Y[n] output is 33 cycles. With a data length of 64, the theoretical delay is 2112 cycles.
Therefore, the thoretical throughput per 64 data sets is the following:
```
Theoretical throughput: 1/2112 = 0.00047348484 data per cycle
```
However, the actual measured delay for a data length of 64 is larger than expected.

```
Measured throughput: 1/40006 = 0.00002499625 data per cycle
```
The cause of the increase in delay from theoretical is due to the time between each X[n] input. The FIR engine stalls and waits until the next X[n] input is received through the AXI-Stream.
---
## What is the latency for firmware to feed data?
The latency for the firmware to feed data can be determined through the cycles between each X[n] input from AXI-Stream.


```
Latency: (1586537500 - 1574937500) ps / (25000) ps = 464 cycles
```
---
## What techniques are used to improve the throughput?
There are several techniques that can be done to improve the throughput. In terms of hardware, the FIR engine can be optimized.
The FIR engine can only receive X[n] and send Y[n] one at a time. If there is no data sent from the RISC-V CPU to the FIR engine, the FIR engine will stall, affecting the throughput. To improve this, a buffer can be added to store X[n] inputs and Y[n] output, and will not stall the FIR engine even if the SS/SM Bus tvalid is not asserted.
Another optimization done in the FIR engine is parallelism. In the current design, multiply and sum actions are done in separate cycles. The FIR engine can take advantage of the parallelism by using the Multiplier and Adder at the same cycles. When the previous dataset is using the Adder, the next dataset can use the Multiplier. This will reduce the FIR calculation delay and improve the throughput.
In terms of software, the throughput can be improved as well. There is a large latency for the firmware to feed the data. To improve this, the X[n] input can be decoupled from the Y[n] output, and allow X[n] to input into the user project area before receiving the Y[n] output. This will improve the delay between each data input and improve the throughput.
---
## Any other insights?
### Synthesis
#### Utilization Report


#### Timing Report
Using the 10 MHz FPGA synthesis


### GitHub
https://github.com/AnthonyGithub/EESM6000C-Lab-4/tree/main/Lab%204-2