[TOC]
---
# SOC Lab1
• Briefly introduction of the system:
o Multiplication.cpp:
It’s straight forward to see that the C++ code implement the multiplication of two numbers.
o MultipTester.cpp:
The C++ testbench tests the inputs from 1 * 1 to 9 * 9.
If there is any error, it prints out “ Test Failed! ”
Otherwise, it prints out the result and “ Test Passed! ”
• What’s observed and learned
o Installation of ubuntu and VM:
Start to use terminal. Getting used to different kinds of commands and the important concept of directory.
o HLS:
When running the simulation and synthesis, we must comment out “ #pragma HLS INTERFACE ap_ctrl_none port=return” since ap_ctrl_none is the interconnect for only combinational logic design.
o Vivado:
Import IP for block design is new thing to learn.
Review some implementations from what I have learned in the logic design lab course.
• Performance
o HLS:


• Utilization
o Vivado:

• Co-simulation transcript

• Jupyter Notebook



# SOC Lab2
• Briefly introduction of the system:
o Just like lab2, but this time we had to using our own Verilog code to implement FIR.
o Using two interfaces, AXI-Lite and Stream, to implement finite impulse response filter.
o What’s FIR?
A Finite Impulse Response (FIR) filter is a type of digital filter used in signal processing and digital signal processing. It is characterized by having a finite-duration response to an input signal, which means that the output of the filter is determined solely by a finite number of past and present input samples. In other words, the output of an FIR filter is based on a weighted sum of the input signal's samples within a finite time window.
• What’s observed and learned?
o What’s the properties of FIR?
1. Require no feedback. This means that any rounding errors are not compounded by summed iterations. The same relative error occurs in each calculation. This also makes implementation simpler.
2. Are inherently stable, since the output is a sum of a finite number of finite multiples of the input values, so can be no greater than times the largest value appearing in the input.
3. Can easily be designed to be linear phase by making the coefficient sequence symmetric. This property is sometimes desired for phase-sensitive applications, for example data communications, seismology, crossover filters, and mastering.
4. 
o M_AXI:
Characteristics:
1. Max burst-length <= 256
2. Max transfer size <= 4KB
3. Bus width – power of two, between 32 and 512 bits
Single vs Burst transfer:

Transaction waveform:

o Stream:
Basic handshake:
The following figure shows the transfer of data in an AXI4-Stream channel. The tvalid signal is driven by the source (master) side of the channel and tready is driven by the destination (slave) side. The tvalid signal indicates that the values in the payload fields (tdata and tlast) are valid. The tready signal indicates that the slave is ready to accept data. When both tvalid and tready are asserted in the same clock cycle, a transfer occurs.
Data transfer :

• Performance
o MAXI:

o Stream:

• Utilization
o MAXI:
HLS:

Vivado:

o Stream:
HLS:

Vivado:

• Interface
o MAXI:

o Stream:

• Jupyter Notebook
o FIRN11Stream on PYNQ :

o FIR_MAXI on KV-260 :

# SOC Lab3
• Briefly introduction of the system:
o Just like lab2, but this time we had to using our own Verilog code to implement FIR.
o Using two interfaces, AXI-Lite and Stream, to implement finite impulse response filter.
o What’s FIR?
A Finite Impulse Response (FIR) filter is a type of digital filter used in signal processing and digital signal processing. It is characterized by having a finite-duration response to an input signal, which means that the output of the filter is determined solely by a finite number of past and present input samples. In other words, the output of an FIR filter is based on a weighted sum of the input signal's samples within a finite time window.
• What’s observed and learned?
o What’s the properties of FIR?
1. Require no feedback. This means that any rounding errors are not compounded by summed iterations. The same relative error occurs in each calculation. This also makes implementation simpler.
2. Are inherently stable, since the output is a sum of a finite number of finite multiples of the input values, so can be no greater than times the largest value appearing in the input.
3. Can easily be designed to be linear phase by making the coefficient sequence symmetric. This property is sometimes desired for phase-sensitive applications, for example data communications, seismology, crossover filters, and mastering.
4. 
o Verilog:
1. Using our own Verilog code to implement FIR, to get more familiar with Verilog design including the timing issue which is the most difficult task to debug.
2. How to implement the pipeline calculation is also a big learning in this lab from calculating the address and dealing with all the control signals.
3. Determine which signal is asynchronous or synchronous is also vital in this lab.
o AXI-Lite:
1. AXI Lite Interface has Master components, Interconnect, and Slave Components . User Logic connected to AXI-Lite Masters and AXI-Lite Slaves, and AXI-Lite Master and Slaves are connected via AXI-Lite Interconnect.
2. Read Transaction

3. Write Transaction

o AXI-Stream:
1. Basic handshake:
The following figure shows the transfer of data in an AXI4-Stream channel. The tvalid signal is driven by the source (master) side of the channel and tready is driven by the destination (slave) side. The tvalid signal indicates that the values in the payload fields (tdata and tlast) are valid. The tready signal indicates that the slave is ready to accept data. When both tvalid and tready are asserted in the same clock cycle, a transfer occurs.
2. Data transfer :

• Verilog Implementation:
o Block diagram (with I/O)

o Datapath for calculation

o Datapath for calculation


o FIR flow and ram arrangement



• Timing report
o Timing constraint

o Timing summary

o Longest path

o Max CLK cycle: 10ns
• Utilization



• Simulation Waveform
o Coefficient program and read back
1. Write coefficient

2. Read

o Data-in stream-in and Data-out stream-out

o RAM control signals

# SOC Lab4
1. Design block diagram – Datapath, control-path
Datapath

2. The interface protocol between firmware, user project and testbench
a. Firmware
i. fir.h

In the fir.h, we define some parameters, address, and functions for the firmware code to have more readability.
Functions:
1. adr_ofst: to calculate the offset address
2. wb_write: wishbone write the data to the target address
3. wb_read: read the data from the target address
ii. fir.c
In the firmware code, there is our implementation flow

1. Initialization:
a. Program the data length (address and data are already defined in the .h file).
b. Use for loop to write the coefficient. As for the address offset, we use the function defined in the .h file.
c. Write the 0x00CC0000 to checkbits address to let the testbench know that we the coefficients are all written.
d. Check the coefficients: read the coefficients then write in to the checkbits address to let the testbench check.
2. Fir implementation

a. Write 0x00A50000 to checkbits address as start mark.
b. Write ap_start.
c. Write data (we set data[x] = x, which is 0-64). The control signal is when ss_tready is asserted.
d. If sm_tvalid is asserted, which means y[t] is valid, we keep y to outputsignal[t] array.
e. Wishbone write the final outputsignal[N-1] and end mark at MPRJ[31:0] to testbench.
f. Do the calculation again.
g. Do it again.
b. Testbench (counter_la_fir_tb.v)
Testbench implementation flow:
i. Check if data length is written.
ii. Check if coefficients are written.
iii. Check if coefficients are right
iv. Check if the ap_start is asserted. If do, start to count the cycles till the final y is valid.
v. Display the total execution time, cycle, and Y.
c. User project
Implement the wishbone interface.
3. Waveform and analysis of the hardware/software behavior.
a. hardware behavior:
i. Synthesis
1. Slice logic

2. Memory(BRAM11, BRAM)
3. 
ii. timing summary
clock period: 25ns (wb_clk)

b. software behavior:
In counter_la_fir.out, we can see the risc-v assembly codes.

Different from the fir code in lab4-1, this time the firmware C code isn’t executing the FIR calculation. Instead, it is implementing data and control signal processing.
4. What is the FIR engine theoretical throughput, i.e. data rate? Actual measured throughput?

Theoretical throughput = 1
Actual measured throughput = 1 if we see the data calculation part. However, we have many cycles delay due to the firmware code instructions and memory access.
5. What is latency for firmware to feed data?

blue lines mark the wbs_adr_i = 0x30000080 (x[n])


Therefore, we can calculate the latency of firmware to feed the data ≈ 2153880 - 2138700 15180 = cycles.
6. What techniques used to improve the throughput?
a. We used the pipeline technique to improve the throughput. (just like what we did in the lab3)
b. Use unsigned integer instead of signed integer (double performance!!)
c. Don’t use x++ in the function. (just reduce a small number of cycle)
7. Does bram12 give better performance, in what way?
We used the old bram11 provided in lab3, which can only be read or written in one time.
However, if we use bram12, which can be read and written in the same time, the performance is definitely going to improve since the every time we have to wait the read operation then we can write and vice versa. Therefore, it will definitely have better performance.
8. Can you suggest other methods to improve the performance?
a. Use unsigned integer instead of signed integer (double performance!!)
b. Don’t use x++ in the function. (just reduce a small number of cycle)
# SOC Lab5
1. Block diagram

PS PL interconnect

Detailed Datapath with IP and interconnect (important wire are marked and explained)
2. FPGA utilization
a. ReadROMcode


b. OutputPin

c. Caravel_PS

d. Spiflash

e. CaravelSoC

f. Block RAM

3. The function of IP in this design
a. HLS
i. Read_romcode
ii. ResetControl
iii. Caravel_ps
b. Verilog: spiflash
4. Run these workload on caravel FPGA
Done and the results are showed below.
5. Screeshot of Execution result
a. counter_wb

b. counter_la

c. gcd_la

6. caravel_fpga.ipynb
a. Initial ROM_SIZE (Max 8k)
b. ol = Overlay .bit file
c. Allocate ReadROMCODE to dram buffer and initialize by 0.
d. Open the firmware code (.hex)
e. Calculate the np_ROM_offset and npROM_index to determine the length of the code. (32bit alignment)
f. Start to write the firmware code in to bram.

g. Check the MPRJ_IO pin before the execution.

h. Print the ipOUTPIN.read(0x10).

0x10 is for reset signal control. Reset is negative active. Therefore, it’s 0 before the execution starts.
For the caravel SoC to start executing, we write the reset to 1 and read if we successfully write the value into 0x10.
i. The last part is to see the result of the execution.

we just print them out to check!

# SOC Lab6
• Firmware simulation
o Matrix Multiplication
Wave view (whole)

Wave view (write data in)

Wave view (check if the data written is correct)
When watching the waveform, we have a question.
Why the initialization of the array is written to the address 0x00000000?
Then we found out that the array used in the firmware is stored in the memory in “dff,” which address is 0x00000000 (from firmware/section.ids)

That’s the reason why when we are doing something involving the array calculation, we would access this address to get the data or store the data.


• Data1 = 1

• Data2 = 2

• Data3 = 3

o Quick Sort
Wave view (whole)

.header file

Why the initialization of the array is written to the address 0x00000000?
Then we found out that the array used in the firmware is stored in the memory in “dff,” which address is 0x00000000 (from firmware/section.ids)

That’s the reason why when we are doing something involving the array calculation, we would access this address to get the data or store the data.
To check the data written
Start mark = AB40

• Data1 = 893

The question we have is WHY the wbs_addr_i = 0x00000000
• Data2 = 40

• Data3 = 3233

• End mark = AB51

o FIR
We’ll skip this part since this is identical to lab4-1
o ISR – UART/rx & tx
UART structure
• uart_rx

• uart_tx

Wave view (overall)

./run_sim

• Block design


• Timing report
o Timing summery

o Clock summery
o MAX delay path

o Min delay path

• Resource report after synthesis

o Axi_uartlite

o blk_mem_gen

o blk_mem_gen_MEMORY

o Caravel_PS

o Caravel_SOC

o Caravel_SOC_MEMORY

o outputPIN

o Read_ROMcode

o Read_ROMcode_MEMORY

o Reset_ctrl

o Spiflash

• Latency for a character loop back using UART

The latency is time difference between the orange line and the yellow line, which is approximately 1.4ms.
• How do you verify your answer from notebook
o run_sim with all 4 functions combined

o FPGA verification on Jupyter notebook

o Discussion
After successful run_sim, we faced some difficulties running the implementation on FPGA on Jupyter notebook. Later, we found out that the reason that the python couldn’t receive the data send by firmware is that the python code is implementing in a relatively slow speed, which means the firmware has already finished before python sampling the signals for verification.
The solution we choose is to “let the firmware slower! ” Before any signal we want to sample for verification, we add a while loop (30000 counts)before it to wait for the python code.
Below is one of the signal we want to test:


Also, the riscv from the firmware code also needs a small modification. Since the original -o compiler results in too large .hex file. We changed the compiler to -o1 to have a smaller .hex file.
• Suggestion for improving latency for UART loop back
o Increase Baud Rate:
One of the simplest ways to improve UART communication speed is to increase the baud rate. Higher baud rates allow for faster data transmission between devices. Ensure that both the transmitter and receiver can support the selected baud rate.
o FIFO Buffers:
Use hardware FIFO buffers if available. These buffers can store multiple bytes of data, reducing the time the CPU spends handling individual characters. This can significantly improve throughput and reduce latency.
o DMA (Direct Memory Access):
If your microcontroller or platform supports it, use DMA to transfer data between the UART and memory without CPU intervention. DMA can reduce the latency introduced by CPU involvement in data transfer.
o Interrupts:
Utilize UART interrupts to handle data reception and transmission. This allows the CPU to perform other tasks while waiting for UART events, reducing overall latency.
# SOC Lab Final Project
## Computation System Overview
在本次 final project 中,我們根據 Lab 4、Lab 6 的基礎,設計Arbitrary、DMA、SDRAM 與硬體加速器,希望能夠改進先前的結果。硬體加速器的部分包含之前 Lab3、4 使用的 Fir 以及新設計的 Matrix Multipication 和 Qsort 三個 module,利用三個DMA去配合它們的資料傳輸,而彼此資料的先後順序交由 Arbitrary 這個 module 去權重。而儲存instructions 和 data sets 的工作則使用 SDRAM,讓硬體不必透過 CPU 去收發資料以減少 cycle。進一步的 Prefetch 設計能在 3T 的 latency 中拿取更多的 Data,能夠再減少用於抓資料的時間。

---
## Firmware
先更改section.ids ,這是為了把需要計算的data放在SDRAM。
 
我們的 Firmware 在這次的 final project 主要用於設定資料地址以及確認完成所有運算。A、B、C 為矩陣乘法用到的位置,X、Y 為 FIR 用到的位置,Q 為 qsort 用到的位置。



由於這三種運算是同時進行,因此我們從 waveform 判斷 FIR 是運行最慢的,因此我們設定當收到 FIR 最後的資料就回到 AB51,同時我們也可以藉由此方法來判斷我們的算完的值確實也寫入SDRAM的位置。

---
## Hardware Accelerator
在 Lab 6 中,我們利用 firmware code 在 carvel soc 上跑 fir、mm、qs,但是 cpu 運算的時間過長,因此希望透過使用硬體去加快它們的計算速度。
### FIR & DMA

Fir 的設計沿用在 Lab3、4 的架構,並加上 Y_buffer 讓 DMA 到 Buffer 去接收計算完的 data。當 Fir 計算完成時會送資料到 y_buffer 並送出 full 的訊號讓 DMA 接收資料,同時也等待 DMA 送新的 X 進來。

DMA_fir 的功能涵蓋先前 Lab 4 的 decoder 與 DMA 本身,其運作圍繞 4 個 state,分別是 RESET、IDLE、X_addr 與 Y_addr。
首先,在 IDLE state 時若發現 X_FF 是空的 (~x_FF_full),就會進入 X_addr state 去等待 arb 送資料進來,而當dma_ack_o 拉起來時便會回到 IDLE;而若 X_FF 是滿的而 y_FF 也是滿的 (y_FF_full) ,則會進入 Y_addr state 等待 arb 來收資料,而當dma_ack_o 拉起來時也會回到 IDLE。

### Matrix Multiplication
MM 的 datapath 如下圖,我們使用 shift register 去設計 A_Ram、B_Ram 讓它配合後面乘法的步驟。我們一樣採用 pipeline 的設計,讓它在 16 個cycle 就能算完所有的 data。

### Q Sort
我們利用insert sorting的方法來插入,利用十個比較器,找出index來決定要插入的位置。
 
### Arbitrary
有優先順序的arb。

---
## SDRAM with SDRAM Controller
### Original Block Diagram
- The overall diagram of SDRAM are shown below:

- The wishbone cycle will pass through SDRAM controller and store/write data from/into SDRAM. We have do some optimize since the memory size of the original source code of SDRAM is not enough.
#### FSM in SDRAM controller:

- Some details about each state:

- **tCASL=3T tPRE=3T tACT=3T tREF=7T**
#### In SDRAM:

- We decode the command sent from controller and mode register defined by user.

- Read/write enable and address/data input/output

- Command pipelined

- MUX select the operation at current state.
- MUX detect read/write command.
**We may be curious about the meaning of `Active` state and `Precharge` state. Here we have a brief explanation about SDRAM:**
1. Dynamic Storage:
*SDRAM stores data in cells that use capacitors to hold charges. SDRAM cells lose their charge over time due to leakage. To maintain the stored data, SDRAM needs to periodically refresh the charges in the cells.*
3. Row Activation:
*Activating a row in SDRAM involves loading the contents of that row into a row buffer for read/write operations, which allows for faster access to the data in the row.*
5. Precharge Operation:
*After accessing a row, it needs to be precharged to restore the charge in the cells. This is essential for the proper functioning of the memory and to prepare it for the activation of others rows.*
### Problems/Solved about Original code:
#### Problem 1 - Address Mapping
``` verilog=
`define BA 9:8
`define RA 22:10
`define CA 7:0
```
- We may encounter a situation that all data are stored in the same bank since our linker was designed as shown below:


- Each data type needs at least 12-bit,but SDRAM only takes 8-bit for column address, and bank_address[9:8] will restrict our size.
#### Problem 2 - Address send into SDRAM
- In SDRAM, we have 4 banks to store data, source code are shown below:
``` verilog=
blkRam$(.SIZE(mem_sizes), .BIT_WIDTH(DQ_BITS))
Bank0(
.clk(Sys_clk),
.we(bwen[0]),
.re(bren[0]),
.waddr(Col_brst[9:0]),
.raddr(Col_brst[9:0]),
.d(bdi[0]),
.q(bqd[0])
);
```
- Only 10-bit for column address
``` verilog=
READ: begin
cmd_d = CMD_READ;
a_d = {2'b0, 1'b0, addr_q[7:0], 2'b0};
state_d = WAIT;
end
```
- Since the address of assembly code plus 4 each time, we may not need to add the 2’b0 at the LSB.
=> **Original memory size: 2^10(width)/4(shift left) * 4(banks) = 1K bytes**
#### Solution
- My design (remapping)
``` verilog=
`define Ba 13:12
`define Ra 22:14
`define CA 11:0
```
- Then we can get 12 bits for column address.
- I also remap the `a_d` since `a_d[10]`is the precharge signal. Note that `BA` is 13:12.
``` verilog=
READ: begin
cmd_d = CMD_READ;
a_d = {addr_q[11:10], 1'b0, addr_q[9:0]};
ba_d = addr_q[9:8];
state_d = WAIT;
end
```
- When read:
``` verilog=
If (Read_enable) begin
Bank <= Ba;
Col <= {Addr[12:11], Addr[9:0]};
Col_brst <= {Addr[12:11], Addr[9:0]};
end
```
- Decode the remapping address, prevent the `addr[10]` (precharge bit) load into block memory.
- Now, address load into memory have 12 bits:
``` verilog=
blkRam$(.SIZE(mem_sizes), .BIT_WIDTH(DQ_BITS))
Bank0(
.clk(Sys_clk),
.we(bwen[0]),
.re(bren[0]),
.waddr(Col_brst[11:0]),
.raddr(Col_brst[11:0]),
.d(bdi[0]),
.q(bqd[0])
);
```
- Also, seems that the block memory in the source code don't have the row address, it may not support the on/off page characteristic.
=> **Memory size: 2^12(width) * 4 = 16K bytes**
---
## Optimization
我們先知道如果資料讀取順利,latancy=1T的狀況下,這三個硬體所需要的時間分別為 64x11=704 (fir)、10x2+10+10x2=50 (qsort)、32x2+16+16=96 (mm)。所以我們優化目標應該以此最長的cycle為優化目標。
首先第一部分,我們的設計如下:

我們分別去測量fir qsort mm 所需時間分別為
fir (1471 cycles)

mm (756 cycles)

qsort (315 cycles)

但是因為有arb的功用其實可以concurrent!
所以實際上完成三個的時間為(1570):

waveform:



所以由上面我們可以知道SDRAM的read至少多要花7個cycle才會回ack。所以如果要進一步的優化,我們要設計sdram有burst的功能並且搭配prefetch。
---
## SDRAM with Prefetch Buffer
### Design Consideration
- Here we may want to prefetch data for faster reading access since SDRAM have the 3T delay for CAS latency when reading.
### Block Diagram

### Prefetch Buffer:

- We have 3 prefetch buffers (FIR/MM/QS), here I use FIR buffer to explan our idea.
Our prefetch buffer acts like shift registers, it shift out the stored data when the read access came. It prefetch data untill all buffers are filled up before our workload start. If the buffer is empty, controller will send a `Empty` signal to tell the **arbiter** to let the priority of that buffer to be the last since the status of that buffer is `busy`, it is being filled up.
### Prefetch Controller:

Above figure shows our design about prefetch controller.
--- Setup
- When setup(before all buffers are FULL), data will store in SDRAM first.
- After we got the initial address(sent by DMA) => start prefetch.
- Fill the data until it buffer length, then set the state of that buffer to 'FULL'.
- After all prefetch buffer is 'FULL', call ap_start.
--- Running
- If input address meet the saving address => HIT
- If HIT, terminate the wishbone read request by sending the ACK immediately.
- If the buffer is Empty, start prefetch the data from SDRAM into buffer.
#### SDRAM burst
- Since our prefetch buffer have the length of 8, if we can achieve the burst length of 8, it can fill up the empty buffer rapidly.
###
## burst result




---
## Uart with I/O FIFO
### Design for Optimization
Original UART implementation flow vs. UART with FIFO
 to
With FIFO, we can lower the number of interrupt to make the execution faster since we can first keep the data sent in the buffer and wait until the buffer is full, then we send the data all at once.
### How to implement into original design

Signals in FIFO


#### Simulation
In simulation we only sent 8 data.
**Without FIFO**

Latnecy=8397900ns
**FIFO with depth 4**

Latnecy=1966020ns
### **FPGA**
In FPGA, we sent 512 data.
**Without FIFO**

Latency = 2.68s
**FIFO with depth 4**

Latency = 2.17s
#### Performance
| | Latency(cycle * period) | Metric(ms) | Improvement |
|:-----:|:--------:|:-------------------------:|:-------------------------:|
| Without FIFO | 114582 * 25ns | 44.78 | * |
| FIFO depth 4 | 54076 * 25ns | 10.48 |4.27x|
---
## Performance Summary
| | Software | Hardware without prefetch | Hardware with prefetch |
|:-----:|:--------:|:-------------------------:|:-------------------------: |
| MM | 55303 | 756 | X |
| FIR | 65890 | 1471 | 893 |
| QS | 14394 | 315 | X |
| total | 135587 | 1570 | 946 |
---
## Problem & Solution