# Lab1 ## Brief Introduction about the Overall System This Lab is using Vitis_HLS and Vivado to transfer C/C++ code into RTL hardware code. Letting we can focus on the behavior/algorithm of the hardware instead of the RTL coding. The connection between these two tools are, * Vitis HLS: Synthesizes C++ algorithms into hardware IP. * Vivado: Generates block design applications for interfacing with the synthesized IP. In Vitis_HLS, we need to go through these three simulation to verify the C code and the Design that tranformed from the C code. * C sim (c-simulation): Simulates the correctness of C++ algorithm. * C synthesis: Synthesizes C++ code into RTL hardware language. * Co-sim (C/RTL co-simulation): Compares C-simulation and RTL-simulation results. ## What is Observed & Learned * Basic usage of the Vivado and Vitis_HLS * .bit, .hwh, .ipynb's usage on board simulation * Pragmas are used for directives, enabling constraints and optimizations. # Lab2 ## Brief Introduction about the Overall System We need to implement different AXI protocal to do the FIR computation with hardware circuit We used two types of AIX-protocal, * M-AXI (FIR_N11_MAXI): Directly accesses DDR using M-AXI protocol. No need for DMA assistance. * S-AXI (FIR_N11_Stream): Uses AXI-Stream as the interface. Requires DMA (adapter) assistance for external DDR read/write. Offers a more modular approach by delegating address handling to other IPs. ## What is Observed & Learned MAXI & Stream Interface differences: Both have handshake mechanisms (valid & ready). AXI-Stream lacks an address interface, focusing solely on data transmission and reception, it can only do data-transfer, doesn't care the address or the burst length whatsoever M-AXI involves address-based read/write operations While writing the Lab2 reports, we also learned these information. * AXI-stream, AXI-FULL(MAXI, AXI-master-slave), axi-lite belong to AMBA AXI 4.0 protocols. * AXI-Lite handles small data transfers with address and handshake. * AXI-FULL (and AXI-Lite) supports burst transfers for large data after a single handshake. * AXI-Stream lacks an address concept but includes handshake for data transmission. # Lab3 github link: https://github.com/b3nsonchang/SoC_lab3_fir This repositary includes the waveform, Design & Testbench ,synth report, and the Lab report. ## Circuit & Function SPEC Implementation of finite impulse response with n = 11. With only 1 Multiplier and 1 Adder, and 2 bram with 11 entry each. The circuit needs to compute ![image](https://hackmd.io/_uploads/HkS_5Wo_p.png) ## What is Observed & Learned * Verilog coding needs to be careful with the aomount of operand, otherwise it'll fail the SPEC. * AXI-Lite-protocal Decoding # Lab4-1 ## Brief Introduction about the Overall System Using the firmware code **fir.c** to simulate shift RAM, multiplies it with corresponding tap parameters, and accumulates the result Using **counter_la_fir.out** store the registers into data as an init step. # Lab4-2 ## Brief Introduction about the Overall System Design a wish-bone portocal decoder to transfer wish-bone protocal into AXI-Lite porotacal, so the Design we made in Lab3 can do the same computation even if the system doesn't use the same data transfer protocal as the Desgin. With this diagram, ![image](https://hackmd.io/_uploads/ByDKbzou6.png) we can draw a more detailed block diagrm, ![image](https://hackmd.io/_uploads/Hy3aZfiOp.png) There's other address spec like the AP signal address, X[n] & Y[n] address we need to take care of. ## What is Observed & Learned * FIR Engine Theoretical Throughput versus Actual Throughput For 64 output data, each requiring 11 multiplications and 10 additions, the total operations are 21 * 64 = 1344. The theoretical time can be calculated based on the cycles for AXI-lite and AXI-stream. The overall operation involves 12 cycles for each set of 5 data and 11 tap parameters, plus additional cycles for control signals. The theoretical time is **1793** cycles, while the actual cycles are **32471**. The discrepancy is observed due to waiting for the wishbone interface to complete firmware operations before AXI-lite confirmation. * Latency for Firmware to Feed Data The firmware needs at least 8 cycles to send one Xn to the design (5 cycles for WB to AXI-lite read, and 3 cycles for WB to AXI stream input). # Lab5 ## Brief Introduction about the Overall System There's 4 IP in this system, * read_romcode Purpose: Read the ROM code from system memory (DDR) using AXI master and store it in BRAM. Operation: Iteratively reads ROM code from system memory and writes it to BRAM. Limits the read count if it exceeds the available data. * spiflash Purpose: Interfaces with SPI flash for reading ROM code. Operation: Utilizes input (io0) as SPI flash input and output (io1) as SPI flash output. Reads ROM code from BRAM and outputs it accordingly. * ResetControl Purpose: Sends OUTPIN control signals using AXI-Lite. The ap_ctrl sends signals internally to control the RISCV CPU. * caravel_ps Purpose: Interconnect IP for the PS side. Manages the interface between MPRJ and PS, allowing data transfer between them. This lab's tool usage is the same as the previous lab, so we'll skip the new learned part # Lab6 ## Brief Introduction about the Overall System ![image](https://hackmd.io/_uploads/H1uBFzsdp.png) from the above diagram, we can draw a new block diagram for UART, ![image](https://hackmd.io/_uploads/BkPsKfsu6.png) we didn't need to design the RTL. ## What is Observed & Learned * Verification on board with ipynb modulation Referring to the discussions on GitHub #175, the team employed a while loop to print checkbit and corresponding time for verification. Additionally, in Top.c, a while loop was inserted at each reg_mprj_datal output interval to slow down the firmware’s execution speed for easier confirmation in the Jupyter notebook. # LabD ## Brief Introduction about the Overall ![image](https://hackmd.io/_uploads/HkNh6zidT.png) In this lab, we need to modify the controller to add a "prefetch" operation, this operation allows controller using register to prefetch some SDRAM data around the last access address. So in the next time, the system require something near the previous address, the controller can directly hit the address and output the corresponding data. ## What is Observed & Learned SDRM controller flow ![image](https://hackmd.io/_uploads/Sk3QeXjdT.png) if the Refresh flag is being required, the controller will ignore the upgoing request and do the PRECHARGE and REFRESH Refresh operation needs at leat 12T to complete ![image](https://hackmd.io/_uploads/r1xN-QjOa.png) # Final Project We replace the firmware workload in Lab6, including matmul, fir, qs into hardware accelerator. In addtion, we add the SDRAM-prefetch scheme we design in LabD to make the instruction read even faster. ![image](https://hackmd.io/_uploads/H1Kn52ndT.png) All simulation operate in 25ns/cycle * Lab6's latency for each workload mm latency = **73729**, fir latency = **162936**, qs latency = **29525** ``` Reading top.hex top.hex loaded into memory Memory 5 bytes = 0x6f 0x00 0x00 0x0b 0x13 LA mm 1 started mm latency: 73729 Call function matmul() in User Project BRAM (mprjram, 0x38000000) return value passed, 0x003e Call function matmul() in User Project BRAM (mprjram, 0x38000000) return value passed, 0x0044 Call function matmul() in User Project BRAM (mprjram, 0x38000000) return value passed, 0x004a Call function matmul() in User Project BRAM (mprjram, 0x38000000) return value passed, 0x0050 LA mm 2 passed LA fir 1 started LA fir 2 passed fir latency: 162936 LA qs 1 started qs latency: 29525 Call function qsort() in User Project BRAM (mprjram, 0x38000000) return value passed, 0x0028 Received 40 Received 893 Received 2541 Received 2669 LA qs 2 passed ``` * After we replace the workload into hardware computation by using wishbone address assignment mm latency = **3837**, fir latency = **3426**, qs latency = **4218** ``` Reading top.hex top.hex loaded into memory Memory 5 bytes = 0x6f 0x00 0x00 0x0b 0x13 VCD info: dumpfile top.vcd opened for output. start time 1205513 LA qs 1 started Received 40 Received 893 Received 2541 Received 2669 Received 3233 Received 4267 Received 4622 Received 5681 Received 6023 Received 9073 LA qs 2 passed QS latency: 4218 start time 1396988 LA fir 1 started Received: 0 Received: -10 Received: -29 Received: -25 Received: 35 Received: 158 Received: 337 Received: 539 Received: 732 Received: 915 Received: 1098 LA fir 2 passed fir latency: 3436 start time 1595938 LA mm 1 started Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 LA mm 2 passed MM latency: 3837 Congrats 1861813 ``` * Finally, we add a prefetch scheme to optimize the reading latency mm latency = **3000**, fir latency = **2804**, qs latency = **3730** ``` Reading top.hex top.hex loaded into memory Memory 5 bytes = 0x6f 0x00 0x00 0x0b 0x13 VCD info: dumpfile top.vcd opened for output. start time 1180513 LA qs 1 started Received 40 Received 893 Received 2541 Received 2669 Received 3233 Received 4267 Received 4622 Received 5681 Received 6023 Received 9073 LA qs 2 passed QS latency: 3730 start time 1359788 LA fir 1 started Received: 0 Received: -10 Received: -29 Received: -25 Received: 35 Received: 158 Received: 337 Received: 539 Received: 732 Received: 915 Received: 1098 LA fir 2 passed fir latency: 2804 start time 1542938 LA mm 1 started Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 LA mm 2 passed MM latency: 3000 Congrats 1787888 ``` Comparing to original firmware execution, the latency improved up to **20** times.