Study Journal for Lab

# Lab 1 ## Brief Introduction about the Overall System HLS, which stands for High-Level Synthesis, is a process of synthesizing hardware using high-level languages. In Xilinx's Vitis HLS, algorithms are described using C/C++, allowing developers to focus on algorithms rather than low-level hardware details. Lab 1 involves two main tools: Vitis HLS and Vivado. - **Vitis HLS:** Synthesizes C++ algorithms into hardware IP. - **Vivado:** Generates block design applications for interfacing with the synthesized IP. ### Vitis HLS Phases 1. **C sim (c-simulation):** Simulates the correctness of C++ algorithm. 2. **C synthesis:** Synthesizes C++ code into RTL hardware language. 3. **Co-sim (C/RTL co-simulation):** Compares C-simulation and RTL-simulation results. ### Vivado Application - Selects FPGA development platform. - Uses Vitis HLS-generated IP for block design. - Generates `.bit` and `.hwh` files. - Deploys applications on a real FPGA (Lab 1 uses Jupyter notebook with PYNQ.Overlay). ## What is Observed & Learned - Learned the basics of Vivado and Vitis tools. - Observed the use of overlay in Python for hardware read/write operations. - Investigated register locations in the generated `.bit` file. - Explored the use of AXI Lite protocol for register access. - Pragmas are used for directives, enabling constraints and optimizations. - HLS mindset can be visualized by imagining the corresponding RTL architecture. # Lab 2 ## Brief Introduction about the Overall System Lab 2 focuses on implementing the Finite Impulse Response (FIR) frequency response using hardware circuits. Two different IPs are utilized: M-AXI and S-AXI. - **M-AXI (FIR_N11_MAXI):** - Directly accesses DDR using M-AXI protocol. - No need for DMA assistance. - **S-AXI (FIR_N11_Stream):** - Uses AXI-Stream as the interface. - Requires DMA (adapter) assistance for external DDR read/write. - Offers a more modular approach by delegating address handling to other IPs. ## What's Observed & Learned? 1. **Differences between MAXI & Stream Interface:** - Both are high-performance protocols with handshake mechanisms (valid & ready). - AXI-Stream lacks an address interface, focusing solely on data transmission and reception. - M-AXI involves address-based read/write operations. 2. **Differences between csim and cosim:** - **C-Simulation (csim):** - Functional verification using software-based simulation. - Focuses on verifying the correctness of C/C++ code. - **Co-Simulation (cosim):** - Hardware/software simulation involving both synthesized hardware and host CPU. - Models hardware/software interaction, providing more accurate performance estimation. - Useful for performance validation, bottleneck identification, and interface correctness. ### Additional Summary: - AXI-stream, AXI-FULL(MAXI, AXI-master-slave), axi-lite belong to AMBA AXI 4.0 protocols. - AXI-Lite handles small data transfers with address and handshake. - AXI-FULL (and AXI-Lite) supports burst transfers for large data after a single handshake. - AXI-Stream lacks an address concept but includes handshake for data transmission. # Lab 3 ### This lab is the implementation of finite impulse response with n = 11. With only ***1 Multiplier and 1 Adder***, and ***2 bram with 11 entry each***. Folder hierarchy: --- put the following files in the same director, and make sure you have simulator to run your verilog codes. - ***fir_tb.v*** - ***fir.v*** - ***bram11.v*** - ***samples_triangular_wave.dat*** - ***out_gold.dat*** - ***makefile*** Execute: --- ```Makefile make run make clean ``` Function spec: --- $Y[n]$ = $\sum(H[i] * X[n-i])$ Design Spec: --- - Data Width = 32 - Tape Num = 11 (Number of coefficients) - Data Num = 600 (Number of input data) - Interface: - data_in (Xn): AXI-STREAM - data_out (Yn): AXI-STREAM - Coeff [10:0] : AXI-Lite - Data Num: AXI-Lite - ap_start: AXI-Lite - ap_done: AXI-Lite - Using one Multiplier and one Adder Operation: --- - ap_start to initiate FIR engine(ap_start valid for 1 cycle) - Xn data in: AXI-Stream - Yn data out: AXI-Stream Configuration Register Address map: --- 0x00: - [0]: ap_start, set when host axi-lite-write data = 1, reset when 1st axi-stream data comes in - [1]: ap_done, when FIR process all the data, assert 1 - [2]: ap_idle, indicate FIR is actively processing data 0x10: data-length(600) 0x20: Tap parameters (11 coefficients, 0x20, 0x24 ... 0x48) Test dataset: --- - Samples_triangular_wave.dat (Xn) - out_gold.dat (Expected Yn) Waveform: https://github.com/holyuming/Verilog-FIR # Lab 4-1 ## Explanation of Firmware Code ### `fir.c` Simulates shift RAM, multiplies it with corresponding tap parameters, and accumulates the result. ### `counter_la_fir.out` Stores the registers into data as an init step. ## Assembly Code - Calculates the corresponding address using `user_counter` declarations and reads from it. - Utilizes `reg_mprj_io` pins for addressing. ## Corresponding C Code ### `counter_la_fir.c` (Firmware Code) ```c // Main function for FIR code void fir() { // Logic for FIR filter while (a5 > a4) { // Read a5 and a4 // Compare a5 and a4 // Loop to perform multiplication and addition } } ``` # Lab 4-2 ## Design Block Diagram and Protocol between WB & Verilog-FIR Referencing the orange box in the diagram, `wbs_addr_i` starting with `0x30` represents the Verilog-FIR block where the user project executes FIR calculations. The firmware uses `wb_addr_i` starting with `0x38` to fetch data for exmem-FIR. The data path is indicated by black arrows, while control signals are represented by blue arrows. ### Protocol between WB & Verilog-FIR **AXI Write & AXI-Read** ## Wishbone & Design Interactions Waveforms ### Wishbone to Axi-lite Read ### Wishbone to Axi-lite Write ## Design Block Diagram and Protocol between WB & exmem-FIR (Firmware) The right half of the orange box represents the firmware (exmem-FIR). `wbs_addr_i` starting with `0x38` is used for control signals and data signals. ### Wishbone & Firmware Interactions Waveforms ### Firmware Read ### Firmware Write ### Wishbone to AXI Stream Input 1. Confirm `ap[4]` is 1 (5 cycles). 2. Input data (`Xn`) using `0x30000080` as `wbs_addr` (3 cycles). ### Wishbone to AXI Stream Output 1. Confirm `AP[5]` is 1 (5 cycles). 2. Read data (`Yn`) using `0x30000084` as `wbs_addr` (3 cycles). ## FIR Engine Theoretical Throughput versus Actual Throughput For 64 output data, each requiring 11 multiplications and 10 additions, the total operations are 21 * 64 = 1344. The theoretical time can be calculated based on the cycles for AXI-lite and AXI-stream. The overall operation involves 12 cycles for each set of 5 data and 11 tap parameters, plus additional cycles for control signals. The theoretical time is 1793 cycles, while the actual cycles are 32471. The discrepancy is observed due to waiting for the wishbone interface to complete firmware operations before AXI-lite confirmation. ## Latency for Firmware to Feed Data The firmware needs at least 8 cycles to send one `Xn` to the design (5 cycles for WB to AXI-lite read, and 3 cycles for WB to AXI stream input). ## Techniques to Improve Throughput 1. Use `while` and `wait` loops efficiently in `counter_la_tb.v`. 2. Avoid re-sending tap parameters for each FIR calculation. 3. Use wishbone protocol for output from design. 4. Implement 11 multipliers in the design. 5. Directly declare `tap_0 = 0` instead of `tap_0 = taps[0]` to reduce load delay. ## Other Methods to Improve Performance Separate wishbone interfaces for firmware and user projects, or use wishbone exclusively instead of AXI protocol. ## Bram12 Better? Not necessarily, as the design achieves a fast FIR calculation with 11 cycles for 11 tap coefficients. ## Synthesis Information - [Synthesis Report](https://github.com/holyuming/NYCU-2023-SoCLab/blob/master/lab-caravel_fir/synthesis/synthesis.runs/synth_1/user_project_wrapper_utilization_synth.rpt) ### Resource Utilization - BRAM: [details] - IO & Arithmetic: [details] - Primitives: [details] ## Timing Report - [Timing Report](https://github.com/holyuming/NYCU-2023-SoCLab/blob/master/lab-caravel_fir/synthesis/timing_report.txt) - Period: 5 ns ### Slack MET No timing violations observed. # Lab 5 ## Block Diagram After compiling, the firmware code is stored in the PS side's DDR and loaded into the hardware's BRAM using the `read_romcode` IP via AXI master. Once the firmware is loaded, another IP (`ResetControl`) releases the RESET, allowing the RISCV CPU to start fetching firmware code from SPI flash. After execution, the firmware sends values to MPRJ pins. To observe the values, an IP (`caravel_ps`) reflects them on AXI-Lite so that the CPU on the PS side can perform MMIO reads. ## IP Usage ### 1. read_romcode & spiflash #### read_romcode - Purpose: Read the ROM code from system memory (DDR) using AXI master and store it in BRAM. - Operation: Iteratively reads ROM code from system memory and writes it to BRAM. Limits the read count if it exceeds the available data. #### spiflash - Purpose: Interfaces with SPI flash for reading ROM code. - Operation: Utilizes input (`io0`) as SPI flash input and output (`io1`) as SPI flash output. Reads ROM code from BRAM and outputs it accordingly. ### 2. ResetControl & caravel_ps Both use AXI-Lite for communication with the system interface. #### ResetControl - Purpose: Sends OUTPIN control signals using AXI-Lite. The `ap_ctrl` sends signals internally to control the RISCV CPU. #### caravel_ps - Purpose: Interconnect IP for the PS side. Manages the interface between MPRJ and PS, allowing data transfer between them. ## Workload on Caravel FPGA - Counter_wb.c: The input data starts from `0xAB610000`. - Counter_la.hex: The input data starts from `0xAB510000`. - GCD_la.hex: The input data starts from `0xAB510000`. ## Study Caravel FPGA Control Flow 1. Declaration of DDR size for the firmware code (`8K`). 2. Loading the `.hex` firmware file into fiROM and parsing it to place corresponding data into npROM (DRAM). 3. Using HLS-generated IP to fill the BRAM in DDR. 4. De-assertion of signals. 5. Reading the final output pin. The resulting output matches the `.c` firmware code (`0xAB51`). ## FPGA Utilization ### Overall Design Wrapper Utilization Synthesis - Design_1_xbar - Design_1_splifash_0_0 - Design_1_rst_ps7_0_50M_0 - Design_1_read_romcode_0_0 - Design_1_0utput_pint_0_0 - Design_1_caravel_ps_0_0 - Design_1_caravel_0_0 - Design_1_blk_mem_gen_0_0 - Design_1_auto_us_0 - Design_1_auto_pc_1 - Design_1_auto_pc_0 - Design_1_processing_system # Lab 6 ## Block Diagram for UART Observing the block diagram, it's evident that the `ctrl` output and inputs for `tx` and `rx` are in opposite directions. ## Method to Verify Answers from Notebook Referring to the discussions on GitHub #175, the team employed a while loop to print checkbit and corresponding time for verification. Additionally, in Top.c, a while loop was inserted at each `reg_mprj_datal` output interval to slow down the firmware's execution speed for easier confirmation in the Jupyter notebook. ## Timing Report/Resource Report After Synthesis After synthesis, all delays have been met (`slack > 0`). ## Latency for a Character Loopback Using UART The average latency for character loopback using UART is approximately 4.17 ms. ## Suggestions for Improving Latency for UART Loopback 1. **Baud Rate:** Explore higher frequencies beyond 40 MHz. Higher frequencies tend to have lower latency. Ensure that both the transmitter and receiver support the chosen frequency without errors. 2. **DMA:** Consider using DMA instead of UART. DMA allows independent memory read/write without involving the processor, reducing latency. ## RTL Verification Fork two processes to simultaneously verify `mm_qs_fir` and UART functionality. # LabD ## SDRAM Controller Design and SDRAM Bus Protocol Our SDRAM controller design is similar to the original, but we added two stages: PREFETCH and PREFETCH_RES. During IDLE or ACTIVATE when the bank is open, PREFETCH is triggered. It fetches data from SDRAM for the next 8 cycles (8 consecutive instructions) and stores it in the cache. When there is a cache hit in IDLE state, we can output the values directly from the cache without going into the READ state. This reduces latency compared to the original exmem, which requires 10 cycles. Our cache design is simple, with only 8 entries, each containing 32 bits of data. The data in these 8 entries is contiguous, eliminating the need to maintain addresses within the cache. ## Introducing the Prefetch Scheme The Prefetch Scheme is a strategy to optimize system performance. It utilizes the processor's idle time to prefetch data or instructions that might be needed, reducing data access latency and improving overall execution speed. ## Bank Interleave for Code and Data The memory is divided into 4 banks, each independently accessible. The SDRAM output (`Dqo`) fetches output values from different banks based on different situations, achieving bank interleave. To ensure firmware code and data are placed in different banks, we modified the SDRAM controller to use bits [9:8] of the user address (Wishbone address) as the bank address. Linker mapping was adjusted accordingly. ## Observing SDRAM Access Conflicts with SDRAM Refresh To observe access conflicts, we increased the refresh frequency from 750T. ### FSM State During Refresh and Access Conflict This waveform demonstrates a conflict where the system wants to read data at 0x3800184, but the controller is in a refresh state. The system request is processed only after refresh completes, causing additional wait cycles. ## Address Remapping for Larger Firmware Code Space By remapping the bank address to higher bits through address remapping, we can provide larger space for firmware code. ## Conclusion In our final project, we plan to further optimize the bank address position to fully utilize the complete space each bank's BRAM address should have. # Final Project # 2023 SoC Design Lab Final Project: WorkLoad Optimize SOC (WLOS) ### Simulation for 3 workloads (matmul, fir, sort) ```sh cd final_project/testbench/top source run_clean source run_sim ``` ### First: execute (matmul, fir, sort) via `ARM processor` ```sh Reading top.hex top.hex loaded into memory Memory 5 bytes = 0x6f 0x00 0x00 0x0b 0x13 start time 2851738 LA qs 1 started Received 40 Received 893 Received 2541 Received 2669 Received 3233 Received 4267 Received 4622 Received 5681 Received 6023 Received 9073 LA qs 2 passed QS latency: 41737 start time 3895438 LA fir 1 started Received: 0 Received: -10 Received: -29 Received: -25 Received: 35 Received: 158 Received: 337 Received: 539 Received: 732 Received: 915 Received: 1098 LA fir 2 passed fir latency: 174090 start time 8274788 LA mm 1 started Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 LA mm 2 passed MM latency: 90444 Congrats 10535888 ``` ### Thus, we build our own customized hardware for 3 different workloads. HDL file is in `final_project/rtl/user/design.v`, including 3 verilog modules. ```sh Reading top.hex top.hex loaded into memory Memory 5 bytes = 0x6f 0x00 0x00 0x0b 0x13 start time 1180513 LA qs 1 started Received 40 Received 893 Received 2541 Received 2669 Received 3233 Received 4267 Received 4622 Received 5681 Received 6023 Received 9073 LA qs 2 passed QS latency: 7148 start time 1359488 LA fir 1 started Received: 0 Received: -10 Received: -29 Received: -25 Received: 35 Received: 158 Received: 337 Received: 539 Received: 732 Received: 915 Received: 1098 LA fir 2 passed fir latency: 7303 start time 1542338 LA mm 1 started Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 LA mm 2 passed MM latency: 9744 Congrats 1785938 ``` ### Conclusion: We can see that latency for each workload drops significantly. | Task | Latency w/o Hardware | Latency w/ Hardware | |:----:|:---------------------:|:-------------------:| | qs | 41,737 | 7,148 | | fir | 174,090 | 7,303 | | mm | 90,444 | 9,744 |