2023 Fall NYCU SOC Study Journal

GitHub Link: https://github.com/holyuming/NYCU-2023-SoCLab # Lab1 ## Brief Introduction about the Overall System The kernel function of this lab is very simple, that is, inputting 2 inputs and then returning the multiplied result. However, you need to use the Register in the AXILite adapter as a tool for input transmission. Pass in IN1 and IN2, and the results obtained are also stored in the Register of the AXILite adapter. Then in the Register, the input is passed in using AXILite's AW and W to write the data into the Register. The result returned by using AXILite's AR and R to read the output from the Register ## What is Observed & Learned - **Content of the AXILite protocol** - how the master and slave communicate - **How to wrap C++ into IP by using Vitis HLS** - **How to call IP in vivado** # Lab2 ## Brief Introduction about the Overall System Lab 2 focuses on the implementation of FIR through hardware circuits, employing two distinct IP: AXI-Master and AXI-Stream. - **FIR with Interface AXI-Master**： - The IN and OUT of FIR are transmitted by AXI-master, and the parameters of FIR are transmitted by AXI-Lite. Therefore, the IP generated by Master is communicated by the interface of AXI-Lite and AXI-Master. The communication method of AXILite is to connect to the GP port of PS side through interconnect. The AXI-master is connected to the HP port of the PS side through the interconnect. - **FIR with Interface AXI-Stream：** - In this experiment, the in and out of FIR are transferred using streams instead. Therefore, it cannot be directly connected to the PS side through the interconnect. It needs one more DMA in the middle, so for the IN and OUT connections, 2 more DMAs are needed. The rest is roughly the same as the AXI-Master. ## What is Observed & Learned - **Content of the AXI-Master、AXI-Stream protocol** - how IN、OUT communicate with PS side - **Difference between csim & cosim** - The difference between csim and cosim is in their kernel. The former is done by using C, while the latter is done by using verilog. # Lab3 ## Brief Introduction about the Overall System - This Lab is roughly the same as Lab2, except that this time you need to write RTL code to implement FIR for AXI-Lite and AXI-Stream interface. - This lab is the implementation of FIR with n = 11 and only ***1 Multiplier and 1 Adder***. As a result, I multiplied and accumulated 11 cycles to complete the FIR operation. ## What is Observed & Learned - **AXI4-Lite Read & Write Transaction** - **SRAM Interface Implementation** # Lab4 ## **Lab4-1** ### Brief Introduction about the Overall System - Execute FIR firmware code in user project memory (BRAM) ### What is Observed & Learned - **BRAM Internal reading and writing mechanism** ## **Lab4-2** ### Brief Introduction about the Overall System - Interactions between firmware in user project memory and design in user project - Integrate lab-fir & lab-exmem-fir into caravel user project (add WishBone interface) ### What is Observed & Learned - **Firmware code to move data in/out FIR** - **HW/SW co-design** # Lab5 ## IP usage - `ResetControl` : A memory-mapped-io to control Caravel reset pin. - `read_romcode` : read the firmware hex data from PS/DDR memory to FPGA BRAM. - `spiflash` : It emulates spiflash device read behavior. It reads data from BRAM. - `caravel_ps` : It allows the PS side to use memory-mapped IO to read/write mprj pins. ## Brief Introduction about the Overall System Compiled firmware code is stored in PS side's DDR. It will load into hardware's BRAM through a IP called `read_romcode` in the form of AXI-master protocol. After the firmware be loaded, there will be another IP (`ResetControl`) that will release RESET, and the RISCV CPU will start running and fetching firmware code from `spiflash`. After finish running, firmware will drop the value to mprj pin. But now there is no simulator that can directly read the values from the pin, so it is necessary to reflect these values on AXI-Lite protocol through an IP (`carvel_ps`), so that the CPU on the PS side can do MMIO read. ## What is Observed & Learned - **Familiar with caravel SoC control flow & put it into FPGA** - **Using GTKWave to view simulated waveform** - **Convert verilog testbench to python code** # Lab6 ## Brief Introduction about the Overall System - Integrate different firmware code (Quick sort, FIR, Matrix Multiplication, UART) into one. - Hardware integrates exmem-fir with UART design in user project area, like Lab4-2. - Modify testbench to include the test for Matrix Multiplication, Quick Sort, FIR and UART - Verify if the firmware code can execute on FPGA ## Timing Violation - We enter vivado and change the implementation strategy. We changed the default to use `performance_NetDelay_high` and successfully repaired our hold time violation. ## What is Observed & Learned - **UART protocol** - very slow...... # LabD ## Brief Introduction about the Overall System - Refer to lab-exmem (Lab4), but replace the BRAM with SDRAM (SDRAM controller + SDRAM) - On the basis of the original, we add prefetch function We added 2 additional stages, `PREFETCH` and `PREFETCH_RES`. During `IDLE` or `ACTIVATE`, if the Bank is already open, it will jump to `PREFETCH`, grab 8 consecutive SDRAM data, and store it in the cache. And the next time if there is a cache hit in `IDLE` state, we can directly output the value in the cache as data_out without going to the READ state. The resulting latency is much faster than the 10T required by the original exmem. As for our cacahe design, it is not large. There are only 8 entries. Each entry contains 32 bits of data. The data of these 8 entries are continuous, so there is no need to maintain an additional address in the cache. ## What is Observed & Learned - **SDRAM Internal reading and writing mechanism** - **Prefetch Scheme** - **Bank Interleave** - In this Lab memory is divided into 4 banks, and each bank can be accessed independently. and the output `Dqo` of SDRAM will capture the output values of different banks according to different situations, and then realize bank interleave. # Final Project：WorkLoad Optimize SOC (WLOS) ## Brief Introduction about the Overall System We change `Quick sort`, `FIR`, `Matrix Multiplication` into verilog code, implement hardware accelerator. ### Simulation for 3 workloads (matmul, fir, sort) ```sh cd final_project/testbench/top source run_clean source run_sim ``` <br> ### First: execute (matmul, fir, sort) via `ARM processor` ```sh Reading top.hex top.hex loaded into memory Memory 5 bytes = 0x6f 0x00 0x00 0x0b 0x13 LA mm 1 started mm latency: 73729 Call function matmul() in User Project BRAM (mprjram, 0x38000000) return value passed, 0x003e Call function matmul() in User Project BRAM (mprjram, 0x38000000) return value passed, 0x0044 Call function matmul() in User Project BRAM (mprjram, 0x38000000) return value passed, 0x004a Call function matmul() in User Project BRAM (mprjram, 0x38000000) return value passed, 0x0050 LA mm 2 passed LA fir 1 started LA fir 2 passed fir latency: 162936 LA qs 1 started qs latency: 29525 Call function qsort() in User Project BRAM (mprjram, 0x38000000) return value passed, 0x0028 Received 40 Received 893 Received 2541 Received 2669 LA qs 2 passed ``` <br> ### Thus, we build our own customized hardware for 3 different workloads.<br> HDL file is in `final_project/rtl/user/design.v`, including 3 verilog modules. ```sh Reading top.hex top.hex loaded into memory Memory 5 bytes = 0x6f 0x00 0x00 0x0b 0x13 VCD info: dumpfile top.vcd opened for output. start time 1205513 LA qs 1 started Received 40 Received 893 Received 2541 Received 2669 Received 3233 Received 4267 Received 4622 Received 5681 Received 6023 Received 9073 LA qs 2 passed QS latency: 4218 start time 1396988 LA fir 1 started Received: 0 Received: -10 Received: -29 Received: -25 Received: 35 Received: 158 Received: 337 Received: 539 Received: 732 Received: 915 Received: 1098 LA fir 2 passed fir latency: 3436 start time 1595938 LA mm 1 started Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 LA mm 2 passed MM latency: 3837 Congrats 1861813 ``` ### Last, we add instruction prefetch technique modifying `sdram_controller.v` to minimize instrucion read latency.<br> ```sh Reading top.hex top.hex loaded into memory Memory 5 bytes = 0x6f 0x00 0x00 0x0b 0x13 VCD info: dumpfile top.vcd opened for output. start time 1180513 LA qs 1 started Received 40 Received 893 Received 2541 Received 2669 Received 3233 Received 4267 Received 4622 Received 5681 Received 6023 Received 9073 LA qs 2 passed QS latency: 3730 start time 1359788 LA fir 1 started Received: 0 Received: -10 Received: -29 Received: -25 Received: 35 Received: 158 Received: 337 Received: 539 Received: 732 Received: 915 Received: 1098 LA fir 2 passed fir latency: 2804 start time 1542938 LA mm 1 started Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 Received: 62 Received: 68 Received: 74 Received: 80 LA mm 2 passed MM latency: 3000 Congrats 1787888 ``` ### Conclusion: We can see that latency for each workload drops significantly. Note: latency is calculated from checkbits 0XAB40 to checkbits to 0XAB51 for each workload. | Task | Latency w/o Hardware | Latency w/ Hardware | Hardware + Prefetch | Hardware + Prefetch + Compiler -O1 | Speed Up | |:--------:|:--------------------:|:-------------------:|:-------------------:|:----------------------------------:|:------------:| | qsort() | 29525 | 4218 | 3730 | 2736 | X **10.8** | | fir() | 162936 | 3436 | 2804 | 1767 | X **92.2** | | matmul() | 73729 | 3837 | 3000 | 1957 | X **37.7** | ## What is Observed & Learned - **Accelerate design**