SOC Study Journal

# SOC Study Journal [TOC] ## Self-Learning - [SystemVerilog](https://hackmd.io/@vic9112/HJgiDaiF6) - [FSIC (Full-Stack IC) architecture](https://hackmd.io/@vic9112/HykJ2Lpt6) ## 2023-Summer Lab ### [Summer Lab 3 - Combinational](https://hackmd.io/@vic9112/SJiwC_Qs3) ### [Summer Lab 4 - Sequential](https://hackmd.io/@vic9112/BJyWMylsn) ### Summer Disscusion ### Lint Rules https://hackmd.io/@vic9112/Hyz7HzKch ## 2023-Fall Lab ### Lab 3 - FIR - [Contents of Lab-FIR](https://github.com/vic9112/SOC/tree/main/Lab3-FIR) ![image](https://hackmd.io/_uploads/ry42GxSKT.png) ![image](https://hackmd.io/_uploads/ByY6feHF6.png) ![image](https://hackmd.io/_uploads/SJX0zeHKT.png) ![image](https://hackmd.io/_uploads/S1XymgBK6.png) #### **Achieve 11T/per data!!** ![螢幕擷取畫面 2024-01-17 143559](https://hackmd.io/_uploads/S1RPBxBYa.png) #### Review / Correction [https://hackmd.io/@vic9112/HyGQqjONa] ### Lab 4-1 ![螢幕擷取畫面 2024-01-17 142632](https://hackmd.io/_uploads/r1CmmlHFT.png) ![螢幕擷取畫面 2024-01-17 143700](https://hackmd.io/_uploads/HkliSert6.png) #### Firmware code - How does it execute a multiplication in assembly code. - Let’s look at the __mulsi3 function: ![image](https://hackmd.io/_uploads/r1HtUlHta.png) - mv a2, a0: This instruction moves the value from register a0 to register a2. In other words, it makes a copy of the value in a0 and stores it in a2. - li a0, 0: This instruction loads an immediate value into register a0, setting it to 0. It essentially clears the content of a0. - andi a3, a1, 1: This instruction performs a bitwise AND operation between the value in register a1 and the immediate value 1. The result is stored in a3. It checks the least significant bit of a1 and stores it in a3. - beqz a3, 38000014 <__mulsi3+0x14>: This is a conditional branch instruction. It checks if the value in a3 (which represents the least significant bit of a1) is zero (beqz = branch if equal to zero). If it is zero, the program jumps to the address 38000014, which skips the next instruction. - add a0, a0, a2: This instruction adds the value in a2 to the value in a0 and stores the result in a0. It effectively accumulates the value in a2 into a0. - srli a1, a1, 0x1: This instruction performs a right shift logical operation on the value in a1, shifting it right by 1 bit position. This is effectively dividing a1 by 2. - slli a2, a2, 0x1: This instruction performs a left shift logical operation on the value in a2, shifting it left by 1 bit position. This is effectively multiplying a2 by 2. - bnez a1, 38000008 <__mulsi3+0x8>: This is a conditional branch instruction. It checks if the value in a1 is non-zero (bnez = branch if not equal to zero). If a1 is non-zero, the program jumps back to address 38000008(andi a3, a1, 1), effectively creating a loop that repeats until a1 becomes zero. - ret: This instruction is a return instruction. It indicates the end of the <__mulsi3> function and returns control to the calling function. - That’s look at the following example: 7 * 6=42 → 000111_2 * 000110_2 = 000111 * 000100 + 000111 * 000010 = 011100 * 1+001110 * 1(a2 shift right,a1 shift left) = 011100 + 001110(accumulate the value) = 101010_2 - Flow of doing multiplication ![image](https://hackmd.io/_uploads/rku0IeSFp.png) Due to the assembly code, we can see the waveform below is executing the function initfir(), which be used in our firmware code to initialize the input buffer and output buffer. ![image](https://hackmd.io/_uploads/rym_OeHK6.png) Zoom in and we can see that ACK signal will have a delay of 10 clock cycles. ![image](https://hackmd.io/_uploads/H1i8veSY6.png) #### Comment - In this lab, we are going to use the Wishbone protocol. I believe it's simpler and more straightforward to implement in this lab compared to AXI-lite. The signal characteristics themselves are closely related to what block RAM requires, such as `CYC`, `STB`, which are very similar to AXI's `VALID`. In class, the teacher mentioned that using `STB` as `VALID` and `ACK` as `READY` is sufficient. However, in my user project, I included `CYC` to indicate that the processor is in operation. - Next are WE and SEL, representing READ/WRITE and Byte enable, respectively. By extending WE into four bits and performing an AND operation with SEL, we can use it as the block RAM's WE. The subsequent evaluations are quite intuitive. After completing the run_sim phase, I created a Makefile - Available on my [GitHub-Lab4-1-counter_la_fir](https://github.com/vic9112/SOC/tree/main/Lab4-1-counter_la_fir/buildup%26xsim). - Its purpose is for running xsim. However, I encountered an issue with the original include.rtl.list file, where the '-v' at the beginning of each line caused errors during xsim. Consequently, I created a new file called include.rtl.list.xsim and removed all the '-v'. In the Makefile, I added the following command: ``` shell= $ xvlog -f ./include.rtl.list.xsim counter_la_fir_tb.v ``` - This modification addressed the problem, allowing me to proceed with xsim without encountering errors. - Even though I successfully got the files running, the simulator reported an error: "port redeclaring." To address this, I went into the various .v files in the caravel-soc, examined the code, and noticed that the code declared inputs/outputs first and then declared reg/wire in other places. This caused the simulator to think that we were redefining the same port, leading to errors. - I later set up the net type definitions for ports within each module where it was necessary, reran the make command, and successfully passed the simulation. ### Lab 4-2 - [Contents of Lab 4-2](https://github.com/vic9112/SOC/tree/main/Lab4-2) #### Origin Design / Bottleneck - During this lab, we have encounter some problem cause by the latency of CPU access data: ![螢幕擷取畫面 2024-01-17 145602](https://hackmd.io/_uploads/Sk8VcgrtT.png) - Orange part is the latency hardware calculating FIR, and the blue part is the latency CPU fetch code from user project and execute it. - Origin run in **7497 cycles** for FIR calculation. ![螢幕擷取畫面 2024-01-17 145907](https://hackmd.io/_uploads/Hy6pqeSYa.png) While our ability to make significant changes to the hardware is limited, there are still potential improvements we can make in optimizing the time it takes for the CPU to read assembly code: - Instruction Pipeline Optimization: Evaluate and optimize the instruction pipeline to enable the concurrent execution of instructions in different stages, enhancing overall execution efficiency. This can be achieved through redesigning pipeline stages or increasing instruction caching. - Cache Optimization: Ensure that frequently used instructions and data are stored in high-speed caches, reducing the need to fetch them from main memory and improving overall system performance." - First, let's clarify some definitions. ![螢幕擷取畫面 2024-01-17 150225](https://hackmd.io/_uploads/SJQ5igBt6.png) - Then take a look about the while loop in our origin firmware code: ![image](https://hackmd.io/_uploads/r1e82lrK6.png =70%x) - It generate the following assembly code: ![螢幕擷取畫面 2024-01-17 150617](https://hackmd.io/_uploads/By9uhxBKa.png) - But we only get the cache deph of 16 in our Caravel SOC: ![image](https://hackmd.io/_uploads/SkoA3lSFT.png =70%x) - CPU has 16 instructions cache, but loop code has 17 instructions. **So, our first goal is to shorten the assembly code** #### Optimization ![螢幕擷取畫面 2024-01-17 151008](https://hackmd.io/_uploads/By8PTlBKa.png =80%x) - We can add the command -O2, -O3, -Ofast to optimize the assemnly code generated by our riscv tool. Results show below: ![螢幕擷取畫面 2024-01-17 151143](https://hackmd.io/_uploads/Syk6alSFa.png =80%x) - **Executes in 1701T** ![螢幕擷取畫面 2024-01-17 151252](https://hackmd.io/_uploads/SyqZRxBYa.png) ![螢幕擷取畫面 2024-01-17 151345](https://hackmd.io/_uploads/rJv4AerF6.png) - Result: ![螢幕擷取畫面 2024-01-17 151413](https://hackmd.io/_uploads/rJHUAlHYT.png) #### **Futher Optimization** - We can do futher optimization by re-ordering some of assembly codes. - Original code: ![image](https://hackmd.io/_uploads/Hyzlr-Btp.png) ![螢幕擷取畫面 2024-01-17 154300](https://hackmd.io/_uploads/Sk7EHWHY6.png) - Re-order to the following result: ![螢幕擷取畫面 2024-01-17 154536](https://hackmd.io/_uploads/HJm2BZSFa.png) - Y->X result ![螢幕擷取畫面 2024-01-17 154612](https://hackmd.io/_uploads/H100BZBFT.png) #### Some Problem - We may wonder why there is a `STB` without `CYC` here: ![螢幕擷取畫面 2024-01-17 154807](https://hackmd.io/_uploads/SJrBIWrtp.png) ![螢幕擷取畫面 2024-01-17 154836](https://hackmd.io/_uploads/rygDL-BFp.png) ![螢幕擷取畫面 2024-01-17 154906](https://hackmd.io/_uploads/Hykt8ZrKT.png =80%x) ![螢幕擷取畫面 2024-01-17 155110](https://hackmd.io/_uploads/rJGZw-HK6.png) ![螢幕擷取畫面 2024-01-17 155143](https://hackmd.io/_uploads/BkxmDWBFa.png) ### Lab 5 - Block Diagram ![螢幕擷取畫面 2024-01-17 155603](https://hackmd.io/_uploads/BJwY_WBYa.png) - PS-PL interface ![螢幕擷取畫面 2024-01-17 155619](https://hackmd.io/_uploads/S1DhuZrKp.png) ### Lab 6 - wlos - [Contents of Lab6-wlos](https://github.com/vic9112/SOC/tree/main/Lab6-wlos) - In this lab, we combined 4 workloads with UART/Interrupt and execute on Caravel SOC. ### Lab - SDRAM - SDRAM with SDRAM Controller #### Original Block Diagram - The overall diagram of SDRAM are shown below: ![螢幕擷取畫面 2024-01-16 211732](https://hackmd.io/_uploads/SyJbM-Vta.png =40%x) - The wishbone cycle will pass through SDRAM controller and store/write data from/into SDRAM. We have do some optimize since the memory size of the original source code of SDRAM is not enough. #### FSM in SDRAM controller: ![螢幕擷取畫面 2024-01-16 215237](https://hackmd.io/_uploads/ryYVcWNFp.png =70%x) - Some details about each state: ![螢幕擷取畫面 2024-01-16 215401](https://hackmd.io/_uploads/H1s5c-4YT.png) - **tCASL=3T tPRE=3T tACT=3T tREF=7T** #### In SDRAM: ![螢幕擷取畫面 2024-01-16 215526](https://hackmd.io/_uploads/S19Acb4Fp.png) - We decode the command sent from controller and mode register defined by user. ![螢幕擷取畫面 2024-01-16 215632](https://hackmd.io/_uploads/HJkmi-VKp.png =70%x) - Read/write enable and address/data input/output ![螢幕擷取畫面 2024-01-16 221328](https://hackmd.io/_uploads/rJrSyMVYp.png =70%x) - Command pipelined ![螢幕擷取畫面 2024-01-16 221527](https://hackmd.io/_uploads/HkAF1f4tp.png =40%x) - MUX select the operation at current state. - MUX detect read/write command. **We may be curious about the meaning of `Active` state and `Precharge` state. Here we have a brief explanation about SDRAM:** 1. Dynamic Storage: *SDRAM stores data in cells that use capacitors to hold charges. SDRAM cells lose their charge over time due to leakage. To maintain the stored data, SDRAM needs to periodically refresh the charges in the cells.* 3. Row Activation: *Activating a row in SDRAM involves loading the contents of that row into a row buffer for read/write operations, which allows for faster access to the data in the row.* 5. Precharge Operation: *After accessing a row, it needs to be precharged to restore the charge in the cells. This is essential for the proper functioning of the memory and to prepare it for the activation of others rows.* ### Problems/Solved about Original code: #### Problem 1 - Address Mapping ``` verilog= `define BA 9:8 `define RA 22:10 `define CA 7:0 ``` - We may encounter a situation that all data are stored in the same bank since our linker was designed as shown below: ![螢幕擷取畫面 2024-01-16 212631](https://hackmd.io/_uploads/HJImV-4Kp.png) ![螢幕擷取畫面 2024-01-16 212642](https://hackmd.io/_uploads/Hyy4E-4tT.png =70%x) - Each data type needs at least 12-bit,but SDRAM only takes 8-bit for column address, and bank_address[9:8] will restrict our size. #### Problem 2 - Address send into SDRAM - In SDRAM, we have 4 banks to store data, source code are shown below: ``` verilog= blkRam$(.SIZE(mem_sizes), .BIT_WIDTH(DQ_BITS)) Bank0( .clk(Sys_clk), .we(bwen[0]), .re(bren[0]), .waddr(Col_brst[9:0]), .raddr(Col_brst[9:0]), .d(bdi[0]), .q(bqd[0]) ); ``` - Only 10-bit for column address ``` verilog= READ: begin cmd_d = CMD_READ; a_d = {2'b0, 1'b0, addr_q[7:0], 2'b0}; state_d = WAIT; end ``` - Since the address of assembly code plus 4 each time, we may not need to add the 2’b0 at the LSB. => **Original memory size: 2^10(width)/4(shift left) * 4(banks) = 1K bytes** #### Solution - My design (remapping) ``` verilog= `define Ba 13:12 `define Ra 22:14 `define CA 11:0 ``` - Then we can get 12 bits for column address. - I also remap the `a_d` since `a_d[10]`is the precharge signal. Note that `BA` is 13:12. ``` verilog= READ: begin cmd_d = CMD_READ; a_d = {addr_q[11:10], 1'b0, addr_q[9:0]}; ba_d = addr_q[9:8]; state_d = WAIT; end ``` - When read: ``` verilog= If (Read_enable) begin Bank <= Ba; Col <= {Addr[12:11], Addr[9:0]}; Col_brst <= {Addr[12:11], Addr[9:0]}; end ``` - Decode the remapping address, prevent the `addr[10]` (precharge bit) load into block memory. - Now, address load into memory have 12 bits: ``` verilog= blkRam$(.SIZE(mem_sizes), .BIT_WIDTH(DQ_BITS)) Bank0( .clk(Sys_clk), .we(bwen[0]), .re(bren[0]), .waddr(Col_brst[11:0]), .raddr(Col_brst[11:0]), .d(bdi[0]), .q(bqd[0]) ); ``` - Also, seems that the block memory in the source code don't have the row address, it may not support the on/off page characteristic. => **Memory size: 2^12(width) * 4 = 16K bytes** ### Final Project - Contents please refer to [Github-FinalProject](https://github.com/vic9112/SOC/tree/main/FinalProject) ## 2024-Spring Lab - Labs based on SOC-FSIC ![FSIC](https://hackmd.io/_uploads/HkinPFEUR.png) ### Lab 1: FSIC Simulation - [Lab Contents](https://github.com/vic9112/Advance_SOC/tree/main/lab01%20-%20fsic-sim) - [Report](https://github.com/vic9112/Advance_SOC/blob/main/lab01%20-%20fsic-sim/report.pdf) #### Introduction: - In Lab 1, we explored the FSIC (Full-Stack IC) architecture to implement an IC validation system based on CaravelSOC. #### Tasks and Experiments: 1. Configuration Programming: - Wrote a task to configure data from the SOC side using tb_fsic.v. - Verified the programming configuration address 32’h3000_5000 to observe changes in user_prj_sel[4:0]. 2. FIR Initialization: - Described the process of initializing the FIR from both SOC and FPGA sides. - Used the AXILite-AXIS module to generate the AXI-Lite write transaction. 3. Data Streaming: - Explained the method to stream data X from FPGA to FIR module. - Used fpga_axis_req for continuous data streaming until the transfer was complete. 4. Output Verification: - Described how to capture output Y in the testbench and compare it with golden values. - Implemented an infinite loop to detect streaming output and validate the results. 5. Debugging: - Encountered simulation stalls due to unfamiliarity with design signals. - Employed print statements before and after tasks to locate stalling points. #### Results: - Successfully completed Test#1 and Test#2, with screenshots of simulation results and waveform analysis showing configuration cycles and data streaming. ### Lab 2-1: Catapult HLS - FIR #### Objective: - To improve the throughput of the FIR filter using Catapult High-Level Synthesis (HLS). #### Experiments: 1. Throughput Enhancement: - Improved the throughput of fir_multi_blks from 10 to 1. - Conducted RTL simulation on QuestaSim and observed the impact of the decimator on data output. #### Results: - Achieved the desired throughput improvement with simulation results confirming the expected data flow. ### Lab 2-2: Catapult HLS – Edge Detection on FSIC - [Lab Contents](https://github.com/vic9112/Advance_SOC/tree/main/lab02%20-%20edge_detect) - [Report](https://github.com/vic9112/Advance_SOC/blob/main/lab02%20-%20edge_detect/report.pdf) #### Objective: - To implement edge detection on FSIC using Catapult HLS and optimize the design for improved performance. #### Experiments: 1. Design Modifications: - Processed four pixels per clock cycle using packed 32-bit data. - Implemented Sum of Absolute Difference (SAD) for edge magnitude calculation. - Added CRC32 calculations for input and output images. - Removed angle calculation for simplified output. 2. Optimization: - Unrolled loops and pipelined functions to improve throughput from 6 to 1. - Adjusted FIFO depth and minimized register resets for area reduction. 3. Simulation: - Conducted RTL simulation to verify the design with different throughput settings. - Controlled stalling behavior in the testbench to ensure robustness against input variations. #### Results: - Successfully integrated the optimized design into FSIC. - Verified simulation results on Catapult GUI, noting challenges with throughput=1 due to AA module structure. ### Lab 3: Synopsys Flow - [Lab Contents](https://github.com/vic9112/Advance_SOC/tree/main/lab03%20-%20snps_flow) - [Report](https://github.com/vic9112/Advance_SOC/blob/main/lab03%20-%20snps_flow/report.pdf) #### Objective: - To perform a comprehensive flow from front-end synthesis to back-end implementation using Synopsys tools. #### Steps: 1. Front-End (Design Compiler): - Analyzed and synthesized HDL code from RTL to gate-netlist. - Applied constraints and optimized for timing, area, and power. 2. Floorplan: - Created initial chip floorplan with power and ground nets. - Generated reports on clock skew, design exceptions, and timing checks. 3. Back-End (IC Compiler 2, StarRC, PrimeTime): - Conducted placement, routing, and detailed timing analysis. - Implemented power planning with mesh patterns and rectangular rings. 4. Verification (IC Validator, Formality): - Performed DRC, LVS, and formal verification to ensure design correctness. #### Results: - Completed synthesis and physical implementation with detailed reporting on quality of results and constraint violations. ### Lab 4: Caravel-FSIC FPGA - [Lab Contents](https://github.com/vic9112/Advance_SOC/tree/main/lab04%20-%20fsic_fpga) - [Report](https://github.com/vic9112/Advance_SOC/blob/main/lab04%20-%20fsic_fpga/report.pdf) #### Objective: - To validate the integration of user FIR into Caravel SoC and perform DMA operations from FPGA to FIR and vice versa. #### Experiments: 1. DMA Design and Configuration: - Redesigned DMA for FIR data transmission. - Configured registers for DMA operations, including error detection mechanisms. 2. FIR Initialization and DMA Start: - Initialized FIR using AXI-Lite transactions. - Started DMA operations to stream data to and from FIR. 3. Simulation and Validation: - Loaded data into memory and configured DMA for FIR processing. - Validated FIR output against expected results using Python scripts. #### Results: - Successfully simulated and validated the FIR-DMA interaction on Caravel-FSIC. ### Final Project - [Project Contents](https://github.com/vic9112/PQC_Falcon) - [Report](https://github.com/vic9112/PQC_Falcon/blob/main/impl_ASIC/report_20240623.pdf) #### Introduction - In the context of post-quantum cryptography (PQC), our project focuses on the Falcon algorithm, recognized for its resistance to quantum attacks. The main objective was to transform Falcon from C code into synthesizable Hardware Description Language (HDL) for hardware acceleration, primarily using FFT, IFFT, NTT, and INTT as core functions. #### FPGA Implementation – Vitis HLS - Initial Implementation on HLS: - Implemented the iterative radix-2 FFT algorithm with bit-reversal permutation, optimizing memory use and processing steps for real-time applications. - Converted the original C code into HLS for synthesis, addressing non-synthesizable elements by transforming recursive loops into iterative loops. - Optimization: - **Synthesizable Implementation**: Restructured the code such that each stage is computed in separate modules, enabling parallelism and pipelining. - **Datapath Restructure**: Separated each stage to reuse processing elements (PEs), thus saving memory and area. - **Complex Multiplication**: Improved the complex multiplier structure to reduce the number of multipliers required, thus saving area. - **Combining Algorithms**: Combined FFT and NTT, as well as their inverse counterparts, to share logic and reduce resource usage. - **Shared Memories**: Used a self-defined datatype to combine buffers, reducing memory usage significantly. - **Double Shifter & Negation**: Implemented a shifter based on IEEE-754 double precision to handle division operations, reducing DSP usage. - **Shared Multiplier**: Encapsulated adders and multipliers to limit usage through pragma, enabling the sharing of these resources. - Middleware - Designed a middleware on the PS side to allocate and control hardware kernels without modifying the Falcon code. - The middleware structure allows scalability and ensures efficient communication between the PS and PL sides. #### ASIC Implementation – Catapult HLS - Modules: - in_copy Module: Changed data reading method from AXI-Master to ac_channel, transferring data into local SRAM. - out_copy Module: Modified stage module and out_copy module into a Catapult-acceptable coding style, sharing SRAM using a struct. - Wiring User Project: - Designed a state machine for different module executions, starting with data loading by DMA, transitioning through computation states, and finally writing results back to memory. #### DMA and Interrupt - DMA: Handled data transfer between system memory and device efficiently. - Interrupt: Implemented an interrupt controller to manage kernel status updates and ensure synchronization between computation and data transfer. #### Simulation & Validation - FSIC Simulation: - Simulated FFT, iFFT, NTT, and iNTT to verify data accuracy and execution time. Found that optimizations are needed for kernel performance. - On-Board Validation: - Used Mailbox on FPGA to generate an ISR, reducing hardware and software overhead. Faced issues with interrupt and data movement synchronization due to multithreading. #### Synopsys IC Flow - Conducted synthesis using Design Compiler, focusing on 'fiFFNTT'. Encountered challenges in floor planning, indicating a need for further optimization. #### Conclusion: This project achieved significant milestones in transforming and optimizing the Falcon algorithm for FPGA and ASIC implementations. Further work is needed to enhance kernel optimization, integrate complete Falcon flow tests, and address synchronization issues during on-board validation.