SOCStudy - HackMD

Lab01(2023/09/28)： Brief introduction about the overall system：首先最開始用 C++將自己需要的 DESIGN 寫出來，接著用 HLS 把 C code 轉換成 verilog 的 ip，接著在 Vivado 中把剛剛轉換好的 ip 與 pynq-z2 的 PS (zynq)做連接，這樣就可以在 pynq-z2 中讓 PS 跟 PL 端溝通，溝通的方式是使用 Jupyter 寫 python code 做操控。簡單的講就是用 python code 在 PS 端操控燒錄進去在 PL 的 RTL code。 PYNQ板子介紹：在 pynq-z2 中，有兩個非常重要的部分，Processing System(PS)跟Programmable Logic(PL)，其中 PS 也就是 zynq 通常用來處理系統與外部傳輸器數據等…，PL 則是用來加速特定功能像是 DSP…，PS 跟 PL 之間的溝通通常是使用 AXI 的方式也是這次 lab1 使用的方式，透過AXI interconnect 與 AXI protocol 來讓 ZYNQ 與寫好的 IP 溝通。 pragma：interface設定axi lite，block level的interface，在執行 HLS 時，pragma 可以向系統指定特定的架構，來轉換成硬體。舉例來說，#pragma HLS INTERFACE 可以定義硬體通訊之間的 port，像是數據傳輸方式之類的，還有 unroll、pipeline 之類的。都是可以操控轉換出來的硬體形式。加入directive有兩種方式，一種是 inline 方式(使用#pragma)，另一種是以 directives.tcl 來控制。那什麼時候該用inline，甚麼時候該用directory？ inline： architecture都是 pragma在define，所以pragma is part of our design，如果把code放到github上，且pragma放到兩個file，因為code分成多部分別人可能只拿到source code，design就是不完整的。所以通常會建議用inline的方式，把pragma放到source code裡面，這樣子只要把source code給別人，別人就知道design的information，design是一體的。 directory：剛開始在做實驗的時候，要試試看各種architecture，開著很多solution的folder，但我們希望source code是一樣的，且我們要試試看不同的pragma在不同的solution，這時候才會用directory，那最後要輸出出去還是要用inline，這樣比較完整。為甚麼在 co-simulation 之前要先 comment pragma ap_ctrl_none？ ap_ctrl_none 的意思就如其名，是在告訴 hls 現在的 function 並沒有 controlled 的 axi interface。所以在執行 co-simulation 時，hls 還是需要有一個 block level 的 interface 來讓他完成 simulation，所以才需要 comment ap_ctrl_none。 Lab02(2023/10/05)： Zynq PS-PL interface： AXI Master： (1)AXI HP/ACP (2)Typically higher performance IP (3)AXI Master出來的code可以用他的HP port直接接到PS-PL的interface AXI Stream： (1)Via DMA (2)HP/ACP ports for data path (3)GP slave for control (4)介面上面沒有Stream interface，且Stream interface 沒有address，所以要透過DMA，DMA可以把Stream interface轉換成Master interface，Master interface就有address的資訊。這個address的資訊可以透過axi lite去操作這個DMA裡面的register，來得知buffer的base address是多少。小結：Stream 沒有address，再加上DMA得到的address的資訊，就變成Master interface，就有Address在裡面。(一定要加DMA) AXI(Lite) Slave IP： (1)General Purpose ports (2)Typically lower performance IP (3)從PS去program slave DMA Class (1)Direct memory access &emsp;&emsp;(a)Transfer data between memories directly &emsp;&emsp;&emsp;- PS-PL &emsp;&emsp;(b)Bypasses CPU &emsp;&emsp;&emsp;- Doesn't waste CPU cycles on data transfer &emsp;&emsp;(c)Speed up memory transfers with burst transactions (2)Xilinx AXI Direct Memory Access IP block supported in PYNQ &emsp;&emsp;(a)Read and Write Paths from PL to DDR and DDR to PL &emsp;&emsp;(b)Memory Mapped to Stream &emsp;&emsp;(c)Stream to Memory Mapped (3)Needs to stream to/from an allocated memory buffer &emsp;&emsp;(a)PYNQ DMA class inherits from xlnk for memory allocation AXI Direct Memory Access IP &emsp;&emsp;(1)AXI Lite control interface(AXI GP port) &emsp;&emsp;(2)Memory mapped interface(AXI interface, HP/ACP ports) &emsp;&emsp;(3)AXI Stream interface(AXI stream accelerator) &emsp;&emsp;(4)Transfer between streams and memory mapped locations &emsp;&emsp;&emsp;&emsp;(a)Paths from PL to DRAM and DRAM to PL &emsp;&emsp;&emsp;&emsp;(b)Memory Mapped to Stream(MM2S) &emsp;&emsp;&emsp;&emsp;(c)Stream to Memory Mapped(S2MM) What is the Differences between c-simulation(csim) and co-simulation(cosim)? csim is used to verify the functional correctness of the C source code. cosim is not only used to verify the functional correctness but also used to verfy that the RTL is functionally identical to the C source code. csim跟cosim可以分成以下幾步驟(如圖)： 0. Different from Event-Driven Hardware Simulator 1. Phase#1: C simulation is executed to prepare the “input vectors” to the top-level function. 2. Phase#2: RTL simulation starts; it takes the input vectors and generates the “output vectors” 3. Phase#3: C simulation of the test bench main() function continues. It takes the “output vectors” returned from the RTL simulation and performs verification of the result. 簡而言之由c sim得到的input vector會給中間的RTL來執行。並再測試RTL simulation得到的output是否正確。所以如上所言，csim只會檢查c code是否functional 上正確。 cosim不只會檢查RTL code的functional正確與否，還會檢查cosim與csim是否相同。 ![](https://hackmd.io/_uploads/rJAw2gnga.png) C/RTL co-simulation uses a C test bench, running the main() function, to automatically verify the RTL design running in behavioral simulation. The C/RTL verification process consists of three phases: &emsp;&emsp;(a)The C simulation is executed and the inputs to the top-level function, or the Design-Under-Test (DUT), are saved as “input vectors.” &emsp;&emsp;(b)The “input vectors” are used in an RTL simulation using the RTL created by Vitis HLS in Vivado simulator, or a supported third-party HDL simulator. The outputs from the RTL, or results of simulation, are saved as “output vectors.” &emsp;&emsp;(c)The “output vectors” from the RTL simulation are returned to the main() function of the C test bench to verify the results are correct. The C test bench performs verification of the results, in some cases by comparing to known good results. Reference:Vitis High-Level Synthesis User Guide (UG1399) https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/Automatically-Verifying-the-RTL Lab03 (2023/10/20)：此次Lab3利用TB模仿software來控制FIR Engine designed by verilog。首先按照workbook提供的下圖，來完成下圖的Testbench。其中： (1) reset_task負責reset訊號 (2) load_input_task負責從txt檔案讀取測資 (3) check_idle_task負責檢查design的ap_idle是否為1 (4) axi_in_task負責傳輸data_length, tap parameters, ap_start (5) transmit_Xn_task負責利用AXI_Stream傳輸x data (6) receive_Yn_task負責利用AXI_Stream接收y data(答案) (7) polling_ap_done_task負責利用AXI_Lite檢查ap_done，檢查design完成沒 (8) cal_latency_task負責計算latency ![](https://hackmd.io/_uploads/rJg79iyf6.png) 在TB中我依序完成下方的各個task： ![](https://hackmd.io/_uploads/BJiwqsyG6.png) 在design的部分，我分為3個FSM來運算 (1) 上方一排的FSM負責處理 &emsp;&emsp;(a) 在idle state等待ap_start &emsp;&emsp;(b) 從TB得到Xn &emsp;&emsp;(c) 把Xn存進去RAM &emsp;&emsp;(d) 從RAM得到Xn與tap並計算FIR &emsp;&emsp;(e) 把計算完的答案傳給TB &emsp;&emsp;(f) 重複(1)~(4)直到完成全部DATA &emsp;&emsp;(g) 把ap_done傳給TB 其中，要計算用的coefficient, ap_start, ap_done, ap_idle會在最左邊以及最右邊的idle state與ap_done state完成。 (2) 下方左邊的FSM用來完成AXI Lite Read &emsp;&emsp;(a) 在idle state等待arvalid(等待訊號傳遞) &emsp;&emsp;(b) 根據得到的address，從design或RAM得到ap或coef &emsp;&emsp;(c) trigger arready並等待rready &emsp;&emsp;(d) trigger rvalid並傳遞ap或coef (3) 下方右邊的FSM用來完成AXI Lite Write &emsp;&emsp;(a) 在idle state等待awvalid(等待axi傳遞) &emsp;&emsp;(b) trigger awready並進到下一state &emsp;&emsp;(c) trigger wready並等待wvalid &emsp;&emsp;(d) 根據得到的address與data，把ap或coef寫入design或RAM ![](https://hackmd.io/_uploads/HJP8jjJMT.png) • How to receive data-in and tap parameters and place into SRAM？利用如上的FSM與AXI Lite protocol，先等待awvalid，接著trigger awready與wready，並等待wvalid，在此時會得到tap parameters。而data-in則需要透過AXI Stream protocol，在ss_tvalid = 1時，trigger ss_tready，來得到data。放入SRAM的地方我利用一個state來處理，並控制好訊號，用counter來計算需要的寫入/讀取時間，並計算寫入的地址，若為tap parameter則寫入SRAM，若是其他就再做處理。(下圖中，在(1)的地方先歸零，接著在(2)的地方控制寫入/讀取，及地址，在(3)的地方為讀出tap parameter來計算FIR) ![](https://hackmd.io/_uploads/Bye6isyMT.png) • How to access shiftram and tapRAM to do computation 如上圖所言，在(3)的地方為讀取tap parameter的state，在讀取時遇到最大的問題為要提早兩個cycle來讀取，第一個cycle先賦予值給A/EN/WE/Di，第二個cycle讓SRAM吃值，第三個CYCLE得到Dout並存起來。在dataRAM的地方我是用一樣的方式。 • How ap_done is generated. 當TB在stream in data時，一直監測ss_tlast(如下圖)，若發現ss_tlast = 1，把值存起來，並確認下一筆將會輸出最後一筆答案，在按照第一題呈現的FSM進到done state，來讓ap_done = 1。 ![](https://hackmd.io/_uploads/SJOynsyfT.png) Caravel SoC UserDesignInterface TB FW User Project Wrapper * Providing the interface between Management Core and User Project * Wishbone * Range 0x30000000 ~ 0x3FFFFFFF * Logic Analyzer &emsp;&emsp;[127:0] * MPRJ_IO &emsp;&emsp;[37:0] * User Clock * IRQ &emsp;&emsp;[2:0] * Implementation * user_project_wrapper.v user project 偵測到cycle, strobe =1，就要對wishbone做回應期中考筆記： HLS： https://hackmd.io/p5uP-8I_Q-S2ApXcU0r40w Interrupt： https://hackmd.io/4p9dTB73Q-C3H4fVZyrumw Lab4-1： This Lab 4-1 is focused on executing code in user memory within the context of SOC (System on Chip) design. Key activities include: Preparing firmware code and RTL (Register Transfer Level) design. Working with FIR (Finite Impulse Response) in C code. Managing firmware in the main function. Address arrangement using a linker. Designing BRAM (Block RAM) in the user project. Compiling the code, including specific steps like transforming .elf files to .hex format and exporting assembly code for debugging. Synthesizing and verifying the designs. Writing FIR C code and RTL in the user project, including a controller for delayed response and communication between BRAM and the wishbone bus. The lab also provides specific file paths and scripts to assist in these tasks. Lab4-2 Lab 4-2 is about integrating a previously designed Finite Impulse Response (FIR) hardware accelerator and firmware into the Caravel user project area. This involves executing RISC-V firmware for FIR from the user project memory and optimizing performance through software/hardware co-design. Key tasks include: Integrating Lab3-FIR and exmem-FIR (from Lab4-1) with a Wishbone interface. Designing a memory-mapped I/O (MMIO) configuration for RISC-V to interface with FIR. Developing a RISC-V and FIR handshake protocol for data movement and latency measurement. Simulating and synthesizing the integrated user project in the Caravel SOC, focusing on performance metrics like latency and resource utilization. The lab challenges students to enhance system performance by creatively designing firmware and hardware components. Lab5： Lab 5 focuses on integrating Caravel SoC with FPGA. It involves various tasks such as: Using a Jupyter Notebook project with bitstream firmware, Python code, and Vitis HLS projects. Implementing and simulating design elements like Read_ROMcode, Spiflash, and Caravel GPIO. Managing project structure, including Caravel SoC source code and testbench code. Executing scripts for Vitis and Vivado to build HLS projects and generate bitstreams. Developing Python host code to interact with and verify the FPGA implementation. The lab emphasizes hands-on experience with FPGA integration and real-world applications of SoC design principles. Lab6： Lab 6 is centered on developing a Workload Optimized SOC (System on Chip) using baseline approaches. It involves: Firmware development for tasks like matrix multiplication, quick sort, FIR, and ISR (UART receive/transmit). Hardware integration including elements from Lab4-1 (exmem-FIR) and UART design. Verification through simulation and FPGA implementation, with a focus on various performance aspects. Managing project structure and executing simulations and implementations for different components (Matrix Multiplication, Quick Sort, FIR, UART). The lab aims to enhance students' understanding of SOC design and optimization through practical, hands-on experience. Lab D： Lab D focuses on the integration and optimization of SDRAM (Synchronous Dynamic Random-Access Memory) in a System on Chip (SoC) design. Key aspects of the lab include: Replacing BRAM with SDRAM in the SoC, including the implementation of an SDRAM controller and device. Designing the SDRAM controller to support page mode and manage user interface signals. Implementing state management in the SDRAM controller, including initialization, idle, read, write, precharge, and refresh states. Addressing memory remapping, prefetching in the SDRAM controller, and bank interleaving to optimize code execution and data fetch. Analyzing the SDRAM controller design, prefetch scheme, bank interleaving, and refresh conflicts. Writing a detailed report on the design, implementation, and performance analysis of the SDRAM integration in the SoC project. 以上Lab都可以參考我們寫的結報