SoC study journal

# SoC study journal ## Introduction This course is the first part of two-semester SOC Design Course. The course starts with basic design skill (Verilog/HLS), study SOC architecture, and finally design an IP to integrate in SOC. The laboratories are based on Caravel SOC running in PYNQ or FPGA. You will also develop embedded program runs in both Caravel/RISCV and FPGA/ARM processor. The course equips participants with the skills and knowledge required to become full-stack IC designers, able to handle all development stages from front-end design to system debugging and embedded programming. ## Lab1 & Lab2 ### Brief introduction First, begin by writing your desired design in C++. Next, use HLS to convert the C code into a Verilog IP. Then, in Vivado, connect the recently converted IP with the PS (Zynq) of the Pynq-Z2 board. This enables communication between the PS and PL on the Pynq-Z2, and this communication is facilitated by writing Python code using Jupyter. In simple terms, it means using Python code on the PS to control the loading and operation of RTL code within the PL. ### Directive Adding directives can be done in two ways: one is using the inline method (with #pragma), and the other is by controlling them through directives.tcl. #### #pragma ![](https://hackmd.io/_uploads/ryeWDCYNl6.png) #### directives.tcl ![](https://hackmd.io/_uploads/rksdAYNgp.png) ### Block-level interface protocol Block-level interface protocols provide a mechanism for controlling the operation of an RTL module such that other modules and software applications can control the RTL module using the associated block-level interface. Vitis HLS uses the following interface types to specify whether an RTL IP is implemented with block-level handshake signals. * ap_ctrl_none: It's useful for data-driven. * ap_ctrl_hs: Default mode, pipeline/nonpipeline depens on ap_ready signal. * ap_ctrl_chain: Pipelined, high performance. You can specify the block-level control protocol on the function return using the INTERFACE pragma or directive. If the C/C++ code does not return a value, you can still specify the control protocol on the function return. If the C/C++ code uses a function return, Vitis HLS creates an output port ap_return for the return value. :star2: When the function return is specified as an AXI4-Lite interface (s_axilite) all the ports in the control protocol are bundled into the s_axilite interface. This is a common practice for software-controllable kernels or IP when an application or software driver is used to configure and control when the block starts and stops operation. This is a requirement of XRT and the Vitis kernel flow. 詳細情況請參考: https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/Block-Level-Control-Protocols ### Port-level interface protocol There are three types of port protocol: * AXI4-Master (pointer to an array, 大量data) * AXI4-Lite (scalar inputs, 少量data) * AXI4-Stream #### AXI4-Master: The complete AXI4 transfer bus is suitable for high-speed internal interconnects, but it has a complex structure and consumes a significant amount of resources. #### AXI4-Lite: The simplified version of the AXI4 protocol is a straightforward throughput-address-mapped communication bus designed for interfacing with control-register-style interface components and enabling the creation of simple component interfaces. It can be thought of as a bus suitable for connecting low-speed peripherals. #### AXI4-Stream: Compared to AXI4, it eliminates address lines, retaining only simple send and receive operations. 但是因為AXI-stream不會發送address，那如果沒有address要讀寫memory該怎麼辦呢？使用DMA正是你的解決方法，不須透過processer發送address，可以獨立地直接讀寫memory。在Export IP階段，會經過一個Adaptor，就是當HLS blcok完成operation, ap_done, ap_idle,還有ap_ready會被當作hardware的output port，然後PS/Processor就可以從這些output port(register)讀取訊號了。 ## Lab3 ### AXI-lite： #### Relationship AXI 的通道間需要保證以下三種關聯（relationship）： * 寫回覆必須在其所屬傳輸的最後一個寫資料完成後（write response must follow the last write transfer in the transaction） * 讀取資料必須在接收到讀取位址訊號後產生 * 通道間的握手需要滿足通道間的握手依賴性(handshake dependencies) #### Handshake 協議規定握手依賴的目的是為了防止死鎖（deadlock），手冊定義於章節 A3.3 ，主要的原則還是第一章中所說的兩條：發送方 VALID 一定不能依賴接收方 READY 訊號接收方 READY 訊號可以偵測到 VALID 置起後再置起有效，換句話說，可以依賴 VALID 訊號。我們先來看讀傳輸的情況，讀地址通道中主機為發送方、從機為接收方；讀取資料通道中主機為接收方、從機為發送方。其中，圖中的單頭箭頭表示：其指向的訊號可以在箭頭起始訊號置起前或後置起（無依賴性）圖中的雙頭箭頭表示：其指向的訊號必須在箭頭起始訊號置起之後置起（指向訊號依賴起始訊號） ![image](https://hackmd.io/_uploads/HyJIBS1wT.png) #### Write channel 由於write channel的address channel與write channel之間是獨立的，也就是可以先write然後再等address，不須先等address再write，因此可以用兩個always block寫兩個channel。於是我write channel就寫成vaild與ready在同個cycle拉起，並handshake。 ```Verilog= always @(posedge axis_clk, negedge axis_rst_n) begin if(~axis_rst_n) awready_reg <= 1'b0; else if(awvalid && ~awready_reg) awready_reg <= 1'b1; else awready_reg <= 1'b0; end always @(posedge axis_clk, negedge axis_rst_n) begin if(~axis_rst_n) wready_reg <= 1'b0; else if(wvalid && ~wready_reg) wready_reg <= 1'b1; else wready_reg <= 1'b0; end ``` ![image](https://hackmd.io/_uploads/BJejXAxRUT.png) ![image](https://hackmd.io/_uploads/rk9AtgCLp.png) #### Read channel Read chnnel簡單來說就是讀完address才可以讀data。因此我寫成以下方式： ```Verilog= //** Axi lite read **// always @(posedge axis_clk, negedge axis_rst_n) begin if(~axis_rst_n) arready_reg <= 1'b0; else if(arvalid) begin if(~arready_reg && rvalid_reg) arready_reg <= 1'b0; else if(arready_reg) arready_reg <= 1'b0; else if(~arready_reg) arready_reg <= 1'b1; end else arready_reg <= 1'b0; end //rvalid always @(posedge axis_clk, negedge axis_rst_n) begin if(~axis_rst_n) rvalid_reg <= 1'b0; else if(rready && arvalid && arready_reg) rvalid_reg <= 1'b1; else if(rvalid_reg && rready) rvalid_reg <= 1'b0; else rvalid_reg <= 1'b0; end ``` 不過有蠻多地方有設計瑕疵的，像是arready，因為這個算是based on verification，是能work，但之後有時間應該會再重新寫。 ```Verilog= always @(posedge axis_clk, negedge axis_rst_n) begin if(~axis_rst_n) arready_reg <= 1'b0; else if(arvalid) begin if(~arready_reg && rvalid_reg) arready_reg <= 1'b0; else if(arready_reg) arready_reg <= 1'b0; else if(~arready_reg) arready_reg <= 1'b1; end else arready_reg <= 1'b0; end ``` ![image](https://hackmd.io/_uploads/HyJIBS1wT.png) 有些訊號可以不用register或latch也不需要reset，當他們有valid value直接assign就可以了，像是tap_Di，本來是用register外加reset。 ```Verilog= always @(posedge axis_clk, negedge axis_rst_n) begin if(~axis_rst_n) tap_Di_reg <= 0; else if(wvalid && ~wready_reg && awaddr != 12'h00) tap_Di_reg <= wdata; else tap_Di_reg <= 0; end ``` 經過改寫之後，可以省一點點resource。 ```Verilog= assign tap_Di = (wvalid && wready_reg && awaddr != 12'h00) ? wdata : 0; ``` ## Lab4-1 ### Firmware This is meant to verify that we can configure the pads for the user project area. The firmware configures the higher 16 I/O pads in the user space as outputs: ```C= reg_mprj_io_31 = GPIO_MODE_MGMT_STD_OUTPUT; reg_mprj_io_30 = GPIO_MODE_MGMT_STD_OUTPUT; reg_mprj_io_29 = GPIO_MODE_MGMT_STD_OUTPUT; reg_mprj_io_28 = GPIO_MODE_MGMT_STD_OUTPUT; reg_mprj_io_27 = GPIO_MODE_MGMT_STD_OUTPUT; reg_mprj_io_26 = GPIO_MODE_MGMT_STD_OUTPUT; reg_mprj_io_25 = GPIO_MODE_MGMT_STD_OUTPUT; reg_mprj_io_24 = GPIO_MODE_MGMT_STD_OUTPUT; reg_mprj_io_23 = GPIO_MODE_MGMT_STD_OUTPUT; reg_mprj_io_22 = GPIO_MODE_MGMT_STD_OUTPUT; reg_mprj_io_21 = GPIO_MODE_MGMT_STD_OUTPUT; reg_mprj_io_20 = GPIO_MODE_MGMT_STD_OUTPUT; reg_mprj_io_19 = GPIO_MODE_MGMT_STD_OUTPUT; reg_mprj_io_18 = GPIO_MODE_MGMT_STD_OUTPUT; reg_mprj_io_17 = GPIO_MODE_MGMT_STD_OUTPUT; reg_mprj_io_16 = GPIO_MODE_MGMT_STD_OUTPUT; ``` Then, the firmware applies the pad configuration by enabling the serial transfer on the shift register responsible for configuring the pads and waits until the transfer is done. ![image](https://hackmd.io/_uploads/HJRnn4lPp.png) Manual configuration ```C= reg_mprj_xfer = 1; while (reg_mprj_xfer == 1); ``` This is done to flag the start/success/end of the simulation by writing a certain value to the I/Os which is then checked by the testbench to know whether the test started/ended/succeeded. For example, the testbench checks on the value of the upper 16 of 32 I/Os, if it is equal to , then we know that the test started.16'hAB40 ```C= // Flag start of the test reg_mprj_datal = 0xAB400000; ``` ```C= wait(checkbits == 16'hAB40); //checkbits = reg_mprj_datal[31:16] $display("LA Test 1 started"); ``` The firmware configures the logic analyzer (LA) probes as inputs to the management SoC to monitor the counter value, and configure the logic analyzer probes as outputs from the management SoC (inputs to the user_proj_example) to set the counter initial value. This is done by writing to the LA probes enable registers. Note that the output enable is active low, while the input enable is active high. Every channel can be configured for input, output, or both independently. [31:0] [63:32] ```C= reg_la0_oenb = reg_la0_iena = 0x00000000; // [31:0] reg_la1_oenb = reg_la1_iena = 0xFFFFFFFF; // [63:32] reg_la2_oenb = reg_la2_iena = 0x00000000; // [95:64] reg_la3_oenb = reg_la3_iena = 0x00000000; // [127:96] ``` main()的section預設是在flash，如果把main()搬到mprjram後，compiler會簡化一些instruction，也少了main()會跳去mprjram抓指令的情況。 Before: ```C= void main(){ } ``` After: ```C= void __attribute__ ( ( section ( ".mprjram" ) ) ) main(){ } ``` ### Hardware Firmware經過spiflash load到user project bram會有10T個delay加上respond的2T總共12T，為了方便實現就用counter。 ![image](https://hackmd.io/_uploads/Byyq3vgvp.png) ```Verilog= //----------------------------------- // WB MI A assign valid = wbs_cyc_i && wbs_stb_i; assign wstrb = wbs_sel_i & {4{wbs_we_i}} & {4{wbs_cyc_i}}; assign wbs_dat_o = rdata; assign wdata = wbs_dat_i; //----------------------------------- // wbs_ack_o assign wbs_ack_o = (counter == 4'd12); // counter always @(posedge clk) begin if(rst | (counter == 4'd12)) counter <= 4'd1; else if(valid == 1'b1) counter <= counter + 1'b1; else counter <= 4'd1; end //----------------------------------- } ``` ## Lab4-2 GCC（GNU Compiler Collection）編譯器的各種優化參數，firmware在基於這些參數的設定下可能可以完美契合你設計的硬體行為。 * `-O0`：這是預設的，不進行優化，編譯速度最快，但產生的代碼執行速度可能較慢。 * `-O1`：提供編譯速度和執行效率之間的平衡。它會啟用最基本的優化，但不會顯著增加編譯時間。 * `-O2`：進行更多優化，比如指令重排和代碼合併。通常會產生比 `-O1` 更快的代碼，但編譯時間會稍長。 * `-O3`：這是最高的標準優化，啟用所有的優化選項，包括那些可能會增加代碼大小的選項。這個通常會產生最快的代碼，但編譯時間最長，並且可能會增加代碼的體積。 * `-Os`：這個參數專注於減少代碼的大小，而不是最大化速度。這對於記憶體有限的系統非常有用。 * `-Ofast`：這個包括 `-O3` 的所有優化，並添加了進一步的優化，這些優化可能不符合標準，但可能會提供額外的性能提升。 * `-Oz`：這是一個針對代碼大小進行極致優化的選項。它類似於 `-Os`，但更加極端地優化以減少生成的二進制文件大小。這在記憶體非常有限的系統（例如某些嵌入式系統或微控制器）中非常有用。不過，需要注意的是，`-Oz` 可能會犧牲更多的執行效率來換取更小的大小。 * `-Og`：這是一個針對調試友好性的優化等級。`-Og` 旨在提供合理的優化，同時最大限度地保留對代碼進行調試所需的訊息。它比 `-O0`（無優化）提供更好的性能，但又不會像 `-O1` 或更高等級的優化那樣，通過過多的優化讓代碼難以理解和調試。`-Og` 是在開發過程中進行調試時的理想選擇。 * `-funroll-loops`：這是一個特定的優化選項，用於展開迴圈，這可以加速一些迴圈密集型的代碼，但可能會增加code大小。 * `-finline-functions`：嘗試將小函數內聯，這可以減少函數呼叫的開銷。 * `-fomit-frame-pointer`：省略函數的框架指針，這通常可以節省一些記憶體和提高速度，但可能會使調試更困難。 * `-fstrict-aliasing`：啟用嚴格別名分析，這可以提高性能，但要求代碼嚴格遵循C/C++標準關於指針別名的規則。 * `-ffast-math`：這個選項會放鬆對浮點數學運算的準確性和標準符合性的要求，以換取速度上的提升。 * `-fexpensive-optimizations`：啟用一些耗時的優化，例如函數調用的全局最佳化。 * `-floop-optimize`：啟用迴圈相關的優化，比如迴圈展開和迴圈不變代碼移動。 * `-finline-small-functions`：自動內聯小函數。 * `-fthread-jumps`：進行線程跳躍優化，這是一種控制流優化。 * `-falign-functions`、`-falign-jumps`、`-falign-loops`：設置函數、跳躍指令和迴圈的對齊方式，有時可以提高性能。 * `-fprefetch-loop-arrays`：預取迴圈中的數組數據，這有助於減少記憶體存取延遲。 * `-march`、`-mtune`：這些選項用於針對特定的CPU類型進行優化。`-march` 指定生成代碼的目標架構，`-mtune` 則指定要為哪種處理器進行性能調整。 ## Lab5 ### Spiflash * Master Output Slave Input (MOSI) * Master Input Slave Output (MISO) * Serial Clock (SCK) * SS ![image](https://hackmd.io/_uploads/Hk4ALRi_p.png) #### Spiflash.v BRAM介面：這邊只會透過spi進行bram讀取，所以input跟write enable都關掉，bytecount >= 4才進行enable應該是為了32bit alignment。Memory的部分算是一個decoder，這邊是由spi_addr選取該拿的bram data，丟給outbuf進行串列輸出。 ```verilog // BRAM Interface assign romcode_Addr_A = {8'b0, spi_addr}; assign romcode_Din_A = 32'b0; assign romcode_EN_A = (bytecount >= 4); assign romcode_WEN_A = 4'b0; assign romcode_Clk_A = ap_clk; assign romcode_Rst_A = ap_rst; // 16 MB (128Mb) Flash // reg [7:0] memory [0:16*1024*1024-1]; wire [7:0] memory; assign memory = (spi_addr[1:0] == 2'b00) ? romcode_Dout_A[7:0] : (spi_addr[1:0] == 2'b01) ? romcode_Dout_A[15:8] : (spi_addr[1:0] == 2'b10) ? romcode_Dout_A[23:16] : romcode_Dout_A[31:24] ; ``` SPI數據接收：當buffer填滿8bits時，剛好bytecount也會+1，這就代表已經可以做一次接收，當接收完32bits就可以做一次數據傳送。 ```verilog wire [7:0] buffer_next = {buffer[6:0], io0}; always @(posedge spiclk or posedge csb) begin // csb deassert -> reset internal states if (csb) begin buffer <= 0; bitcount <= 0; bytecount <= 0; end else begin // csb active -> count bit, byte buffer <= buffer_next; bitcount <= bitcount + 1; if (bitcount == 7) begin bitcount <= 0; bytecount <= bytecount + 1; // spi_action; if(bytecount == 0) spi_cmd <= buffer_next; // command if(bytecount == 1) spi_addr[23:16] <= buffer_next; if(bytecount == 2) spi_addr[15:8] <= buffer_next; if(bytecount == 3) spi_addr[7:0] <= buffer_next; if(bytecount >= 4 && spi_cmd == 'h03) begin // buffer <= memory; spi_addr <= spi_addr + 1; end end end end ``` SPI數據傳送：當32bits data準備好後，就從bram讀取數據到outbuf給Caravel，outbuf是一個左移暫存器(串列輸出)，為了符合SPI的protocol。 ```verilog= // io1 output // assign io1 = buffer[7]; assign io1 = outbuf[7]; ``` ```Verilog= always @(negedge spiclk or posedge csb) begin if(csb) begin outbuf <= 0; end else begin outbuf <= {outbuf[6:0],1'b0}; if(bitcount == 0 && bytecount >= 4) begin outbuf <= memory; end end end ``` #### Study caravel_fpga.ipynb 宣告8K的ROM size。 ```python= ROM_SIZE = 0x2000 #8K ``` 在FPGA的DRAM上分配8K的大小，然後將ROM右移2位，因為每個data都是32bits (4 bytes)，也代表data是32bit alignment。 ```python= # Allocate dram buffer will assign physical address to ip ipReadROMCODE npROM = allocate(shape=(ROM_SIZE >> 2,), dtype=np.uint32) ``` 檢查每行是否以’@’開始。若是，則代表這行是接下來data會被load到指定的記憶體位置；若不是，則代表這行是將要寫入的data。 ```python= for line in fiROM: # offset header if line.startswith('@'): # Ignore first char @ npROM_offset = int(line[1:].strip(b'\x00'.decode()), base = 16) npROM_offset = npROM_offset >> 2 # 4byte per offset #print (npROM_offset) npROM_index = 0 continue ``` 將每個byte轉換成int，並將移位到適當的位置，使得4 bytes組成一個32bits的data。接著檢查是否已處理了4 bytes，如果剛好4 bytes就直接寫入至dram，若不是也將剩餘的data也寫入至dram，以確保32bit alignment。 ```python= # We suppose the data must be 32bit alignment buffer = 0 bytecount = 0 for line_byte in line.strip(b'\x00'.decode()).split(): buffer += int(line_byte, base = 16) << (8 * bytecount) bytecount += 1 # Collect 4 bytes, write to npROM if(bytecount == 4): npROM[npROM_offset + npROM_index] = buffer # Clear buffer and bytecount buffer = 0 bytecount = 0 npROM_index += 1 #print (npROM_index) continue # Fill rest data if not alignment 4 bytes if (bytecount != 0): npROM[npROM_offset + npROM_index] = buffer npROM_index += 1 ``` 寫至bram。 ```python= # Program physical address for the romcode base address ipReadROMCODE.write(0x10, npROM.device_address) ipReadROMCODE.write(0x14, 0) # Program length of moving data ipReadROMCODE.write(0x1C, rom_size_final) # ipReadROMCODE start to move the data from rom_buffer to bram ipReadROMCODE.write(0x00, 1) # IP Start while (ipReadROMCODE.read(0x00) & 0x04) == 0x00: # wait for done continue print("Write to bram done") ``` ## Lab6 #### Simulation on Jupyter Notebook 由於 firmware code 跑的速度比 ps side 快太多，以至於容易抓不到 0x1C 的位置，因此參考了討論區的做法，分別在 python code 與 firmware code 加上 while 迴圈去等待 ps side 抓到 0x1C 的位置。 refer to: https://github.com/bol-edu/HLS-SOC-Discussions/discussions/175#discussioncomment-7789823 ```python= async def task_check(): while((ipPS.read(0x1c)&0xffff0000) != 0xAB410000): continue print("matmul started") while((ipPS.read(0x1c)&0xffff0000) != 0x003E0000): continue print("matmul 1 passed") while((ipPS.read(0x1c)&0xffff0000) != 0x00500000): continue print("matmul 2 passed") while((ipPS.read(0x1c)&0xffff0000) != 0xAB420000): continue print("qsort started") while((ipPS.read(0x1c)&0xffff0000) != 0x00280000): continue print("qsort 1 passed") while((ipPS.read(0x1c)&0xffff0000) != 0x23710000): continue print("qsort 2 passed") while((ipPS.read(0x1c)&0xffff0000) != 0xAB430000): continue print("fir 1 started") while((ipPS.read(0x1c)&0xffff0000) != 0x00000000): continue print("fir 2 started") while((ipPS.read(0x1c)&0xffff0000) != 0x044A0000): continue print("fir passed") print("ALL Test Done") ``` ```c= int count = 0; while(count < 100000) count++; int *tmp = matmul(); reg_mprj_datal = *tmp << 16; //0x1c ``` #### 解決hold time violation的方法在.tcl的open implement後面加上 ``` open_run impl_1 phys_opt_design -hold_fix <--- ``` 就可以解決了或是打在vivado的Tcl Console裡面打上指令也可以解決 ![image](https://hackmd.io/_uploads/ByC94BlYp.png) 不過要先到IMPLEMENTATION的Report Timing Summary之後打指令才有效 Before: ![image](https://hackmd.io/_uploads/Byb7HBlYT.png) After: ![image](https://hackmd.io/_uploads/ByoUrBlYp.png)