Caravel_SOC.md

# Caravel_SOC ## 1. Vitis/Vivado Tools Implementation ### 1.1. Vitis HLS(生成IP) #### 1.1.1. 加入檔案至專案將 xxx.cpp、xxx.h、xxx.cpp 放入 vitis hls專案對應的位置之後，分別在Source選擇xxx.cpp及xxx.h檔，在Test Bench選擇xxx.cpp檔。 #### 1.1.2. 設定 Top Function 名稱 #### 1.1.3. 設定directive 可以在.cpp中用#pragma設定。也可以在右側視窗設定。 #### 1.1.4. C simulation 結果圖(multip_2num_csim.log)) 接者執行C Simulation驗證IP的執行結果，會得到 xxx_csim.log檔。 #### 1.1.5. C synthesis 再來要執行C Synthesis，完成C Synthesis會在主視窗回報synthesis report。分別是xxx.rpt跟xxx_csynth.rpt csynth.rpt #### 1.1.6. Cosimulation 接者跑 Cosimulation，注意此時須將directive的組態中，#pragma HLS INTERFACE ap_ctrl_none port=return 進行註解，在執行 Cosimulation前必須重新跑一次 C Synthesis，再來按下Cosimulation來驗證設計，若要檢視波形可執行 Open Wave Veiwer。 #### 1.1.7. 匯出IP ### 1.2. Vivado (IP、SOC整合) #### 1.2.1. 匯入IP 完成IP設計後，從Vitis HLS匯出IP，之後Vivado需要匯入由 Vitis HLS 匯出的IP。 #### 1.2.2. Create Block Design 在專案管理點擊 Create Block Design 選項，在Diagram tab視窗加入 components(ZYNQ7 Processing System及multip_2num_0)，因為 block design 只有 1 個 IP 的設計，連線完成可由系統自動完成，直接執行Run Connection Automation即可。在專案管理點擊Create Block Design 選項，在 Diagram tab視窗加入 components 會重置 ZYNQ 的 configuration，一定要先 run block automation 然後再完成 ZYNQ 的 configure 。 AXI-Lite 與 AXI-Master 在 processing system block 使用的 port 並不相同，所以在 Run Block Automation 後用滑鼠左鍵雙擊 processing system block，開啟 HP port 設定。另外， AXI-Master to stream是用 Xilinx DMA IP來實現，針對 DMA IP 需要調整及設定。在 Block Design時，除匯入 HLS fir_n11_strm IP並將 IP component 加入至 Diagram中之外，還需要加入 Xilinx DMA IP component，加入後以滑鼠雙擊 DMA block。 1. 關閉 Scatter-Gather Mode 2. 調整 Width of Buffer Length Register 3. 選擇單一 Read Channel，單一 Write Channel，或兩者 Read/Write Channel。（由開發者設計需求決定，此 lab 會需要兩個 dma，一個單一 Read 做為 dma in，一個單一 Write 做為 dma out dma block名稱可以在左側更改， python code裡有用到 dma in跟 out的名稱，記得更改成 axi_dma_in_0及 axi_dma_out_0。接著按 run connection automation 到沒有出現為止。注意這裡要手動連兩條 wire FIR_N11_STRM 的 pstrminput 接到 axi_dma_in_0 的 M_AXIS_MM2S FIR_N11_STRM 的 pstrmoutput 接到 axi_dma_out_0 的 S_AXIS_S2MM。 #### 1.2.3. Generate Bitstream (一鍵完成 Synthesis / Placement / Routing 等步驟) 藉由 MakeBit.bat，於根目錄產生 FPGA 運行時所需的.bit/.hwh。 ### 1.3. online FPGA 租借一個線上的 FPGA 板子登入 Jupyter Notebook 並且把剛剛得到的 .bit、.hwh及 Multip2Num.ipynb匯入，執行 Multip2Num.ipynb，最後可以得到我們的驗證結果。 (KV260檔案位置為 ol = Overlay("/home/root/jupyter_notebook/FIRN11Stream.bit") ### 1.4. observed & learned 透過 Lab2對於 Port-level Protocol使用其中兩種來實現分別為 AXI-Master、 AXI-Stream。 AXI-Master中可以直接與 pynq上的 HP port相接，因此中間只需要 interface就好。而 AXI-Stream因為是串流的沒有 address的概念，以及無法直接與 pynq上的 HP port相接，因此中間需要 axi_dma來轉接。 axi_dma除了使 AXI-Stream和 AXI-Master能轉換外， address則由 AXI-Lite來與 pynq上的 GP port溝通。兩者都需要將 input、 output、 coefficient和 register transfer length的位置搞清楚並且都設定完成，才能開始 ap_start。 Lab2也證實 AXI-Master是比 AXI-Stream快的。對於整個 SoC中數據溝通有更深的了解，以及 python code的設定也慢慢能理解。 ## 2. fir hardware ### 2.1. Describe operation #### 2.1.1. How to receive data-in and tap parameters and place into SRAM 當 tap_EN = 1 與 tap_WE = 4’hF 時，會將 wdata 寫進去 awaddr(會做shift)的位置。當 data_EN = 1 與 data_WE = 4’hF 時，會將 ss_tdata 寫進去 data_A 的位置。 #### 2.1.2. How to access shiftram and tapRAM to do computation ctrl 模組會輸出對應的 tap_addr 與 data_addr，之後從 RAM 讀出數值。 e.g. RAM_size = 3 * data_size data：1, 2, 3, 4 tap：-1, 3, -1 * 1st. data_ram_addr A0 A1 A2 tap_ram_addr A2 A0 A1 data 1 0 0 tap -1 -1 3 * 2nd data_ram_addr A0 A1 A2 tap_ram_addr A1 A2 A0 data 1 2 0 tap 3 -1 -1 * 3rd data_ram_addr A0 A1 A2 tap_ram_addr A0 A1 A2 data 1 2 3 tap -1 3 -1 * 4th data_ram_addr A0 A1 A2 tap_ram_addr A2 A0 A1 data 4 2 3 tap -1 -1 3 #### 2.1.3. How ap_done is generated. 當 ss_tlast = 1 與 sm_tlast = 1 後，state = ap_done(2’b10)。 ### 2.2. Simulation Waveform 1. 首先 state = ap_idle(2’b00)，開始的輸入 taps，並存在 RAM。 2. testbench 開始的檢查輸出的 taps。 3. 當接收到 testbench 的 ap_start 訊號，ap_start_sig = 1，state = ap_start(2’b01)。 4. state = ap_start(2’b01)後，會將 ss_tdata 寫入至 RAM，並且 ctrl_tap_ready = 1，ctrl 模組開始輸出對應的 tap_addr 與 data_addr，之後從 RAM 讀出數值做相乘累加，輸出結果，且 sm_tvalid = 1。 5. 當 ss_tlast = 1 與 sm_tlast = 1 後，state = ap_done(2’b10)。 ## 3. fir firmware ### 3.1. Prepare firmware code & RTL #### 3.1.1. Generate data in header file – fir.h Define taps parameters and inputsignal as lab3 in header file #### 3.1.2. C code – fir.c Implement FIR function in c code #### 3.1.3. Firmware management in main() In testbench/counter_la_fir.c, parameter reg_mprj_xfer will be initially to 1, and will not start fir until the external signal is given to 0. #### 3.1.4. Linker for address arrangement In firmware/section.ids, mpjram is our bram, it’s original address is at 0x38000000, and it’s size is 4 KB #### 3.1.5. Design BRAM in user_project Estimated the required size of RAM Design the controller connected with wishbone bus and ack response need to after Delay (10 delays) ### 3.2. Compilation #### 3.2.1. Run_clean #### 3.2.2. Run_sim ## 4. fir hardware & firmware combined ### 4.1. The interface protocol between firmware, user project and testbench 1. Firmware 跟 user project 是 Wishbone 2. Testbench 藉由 mprj 與 Firmware 和 user project 溝通 ### 4.2.Waveform and analysis of the hardware/software behavior 1. fir.v 前面先 load taps，tap 位置為 0x30000040 2. fir.v 執行 3 次，x 位置為 0x30000080 3. fir.v 最後讀取輸出，y 位置為 0x30000084 ### 4.3. What is the FIR engine theoretical throughput, i.e. data rate? Actually measured throughput? 1. 一個 clock 週期： 25ns 2. theoretical throughput data rate： 13 clock 3. Actually measured throughput： 72 clocks ### 4.4. What is latency for firmware to feed data? 1. 59 clocks ### 4.5. What techniques used to improve the throughput? #### 4.5.1. Does bram12 give better performance, in what way? 對，因為 Bram 的位置多一個，所以可以多存取一筆資料，若把資料做 prefatch，可以加速後續的資料處理，因此可以有更好的效果。 #### 4.5.2. Can you suggest other method to improve the performance? 增加資料輸入的頻寬。 ### 4.6. Compilation #### 4.6.1. Run_clean #### 4.6.2. Run_sim ### 4.7. Metrics #-of-clock (latency-timer) * clock_period * gate-resource Metrics = 72 * 13.120 ns * (547+265) = 767,047.68 ## 5. Verification on jupyter notebook ### 5.1. Run_vitis #### 5.1.1. Vitis HLS(生成 IP) ### 5.2. Run_vivado #### 5.2.1. Vivado (IP、SOC 整合) ### 5.3. Online FPGA 租借板子後在 jupyter 上用 python 驗證。 ### 5.4. Explain the function of IP in this design #### 5.4.1. HLS 1. read_romcode 一開始 .hex 檔是載入在 ps side 的 DDR 裡面，需要將.hex 載入到硬體 bram 裡面，所以要透過 read_romcode 這 ip 來傳遞 data。 2. ResetControl 當 ps side 要與 caravel 的 gpio、reset 溝通所需的 ResetControl ip。執行開始前所需的 reset，我們從 python code 可對 ResetControl ip 設定 reset 信號，caravel 就開始跑，當 firmware 做完之後，可以將 reset release 掉。 4. caravel_ps 當 ps side 與 caravel 的 mprj_io 溝通所需的 caravel_ps ip。當 caravel 跑完之後，透過 mprj_io 經由 caravel_ps ip 送一些信息跟 python code 溝通，來驗證功能是否正確。 #### 5.4.2. Verilog 1. Spiflash 經由將 bram 裡”.hex”的 data 儲存起來，並供給 caravel 使用。 ### 5.5 Execution result on all workload (jupyter) 1. open “.hex”。 2. 將“.hex”寫進去 bram。 “.hex”裡以”@”開頭。 3. 查看 mprj_io 的輸出。 4. 給 reset 為 1 的信號，caravel 會重新執行 firmware。 5. 執行後“.hex”，查看 mprj_io 的輸出，0x1c 腳位值。 ### 5.6 Study caravel_fpga.ipynb, and be familiar with caravel SoC control flow lab4 的 test bench 由 simulator 來跑中間 interaction。而 lab5 是由 python code 來跑驗證結果，而中間彼此的 interaction 則由實際的 ip 來完成 data 傳遞溝通。一開始 .hex 檔是載入在 ps side 的 DDR 裡面，需要將.hex 載入到硬體 bram 裡面，所以要透過 read_romcode 這 ip 來傳遞 data。其中還有 ps side 要與 caravel 的 gpio、reset 溝通所需的 ResetControl ip，以及 ps side 與 caravel 的 mprj_io 溝通所需的 caravel_ps ip。執行開始前所需的 reset，我們從 python code 可對 ResetControl ip 設定 reset 信號，caravel 就開始跑，當 firmware 做完之後，可以將 reset release 掉。當 caravel 跑完之後，透過 mprj_io 經由 caravel_ps ip 送一些信息跟 python code 溝通，來驗證功能是否正確。 ## 6. UART & interrupt ### 6.1. Suggestion for improving latency for UART loop 1. 優化 firmware code 2. UART 使用硬體來支援 3. 中斷使用中斷驅動的 UART 通信。使微控制器在等待數據時執行其他任務，減少閒置時間，提高整體系統效能。 5. Baud Rate 優化確保 Baud Rate 設置為傳送和接收設備都可靠支援的最高值。較高的 Baud Rate 能夠實現更快的資料傳輸。 7. FIFO 緩衝區 FIFO 緩衝區可在微控制器需要處理數據之前存儲多個字節，從而減少處理時間。 9. DMA 使用 DMA 來進行 UART 數據傳輸。DMA 允許數據在不涉及 CPU 的情況下直接在內存和外設之間傳輸，降低 CPU 的負擔，提高延遲性能。 11. buffer 和 data flow 重疊數據處理和傳輸。 ### 6.2. What else do you observe 這次 lab6 所實現的功能為中斷並完成 uart 傳輸。在軟韌方面，當 caravel 接收到外部中斷(if USER_PROJ_IRQ0_EN)，會將CSR 中的 mask 更新為 2(102)，並且 irq enable。之後會執行 isr.c，也就是執行 uart.c，來 read 使他中斷的 data，之後再寫 data 回去外部。裡面包含 user_uart.h 來定義 read(0x30000000)和 write(0x30000004)的腳位。在硬體中，當 caravel 接收到外部中斷(if USER_PROJ_IRQ0_EN)時，外部 data 從 mprj_io[5]進來到 uart_rx。當 carvel 執行 isr.c 的 read data 時，經由 uart_ctrl 傳 data 給 carvel。當 carvel 執行 isr.c 的 write data 時，經由 uart_ctrl 傳 data 給 uart_tx，並且再由 mprj_io[6]回傳給外部。而驗證方面 test bench simulation 是由 tbuart 來對 mprj_io[5]和 mprj_io[6]來做 data 讀寫。中間彼此的 interaction 是由 simulator 來跑。而使用 python code 來跑驗證結果，中間彼此的 interaction 則由實際的 ip 來完成 data 傳遞溝通。一開始 .hex 檔是載入在 ps side 的 DDR 裡面，需要將.hex 載入到硬體 bram 裡面，所以要透過 read_romcode 這 ip 來傳遞 data。其中還有 ps side 要與 caravel 的 gpio、reset 溝通所需的 ResetControl ip，以及 ps side 與 caravel 的 mprj_io 溝通所需的 caravel_ps ip。執行開始前所需的 reset，我們從 python code 可對 ResetControl ip 設定 reset 信號，caravel 就開始跑，當 firmware 做完之後，可以將 reset release 掉。當 caravel 跑完之後，透過 mprj_io 經由 caravel_ps ip 送一些信息跟 python code 溝通，來驗證功能是否正確。這次因為有 uart 和中斷，因此還加上 axi_uartlite 和 interrupt_ctrl。當 ps side 傳 data 給 caravel，data 經由 interconnect，再經由 axi_uartlite 傳給 caravel。也就是 axi_uartlite_tx 傳給 caravel_uart_rx。反之當caravel 傳 data 給 ps side，data 經由 axi_uartlite，再經由 interconnect，以及再經由 interrupt_ctrl 傳給 ps side。也就是 caravel_uart_tx 傳給 axi_uartlite_rx。而 interrupt_ctrl 是讓 ps side 知道 caravel 有送 data 過來，可以去讀取。