Lab A Question

## Dataflow Debug and Analysis Run on Vitis HW-emulation ### Introduce dataflow viewer, with cross-reference with codes ``` Hardware_Acceleration/Feature_Tutorials/03-dataflow_debug_and_optimization/reference_files/dataflow/diamond.cpp ``` Apply the DATAFLOW pragma or directive to your design for the Dataflow viewer to be populated. ```c void diamond(data_t vecIn[N], data_t vecOut[N]) { data_t c1[N], c2[N], c3[N], c4[N]; #pragma HLS dataflow funcA(vecIn, c1, c2); funcB(c1, c3); funcC(c2, c4); funcD(c3, c4, vecOut); } ``` After synthesizing your design ... ![synthesis_summary_report_view (1)](https://hackmd.io/_uploads/By-n6Ua5gx.png) ![open_dataflow_viewer](https://hackmd.io/_uploads/Skd668Tqxe.png) Open a new Dataflow viewer window for the function as shown below. ![dataflow_viewer_start_page (1)](https://hackmd.io/_uploads/HyYRaL65xx.png) Compare with the previous code and observe the dataflow graph. B and C can be executed in parallel. * Blue edges represent data dependencies between the functions * Green edges represent the inferred FIFO channels between the functions * Gold edges represent the inferred PIPO channels Since c1, c2, c3, and c4 are fixed-size arrays (implemented as RAM),and funcA writes to c1 and c2, while funcB reads from c1 and funcC reads from c2 at the same time, the tool automatically converts these intermediate arrays into PIPOs. ### View the Dataflow Graph after RTL co-simulation Run co-simulation. ![rtl-cosim (1)](https://hackmd.io/_uploads/rk4Txwpclg.png) When it is done with simulation, it will display the Vivado XSIM waveform viewer (due to the Wave Debug option), to let you inspect the waveforms generated during simulation (by the Dump Trace option). After running C/RTL co-simulation, the elements of the graph are filled out with performance data, and the Process and Channel tables beneath the graph are also filled out. ![process_table (1)](https://hackmd.io/_uploads/H1fm4D6cex.png) **Run multiple iteration and to make sure the design is flushing the FIFOs** Note that the testbench runs this function for 3 iterations. ![waveform_summary (1)](https://hackmd.io/_uploads/Syh_bvp9ll.png) Note: one function call gives latency, multiple function call gives II estimation Since the functions communicate via PIPOs, stalling can occur. ![waveform_detailed (1)](https://hackmd.io/_uploads/rJWIQDT5xe.png) When funcB and funcC are still processing data, funcA must wait after finishing, which may cause a stall. Similarly, funcB and funcC may also stall while waiting for funcD. ### Use dataflow viewer to investigate a deadlock case ``` Hardware_Acceleration/Feature_Tutorials/03-dataflow_debug_and_optimization/reference_files/deadlock/example.cpp ``` Run co-simulation. Select the Channel (PIPO/FIFO) Profiling option before clicking OK. The GUI will automatically launch the Dataflow Viewer as a deadlock is detected in this design. Click "+". ![dataflow_deadlock_view2 (1)](https://hackmd.io/_uploads/BkP0rw6qeg.png) Processes that are deadlocked will be shown in red in the graph. ### Resize FIFO to resolve the deadlock Add ``` #pragma HLS stream depth=10 variable=data_channel1 #pragma HLS stream depth=1 variable=data_channel2 ``` to each function where hls::stream variables is declared. The appropriate FIFO depth can be found in three ways through the GUI. ### To better illustration: **Show the kernel code, and test bench code** The top function. ```c void example(hls::stream<int>& A, hls::stream<int>& B){ #pragma HLS dataflow #pragma HLS INTERFACE ap_fifo port=&A #pragma HLS INTERFACE ap_fifo port=&B hls::stream<int> data_channel1; hls::stream<int> data_channel2; proc_1(A, data_channel1, data_channel2); proc_2(data_channel1, data_channel2, B); } ``` proc_1 contain two part ```c void proc_1(hls::stream<int>& A, hls::stream<int>& B, hls::stream<int>& C){ #pragma HLS dataflow hls::stream<int> data_channel1; hls::stream<int> data_channel2; proc_1_1(A, data_channel1, data_channel2); proc_1_2(B, C, data_channel1, data_channel2); } ``` Here is the inside task. ```c void proc_1_1(hls::stream<int>& A, hls::stream<int>& data_channel1, hls::stream<int>& data_channel2){ int i; int tmp; for(i = 0; i < 10; i++){ tmp = A.read(); data_channel1.write(tmp); } for(i = 0; i < 10; i++){ data_channel2.write(tmp); } } void proc_1_2(hls::stream<int>& B, hls::stream<int>& C, hls::stream<int>& data_channel1, hls::stream<int>& data_channel2){ int i; int tmp; for(i = 0; i < 10; i++){ tmp = data_channel2.read() + data_channel1.read(); B.write(tmp); } for(i = 0; i < 10; i++){ C.write(tmp); } } ``` The function proc_1_1 first writes to data_channel1 10 times, and then starts writing to data_channel2. These two loops do not execute in parallel. At the same time, proc_1_2 first reads from both data_channel1 and data_channel2. If data_channel2 is empty, the read loop will stall. If the FIFO depth of data_channel1 is less than 10, a deadlock will occur. Note that this example just shows how to handle a deadlock caused by insufficient FIFO depth. Resizing the FIFO can resolve the deadlock, but I think a better solution is to apply task-level parallelism to the two for loops in proc_1_1 and proc_1_2. Here is the testbench. ```c int main() { int i; hls::stream<int> A; hls::stream<int> B; int time = 0; for (time = 0 ; time < 4; time ++) { for(i=0; i < SIZE; i++){ A << (i + time); } example(A,B); } return 0; } ``` **Show the Dataflow table – block time, max-depth** Here when a deadlock occur. ![擷取](https://hackmd.io/_uploads/B1zghFp5xl.png) The corresponding buffer depth will be marked in red. After resizing the FIFO, the cosim MAX Depth indicates the required FIFO depth. ![擷取](https://hackmd.io/_uploads/SJBB2YTqeg.png) There 3 ways to modify the fifo depth. ### Introduce different methods of FIFO sizing * Manul FIFO sizing * Global FIFO sizing * Auto FIFO sizing ### Experiment with manual setting channel with different FIFO size, and run co-sim. Explain why the FIFO size is Right click... ![adjusting_fifo_depth (1)](https://hackmd.io/_uploads/H1FA6Y69eg.png) Modify the depth. ![modify_depth](https://hackmd.io/_uploads/rkckAFa9gl.png) The GUI will prompt you to rerun the co-simulation. After modifying all FIFO depths, click Yes to rerun the co-simulation without changing the source code. Once you determine a good set of FIFO depths through experimentation, **remember to update the source code accordingly**. There are two ways to update the source code: 1. Directly add the pragma in the source code. 2. Use the GUI to apply the changes. ![擷取](https://hackmd.io/_uploads/BkQXJc6qge.png) I suggest directly adding the pragma in the source code, because the GUI sometimes fails to apply it to a few functions. Here the cosim MAX Depth indicates the required FIFO depth. ### Use Global FIFO Sizing, Max-Depth After **c-simulation**, find this in report. ``` The maximum depth reached by any of the 26 hls::stream() instances in the design is 40 ``` Then we can force all FIFO depths to 40, and determine the required FIFO depth after co-simulation. ![solution_settings (1)](https://hackmd.io/_uploads/HkEDxcT9eg.png) ![global_setting1 (1)](https://hackmd.io/_uploads/rkmug5T5lg.png) ### Automated FIFO Sizing Co-sim → enable Dynamic Deadlock Prevention. ![auto_fifo_setting (1)](https://hackmd.io/_uploads/Synheq69el.png) ![auto_fifo_sizing (1)](https://hackmd.io/_uploads/rkYJbcT5xx.png) ### Compare the result of the three methods of FIFO sizing