# HLS LabA Chapter4 ## Lab 1 ```cpp #include "adders.h" int adders(int in1, int in2, int in3) { // Prevent IO protocols on all input ports #pragma HLS INTERFACE ap_none port=in3 #pragma HLS INTERFACE ap_none port=in2 #pragma HLS INTERFACE ap_none port=in1 int sum; sum = in1 + in2 + in3; return sum; } ``` ![image](https://hackmd.io/_uploads/SJDisyC9le.png) ![image](https://hackmd.io/_uploads/B1S9Ijn5xg.png) ```cpp #include "adders.h" int adders(int in1, int in2, int in3) { #pragma HLS INTERFACE ap_ctrl_none port=return // Prevent IO protocols on all input ports #pragma HLS INTERFACE ap_none port=in3 #pragma HLS INTERFACE ap_none port=in2 #pragma HLS INTERFACE ap_none port=in1 int sum; sum = in1 + in2 + in3; return sum; } ``` ![image](https://hackmd.io/_uploads/rJKvMi25lx.png) The difference between the two versions is whether #pragma HLS INTERFACE ap_ctrl_none port=return is applied: Without it, Vitis HLS uses the default ap_ctrl_hs and automatically inserts the control handshake signals: ap_start, ap_done, ap_idle, and ap_ready. With ap_ctrl_none, those ap_* control ports are not generated, and the block is treated as continuously running without block-level handshaking. ### Question 1. Show the default block-level, port-level protocol table ![image](https://hackmd.io/_uploads/H1F_Z2aqxg.png) 2. How to specify the block-level protocol? Add pragma : `#pragma HLS INTERFACE **** port=return` 3. Show Interface table, and cross-reference signals and corresponding protocol | C | Protocol | RTL Signals | | ----------------------- | ------------ | --------------------------------------------------------------- | | `return` | `ap_ctrl_hs` | `ap_start`, `ap_done`, `ap_idle`, `ap_ready`, `ap_return[31:0]` | | `in1` | `ap_none` | `in1[31:0]` | | `in2` | `ap_none` | `in2[31:0]` | | `in3` | `ap_none` | `in3[31:0]` | 4. Show co-simulation waveform, explain the ap_ctrl_hs interface behavior ![image](https://hackmd.io/_uploads/H14mBnT5lx.png) ![image](https://hackmd.io/_uploads/B1JWwnp9ll.png) When `ap_start` and `ap_ready` are both 1 on a rising edge, a transaction is accepted. `ap_done` goes high when that transaction completes (in this design, the same cycle due to zero-cycle latency). Because the block is fully combinational and has no in-flight work across cycles, `ap_idle` remains 1. The ability to accept new data every cycle is indicated by `ap_ready` staying high (II=1). 5. Use ap_ctrl_none -> Cosim failures ![image](https://hackmd.io/_uploads/HyL8t2acll.png) No Failure ## Lab 2 ```tcl ############################################################ ## This file is generated automatically by Vitis HLS. ## Please DO NOT edit it. ## Copyright 1986-2022 Xilinx, Inc. All Rights Reserved. ############################################################ set_directive_interface -mode ap_vld "adders_io" in1 set_directive_interface -mode ap_ack "adders_io" in2 set_directive_interface -mode ap_hs "adders_io" in_out1 ``` ![image](https://hackmd.io/_uploads/BkRjAjh5xx.png) ![image](https://hackmd.io/_uploads/rynQnk09ll.png) ![image](https://hackmd.io/_uploads/rkp2ajhceg.png) note: `ap_hs` include `ap_vld` and `ap_ack`、 pointer argument is both an input and output to the function ### Question 1. List all the port-level protocol from Vitis HLS (2022.1) manual ap_none/ap_vld/ap_ack/ap_hs/ap_ovld/ap_fifo/ap_memory/bram/axis/s_axilite/m_axi 2. Show the interface table & waveform to explain the signal behavior `in1` (ap_vld): in1_ap_vld = 1 means the data on in1 is valid in that cycle, but it does not confirm whether the block actually consumed it. `in2` (ap_ack): in2_ap_ack = 1 means the block accepted/consumed the data in that cycle. The driver should hold the input stable and only change it after seeing ap_ack. `in_out1` (ap_hs): It has both vld and ack. Whether for input or output, a transfer completes only when vld and ack are both 1 on the clock edge. ## Lab 3 ### Single-port RAM(Solution1) ```cpp #include "array_io.h" // The data comes in organized in a single array. // - The first sample for the first channel (CHAN) // - Then the first sample for the 2nd channel etc. // The channels are accumulated independently // E.g. For 8 channels: // Array Order : 0 1 2 3 4 5 6 7 8 9 10 etc. 16 etc... // Sample Order: A0 B0 C0 D0 E0 F0 G0 H0 A1 B1 C2 etc. A2 etc... // Output Order: A0 B0 C0 D0 E0 F0 G0 H0 A0+A1 B0+B1 C0+C2 etc. A0+A1+A2 etc... void array_io (dout_t d_o[N], din_t d_i[N]) { int i, rem; // Store accumulated data static dacc_t acc[CHANNELS]; dacc_t temp; // Accumulate each channel For_Loop: for (i=0;i<N;i++) { rem=i%CHANNELS; temp = acc[rem] + d_i[i]; acc[rem] = temp; d_o[i] = acc[rem]; } } ``` ![image](https://hackmd.io/_uploads/HyyGrpn5xl.png) ### Dual-port RAM(Solution2) ```tcl ############################################################ ## This file is generated automatically by Vitis HLS. ## Please DO NOT edit it. ## Copyright 1986-2022 Xilinx, Inc. All Rights Reserved. ############################################################ set_directive_top -name array_io "array_io" set_directive_unroll "array_io/For_Loop" set_directive_interface -mode bram -storage_impl bram -storage_type ram_2p "array_io" d_i set_directive_interface -mode ap_fifo "array_io" d_o ``` ![image](https://hackmd.io/_uploads/Sy9gdT35gg.png) ### Array Partition(Solution3) ```tcl ############################################################ ## This file is generated automatically by Vitis HLS. ## Please DO NOT edit it. ## Copyright 1986-2022 Xilinx, Inc. All Rights Reserved. ############################################################ set_directive_top -name array_io "array_io" set_directive_unroll "array_io/For_Loop" set_directive_array_partition -dim 1 -factor 2 -type block "array_io" d_i set_directive_array_partition -dim 1 -factor 4 -type block "array_io" d_o set_directive_interface -mode ap_fifo "array_io" d_o set_directive_interface -mode bram -storage_type ram_2p -storage_impl bram "array_io" d_i ``` ![image](https://hackmd.io/_uploads/ryoY5Tn5lg.png) ### Fully Partitioned Array Interfaces(Solution4) ```tcl ############################################################ ## This file is generated automatically by Vitis HLS. ## Please DO NOT edit it. ## Copyright 1986-2022 Xilinx, Inc. All Rights Reserved. ############################################################ set_directive_top -name array_io "array_io" set_directive_unroll "array_io/For_Loop" set_directive_array_partition -dim 1 -type complete "array_io" d_i set_directive_array_partition -dim 1 -type complete "array_io" d_o set_directive_interface -mode ap_fifo "array_io" d_o ``` ![image](https://hackmd.io/_uploads/Hk3Kiah9ex.png) ### Result Compare #### Performance ![image](https://hackmd.io/_uploads/Sk3wh6n9xx.png) #### Utilization ![image](https://hackmd.io/_uploads/BkWt2a25xg.png) ### Question 1.Rolled loop, use dual-port RAM. What does the synthesis report show? ![image](https://hackmd.io/_uploads/SynmWTpcex.png) In the figure, the d_i side shows extra q1, ce1, and address1 signals, indicating it supports dual-port reads 2.Unrolled the loop, compare the latency for the cases of combination of input(single/dual port), output (single/dual port), explain why? Single/Dual => solution1/solution2 Although making `d_i` dual-port can improve the input-side Initiation Interval (II), the output `d_o` is implemented as a FIFO, so the achieved II stays the same—hence the overall latency is nearly unchanged between the two designs. ![image](https://hackmd.io/_uploads/BkqL-aa9xe.png) 3. Array partition • Unroll & array_partition with different type = block/cyclic/complete, factor = 2, 4 • Observe latency, resource used and explain why? Solution 4/5/6/7/8 => complete/block2/block4/cyclic2/cyclic4 ![image](https://hackmd.io/_uploads/ByeqxkCclg.png) Latency: Complete < cyclic 4 < block 4 < cyclic 2 < block 2 Resource used: Complete < block 4 < block 2 < cyclic 2 ~= cyclic 4 **Latency**: Complete: Everything is in registers, so there are no bank conflicts → smallest latency. Cyclic4: Data are striped across banks (bank1: 0,4,8…; bank2: 1,5,9…; bank3: 2,6,10…; bank4: 3,7,11…). Our loop mostly accesses consecutive indices, so conflicts are rare → second-best latency. Block4: Consecutive addresses fall into the same bank, so conflicts are more frequent than Cyclic4 → latency is slightly worse. Cyclic2 / Block2: Fewer banks increase contention → worst latency among the options. **Resource Usage**: Complete: Implemented entirely with a small number of registers and minimal control logic, which is more efficient than using RAM in this case. Block4 vs. Block2: Block4 uses fewer resources; Block2 sees more frequent conflicts and therefore needs more arbitration/scheduling logic. Cyclic4 ≈ Cyclic2: Both need bank-selection (effectively a modulo-based mapping) and mux/demux control; with this small total size, that control overhead dominates, so their resource usage is about the same. ## Lab 4 ### Cyclic Partition (Axis I/O) ![image](https://hackmd.io/_uploads/HJAXgR2cle.png) ![image](https://hackmd.io/_uploads/HJOUeC2clx.png) ### Axi4-lite Imp Addr ```cpp // ============================================================== // Vitis HLS - High-Level Synthesis from C, C++ and OpenCL v2022.1 (64-bit) // Tool Version Limit: 2022.04 // Copyright 1986-2022 Xilinx, Inc. All Rights Reserved. // ============================================================== // control // 0x0 : Control signals // bit 0 - ap_start (Read/Write/COH) // bit 1 - ap_done (Read/COR) // bit 2 - ap_idle (Read) // bit 3 - ap_ready (Read/COR) // bit 7 - auto_restart (Read/Write) // bit 9 - interrupt (Read) // others - reserved // 0x4 : Global Interrupt Enable Register // bit 0 - Global Interrupt Enable (Read/Write) // others - reserved // 0x8 : IP Interrupt Enable Register (Read/Write) // bit 0 - enable ap_done interrupt (Read/Write) // bit 1 - enable ap_ready interrupt (Read/Write) // others - reserved // 0xc : IP Interrupt Status Register (Read/COR) // bit 0 - ap_done (Read/COR) // bit 1 - ap_ready (Read/COR) // others - reserved // (SC = Self Clear, COR = Clear on Read, TOW = Toggle on Write, COH = Clear on Handshake) #define XAXI_INTERFACES_CONTROL_ADDR_AP_CTRL 0x0 #define XAXI_INTERFACES_CONTROL_ADDR_GIE 0x4 #define XAXI_INTERFACES_CONTROL_ADDR_IER 0x8 #define XAXI_INTERFACES_CONTROL_ADDR_ISR 0xc ``` ### Question 1. Stream • Unroll the loop, and observe how many axis channel created • Compare the area with Lab1-3 ![image](https://hackmd.io/_uploads/HJLIqy0cxg.png) 48 axis channels created Comparable to Lab 3 (Complete), smaller than Lab 3’s Cyclic and Block configurations, and much larger than in Lab 1 and Lab 2. 2. Axilite It is used to communicate with hos program Show _hw.h and explain its content ```cpp // ============================================================== // Vitis HLS - High-Level Synthesis from C, C++ and OpenCL v2022.1 (64-bit) // Tool Version Limit: 2022.04 // Copyright 1986-2022 Xilinx, Inc. All Rights Reserved. // ============================================================== // control // 0x0 : Control signals // bit 0 - ap_start (Read/Write/COH) // bit 1 - ap_done (Read/COR) // bit 2 - ap_idle (Read) // bit 3 - ap_ready (Read/COR) // bit 7 - auto_restart (Read/Write) // bit 9 - interrupt (Read) // others - reserved // 0x4 : Global Interrupt Enable Register // bit 0 - Global Interrupt Enable (Read/Write) // others - reserved // 0x8 : IP Interrupt Enable Register (Read/Write) // bit 0 - enable ap_done interrupt (Read/Write) // bit 1 - enable ap_ready interrupt (Read/Write) // others - reserved // 0xc : IP Interrupt Status Register (Read/COR) // bit 0 - ap_done (Read/COR) // bit 1 - ap_ready (Read/COR) // others - reserved // (SC = Self Clear, COR = Clear on Read, TOW = Toggle on Write, COH = Clear on Handshake) #define XAXI_INTERFACES_CONTROL_ADDR_AP_CTRL 0x0 #define XAXI_INTERFACES_CONTROL_ADDR_GIE 0x4 #define XAXI_INTERFACES_CONTROL_ADDR_IER 0x8 #define XAXI_INTERFACES_CONTROL_ADDR_ISR 0xc ``` This file provides the addresses of the relevant control registers and details the bit-field mapping