# HLS Study Journal - NTU 2024 AAHLS SP
## Table of Content
:::warning
[toc]
:::
## Lab Note
:::warning
- [Lab2 - FIR_AXI4-Master v.s. FIR_AXI4-Stream](/iAZN1pzISJKt-eiq4ZdE1A)
- [Lab3 - OpenCL for U50 PCIe FPGA ](/xVnrTaZqRtONEf6pCyv0JA)
- [LabA - Hardware Acceleration Tutorial - Mixing C++ & RTL Kernels](/CiEGfLIrTiiu5PkxcH6gwg)
- [LabB - Systolic Array](/FhQPwB5FRFaDavJ8_-KIxA)
:::
## Question AR
### Week 3
:::success
Explain Vitis programming model and the following models of data movements
1\. Kernel directly access host memory
2\. Host to Kernel streaming
Watch the video & draw a diagram to assist explaination
:::

In the Vitis Development Flow, the required data for computation on the FPGA side can be obtained through the following five methods. The second and third methods, _Kernel Directly Access Host Memory_ and _Host to Kernel Streaming_, will be discussed in detail below.
:::info
在Vitis的Development Flow中,FPGA side需要運算的資料可透過以下五種方式獲取資料,以下將針對第二點和第三點(Kernrl Directly Access Host Memory 和 Host to Kernel Streaming)詳細探討。
:::

In the _Kernel Directly Access Host Memory_ approach, the traditional method involves transferring data from the host memory on the host side to the global memory on the FPGA side for kernel function used. In contrast, _Kernel Directly Access Host Memory_ allows the kernel function to directly access the host memory through DMA (Direct Memory Access) without transferring the data to global memory.
This method eliminates the latency incurred by transferring data from host memory to global memory. However, it significantly increases the latency for accessing the data. Additionally, directly accessing data from the host memory introduces uncertainty issues, as accessing data may involve arbitration with other resources, leading to potential contention.
:::info
在*Kernel Directly Access Host Memory*部分,先前的方法是將Host Side的Host Memory中的資料搬到FPGA Side的Global Memory提供kernel function所用。相對地,Kernel Directly Access Host Memory則是採取kernel function 透過 DMA(Direct Memory Access) directly access Host Memory的方式來access data,並非用資料搬移到Global Memory的方式。
此方式不會浪費Host Memory搬移資料到Global Memory所需的Latency,然而在access data的Latency卻增長許多,並且直接access Host Memory的data會有uncertain會有uncertainty的issue。(因為access data會有與其他resource有arbitrate的問題)
:::

In the *Host to Kernel Streaming* approach, the traditional method involves transmitting control and data from the host side to the FPGA side. Once both are received on the FPGA side, the kernel function execution starts. However, this method may face synchronization issues.
The diagram on the bottom left illustrates that the host side first sends a command. Based on this command, the DMA starts accessing data from memory and transfers it to the kernel function. When the kernel function has both control and data information, it begins execution. However, this process incurs significant latency.
The diagram on the right depicts a hybrid method, where control and data are combined into a single package and transmitted via streaming. The first part of the package contains control information, while the latter part contains data. At the block level protocol, a data-driven `ap_ctrl_none` mechanism is used to handle transmissions. This allows the kernel function to start computation as soon as data is received, eliminating the need for `ap_start` and `ap_done` signals to control data transfer between the host side and the FPGA side.
:::info
在*Host to Kernel Streaming*部分,先前的方式是Host Side傳輸control和data到FPGA Side,當兩者同時送到FPGA Side後,FPGA Side開始execute kernel function,然而此方式會有Synchronization的問題。
在左下的圖中,示意了Host Side會先發送Command。透過這個Command,DMA會開始到Memory access data,並送到kernel function中。當kernel function同時有control和data的資訊後,就會開始execute,然而這個會消耗大量的Latency。
在右半的圖中,使用了一個Hibrid的方法,將data和control結合為一個package透過streaming進行傳輸,資料的前半部分為control,後半部分為data。在Block Level Protocol部分,使用Data-Driven的`ap_ctrl_none`來傳輸,只要有資料進到kernel function,就開始進行計算,不用`ap_start`和`ap_done`的signal來控制Host Side和FPGA Side之間的傳輸。
:::
### Week 4
:::success
Explain the use of IO interface for top function (kernel)
:::

This slide discusses the I/O interface between the kernel function (top function) and all the surrounding circuits. The following sections will explain the interfaces for each circuit.
Using *Kernel #1* as the top function as an example, the communication with the host side varies depending on the development board used. For PYNQ series development boards (e.g., PYNQ Z2), the AXI interface is used, while for Alveo series development boards (e.g., Alveo U50), the PCIe interface is utilized.
For communication with external I/O, the data is transmitted using streaming methods (e.g., AXI Stream).
For communication with global memory, data is transmitted using the `m_axi` (AXI Master) interface. The AXI Master interface employs Memory Mapped I/O (MMIO) to handle transmissions. This method allocates a specific memory space as I/O and performs read/write operations on this memory block, replacing the traditional I/O read/write behavior.
For communication with other kernel IPs, the streaming method (AXI Stream) is used, similar to the external I/O communication.
:::info
此投影片在探討kernel function(top finction)與周邊所有電路的I/O Interface,以下將逐一解說與每個電路的Interface。
以Kernel #1作為top function為例,與Host Side的communication部分,視開發版而定決定要使用哪一種Interface,若為PYNQ系列的開發板(如PYNQ Z2)使用AXI Interface,若為Alveo系列的開發板(如Alveo U50)則使用PCIe Interface。
與External I/O communication部分,使用streaming的方式傳輸(axis, AXI Stream)。
與Global Memory communication部分,使用`m_axi`(AXI Master)方式進行傳輸。而AXI Master部分,使用的傳輸方式為MMIO(Memory Mapped I/O),透過指定特定Memory Space為I/O,對於此記憶體區塊做讀寫來取代傳統的I/O Read/Write的behavior。
與其他Kernel IP communication部分,同External I/O使用streaming的方式傳輸(AXI Stream)。
:::
### Week 5
:::success
Explain the best practice for task/data-level parallelism
:::

For an HLS code, the best practices for task-level parallelism and data-level parallelism are as described in the diagram above. Suppose an HLS kernel code contains three functions: _Function A_, _Function B_, and _Function C_. Task-level parallelism and data-level parallelism can be explained separately in terms of horizontal and vertical dimensions.
**In the horizontal dimension**, task-level parallelism can be divided into optimizations within each function and data transfers between functions. Suppose each function is written using nested loops. The HLS directive `#pragma HLS PIPELINE` can be used to optimize the internal pipeline of each function. For data transfers between functions, the _dataflow_ approach can be used. This involves inserting FIFOs between each pair of functions to store the processed results from the previous function and sequentially feed the data into the subsequent function.
**In the vertical dimension**, data-level parallelism is typically associated with memory or buffers feeding data into the kernel function for computation. This is generally implemented using streaming methods. In HLS code, you can declare data using the `vector` datatype, which can apply SIMD (Single Instruction, Multiple Data) operations. For example, if the array size is `N`, the kernel function can process and produce `N` pieces of data at once.
:::info
對於一份HLS code來說,task level parallelism 和 data level parallelism 的 best practice如上圖所述。 假設在一份HLS的kernel code中,有三個Function分別為Function A, Function B 和 Function C,task level parallelism 和 data level parallelism 可以分別以橫向和縱向來進行說明。
在橫向部分為task level parallelism,可分為在每個Function內部的優化和Function與Function之間的資料傳輸。假設每個Function中都是以Nested Loop的方式做撰寫,使用HLS中的directive`#pragma HLS PIPELINE`對於Function內部進行pipeline的優化。在Function與Function之間的資料傳輸的部分,使用dataflow的方式傳輸,在兩兩Function中間安插FIFO儲存前面Function process完的結果並依序將資料送入後面的Function中。
在縱向部分為data level parallelism,通常為Memory或Buffer將資料送入Kernel Function運算,其傳輸方式通常為Streaming,在HLS Coding部分,可以使用`vector`這個datatype來進行宣告,此宣告方式可以apply SIMD。假設array size為N的話,Kernel Function可以一次process,並produce N筆data。
:::
### Week 6
:::success
Handling for different rate and different latency (p#32, 33)
:::
#### Process Execute at Different Rate – Rate Matching

In HLS design, II (Initiation Interval) can be considered as the hardware design throughput. Under the Different Rate issue, the rate matching method can be used to make the overall hardware design achieve an II of 1. The bottom-left diagram shows the block diagram of the above HLS code, where `inStream` is input to the demux, which outputs `outStream1` and `outStream2` that are then fed into `worker1` and `worker2` respectively. The II for the demux is 1, the II for `worker1` is 2, and the II for `worker2` is 1, resulting in an overall II of 2. The bottom-right diagram shows the optimized version of the original code, where the kernel function of `worker1` is replicated, making the overall II equal to 1.
:::info
在HLS的設計中,II(Initiation Interval)可視為Hardware Design Throughput。在Different Rate的Issue下,可以靠Rate Matching的方式來讓整個Hardware Design的 II = 1。左下圖為上述HLS code的Block Diagram,inStream輸入到demux並輸出outStream1和outStream2分別輸入到worker1和worker2,demux的II為1、worker1的II為2和worker2的II為1,overall的II為2。右下圖為原code的optimize version,將worker1的kernel function replicate一份讓overall的II為1。
:::
#### Process at Different Latency – Latency Matching (Estimate HLS Stream Depth)

In process design, you might encounter a situation where tasks are forked and then joined together, with the forked tasks possibly having different latencies. The diagram below illustrates how to handle such a situation.
In a segment of code, the `Duplicate` function is first processed, followed by forking into two functions: `GaussianBlur<5,5>` and `GaussianBlur<3,3>`. These two functions are executed in parallel and then joined back to the `AbsDiff` function for further processing. However, `GaussianBlur<5,5>` has a latency of 5 cycles, while `GaussianBlur<3,3>` has a latency of 3 cycles, creating a latency gap of 2 cycles. To resolve this latency difference, a FIFO with a depth of 2 can be added after `GaussianBlur<3,3>`.
:::info
在Process的Design中,可能會遇到task fork後再join起來的狀況,並且fork的task可能有不同的Latency,在下圖描述了這種情況的處理方式。
若有一段code中,會先處理Duplicate這個Function,再fork成GaussianBlur<5,5>和GaussianBlur<3,3>兩個Function執行,再join回AbsDiff Function處理。然而GaussianBlur<5,5>的Latency=5,GaussianBlur<3,3>的Latency=3,會有兩個cycle的Latency落差,可以在GaussianBlur<3,3>後加上Depth=2的FIFO即可解決Latency Difference的問題。
:::
### Week 7
:::success
Explain describe structure in C++, and explain the code of an example Counter
:::

In hardware design, HLS design is more suitable for algorithm-related development, while I/O interfaces are better suited for RTL Design due to the need for cycle-by-cycle operation control. However, since purely HLS code designs run C simulations on the CPU, and HLS code and RTL code need to run on C/RTL co-simulation—where HLS code runs on the CPU and RTL code runs on a Verilog simulator—the two processes communicate via inter-process communication. This requires more time to complete functional simulations, making the use of HLS structural design an important topic.
In hardware design, sequential logic consists of combinational logic and storage elements.
The left half describes Verilog design, which is divided into two parts: combinational logic and state variable assignment. In the combinational logic section, signals affected by input changes are listed, and their relationships are specified using equations to determine the corresponding output and next state. In the state variable assignment section, at every positive edge of the clock (posedge clk), the next state, modified by the combinational logic, is updated to the current state.
The right half describes HLS design. It first declares the inputs (InS), outputs (OutS), and states (State) used in the sequential logic. In the combinational logic section, the next state (NextS) is updated using the `F Function`, and the output signal `Temp_Out` is updated using the `Y Function`. After declaring the top module and initializing the current state, the `F` and `Y` functions are used to obtain the current state (CurState) and output.
:::info
在硬體的設計中,HLS design較適合演算法相關的開發設計,而I/O interface則較適合使用RTL Design,因著我們需要控制cycle-by-cycle的operation。然而,因著Purely HLS Code的Design會runs C Simulation on CPU,而HLS Code和RTL Code需runs on C/RTL co-simulation,其中HLS Code runs on CPU,RTL Code runs on Verilog Simulator,兩個process透過inter-process communication,需較多的時間才能完成功能模擬,故使用HLS Structure Design為重要的課題。
在硬體的設計中,Sequential Logic由Combinational Logic和Storage Element構成。
左半部分描述Verilog Design,分為Combinational Logic和State Variable Assignment兩部分。Combinational Logic部分會將input會變動的訊號列出來後,並specify他們之間的equation來取得對應的Output和NextState,State Variable Assignment部分則是在每個正緣觸發時(posedge clk),更新當由Combinational Logic改變的NextState更新到CurrentState。
右半部分描述HLS Design,會先宣告Sequential Logic會用到的InS、OutS和State。Combinational Logic部分透過`F Function`來更新Next State(NextS),透過`Y Function`來更新輸出信號Temp_Out,並宣告Top,初始化Current State後,使用 F 和 Y Function 後得到CurState和Output。
:::


```clike=
typedef enum{wait_for_one, wait_for_zero} counter_state_type;
void counter(
bool count_button,
ap_uint<8> &seven_segment_data,
ap_uint<4> &seven_segment_enable){
#pragma HLS INTERFACE ap_ctrl_none port=return
#pragma HLS INTERFACE ap_none port=count_button
#pragma HLS INTERFACE ap_none port=seven_segment_data
#pragma HLS INTERFACE ap_none port=seven_segment_enable
static ap_uint<5> number=0;
static counter_state_type = wait_for_one;
ap_uint<5> next_number;
counter_state_type next_state;
bool out_local;
switch(state){
case wait_for_one:
if (count_button == 1){
if (number + 1 ==10){
next_number = 0;
} else {
next_number = next_number + 1;
}
next_state = wait_for_zero;
} else {
next_number = number;
next_state = wait_for_one
}
break;
case wait_for_zero:
if (count_button == 1){
next_number = number;
next_state = wait_for_zero;
} else {
next_number = number;
next_state = wait_for_one;
}
break;
default:
break;
}
number = next_number;
state = next_state;
seven_segment_data = get_seven_segment_code(number);
seven_segment_enable = 0b1110;
}
ap_uint<8> get_seven_segment_code(ap_uint<5> number){
#pragma HLS INLINE
ap_uint<8> code = seven_segment_code[0];
switch(number){
case 0 :
code = seven_segment_code[0];
break;
case 1 :
code = seven_segment_code[1];
break;
case 2 :
code = seven_segment_code[2];
break;
case 3 :
code = seven_segment_code[3];
break;
case 4 :
code = seven_segment_code[4];
break;
case 5 :
code = seven_segment_code[5];
break;
case 6 :
code = seven_segment_code[6];
break;
case 7 :
code = seven_segment_code[7];
break;
case 8 :
code = seven_segment_code[8];
break;
case 9 :
code = seven_segment_code[9];
break;
default:
break;
}
return code;
}
```
In the `Counter` function, a static variable is first defined to record the current count and the current state. Then, variables `next_number` and `next_state` are declared to store the upcoming number and state. Next, a `switch-case` structure is used to implement all state transitions based on the current state. This updates the count (`number`) and the current state. Finally, the output variable is assigned to the output port.
:::info
在Counter的Function中,會先定義static variable用於紀錄當下計數的數字和當下的狀態,接著定義宣告用於紀錄接下來數字和狀態的next_number和next_state。接下來透過switch-case來implement所有的隨著state變換而transition,並更新計數的number和current state,再將輸出變數指定給輸出的Output Port。
:::
### Week 8
:::success
Construct a data-flow design (Canonical Form) using code examples (P#41-44)
:::


```clike=
void createReply(hls::stream<axiWord> &inData, hls::stream<axiWord> &outData, hls::stream<ap_uint<1>>
&validBuffer, hls::stream<ap_uint<16>> &checksumBuffer){
#pragma HLS PIPELINE II=1 style=flp
...
}
void dropper(hls::stream<axiWord> &inData, hls::stream<ap_uint<1>> &validBuffer, hls::stream<axiWord> &outData){
#pragma HLS PIPELINE II=1 style=flp
...
}
void insertChecksum(hls::stream<axiWord> &inData, hls::stream<ap_uint<16>> &checksumBuffer, hls::stream<axiWord> &outData){
#pragma HLS PIPELINE II=1 style=flp
...
}
// Second Level - ICMP
void icmp_server(hls::stream<axiWord> &inData, hls::stream<axiWord> &outData){
#pragma HLS DATAFLOW
#pragma INLINE
static hls::stream<axiWord> cr2dropper("cr2dropper");
static hls::stream<axiWord> drop2checksum("drop2checksum");
static hls::stream<ap_uint<1>> validBuffer("validBuffer");
static hls::stream<ap_uint<16>> cr2checksum("cr2checksum");
#pragma HLS STREAM variable=cr2dropper depth=16
#pragma HLS STREAM variable=drop2checksum depth=16
#pragma HLS STREAM variable=validBuffer depth=16
#pragma HLS STREAM variable=cr2checksum depth=16
createReply(inData, cr2dropper, validBuffer, cr2checksum);
dropper(cr2dropper, validBuffer, drop2checksum);
insertChecksum(drop2checksum, cr2checksum, outData);
}
// Top Level - top
void top(stream<axiWord> &inData, stream<axiWord> &outData){
#pragma HLS DATAFLOW
static hls::stream<axiWord> parser2arp("parser2arp");
static hls::stream<axiWord> parser2cmp("parser2cmp");
static hls::stream<axiWord> parser2loopback("parser2loopback");
static hls::stream<axiWord> arp2merge("arp2merge");
static hls::stream<axiWord> cmp2merge("cmp2merge");
static hls::stream<axiWord> loopback2merge("loopback2merge");
#pragma HLS STREAM variable=parser2arp depth=16
#pragma HLS STREAM variable=parser2cmp depth=16
#pragma HLS STREAM variable=parser2loopback depth=16
#pragma HLS STREAM variable=arp2merge depth=16
#pragma HLS STREAM variable=cmp2merge depth=16
#pragma HLS STREAM variable=loopback2merge depth=16
praser(inData, parser2arp, parser2cmp, parser2loopback);
arp_server(parser2arp, arp2merge);
icmp_server(parser2cmp, cmp2merge);
loopback(parser2loopback, loopback2merge);
merge(arp2merge, cmp2merge, loopback2merge, outData);
}
```