---
# System prepended metadata

title: hls_ch7_lab_A

---

## HLS LabA Ch7 : Chap7: Design Optimization
### Lab 1 : Optimizing a Matrix Multiplier
### Objective
1. Understand how function **latency** is calculated.
2. Learn to interpret synthesis **log messages** (INFO / WARNING).
3. Explore memory resource limitations (limited BRAM ports) and their impact on pipeline performance.
4. Apply `#pragma HLS ARRAY_RESHAPE` to resolve memory access conflicts.
5. Analyze **pipeline behavior** at the function level in terms of **latency, throughput, and resource utilization**.
### Setup
* Tool: Vitis HLS 
* FPGA device: xc7z020clg400-1
* Baseline design: `matrixmul`
```C=
#include "matrixmul.h"

void matrixmul(
      mat_a_t a[MAT_A_ROWS][MAT_A_COLS],
      mat_b_t b[MAT_B_ROWS][MAT_B_COLS],
      result_t res[MAT_A_ROWS][MAT_B_COLS])
{
  // Iterate over the rows of the A matrix
   Row: for(int i = 0; i < MAT_A_ROWS; i++) {
      // Iterate over the columns of the B matrix
      Col: for(int j = 0; j < MAT_B_COLS; j++) {
         res[i][j] = 0;
         // Do the inner product of a row of A and col of B
         Product: for(int k = 0; k < MAT_B_ROWS; k++) {
            res[i][j] += a[i][k] * b[k][j];
         }
      }
   }
}
```
### Analysis Based on different methodology
#### A. Baseline -- no pragma (Solution 1)
##### 1. Code
![image](https://hackmd.io/_uploads/H1Oauz7qle.png)

##### 2. Analysis
```markdown=
================================================================
== Performance Estimates
================================================================
+ Timing: 
    * Summary: 
    +--------+----------+----------+------------+
    |  Clock |  Target  | Estimated| Uncertainty|
    +--------+----------+----------+------------+
    |ap_clk  |  10.00 ns|  6.270 ns|     2.70 ns|
    +--------+----------+----------+------------+

+ Latency: 
    * Summary: 
    +---------+---------+----------+----------+-----+-----+---------+
    |  Latency (cycles) |  Latency (absolute) |  Interval | Pipeline|
    |   min   |   max   |    min   |    max   | min | max |   Type  |
    +---------+---------+----------+----------+-----+-----+---------+
    |       24|       24|  0.240 us|  0.240 us|   25|   25|       no|
    +---------+---------+----------+----------+-----+-----+---------+

    + Detail: 
        * Instance: 
        N/A

        * Loop: 
        +-----------+---------+---------+----------+-----------+-----------+------+----------+
        |           |  Latency (cycles) | Iteration|  Initiation Interval  | Trip |          |
        | Loop Name |   min   |   max   |  Latency |  achieved |   target  | Count| Pipelined|
        +-----------+---------+---------+----------+-----------+-----------+------+----------+
        |- Row_Col  |       22|       22|         7|          2|          1|     9|       yes|
        +-----------+---------+---------+----------+-----------+-----------+------+----------+
```
* Latency: 24 cycles
Total Latency=1(input)+22(loop)+1(output)=24 cycles
* II: 25
Since the module has no pipeline, each input must wait until the previous input is fully processed and output before it can enter.

#### B. Pipeline -- innermost loop (Solution 2)
##### 1. Code
![image](https://hackmd.io/_uploads/HkuyYfQ5el.png)

##### 2. Analysis
```markdown=
================================================================
== Performance Estimates
================================================================
+ Timing: 
    * Summary: 
    +--------+----------+----------+------------+
    |  Clock |  Target  | Estimated| Uncertainty|
    +--------+----------+----------+------------+
    |ap_clk  |  10.00 ns|  7.061 ns|     2.70 ns|
    +--------+----------+----------+------------+

+ Latency: 
    * Summary: 
    +---------+---------+----------+----------+-----+-----+---------+
    |  Latency (cycles) |  Latency (absolute) |  Interval | Pipeline|
    |   min   |   max   |    min   |    max   | min | max |   Type  |
    +---------+---------+----------+----------+-----+-----+---------+
    |       34|       34|  0.340 us|  0.340 us|   35|   35|       no|
    +---------+---------+----------+----------+-----+-----+---------+

    + Detail: 
        * Instance: 
        N/A

        * Loop: 
        +-------------------+---------+---------+----------+-----------+-----------+------+----------+
        |                   |  Latency (cycles) | Iteration|  Initiation Interval  | Trip |          |
        |     Loop Name     |   min   |   max   |  Latency |  achieved |   target  | Count| Pipelined|
        +-------------------+---------+---------+----------+-----------+-----------+------+----------+
        |- Row_Col_Product  |       32|       32|         7|          1|          1|    27|       yes|
        +-------------------+---------+---------+----------+-----------+-----------+------+----------+

```

* **Increased latency : carried dependency**
In the inner loop, each iteration depends on the result of the previous one : `res[i][j] += a[i][k] * b[k][j]`.
When loop pipelining is applied, HLS attempts to insert pipeline stages into the multiplication and addition, which increases the latency of a single iteration. However, because of the data dependency, the iterations cannot fully overlap in the pipeline, so the design must still wait for the result of each addition before proceeding. As a result, pipelining actually increases it due to the added pipeline register stages combined with the dependency constraint.

#### C. Pipeline -- outer loop (Solution 3)
##### 1. Code
![image](https://hackmd.io/_uploads/H1YZYzQ9xg.png)
##### 2. Analysis
```markdown=
================================================================
== Performance Estimates
================================================================
+ Timing: 
    * Summary: 
    +--------+----------+----------+------------+
    |  Clock |  Target  | Estimated| Uncertainty|
    +--------+----------+----------+------------+
    |ap_clk  |  10.00 ns|  6.270 ns|     2.70 ns|
    +--------+----------+----------+------------+

+ Latency: 
    * Summary: 
    +---------+---------+----------+----------+-----+-----+---------+
    |  Latency (cycles) |  Latency (absolute) |  Interval | Pipeline|
    |   min   |   max   |    min   |    max   | min | max |   Type  |
    +---------+---------+----------+----------+-----+-----+---------+
    |       24|       24|  0.240 us|  0.240 us|   25|   25|       no|
    +---------+---------+----------+----------+-----+-----+---------+

    + Detail: 
        * Instance: 
        N/A

        * Loop: 
        +-----------+---------+---------+----------+-----------+-----------+------+----------+
        |           |  Latency (cycles) | Iteration|  Initiation Interval  | Trip |          |
        | Loop Name |   min   |   max   |  Latency |  achieved |   target  | Count| Pipelined|
        +-----------+---------+---------+----------+-----------+-----------+------+----------+
        |- Row_Col  |       22|       22|         7|          2|          1|     9|       yes|
        +-----------+---------+---------+----------+-----------+-----------+------+----------+
```
* **Warning messages:**
```
WARNING: [HLS 200-885] The II Violation in module 'matrixmul' (loop 'Row_Col'): Unable to schedule 'load' operation 
('b_load', materials/lab1/matrixmul.cpp:15) on array 'b' due to limited memory ports (II = 1). Please consider using 
a memory core with more ports or partitioning the array 'b'. 
```
* **Limited memory ports**
When pipelining is applied at the column loop level, the tool tries to access multiple elements of arrays a and b in parallel for different iterations. However, the array interfaces are limited by the available memory ports (typically two ports per BRAM). This results in a "limited port" warning, since the design requires more parallel memory accesses than the memory architecture can support. Although the tool still enforces II=1 for the column loop, the port contention prevents additional performance improvement, so the overall latency remains the same.

#### D. Apply ARRAY_RESHAPE (Solution 4)
##### 1. Code
![image](https://hackmd.io/_uploads/r1ace1Ncge.png)

##### 2. Analysis
```markdown=
================================================================
== Performance Estimates
================================================================
+ Timing: 
    * Summary: 
    +--------+----------+----------+------------+
    |  Clock |  Target  | Estimated| Uncertainty|
    +--------+----------+----------+------------+
    |ap_clk  |  10.00 ns|  6.270 ns|     2.70 ns|
    +--------+----------+----------+------------+

+ Latency: 
    * Summary: 
    +---------+---------+----------+----------+-----+-----+---------+
    |  Latency (cycles) |  Latency (absolute) |  Interval | Pipeline|
    |   min   |   max   |    min   |    max   | min | max |   Type  |
    +---------+---------+----------+----------+-----+-----+---------+
    |       15|       15|  0.150 us|  0.150 us|   16|   16|       no|
    +---------+---------+----------+----------+-----+-----+---------+

    + Detail: 
        * Instance: 
        N/A

        * Loop: 
        +-----------+---------+---------+----------+-----------+-----------+------+----------+
        |           |  Latency (cycles) | Iteration|  Initiation Interval  | Trip |          |
        | Loop Name |   min   |   max   |  Latency |  achieved |   target  | Count| Pipelined|
        +-----------+---------+---------+----------+-----------+-----------+------+----------+
        |- Row_Col  |       13|       13|         6|          1|          1|     9|       yes|
        +-----------+---------+---------+----------+-----------+-----------+------+----------+
```
* **Use ARRAY_RESHAPE to solve memory resource issue**
Using pipeline alone causes issues due to either carry dependency (in the Product loop) or memory port contention (in the Col loop). By applying **array_reshape**, the **arrays are partitioned into registers**, which provides **more parallel access ports**. This removes the memory bottleneck, allowing the pipelined Col loop to achieve **II=1** effectively and reduce latency, at the cost of higher resource usage.

#### F. Apply FIFO interface (Solution 5)
##### 1. Code
![image](https://hackmd.io/_uploads/rybagJ49xe.png)

##### 2. Analysis
```markdown=
================================================================
== Performance Estimates
================================================================
+ Timing: 
    * Summary: 
    +--------+----------+----------+------------+
    |  Clock |  Target  | Estimated| Uncertainty|
    +--------+----------+----------+------------+
    |ap_clk  |  10.00 ns|  5.577 ns|     2.70 ns|
    +--------+----------+----------+------------+

+ Latency: 
    * Summary: 
    +---------+---------+----------+----------+-----+-----+---------+
    |  Latency (cycles) |  Latency (absolute) |  Interval | Pipeline|
    |   min   |   max   |    min   |    max   | min | max |   Type  |
    +---------+---------+----------+----------+-----+-----+---------+
    |       23|       23|  0.230 us|  0.230 us|   24|   24|       no|
    +---------+---------+----------+----------+-----+-----+---------+

    + Detail: 
        * Instance: 
        N/A

        * Loop: 
        +----------+---------+---------+----------+-----------+-----------+------+----------+
        |          |  Latency (cycles) | Iteration|  Initiation Interval  | Trip |          |
        | Loop Name|   min   |   max   |  Latency |  achieved |   target  | Count| Pipelined|
        +----------+---------+---------+----------+-----------+-----------+------+----------+
        |- Row     |       21|       21|        10|          6|          1|     3|       yes|
        +----------+---------+---------+----------+-----------+-----------+------+----------+
```
* **Warning**
```markdown=
WARNING: [HLS 214-237] The INTERFACE pragma actions in object field. If on struct field, disaggregate pragma is required; If on array element, array_partition pragma is required. If no, this interface pragma will be viewed as invalid and ignored. In function 'matrixmul(char (*) [3], char (*) [3], short (*) [3])' (materials/lab1/matrixmul.cpp:7:0)
WARNING: [HLS 214-142] Implementing stream: may cause mismatch if read and write accesses are not in sequential order on port 'b' (materials/lab1/matrixmul.cpp:7:0)
WARNING: [HLS 214-142] Implementing stream: may cause mismatch if read and write accesses are not in sequential order on port 'res' (materials/lab1/matrixmul.cpp:7:0)
```
* **FIFO order mismatch**
After setting `b` and `res` as FIFO/stream, warnings appear because HLS assumes FIFO accesses must be sequential. In matrix multiplication:
    * `b[k][j]` accesses are across rows, not sequential
    * `res[i][j]` writes span rows and columns, also not sequential
    * `a[i][k]` accesses are row-contiguous, so no warning occurs.

* **Solution** 
Reorder the data access to be sequential (Lab 2)
#### G. Pipeline all function (Solution 6)
##### 1. Code
![image](https://hackmd.io/_uploads/S1F9Yz75el.png)
##### 2. Analysis

```markdown=
================================================================
== Performance Estimates
================================================================
+ Timing: 
    * Summary: 
    +--------+----------+----------+------------+
    |  Clock |  Target  | Estimated| Uncertainty|
    +--------+----------+----------+------------+
    |ap_clk  |  10.00 ns|  6.492 ns|     2.70 ns|
    +--------+----------+----------+------------+

+ Latency: 
    * Summary: 
    +---------+---------+----------+----------+-----+-----+---------+
    |  Latency (cycles) |  Latency (absolute) |  Interval | Pipeline|
    |   min   |   max   |    min   |    max   | min | max |   Type  |
    +---------+---------+----------+----------+-----+-----+---------+
    |       10|       10|  0.100 us|  0.100 us|    5|    5|      yes|
    +---------+---------+----------+----------+-----+-----+---------+
```
* Performance
The design now completes in fewer clocks and can start a new transaction every 5 clock
cycles. However, the area and resources have increased substantially because all the loops
in the design were unrolled.

### Comparison
![image](https://hackmd.io/_uploads/H15ZKbNqgx.png)

Comparing **sol 4(array partition)** and **sol 6(array partition + pipeline)**, we can see that **sol 6** has a shorter latency but a larger area.

---
### Lab 2 : C code modified for Stream access
### Objective
* Explain the requirements for using Stream/FIFO access.
* Observe and analyze the **memory-mapped (MM)** access pattern.
* Derive and implement the necessary code modifications, including the introduction of a **cache buffer**, and explain how the buffer is utilized.
* Compare latency and resource utilization across each optimization step to evaluate the **trade-offs between performance and hardware cost**.

### Bottleneck In Lab1 : Unable to use streaming interface
![image](https://hackmd.io/_uploads/B1x2eGE5ee.png)
As `i`, `j`, and `k` iterate from 0 to 3, Figure 7-20 shows the read/write addresses for `a`, `b`, and `res`, with `res` reset to zero at each Product loop. For sequential streaming, only the red-highlighted port accesses are allowed: reads must be cached internally, and writes to `res` stored temporarily and committed only at the red cycles.

### Modification : Streaming access
```c=
#include "matrixmul.h"

void matrixmul(
      mat_a_t a[MAT_A_ROWS][MAT_A_COLS],
      mat_b_t b[MAT_B_ROWS][MAT_B_COLS],
      result_t res[MAT_A_ROWS][MAT_B_COLS])
{
#pragma HLS ARRAY_RESHAPE variable=b complete dim=1
#pragma HLS ARRAY_RESHAPE variable=a complete dim=2
#pragma HLS INTERFACE ap_fifo port=a
#pragma HLS INTERFACE ap_fifo port=b
#pragma HLS INTERFACE ap_fifo port=res
  mat_a_t a_row[MAT_A_ROWS];
  mat_b_t b_copy[MAT_B_ROWS][MAT_B_COLS];
  int tmp = 0;

  // Iterate over the rowa of the A matrix
  Row: for(int i = 0; i < MAT_A_ROWS; i++) {
    // Iterate over the columns of the B matrix
    Col: for(int j = 0; j < MAT_B_COLS; j++) {
      #pragma HLS PIPELINE rewind
      // Do the inner product of a row of A and col of B
      tmp=0;
      // Cache each row (so it's only read once per function)
      if (j == 0)
        Cache_Row: for(int k = 0; k < MAT_A_ROWS; k++)
          a_row[k] = a[i][k];
      
       // Cache all cols (so they are only read once per function)
     if (i == 0)
            Cache_Col: for(int k = 0; k < MAT_B_ROWS; k++)
               b_copy[k][j] = b[k][j];

      Product: for(int k = 0; k < MAT_B_ROWS; k++) {
        tmp += a_row[k] * b_copy[k][j];
      }
      res[i][j] = tmp;
    }
  }
}
```

* Caching improves performance by reducing repeated FIFO accesses and reusing data locally.
* Flow:
1. Read once from FIFO : Fetch a full row of a or a full column of b into local buffers.
2. Store in fast local memory : Keep the data in registers or BRAM for quick access.
3. Reuse for computation : Use cached values multiple times across different products without re-reading FIFO.
4. Sequential output : Write final results to res in order, ensuring FIFO-compatible sequential writes.

### Synthesis result
```markdown=
================================================================
== Performance Estimates
================================================================
+ Timing: 
    * Summary: 
    +--------+----------+----------+------------+
    |  Clock |  Target  | Estimated| Uncertainty|
    +--------+----------+----------+------------+
    |ap_clk  |  10.00 ns|  6.270 ns|     2.70 ns|
    +--------+----------+----------+------------+

+ Latency: 
    * Summary: 
    +---------+---------+----------+----------+-----+-----+------------------------------------------+
    |  Latency (cycles) |  Latency (absolute) |  Interval |                 Pipeline                 |
    |   min   |   max   |    min   |    max   | min | max |                   Type                   |
    +---------+---------+----------+----------+-----+-----+------------------------------------------+
    |       13|       14|  0.130 us|  0.140 us|    9|    9|  loop rewind stp(delay=0 clock cycles(s))|
    +---------+---------+----------+----------+-----+-----+------------------------------------------+

    + Detail: 
        * Instance: 
        N/A

        * Loop: 
        +-----------+---------+---------+----------+-----------+-----------+------+----------+
        |           |  Latency (cycles) | Iteration|  Initiation Interval  | Trip |          |
        | Loop Name |   min   |   max   |  Latency |  achieved |   target  | Count| Pipelined|
        +-----------+---------+---------+----------+-----------+-----------+------+----------+
        |- Row_Col  |       13|       13|         6|          1|          1|     9|       yes|
        +-----------+---------+---------+----------+-----------+-----------+------+----------+
```

```markdown=
================================================================
== Utilization Estimates
================================================================
* Summary: 
+-----------------+---------+-----+--------+-------+-----+
|       Name      | BRAM_18K| DSP |   FF   |  LUT  | URAM|
+-----------------+---------+-----+--------+-------+-----+
|DSP              |        -|    2|       -|      -|    -|
|Expression       |        -|    -|       0|    240|    -|
|FIFO             |        -|    -|       -|      -|    -|
|Instance         |        -|    0|       0|     83|    -|
|Memory           |        -|    -|       -|      -|    -|
|Multiplexer      |        -|    -|       -|    188|    -|
|Register         |        -|    -|     243|     32|    -|
+-----------------+---------+-----+--------+-------+-----+
|Total            |        0|    2|     243|    543|    0|
+-----------------+---------+-----+--------+-------+-----+
|Available        |      280|  220|  106400|  53200|    0|
+-----------------+---------+-----+--------+-------+-----+
|Utilization (%)  |        0|   ~0|      ~0|      1|    0|
+-----------------+---------+-----+--------+-------+-----+
```
### Comparison


| Solution                           | Latency | DSP | FF | LUT |
|-----------------------------------|---------|-----|----|-----|
| Lab1 (Sol 6) : array reshape + pipeline | 10    | 18 | 632 | 560 |
| Lab2 : cache + FIFO                        | 13    | 2 | 243 | 543 |

- **Lab1**: Faster, higher DSP usage due to full parallelism and pipelining.  
- **Lab2**: Slightly slower, but uses fewer resources.