# Groq module - MEM & HLS 1 ## Spec ![image](https://hackmd.io/_uploads/HJpk0Vd5eg.png) ## Function | Instruction | Description | Parameters | Example | |-----------------|------------------------------------------------------|----------------------------------------|------------------------------------------------------------------------| | Read a, s | Load data from main memory into a stream register | a: memory address<br>s: stream register | `Read 100, s1` → Loads Mem[100] into s1 | | Write a, s | Store the contents of a stream register into memory | a: memory address<br>s: stream register | `Write 200, s1` → Stores the content of s1 into Mem[200] | | Gather s, map | Indirect read: load data from addresses listed in map into stream | s: stream register<br>map: list of addresses | `map = [100, 200, 300]`<br>`Gather s1, map` → s1 = [Mem[100], Mem[200], Mem[300]] | | Scatter s, map | Indirect write: store stream data into addresses listed in map | s: stream register<br>map: list of addresses | `s1 = [10, 20, 30]`, `map = [100, 200, 300]`<br>`Scatter s1, map` → Mem[100]=10, Mem[200]=20, Mem[300]=30 | ## Instruction ```cpp // MEM Instruction fields packed into 32 bits typedef struct mem_inst_t { ap_uint<MEM_DFUNC_BITS> dfunc; // bits [2:0] ap_uint<MEM_DSKEW_BITS> dskew; // bits [6:3] ap_uint<MEM_OPCODE_BITS> opcode; // bits [10:7] ap_uint<MEM_SRCID_BITS> srcID; // bits [16:11] ap_uint<MEM_DSTID_BITS> dstID; // bits [22:17] ap_uint<MEM_SCA_GAT_BITS> scatter_gather; // bits [23] ap_uint<MEM_RES_BITS> reserved; // bits [31:22] }; // Enumerate of MEM Opcode typedef enum mem_opcode { MEM_OP_READ, MEM_OP_WRITE, // Gather s, map // Indirectly read addresses pointed to by map putting onto stream s // Mapping address from stream register MEM_OP_GATHER_READ, // Scatter s, map // Indirectly store stream s into address in the map stream // Mapping address from stream register MEM_OP_SCATTER_WRITE } mem_opcode_t; ``` ## Design insight In this design, we aim to leverage a Large Language Model (LLM) as an assistant for code generation. Beyond simply producing code, the LLM will also serve as a tool for exploring and understanding the overall program architecture. By integrating the model into the development process, we can accelerate implementation while simultaneously gaining deeper insights into how different components of the system are structured and interact. This dual approach not only improves efficiency but also enhances our ability to learn and adopt best practices in software design. ## Code - 1st version ### MEM.cpp <details> <summary>HLS for MEM.cpp</summary> ```cpp #include <ap_int.h> #include <hls_stream.h> #include <stdint.h> // --------------------------- // Instruction encoding // --------------------------- #define MEM_DFUNC_BITS 3 #define MEM_DSKEW_BITS 4 #define MEM_SRC_DST_ID_BITS 12 #define MEM_BANK_BITS 1 #define MEM_RES_BITS 6 #define MEM_W_OR_E_BITS 1 #define MEM_ADDR_MAP_BITS 1 #define MEM_OPCODE_BITS 4 // word size = 16 bytes #define WORD_BYTES 16 // Example slice memory size (small for HLS synth) //16 lanes * 4096 entry per bank * 8 bit per byte #define MEM_SLICE_BYTES (16*4096*8) typedef struct mem_inst_t { ap_uint<MEM_DFUNC_BITS> dfunc; ap_uint<MEM_DSKEW_BITS> dskew; ap_uint<MEM_OPCODE_BITS> opcode; ap_uint<MEM_SRCID_BITS> srcID; ap_uint<MEM_DSTID_BITS> dstID; ap_uint<MEM_SCA_GAT_BITS> scatter_gather; ap_uint<MEM_RES_BITS> reserved; } mem_inst_t; // Opcodes typedef enum mem_opcode { MEM_OP_READ = 0, MEM_OP_WRITE = 1, MEM_OP_GATHER_READ = 2, MEM_OP_SCATTER_WRITE = 3 } mem_opcode_t; // Data and Address words typedef ap_uint<8> byte_t; struct DataWord { byte_t data[WORD_BYTES]; }; struct AddrWord { ap_uint<32> addr; }; // --------------------------- // Simple slice memory // --------------------------- static byte_t mem_bank0[MEM_SLICE_BYTES]; static byte_t mem_bank1[MEM_SLICE_BYTES]; // --------------------------- // MEM Slice module // --------------------------- extern "C" { void mem_slice( // instruction streams hls::stream<mem_inst_t> &inst_in, hls::stream<mem_inst_t> &inst_out, // data streams hls::stream<DataWord> &west_in, hls::stream<DataWord> &west_out, hls::stream<DataWord> &east_in, hls::stream<DataWord> &east_out, ) { #pragma HLS INTERFACE axis port=inst_in #pragma HLS INTERFACE axis port=inst_out #pragma HLS INTERFACE axis port=west_in #pragma HLS INTERFACE axis port=west_out #pragma HLS INTERFACE axis port=east_in #pragma HLS INTERFACE axis port=east_out #pragma HLS INTERFACE axis port=addr_in #pragma HLS INTERFACE axis port=addr_out #pragma HLS INTERFACE ap_ctrl_none port=return if (inst_in.empty()) return; mem_inst_t inst = inst_in.read(); ap_uint<MEM_OPCODE_BITS> opc = inst.opcode; // pipeline forward instruction inst_out.write(inst); switch(opc) { case MEM_OP_READ: { // READ: base address = srcID << 4 AddrWord base; base.addr = inst.srcID << 4; DataWord out; for (int i=0;i<WORD_BYTES;i++) { out.data[i] = mem_bank0[(unsigned int)(base.addr+i) % MEM_SLICE_BYTES]; } west_out.write(out); // output to west side break; } case MEM_OP_WRITE: { // WRITE: base address = dstID << 4 if (west_in.empty()) break; AddrWord base; base.addr = inst.dstID << 4; DataWord in = west_in.read(); for (int i=0;i<WORD_BYTES;i++) { mem_bank0[(unsigned int)(base.addr+i) % MEM_SLICE_BYTES] = in.data[i]; } break; } case MEM_OP_GATHER_READ: { // Gather: consume addr_in, output data if (addr_in.empty()) break; AddrWord a = addr_in.read(); DataWord out; for (int i=0;i<WORD_BYTES;i++) { out.data[i] = mem_bank0[(unsigned int)(a.addr+i) % MEM_SLICE_BYTES]; } east_out.write(out); // output to east side break; } case MEM_OP_SCATTER_WRITE: { // Scatter: consume addr_in + east_in if (addr_in.empty() || east_in.empty()) break; AddrWord a = addr_in.read(); DataWord in = east_in.read(); for (int i=0;i<WORD_BYTES;i++) { mem_bank0[(unsigned int)(a.addr+i) % MEM_SLICE_BYTES] = in.data[i]; } break; } default: // NOP break; } } } ``` </details> ## Code - 2nd version Here, we design the instruction format such that for SCATTER or GATHER operations, the instruction can carry up to two addresses. With this capability, dividing the MEM slice into two banks becomes natural and efficient, since both banks can be accessed in parallel—either performing two reads or two writes simultaneously. :::warning * doesnt design the data input/output from which stream ::: ### mem_slice.h <details> <summary>HLS for mem_slice.h</summary> ```cpp #ifndef MEM_SLICE_H #define MEM_SLICE_H #include <ap_int.h> #include <hls_stream.h> #include <stdint.h> // --------------------------- // bit layout (LSB..MSB): // 3:0 dfunc (4) // 6:4 dskew (3) // 8:7 opcode (2) // 20:9 addr (12) // 21 addr_map (1) // 22 w_or_e (1) // 23 bankbit (1) // 31:24 reserved (8) // --------------------------- #define MEM_DFUNC_BITS 4 #define MEM_DSKEW_BITS 3 #define MEM_OPCODE_BITS 2 #define MEM_ADDR_BITS 12 #define MEM_ADDR_MAP_BITS 1 #define MEM_W_OR_E_BITS 1 #define MEM_BANK_BITS 1 #define MEM_RES_BITS 8 #define WORD_BYTES 16 #define MEM_SLICE_BYTES (16*4096) // ---------- 型別 ---------- typedef struct { ap_uint<32> raw; } mem_inst_t; typedef ap_uint<8> byte_t; struct DataWord { byte_t data[WORD_BYTES]; }; typedef enum mem_opcode { MEM_OP_READ = 0, MEM_OP_WRITE = 1, MEM_OP_GATHER_READ = 2, MEM_OP_SCATTER_WRITE = 3 } mem_opcode_t; // ---------- decode helpers ---------- inline ap_uint<MEM_DFUNC_BITS> get_dfunc(const mem_inst_t &i) { return i.raw.range(3,0); } inline ap_uint<MEM_DSKEW_BITS> get_dskew(const mem_inst_t &i) { return i.raw.range(6,4); } inline ap_uint<MEM_OPCODE_BITS> get_opcode(const mem_inst_t &i) { return i.raw.range(8,7); } inline ap_uint<MEM_ADDR_BITS> get_addr(const mem_inst_t &i) { return i.raw.range(20,9); } inline ap_uint<MEM_ADDR_MAP_BITS> get_addrmap(const mem_inst_t &i){ return i.raw.bit(21); } inline ap_uint<MEM_W_OR_E_BITS> get_wore(const mem_inst_t &i) { return i.raw.bit(22); } inline ap_uint<MEM_BANK_BITS> get_bankbit(const mem_inst_t &i){ return i.raw.bit(23); } inline ap_uint<MEM_RES_BITS> get_res(const mem_inst_t &i) { return i.raw.range(31,24); } // ---------- memory bank 存取函式宣告 ---------- void write_bank(unsigned bank_id, ap_uint<32> byte_addr, const DataWord &dw); DataWord read_bank(unsigned bank_id, ap_uint<32> byte_addr); // ---------- top-level function ---------- extern "C" void mem_slice( hls::stream<mem_inst_t> &inst_in, hls::stream<mem_inst_t> &inst_out, hls::stream<DataWord> &west_in, hls::stream<DataWord> &west_out, hls::stream<DataWord> &east_in, hls::stream<DataWord> &east_out ); #endif // MEM_SLICE_H ``` </details> ### mem_slice.cpp <details> <summary>HLS for mem_slice.cpp</summary> ```cpp #include "mem_slice.h" // ---------- memory banks ---------- byte_t mem_bank0[MEM_SLICE_BYTES]; byte_t mem_bank1[MEM_SLICE_BYTES]; // ---------- memory access functions ---------- void write_bank(unsigned bank_id, ap_uint<32> byte_addr, const DataWord &dw) { #pragma HLS INLINE unsigned int idx0 = (unsigned int)(byte_addr % MEM_SLICE_BYTES); for (int i = 0; i < WORD_BYTES; i++) { #pragma HLS UNROLL unsigned int idx = (idx0 + i) % MEM_SLICE_BYTES; if (bank_id == 0) mem_bank0[idx] = dw.data[i]; else mem_bank1[idx] = dw.data[i]; } } DataWord read_bank(unsigned bank_id, ap_uint<32> byte_addr) { #pragma HLS INLINE DataWord out; unsigned int idx0 = (unsigned int)(byte_addr % MEM_SLICE_BYTES); for (int i = 0; i < WORD_BYTES; i++) { #pragma HLS UNROLL unsigned int idx = (idx0 + i) % MEM_SLICE_BYTES; if (bank_id == 0) out.data[i] = mem_bank0[idx]; else out.data[i] = mem_bank1[idx]; } return out; } // ---------- top-level DUT ---------- extern "C" { void mem_slice( hls::stream<mem_inst_t> &inst_in, hls::stream<mem_inst_t> &inst_out, hls::stream<DataWord> &west_in, hls::stream<DataWord> &west_out, hls::stream<DataWord> &east_in, hls::stream<DataWord> &east_out ) { #pragma HLS INTERFACE axis port=inst_in #pragma HLS INTERFACE axis port=inst_out #pragma HLS INTERFACE axis port=west_in #pragma HLS INTERFACE axis port=west_out #pragma HLS INTERFACE axis port=east_in #pragma HLS INTERFACE axis port=east_out #pragma HLS INTERFACE ap_ctrl_none port=return if (inst_in.empty()) return; mem_inst_t inst = inst_in.read(); ap_uint<MEM_OPCODE_BITS> opc = get_opcode(inst); ap_uint<MEM_DSKEW_BITS> dsk = get_dskew(inst); ap_uint<MEM_ADDR_MAP_BITS> amap = get_addrmap(inst); ap_uint<MEM_W_OR_E_BITS> side = get_wore(inst); ap_uint<MEM_BANK_BITS> bankbit = get_bankbit(inst); ap_uint<MEM_ADDR_BITS> addr12 = get_addr(inst); ap_uint<MEM_RES_BITS> res8 = get_res(inst); ap_uint<MEM_DFUNC_BITS> dfunc = get_dfunc(inst); // forward inst_out.write(inst); // dskew 空轉 for (int dd = 0; dd < (int)dsk; ++dd) { #pragma HLS PIPELINE II=1 } // gather/scatter 專用 address ap_uint<12> bank0_addr12 = addr12; ap_uint<12> bank1_addr12 = ((ap_uint<12>)res8 << 4) | dfunc; switch (opc) { case MEM_OP_READ: { DataWord dout = read_bank((unsigned)bankbit, ((ap_uint<32>)addr12) << 4); if (side == 0) west_out.write(dout); else east_out.write(dout); break; } case MEM_OP_WRITE: { if (side == 0 && !west_in.empty()) { DataWord dw = west_in.read(); write_bank((unsigned)bankbit, ((ap_uint<32>)addr12) << 4, dw); } else if (side == 1 && !east_in.empty()) { DataWord dw = east_in.read(); write_bank((unsigned)bankbit, ((ap_uint<32>)addr12) << 4, dw); } break; } case MEM_OP_GATHER_READ: { if (amap == 1) { DataWord d0 = read_bank(0, ((ap_uint<32>)bank0_addr12) << 4); DataWord d1 = read_bank(1, ((ap_uint<32>)bank1_addr12) << 4); west_out.write(d0); east_out.write(d1); } else { DataWord dout = read_bank((unsigned)bankbit, ((ap_uint<32>)addr12) << 4); if (side == 0) west_out.write(dout); else east_out.write(dout); } break; } case MEM_OP_SCATTER_WRITE: { if (amap == 1) { if (!west_in.empty()) { DataWord dw = west_in.read(); write_bank(0, ((ap_uint<32>)bank0_addr12) << 4, dw); } if (!east_in.empty()) { DataWord dw = east_in.read(); write_bank(1, ((ap_uint<32>)bank1_addr12) << 4, dw); } } else { if (side == 0 && !west_in.empty()) { DataWord dw = west_in.read(); write_bank((unsigned)bankbit, ((ap_uint<32>)addr12) << 4, dw); } else if (side == 1 && !east_in.empty()) { DataWord dw = east_in.read(); write_bank((unsigned)bankbit, ((ap_uint<32>)addr12) << 4, dw); } } break; } default: break; } } } // extern "C" ``` </details> ### mem_slice_test.cpp <details> <summary>HLS for mem_slice_test.cpp</summary> ```cpp #include "mem_slice.h" #include <iostream> #include <iomanip> using namespace std; void print_dataword(const DataWord &dw, const char *tag) { cout << tag << ": "; for (int i = 0; i < WORD_BYTES; i++) cout << setw(3) << (int)dw.data[i] << " "; cout << endl; } mem_inst_t make_inst(mem_opcode_t opc, unsigned bankbit, unsigned side, unsigned addr12, unsigned amap=0, unsigned res8=0, unsigned dfunc=0) { mem_inst_t inst; inst.raw = 0; inst.raw.range(8,7) = opc; inst.raw.bit(22) = side; inst.raw.bit(23) = bankbit; inst.raw.range(20,9) = addr12; inst.raw.bit(21) = amap; inst.raw.range(31,24) = res8; inst.raw.range(3,0) = dfunc; return inst; } int main() { hls::stream<mem_inst_t> inst_in, inst_out; hls::stream<DataWord> west_in, west_out, east_in, east_out; // -------------------------- // 1. WRITE bank0 west DataWord w; for (int i=0;i<WORD_BYTES;i++) w.data[i] = i; west_in.write(w); inst_in.write(make_inst(MEM_OP_WRITE, 0, 0, 0x010)); mem_slice(inst_in, inst_out, west_in, west_out, east_in, east_out); // 2. READ bank0 west inst_in.write(make_inst(MEM_OP_READ, 0, 0, 0x010)); mem_slice(inst_in, inst_out, west_in, west_out, east_in, east_out); if (!west_out.empty()) { DataWord r = west_out.read(); print_dataword(r, "READ back bank0"); } // 3. SCATTER_WRITE amap=1 DataWord w2, e2; for (int i=0;i<WORD_BYTES;i++) { w2.data[i]=i+10; e2.data[i]=i+100; } west_in.write(w2); east_in.write(e2); inst_in.write(make_inst(MEM_OP_SCATTER_WRITE, 0, 0, 0x020, 1, 0x00, 0x03)); mem_slice(inst_in, inst_out, west_in, west_out, east_in, east_out); // 4. GATHER_READ amap=1 inst_in.write(make_inst(MEM_OP_GATHER_READ, 0, 0, 0x020, 1, 0x00, 0x03)); mem_slice(inst_in, inst_out, west_in, west_out, east_in, east_out); if (!west_out.empty()) print_dataword(west_out.read(), "GATHER bank0"); if (!east_out.empty()) print_dataword(east_out.read(), "GATHER bank1"); // Solution1 while (!inst_out.empty()) { auto tmp = inst_out.read(); // 可選:印出來檢查 // cout << "inst_out leftover: " << tmp.raw << endl; } return 0; } ``` </details> ### Error1 @ mem_slice_test.cpp ``` INFO: [SIM 2] *************** CSIM start *************** INFO: [SIM 4] CSIM will launch GCC as the compiler. Compiling ../../../../mem_slice_test.cpp in debug mode Compiling ../../../../mem_slice.cpp in debug mode Generating csim.exe READ back bank0: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 GATHER bank0: 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 GATHER bank1: 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 WARNING: Hls::stream 'hls::stream<mem_inst_t, 0>1' contains leftover data, which may result in RTL simulation hanging. INFO: [SIM 1] CSim done with 0 errors. INFO: [SIM 3] *************** CSIM finish *************** ``` * The `hls::stream<mem_inst_t> inst_out;` has not been read from the test.cpp ### Solution1 @ mem_slice_test.cpp ```cpp while (!inst_out.empty()) { auto tmp = inst_out.read(); // 可選:印出來檢查 // cout << "inst_out leftover: " << tmp.raw << endl; } ``` ### Syn Report & Scheduling ![image](https://hackmd.io/_uploads/S1ts3kDogx.png) ![image](https://hackmd.io/_uploads/rJE33kvsee.png) ![image](https://hackmd.io/_uploads/r1gJCkDogg.png) :::info read_bank and write_bank was the reason that latency = 18 cycles. took write_bank as example, though we add pragma to unroll the loop, the bank has only two ports causing resource sharing bottlenck ::: ```cpp void write_bank(unsigned bank_id, ap_uint<32> byte_addr, const DataWord &dw) { #pragma HLS INLINE unsigned int idx0 = (unsigned int)(byte_addr % MEM_SLICE_BYTES); for (int i = 0; i < WORD_BYTES; i++) { #pragma HLS UNROLL unsigned int idx = (idx0 + i) % MEM_SLICE_BYTES; if (bank_id == 0) mem_bank0[idx] = dw.data[i]; else mem_bank1[idx] = dw.data[i]; } } ``` ## Code - 3rd version ### Solution1 - RAM Bandwidth = 128 bit <details> <summary>mem_slice.h: widen bank port to 128 bit</summary> ```cpp #ifndef MEM_SLICE_H #define MEM_SLICE_H #include <ap_int.h> #include <hls_stream.h> #include <stdint.h> // --------------------------- // bit layout (LSB..MSB): // 3:0 dfunc (4) // 6:4 dskew (3) // 8:7 opcode (2) // 20:9 addr (12) // 21 addr_map (1) // 22 w_or_e (1) // 23 bankbit (1) // 31:24 reserved (8) // --------------------------- #define MEM_DFUNC_BITS 4 #define MEM_DSKEW_BITS 3 #define MEM_OPCODE_BITS 2 #define MEM_ADDR_BITS 12 #define MEM_ADDR_MAP_BITS 1 #define MEM_W_OR_E_BITS 1 #define MEM_BANK_BITS 1 #define MEM_RES_BITS 8 #define WORD_BYTES 16 #define MEM_SLICE_BYTES (16*4096) // ---------- 型別 ---------- typedef struct { ap_uint<32> raw; } mem_inst_t; typedef ap_uint<8> byte_t; //struct DataWord { byte_t data[WORD_BYTES]; }; typedef ap_uint<128> DataWord; typedef enum mem_opcode { MEM_OP_READ = 0, MEM_OP_WRITE = 1, MEM_OP_GATHER_READ = 2, MEM_OP_SCATTER_WRITE = 3 } mem_opcode_t; typedef ap_uint<128> wide_t; #define MEM_SLICE_WORDS (MEM_SLICE_BYTES / WORD_BYTES) // ---------- decode helpers ---------- inline ap_uint<MEM_DFUNC_BITS> get_dfunc(const mem_inst_t &i) { return i.raw.range(3,0); } inline ap_uint<MEM_DSKEW_BITS> get_dskew(const mem_inst_t &i) { return i.raw.range(6,4); } inline ap_uint<MEM_OPCODE_BITS> get_opcode(const mem_inst_t &i) { return i.raw.range(8,7); } inline ap_uint<MEM_ADDR_BITS> get_addr(const mem_inst_t &i) { return i.raw.range(20,9); } inline ap_uint<MEM_ADDR_MAP_BITS> get_addrmap(const mem_inst_t &i){ return i.raw.bit(21); } inline ap_uint<MEM_W_OR_E_BITS> get_wore(const mem_inst_t &i) { return i.raw.bit(22); } inline ap_uint<MEM_BANK_BITS> get_bankbit(const mem_inst_t &i){ return i.raw.bit(23); } inline ap_uint<MEM_RES_BITS> get_res(const mem_inst_t &i) { return i.raw.range(31,24); } // ---------- memory bank 存取函式宣告 ---------- void write_bank(unsigned bank_id, ap_uint<32> byte_addr, const DataWord &dw); DataWord read_bank(unsigned bank_id, ap_uint<32> byte_addr); // ---------- top-level function ---------- extern "C" void mem_slice( hls::stream<mem_inst_t> &inst_in, hls::stream<mem_inst_t> &inst_out, hls::stream<DataWord> &west_in, hls::stream<DataWord> &west_out, hls::stream<DataWord> &east_in, hls::stream<DataWord> &east_out ); #endif // MEM_SLICE_H ``` </details> <details> <summary>mem_slice.cpp: widen bank port to 128 bit</summary> ```cpp #include "mem_slice.h" // ---------- memory banks ---------- static wide_t mem_bank0[MEM_SLICE_WORDS]; static wide_t mem_bank1[MEM_SLICE_WORDS]; // ---------- memory access functions ---------- void write_bank(unsigned bank_id, ap_uint<32> byte_addr, const DataWord &dw) { #pragma HLS INLINE unsigned int idx = (unsigned int)((byte_addr / 16) % MEM_SLICE_WORDS); // byte_addr 除以 16 轉成 128-bit word index if (bank_id == 0) mem_bank0[idx] = dw; // data 是 128-bit else mem_bank1[idx] = dw; } DataWord read_bank(unsigned bank_id, ap_uint<32> byte_addr) { #pragma HLS INLINE DataWord dw; unsigned int idx = (unsigned int)((byte_addr / 16) % MEM_SLICE_WORDS); if (bank_id == 0) dw = mem_bank0[idx]; else dw = mem_bank1[idx]; return dw; } // ---------- top-level DUT ---------- extern "C" { void mem_slice( hls::stream<mem_inst_t> &inst_in, hls::stream<mem_inst_t> &inst_out, hls::stream<DataWord> &west_in, hls::stream<DataWord> &west_out, hls::stream<DataWord> &east_in, hls::stream<DataWord> &east_out ) { #pragma HLS INTERFACE axis port=inst_in #pragma HLS INTERFACE axis port=inst_out #pragma HLS INTERFACE axis port=west_in #pragma HLS INTERFACE axis port=west_out #pragma HLS INTERFACE axis port=east_in #pragma HLS INTERFACE axis port=east_out #pragma HLS INTERFACE ap_ctrl_none port=return if (inst_in.empty()) return; mem_inst_t inst = inst_in.read(); ap_uint<MEM_OPCODE_BITS> opc = get_opcode(inst); ap_uint<MEM_DSKEW_BITS> dsk = get_dskew(inst); ap_uint<MEM_ADDR_MAP_BITS> amap = get_addrmap(inst); ap_uint<MEM_W_OR_E_BITS> side = get_wore(inst); ap_uint<MEM_BANK_BITS> bankbit = get_bankbit(inst); ap_uint<MEM_ADDR_BITS> addr12 = get_addr(inst); ap_uint<MEM_RES_BITS> res8 = get_res(inst); ap_uint<MEM_DFUNC_BITS> dfunc = get_dfunc(inst); // forward inst_out.write(inst); // dskew 空轉 for (int dd = 0; dd < (int)dsk; ++dd) { #pragma HLS PIPELINE II=1 } // gather/scatter 專用 address ap_uint<12> bank0_addr12 = addr12; ap_uint<12> bank1_addr12 = ((ap_uint<12>)res8 << 4) | dfunc; switch (opc) { case MEM_OP_READ: { DataWord dout = read_bank((unsigned)bankbit, ((ap_uint<32>)addr12) << 4); if (side == 0) west_out.write(dout); else east_out.write(dout); break; } case MEM_OP_WRITE: { if (side == 0 && !west_in.empty()) { DataWord dw = west_in.read(); write_bank((unsigned)bankbit, ((ap_uint<32>)addr12) << 4, dw); } else if (side == 1 && !east_in.empty()) { DataWord dw = east_in.read(); write_bank((unsigned)bankbit, ((ap_uint<32>)addr12) << 4, dw); } break; } case MEM_OP_GATHER_READ: { if (amap == 1) { DataWord d0 = read_bank(0, ((ap_uint<32>)bank0_addr12) << 4); DataWord d1 = read_bank(1, ((ap_uint<32>)bank1_addr12) << 4); west_out.write(d0); east_out.write(d1); } else { DataWord dout = read_bank((unsigned)bankbit, ((ap_uint<32>)addr12) << 4); if (side == 0) west_out.write(dout); else east_out.write(dout); } break; } case MEM_OP_SCATTER_WRITE: { if (amap == 1) { if (!west_in.empty()) { DataWord dw = west_in.read(); write_bank(0, ((ap_uint<32>)bank0_addr12) << 4, dw); } if (!east_in.empty()) { DataWord dw = east_in.read(); write_bank(1, ((ap_uint<32>)bank1_addr12) << 4, dw); } } else { if (side == 0 && !west_in.empty()) { DataWord dw = west_in.read(); write_bank((unsigned)bankbit, ((ap_uint<32>)addr12) << 4, dw); } else if (side == 1 && !east_in.empty()) { DataWord dw = east_in.read(); write_bank((unsigned)bankbit, ((ap_uint<32>)addr12) << 4, dw); } } break; } default: break; } } } // extern "C" ``` </details> <details> <summary>mem_slice_test.cpp: widen bank port to 128 bit</summary> ```cpp #include "mem_slice.h" #include <iostream> #include <iomanip> using namespace std; // -------------------------- // print DataWord (ap_uint<128>) void print_dataword(const DataWord &dw, const char *tag) { cout << tag << ": "; for (int i = 0; i < WORD_BYTES; i++) { ap_uint<8> b = dw.range((i + 1) * 8 - 1, i * 8); cout << setw(3) << (int)b << " "; } cout << endl; } // -------------------------- // helper to create mem_inst_t mem_inst_t make_inst(mem_opcode_t opc, unsigned bankbit, unsigned side, unsigned addr12, unsigned amap = 0, unsigned res8 = 0, unsigned dfunc = 0) { mem_inst_t inst; inst.raw = 0; inst.raw.range(8, 7) = opc; inst.raw.bit(22) = side; inst.raw.bit(23) = bankbit; inst.raw.range(20, 9) = addr12; inst.raw.bit(21) = amap; inst.raw.range(31, 24) = res8; inst.raw.range(3, 0) = dfunc; return inst; } int main() { hls::stream<mem_inst_t> inst_in, inst_out; hls::stream<DataWord> west_in, west_out, east_in, east_out; // -------------------------- // 1. WRITE bank0 west DataWord w = 0; for (int i = 0; i < WORD_BYTES; i++) w.range((i + 1) * 8 - 1, i * 8) = i; west_in.write(w); inst_in.write(make_inst(MEM_OP_WRITE, 0, 0, 0x010)); mem_slice(inst_in, inst_out, west_in, west_out, east_in, east_out); // -------------------------- // 2. READ bank0 west inst_in.write(make_inst(MEM_OP_READ, 0, 0, 0x010)); mem_slice(inst_in, inst_out, west_in, west_out, east_in, east_out); if (!west_out.empty()) { DataWord r = west_out.read(); print_dataword(r, "READ back bank0"); } // -------------------------- // 3. SCATTER_WRITE amap=1 DataWord w2 = 0, e2 = 0; for (int i = 0; i < WORD_BYTES; i++) { w2.range((i + 1) * 8 - 1, i * 8) = i + 10; e2.range((i + 1) * 8 - 1, i * 8) = i + 100; } west_in.write(w2); east_in.write(e2); inst_in.write(make_inst(MEM_OP_SCATTER_WRITE, 0, 0, 0x020, 1, 0x00, 0x03)); mem_slice(inst_in, inst_out, west_in, west_out, east_in, east_out); // -------------------------- // 4. GATHER_READ amap=1 inst_in.write(make_inst(MEM_OP_GATHER_READ, 0, 0, 0x020, 1, 0x00, 0x03)); mem_slice(inst_in, inst_out, west_in, west_out, east_in, east_out); if (!west_out.empty()) print_dataword(west_out.read(), "GATHER bank0"); if (!east_out.empty()) print_dataword(east_out.read(), "GATHER bank1"); // -------------------------- // 清空 inst_out 避免 leftover while (!inst_out.empty()) inst_out.read(); return 0; } ``` </details> ### Result Compare & Scheduling ![image](https://hackmd.io/_uploads/B1RVE6voll.png) ![image](https://hackmd.io/_uploads/SkYwHawjxg.png) ## Code - 4th version To address the limitation in Code3, we redesigned the instruction format by introducing two opcodes, allowing the two banks to operate independently. However, the address field is limited to 12 bits, which means both banks will access the same address. With the reduced address space usage, we can incorporate additional bits to specify the stream source and destination. * `GATHER` and `SCATTER` operation is removed <details> <summary>mem_slice.h </summary> ```cpp #ifndef MEM_SLICE_H #define MEM_SLICE_H #define AP_INT_MAX_W 4096 #include <ap_int.h> #include <hls_stream.h> #include <stdint.h> // --------------------------- // bit layout (LSB..MSB): // 0:3 dskew (4) // 4:5 opcode0 (2) // 6:7 opcode1 (2) // 8:19 addr (12) // 20 w_or_e0 (1) // 21 w_or_e1 (1) // 22:26 stream_src_dst0 (5) // 27:31 stream_src_dst1 (5) // --------------------------- #define MEM_DSKEW_BITS 4 #define MEM_OPCODE0_BITS 2 #define MEM_OPCODE1_BITS 2 #define MEM_ADDR_BITS 12 #define MEM_W_OR_E0_BITS 1 #define MEM_W_OR_E1_BITS 1 #define MEM_SRC_DST0_BITS 5 #define MEM_SRC_DST1_BITS 5 #define WORD_BYTES 16 #define STREAM_WIDTH 128 #define NUM_OF_STREAMS 32 #define MEM_SLICE_BYTES (16*4096) #define MEM_SLICE_WORDS (MEM_SLICE_BYTES / WORD_BYTES) // ---------- 型別 ---------- typedef struct { ap_uint<32> raw; } mem_inst_t; typedef ap_uint<STREAM_WIDTH> DataUnit; typedef ap_uint<STREAM_WIDTH * NUM_OF_STREAMS> data_full_t; typedef enum mem_opcode { MEM_OP_READ = 0, MEM_OP_WRITE = 1, NOP = 3 } mem_opcode_t; // ---------- decode helpers ---------- inline ap_uint<MEM_DSKEW_BITS> get_dskew (const mem_inst_t &i) { return i.raw.range(3,0); } inline ap_uint<MEM_OPCODE0_BITS> get_opcode0 (const mem_inst_t &i) { return i.raw.range(5,4); } inline ap_uint<MEM_OPCODE1_BITS> get_opcode1 (const mem_inst_t &i) { return i.raw.range(7,6); } inline ap_uint<MEM_ADDR_BITS> get_addr (const mem_inst_t &i) { return i.raw.range(19,8); } inline ap_uint<MEM_W_OR_E0_BITS> get_wore0 (const mem_inst_t &i) { return i.raw.bit(20); } inline ap_uint<MEM_W_OR_E1_BITS> get_wore1 (const mem_inst_t &i) { return i.raw.bit(21); } inline ap_uint<MEM_SRC_DST0_BITS> get_srcdst0(const mem_inst_t &i) { return i.raw.range(26,22); } inline ap_uint<MEM_SRC_DST1_BITS> get_srcdst1(const mem_inst_t &i) { return i.raw.range(31,27); } // ---------- memory bank 存取函式宣告 ---------- void write_bank(unsigned bank_id, ap_uint<32> byte_addr, const DataUnit &dw); DataUnit read_bank(unsigned bank_id, ap_uint<32> byte_addr); DataUnit get_stream_slice(const data_full_t &full, unsigned idx); void set_stream_slice(data_full_t &full, unsigned idx, const DataUnit &val); // ---------- top-level function ---------- extern "C" void mem_slice( hls::stream<mem_inst_t> &inst_in, hls::stream<mem_inst_t> &inst_out, hls::stream<data_full_t> &west_in, hls::stream<data_full_t> &west_out, hls::stream<data_full_t> &east_in, hls::stream<data_full_t> &east_out ); #endif // MEM_SLICE_H ``` </details> <details> <summary>mem_slic.cpp </summary> ```cpp #include "mem_slice.h" // ---------- memory banks ---------- static DataUnit mem_bank0[MEM_SLICE_WORDS]; static DataUnit mem_bank1[MEM_SLICE_WORDS]; // ---------- memory access ---------- void write_bank(unsigned bank_id, ap_uint<32> byte_addr, const DataUnit &dw) { #pragma HLS INLINE unsigned int idx = (unsigned int)((byte_addr / WORD_BYTES) % MEM_SLICE_WORDS); if (bank_id == 0) mem_bank0[idx] = dw; else mem_bank1[idx] = dw; } DataUnit read_bank(unsigned bank_id, ap_uint<32> byte_addr) { #pragma HLS INLINE unsigned int idx = (unsigned int)((byte_addr / WORD_BYTES) % MEM_SLICE_WORDS); return (bank_id == 0) ? mem_bank0[idx] : mem_bank1[idx]; } // slice helper inline DataUnit get_stream_slice(const data_full_t &full, unsigned idx) { #pragma HLS INLINE return full.range((idx+1)*STREAM_WIDTH-1, idx*STREAM_WIDTH); } inline void set_stream_slice(data_full_t &full, unsigned idx, const DataUnit &val) { #pragma HLS INLINE full.range((idx+1)*STREAM_WIDTH-1, idx*STREAM_WIDTH) = val; } // ---------- top-level DUT ---------- extern "C" { void mem_slice( hls::stream<mem_inst_t> &inst_in, hls::stream<mem_inst_t> &inst_out, hls::stream<data_full_t> &west_in, hls::stream<data_full_t> &west_out, hls::stream<data_full_t> &east_in, hls::stream<data_full_t> &east_out ) { #pragma HLS INTERFACE axis port=inst_in #pragma HLS INTERFACE axis port=inst_out #pragma HLS INTERFACE axis port=west_in #pragma HLS INTERFACE axis port=west_out #pragma HLS INTERFACE axis port=east_in #pragma HLS INTERFACE axis port=east_out #pragma HLS INTERFACE ap_ctrl_none port=return if (inst_in.empty()) return; mem_inst_t inst = inst_in.read(); inst_out.write(inst); // forward // decode ap_uint<MEM_DSKEW_BITS> dsk = get_dskew(inst); ap_uint<MEM_OPCODE0_BITS> op0 = get_opcode0(inst); ap_uint<MEM_OPCODE1_BITS> op1 = get_opcode1(inst); ap_uint<MEM_ADDR_BITS> addr12 = get_addr(inst); ap_uint<MEM_W_OR_E0_BITS> side0 = get_wore0(inst); ap_uint<MEM_W_OR_E1_BITS> side1 = get_wore1(inst); ap_uint<MEM_SRC_DST0_BITS> srcdst0 = get_srcdst0(inst); ap_uint<MEM_SRC_DST1_BITS> srcdst1 = get_srcdst1(inst); // dskew 延遲 for (int dd = 0; dd < (int)dsk; ++dd) { #pragma HLS PIPELINE II=1 } // ---- Bank0 ---- switch ((mem_opcode_t)(unsigned)op0) { case MEM_OP_READ: { DataUnit dout = read_bank(0, ((ap_uint<32>)addr12) << 4); data_full_t full_out = 0; set_stream_slice(full_out, srcdst0, dout); if (side0 == 0) west_out.write(full_out); else east_out.write(full_out); break; } case MEM_OP_WRITE: { if (side0 == 0 && !west_in.empty()) { data_full_t full_in = west_in.read(); write_bank(0, ((ap_uint<32>)addr12) << 4, get_stream_slice(full_in, srcdst0)); } else if (side0 == 1 && !east_in.empty()) { data_full_t full_in = east_in.read(); write_bank(0, ((ap_uint<32>)addr12) << 4, get_stream_slice(full_in, srcdst0)); } break; } case NOP: default: break; } // ---- Bank1 ---- switch ((mem_opcode_t)(unsigned)op1) { case MEM_OP_READ: { DataUnit dout = read_bank(1, ((ap_uint<32>)addr12) << 4); data_full_t full_out = 0; set_stream_slice(full_out, srcdst1, dout); if (side1 == 0) west_out.write(full_out); else east_out.write(full_out); break; } case MEM_OP_WRITE: { if (side1 == 0 && !west_in.empty()) { data_full_t full_in = west_in.read(); write_bank(1, ((ap_uint<32>)addr12) << 4, get_stream_slice(full_in, srcdst1)); } else if (side1 == 1 && !east_in.empty()) { data_full_t full_in = east_in.read(); write_bank(1, ((ap_uint<32>)addr12) << 4, get_stream_slice(full_in, srcdst1)); } break; } case NOP: default: break; } } } // extern "C" ``` </details> <details> <summary>mem_slic.test (test 1 read & write) </summary> ```cpp #include "mem_slice.h" #include <iostream> #include <iomanip> using namespace std; // 輔助: 建立一個 DataUnit (128-bit) 填入固定 pattern DataUnit make_data(int base) { DataUnit d = 0; for (int i = 0; i < WORD_BYTES; i++) { d.range((i+1)*8-1, i*8) = base + i; } return d; } // 輔助: 印出 DataUnit void print_data(const DataUnit &d, const char *tag) { cout << tag << ": "; for (int i = 0; i < WORD_BYTES; i++) { int val = d.range((i+1)*8-1, i*8); cout << setw(3) << val << " "; } cout << endl; } // 建立指令 mem_inst_t make_inst( unsigned dskew, unsigned op0, unsigned op1, unsigned addr12, unsigned side0, unsigned side1, unsigned srcdst0, unsigned srcdst1 ) { mem_inst_t inst; inst.raw = 0; inst.raw.range(3,0) = dskew; inst.raw.range(5,4) = op0; inst.raw.range(7,6) = op1; inst.raw.range(19,8) = addr12; inst.raw.bit(20) = side0; inst.raw.bit(21) = side1; inst.raw.range(26,22) = srcdst0; inst.raw.range(31,27) = srcdst1; return inst; } int main() { hls::stream<mem_inst_t> inst_in, inst_out; hls::stream<data_full_t> west_in, west_out, east_in, east_out; // ----------------------- // 1. WRITE -> Bank0, stream idx=3, side=west data_full_t din0 = 0; set_stream_slice(din0, 3, make_data(10)); // 填入 [10..25] west_in.write(din0); inst_in.write(make_inst(0, MEM_OP_WRITE, NOP, 0x001, 0, 0, 3, 0)); mem_slice(inst_in, inst_out, west_in, west_out, east_in, east_out); // ----------------------- // 2. READ <- Bank0, stream idx=3, side=west inst_in.write(make_inst(0, MEM_OP_READ, NOP, 0x001, 0, 0, 3, 0)); mem_slice(inst_in, inst_out, west_in, west_out, east_in, east_out); if (!west_out.empty()) { data_full_t dout_full = west_out.read(); DataUnit dout = get_stream_slice(dout_full, 3); print_data(dout, "READ Bank0[addr=0x001]"); } // ----------------------- // 3. WRITE -> Bank1, stream idx=5, side=east data_full_t din1 = 0; set_stream_slice(din1, 5, make_data(100)); // 填入 [100..115] east_in.write(din1); inst_in.write(make_inst(0, NOP, MEM_OP_WRITE, 0x002, 0, 1, 0, 5)); mem_slice(inst_in, inst_out, west_in, west_out, east_in, east_out); // ----------------------- // 4. READ <- Bank1, stream idx=5, side=east inst_in.write(make_inst(0, NOP, MEM_OP_READ, 0x002, 0, 1, 0, 5)); mem_slice(inst_in, inst_out, west_in, west_out, east_in, east_out); if (!east_out.empty()) { data_full_t dout_full = east_out.read(); DataUnit dout = get_stream_slice(dout_full, 5); print_data(dout, "READ Bank1[addr=0x002]"); } // 清空 inst_out 避免 leftover while (!inst_out.empty()) inst_out.read(); return 0; } ``` </details> <details> <summary>mem_slic.test (test 1 read + write iteration) </summary> ```cpp #include "mem_slice.h" #include <iostream> #include <iomanip> using namespace std; DataUnit make_data(int base) { DataUnit d = 0; for (int i = 0; i < WORD_BYTES; i++) d.range((i+1)*8-1, i*8) = base + i; return d; } void print_data(const DataUnit &d, const char *tag) { cout << tag << ": "; for (int i = 0; i < WORD_BYTES; i++) cout << setw(3) << d.range((i+1)*8-1, i*8) << " "; cout << endl; } mem_inst_t make_inst(unsigned dskew, unsigned op0, unsigned op1, unsigned addr12, unsigned side0, unsigned side1, unsigned srcdst0, unsigned srcdst1) { mem_inst_t inst; inst.raw = 0; inst.raw.range(3,0) = dskew; inst.raw.range(5,4) = op0; inst.raw.range(7,6) = op1; inst.raw.range(19,8) = addr12; inst.raw.bit(20) = side0; inst.raw.bit(21) = side1; inst.raw.range(26,22) = srcdst0; inst.raw.range(31,27) = srcdst1; return inst; } int main() { hls::stream<mem_inst_t> inst_in, inst_out; hls::stream<data_full_t> west_in, west_out, east_in, east_out; const int num_instr = 10; // 交替 WRITE + READ for (int i = 0; i < num_instr; i++) { // WRITE data_full_t din = 0; set_stream_slice(din, i % NUM_OF_STREAMS, make_data(10 + i*WORD_BYTES)); west_in.write(din); inst_in.write(make_inst(0, MEM_OP_WRITE, NOP, 0x001 + i, 0, 0, i % NUM_OF_STREAMS, 0)); // READ inst_in.write(make_inst(0, MEM_OP_READ, NOP, 0x001 + i, 0, 0, i % NUM_OF_STREAMS, 0)); } int cycle = 0, i = 0; while (!inst_in.empty() || !west_out.empty()) { mem_slice(inst_in, inst_out, west_in, west_out, east_in, east_out); cycle++; if (!west_out.empty()) { data_full_t dout_full = west_out.read(); DataUnit dout = get_stream_slice(dout_full, i); i++; cout << "Cycle " << cycle << ": "; print_data(dout, "Output"); } while (!inst_out.empty()) inst_out.read(); } cout << "Total cycles: " << cycle << endl; cout << "Estimated II = " << (double)cycle / num_instr << endl; return 0; } ``` </details> <details> <summary>mem_slic.test (test all write finished then all read) </summary> ```cpp #include "mem_slice.h" #include <iostream> #include <iomanip> using namespace std; DataUnit make_data(int base) { DataUnit d = 0; for (int i = 0; i < WORD_BYTES; i++) d.range((i+1)*8-1, i*8) = base + i; return d; } void print_data(const DataUnit &d, const char *tag) { cout << tag << ": "; for (int i = 0; i < WORD_BYTES; i++) cout << setw(3) << d.range((i+1)*8-1, i*8) << " "; cout << endl; } mem_inst_t make_inst(unsigned dskew, unsigned op0, unsigned op1, unsigned addr12, unsigned side0, unsigned side1, unsigned srcdst0, unsigned srcdst1) { mem_inst_t inst; inst.raw = 0; inst.raw.range(3,0) = dskew; inst.raw.range(5,4) = op0; inst.raw.range(7,6) = op1; inst.raw.range(19,8) = addr12; inst.raw.bit(20) = side0; inst.raw.bit(21) = side1; inst.raw.range(26,22) = srcdst0; inst.raw.range(31,27) = srcdst1; return inst; } int main() { hls::stream<mem_inst_t> inst_in, inst_out; hls::stream<data_full_t> west_in, west_out, east_in, east_out; const int num_instr = 10; // ----------------------- // 先送 WRITE for (int i = 0; i < num_instr; i++) { data_full_t din = 0; set_stream_slice(din, i % NUM_OF_STREAMS, make_data(10 + i*WORD_BYTES)); west_in.write(din); inst_in.write(make_inst(0, MEM_OP_WRITE, NOP, 0x010 + i, 0, 0, i % NUM_OF_STREAMS, 0)); } // 再送 READ for (int i = 0; i < num_instr; i++) { inst_in.write(make_inst(0, MEM_OP_READ, NOP, 0x010 + i, 0, 0, i % NUM_OF_STREAMS, 0)); } int cycle = 0, i = 0; while (!inst_in.empty() || !west_out.empty()) { mem_slice(inst_in, inst_out, west_in, west_out, east_in, east_out); cycle++; if (!west_out.empty()) { data_full_t dout_full = west_out.read(); DataUnit dout = get_stream_slice(dout_full, i); i++; cout << "Cycle " << cycle << ": "; print_data(dout, "Output"); } while (!inst_out.empty()) inst_out.read(); } cout << "Total cycles: " << cycle << endl; cout << "Estimated II = " << (double)cycle / num_instr << endl; return 0; } ``` </details> ### Syn Report (unpiplined) ![image](https://hackmd.io/_uploads/BJAnF1xhgx.png) ![image](https://hackmd.io/_uploads/rJJe9Jlhxl.png) * 1st cycle is used to pass inst out ![image](https://hackmd.io/_uploads/rJPUcJenll.png) ![image](https://hackmd.io/_uploads/SJknjkgnel.png) * 2nd cycle is used to read & write data at bank0 ![image](https://hackmd.io/_uploads/S1ji5ylhxe.png) * 3rd cycle is used to output to FIFO * while 4th and 5th cycle do the repeat thing of 2nd and 3rd cycle but at bank1. ![image](https://hackmd.io/_uploads/Sybv2ylnel.png) ### Syn Report (Pipelined) * add `#pragma HLS pipeline II=1` ![image](https://hackmd.io/_uploads/BJCx7wZhel.png) ![image](https://hackmd.io/_uploads/rk_7XP-2xx.png) ### Reason for II = 1 Violation ```cpp if (side0 == 0) west_out.write(full_out); // Bank0 ... if (side1 == 0) west_out.write(full_out); // Bank1 ``` * Writing to the same AXI port twice within a single cycle is not allowed and leads to a violation. ```cpp if (side0 == 0 && !west_in.empty()) { data_full_t full_in = west_in.read(); // 第一次 read ... } if (side1 == 0 && !west_in.empty()) { data_full_t full_in = west_in.read(); // 第二次 read ... } ``` * This code show the same issue of reading same port in 1 cycle. ### Solution for II = 1 Violation * Using buffer to preserve the value read/write from the port, then both bank can access to the same value at 1 cycle. <details> <summary>mem_slic.cpp </summary> ```cpp #include "mem_slice.h" // ---------- memory banks ---------- static DataUnit mem_bank0[MEM_SLICE_WORDS]; static DataUnit mem_bank1[MEM_SLICE_WORDS]; // ---------- memory access ---------- void write_bank(unsigned bank_id, ap_uint<32> byte_addr, const DataUnit &dw) { #pragma HLS INLINE unsigned int idx = (unsigned int)((byte_addr / WORD_BYTES) % MEM_SLICE_WORDS); if (bank_id == 0) mem_bank0[idx] = dw; else mem_bank1[idx] = dw; } DataUnit read_bank(unsigned bank_id, ap_uint<32> byte_addr) { #pragma HLS INLINE unsigned int idx = (unsigned int)((byte_addr / WORD_BYTES) % MEM_SLICE_WORDS); return (bank_id == 0) ? mem_bank0[idx] : mem_bank1[idx]; } // slice helper inline DataUnit get_stream_slice(const data_full_t &full, unsigned idx) { #pragma HLS INLINE return full.range((idx+1)*STREAM_WIDTH-1, idx*STREAM_WIDTH); } inline void set_stream_slice(data_full_t &full, unsigned idx, const DataUnit &val) { #pragma HLS INLINE full.range((idx+1)*STREAM_WIDTH-1, idx*STREAM_WIDTH) = val; } // ---------- top-level DUT ---------- extern "C" { void mem_slice( hls::stream<mem_inst_t> &inst_in, hls::stream<mem_inst_t> &inst_out, hls::stream<data_full_t> &west_in, hls::stream<data_full_t> &west_out, hls::stream<data_full_t> &east_in, hls::stream<data_full_t> &east_out ) { #pragma HLS INTERFACE axis port=inst_in #pragma HLS INTERFACE axis port=inst_out #pragma HLS INTERFACE axis port=west_in #pragma HLS INTERFACE axis port=west_out #pragma HLS INTERFACE axis port=east_in #pragma HLS INTERFACE axis port=east_out #pragma HLS INTERFACE ap_ctrl_none port=return #pragma HLS PIPELINE II=1 if (inst_in.empty()) return; mem_inst_t inst = inst_in.read(); inst_out.write(inst); // forward // decode ap_uint<MEM_DSKEW_BITS> dsk = get_dskew(inst); ap_uint<MEM_OPCODE0_BITS> op0 = get_opcode0(inst); ap_uint<MEM_OPCODE1_BITS> op1 = get_opcode1(inst); ap_uint<MEM_ADDR_BITS> addr12 = get_addr(inst); ap_uint<MEM_W_OR_E0_BITS> side0 = get_wore0(inst); ap_uint<MEM_W_OR_E1_BITS> side1 = get_wore1(inst); ap_uint<MEM_SRC_DST0_BITS> srcdst0 = get_srcdst0(inst); ap_uint<MEM_SRC_DST1_BITS> srcdst1 = get_srcdst1(inst); // dskew 延遲 for (int dd = 0; dd < (int)dsk; ++dd) { #pragma HLS PIPELINE II=1 } // buffer input data_full_t west_tmp = 0; data_full_t east_tmp = 0; bool west_n_empty = !west_in.empty(); bool east_n_empty = !east_in.empty(); if (west_n_empty) west_tmp = west_in.read(); if (east_n_empty) east_tmp = east_in.read(); // Precompute bank reads DataUnit bank0_dout = 0; bool bank0_do_read = ((mem_opcode_t)(unsigned)op0) == MEM_OP_READ; if (bank0_do_read) { bank0_dout = read_bank(0, ((ap_uint<32>)addr12) << 4); } DataUnit bank1_dout = 0; bool bank1_do_read = ((mem_opcode_t)(unsigned)op1) == MEM_OP_READ; if (bank1_do_read) { bank1_dout = read_bank(1, ((ap_uint<32>)addr12) << 4); } // Accumulate outputs per side to ensure at most one write per side bool west_valid = false; data_full_t west_full = 0; bool east_valid = false; data_full_t east_full = 0; // ---- Bank0 ---- switch ((mem_opcode_t)(unsigned)op0) { case MEM_OP_READ: { if (side0 == 0) { set_stream_slice(west_full, srcdst0, bank0_dout); west_valid = true; } else { set_stream_slice(east_full, srcdst0, bank0_dout); east_valid = true; } break; } case MEM_OP_WRITE: { if (side0 == 0 && west_n_empty) { write_bank(0, ((ap_uint<32>)addr12) << 4, get_stream_slice(west_tmp, srcdst0)); } else if (side0 == 1 && east_n_empty) { write_bank(0, ((ap_uint<32>)addr12) << 4, get_stream_slice(east_tmp, srcdst0)); } break; } case NOP: default: break; } // ---- Bank1 ---- switch ((mem_opcode_t)(unsigned)op1) { case MEM_OP_READ: { if (side1 == 0) { set_stream_slice(west_full, srcdst1, bank1_dout); west_valid = true; } else { set_stream_slice(east_full, srcdst1, bank1_dout); east_valid = true; } break; } case MEM_OP_WRITE: { if (side1 == 0 && west_n_empty) { write_bank(1, ((ap_uint<32>)addr12) << 4, get_stream_slice(west_tmp, srcdst1)); } else if (side1 == 1 && east_n_empty) { write_bank(1, ((ap_uint<32>)addr12) << 4, get_stream_slice(east_tmp, srcdst1)); } break; } case NOP: default: break; } // Single writes to each side per iteration if (west_valid) { west_out.write(west_full); } if (east_valid) { east_out.write(east_full); } } } // extern "C" ``` </details> ### Syn Report ![image](https://hackmd.io/_uploads/BkCeM_Q3xg.png) ![image](https://hackmd.io/_uploads/HJSsW_m2lx.png) * In `mem_slice_test.cpp`, a single read and write iteration may cause problems because they are only one cycle apart and access the same address, which can lead to timing issues. ## Code5 - use fewer Resource <details> <summary>mem_slice.cpp (min resource) </summary> ```cpp #include "mem_slice.h" // ---------- memory banks ---------- static DataUnit mem_bank0[MEM_SLICE_WORDS]; static DataUnit mem_bank1[MEM_SLICE_WORDS]; // ---------- memory access ---------- void write_bank(unsigned bank_id, ap_uint<32> byte_addr, const DataUnit &dw) { #pragma HLS INLINE unsigned int idx = (unsigned int)((byte_addr / WORD_BYTES) % MEM_SLICE_WORDS); if (bank_id == 0) mem_bank0[idx] = dw; else mem_bank1[idx] = dw; } DataUnit read_bank(unsigned bank_id, ap_uint<32> byte_addr) { #pragma HLS INLINE unsigned int idx = (unsigned int)((byte_addr / WORD_BYTES) % MEM_SLICE_WORDS); return (bank_id == 0) ? mem_bank0[idx] : mem_bank1[idx]; } // slice helper inline DataUnit get_stream_slice(const data_full_t &full, unsigned idx) { #pragma HLS INLINE return full.range((idx+1)*STREAM_WIDTH-1, idx*STREAM_WIDTH); } inline void set_stream_slice(data_full_t &full, unsigned idx, const DataUnit &val) { #pragma HLS INLINE full.range((idx+1)*STREAM_WIDTH-1, idx*STREAM_WIDTH) = val; } // ---------- top-level DUT ---------- extern "C" { void mem_slice( hls::stream<mem_inst_t> &inst_in, hls::stream<mem_inst_t> &inst_out, hls::stream<data_full_t> &west_in, hls::stream<data_full_t> &west_out, hls::stream<data_full_t> &east_in, hls::stream<data_full_t> &east_out ) { #pragma HLS INTERFACE axis port=inst_in #pragma HLS INTERFACE axis port=inst_out #pragma HLS INTERFACE axis port=west_in #pragma HLS INTERFACE axis port=west_out #pragma HLS INTERFACE axis port=east_in #pragma HLS INTERFACE axis port=east_out #pragma HLS INTERFACE ap_ctrl_none port=return #pragma HLS PIPELINE II=1 #pragma HLS LATENCY min=1 max=6 if (inst_in.empty()) return; mem_inst_t inst = inst_in.read(); inst_out.write(inst); // forward // 直接解碼並並行處理所有操作 ap_uint<MEM_DSKEW_BITS> dsk = get_dskew(inst); ap_uint<MEM_OPCODE0_BITS> op0 = get_opcode0(inst); ap_uint<MEM_OPCODE1_BITS> op1 = get_opcode1(inst); ap_uint<MEM_ADDR_BITS> addr12 = get_addr(inst); ap_uint<MEM_W_OR_E0_BITS> side0 = get_wore0(inst); ap_uint<MEM_W_OR_E1_BITS> side1 = get_wore1(inst); ap_uint<MEM_SRC_DST0_BITS> srcdst0 = get_srcdst0(inst); ap_uint<MEM_SRC_DST1_BITS> srcdst1 = get_srcdst1(inst); // dskew 延遲 for (int dd = 0; dd < (int)dsk; ++dd) { #pragma HLS PIPELINE II=1 } // 預計算所有需要的值 ap_uint<32> byte_addr = ((ap_uint<32>)addr12) << 4; bool west_n_empty = !west_in.empty(); bool east_n_empty = !east_in.empty(); // 並行讀取記憶體和輸入數據 DataUnit bank0_dout = 0; DataUnit bank1_dout = 0; data_full_t west_tmp = 0; data_full_t east_tmp = 0; #pragma HLS DEPENDENCE variable=mem_bank0 inter false #pragma HLS DEPENDENCE variable=mem_bank1 inter false // 並行執行所有讀取操作 - 直接內聯記憶體存取 if (op0 == MEM_OP_READ) { unsigned int idx0 = (unsigned int)((byte_addr / WORD_BYTES) % MEM_SLICE_WORDS); bank0_dout = mem_bank0[idx0]; } if (op1 == MEM_OP_READ) { unsigned int idx1 = (unsigned int)((byte_addr / WORD_BYTES) % MEM_SLICE_WORDS); bank1_dout = mem_bank1[idx1]; } if (west_n_empty) west_tmp = west_in.read(); if (east_n_empty) east_tmp = east_in.read(); // 並行處理寫入操作 - 直接內聯記憶體存取 if (op0 == MEM_OP_WRITE) { unsigned int idx0 = (unsigned int)((byte_addr / WORD_BYTES) % MEM_SLICE_WORDS); if (side0 == 0 && west_n_empty) { mem_bank0[idx0] = get_stream_slice(west_tmp, srcdst0); } else if (side0 == 1 && east_n_empty) { mem_bank0[idx0] = get_stream_slice(east_tmp, srcdst0); } } if (op1 == MEM_OP_WRITE) { unsigned int idx1 = (unsigned int)((byte_addr / WORD_BYTES) % MEM_SLICE_WORDS); if (side1 == 0 && west_n_empty) { mem_bank1[idx1] = get_stream_slice(west_tmp, srcdst1); } else if (side1 == 1 && east_n_empty) { mem_bank1[idx1] = get_stream_slice(east_tmp, srcdst1); } } // 並行處理讀取輸出 data_full_t west_full = 0; data_full_t east_full = 0; bool west_valid = false; bool east_valid = false; if (op0 == MEM_OP_READ) { if (side0 == 0) { set_stream_slice(west_full, srcdst0, bank0_dout); west_valid = true; } else { set_stream_slice(east_full, srcdst0, bank0_dout); east_valid = true; } } if (op1 == MEM_OP_READ) { if (side1 == 0) { set_stream_slice(west_full, srcdst1, bank1_dout); west_valid = true; } else { set_stream_slice(east_full, srcdst1, bank1_dout); east_valid = true; } } // 輸出結果 if (west_valid) west_out.write(west_full); if (east_valid) east_out.write(east_full); } } // extern "C" ``` </details> * It seems like using simpler logic like stop using nesting if else may result in fewer resource usage. ### Syn Report ![image](https://hackmd.io/_uploads/rJFI6Kmnxx.png) ![image](https://hackmd.io/_uploads/Hkt5Ttm2xl.png) ![image](https://hackmd.io/_uploads/ByK36tmhxl.png) ![image](https://hackmd.io/_uploads/rJbbAKmhxl.png) <details> <summary>mem_slice_test.cpp (all case) </summary> ```cpp #include "mem_slice.h" #include <iostream> #include <iomanip> using namespace std; DataUnit make_data(int base) { DataUnit d = 0; for (int i = 0; i < WORD_BYTES; i++) d.range((i+1)*8-1, i*8) = base + i; return d; } void print_data(const DataUnit &d, const char *tag) { cout << tag << ": "; for (int i = 0; i < WORD_BYTES; i++) cout << setw(3) << d.range((i+1)*8-1, i*8) << " "; cout << endl; } mem_inst_t make_inst(unsigned dskew, unsigned op0, unsigned op1, unsigned addr12, unsigned side0, unsigned side1, unsigned srcdst0, unsigned srcdst1) { mem_inst_t inst; inst.raw = 0; inst.raw.range(3,0) = dskew; inst.raw.range(5,4) = op0; inst.raw.range(7,6) = op1; inst.raw.range(19,8) = addr12; inst.raw.bit(20) = side0; inst.raw.bit(21) = side1; inst.raw.range(26,22) = srcdst0; inst.raw.range(31,27) = srcdst1; return inst; } int main() { hls::stream<mem_inst_t> inst_in, inst_out; hls::stream<data_full_t> west_in, west_out, east_in, east_out; cout << "=== 開始多指令綜合測試 mem_slice ===\n" << endl; // 測試 1: 基本 WRITE 和 READ (多個指令) cout << "測試 1: 基本 WRITE 和 READ (5個指令)" << endl; for (int i = 0; i < 5; i++) { data_full_t din = 0; set_stream_slice(din, i % NUM_OF_STREAMS, make_data(100 + i*WORD_BYTES)); west_in.write(din); inst_in.write(make_inst(0, MEM_OP_WRITE, NOP, 0x010 + i, 0, 0, i % NUM_OF_STREAMS, 0)); } for (int i = 0; i < 5; i++) { inst_in.write(make_inst(0, MEM_OP_READ, NOP, 0x010 + i, 0, 0, i % NUM_OF_STREAMS, 0)); } // 測試 2: 兩個 bank 同時 READ (多個指令) cout << "\n測試 2: 兩個 bank 同時 READ (4個指令)" << endl; for (int i = 0; i < 4; i++) { inst_in.write(make_inst(0, MEM_OP_READ, MEM_OP_READ, 0x020 + i, 0, 1, i % NUM_OF_STREAMS, (i+1) % NUM_OF_STREAMS)); } // 測試 3: 兩個 bank 同時 WRITE (多個指令) cout << "\n測試 3: 兩個 bank 同時 WRITE (4個指令)" << endl; for (int i = 0; i < 4; i++) { data_full_t din_west = 0, din_east = 0; set_stream_slice(din_west, i % NUM_OF_STREAMS, make_data(200 + i*WORD_BYTES)); set_stream_slice(din_east, (i+1) % NUM_OF_STREAMS, make_data(300 + i*WORD_BYTES)); west_in.write(din_west); east_in.write(din_east); inst_in.write(make_inst(0, MEM_OP_WRITE, MEM_OP_WRITE, 0x030 + i, 0, 1, i % NUM_OF_STREAMS, (i+1) % NUM_OF_STREAMS)); } // 測試 4: 混合操作 (多個指令) cout << "\n測試 4: 混合操作 (Bank0 READ, Bank1 WRITE) (4個指令)" << endl; for (int i = 0; i < 4; i++) { data_full_t din = 0; set_stream_slice(din, i % NUM_OF_STREAMS, make_data(400 + i*WORD_BYTES)); west_in.write(din); inst_in.write(make_inst(0, MEM_OP_READ, MEM_OP_WRITE, 0x040 + i, 0, 0, i % NUM_OF_STREAMS, i % NUM_OF_STREAMS)); } // 測試 5: 帶 dskew 延遲的測試 (多個指令) cout << "\n測試 5: 帶 dskew 延遲的測試 (6個指令)" << endl; for (int i = 0; i < 3; i++) { data_full_t din = 0; set_stream_slice(din, i % NUM_OF_STREAMS, make_data(500 + i*WORD_BYTES)); west_in.write(din); inst_in.write(make_inst(i+1, MEM_OP_WRITE, NOP, 0x050 + i, 0, 0, i % NUM_OF_STREAMS, 0)); } for (int i = 0; i < 3; i++) { inst_in.write(make_inst(i+1, MEM_OP_READ, NOP, 0x050 + i, 0, 0, i % NUM_OF_STREAMS, 0)); } // 測試 6: 兩個 bank 都對同一個 port 寫入 (測試合併功能) (多個指令) cout << "\n測試 6: 兩個 bank 都對 west 寫入 (測試合併功能) (4個指令)" << endl; for (int i = 0; i < 4; i++) { data_full_t din = 0; set_stream_slice(din, i % NUM_OF_STREAMS, make_data(600 + i*WORD_BYTES)); west_in.write(din); inst_in.write(make_inst(0, MEM_OP_READ, MEM_OP_READ, 0x060 + i, 0, 0, i % NUM_OF_STREAMS, (i+1) % NUM_OF_STREAMS)); } // 測試 7: 兩個 bank 都對 east 寫入 (多個指令) cout << "\n測試 7: 兩個 bank 都對 east 寫入 (4個指令)" << endl; for (int i = 0; i < 4; i++) { data_full_t din = 0; set_stream_slice(din, i % NUM_OF_STREAMS, make_data(700 + i*WORD_BYTES)); east_in.write(din); inst_in.write(make_inst(0, MEM_OP_READ, MEM_OP_READ, 0x070 + i, 1, 1, i % NUM_OF_STREAMS, (i+1) % NUM_OF_STREAMS)); } // 測試 8: 連續的 dskew 延遲測試 (多個指令) cout << "\n測試 8: 連續的 dskew 延遲測試 (6個指令)" << endl; for (int i = 0; i < 3; i++) { data_full_t din = 0; set_stream_slice(din, i % NUM_OF_STREAMS, make_data(800 + i*WORD_BYTES)); west_in.write(din); inst_in.write(make_inst(3, MEM_OP_WRITE, NOP, 0x080 + i, 0, 0, i % NUM_OF_STREAMS, 0)); } for (int i = 0; i < 3; i++) { inst_in.write(make_inst(3, MEM_OP_READ, NOP, 0x080 + i, 0, 0, i % NUM_OF_STREAMS, 0)); } // 測試 9: 混合 dskew 延遲測試 (多個指令) cout << "\n測試 9: 混合 dskew 延遲測試 (8個指令)" << endl; for (int i = 0; i < 4; i++) { data_full_t din = 0; set_stream_slice(din, i % NUM_OF_STREAMS, make_data(900 + i*WORD_BYTES)); west_in.write(din); inst_in.write(make_inst((i%3)+1, MEM_OP_WRITE, NOP, 0x090 + i, 0, 0, i % NUM_OF_STREAMS, 0)); } for (int i = 0; i < 4; i++) { inst_in.write(make_inst((i%3)+1, MEM_OP_READ, NOP, 0x090 + i, 0, 0, i % NUM_OF_STREAMS, 0)); } // 測試 10: 複雜混合操作 (多個指令) cout << "\n測試 10: 複雜混合操作 (8個指令)" << endl; for (int i = 0; i < 4; i++) { data_full_t din_west = 0, din_east = 0; set_stream_slice(din_west, i % NUM_OF_STREAMS, make_data(1000 + i*WORD_BYTES)); set_stream_slice(din_east, (i+1) % NUM_OF_STREAMS, make_data(1100 + i*WORD_BYTES)); west_in.write(din_west); east_in.write(din_east); inst_in.write(make_inst(0, MEM_OP_WRITE, MEM_OP_READ, 0x0A0 + i, 0, 1, i % NUM_OF_STREAMS, (i+1) % NUM_OF_STREAMS)); } for (int i = 0; i < 4; i++) { inst_in.write(make_inst(0, MEM_OP_READ, MEM_OP_WRITE, 0x0A0 + i, 1, 0, (i+1) % NUM_OF_STREAMS, i % NUM_OF_STREAMS)); } // 執行所有測試 int cycle = 0, read_count = 0; while (!inst_in.empty() || !west_out.empty() || !east_out.empty()) { mem_slice(inst_in, inst_out, west_in, west_out, east_in, east_out); cycle++; if (!west_out.empty()) { data_full_t dout_full = west_out.read(); DataUnit dout = get_stream_slice(dout_full, read_count % NUM_OF_STREAMS); read_count++; cout << "Cycle " << cycle << " [WEST]: "; print_data(dout, "Output"); } if (!east_out.empty()) { data_full_t dout_full = east_out.read(); DataUnit dout = get_stream_slice(dout_full, read_count % NUM_OF_STREAMS); read_count++; cout << "Cycle " << cycle << " [EAST]: "; print_data(dout, "Output"); } while (!inst_out.empty()) inst_out.read(); } cout << "\n=== 測試完成 ===" << endl; cout << "Total cycles: " << cycle << endl; cout << "Total reads: " << read_count << endl; cout << "Estimated II = " << (double)cycle / read_count << endl; cout << "Total instructions: " << (5+5+4+4+3+3+4+4+3+3+4+4+8) << endl; return 0; } ``` </details> ## Analyse ### Common Data_Type | Type | Typical Bitwidth | Notes | |-------------------|-----------------|-------| | `char` | 8 bits | 1 byte | | `short` / `short int` | 16 bits | 2 bytes | | `int` / `signed int` | 32 bits | 4 bytes | | `long long` / `long long int` | 64 bits | 8 bytes | | Category | Header | Syntax Example | Description | Bitwidth Limit / Extension | |----------|--------|----------------|-------------|-------------------------------| | **Integer**<br>`ap_int<N>` / `ap_uint<N>` | `#include "ap_int.h"` | `ap_int<12> a;`<br>`ap_uint<8> b;` | Signed/unsigned integers with user-defined bitwidth, very resource-efficient | Default max `N = 1024`<br>Can be extended to `4096` by defining:<br>`#define AP_INT_MAX_W 4096` **before** `#include "ap_int.h"` | | **Fixed-point**<br>`ap_fixed<T,I>` / `ap_ufixed<T,I>` | `#include "ap_fixed.h"` | `ap_fixed<16,4> fx;` | Customizable fixed-point numbers (`T` = total bits, `I` = integer bits), useful for precise fractional math | Default max `T = 1024`, can also be extended to `4096` with `AP_INT_MAX_W` | | **Floating-point**<br>`float` / `double` | *(Standard C/C++)* | `float f;`<br>`double d;` | IEEE 754 floating-point numbers, easy to use but hardware-expensive | **Fixed bitwidth**:<br>• `float`: 32 bits → **1 sign + 8 exponent + 23 mantissa**<br>• `double`: 64 bits → **1 sign + 11 exponent + 52 mantissa**<br>Cannot be extended | | **Bit / Logic**<br>`ap_bit<1>` / `ap_logic<1>` | `#include "ap_int.h"` | `ap_bit<1> x;` | Single-bit signals; `ap_logic` can also represent `X/Z` states | Only 1 bit | | **Stream**<br>`hls::stream<T>` | `#include "hls_stream.h"` | `hls::stream<ap_uint<8>> s;` | FIFO-style streaming channel, commonly used for AXI-Stream interface | `T` is any supported type above | ### Struct vs Enum vs Union | Feature | Struct | Union | Enum | | ----------- | --------------------------- | ---------------------------------- | --------------------------------- | | Purpose | Encapsulate multiple fields | Multiple views of the same data | Enumerate discrete states/options | | Bitwidth | Sum of all members | Max of all members | Minimal bits to encode all values | | Hardware | Multiple signals/registers | Shared register, one active member | Single small-width signal (FSM) | | Typical Use | Instruction/data packets | Type punning, word/byte conversion | FSM states, control logic | ### Struct vs. Bit range #### Version1 ```cpp struct Inst { ap_uint<4> dskew; ap_uint<2> opcode; ap_uint<13> addr; ap_uint<1> side; ap_uint<5> srcdst; ap_uint<5> reserved; ap_uint<2> icu; }; ``` * Every field becomes its own register or flip-flop. #### Version2 ```cpp ap_uint<32> raw; inline ap_uint<MEM_ADDR_BITS> get_addr(const mem_inst_t &i) { return i.raw.range(18,6); } ``` From a synthesis perspective: * No extra registers are created. * All bit extraction is done through pure wiring (no logic). * In RTL, this simply becomes: ```verilog assign addr = raw[18:6]; ``` ### hls::stream | Method | Description | Example | |-----------------|-----------------------------------------------------------------------------|---------| | `write(T val)` | Writes an element into the FIFO (blocking by default) | `inst_in.write(inst);` | | `read()` | Reads an element from the FIFO (blocking if empty) | `mem_inst_t inst = inst_in.read();` | | `read_nb(T &val)` | Non-blocking read; returns false if FIFO is empty | `if (!inst_in.read_nb(inst)) { /* handle empty */ }` | | `write_nb(T val)` | Non-blocking write; returns false if FIFO is full | `if (!inst_in.write_nb(inst)) { /* handle full */ }` | | `empty()` | Checks whether the FIFO is empty | `if (!inst_in.empty()) { /* read data */ }` | | `full()` | Checks whether the FIFO is full | `if (!inst_in.full()) { /* write data */ }` | | `num_available()` | Returns the number of readable elements currently in the FIFO | `int n = inst_in.num_available();` | ### Return in HLS ```cpp static bool skew_active = false; static ap_uint<4> skew_cnt = 0; if (!skew_active && dskew > 0) { skew_cnt = dskew; skew_active = true; return; } if (skew_active) { if (skew_cnt > 0) { skew_cnt--; return; } else { skew_active = false; } } ``` The hls code convert to verilog is shown below: ```verilog if (!skew_active && dskew > 0) begin skew_cnt <= dskew; skew_active <= 1; end else if (skew_active) begin if (skew_cnt > 0) skew_cnt <= skew_cnt - 1; else skew_active <= 0; end ```