Cursor implement Lab B 紀錄

# Cursor implement Lab B 紀錄 I optimized my cursor rule. 1. reduce to about 40 rule, too much rule cannot success feed in. 2. less word per rule. Always define block/port protocol; do not let Cursor choose. Always use "Always Apply". Eventhough with same rule, cursor seldom output same outcome. ## Cursor rule Suggestion... * Keep it short and clear: one rule per line, no long sentences. * Use action-oriented verbs: Always / Prefer / Avoid / Do not. * Maintain consistency: no contradictory rules. * Handle exceptions: specify exceptions in the same line. * Focus on results: keep only rules that affect RTL / II / resources. * Remove vagueness: avoid words like “maybe” or “consider”; give explicit directions. * Make suggestions conditional: add conditions when suggesting alternatives. * Vague: - Prefer shift/add over mul * Clear: - Prefer shift/add over mul when II=1 cannot be met * Control the count: keep less then 50 rules. (I still trying) Using GPT, I converted User Guide 1399 into a cursor rule. Also, it's recommend to write prompt in rule format. ## FIR128 ### First try without rules I downloaded the project spec, which contains fir.cpp, fir.h, and the testbench, from the lab website and put them into my Cursor workspace. Then I gave Cursor the following prompt: ``` Develop the FIR using Vitis HLS based on the files in the fir128 folder. Create a baseline folder inside fir128 for this project. Directly edit the file in fir128, don't move source and test file to baseline folder. Run csim, synthesis, cosim. Also, add exit in tcl to preveng hang. ``` and without any rules. ```c void fir ( data_t *y, data_t x ) { coef_t c[N] = {10, 11, 11, 8, 3, -3, -8, -11, -11, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -11, -11, -8, -3, 3, 8, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 8, 3, -3, -8, -11, -11, -10, -10, -10, -10, -10, -10, -10, -10, -11, -11, -8, -3, 3, 8, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 8, 3, -3, -8, -11, -11, -10, -10, -10, -10, -10, -10, -10, -10, -11, -11, -8, -3, 3, 8, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; static data_t shift_register[N]; acc_t accumulator = 0; int i; // Shift register: make room for new sample for (i = N - 1; i > 0; i--) { shift_register[i] = shift_register[i - 1]; } shift_register[0] = x; // Convolution: dot product of coefficients and samples for (i = 0; i < N; i++) { accumulator += c[i] * shift_register[i]; } *y = (data_t)accumulator; } ``` Cursor generated a FIR without any optimizations: the two loops were separated and executed sequentially. syn report ``` PS: '+' for module; 'o' for loop; '*' for dataflow +---------------------------------+------+------+---------+-----------+----------+---------+------+----------+---------+---------+-----------+-----------+-----+ | Modules | Issue| | Latency | Latency | Iteration| | Trip | | | | | | | | & Loops | Type | Slack| (cycles)| (ns) | Latency | Interval| Count| Pipelined| BRAM | DSP | FF | LUT | URAM| +---------------------------------+------+------+---------+-----------+----------+---------+------+----------+---------+---------+-----------+-----------+-----+ |+ fir | -| 0.39| 267| 2.670e+03| -| 268| -| no| 3 (1%)| 1 (~0%)| 400 (~0%)| 370 (~0%)| -| | + fir_Pipeline_VITIS_LOOP_27_1 | -| 0.79| 129| 1.290e+03| -| 129| -| no| -| -| 17 (~0%)| 62 (~0%)| -| | o VITIS_LOOP_27_1 | -| 7.30| 127| 1.270e+03| 2| 1| 127| yes| -| -| -| -| -| | + fir_Pipeline_VITIS_LOOP_33_2 | -| 0.39| 133| 1.330e+03| -| 133| -| no| 1 (~0%)| 1 (~0%)| 375 (~0%)| 194 (~0%)| -| | o VITIS_LOOP_33_2 | -| 7.30| 131| 1.310e+03| 5| 1| 128| yes| -| -| -| -| -| +---------------------------------+------+------+---------+-----------+----------+---------+------+----------+---------+---------+-----------+-----------+-----+ ``` cosim report ``` +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | | | Latency(Clock Cycles) | Interval(Clock Cycles) | Total Execution Time | + RTL + Status +-----------------------------------------------+-----------------------------------------------+ (Clock Cycles) + | | | min | avg | max | min | avg | max | | +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | VHDL| NA| NA| NA| NA| NA| NA| NA| NA| | Verilog| Pass| 263| 263| 263| 264| 264| 264| 158399| +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ ``` Throughput = 264 Generally, we prefer to use more resources in exchange for lower II and higher throughput. ### First try with rules I gave Cursor same prompt: ``` Develop the FIR using Vitis HLS based on the files in the fir128 folder. Create a baseline folder inside fir128 for this project. Directly edit the file in fir128, don't move source and test file to baseline folder. Run csim, synthesis, cosim. Also, add exit in tcl to preveng hang. ``` Some conclusion about thinking process: * Recognized that achieving II=1 requires managing adder and multiplier latencies. * Identified two options—adder tree with full unroll (high resources) vs. simple shift-sum (baseline). Chose the straightforward approach to meet requirements. I don’t really see much discussion about the detailed implementation. ```c void fir ( data_t *y, data_t x ) { // Coefficients (constant ROM) static const coef_t c[N] = {10, 11, 11, 8, 3, -3, -8, -11, -11, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -11, -11, -8, -3, 3, 8, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 8, 3, -3, -8, -11, -11, -10, -10, -10, -10, -10, -10, -10, -10, -11, -11, -8, -3, 3, 8, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 8, 3, -3, -8, -11, -11, -10, -10, -10, -10, -10, -10, -10, -10, -11, -11, -8, -3, 3, 8, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; #pragma HLS ARRAY_PARTITION variable=c complete dim=1 // Shift register for past inputs static data_t shift_reg[N]; #pragma HLS ARRAY_PARTITION variable=shift_reg complete dim=1 // Initialize shift register on first call static bool initialized = false; if (!initialized) { init_loop: for (int i = 0; i < N; i++) { #pragma HLS UNROLL shift_reg[i] = 0; } initialized = true; } // Update shift register: shift right, insert new sample at index 0 shift_loop: for (int i = N - 1; i > 0; i--) { #pragma HLS UNROLL shift_reg[i] = shift_reg[i - 1]; } shift_reg[0] = x; // Multiply-accumulate acc_t acc = 0; mac_loop: for (int i = 0; i < N; i++) { #pragma HLS PIPELINE II=1 acc += (acc_t)shift_reg[i] * (acc_t)c[i]; } *y = (data_t)acc; } ``` ``` +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | | | Latency(Clock Cycles) | Interval(Clock Cycles) | Total Execution Time | + RTL + Status +-----------------------------------------------+-----------------------------------------------+ (Clock Cycles) + | | | min | avg | max | min | avg | max | | +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | VHDL| NA| NA| NA| NA| NA| NA| NA| NA| | Verilog| Pass| 132| 132| 132| 133| 133| 133| 79799| +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ ``` Since Cursor chose to implement with a single lane, the II is 133, which is close to 128. It’s better, but still not optimized. Besides, the loop can be rewound since it is a streaming loop with a fixed trip count. Next prompt. ``` use rewind ``` Some conclusion about thinking process: * FIR function is designed for single-sample inputs, but the testbench calls it repeatedly. Cursor recognizes the loop should use #pragma HLS PIPELINE II=1 rewind for streaming-style execution. * User wants to keep the fir128 implementation without stream interfaces. Cursor explores combining shift and MAC into one loop to apply rewind, even though the loop runs N times per call. * Replace two separate loops (shift + MAC) with one combined loop iterating N-1 → 0. Apply PIPELINE II=1 rewind so each call can restart without latency. Keep array partition pragmas. * Maintain tabs for indentation. Initialize acc=0 before loop. Label the loop (comb_loop or mac_shift_loop) for clarity. Ensure types and initialization remain consistent. ``` void fir ( data_t *y, data_t x ) { // Coefficients (constant ROM) static const coef_t c[N] = {10, 11, 11, 8, 3, -3, -8, -11, -11, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -11, -11, -8, -3, 3, 8, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 8, 3, -3, -8, -11, -11, -10, -10, -10, -10, -10, -10, -10, -10, -11, -11, -8, -3, 3, 8, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 8, 3, -3, -8, -11, -11, -10, -10, -10, -10, -10, -10, -10, -10, -11, -11, -8, -3, 3, 8, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; #pragma HLS ARRAY_PARTITION variable=c complete dim=1 // Shift register for past inputs (power-on zero) static data_t shift_reg[N] = {0}; #pragma HLS ARRAY_PARTITION variable=shift_reg complete dim=1 // Combined shift + MAC with rewinded pipeline acc_t acc = 0; comb_loop: for (int i = N - 1; i >= 0; --i) { #pragma HLS PIPELINE II=1 rewind data_t si = (i == 0) ? x : shift_reg[i - 1]; shift_reg[i] = si; acc += (acc_t)si * (acc_t)c[i]; } *y = (data_t)acc; } ``` ``` +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | | | Latency(Clock Cycles) | Interval(Clock Cycles) | Total Execution Time | + RTL + Status +-----------------------------------------------+-----------------------------------------------+ (Clock Cycles) + | | | min | avg | max | min | avg | max | | +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | VHDL| NA| NA| NA| NA| NA| NA| NA| NA| | Verilog| Pass| 130| 130| 130| 128| 128| 128| 76802| +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ ``` In the baseline design, throughput is optimized. This is just a rare case where rewind can still be used with ap_ctrl_hs. The reason Cursor doesn’t apply rewind is that the top-level protocol is ap_ctrl_hs, which conflicts with rewind since rewind assumes no flush is needed. I've tried to make cursor impliment fir with axis top and loop rewind using only one lane, but still not success yet. Cursor just don't follow my rules and prompt. I'm finding reason. Here is the updated rules: ``` - CRITICAL: Use #pragma HLS PIPELINE II=1 rewind on streaming or fixed-tripcount loops that run back-to-back - Use plain PIPELINE II=1 only for one-shot transactional loops needing full reset - Do not add rewind if loop has multiple exits or state must reset each run ``` ### FIR128 with an AXIS top I gave Cursor the following prompt: ``` - Develop the FIR using Vitis HLS with sources in fir128 folder - Create a baseline folder inside fir128 for project outputs - Directly edit files in fir128; do not move source or test files to baseline - Use AXIS interfaces for data ports and ap_ctrl_none for top control - Run csim, synthesis, and cosim to validate design - Add exit command in TCL scripts to prevent hang ``` Cursor Thinking: * Plans to refactor FIR into a streaming design using hls::stream input/output. Will pipeline the inner loop (i loop) with II=1, not the sample loop. This breaks the old testbench, so it must be modified to read 600 samples from input.dat. * Switches from Vivado HLS to Vitis HLS commands. Core commands (open_project, set_top, add_files) remain. Updates script.tcl with clock and project settings. Top function renamed to fir, with header and typedefs updated to ap_int. * New signature: void fir(hls::stream<data_t> &in_stream, hls::stream<data_t> &out_stream); * Includes hls_stream.h. Uses ap_ctrl_none for continuous operation. Keeps state across calls with static shift_reg[N] and reset logic. * Implements while (!in_stream.empty()) loop. For each input, initializes acc=0, shifts register, and performs MAC. Uses #pragma HLS PIPELINE II=1. Prevents overflow by widening accumulator (acc_t). Applies #pragma HLS ARRAY_PARTITION for optimization. * Declares const coef_t c[N] as static const (register storage). No need for BRAM binding. Uses LUT-friendly coefficients. * Updates testbench to include hls_stream.h. Ensures correct file paths in TCL. Creates baseline directory inside fir128. * To-do includes: update headers, implement FIR, modify testbench, update TCL script. Plans to use apply_patch for fir.h first. ```c #ifndef FIR_H_ #define FIR_H_ #include <hls_stream.h> const int N=128; typedef int coef_t; typedef int data_t; typedef int acc_t; // AXI-Stream streaming top with ap_ctrl_none void fir ( hls::stream<data_t> &s_in, hls::stream<data_t> &s_out ); #endif ``` ```c void fir ( hls::stream<data_t> &s_in, hls::stream<data_t> &s_out ) { #pragma HLS INTERFACE axis port=s_in #pragma HLS INTERFACE axis port=s_out #pragma HLS INTERFACE ap_ctrl_none port=return #pragma HLS STREAM variable=s_in depth=8 #pragma HLS STREAM variable=s_out depth=8 // Coefficients (constant) static const coef_t c[N] = {10, 11, 11, 8, 3, -3, -8, -11, -11, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -10, -11, -11, -8, -3, 3, 8, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 8, 3, -3, -8, -11, -11, -10, -10, -10, -10, -10, -10, -10, -10, -11, -11, -8, -3, 3, 8, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 8, 3, -3, -8, -11, -11, -10, -10, -10, -10, -10, -10, -10, -10, -11, -11, -8, -3, 3, 8, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10}; // Delay line for FIR shift register static data_t shift_reg[N]; #pragma HLS ARRAY_PARTITION variable=shift_reg complete dim=1 #ifdef __SYNTHESIS__ ReadWriteLoop: while (1) { #pragma HLS PIPELINE II=1 rewind data_t x; bool has = s_in.read_nb(x); if (has) { // Shift register: move older samples for (int i = N-1; i > 0; i--) { #pragma HLS UNROLL shift_reg[i] = shift_reg[i-1]; } shift_reg[0] = x; // MAC acc_t acc = 0; for (int i = 0; i < N; i++) { #pragma HLS UNROLL acc += (acc_t)shift_reg[i] * (acc_t)c[i]; } data_t y = (data_t)acc; s_out.write(y); } } #else if (!s_in.empty()) { data_t x = s_in.read(); for (int i = N-1; i > 0; i--) { #pragma HLS UNROLL shift_reg[i] = shift_reg[i-1]; } shift_reg[0] = x; acc_t acc = 0; for (int i = 0; i < N; i++) { #pragma HLS UNROLL acc += (acc_t)shift_reg[i] * (acc_t)c[i]; } data_t y = (data_t)acc; s_out.write(y); } #endif } ``` ``` +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | | | Latency(Clock Cycles) | Interval(Clock Cycles) | Total Execution Time | + RTL + Status +-----------------------------------------------+-----------------------------------------------+ (Clock Cycles) + | | | min | avg | max | min | avg | max | | +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | VHDL| NA| NA| NA| NA| NA| NA| NA| NA| | Verilog| Pass| 21| 21| 22| 1| 1| 2| 621| +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ ``` ``` * Summary: + Performance & Resource Estimates: PS: '+' for module; 'o' for loop; '*' for dataflow +------------------+------+------+---------+--------+----------+---------+------+----------+------+---------+-------------+-------------+-----+ | Modules | Issue| | Latency | Latency| Iteration| | Trip | | | | | | | | & Loops | Type | Slack| (cycles)| (ns) | Latency | Interval| Count| Pipelined| BRAM | DSP | FF | LUT | URAM| +------------------+------+------+---------+--------+----------+---------+------+----------+------+---------+-------------+-------------+-----+ |+ fir | -| 0.38| -| -| -| -| -| no| -| 1 (~0%)| 15203 (14%)| 10140 (19%)| -| | o ReadWriteLoop | -| 7.30| -| -| 21| 1| -| yes| -| -| -| -| -| +------------------+------+------+---------+--------+----------+---------+------+----------+------+---------+-------------+-------------+-----+ +-----------------+---------+-----+--------+-------+-----+ | Name | BRAM_18K| DSP | FF | LUT | URAM| +-----------------+---------+-----+--------+-------+-----+ |DSP | -| -| -| -| -| |Expression | -| -| 0| 4121| -| |FIFO | -| -| -| -| -| |Instance | -| 1| 165| 50| -| |Memory | -| -| -| -| -| |Multiplexer | -| -| -| 4267| -| |Register | -| -| 15038| 1696| -| +-----------------+---------+-----+--------+-------+-----+ |Total | 0| 1| 15203| 10134| 0| +-----------------+---------+-----+--------+-------+-----+ |Available | 280| 220| 106400| 53200| 0| +-----------------+---------+-----+--------+-------+-----+ |Utilization (%) | 0| ~0| 14| 19| 0| +-----------------+---------+-----+--------+-------+-----+ ``` Note that this time Cursor added rewind to the loop. It used only 1 DSP—which is remarkable—and still achieved throughput = 1. Even with 10K LUTs and 15K FFs (which is not much), this result is better. I’m still trying to figure out why only one DSP is needed. Maybe the vitis_hls is shitting me. Here is the updated rule: ``` - use while(1) with PIPELINE II=1 rewind for streaming kernels that process hls::stream input/output continuously - use if/for based finite loops only when the loop bound is known and fixed (e.g., matrix multiply, FFT butterflies) - use __SYNTHESIS__ guard: in synthesis path keep while(1), in simulation path replace with if (!stream.empty()) or finite loop to avoid infinite csim ``` ## Cordic ### Cordic without rules I will focus on the II and Resource, and then precision. I gave Cursor the following prompt: ``` - Develop the cordic using Vitis HLS with sources in cordic folder - Create a baseline folder inside cordic for project outputs - Directly edit files in cordic, do not move source or test files to baseline - Run csim, synthesis, and cosim to validate design - Add exit command in TCL scripts to prevent hang ``` Cursor thinking: * Vectoring equations confirmed: * X_{i+1} = X_i - d_i * Y_i * 2^{-i} * Y_{i+1} = Y_i + d_i * X_i * 2^{-i} * Gain compensation handled by multiplying x with K_inv. Algorithm matches expectations for vectoring. * Re-checked Kvalues: they’re based on sqrt(1 + 2^{-2i}), not just 2^{-i}. Must ensure precomputed array is correct. Using floats is acceptable for CORDIC in HLS. After csim, error occur... * Current vectoring mode gives wrong theta in quadrants II and III. Rotation only considers the sign of y, but must also account for x < 0. * Correct vectoring requires converging to [-π, π). Add pre-rotation: if x < 0, flip signs of x, y and adjust theta by ±π depending on y. This ensures full range coverage like atan2. * Standard equations for (x_new, y_new, z_new) still apply, but rotation direction must follow quadrant logic. Dynamic gain from Kvalues must be handled properly. * Avoid computing gain in runtime (too heavy). Precompute as constant or table. Use float data_t, keep Kvalues and angle tables at file scope. Avoid math.h; replace M_PI with explicit constant. * Code Fix Plan * Introduce theta_offset from pre-rotation. * Apply patch with changes (no baseline copy needed). * Update TCL to use absolute output directories. * Rerun csim to confirm corrections, especially for tests 3 & 4 (quadrant errors). ```c #ifndef CORDICCART2POL_H #define CORDICCART2POL_H #define NO_ITER 16 typedef int coef_t; typedef float data_t; typedef float acc_t; void cordiccart2pol(data_t x, data_t y, data_t * r, data_t * theta); #endif ``` ```c void cordiccart2pol(data_t x, data_t y, data_t * r, data_t * theta) { data_t current_x = x; data_t current_y = y; data_t current_theta = 0.0f; // Quadrant correction to emulate atan2: pre-rotate by \u00B1pi if x < 0 data_t theta_offset = 0.0f; if (current_x < 0.0f) { if (current_y >= 0.0f) { theta_offset = 3.14159265358979323846f; } else { theta_offset = -3.14159265358979323846f; } current_x = -current_x; current_y = -current_y; } // Vectoring mode CORDIC iterations for (int i = 0; i < NO_ITER; i++) { data_t pow2neg = Kvalues[i]; data_t x_shift = current_y * pow2neg; data_t y_shift = current_x * pow2neg; if (current_y > 0) { current_x = current_x + x_shift; current_y = current_y - y_shift; current_theta = current_theta + angles[i]; } else { current_x = current_x - x_shift; current_y = current_y + y_shift; current_theta = current_theta - angles[i]; } } // Compute the CORDIC gain compensation data_t gain = 1.0f; for (int i = 0; i < NO_ITER; i++) { data_t v = Kvalues[i]; gain = gain * (1.0f / sqrtf(1.0f + v * v)); } *r = current_x * gain; *theta = current_theta + theta_offset; } ``` This is Cursor’s first output. It seems that without rules or hint prompts, Cursor won’t add any pragmas and will only implement the base algorithm. ``` +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | | | Latency(Clock Cycles) | Interval(Clock Cycles) | Total Execution Time | + RTL + Status +-----------------------------------------------+-----------------------------------------------+ (Clock Cycles) + | | | min | avg | max | min | avg | max | | +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | VHDL| NA| NA| NA| NA| NA| NA| NA| NA| | Verilog| Pass| 261| 261| 261| 262| 262| 262| 1047| +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ ``` ``` * Summary: +-----------------+---------+-----+--------+-------+-----+ | Name | BRAM_18K| DSP | FF | LUT | URAM| +-----------------+---------+-----+--------+-------+-----+ |DSP | -| -| -| -| -| |Expression | -| -| 0| 253| -| |FIFO | -| -| -| -| -| |Instance | 0| 10| 1396| 2108| -| |Memory | 0| -| 32| 8| -| |Multiplexer | -| -| -| 281| -| |Register | -| -| 126| -| -| +-----------------+---------+-----+--------+-------+-----+ |Total | 0| 10| 1554| 2650| 0| +-----------------+---------+-----+--------+-------+-----+ |Available | 280| 220| 106400| 53200| 0| +-----------------+---------+-----+--------+-------+-----+ |Utilization (%) | 0| 4| 1| 4| 0| +-----------------+---------+-----+--------+-------+-----+ ``` First, multiplying by a static constant should prefer shift-add instead of multiply, since it saves resources and improves timing. Second, using ap_fixed or ap_int<> instead of float is better. Besides, using cmath in HLS cores is not recommended. ### Cordic with rule Same prompt but with cursor rules. ``` - Develop the cordic using Vitis HLS with sources in cordic folder - Create a baseline folder inside cordic for project outputs - Directly edit files in cordic, do not move source or test files to baseline - Run csim, synthesis, and cosim to validate design - Add exit command in TCL scripts to prevent hang ``` Record Cursor thinking process... * Plan to normalize with z_n = fmod(theta + π, 2π) → result in (-π, π]. If z_n > π/2, subtract π and flip cosine/sine signs; if z_n < -π/2, add π. This prepares angle for CORDIC iterations. * Replace while-loop with for-loop (0–3) for bounded normalization. Apply:#pragma HLS LOOP_TRIPCOUNT min=0 max=4 Ensure compatibility with pipelining and fixed-point shifts. * Ensure K_CONST and tan table use correct angle_t and trig_t types. Keep constants at file scope as static for reuse. * Current CORDIC output shows cosine/sine errors because input angles are not normalized. Algorithm only converges reliably within [-π/2, π/2] (≈ ±99°). * Define static constants (TWO_PI, PI, HALF_PI) as ap_fixed or double precomputed values.Normalize angles by bounded subtraction/addition of TWO_PI.Use a small for-loop with #pragma HLS LOOP_TRIPCOUNT for synthesizable normalization. * Avoid unbounded while-loops; use bounded loops or conditional adjustments.Ensure angle_t width is sufficient for normalization.Add required HLS pragmas for pipelining and streaming loops. * With 16 iterations, expect ~1e-5 accuracy (threshold 1e-3 is already satisfied). Can increase to 20 iterations if higher precision is needed. After csim, one error... * Angle normalization is missing; the single flip_sign logic flips both cos/sin and breaks quadrants II/III. * Map θ → [0, 2π) via a bounded loop (for up to 4; #pragma HLS LOOP_TRIPCOUNT min=0 max=4), using adds/subs of TWO_PI. * eep trig_t/angle_t as ap_fixed. Define static const angle_t PI, HALF_PI, TWO_PI. For sign multiplication, use a temp wider variable (then cast back to trig_t) to avoid narrowing. * Main CORDIC iteration loop #pragma HLS PIPELINE II=1 (add rewind if streaming). Keep the normalization loop bounded and synthesizable. The thinking process is quite long because this algorithm is more complex, and addressing precision issues is not straightforward. ```c #ifndef CORDIC_H #define CORDIC_H #include <ap_fixed.h> // Fixed-point types for angles and outputs typedef ap_fixed<24, 4> angle_t; // range ~[-8, 8), ~20 fractional bits typedef ap_fixed<24, 2> trig_t; // range ~[-2, 2), ~22 fractional bits // Top-level CORDIC function: computes cos and sin for input angle (radians) void cordic(angle_t theta, trig_t &cos_out, trig_t &sin_out); #endif // CORDIC_H ``` ```c #include "cordic.h" // Number of CORDIC iterations controls precision static const int NUM_ITER = 16; // Precomputed arctangent values in radians for i = 0..NUM_ITER-1 static const angle_t atan_table[NUM_ITER] = { angle_t(0.7853981633974483), // atan(2^-0) angle_t(0.4636476090008061), // atan(2^-1) angle_t(0.24497866312686414), // atan(2^-2) angle_t(0.12435499454676144), // atan(2^-3) angle_t(0.06241880999595735), // atan(2^-4) angle_t(0.031239833430268277), angle_t(0.015623728620476831), angle_t(0.007812341060101111), angle_t(0.0039062301319669718), angle_t(0.0019531225164788188), angle_t(0.0009765621895593195), angle_t(0.0004882812111948983), angle_t(0.00024414062014936177), angle_t(0.00012207031189367021), angle_t(0.00006103515617420877), angle_t(0.000030517578115526096) }; // Gain compensation constant for NUM_ITER iterations in rotation mode static const trig_t K_CONST = trig_t(0.6072529350088812561694); void cordic(angle_t theta, trig_t &cos_out, trig_t &sin_out) { #pragma HLS INLINE off // Internal accumulators use a slightly wider type to avoid growth typedef ap_fixed<26, 3> acc_t; // Normalize angle to [0, 2*pi) and map to [0, pi/2] with quadrant signs static const angle_t PI = angle_t(3.14159265358979323846); static const angle_t TWO_PI = angle_t(6.28318530717958647692); static const angle_t HALF_PI = angle_t(1.57079632679489661923); angle_t t = theta; #pragma HLS LOOP_TRIPCOUNT min=0 max=4 for (int k = 0; k < 4; k++) { if (t < angle_t(0)) { t = t + TWO_PI; } else if (t >= TWO_PI) { t = t - TWO_PI; } else { break; } } angle_t z; trig_t cos_sign = trig_t(1); trig_t sin_sign = trig_t(1); if (t <= HALF_PI) { z = t; } else if (t <= PI) { z = PI - t; cos_sign = trig_t(-1); } else if (t <= (PI + HALF_PI)) { z = t - PI; cos_sign = trig_t(-1); sin_sign = trig_t(-1); } else { z = TWO_PI - t; sin_sign = trig_t(-1); } acc_t x = acc_t(K_CONST); acc_t y = acc_t(0); CORDIC_LOOP: for (int i = 0; i < NUM_ITER; i++) { #pragma HLS PIPELINE II=1 // Select rotation direction based on sign of z bool z_nonneg = (z >= 0); acc_t x_shift = x >> i; acc_t y_shift = y >> i; if (z_nonneg) { // Rotate by -atan(2^-i) acc_t x_next = x - y_shift; acc_t y_next = y + x_shift; x = x_next; y = y_next; z = z - atan_table[i]; } else { // Rotate by +atan(2^-i) acc_t x_next = x + y_shift; acc_t y_next = y - x_shift; x = x_next; y = y_next; z = z + atan_table[i]; } } trig_t c_tmp = trig_t(x) * cos_sign; trig_t s_tmp = trig_t(y) * sin_sign; cos_out = c_tmp; sin_out = s_tmp; } ``` ``` +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | | | Latency(Clock Cycles) | Interval(Clock Cycles) | Total Execution Time | + RTL + Status +-----------------------------------------------+-----------------------------------------------+ (Clock Cycles) + | | | min | avg | max | min | avg | max | | +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | VHDL| NA| NA| NA| NA| NA| NA| NA| NA| | Verilog| Pass| 21| 21| 23| 22| 22| 24| 3263| +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ ``` ``` * Summary: +---------------------+---------+------+---------+---------+-----+ | Name | BRAM_18K| DSP | FF | LUT | URAM| +---------------------+---------+------+---------+---------+-----+ |DSP | -| -| -| -| -| |Expression | -| -| 0| 347| -| |FIFO | -| -| -| -| -| |Instance | 0| -| 151| 702| -| |Memory | -| -| -| -| -| |Multiplexer | -| -| -| 31| -| |Register | -| -| 108| -| -| +---------------------+---------+------+---------+---------+-----+ |Total | 0| 0| 259| 1080| 0| +---------------------+---------+------+---------+---------+-----+ |Available SLR | 1440| 2280| 788160| 394080| 320| +---------------------+---------+------+---------+---------+-----+ |Utilization SLR (%) | 0| 0| ~0| ~0| 0| +---------------------+---------+------+---------+---------+-----+ |Available | 4320| 6840| 2364480| 1182240| 960| +---------------------+---------+------+---------+---------+-----+ |Utilization (%) | 0| 0| ~0| ~0| 0| +---------------------+---------+------+---------+---------+-----+ ``` II = 26 with lower resource usage, which is a clear improvement. Cursor used ap_fixed instead of float, and replaced multiply-add with shift-add, making it different from the previous baseline. However, since NUM_ITER is set to 16, the best possible throughput is theoretically 16. First, the initial loop is redundant—the function maps theta to [0, 2π], and four iterations are unnecessary. Second, the static array should be fully partitioned since its depth is only 16, though this may not improve throughput. Finally, after removing the first loop, the core loop can be rewound. Next prompt: ``` - Simplified angle wrap to single-step instead of loop. - Partition the ROM - rewind the loop ``` Cursor thinking: * Key updates are: replace the angle-normalization loop with a single if/else step, partition the atan_table ROM for parallel access, and add #pragma HLS PIPELINE II=1 rewind to the main CORDIC loop. Following ugextract rules ensures proper pipelining. * Input angles span -360° to 360°. Normalization can be done in one step: add 2π if below -2π, subtract 2π if above 2π. This removes redundant iterations and saves resources. * atan_table can be partitioned either with #pragma HLS ARRAY_PARTITION inside the function or via set_directive_array_partition in TCL. Scope depends on placement: cordic/atan_table if local, atan_table if global. * Use inline pragmas for ROM partitioning. Keep ap_fixed types and tab indentation. Remove the old loop but preserve quadrant mapping logic. * Apply these edits in cordic.cpp, then re-run the full Vitis HLS flow (csim → csynth → cosim) to confirm correctness and performance. ```c #include "cordic.h" // Number of CORDIC iterations controls precision static const int NUM_ITER = 16; // Precomputed arctangent values in radians for i = 0..NUM_ITER-1 static const angle_t atan_table[NUM_ITER] = { angle_t(0.7853981633974483), // atan(2^-0) angle_t(0.4636476090008061), // atan(2^-1) angle_t(0.24497866312686414), // atan(2^-2) angle_t(0.12435499454676144), // atan(2^-3) angle_t(0.06241880999595735), // atan(2^-4) angle_t(0.031239833430268277), angle_t(0.015623728620476831), angle_t(0.007812341060101111), angle_t(0.0039062301319669718), angle_t(0.0019531225164788188), angle_t(0.0009765621895593195), angle_t(0.0004882812111948983), angle_t(0.00024414062014936177), angle_t(0.00012207031189367021), angle_t(0.00006103515617420877), angle_t(0.000030517578115526096) }; // Gain compensation constant for NUM_ITER iterations in rotation mode static const trig_t K_CONST = trig_t(0.6072529350088812561694); void cordic(angle_t theta, trig_t &cos_out, trig_t &sin_out) { #pragma HLS INLINE off // Internal accumulators use a slightly wider type to avoid growth typedef ap_fixed<26, 3> acc_t; // Normalize angle to [0, 2*pi) and map to [0, pi/2] with quadrant signs static const angle_t PI = angle_t(3.14159265358979323846); static const angle_t TWO_PI = angle_t(6.28318530717958647692); static const angle_t HALF_PI = angle_t(1.57079632679489661923); angle_t t = theta; // Single-step wrap sufficient for |theta| <= 2*pi if (t < angle_t(0)) { t = t + TWO_PI; } else if (t >= TWO_PI) { t = t - TWO_PI; } angle_t z; trig_t cos_sign = trig_t(1); trig_t sin_sign = trig_t(1); if (t <= HALF_PI) { z = t; } else if (t <= PI) { z = PI - t; cos_sign = trig_t(-1); } else if (t <= (PI + HALF_PI)) { z = t - PI; cos_sign = trig_t(-1); sin_sign = trig_t(-1); } else { z = TWO_PI - t; sin_sign = trig_t(-1); } acc_t x = acc_t(K_CONST); acc_t y = acc_t(0); // Partition the ROM to maximize throughput #pragma HLS ARRAY_PARTITION variable=atan_table complete dim=1 CORDIC_LOOP: for (int i = 0; i < NUM_ITER; i++) { #pragma HLS PIPELINE II=1 rewind // Select rotation direction based on sign of z bool z_nonneg = (z >= 0); acc_t x_shift = x >> i; acc_t y_shift = y >> i; if (z_nonneg) { // Rotate by -atan(2^-i) acc_t x_next = x - y_shift; acc_t y_next = y + x_shift; x = x_next; y = y_next; z = z - atan_table[i]; } else { // Rotate by +atan(2^-i) acc_t x_next = x + y_shift; acc_t y_next = y - x_shift; x = x_next; y = y_next; z = z + atan_table[i]; } } trig_t c_tmp = trig_t(x) * cos_sign; trig_t s_tmp = trig_t(y) * sin_sign; cos_out = c_tmp; sin_out = s_tmp; } ``` ``` +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | | | Latency(Clock Cycles) | Interval(Clock Cycles) | Total Execution Time | + RTL + Status +-----------------------------------------------+-----------------------------------------------+ (Clock Cycles) + | | | min | avg | max | min | avg | max | | +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | VHDL| NA| NA| NA| NA| NA| NA| NA| NA| | Verilog| Pass| 16| 16| 16| 16| 16| 16| 2320| +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ ``` ``` * Summary: +---------------------+---------+------+---------+---------+-----+ | Name | BRAM_18K| DSP | FF | LUT | URAM| +---------------------+---------+------+---------+---------+-----+ |DSP | -| -| -| -| -| |Expression | -| -| 0| 906| -| |FIFO | -| -| -| -| -| |Instance | -| -| 0| 65| -| |Memory | -| -| -| -| -| |Multiplexer | -| -| -| 145| -| |Register | -| -| 119| -| -| +---------------------+---------+------+---------+---------+-----+ |Total | 0| 0| 119| 1116| 0| +---------------------+---------+------+---------+---------+-----+ |Available SLR | 1440| 2280| 788160| 394080| 320| +---------------------+---------+------+---------+---------+-----+ |Utilization SLR (%) | 0| 0| ~0| ~0| 0| +---------------------+---------+------+---------+---------+-----+ |Available | 4320| 6840| 2364480| 1182240| 960| +---------------------+---------+------+---------+---------+-----+ |Utilization (%) | 0| 0| ~0| ~0| 0| +---------------------+---------+------+---------+---------+-----+ ``` Throughput improved from 22 to 16 with almost the same resources. I need to add a rule about removing unnecessary loops. Rules about array partition are already in Cursor’s rules, but Cursor may have missed them since I placed them at the end. As for the rewind issue, cursor usually won't add rewind to this kind of structure, since top level protocal use ap_ctrl_hs. ``` - Do not use for-loops to describe simple conditional logic - Prefer plain combinational expressions over loops when no iteration is required - Avoid complex or nested for-loops unless necessary; refactor into separate functions or streams - Do not remove for-loops that describe core pipelines; optimize them with PIPELINE/UNROLL/rewind instead ``` I may fine-tune the rule after several tries. I found that Cursor always uses ap_ctrl_hs (the default block protocol) if the prompt does not specify otherwise. So it’s important to specify the top-level protocol I found that 20–40 rules is the most efficient. When I used more than 40 rules before, Cursor sometimes missed some of them. It’s important to eliminate rules that don’t affect Cursor’s behavior. ### Cordic with AXIS top Always define block/port protocol; do not let Cursor choose. Prompt... ``` - Develop the cordiccart2pol using Vitis HLS with sources in cordic folder - Create a baseline folder inside cordiccart2pol for project outputs - Directly edit files in cordiccart2pol , do not create another source or test files - Use ap_ctrl_none and axi-stream - Run csim, synthesis, and cosim to validate design - Add exit command in TCL scripts to prevent hang ``` Cursor thinking: * Implementing the cordiccart2pol top function using AXIS streaming interfaces with ap_ctrl_none. The project structure is organized under cordiccart2pol/baseline, with script.tcl handling setup, sources, testbench, clock, and simulations. * The algorithm uses fixed-point arithmetic with iterative rotations, precomputed angles, and a constant K_inv to calculate r and θ. * Input and output are managed through streams, with data packed into ap_uint and unpacked using .range for safe casting. * The testbench is refactored to work with streams, verify results against a golden model, and compute RMSE. * The design ensures continuous streaming in hardware but exits cleanly in simulation. * Added ap_ctrl_none + AXIS with depth=1. Handled ap_fixed casting and set K_INV=1.646760. Testbench expects prototype cordiccart2pol(...). * Shifted to stream I/O, updated testbench with hls_stream/axis64_t. Adjusted script.tcl with baseline, set_top, and exit. * Loop with empty() needs preload. In hardware, ap_ctrl_none runs continuously; in C-sim, must exit after inputs. Test pushes 4 (x,y) pairs, computes golden outputs, and checks results. ``` #include "cordiccart2pol.h" // Precomputed atan(2^-i) angles for i = 0..15 static const data_t atan_table[NO_ITER] = { (data_t)0.785398163397448, (data_t)0.463647609000806, (data_t)0.244978663126864, (data_t)0.124354994546761, (data_t)0.0624188099959574, (data_t)0.0312398334302683, (data_t)0.0156237286204768, (data_t)0.00781234106010111, (data_t)0.00390623013196697, (data_t)0.00195312251647882, (data_t)0.000976562189559320, (data_t)0.000488281211194898, (data_t)0.000244140620149362, (data_t)0.000122070311893670, (data_t)0.0000610351561742088, (data_t)0.0000305175781155261 }; // Inverse of the CORDIC gain for NO_ITER=16 (1/K) static const data_t K_INV = (data_t)0.6072529350088814; void cordiccart2pol(hls::stream<axis64_t> &in_stream, hls::stream<axis64_t> &out_stream) { #pragma HLS INTERFACE axis port=in_stream depth=1 #pragma HLS INTERFACE axis port=out_stream depth=1 #pragma HLS INTERFACE ap_ctrl_none port=return // Process a full AXI4-Stream frame until TLAST #pragma HLS PIPELINE II=1 axis64_t in_word = in_stream.read(); ap_uint<64> in_bits = in_word.data; ap_uint<32> x_bits = in_bits.range(63, 32); ap_uint<32> y_bits = in_bits.range(31, 0); data_t x; data_t y; x.range(31, 0) = x_bits; y.range(31, 0) = y_bits; data_t xr = x; data_t yr = y; data_t zr = (data_t)0; // Pre-rotate 180 degrees when x is negative to keep radius positive if (xr < (data_t)0) { bool y_nonneg = (yr >= (data_t)0); xr = -xr; yr = -yr; zr = zr + (data_t)(y_nonneg ? 3.14159265358979323846 : -3.14159265358979323846); } // Vectoring CORDIC iterations to drive y toward zero for (int i = 0; i < NO_ITER; i++) { #pragma HLS UNROLL factor=1 data_t x_shift = xr >> i; data_t y_shift = yr >> i; if (yr >= (data_t)0) { xr = xr + y_shift; yr = yr - x_shift; zr = zr + atan_table[i]; } else { xr = xr - y_shift; yr = yr + x_shift; zr = zr - atan_table[i]; } } data_t r = xr * K_INV; data_t theta = zr; ap_uint<32> r_bits = r.range(31, 0); ap_uint<32> theta_bits = theta.range(31, 0); ap_uint<64> out_bits = 0; out_bits.range(31, 0) = r_bits; out_bits.range(63, 32) = theta_bits; axis64_t out_word; out_word.data = out_bits; out_word.keep = (ap_uint<8>)-1; out_word.strb = (ap_uint<8>)0; out_word.user = 0; out_word.last = in_word.last; out_word.id = 0; out_word.dest = 0; out_stream.write(out_word); } ``` ``` +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | | | Latency(Clock Cycles) | Interval(Clock Cycles) | Total Execution Time | + RTL + Status +-----------------------------------------------+-----------------------------------------------+ (Clock Cycles) + | | | min | avg | max | min | avg | max | | +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | VHDL| NA| NA| NA| NA| NA| NA| NA| NA| | Verilog| Pass| 12| 12| 13| 1| 1| 2| 1040| +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ ``` ``` * Summary: +-----------------+---------+-----+--------+-------+-----+ | Name | BRAM_18K| DSP | FF | LUT | URAM| +-----------------+---------+-----+--------+-------+-----+ |DSP | -| -| -| -| -| |Expression | -| -| 0| 4543| -| |FIFO | -| -| -| -| -| |Instance | -| 3| 165| 50| -| |Memory | -| -| -| -| -| |Multiplexer | -| -| -| 18| -| |Register | -| -| 1765| 192| -| +-----------------+---------+-----+--------+-------+-----+ |Total | 0| 3| 1930| 4803| 0| +-----------------+---------+-----+--------+-------+-----+ |Available | 280| 220| 106400| 53200| 0| +-----------------+---------+-----+--------+-------+-----+ |Utilization (%) | 0| 1| 1| 9| 0| +-----------------+---------+-----+--------+-------+-----+ ``` Though it uses more resources, Cursor reduced throughput to 1. The Cursor rules always make Cursor unroll loops when the design is in dataflow. Cursor choose to unroll the loop instead of using pipeline, which push II and throughput to 1. rule update ``` - use shift for power-of-two scaling instead of multiplication or division; replace x*2^k or x/2^k with x << k or x >> k to reduce operator latency and resource usage. ``` ## DFT1024 ### DFT with no rules prompt ``` - Develop the dft using Vitis HLS with sources in dft folder - Create a baseline folder inside dft for project outputs - Directly edit files in dft , do not create another source or test files - Run csim, synthesis, and cosim to validate design - Add exit command in TCL scripts to prevent hang ``` ``` #include "dft.h" #include "coefficients256.h" void dft(DTYPE In_R[SIZE], DTYPE In_I[SIZE]) { DTYPE out_r[SIZE]; DTYPE out_i[SIZE]; for (int k = 0; k < SIZE; k++) { DTYPE sum_r = 0; DTYPE sum_i = 0; for (int n = 0; n < SIZE; n++) { int idx = (k * n) % SIZE; DTYPE c = cos_coefficients_table[idx]; DTYPE s = sin_coefficients_table[idx]; DTYPE xr = In_R[n]; DTYPE xi = In_I[n]; sum_r += xr * c - xi * s; sum_i += xr * s + xi * c; } out_r[k] = sum_r; out_i[k] = sum_i; } for (int k = 0; k < SIZE; k++) { In_R[k] = out_r[k]; In_I[k] = out_i[k]; } } ``` ``` + Performance & Resource Estimates: PS: '+' for module; 'o' for loop; '*' for dataflow +------------------------------------------------+------+------+---------+-----------+----------+---------+-------+----------+---------+--------+-----------+-----------+-----+ | Modules | Issue| | Latency | Latency | Iteration| | Trip | | | | | | | | & Loops | Type | Slack| (cycles)| (ns) | Latency | Interval| Count | Pipelined| BRAM | DSP | FF | LUT | URAM| +------------------------------------------------+------+------+---------+-----------+----------+---------+-------+----------+---------+--------+-----------+-----------+-----+ |+ dft | -| 0.04| 393493| 3.935e+06| -| 393494| -| no| 4 (1%)| 5 (2%)| 1229 (1%)| 1544 (2%)| -| | + dft_Pipeline_VITIS_LOOP_9_1_VITIS_LOOP_13_2 | -| 0.04| 393232| 3.932e+06| -| 393232| -| no| 2 (~0%)| 5 (2%)| 1202 (1%)| 1308 (2%)| -| | o VITIS_LOOP_9_1_VITIS_LOOP_13_2 | II| 7.30| 393230| 3.932e+06| 21| 6| 65536| yes| -| -| -| -| -| | + dft_Pipeline_VITIS_LOOP_27_3 | -| 0.79| 258| 2.580e+03| -| 258| -| no| -| -| 21 (~0%)| 63 (~0%)| -| | o VITIS_LOOP_27_3 | -| 7.30| 256| 2.560e+03| 2| 1| 256| yes| -| -| -| -| -| +------------------------------------------------+------+------+---------+-----------+----------+---------+-------+----------+---------+--------+-----------+-----------+-----+ ``` Cosim take too long times. ### DFT with rules prompt ``` - Develop the dft using Vitis HLS with sources in dft folder - Create a baseline folder inside dft for project outputs - Directly edit files in dft , do not create another source or test files - Use ap_ctrl_hs and axi-master - Run csim, synthesis, and cosim to validate design - Add exit command in TCL scripts to prevent hang ``` ``` #include "dft.h" #include "coefficients256.h" // Top-level DFT kernel void dft(DTYPE In_R[SIZE], DTYPE In_I[SIZE]) { #pragma HLS INTERFACE m_axi port=In_R offset=slave bundle=gmem depth=256 #pragma HLS INTERFACE m_axi port=In_I offset=slave bundle=gmem depth=256 #pragma HLS INTERFACE s_axilite port=In_R bundle=control #pragma HLS INTERFACE s_axilite port=In_I bundle=control #pragma HLS INTERFACE s_axilite port=return bundle=control // Local copies to avoid in-place overwrite corrupting reads DTYPE in_r_local[SIZE]; DTYPE in_i_local[SIZE]; DTYPE out_r_local[SIZE]; DTYPE out_i_local[SIZE]; // Read inputs for (int i = 0; i < SIZE; i++) { #pragma HLS PIPELINE II=1 in_r_local[i] = In_R[i]; in_i_local[i] = In_I[i]; } // Compute DFT for (int k = 0; k < SIZE; k++) { DTYPE sum_real = 0; DTYPE sum_imag = 0; for (int n = 0; n < SIZE; n++) { #pragma HLS PIPELINE II=1 int idx = (n * k) & (SIZE - 1); DTYPE c = cos_coefficients_table[idx]; DTYPE s = sin_coefficients_table[idx]; DTYPE xr = in_r_local[n]; DTYPE xi = in_i_local[n]; sum_real += xr * c - xi * s; sum_imag += xi * c + xr * s; } out_r_local[k] = sum_real; out_i_local[k] = sum_imag; } // Write outputs for (int k = 0; k < SIZE; k++) { #pragma HLS PIPELINE II=1 In_R[k] = out_r_local[k]; In_I[k] = out_i_local[k]; } } ``` ``` +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | | | Latency(Clock Cycles) | Interval(Clock Cycles) | Total Execution Time | + RTL + Status +-----------------------------------------------+-----------------------------------------------+ (Clock Cycles) + | | | min | avg | max | min | avg | max | | +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | VHDL| NA| NA| NA| NA| NA| NA| NA| NA| | Verilog| Pass| 264564| 264564| 264564| NA| NA| NA| 264564| +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ ``` ``` * Summary: +-----------------+---------+-----+--------+-------+-----+ | Name | BRAM_18K| DSP | FF | LUT | URAM| +-----------------+---------+-----+--------+-------+-----+ |DSP | -| -| -| -| -| |Expression | -| -| -| -| -| |FIFO | -| -| -| -| -| |Instance | 2| 5| 2663| 3657| -| |Memory | 4| -| 0| 0| 0| |Multiplexer | -| -| -| 230| -| |Register | -| -| 133| -| -| +-----------------+---------+-----+--------+-------+-----+ |Total | 6| 5| 2796| 3887| 0| +-----------------+---------+-----+--------+-------+-----+ |Available | 280| 220| 106400| 53200| 0| +-----------------+---------+-----+--------+-------+-----+ |Utilization (%) | 2| 2| 2| 7| 0| +-----------------+---------+-----+--------+-------+-----+ ``` Feels the same as no rules. It seems the rules must be modified. By the way, the optimized II should be closed to 65535(no loop unrolling). next prompt: ``` exchange the loop index ``` ``` #include "dft.h" #include "coefficients256.h" // Top-level DFT kernel void dft(DTYPE In_R[SIZE], DTYPE In_I[SIZE]) { #pragma HLS INTERFACE m_axi port=In_R offset=slave bundle=gmem depth=256 #pragma HLS INTERFACE m_axi port=In_I offset=slave bundle=gmem depth=256 #pragma HLS INTERFACE s_axilite port=In_R bundle=control #pragma HLS INTERFACE s_axilite port=In_I bundle=control #pragma HLS INTERFACE s_axilite port=return bundle=control // Local copies to avoid in-place overwrite corrupting reads DTYPE in_r_local[SIZE]; DTYPE in_i_local[SIZE]; DTYPE out_r_local[SIZE]; DTYPE out_i_local[SIZE]; // Read inputs for (int i = 0; i < SIZE; i++) { #pragma HLS PIPELINE II=1 in_r_local[i] = In_R[i]; in_i_local[i] = In_I[i]; } // Compute DFT with loop interchange to remove accumulation dependency in pipelined loop // Accumulate across samples n into per-bin arrays for (int k = 0; k < SIZE; k++) { out_r_local[k] = 0; out_i_local[k] = 0; } for (int n = 0; n < SIZE; n++) { DTYPE xr = in_r_local[n]; DTYPE xi = in_i_local[n]; #pragma HLS DEPENDENCE variable=out_r_local inter false #pragma HLS DEPENDENCE variable=out_i_local inter false for (int k = 0; k < SIZE; k++) { #pragma HLS PIPELINE II=1 int idx = (n * k) & (SIZE - 1); DTYPE c = cos_coefficients_table[idx]; DTYPE s = sin_coefficients_table[idx]; out_r_local[k] += xr * c - xi * s; out_i_local[k] += xi * c + xr * s; } } // Write outputs for (int k = 0; k < SIZE; k++) { #pragma HLS PIPELINE II=1 In_R[k] = out_r_local[k]; In_I[k] = out_i_local[k]; } } ``` ``` +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | | | Latency(Clock Cycles) | Interval(Clock Cycles) | Total Execution Time | + RTL + Status +-----------------------------------------------+-----------------------------------------------+ (Clock Cycles) + | | | min | avg | max | min | avg | max | | +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ | VHDL| NA| NA| NA| NA| NA| NA| NA| NA| | Verilog| Pass| 67956| 67956| 67956| NA| NA| NA| 67956| +----------+----------+-----------------------------------------------+-----------------------------------------------+----------------------+ ``` ``` +-----------------+---------+-----+--------+-------+-----+ | Name | BRAM_18K| DSP | FF | LUT | URAM| +-----------------+---------+-----+--------+-------+-----+ |DSP | -| -| -| -| -| |Expression | -| -| 0| 2| -| |FIFO | -| -| -| -| -| |Instance | 2| 20| 3687| 5539| -| |Memory | 6| -| 0| 0| 0| |Multiplexer | -| -| -| 310| -| |Register | -| -| 134| -| -| +-----------------+---------+-----+--------+-------+-----+ |Total | 8| 20| 3821| 5851| 0| +-----------------+---------+-----+--------+-------+-----+ |Available | 280| 220| 106400| 53200| 0| +-----------------+---------+-----+--------+-------+-----+ |Utilization (%) | 2| 9| 3| 10| 0| +-----------------+---------+-----+--------+-------+-----+ ``` #pragma HLS DEPENDENCE variable=out_r_local inter false is added by cursor. Cursor found that each index of out_r_local and out_i_local is accessed only once. rule updated ``` - Repeating updates the same scalar across iterations (sum += …) in innermost body of nested loop is strictly prohibit, perform loop interchange so the recurrence moves to the outer loop. ``` Key words: strictly prohibit. Directly tell cursor not to do it, don't let cursor try, or Cursor will always choose not to interchange the loop. It's closed but not optimized. new prompt: ``` - Develop the dft using Vitis HLS with sources in dft folder - Create a folder inside dft folder for project solution - Directly edit files in dft folder - Use out.gold.dat that prvided to test - Use ap_ctrl_hs and axi-master - Run csim, synthesis, and cosim to validate design - Add exit command in TCL scripts to prevent hang ``` avoid the word 'baseline' in prompt, Cursor will misunderstand the prompt and make a baseline hls. Updated rule: ``` - CRITICAL: Apply #pragma HLS DEPENDENCE false when you are 100% certain the dependency is false ``` This is important since vitis_hls may incorrectly detect a data dependency. ## 8*8 Systolic Array prompt used: ``` - Develop the 8*8 Wavefront SystolicArray using Vitis HLS. - Build a folder for all meterals. - Use wavefront structure, not outer-product structure. - Run csim, synthesis, and cosim to validate design. - Test 32 sets of data. - Add exit command in TCL scripts to prevent hang. - Adher to rules. Use AXI-Stream interfaces and ap_ctrl_none. The stream width matches the width of one fully-partitioned matrix dimension. Pack row/column into AXI words One input matrix is streamed in by rows, while the other is streamed in by columns. Each Processing Element (PE) performs multiply-and-accumulate operations. The array contains a total of 64 PEs. ``` Cursor always adopts an outer-product structure rather than a wavefront structure, so it’s important to clarify which structure is being used. Providing a clear description of the algorithm is crucial, since there are many possible implementations of a Systolic Array. I have tried many times and iterated the rules repeatedly. The optimal performance for a wavefront systolic array is N + 2M − 2 cycles. Therefore, if Cursor achieves an execution interval less than or equal to N + 2M − 2 cycles, I consider it successful. Since I’ve iterated so many times, I’ll just document the common errors Cursor often makes. 1. Incorrect structure(outer-product instead of wavefront). 1. AXI write inside hot loop. 1. Insufficient FIFO depth. 1. Global broadcast in hot loop – uses global RAM access for read/write. 1. BRAM arrays not partitioned to match unrolled compute lanes. 1. No unrolling – data feed and compute loops remain serial; Lack of unrolling causes DSP sharing. 1. No overlap of compute/read/write. 1. Forgot to add synthesis guard, leading to deadlock in dataflow. These are the errors that Cursor often makes. The following shows one design generated by Cursor on its first try with the rules. ```c // Wavefront systolic compute core (8x8), no off-chip access in hot loop #pragma once #include "types.hpp" #include "pe.hpp" namespace sa8x8 { // Compute one output tile C = A x B for 8x8 using wavefront propagation. // Scheduling: A enters from the left (row i, advancing with t), B enters from the top (column j). // At time t, PE(i,j) processes k = t - i - j if k in [0, N-1]. static inline void compute_wavefront( const elem_t A_rows[N][N], // row-major, A_rows[i][k] = A[i][k] const elem_t B_cols[N][N], // column-major, B_cols[j][k] = B[k][j] acc_t C[N][N] ) { #pragma HLS INLINE off ProcessingElement pe[N][N]; #pragma HLS ARRAY_PARTITION variable=pe complete dim=0 // Token pipelines elem_t a_pipe[N][N]; elem_t b_pipe[N][N]; #pragma HLS ARRAY_PARTITION variable=a_pipe complete dim=0 #pragma HLS ARRAY_PARTITION variable=b_pipe complete dim=0 // Initialize for (int i = 0; i < N; i++) { #pragma HLS UNROLL for (int j = 0; j < N; j++) { #pragma HLS UNROLL pe[i][j].reset(); a_pipe[i][j] = 0; b_pipe[i][j] = 0; } } // Total steps: fill + compute + drain = 3N - 2 iterations (t = 0..3N-3) const int T = 3 * N - 2; for (int t = 0; t < T; t++) { #pragma HLS PIPELINE II=1 rewind // Inject row-wise A on left boundary and column-wise B on top boundary for (int i = 0; i < N; i++) { #pragma HLS UNROLL int k_a = t - i; // stagger by row so tokens meet with B at PE(i,j) elem_t a_in = 0; if (k_a >= 0 && k_a < N) a_in = A_rows[i][k_a]; a_pipe[i][0] = a_in; } for (int j = 0; j < N; j++) { #pragma HLS UNROLL int k_b = t - j; // stagger by column elem_t b_in = 0; if (k_b >= 0 && k_b < N) b_in = B_cols[j][k_b]; b_pipe[0][j] = b_in; } // Compute and propagate a right, b down for (int i = N - 1; i >= 0; i--) { #pragma HLS UNROLL for (int j = N - 1; j >= 0; j--) { #pragma HLS UNROLL elem_t a_out, b_out; pe[i][j].step(a_pipe[i][j], b_pipe[i][j], a_out, b_out); if (j + 1 < N) a_pipe[i][j + 1] = a_out; // safe with j descending if (i + 1 < N) b_pipe[i + 1][j] = b_out; // safe with i descending } } } // Output accumulators for (int i = 0; i < N; i++) { #pragma HLS UNROLL for (int j = 0; j < N; j++) { #pragma HLS UNROLL C[i][j] = pe[i][j].acc; } } } } // namespace sa8x8 ``` ```c // Processing Element for wavefront SA 8x8 #pragma once #include "types.hpp" namespace sa8x8 { // Single-cycle MAC bound to DSP; forwards a and b to neighbors. struct ProcessingElement { acc_t acc; ProcessingElement() : acc(0) {} void reset() { acc = 0; } void step(const elem_t a_in, const elem_t b_in, elem_t &a_out, elem_t &b_out) { #pragma HLS INLINE // Multiply in DSP by type selection, accumulate in a wide register acc_t prod = (acc_t)a_in * (acc_t)b_in; // multiplication maps to DSP with 16-bit operands acc = acc + prod; a_out = a_in; b_out = b_in; } }; } // namespace sa8x8 ``` ```c // Top-level interfaces and dataflow wrapper #pragma once #include "types.hpp" namespace sa8x8 { // Intentionally no namespaced top declaration to avoid ambiguity with global top. } // namespace sa8x8 // Global wrapper top for HLS tool compatibility void sa8x8_top( hls::stream<sa8x8::axis_ab_t> &A_rows_axis, hls::stream<sa8x8::axis_ab_t> &B_cols_axis, hls::stream<sa8x8::axis_c_t> &C_rows_axis, int num_sets ); ``` ```c // Copyright (c) 2025 // Types and AXI helpers for 8x8 wavefront systolic array #pragma once #include <ap_int.h> #include <hls_stream.h> #include <ap_axi_sdata.h> namespace sa8x8 { // --- Matrix configuration --- static const int N = 8; // Matrix dimension (8x8) static const int ELEM_W = 16; // Operand bitwidth (DSP-friendly <=18) static const int ACC_W = 48; // Accumulator width (<=48 typical) typedef ap_int<ELEM_W> elem_t; typedef ap_int<ACC_W> acc_t; // Packed word widths for AXI-Stream (pack a full row/column per beat) static const int A_B_AXIS_W = N * ELEM_W; // 8 * 16 = 128 bits static const int C_AXIS_W = N * ACC_W; // 8 * 48 = 384 bits // AXI-Stream payload types using ap_axiu typedef ap_axiu<A_B_AXIS_W, 0, 0, 0> axis_ab_t; // for A rows and B columns typedef ap_axiu<C_AXIS_W, 0, 0, 0> axis_c_t; // for C rows // ----------------------------------------------------------------------------- // Packing helpers: pack/unpack arrays of N elements into AXI words // ----------------------------------------------------------------------------- // Pack N operands (elem_t) into an ap_uint<A_B_AXIS_W>, LSB-first by index static inline ap_uint<A_B_AXIS_W> pack_elem_row(const elem_t values[N]) { #pragma HLS INLINE ap_uint<A_B_AXIS_W> word = 0; for (int i = 0; i < N; i++) { #pragma HLS UNROLL ap_uint<ELEM_W> lane = (ap_uint<ELEM_W>)values[i]; word.range((i + 1) * ELEM_W - 1, i * ELEM_W) = lane; } return word; } // Unpack an ap_uint<A_B_AXIS_W> into N operands (elem_t) static inline void unpack_elem_row(const ap_uint<A_B_AXIS_W> word, elem_t values[N]) { #pragma HLS INLINE for (int i = 0; i < N; i++) { #pragma HLS UNROLL ap_uint<ELEM_W> lane = word.range((i + 1) * ELEM_W - 1, i * ELEM_W); values[i] = (elem_t)lane; } } // Pack N accumulators (acc_t) into an ap_uint<C_AXIS_W>, LSB-first by index static inline ap_uint<C_AXIS_W> pack_acc_row(const acc_t values[N]) { #pragma HLS INLINE ap_uint<C_AXIS_W> word = 0; for (int i = 0; i < N; i++) { #pragma HLS UNROLL ap_uint<ACC_W> lane = (ap_uint<ACC_W>)values[i]; word.range((i + 1) * ACC_W - 1, i * ACC_W) = lane; } return word; } // Unpack an ap_uint<C_AXIS_W> into N accumulators (acc_t) static inline void unpack_acc_row(const ap_uint<C_AXIS_W> word, acc_t values[N]) { #pragma HLS INLINE for (int i = 0; i < N; i++) { #pragma HLS UNROLL ap_uint<ACC_W> lane = word.range((i + 1) * ACC_W - 1, i * ACC_W); values[i] = (acc_t)lane; } } } // namespace sa8x8 ``` ```c #include "top.hpp" #include "compute.hpp" namespace sa8x8 { static void load_A_rows(hls::stream<axis_ab_t> &in_axis, elem_t A_rows[N][N]) { #pragma HLS INLINE // Read N beats, each packs N elements of a row for (int r = 0; r < N; r++) { #pragma HLS PIPELINE II=1 axis_ab_t beat = in_axis.read(); ap_uint<A_B_AXIS_W> payload = beat.data; unpack_elem_row(payload, A_rows[r]); } } static void load_B_cols(hls::stream<axis_ab_t> &in_axis, elem_t B_cols[N][N]) { #pragma HLS INLINE // Read N beats, each packs N elements of a column for (int c = 0; c < N; c++) { #pragma HLS PIPELINE II=1 axis_ab_t beat = in_axis.read(); ap_uint<A_B_AXIS_W> payload = beat.data; unpack_elem_row(payload, B_cols[c]); } } static void store_C_rows(const acc_t C[N][N], hls::stream<axis_c_t> &out_axis) { #pragma HLS INLINE // Emit N beats, each packs N accumulators of a row for (int r = 0; r < N; r++) { #pragma HLS PIPELINE II=1 acc_t row[N]; #pragma HLS ARRAY_PARTITION variable=row complete dim=1 for (int j = 0; j < N; j++) { #pragma HLS UNROLL row[j] = C[r][j]; } axis_c_t out_beat; out_beat.data = pack_acc_row(row); out_beat.keep = (ap_uint<C_AXIS_W/8>) -1; out_beat.strb = (ap_uint<C_AXIS_W/8>) -1; out_beat.user = 0; out_beat.id = 0; out_beat.dest = 0; out_beat.last = (r == N - 1); out_axis.write(out_beat); } } static void sa8x8_namespace_top( hls::stream<axis_ab_t> &A_rows_axis, hls::stream<axis_ab_t> &B_cols_axis, hls::stream<axis_c_t> &C_rows_axis, int num_sets ) { #pragma HLS INTERFACE ap_ctrl_hs port=return #pragma HLS INTERFACE s_axilite port=num_sets bundle=control #pragma HLS INTERFACE axis port=A_rows_axis #pragma HLS INTERFACE axis port=B_cols_axis #pragma HLS INTERFACE axis port=C_rows_axis // Process multiple datasets in a streaming dataflow; ensure overlap without backpressure dataflow_sets: for (int ds = 0; ds < num_sets; ds++) { #pragma HLS LOOP_TRIPCOUNT min=1 max=32 #pragma HLS DATAFLOW // Local fully partitioned buffers per dataset elem_t A_rows[N][N]; elem_t B_cols[N][N]; acc_t C[N][N]; #pragma HLS ARRAY_PARTITION variable=A_rows complete dim=0 #pragma HLS ARRAY_PARTITION variable=B_cols complete dim=0 #pragma HLS ARRAY_PARTITION variable=C complete dim=0 load_A_rows(A_rows_axis, A_rows); load_B_cols(B_cols_axis, B_cols); compute_wavefront(A_rows, B_cols, C); store_C_rows(C, C_rows_axis); } } } // namespace sa8x8 // Non-namespaced top that forwards to namespaced implementation for tool visibility void sa8x8_top( hls::stream<sa8x8::axis_ab_t> &A_rows_axis, hls::stream<sa8x8::axis_ab_t> &B_cols_axis, hls::stream<sa8x8::axis_c_t> &C_rows_axis, int num_sets ) { #pragma HLS INTERFACE ap_ctrl_hs port=return #pragma HLS INTERFACE s_axilite port=num_sets bundle=control #pragma HLS INTERFACE axis port=A_rows_axis #pragma HLS INTERFACE axis port=B_cols_axis #pragma HLS INTERFACE axis port=C_rows_axis sa8x8::sa8x8_namespace_top(A_rows_axis, B_cols_axis, C_rows_axis, num_sets); } ``` ``` + Performance & Resource Estimates: PS: '+' for module; 'o' for loop; '*' for dataflow +--------------------------------------+------+------+---------+-----------+----------+---------+------+----------+------+----------+-------------+------------+-----+ | Modules | Issue| | Latency | Latency | Iteration| | Trip | | | | | | | | & Loops | Type | Slack| (cycles)| (ns) | Latency | Interval| Count| Pipelined| BRAM | DSP | FF | LUT | URAM| +--------------------------------------+------+------+---------+-----------+----------+---------+------+----------+------+----------+-------------+------------+-----+ |+ sa8x8_top | -| 0.12| 735| 2.425e+03| -| 736| -| no| -| 64 (~0%)| 29275 (1%)| 21418 (1%)| -| | + sa8x8_namespace_top | -| 0.12| 732| 2.416e+03| -| 732| -| no| -| 64 (~0%)| 29168 (1%)| 21264 (1%)| -| | o dataflow_sets | -| 2.41| 731| 2.412e+03| 50| -| 32| no| -| -| -| -| -| | + dataflow_in_loop_dataflow_sets* | -| 0.12| 48| 158.400| -| 22| -| dataflow| -| 64 (~0%)| 28660 (1%)| 21144 (1%)| -| | + VITIS_LOOP_9_1_proc | -| 1.31| 10| 33.000| -| 10| -| no| -| -| 2054 (~0%)| 637 (~0%)| -| | o VITIS_LOOP_9_1 | -| 2.41| 8| 26.400| 1| 1| 8| yes| -| -| -| -| -| | + VITIS_LOOP_20_1_proc | -| 1.31| 10| 33.000| -| 10| -| no| -| -| 2054 (~0%)| 637 (~0%)| -| | o VITIS_LOOP_20_1 | -| 2.41| 8| 26.400| 1| 1| 8| yes| -| -| -| -| -| | + compute_wavefront | -| 0.12| 26| 85.800| -| 26| -| no| -| 64 (~0%)| 10846 (~0%)| 9293 (~0%)| -| | o VITIS_LOOP_41_3 | -| 2.41| 25| 82.500| 5| 1| 22| yes| -| -| -| -| -| | + VITIS_LOOP_31_1_proc | -| 1.21| 10| 33.000| -| 10| -| no| -| -| 392 (~0%)| 425 (~0%)| -| | o VITIS_LOOP_31_1 | -| 2.41| 8| 26.400| 2| 1| 8| yes| -| -| -| -| -| +--------------------------------------+------+------+---------+-----------+----------+---------+------+----------+------+----------+-------------+------------+-----+ ``` There 32 sets of test data. It's very hard to success at the first try. Most of the time average cycle spent is 60 cycles, while ideal is 22 cycles. And I'll just tell cursor to improve it by prompt. For example: Improve the overall latency to under 800 cycles. And second try may succeed. ## FFT