How do you debug real-time performance issues in DSP applications?

Here’s a practical, battle-tested checklist for debugging real-time performance in [DSP](https://www.ampheo.com/c/dsp-digital-signal-processors) apps (audio, comms, motor control, radar, etc.). Work through it top-down; fix the pipeline first, then the hot code. ![mikroelektronika-d-o-o-click-board-dsp-click-board-30257701683389](https://hackmd.io/_uploads/rk4jtkD9xe.jpg) **1) Define the real-time budget** * Frame period 𝑇𝑓= samples_per_block / sample_rate. * CPU budget per frame: 𝐵=𝑇𝑓×CPU_freq (in cycles). * Headroom target: 20–30% slack under worst-case input. * Track XRUNs (buffer under/overruns) or dropped frames—those are your ground truth. **2) Reproduce + instrument correctly** * Disable printf/logging in the real-time path. * Mark the pipeline: ISR entry/exit, DMA half/full callbacks, task start/end. * Measure time with cycle counters or a scope: * Cortex-M (M4F/M7): ``` // one-time init CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk; DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk; uint32_t start = DWT->CYCCNT; // ... work ... uint32_t cycles = DWT->CYCCNT - start; ``` * Toggle a GPIO high/low around code and measure pulse width with a logic analyzer/oscilloscope for nanosecond accuracy. * Use your vendor tools for tracing/profiling (e.g., ITM/ETM/SWO, RTOS trace, CCS/SEGGER/Ozone/Lauterbach). **3) Triage the pipeline (I/O first)** * DMA everywhere you can (ADC/DAC/I²S/SPI). No memcpy in the hot path. * Ping-pong (double) buffering; process half-complete DMA callbacks to cut latency. * Right-size block size: larger blocks = better throughput but higher latency; find the knee. * Verify clocking (PLL/DFS) and that freq scaling/low-power modes aren’t throttling you. **4) Find where time goes (hotspots)** * Sampling profiler (if available) or manual timing around each stage: windowing → FFT → magnitude → filtering → post-proc. * Rank by cycles and fix in this order: (1) memory moves, (2) math kernels (FFTs, FIR/IIR, resamplers), (3) format conversions, (4) control/UI. **5) Attack common hotspots** **Math kernels** * Use vendor-optimized libraries (CMSIS-DSP, clDNN/Neon, TI DSPLIB, KissFFT/FFTS with NEON/SIMD). * Choose power-of-two FFT sizes; precompute twiddles; reuse plans; avoid repeated init. * For FIR/IIR: block processing, polyphase for resamplers, symmetric FIR trick, biquad cascades with SOS form. **Numerics** * Watch for denormals/subnormals (audio “zipper noise” + huge slowdowns). * Enable flush-to-zero/denormals-are-zero or add a tiny DC dither (e.g., 1e-20f). * Use fixed-point with saturation if FPU is weak; pick Q format, scale once, then MAC. **Compiler/flags** * Build release with -O3 (or -Ofast if acceptable), LTO, and target CPU flags (e.g., -mcpu=cortex-m7 -mfpu=fpv5-d16 -mfloat-abi=hard). * Consider PGO (profile-guided optimization) for big apps. **6) Minimize memory traffic (often the #1 limiter)** * Keep working sets in TCM/scratchpad or L1 cache; align buffers (32/64-byte). * Interleave/deinterleave once; keep data in the format kernels expect. * Eliminate memcpy; process in-place; use restrict pointers to help the compiler. * Prefetch/stream in order; avoid strided/random access in inner loops. **7) Concurrency & RTOS hygiene** * Do almost nothing in ISRs: move work to high-priority tasks via lock-free queues. * Set ISR and task priorities so the audio/ADC path cannot be preempted by non-critical work. * Avoid priority inversion (mutexes in the RT path); use bounded mailboxes or SPSC ring buffers. * No dynamic allocation (malloc/free) in the real-time path; pre-allocate pools at init. **8) Quick stabilizers (while you dig deeper)** * Increase buffer depth one notch to avoid audible/visible glitches. * Lower CPU load: reduce sample rate, FFT size, [filter](https://www.onzuu.com/category/filters) order, or update rate for non-critical features. * Implement backpressure or frame drop policy for non-critical streams. **9) Validate worst-case, not just average** * Feed worst-case vectors: max amplitude, impulses/steps, multi-tone near Nyquist, pathological filter inputs. * Sweep across operating points (temperatures, supply voltages, different PLL settings). * Measure WCET (worst-case execution time) and confirm slack ≥ headroom target. * **10) Small but mighty tricks** * Replace divides with reciprocal multiply; hoist invariant math out of loops. * Unroll tight loops; ensure SIMD vectorization (NEON/SSE) actually triggers (inspect disassembly or compiler reports). * Use CORDIC or LUTs for expensive trig if accuracy allows. * Keep printf/logging out of the fast path (or buffer to a low-priority task). **Minimal timing template (drop-in)** ``` static inline uint32_t cycles_begin(void){ uint32_t x = DWT->CYCCNT; return x; } static inline uint32_t cycles_end(uint32_t start){ return DWT->CYCCNT - start; } // usage uint32_t t0 = cycles_begin(); process_block(buf, N); uint32_t cyc = cycles_end(t0); g_stats.cyc_max = MAX(g_stats.cyc_max, cyc); ``` Compute CPU load: load = cyc / (CPU_HZ * frame_period). **Common culprits checklist** * Misconfigured clock tree → CPU really slower than you think. * DMA not used / cache maintenance missing (invalidate/clean) → stalls/corruption. * Denormals on float paths → sudden 10–100× slowdown. * Excessive copies / format conversions between stages. * Doing work in ISR; low-priority thread preempting high-priority path. * Using generic library instead of the [SoC](https://www.ampheo.com/c/system-on-chip-soc)’s optimized [DSP](https://www.ampheoelec.de/c/dsp-digital-signal-processors) kernels.