Here’s a practical, battle-tested checklist for debugging real-time performance in [DSP](https://www.ampheo.com/c/dsp-digital-signal-processors) apps (audio, comms, motor control, radar, etc.). Work through it top-down; fix the pipeline first, then the hot code.

**1) Define the real-time budget**
* Frame period 𝑇𝑓= samples_per_block / sample_rate.
* CPU budget per frame: 𝐵=𝑇𝑓×CPU_freq (in cycles).
* Headroom target: 20–30% slack under worst-case input.
* Track XRUNs (buffer under/overruns) or dropped frames—those are your ground truth.
**2) Reproduce + instrument correctly**
* Disable printf/logging in the real-time path.
* Mark the pipeline: ISR entry/exit, DMA half/full callbacks, task start/end.
* Measure time with cycle counters or a scope:
* Cortex-M (M4F/M7):
```
// one-time init
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
uint32_t start = DWT->CYCCNT;
// ... work ...
uint32_t cycles = DWT->CYCCNT - start;
```
* Toggle a GPIO high/low around code and measure pulse width with a logic analyzer/oscilloscope for nanosecond accuracy.
* Use your vendor tools for tracing/profiling (e.g., ITM/ETM/SWO, RTOS trace, CCS/SEGGER/Ozone/Lauterbach).
**3) Triage the pipeline (I/O first)**
* DMA everywhere you can (ADC/DAC/I²S/SPI). No memcpy in the hot path.
* Ping-pong (double) buffering; process half-complete DMA callbacks to cut latency.
* Right-size block size: larger blocks = better throughput but higher latency; find the knee.
* Verify clocking (PLL/DFS) and that freq scaling/low-power modes aren’t throttling you.
**4) Find where time goes (hotspots)**
* Sampling profiler (if available) or manual timing around each stage: windowing → FFT → magnitude → filtering → post-proc.
* Rank by cycles and fix in this order: (1) memory moves, (2) math kernels (FFTs, FIR/IIR, resamplers), (3) format conversions, (4) control/UI.
**5) Attack common hotspots**
**Math kernels**
* Use vendor-optimized libraries (CMSIS-DSP, clDNN/Neon, TI DSPLIB, KissFFT/FFTS with NEON/SIMD).
* Choose power-of-two FFT sizes; precompute twiddles; reuse plans; avoid repeated init.
* For FIR/IIR: block processing, polyphase for resamplers, symmetric FIR trick, biquad cascades with SOS form.
**Numerics**
* Watch for denormals/subnormals (audio “zipper noise” + huge slowdowns).
* Enable flush-to-zero/denormals-are-zero or add a tiny DC dither (e.g., 1e-20f).
* Use fixed-point with saturation if FPU is weak; pick Q format, scale once, then MAC.
**Compiler/flags**
* Build release with -O3 (or -Ofast if acceptable), LTO, and target CPU flags (e.g., -mcpu=cortex-m7 -mfpu=fpv5-d16 -mfloat-abi=hard).
* Consider PGO (profile-guided optimization) for big apps.
**6) Minimize memory traffic (often the #1 limiter)**
* Keep working sets in TCM/scratchpad or L1 cache; align buffers (32/64-byte).
* Interleave/deinterleave once; keep data in the format kernels expect.
* Eliminate memcpy; process in-place; use restrict pointers to help the compiler.
* Prefetch/stream in order; avoid strided/random access in inner loops.
**7) Concurrency & RTOS hygiene**
* Do almost nothing in ISRs: move work to high-priority tasks via lock-free queues.
* Set ISR and task priorities so the audio/ADC path cannot be preempted by non-critical work.
* Avoid priority inversion (mutexes in the RT path); use bounded mailboxes or SPSC ring buffers.
* No dynamic allocation (malloc/free) in the real-time path; pre-allocate pools at init.
**8) Quick stabilizers (while you dig deeper)**
* Increase buffer depth one notch to avoid audible/visible glitches.
* Lower CPU load: reduce sample rate, FFT size, [filter](https://www.onzuu.com/category/filters) order, or update rate for non-critical features.
* Implement backpressure or frame drop policy for non-critical streams.
**9) Validate worst-case, not just average**
* Feed worst-case vectors: max amplitude, impulses/steps, multi-tone near Nyquist, pathological filter inputs.
* Sweep across operating points (temperatures, supply voltages, different PLL settings).
* Measure WCET (worst-case execution time) and confirm slack ≥ headroom target.
*
**10) Small but mighty tricks**
* Replace divides with reciprocal multiply; hoist invariant math out of loops.
* Unroll tight loops; ensure SIMD vectorization (NEON/SSE) actually triggers (inspect disassembly or compiler reports).
* Use CORDIC or LUTs for expensive trig if accuracy allows.
* Keep printf/logging out of the fast path (or buffer to a low-priority task).
**Minimal timing template (drop-in)**
```
static inline uint32_t cycles_begin(void){
uint32_t x = DWT->CYCCNT; return x;
}
static inline uint32_t cycles_end(uint32_t start){
return DWT->CYCCNT - start;
}
// usage
uint32_t t0 = cycles_begin();
process_block(buf, N);
uint32_t cyc = cycles_end(t0);
g_stats.cyc_max = MAX(g_stats.cyc_max, cyc);
```
Compute CPU load: load = cyc / (CPU_HZ * frame_period).
**Common culprits checklist**
* Misconfigured clock tree → CPU really slower than you think.
* DMA not used / cache maintenance missing (invalidate/clean) → stalls/corruption.
* Denormals on float paths → sudden 10–100× slowdown.
* Excessive copies / format conversions between stages.
* Doing work in ISR; low-priority thread preempting high-priority path.
* Using generic library instead of the [SoC](https://www.ampheo.com/c/system-on-chip-soc)’s optimized [DSP](https://www.ampheoelec.de/c/dsp-digital-signal-processors) kernels.