What is DMA (Direct Memory Access) and how does it work?

DMA (Direct Memory Access) lets hardware copy data between memory and peripherals (or memory↔memory) without the CPU moving each byte. A small engine—the DMA controller—becomes a bus master, reads a descriptor (source, destination, length, options), moves the bytes in bursts, then raises an interrupt when done (or on error). ![img_temp_659678c1ec1aa8-99987897-92075426](https://hackmd.io/_uploads/H1VI9D4Kgl.png) **Why use DMA?** * Offloads the CPU: frees cycles for real work or lets it sleep. * Higher throughput & lower latency jitter: burst transfers at bus speed. * Energy efficient: fewer interrupts and context switches for large/continuous data. **How it works (typical flow)** 1. CPU sets up a descriptor: source addr, dest addr, transfer size, increment modes, burst size, peripheral request, etc. 2. CPU starts the DMA channel. 3. Peripheral or timer triggers the DMA (e.g., UART RX not-empty, ADC end-of-conversion) or it runs immediately for mem↔mem. 4. DMA reads/writes the bus directly in bursts, updating addresses and counters. 5. On completion/half-transfer/error, DMA raises an interrupt; the ISR handles bookkeeping (e.g., advance a ring buffer). **Common transfer types** * Peripheral→Memory: e.g., ADC samples into a RAM buffer; UART RX into ring buffer. * Memory→Peripheral: e.g., SPI TX from a frame buffer; DAC playback. * Memory→Memory: fast memcpy/fill for large blocks (some [MCUs](https://www.onzuu.com/category/microcontrollers)/SOCs). Options you’ll see: * Increment/Fixed address per side (peripherals often fixed). * Burst size (beats per burst). * Circular (ring) mode for continuous capture/playback. * Scatter–gather (linked descriptors) for noncontiguous buffers—common in NICs, SD/PCIe, AXI DMA. **Microcontrollers vs. SoCs/FPGA** * [MCUs](https://www.ampheo.com/c/microcontrollers) (e.g., [STM32](https://www.ampheo.com/search/STM32), [NXP](https://www.ampheo.com/manufacturer/nxp-semiconductors)): a central DMA/DMAMUX services on-chip buses (AHB/APB). You select a channel/stream and a request (UART/ADC/SPI). * Linux/[SoC](https://www.ampheo.com/c/system-on-chip-soc) world: devices contain their own DMA engines; drivers program descriptor rings. MSI/MSI-X interrupts signal completion. * [FPGA](https://www.ampheo.com/c/fpgas-field-programmable-gate-array) (AXI systems): use an AXI DMA/VDMA (memory-mapped↔AXI-Stream) with descriptors in DDR; great for high-rate ADC/[DSP](https://www.ampheo.com/c/dsp-digital-signal-processors)/video pipelines. **Minimal MCU example (UART RX to circular buffer)** Conceptual steps: 1. Allocate rx_buf[N]. 2. Configure DMA: peripheral=UART_DR, memory=rx_buf, dir=Periph→Mem, circular, mem increment, half/full complete IRQs. 3. In DMA ISR: if half or full flag, process that half of the buffer; clear flags. 4. Main loop stays free; no per-byte interrupts. This pattern scales to ADC continuous sampling, I²S audio, SPI camera streams, etc. **Performance & correctness tips** * Choose burst/beat size that matches bus/peripheral FIFO width (e.g., 4- or 8-beat bursts on 32-bit bus). * Arbitration & priorities: heavy DMA can starve CPU or other DMA; tune priorities. * Cache coherency ([SOCs](https://www.ampheoelec.de/c/system-on-chip-soc)/CPUs with caches): * Device→RAM: invalidate cache before reading the buffer. * RAM→Device: clean/flush cache before starting the DMA. * Prefer noncacheable/“coherent” buffers if available (e.g., dma_alloc_coherent, ACP/ACE on ARM). * Alignment: align buffers to bus width/line (often 4/8/32/64 bytes). * Interrupt strategy: use half-transfer for steady processing; avoid tiny fragments. * Error handling: handle FIFO overruns, bus faults, and transfer-abort paths. **When to use (and not)** * Use DMA for large or continuous transfers (audio, video, [sensors](https://www.ampheo.com/c/sensors), storage, comms). * Prefer interrupts/polling for tiny, sporadic transfers—DMA setup overhead can dominate. **Typical pitfalls** * Forgetting to enable the peripheral’s DMA request or wrong channel mapping. * Cache bugs (stale/dirty data) on cached CPUs. * Mis-sized burst/beat causing FIFO underruns/overruns. * Reading/writing the buffer in the CPU while DMA is modifying it—use double buffers or producer/consumer indices. In one line: DMA is a bus-master copy engine that moves data between memory and devices autonomously; you configure descriptors, let it stream in the background, and handle completion with lightweight interrupts.