Here’s the fast, practical way to understand (and predict) [FPGA](https://www.ampheo.com/c/fpgas-field-programmable-gate-array) resources from RTL—what maps to LUT/FF/BRAM/[DSP](https://www.ampheo.com/c/dsp-digital-signal-processors), and how to verify it in the tools without doing a full place-and-route. ![_qG7tfw3alS](https://hackmd.io/_uploads/ByW1TaZgWg.jpg) **1) The only numbers that matter: run synthesis-only** You don’t need implementation to see resources. **[AMD](https://www.ampheo.com/manufacturer/amd)/[Xilinx](https://www.vemeko.com/product/#xilinx) (Vivado):** ``` read_verilog [glob src/**/*.v] synth_design -top <top> report_utilization -hierarchical -file util.rpt open_run synth_1 report_utilization -hierarchical ``` Useful GUIs: Open Elaborated Design (quick structure view), Open Synthesized Design → Utilization by Hierarchy, Technology Schematic (carry chains, SRLs, LUTRAM, DSP, BRAM). **[Intel](https://www.ampheo.com/manufacturer/intel) (Quartus):** * Run Analysis & Synthesis only. * Open Compilation Report → Resource Utilization by Entity / Flow Summary. **[Lattice](https://www.ampheo.com/manufacturer/lattice-semiconductor) (Radiant/iCEcube2/Trellis):** * Run synthesis (LSE/Synplify). Open the Mapping/Utilization report per module. Look at the hierarchical breakdown to see which RTL blocks are expensive. **2) “RTL → resource” mental map (cheatsheet)** **Combinational logic** * assign, case, boolean ops → LUTs. Wider equations use more LUTs and may add levels (timing). * Wide muxes: w-bit, N-to-1 mux ≈ scales with w * log_K(N) LUTs (K = LUT inputs; 6 on many devices). **Registers / pipelines** always_ff @(posedge clk) flops → FFs. Enable/reset add control inputs (no extra LUTs unless gated). **Add/Sub/Compare** +, -, comparators map to carry chains + LUTs; roughly ~1 LUT/bit plus dedicated carry wiring. **Multipliers / MAC** * (and a*b + c) → DSP blocks if widths fit the device’s DSP size. Otherwise split across multiple DSPs or fall back to LUTs if disabled. **Memories** reg [W-1:0] mem [0:D-1]; * Single/dual-port, synchronous read/write → Block RAM (BRAM/URAM/M20K). * Asynchronous read, many write enables, tiny depth → often distributed RAM (LUTRAM) or FFs. ROMs (case tables or initialized arrays) → BRAM if large/regular; else LUTs. **Shift registers / FIFOs** Long shift chains with enables → SRL/SRL16/SRL32 (LUT-based shifters) on Xilinx; otherwise FF chains or small BRAM FIFOs. **State machines** One-hot vs binary encoding changes FF count but LUTs depend on transition logic. Synthesis usually picks encoding for timing. **3) Nudge the mapper (portable hints)** Prefer dedicated hardware * Force DSP use for mults/MACs (if your tool supports): (* use_dsp = "yes" *) (Xilinx), or vendor attributes (ramstyle, multstyle) in Intel/Lattice. * Pick RAM type: Xilinx: (* ram_style = "block" | "distributed" | "ultra" *) Intel: // synthesis ramstyle = "M20K" | "MLAB" or (* ramstyle = "M20K" *) Enable SRLs (Xilinx): keep simple shift patterns; avoid async reset on every tap. Avoid accidental LUT RAM: synchronous, single-clock RAM with clean write enables maps to BRAM. **4) Quick estimation rules (sanity checks)** * Adder/subtractor: ≈ 1 LUT/bit + carry chain (timing-friendly). * Multiplier: use 1 DSP if within native width (e.g., 18×25, 27×18, etc.); larger → multiple DSPs. * Block RAM need: bits = W × D. Compare to on-chip block size (e.g., 18 k/36 k/20 k). Add 10–20% headroom for control logic. * SRL vs FF: long, simple shifts (no async reset per stage) → SRL; else FFs. * Mux farms / buses: each extra input roughly adds LUT levels—pipeline wide mux trees. **5) Verify by hierarchy (what to click)** * Vivado: after synthesis → Report Utilization → Hierarchical, sort by LUT/BRAM/DSP; double-click a heavy block → Schematic to see if that multiplier became a [DSP](https://www.onzuu.com/category/dsp), if memories are BRAM, etc. * Quartus: Resource Utilization by Entity and Inferred RAM/ROM summary; open Technology Map Viewer (post-map) to see DSP/BRAM placement. **6) Common pitfalls that skew resources** * Async reads/writes or two writes/clock → can break BRAM inference → LUT/FF explosion. * Resets on every stage of long shifters → prevents SRL mapping. * Disabled [DSP](https://www.ampheoelec.de/c/dsp-digital-signal-processors) inference (global setting or attribute) → multipliers burn LUTs. * Tiny memories sprinkled everywhere → many LUTRAMs; consider packing into BRAM with a shared controller. * Over-wide buses without pipelining → deep LUT levels and timing pressure (and sometimes more duplication). **7) Minimal experiment pattern** Create a tiny, isolated module for the structure you care about (e.g., a 64×16 dual-port RAM, or a 24×24 multiplier with pipeline), synthesize out-of-context, check the utilization. Iterate until the mapping is what you want, then drop into the main design.