Hopper Tensor Core Programming Guide

# Hopper (SM90) Tensor Core Programming: A Technical Reference **Scope:** This document is a reference for writing high-performance kernels that target the NVIDIA Hopper architecture (compute capability 9.0, architecture label `sm_90a`) — specifically the H100 family. It covers the hardware model, the async proxy memory model, mbarriers, TMA, WGMMA, thread block clusters/DSMEM, producer–consumer warp specialization, and resource budgeting, with verified PTX, CUDA/CuTe C++, CuTe-DSL (Python), and Triton/Gluon examples. **Citation conventions.** Every factual claim carries a parenthetical citation. The most common short-forms are: - **PTX ISA 9.2, §X.Y.Z** — NVIDIA Parallel Thread Execution ISA, v9.2 (`docs.nvidia.com/cuda/parallel-thread-execution/`) - **Hopper Tuning Guide, §X.Y.Z** — NVIDIA Hopper Tuning Guide v13.1 (`docs.nvidia.com/cuda/hopper-tuning-guide/`) - **Colfax WGMMA Tutorial** — `research.colfax-intl.com/cutlass-tutorial-wgmma-hopper/` - **Colfax TMA Tutorial** — `research.colfax-intl.com/tutorial-hopper-tma/` - **Colfax Pipelining Tutorial** — `research.colfax-intl.com/cutlass-tutorial-design-of-a-gemm-kernel/` - **PyTorch TMA Blog** — `pytorch.org/blog/hopper-tma-unit/` - **PyTorch WS Blog** — `pytorch.org/blog/warp-specialization/` - **Yadav, Unweaving WS** — `rohany.github.io/blog/warp-specialization/` - **Veitner Pipelining Blog** — `veitner.bearblog.dev/cutedsl-on-hopper-pipelining/` - **Triton TTNG Dialect Docs** — `triton-lang.org/main/dialects/TritonNvidiaGPUOps.html` Anything I could not independently verify from a primary source is marked `[UNVERIFIED]`. When sources conflict, the PTX ISA and the CUDA/Hopper Tuning Guide are treated as ground truth. --- ## Section 1: H100 SM Hardware Model ### 1.1. GPU- and SM-level resource counts **Die vs. product distinction.** The Hopper silicon exists in multiple SKUs with different enabled-SM counts. These matter for persistent-kernel grid sizing, tile scheduling, and L2 reasoning: - **Full GH100 die:** 144 SMs total (per Wikipedia, citing the NVIDIA Hopper architecture whitepaper; note the architecture whitepaper states "up to 144 SMs"). Organized as 8 GPCs × 18 SMs per GPC `[UNVERIFIED by official NVIDIA document — Aleksa Gordic's "Inside NVIDIA GPUs" blog cites 8 GPCs × 18 SMs; NVIDIA's Hopper whitepaper states "up to 144 SMs" but I could not verify the GPC breakdown from a first-party NVIDIA source]`. - **H100 SXM5 product:** 132 SMs enabled `[UNVERIFIED — commonly cited across community sources such as Aleksa Gordic's blog, but I could not confirm 132 from a direct NVIDIA datasheet fetch in this session]`. - **H100 PCIe product:** 114 SMs enabled `[UNVERIFIED — again a commonly cited number, not confirmed against a first-party NVIDIA datasheet in this session]`. - **H200 SXM:** Same 132-SM die, different HBM3e memory config — `[UNVERIFIED in this session]`. **Why care about the difference.** A persistent kernel that hardcodes "grid = 132" will launch 18 fewer CTAs than the die is capable of on the full GH100, and too many CTAs on a PCIe H100 will over-subscribe. Read it from `cudaGetDeviceProperties().multiProcessorCount` at launch time rather than hardcoding. ### 1.2. Per-SM resources (verified) From the Hopper Tuning Guide §1.4.1.1 ("Occupancy"): - **Maximum concurrent warps per SM: 64** — same as Ampere (Hopper Tuning Guide, §1.4.1.1). That is 64 × 32 = **2048 resident threads per SM**. - **Register file: 65,536 × 32-bit registers per SM** (the guide writes "64K 32-bit registers per SM", Hopper Tuning Guide, §1.4.1.1). At 4 bytes each, this is **256 KB of RF per SM**. - **Maximum registers per thread: 255** (Hopper Tuning Guide, §1.4.1.1). - **Maximum thread blocks per SM: 32** for compute capability 9.0 (Hopper Tuning Guide, §1.4.1.1). - **Maximum shared memory per SM: 228 KB** (Hopper Tuning Guide, §1.4.1.1). This is a 39% increase over A100's 164 KB. - **Maximum shared memory per thread block: 227 KB** (Hopper Tuning Guide, §1.4.1.1). CUDA reserves 1 KB per block. - **Configurable carveouts: 0, 8, 16, 32, 64, 100, 132, 164, 196, 228 KB** (Hopper Tuning Guide, §1.4.2.4). - **Combined L1 + texture + shared: 256 KB per SM** (Hopper Tuning Guide, §1.4.2.4), a 33% increase over A100. - **Static SMEM still capped at 48 KB** for architectural compatibility; larger allocations require the dynamic-SMEM opt-in (`cudaFuncAttributeMaxDynamicSharedMemorySize`) (Hopper Tuning Guide, §1.4.2.4). **Tensor cores and warp schedulers per SM.** The Hopper Tuning Guide does not enumerate these exact counts in its text, but it is widely reported that: - **4 warp schedulers per SM** (one per processing block / SM partition; each SM is split into 4 SM partitions), and - **4 Tensor Cores per SM** (one per processing block). This matches the four-quadrant SM diagram in Yadav's "Unweaving Warp Specialization" (Yadav, Unweaving WS: "A Hopper SM (pictured below) has four execution contexts that can host an active warp ... At any cycle, at most 4 warps may issue instructions into the SM to execute."). This is cross-confirmed by the WGMMA Tutorial comment ("there are 4 Tensor Cores per SM", Colfax WGMMA Tutorial, comment by Ryo). I flag this as **confirmed by two secondary sources; not located in a first-party NVIDIA document in this session**. ### 1.3. The warpgroup The warpgroup is the unit of issue for WGMMA. Its definition is identical across the PTX ISA and every secondary source: - **A warpgroup is exactly 4 contiguous warps, i.e. 128 contiguous threads** (PTX ISA 9.2, §9.7.15.1, "Warpgroup"; Colfax WGMMA Tutorial: "A warpgroup consists of four contiguous warps, i.e., 128 contiguous threads"). - The **warp-rank of the first warp in a warpgroup must be a multiple of 4** (Colfax WGMMA Tutorial; PTX ISA 9.2, §9.7.15.1 has the same constraint). - `wgmma.mma_async` is executed **collectively by all 128 threads** in the warpgroup; the `.sync.aligned` qualifier enforces this (PTX ISA 9.2, §9.7.15.5; Colfax WGMMA Tutorial). **Why the warpgroup is the unit of issue.** Hopper's tensor cores are physically laid out so that a single `wgmma.mma_async` instruction fans out across the four tensor cores in the SM (one per processing block). The four warps of a warpgroup each map 1:1 to one of the four tensor cores, so the full warpgroup together drives the complete MMA. This is why the accumulator fragment is distributed across all 128 threads — not 32, not 256. ### 1.4. SM execution model and implications for performance engineering At most **four warps issue instructions per cycle** on a Hopper SM — one per warp scheduler / processing block (Yadav, Unweaving WS, confirmed visually against the H100 architecture diagram). This is the single most important fact for reasoning about Hopper kernels. Three immediate consequences: 1. **Peak issue throughput is fixed at 4 instructions/cycle per SM regardless of occupancy.** 4 warps vs. 64 warps does not change peak issue rate. More warps only help by giving the four schedulers more ready warps to pick from when some are stalled. 2. **Warp specialization is a scheduling technique, not a throughput technique.** Splitting a kernel into producer and consumer warps gives the schedulers more independent instruction streams to interleave when one stream blocks. Yadav derives this from first principles (Yadav, Unweaving WS): warp specialization is useful when (a) resource constraints (register-per-thread limit, etc.) require partitioning, (b) variable-latency instructions are hard to statically schedule (memory operations, in particular), or (c) blocking synchronization interferes with instruction issue. 3. **Warpgroup-level synchronization is cheap; block-level is not.** Because `wgmma.fence.sync.aligned` and friends synchronize across 4 warps, they are much lighter than `__syncthreads()` across 8 or 16 warps of a large block. This is a key reason warp-specialized mainloops on Hopper use per-warpgroup fences rather than block-wide barriers in the inner loop. --- ## Section 2: The Async Proxy Memory Model Hopper introduces asynchronous hardware units (TMA, the async wgmma pipeline) that access shared and global memory through a **separate "async proxy"** from the conventional load/store pipeline (which operates through the **"generic proxy"**). The memory consistency model treats these as two different observation domains that may see each other's writes **out of order** unless an explicit fence is inserted. ### 2.1. What lives in which proxy - **Generic proxy:** `ld`, `st`, `ld.shared`, `st.shared`, `ld.global`, `st.global`, `cp.async` (Ampere-style async copy), `ldmatrix`, `stmatrix`, atomics other than async ones (PTX ISA 9.2, §8.6 "Proxies", and §9.7.9.25.2 "Async Proxy"). - **Async proxy:** `cp.async.bulk`, `cp.async.bulk.tensor` (all TMA variants), `cp.reduce.async.bulk[.tensor]`, and `wgmma.mma_async` (PTX ISA 9.2, §9.7.9.25.2 "Async Proxy"; PTX ISA 9.2, §9.7.15.4 "Async Proxy" — the wgmma-specific proxy section). The "Hitchhiker's Guide to TMA" puts it pragmatically: writes by one proxy are not automatically visible to the other proxy (Accelerated Computing Academy, Fall 2025 TMA interface notes). ### 2.2. When `fence.proxy.async.shared::cta` is required The rule, cross-checked across PTX ISA 9.2 §8.6 and the Colfax WGMMA Tutorial's "Synchronization for WGMMA" section: - **Generic → Async:** If a thread writes SMEM through the generic proxy (e.g., a plain `st.shared` or `cp.async`), and a subsequent `wgmma.mma_async` (async proxy) or TMA store (async proxy) reads that SMEM, a **`fence.proxy.async.shared::cta`** is **required** before the async op. Without it, the async unit may not see the generic proxy's writes. Silent wrong answer. - **Async → Async:** If SMEM was filled by **TMA** (async proxy) and then consumed by **`wgmma.mma_async`** (also async proxy), **no proxy fence is needed**. The two operations are already in the same proxy. (Colfax WGMMA Tutorial: *"if operations performed in the generic proxy affect the SMEM read by `wgmma.mma_async`, we need to issue `fence.proxy.async`. For example, this would be the case if we copied A and B into SMEM via ordinary `ld.global`/`st.shared` operations. Since we use TMA load, we don't need `fence.proxy.async` in our example, and indeed it doesn't appear in the WGMMA tutorial code or in the mainloop of CUTLASS Hopper GEMM kernels."*) - **Async → Generic:** Writes done by the async proxy (e.g., TMA load writing SMEM) are visible to generic-proxy reads (e.g., ordinary `ld.shared` from a consumer warp) once the completion mechanism has fired (mbarrier phase flip or async-group wait). No separate proxy fence is required beyond that. (PTX ISA 9.2, §8.6; this is implicit in the way the mbarrier completion mechanism is specified.) - **TMA store prolog:** If SMEM was filled by generic-proxy writes (`st.shared` from consumer warps writing the epilogue tile) and then a **TMA store** is issued (async proxy), you **must** `fence.proxy.async.shared::cta` before the store (Colfax TMA Tutorial, "Memory fence" subsection: *"on line 35 we have the CuTe method `tma_store_fence()` that wraps the PTX instruction `fence.proxy.async.shared::cta`"*). **Practical rule of thumb for Hopper GEMM mainloops:** - TMA load → WGMMA consume: **no fence** (same async proxy). - Generic stores (register → SMEM via `stmatrix`, etc.) → TMA store: **fence required**. - Stores to SMEM via generic proxy before a WGMMA reads that SMEM: **fence required**. ### 2.3. PTX syntax for the fences ```ptx // Before a TMA store that depends on generic-proxy SMEM writes: fence.proxy.async.shared::cta; // Colfax TMA Tutorial confirms exact spelling ``` `.shared::cta` is the narrowest scope — only visibility within the CTA. There are also `.shared::cluster` and `.global` scoped variants (PTX ISA 9.2, §9.7.13.4 "membar / fence"). Use the narrowest scope that covers the set of threads that observe both sides. --- ## Section 3: mbarrier — The Async Transaction Barrier ### 3.1. Prerequisite: what barriers and fences are GPU synchronization has two conceptually distinct jobs, both of which are needed for correctness: - A **fence** orders memory operations *by a single thread* (or within a scope). After a release fence, any observer that later does an acquire fence is guaranteed to see writes issued before the release. Fences do not wait for anybody — they reorder the local thread. - A **barrier** makes a set of threads *wait* until all of them (or a specified count) have arrived. A barrier typically includes a memory-ordering effect (release on the arrive side, acquire on the wait side), so after the barrier, all participants see all writes issued before the arrives. In CUDA pre-Hopper, the main barrier was `__syncthreads()` (PTX `bar.sync 0`), which counts threads. In Hopper, a new kind of barrier is introduced — the **mbarrier** — which additionally tracks **transaction bytes** delivered by asynchronous engines (specifically, TMA). ### 3.2. The mbarrier object (PTX ISA 9.2, §9.7.13.15.1–§9.7.13.15.8) - **Size and alignment:** an mbarrier is a single **64-bit** (`.b64`) object in **shared memory**, aligned to 8 bytes (PTX ISA 9.2, §9.7.13.15.1). - **State space:** `.shared::cta` (CTA-local) or `.shared::cluster` (cluster-visible). The cluster-scoped variant is what makes cross-CTA synchronization within a thread block cluster possible (PTX ISA 9.2, §9.7.13.15.8). - **Counters tracked** (PTX ISA 9.2, §9.7.13.15.2): 1. **Expected arrival count**, set at `mbarrier.init` time. 2. **Pending arrival count**, which starts equal to the expected arrival count and decrements as `mbarrier.arrive`-family ops are issued. 3. **Transaction count (tx-count)**, which tracks bytes delivered by async ops like TMA. The barrier's current phase completes only when both the pending arrival count hits zero **and** the tx-count reaches zero. - **Phase bit:** a single bit that starts at 0 after `mbarrier.init` and **flips each time the current phase completes** (i.e., arrivals and tx-count both satisfied) (PTX ISA 9.2, §9.7.13.15.4). This is how the barrier is reusable without re-initialization — you track whose turn (0 or 1) this phase is in your own code. **How TMA uses tx-count.** Before issuing a TMA load, a single thread calls `mbarrier.arrive.expect_tx` with a byte count equal to the number of bytes the TMA load will deliver. The TMA load, on completion, invokes the `complete_tx::bytes` completion mechanism on the same mbarrier, decrementing the tx-count by that number of bytes. The barrier's phase flips when both counts hit zero. This lets the consumer wait on a single object, without caring how many separate TMA operations the producer actually issued — the total byte-count is the unit of agreement (PTX ISA 9.2, §9.7.13.15.5; Colfax TMA Tutorial). ### 3.3. Relevant PTX instructions (verified syntax) From PTX ISA 9.2 §9.7.13.15.9 through §9.7.13.15.16 and cross-referenced with the Colfax TMA Tutorial's quotations of PTX: **Initialization** (PTX ISA 9.2, §9.7.13.15.9): ```ptx mbarrier.init.shared::cta.b64 [mbar_addr], count; // mbar_addr: shared-memory address of the .b64 object // count: expected arrival count (unsigned 32-bit) ``` The CuTe method `initialize_barrier(bar, count)` wraps this (Colfax TMA Tutorial). **Arrive and set expected transaction bytes** (PTX ISA 9.2, §9.7.13.15.11 "mbarrier.expect_tx", and §9.7.13.15.13 "mbarrier.arrive"; the `.arrive.expect_tx` variant fuses them): ```ptx mbarrier.arrive.expect_tx.shared::cta.b64 _, [mbar_addr], tx_count; // _ is the "token" destination; we typically discard it and use .parity waits instead. // tx_count: 32-bit unsigned byte count the producer promises to deliver. ``` Wrapped by CuTe as `set_barrier_transaction_bytes(bar, bytes)` (Colfax TMA Tutorial). **Arrive without tx-count** (PTX ISA 9.2, §9.7.13.15.13): ```ptx mbarrier.arrive.shared::cta.b64 _, [mbar_addr]; ``` Used by consumer warps to signal "I'm done" on the empty-barrier side of a ping-pong pipeline. **Try-wait (blocking, may spuriously return, requires loop)** (PTX ISA 9.2, §9.7.13.15.16): ```ptx mbarrier.try_wait.parity.shared::cta.b64 waitComplete|%pred, [mbar_addr], %phase; // waitComplete is a predicate; true if phase completed // %phase: 0 or 1 — the phase parity the caller is waiting to see flip *to* // Optional third operand: suspendTimeHint (nanoseconds) — omitted here ``` Per the CUDA Programming Guide §4.9 (Asynchronous Barriers): *"An even phase has integer parity 0 and an odd phase has integer parity 1. When we initialize a barrier, its phase has parity 0."* The phase argument tells the PTX instruction the **parity it is waiting on**. It may time out and return `false` even if the phase has not yet completed (PyTorch/Academic TMA writeups: "a single call to try_wait does not mean that the phase has completed, it may just mean the timeout was reached"), so **it must be used in a spin loop** (Triton's MLIR operation `nvvm.mbarrier.try_wait.parity` "generates a loop that enforces the test to complete", per MLIR NVVM dialect docs). **Test-wait (non-blocking, single poll)** (PTX ISA 9.2, §9.7.13.15.16): ```ptx mbarrier.test_wait.parity.shared::cta.b64 waitComplete|%pred, [mbar_addr], %phase; // Same operands as try_wait, but this does not suspend at all — it just reports // "has the phase completed?" as a predicate. Use in non-blocking polling code. ``` **There is no blocking `mbarrier.wait.parity`.** The PTX ISA provides only these two non-blocking forms; any "blocking wait" is always a software spin loop around `try_wait.parity`. (This is explicit in the MLIR NVVM dialect docs for `nvvm.mbarrier.try_wait.parity`, and cross-confirmed by the "Hitchhiker's Guide to TMA" from Stanford Accelerated Computing Academy. The PyTorch TMA blog quotes the lowered Triton instruction as a `.try_wait.parity` as well.) When Colfax writes that "`try_wait` ... indicates that the wait is a blocking instruction," they mean it *can* suspend the warp, not that it guarantees completion on return — correctness always requires the loop. **Cluster-scoped variants** (PTX ISA 9.2, §9.7.13.15.8): all of the above instructions have `.shared::cluster` counterparts for mbarriers placed in the cluster-scoped shared memory proxy (DSMEM). Syntax differs only in the state-space qualifier: ```ptx mbarrier.init.shared::cluster.b64 [mbar_addr], count; mbarrier.arrive.shared::cluster.b64 _, [mbar_addr]; // etc. ``` See also `barrier.cluster.arrive` and `barrier.cluster.wait` (PTX ISA 9.2, §9.7.13.3 "barrier.cluster"), which are named barriers at cluster scope used for bootstrap. **Invalidation** (PTX ISA 9.2, §9.7.13.15.10): ```ptx mbarrier.inval.shared::cta.b64 [mbar_addr]; ``` Required before the memory backing an mbarrier is reused for something else (Triton TTNG docs on `ttng.inval_barrier`). ### 3.4. Phase tracking for a circular buffer of N stages After `mbarrier.init`, the barrier's phase is 0. When `arrivals == expected AND tx-count == 0` fire, the phase flips to 1. Next time, it flips back to 0. And so on. In a **circular buffer with N stages**, each stage has its own mbarrier. After the producer has issued `N` loads, the producer is now about to load into stage 0 again — for the **second time** — and the mbarrier for stage 0 has been re-armed by the consumer's release. The phase the producer waits for is now **1**. On the third pass through stage 0, phase 0 again. So: ``` stage_phase[s] = (iteration / N) & 1 ``` In code, this is usually tracked by a thread-local integer that starts at 0, advances with the loop counter, and XORs the phase bit when the stage index wraps. The CUTLASS `PipelineState<N>` class encapsulates exactly this (Colfax Pipelining Tutorial: *"the index is incremented modulo N, and the phase is flipped when the index increments to 0."*): ```cpp // Sketch — not a literal quote. struct PipelineState { int index; // 0 .. N-1 int phase; // 0 or 1 void operator++() { ++index; if (index == N) { index = 0; phase ^= 1; } } }; ``` When the producer waits on the **empty_barrier** for stage `index`, and the consumer waits on the **full_barrier** for stage `index`, both use `phase` as the parity argument to `try_wait.parity`. --- ## Section 4: TMA — The Tensor Memory Accelerator ### 4.1. What TMA offloads, and why it matters for performance TMA is a single-threaded, asynchronous hardware copy engine between global memory and shared memory, introduced on Hopper (Hopper Tuning Guide, §1.4.1.2). Its hardware responsibilities include: - **Address generation** from a descriptor + per-kernel coordinates — no software loop computing pointers (Hopper Tuning Guide, §1.4.1.2; PyTorch TMA Blog: *"TMA handles the address computation for the shared memory destination where the data requested should be placed."*). - **Strided and multi-dimensional access** (1D through 5D tensors) without per-thread index math (Hopper Tuning Guide, §1.4.1.2). - **Bounds checking / out-of-bounds predication** — TMA predicates OOB reads and fills with a default (see §4.4) (Colfax TMA Tutorial: *"the TMA copy unit will necessarily predicate the memory copy to not read out-of-bounds"*). - **Swizzle application** so the arriving SMEM tile is already in a layout compatible with `ldmatrix` and/or WGMMA (PyTorch TMA Blog: *"TMA can lay out the arriving data based on swizzling instructions, to ensure the arriving data can be read as fast as possible by consumers, as the swizzling pattern helps avoid shared memory bank conflicts."*). **Why this matters for performance engineering.** Before TMA, bulk GMEM→SMEM copy required many threads (and many registers holding intermediate pointers) to compute addresses and issue `ld.global/st.shared` pairs — exactly the "address arithmetic taxes a whole warpgroup of registers" problem that makes warp specialization expensive on pre-Hopper architectures. With TMA, **a single thread** issues a copy instruction and the rest of the block can immediately do something else (Hopper Tuning Guide, §1.4.1.2). This is the hardware enabler that makes Hopper-style warp specialization (tiny producer warpgroup, register-heavy consumer warpgroups) pay off — see Sections 7 and 9. ### 4.2. cuTensorMap descriptor (host-side setup) TMA loads take a `cuTensorMap` (CUtensorMap) object as a descriptor operand, created on the host via the CUDA driver API `cuTensorMapEncodeTiled` (and variants) (Colfax TMA Tutorial, quoting the CUDA driver API). In CUTLASS/CuTe, the descriptor is constructed by `cute::make_tma_copy`, which internally calls `cuTensorMapEncodeTiled` and returns an object aliased to `CUtensorMap` (Colfax TMA Tutorial: *"this descriptor object is an alias for `CUtensorMap`"*). The parameters to `cuTensorMapEncodeTiled`, per the CUDA Driver API (as referenced from the Colfax TMA Tutorial and the PyTorch TMA blog), are: | Parameter | Purpose | |---|---| | `tensorDataType` | Element type (FP16, BF16, FP32, FP8, TF32, INT8, etc.) | | `tensorRank` | Number of dimensions, 1–5 | | `globalAddress` | GMEM pointer to the tensor | | `globalDim[rank]` | Size along each GMEM dim (in elements) | | `globalStrides[rank-1]` | Strides along outer dims (in bytes, must be multiples of 16) | | `boxDim[rank]` | Tile/box size in each dim (SMEM tile shape, in elements) | | `elementStrides[rank]` | Element-level strides (typically all 1) | | `interleave` | Interleave layout — NONE, 16B, 32B | | `swizzle` | Swizzle mode — NONE, 32B, 64B, 128B (see §4.5) | | `l2Promotion` | L2 cache policy (NONE, 64B, 128B, 256B) | | `oobFill` | OOB fill behavior (ZERO or NAN request) | (Parameter names and high-level purposes match the NVIDIA CUDA driver API documentation as cited in the Colfax TMA Tutorial. The exact enumerator identifiers — `CU_TENSOR_MAP_*` — are in the CUDA driver headers and are not quoted here to avoid reproducing them verbatim.) Key constraints from the Colfax TMA Tutorial: - **16-byte boundary on GMEM strides**: *"we need to assume that the tile being copied has (i) a contiguous direction (stride 1), and (ii) other strides as multiples of 16 bytes. This is asserted in the CUTLASS codebase."* For a row-major `(m, n)` float matrix, this means `n % 4 == 0`. - The **descriptor is `__grid_constant__ const`** when passed by value to a kernel (Colfax TMA Tutorial: *"the `tma_load` argument for the kernel must be annotated with `__grid_constant__ const`."*). ### 4.3. L2 promotion and why it saves bandwidth TMA's descriptor has an L2 promotion hint (128B / 256B) that biases the L2 cache policy toward treating TMA-loaded lines as reusable across CTAs. Because TMA is asynchronous and runs through the async proxy, the hardware can keep tensor tiles resident in L2 when the same input tile is consumed by many CTAs (typical for GEMM, where each column of `B` is used by every row of `A`). This is in addition to **TMA multicast** (§4.7), which is a more targeted mechanism for sharing inputs across CTAs in the same cluster. The Hopper Tuning Guide §1.4.2.2 ("Increased L2 Capacity") confirms that *"the NVIDIA Hopper architecture allows CUDA users to control the persistence of data in L2 cache similar to the NVIDIA Ampere GPU Architecture."* The TMA-level L2 promotion is an ergonomic wrapper — the programmer sets it once in the descriptor and then every TMA operation benefits. ### 4.4. Out-of-bounds handling TMA handles remainder tiles (the last tile in each dim, when the problem size isn't a multiple of the box size) **natively via predication**: it will not read off the end of the tensor, and the corresponding SMEM slots are filled with zero or NaN depending on the `oobFill` setting (Colfax TMA Tutorial: *"we can dispense with this assumption entirely. Instead of needing to handle the out-of-bounds logic ourselves when loading in remainder tiles from GMEM to SMEM, the TMA copy unit will necessarily predicate the memory copy to not read out-of-bounds."*). This is a substantial simplification vs. pre-Hopper, where you needed explicit predicate tensors and per-thread bounds math. ### 4.5. Swizzle modes and why they must match WGMMA There are **four TMA swizzle modes** that TMA can apply as it writes tiles into SMEM (Colfax WGMMA Tutorial; PTX ISA 9.2, §5.5.7 "Swizzling Modes"): - **None** (implicit 16B atom — no cross-element mixing). - **32B swizzle** — permutes groups of 2 consecutive 16-byte segments. - **64B swizzle** — permutes groups of 4 consecutive 16-byte segments. - **128B swizzle** — permutes groups of 8 consecutive 16-byte segments. **Why they exist: bank conflict avoidance.** Hopper SMEM has 32 banks of 4 bytes. A naive row-major `(M, K)` SMEM tile of FP16 has `K` × 2 bytes per row; every 32-bank stride within a row maps several rows to the same banks for certain strided accesses, producing bank conflicts. Swizzling XORs the low bits of the address with bits from a higher dimension, so that SMEM accesses from `ldmatrix` or WGMMA descriptors land in distinct banks. **Critical constraint: the TMA descriptor's swizzle must match the WGMMA SMEM descriptor's swizzle.** The two components of the hardware pipeline don't negotiate — if TMA writes the tile with 128B swizzle but the WGMMA descriptor says 64B swizzle, the WGMMA will interpret the SMEM bytes under the wrong permutation and produce **silently wrong output** (no hardware fault, no runtime error; just wrong numbers). This is easy to see in the CUTLASS GMMA layout atoms: the CUTLASS convention (Colfax WGMMA Tutorial) ties the TMA swizzle to the SMEM layout atom: ```cpp // Layout atoms that set TMA swizzle = 128B and the compatible SMEM layout. GMMA::Layout_MN_SW128_Atom<T> GMMA::Layout_K_SW128_Atom<T> // (Atoms for 64B, 32B, and "no swizzle" / 16B INTER are analogous.) ``` (Colfax WGMMA Tutorial, "SMEM layout constraints for WGMMA".) The length of the layout atom in the contiguous direction equals the number of swizzle bytes — 16B for none, 32/64/128 for the others (Colfax WGMMA Tutorial). For FP16, 128B ÷ 2 = 64 elements, which is why the canonical SMEM tile for 128B swizzle has a leading-dim extent of 64 FP16 elements. ### 4.6. `elect.sync` — picking the single TMA-issuing thread TMA operations are "single-thread" instructions in the sense that only one thread in the warpgroup should actually issue each copy; the rest must not duplicate it. The PTX primitive for this is `elect.sync`, which deterministically elects one thread out of an active mask (PTX ISA 9.2, §9.7.13.14). **Exact syntax, verified against the PTX emitted by Triton (quoted in the PyTorch TMA Blog):** ```ptx elect.sync _|%pred, 0xffffffff; // ^ ^ ^ // | | membermask (which threads participate; 0xffffffff = all 32 in the warp) // | destination predicate: true for exactly one (the elected) lane, false for others // destination register for the elected lane-id (underscore = ignore) ``` **The operand order matters.** Note the `_|%pred`, not `%pred|_`. The left side is the elected-lane-id destination (32-bit reg), and the right side is the boolean predicate destination. The PyTorch TMA Blog shows exactly this form in compiled Triton output: `elect.sync _|%p7, 0xffffffff;`. In CUTLASS, this is wrapped as `cute::elect_one_sync()` (Colfax Pipelining Tutorial uses it: *"we make sure that only the first warp is involved in this, and that one thread, chosen using `elect_one_sync`, is responsible for the TMA call."*). ### 4.7. Full TMA load PTX For a 2D TMA tile load from GMEM to CTA-local SMEM with mbarrier-based completion, the full PTX (quoted verbatim from the Colfax TMA Tutorial, which quotes the CUTLASS codebase): ```ptx cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [%0], [%1, {%3, %4}], [%2]; // Operands (in order as they appear in the inline asm above): // %0 = smem_int_ptr — destination SMEM pointer (.shared::cta or .shared::cluster) // %1 = gmem_int_desc — pointer to the cuTensorMap descriptor // %3, %4 = crd0, crd1 — 2D coordinates into the GMEM tensor (per dimension) // %2 = smem_int_mbar — pointer to the mbarrier object tracking this load ``` General syntax (PTX ISA 9.2, §9.7.9.25.5.2 "cp.async.bulk.tensor", quoted verbatim from the Colfax TMA Tutorial): ``` cp.async.bulk.tensor.dim.dst.src{.load_mode}.completion_mechanism {.multicast}{.level::cache_hint} [dstMem], [tensorMap, tensorCoords], [mbar] {, im2colOffsets} {, ctaMask} {, cache-policy} ; .dst = { .shared::cluster } .src = { .global } .dim = { .1d, .2d, .3d, .4d, .5d } .completion_mechanism = { .mbarrier::complete_tx::bytes } .load_mode = { .tile, .im2col } .level::cache_hint = { .L2::cache_hint } .multicast = { .multicast::cluster } ``` The per-dimension coordinate tuple `{crd0, crd1, ...}` has one 32-bit integer per dimension; the rank must match the descriptor's rank. ### 4.8. TMA store TMA store copies from SMEM to GMEM. Two big differences from TMA load (Colfax TMA Tutorial, "Memory fence" and "TMA STORE ARRIVE AND WAIT" subsections): 1. **Completion mechanism is async-group, not mbarrier.** The store is committed into a "bulk group" via `cp.async.bulk.commit_group` (PTX ISA 9.2, §9.7.9.25.6.1), and later waited on via `cp.async.bulk.wait_group N` (PTX ISA 9.2, §9.7.9.25.6.2), which waits until at most N bulk groups remain outstanding. There is no mbarrier involved. 2. **Fence is required on the prolog side.** Because the SMEM being stored was written by the generic proxy (register→SMEM via `stmatrix` or `st.shared` in the epilogue), you **must** `fence.proxy.async.shared::cta` before issuing the TMA store so that the async unit sees the writes (Colfax TMA Tutorial: *"on line 35 we have the CuTe method `tma_store_fence()` that wraps the PTX instruction `fence.proxy.async.shared::cta`."*). Full CuTe idiom: ```cpp tma_store_fence(); // fence.proxy.async.shared::cta if (thread_is_elected) { cute::copy(tma_store, /*src=*/smem_tile, /*dst=*/gmem_coord_tile); tma_store_arrive(); // cp.async.bulk.commit_group } tma_store_wait<0>(); // cp.async.bulk.wait_group 0 ``` (Source: Colfax TMA Tutorial — "TMA STORE ARRIVE AND WAIT" subsection.) ### 4.9. TMA multicast Multicast lets one TMA load deliver the same GMEM tile to the SMEM of **multiple CTAs in the same thread block cluster** at once (Colfax TMA Tutorial, "TMA Load Multicast"). This is a bandwidth saving for GEMM: when every row-tile CTA needs a column-tile `B`, rather than each one issuing a load, CTAs in the same cluster split the load. **Requirements** (Colfax TMA Tutorial): - Cluster must be non-trivial (e.g., `(1,1,2)` or `(2,1,1)`). - Descriptor built with `SM90_TMA_LOAD_MULTICAST` instead of `SM90_TMA_LOAD`. - Destination SMEM is `.shared::cluster`, not `.shared::cta` — this enables distributed shared memory addressing across the cluster. - A **16-bit bitmask** `ctaMask` specifies which CTAs in the cluster participate. Bit `i` set means cluster-rank `i` is a destination. Max 16 CTAs in a cluster (the H100 non-portable max — see Hopper Tuning Guide, §1.4.1.3). - The elected thread, on one CTA, issues the `cp.async.bulk.tensor...multicast::cluster` with the mask, and each listed destination CTA gets the tile in its SMEM at the same CTA-relative offset. Multicast PTX is identical to the single-CTA variant with the `.multicast::cluster` qualifier added and the mask operand appended (PTX ISA 9.2, §9.7.9.25.5.2, syntax quoted in §4.7 above). **Slicing subtlety** (Colfax TMA Tutorial, "TMA Load Multicast" subsection, final paragraphs): in CuTe, the `TiledCopy` for a multicast load has one "logical thread" per cluster-CTA — when you `get_slice(block_rank_in_cluster)`, you get the *subset* of the source tile that this particular CTA will fetch. Giving the wrong cluster rank produces correct-looking but actually wrong SMEM (Colfax TMA Tutorial has a concrete illustration: with two CTAs both sliced with rank 0, each only gets half the tile; with both sliced with rank 1, each gets a different half). Get this right: one CTA loads "its half," the other loads "its half," and the tile is delivered whole to both CTAs' SMEM. --- ## Section 5: WGMMA — Warpgroup-Level MMA ### 5.1. Shape constraints and register counts For `wgmma.mma_async` with 16-bit operands (FP16/BF16), the only allowed atom shape parameters are (PTX ISA 9.2, §9.7.15.2 "Matrix Shape"; Colfax WGMMA Tutorial): - **M = 64, always.** - **K × sizeof(dtype) = 32 bytes.** For FP16/BF16, that's K = 16. For FP8, K = 32. For TF32, K = 8. For integer 8-bit, K = 32. - **N is a multiple of 8 in [8, 256].** For FP16, possible N values are 8, 16, 24, ..., 256. **Accumulator register-per-thread math.** The 128-thread warpgroup holds the M×N output tile in registers. For FP16/BF16/FP32 accumulator with 32-bit register granularity, the math is: ``` regs_per_thread = (M * N * bytes_per_acc_elem) / (128 threads * 4 bytes/reg) = (M * N * bytes_per_acc_elem) / 512 ``` Concrete numbers for FP16 inputs with **FP32 accumulator** (2 elements × 2 bytes = 4 bytes total per reg pair, but each accumulator register holds one FP32): | Shape | M×N output elements | FP32 regs / thread | Source for verification | |---|---:|---:|---| | m64n64k16 | 4096 | **32** | Derived; matches Colfax WGMMA Tutorial's example comment stating the FP16/FP16/FP16 variant uses "16 `uint32` variables for the accumulator" = 16 regs/thread for FP16 accum | | m64n128k16 | 8192 | **64** | Derived from the formula | | m64n256k16 | 16384 | **128** | Derived from the formula | Note: for FP16 **input** with **FP16 accumulator** (no upcast), each 32-bit register packs two FP16 values, so the count halves: m64n64k16 with FP16 accum gives **16 regs/thread**, which is exactly the 16 `uint32` operands in the verbatim PTX in the Colfax WGMMA Tutorial. This matches: ``` regs_per_thread(FP16 acc) = (M * N * 2 bytes) / (128 * 4) = M*N / 256 ``` For m64n64k16 FP16-acc: 4096/256 = 16. ✓ ### 5.2. SS vs. RS mode - **SS mode**: both A and B are sourced from **SMEM** via 64-bit matrix descriptors. A is *not* copied into registers. This is the default for Hopper GEMM mainloops and has the best register efficiency (Colfax WGMMA Tutorial: *"accessing the values of `tCrA` and `tCrB` instead accesses these 64-bit descriptors."*). Used when both operand tiles are prepared by TMA into SMEM. - **RS mode**: A is sourced from **registers** (loaded there by `ldmatrix`), B is sourced from SMEM. Used when A is computed on-chip (e.g., softmax output of FlashAttention-2/3's first GEMM feeding the second GEMM, or when A needs some register-level transformation). The register pressure per consumer thread increases; trade-off is register-to-register MMA pipelining (Ash Vardanian observations: *"At least one operand must reside in shared memory, with dual shared memory operands potentially offering 5–10% better performance"*, but that quote is [UNVERIFIED] against a first-party NVIDIA source and the precise number should be taken as suggestive, not definitive). **B is always in SMEM regardless of mode** (Colfax WGMMA Tutorial; PTX ISA 9.2, §9.7.15.5). ### 5.3. `ldmatrix` for RS mode When A needs to come from registers but is still being staged through SMEM, the warp-level `ldmatrix.sync.aligned` loads 8×8 matrix tiles (of 16-bit elements) from SMEM into per-thread register fragments with the exact layout the MMA/WGMMA wants (PTX ISA 9.2, §9.7.14.5.15). **Full syntax for the 4-matrix, 16-bit variant:** ```ptx ldmatrix.sync.aligned.m8n8.x4.shared::cta.b16 {%r0, %r1, %r2, %r3}, [%smem_ptr]; // ^^ number of 8x8 matrices loaded: 1, 2, or 4 ``` There is also a transpose variant: ```ptx ldmatrix.sync.aligned.m8n8.x4.trans.shared::cta.b16 {%r0, %r1, %r2, %r3}, [%smem_ptr]; ``` The `.trans` qualifier tells the instruction to deliver the transpose of what's in SMEM. This is needed when the SMEM tile is K-major but the MMA wants MN-major A. **Warp-collective semantics.** `ldmatrix` is a warp-level instruction — all 32 threads of the warp participate, and each contributes one address. The 32 threads' addresses should each point to a 16-byte-aligned row of the 8×8 matrix (for `.x4`, each thread provides an address for one of 32 rows across four matrices). The result distributes the 8×8 matrix across the 32 threads in the canonical MMA A-fragment layout (PTX ISA 9.2, §9.7.14.5.15). ### 5.4. WGMMA synchronization: the three mandatory instructions, in order (PTX ISA 9.2, §9.7.15.7; Colfax WGMMA Tutorial, "Synchronization for WGMMA"): ```ptx wgmma.fence.sync.aligned; // (1) Fence wgmma.mma_async.sync.aligned.m64nNk16.f32.f16.f16 {...}, ..., ..., ..., ..., ..., ...; // ... possibly more wgmma.mma_async issues ... wgmma.commit_group.sync.aligned; // (2) Commit into a group wgmma.wait_group.sync.aligned 0; // (3) Wait until ≤0 groups outstanding ``` **(1) `wgmma.fence.sync.aligned`** (PTX ISA 9.2, §9.7.15.7.1): > "`wgmma.fence` establishes an ordering between prior accesses to any warpgroup registers and subsequent accesses to the same registers by a `wgmma.mma_async` instruction." (PTX ISA 9.2 wording quoted in Aleksa Gordic's annotated PTX walkthrough; also summarized in Colfax WGMMA Tutorial: *"a `wgmma.fence` instruction ensures that `wgmma.mma_async` only accesses certain RMEM addresses after all prior accesses to such addresses have finished. Without the `wgmma.fence`, the behavior is undefined."*) All four warps of the warpgroup must execute the fence before the warpgroup's first `wgmma.mma_async` (Aleksa Gordic, confirmed). The fence is **not** needed between successive `wgmma.mma_async` instructions that write to the same accumulator tile — only when the warpgroup has touched the relevant registers through the generic proxy (e.g., after a register-to-register zero init, or after a non-wgmma write). **(2) `wgmma.commit_group.sync.aligned`** (PTX ISA 9.2, §9.7.15.7.2): > Batches all prior uncommitted `wgmma.mma_async` instructions issued by this warpgroup into a new "wgmma-group." Each warpgroup has its own independent wgmma-group counter (Colfax WGMMA Tutorial). **(3) `wgmma.wait_group.sync.aligned N`** (PTX ISA 9.2, §9.7.15.7.3): > Blocks until no more than `N` of the most recent wgmma-groups are outstanding (i.e., waits until all but the N most recent are complete). `N = 0` means "drain everything." **Variable N is useful for overlap.** When the consumer can do some independent work in parallel with the tail of the MMA (FlashAttention-3's softmax overlap pattern), use `wgmma.wait_group.sync.aligned 1` to wait until all but one group are done, then do the independent work, then wait for 0. (Colfax WGMMA Tutorial: *"flexibility with the parameter N comes in handy. For example, this comes into play with the GEMM-softmax overlapping strategy employed in the design of FlashAttention-3."*) ### 5.5. The `wgmma.mma_async` instruction — full reference for FP16→FP16 m64n64k16, SS mode Quoted verbatim from the CUTLASS codebase as reproduced in the Colfax WGMMA Tutorial (the `fma` method of `SM90_64x64x16_F16F16F16_SS`): ```cpp asm volatile( "{\n" ".reg .pred p;\n" "setp.ne.b32 p, %18, 0;\n" "wgmma.mma_async.sync.aligned.m64n64k16.f16.f16.f16 " "{%0, %1, %2, %3, %4, %5, %6, %7, " " %8, %9, %10, %11, %12, %13, %14, %15}," " %16," // desc-a (64-bit SMEM descriptor for A) " %17," // desc-b (64-bit SMEM descriptor for B) " p, %19, %20, %21, %22;\n" "}\n" : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03), // 16 accumulator regs "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07), // (uint32, each packs "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11), // 2 FP16s for FP16 acc) "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15) : "l"(desc_a), // 64-bit matrix descriptor A "l"(desc_b), // 64-bit matrix descriptor B "r"(int32_t(scale_D)), // 1 = accumulate, 0 = overwrite "n"(int32_t(scaleA)), // +1 or -1 — negate A flag "n"(int32_t(scaleB)), // +1 or -1 — negate B flag "n"(int32_t(tnspA)), // 0 = K-major, 1 = MN-major "n"(int32_t(tnspB)) // 0 = K-major, 1 = MN-major ); ``` **So the PTX operand count for this variant is: 16 accumulator outputs + 2 descriptors + 1 scale_D + 2 negate flags + 2 transpose flags = 23 operands.** (The transpose flags are only present for the 16-bit variants; they're absent for 8-bit and narrower types, which are required to be K-major — Colfax WGMMA Tutorial, comment by Jay Shah citing PTX docs: *"The transpose operation is only supported for the `wgmma.mma_async` variants with .f16/.bf16 types on matrices accessed from shared memory using matrix descriptors."*) PTX documentation for this instruction: PTX ISA 9.2, §9.7.15.5.2 "Asynchronous Multiply-and-Accumulate Instruction: `wgmma.mma_async`". For the RS variant where A is in registers, the descriptor `%16` is replaced by the tuple of A register operands (the A fragment layout is specified in PTX ISA 9.2, §9.7.15.5.1.1.1 "Matrix Fragments for `wgmma.mma_async.m64nNk16`"). ### 5.6. Matrix descriptor format (for operands in SMEM) The 64-bit SMEM matrix descriptor (PTX ISA 9.2, §9.7.15.5.1.2.2 "Matrix Descriptor Format"; Colfax WGMMA Tutorial, "WGMMA core matrices") encodes: | Field | Purpose | |---|---| | Start address (14 bits of the SMEM byte address, after shifting by 4) | Where the operand starts in SMEM | | **Leading Dimension Byte Offset (LBO)** | Byte distance between adjacent **core matrices in the K direction** | | **Stride Dimension Byte Offset (SBO)** | Byte distance between adjacent core matrices in the **M (for A)** or **N (for B)** direction | | Swizzle mode | 2 bits selecting none, 32B, 64B, or 128B swizzle | | Matrix base offset | Small correction for non-aligned SMEM starts | "Core matrix" is a fixed-size sub-block: 8 rows × 16 bytes per core matrix, regardless of data type (Colfax WGMMA Tutorial). For FP16 (2 bytes/elem), that's 8 rows × 8 elements. `wA` (the A operand tile for one `wgmma.mma_async`) is **8×2 core matrices** (for M=64, K=16 FP16); `wB` is **2×(N/8) core matrices**. In CUTLASS, the descriptor is built by `cute::make_gmma_desc` from the SMEM layout and is maintained as a `GMMA::DescriptorIterator` alongside the SMEM tensor — which is why printing `tCrA` in CuTe shows `GMMA::DescriptorIterator` and not register-backed values (Colfax WGMMA Tutorial). --- ## Section 6: Thread Block Clusters, GPCs, and DSMEM ### 6.1. Cluster concept A **thread block cluster** is a new programmability-level grouping above the CTA (thread block): a small number of CTAs (up to 16 on H100 non-portable, 8 portable — see Hopper Tuning Guide §1.4.1.3) that are guaranteed by the runtime to be co-scheduled on SMs within the **same GPU Processing Cluster (GPC)**. Within a cluster, CTAs can directly access each other's shared memory ("Distributed Shared Memory" or DSMEM), synchronize via cluster-scoped barriers, and participate together in TMA multicast (Hopper Tuning Guide, §1.4.1.3). **Why a GPC?** Physical co-location on the die means the wires between SMs are short enough that SMEM-to-SMEM round-trip is cheap — ~ the same cycle cost as L1, not L2 — and dramatically faster than going through L2. Hopper Tuning Guide §1.4.1.3: *"Distributed Shared Memory can be used by an SM simultaneously with L2 cache accesses. This can benefit applications that need to communicate data between SMs by utilizing the combined bandwidth of both distributed shared memory and L2."* ### 6.2. Cluster launch In CUDA C++, a cluster kernel uses `__cluster_dims__(x, y, z)` on the kernel or a launcher that sets the cluster dimensions. In CUTLASS: ```cpp // Sketch — not literal. The exact launcher call is cutlass::launch_kernel_on_cluster. dim3 gridDim = {M/bM, N/bN, 1}; dim3 blockDim = {128, 1, 1}; dim3 clusterDim = {2, 1, 1}; // 2-CTA cluster along M cutlass::launch_kernel_on_cluster({gridDim, blockDim, clusterDim, smem_bytes}, (void const*)kernel, args...); ``` **Cluster dimensions must evenly divide the grid dimensions**, or the launch fails (Colfax TMA Tutorial, multicast subsection). ### 6.3. DSMEM addressing Pre-Hopper shared-memory pointers are local to a CTA and cannot be dereferenced from another CTA. Hopper introduces **`.shared::cluster`** state space, and `mapa` / `getctarank` PTX instructions that convert between local (CTA-relative) and cluster-visible shared-memory addresses (PTX ISA 9.2, §9.7.9.23 "mapa", §9.7.9.24 "getctarank"). The idiomatic use: ```ptx .reg .b64 local_addr, cluster_addr; .shared .b64 shMem; cvta.shared.u64 local_addr, shMem; // local-SMEM alias mapa.u64 cluster_addr, local_addr, 0; // cluster-visible alias for CTA 0 in cluster ``` (Quoted from NVIDIA developer forums excerpt of the PTX ISA Chapter 9 cluster examples.) After this, `cluster_addr` can be dereferenced by any CTA in the cluster and will read/write the shared memory of **CTA 0** in the cluster. ### 6.4. Cluster-scope barriers Two synchronization primitives: 1. **`barrier.cluster.arrive` / `barrier.cluster.wait`** (PTX ISA 9.2, §9.7.13.3 "barrier.cluster") — a named barrier at cluster scope. Used for bootstrapping (before any cluster-wide mbarrier is valid, all CTAs need a synchronization point). CuTe wrapper: `cluster_sync()` (Colfax TMA Tutorial, multicast subsection, which calls `cute::cluster_sync()` after mbarrier initialization). 2. **Cluster-scoped mbarrier** — an mbarrier placed in `.shared::cluster` SMEM. Any thread in any CTA of the cluster can arrive on it; arrivals are counted across the whole cluster. Used for warp-specialized pipelines that cross CTA boundaries (e.g., when the producer CTA of a cluster drives TMA multicast loads for the whole cluster). ### 6.5. TMA multicast integration Multicast TMA (§4.9) requires cluster-scope SMEM and cluster-scope mbarriers because the tile is delivered into multiple SMEMs — the single TMA op must wake all of them. The exact wiring is: 1. Producer CTA runs `mbarrier.init.shared::cluster.b64` on the per-stage full-barriers, with an arrival count that accounts for all participating CTAs × (arrivals per CTA). 2. After init, `barrier.cluster.arrive; barrier.cluster.wait;` to make initialization visible cluster-wide. 3. The producer does `mbarrier.arrive.expect_tx` once, then issues `cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes.multicast::cluster [...]` with the ctaMask. 4. Consumers in all listed CTAs do `mbarrier.try_wait.parity.shared::cluster.b64` on the same mbarrier (which is visible cluster-wide). (Based on the synthesis of PTX ISA 9.2 §§9.7.9.25.5.2, 9.7.13.15.8, 9.7.13.3; Colfax TMA Tutorial; and the CuTe-DSL Hopper pipelining blog.) --- ## Section 7: Producer–Consumer Warp Specialization ### 7.1. First-principles derivation of when WS helps Yadav's "Unweaving Warp Specialization" is the cleanest derivation of this question. Summarized from Yadav, Unweaving WS (direct quotes selectively used): **Starting point.** "An SM has some fixed number of compute resources available (ALU's, LSU's, a Tensor Core) and issue slots per clock cycle, regardless of how many warps a thread block uses. Therefore, a kernel has the same theoretical peak compute throughput and peak instructions issued per cycle on an H100 SM whether it uses 4 or 64 warps." Because the Hopper SM has four warp schedulers each able to issue one instruction per cycle, the hardware ceiling is 4 instructions/cycle regardless. So warp specialization cannot *increase peak throughput*. What it can do is make it easier to actually reach that peak. Yadav identifies three independent reasons: **Condition 1 — Resource constraints mandate partitioning.** The accumulator for a large output tile may not fit under the 255-registers-per-thread hardware limit for a single warpgroup. Example (Yadav's own derivation): a 256×256 FP16 accumulator with 128 threads is 256×256/128/2 = 256 registers per thread — over the limit. The only way to fit this in the register file is to split the accumulator across more warps (more warpgroups), each holding less. This forces at least *some* specialization — two consumer warpgroups, each holding half the accumulator. **Condition 2 — Variable-latency instructions are hard to statically schedule.** "Statically constructing a tight schedule when an instruction may take between 10–100 cycles instead of always 25 cycles is significantly harder. This is precisely the second case where warp specialization is useful: dynamic scheduling with the warp scheduler can gracefully handle variable instruction latency, which is common for memory-related operations." TMA loads have wildly variable latency depending on L2/DRAM hits, and `wgmma` has variable latency depending on whether descriptors are already in the WGMMA issue queue. A statically scheduled single-warp program must guess; a warp-specialized program gets dynamic interleaving for free. **Condition 3 — Blocking synchronization serializes instruction issue.** "These synchronization instructions are essentially semaphores that put warps to sleep until the invoked accelerator has completed. If these synchronization instructions are placed sub-optimally in the instruction stream, the warp may be blocked from executing independent instructions until the operation being synchronized against completes ... Warp specialization allows the programmer to ignore the effects of this synchronization: by breaking the computation into separate warps, other warps can immediately execute as soon as one warp blocks waiting on synchronization." **The summary idea** (Yadav's framing): "Warp specialization essentially turns the SM from an in-order processor into a quasi-out-of-order processor, specifically at the specialization boundaries at each warp." **Important nuance.** Yadav shows that even for H100 GEMM (256×256 tile, 8192×8192 shape), it is **possible** to reach CuBLAS-competitive performance *without* warp specialization, using careful software pipelining that pre-issues MMAs to cover the blocking wait. This takes more human effort per kernel but produces a more compiler-friendly IR. So WS is a point in a trade-off space, not a hardware mandate — but for kernels with heavy non-GEMM intra-loop work (FlashAttention-3 softmax) it is effectively required. ### 7.2. The two-barrier circular buffer The canonical pattern (Colfax Pipelining Tutorial, "The CUTLASS Pipeline abstraction"; implemented identically in CuTe-DSL's `pipeline.PipelineTmaAsync`, the Triton WS infrastructure, and manually in PTX): Two arrays of mbarriers, each of length `N` (the stage count): - **`full_barrier[N]`** — producer signals this when stage `s` is filled. Consumer waits on this. - **`empty_barrier[N]`** — consumer signals this when stage `s` has been consumed (i.e., WGMMA completed and SMEM buffer is free for reuse). Producer waits on this before overwriting. Each thread (or each thread that participates in the pipeline) keeps a thread-local **(index, phase)** `PipelineState` that starts at `(0, 0)` on the consumer side and `(0, 1)` on the producer side (because buffers start out empty, the producer initially "sees" phase 1 on the empty barrier before writing — Colfax Pipelining Tutorial: *"we initialize smem_pipe_write to start with an opposite phase (i.e., 1 instead of 0), since the buffers start out as empty."*). Each `advance()` increments the index modulo N; when the index wraps, the phase XORs. ### 7.3. The full warp-specialized mainloop as pseudocode Verified against the CUTLASS `CollectiveMma` warp-specialized mainloop structure (as exposed in Colfax Pipelining Tutorial) and CuTe-DSL's `dense_gemm.py` (Veitner Pipelining Blog). All three phases: ``` // ============ PROLOGUE ============ if (role == PRODUCER) { for s in 0 .. min(N, K_tiles)-1: pipeline.producer_acquire(state); // wait for empty_barrier[s], phase ^= if (is_leader_thread): tma_load(A_tile[s], B_tile[s]); // full_barrier[s] is the load's mbarrier pipeline.producer_commit(state); // NOP for TMA pipelines (arrival // is baked into the TMA completion) state.advance(); } // Consumer also advances past any prologue iterations if designed that way. // ============ MAIN LOOP ============ for k in prologue_count .. K_tiles - 1 (PRODUCER) or 0 .. K_tiles - 1 (CONSUMER): if (role == PRODUCER): pipeline.producer_acquire(prod_state); // wait for empty_barrier[s] to flip if (is_leader_thread): tma_load(A_tile[k], B_tile[k]); pipeline.producer_commit(prod_state); prod_state.advance(); if (role == CONSUMER): pipeline.consumer_wait(cons_state); // wait for full_barrier[s] to flip warpgroup_arrive(); // wgmma.fence.sync.aligned for k_block in 0 .. num_k_blocks - 1: gemm(tiled_mma, tCrA[k, k_block], tCrB[k, k_block], tCrC); // issues wgmma.mma_async warpgroup_commit_batch(); // wgmma.commit_group.sync.aligned warpgroup_wait<0>(); // wgmma.wait_group.sync.aligned 0 pipeline.consumer_release(cons_state); // arrive on empty_barrier[s] cons_state.advance(); // ============ EPILOGUE ============ // Consumer drains any in-flight wgmmas and writes out the accumulator to GMEM // (via register→SMEM stmatrix + fence.proxy.async.shared::cta + TMA store). // Producer just exits after its last iteration. ``` ### 7.4. The persistent-kernel reset bug **The trap.** In a **persistent** warp-specialized kernel, the CTA lives across many output tiles. The consumer's empty-barrier release is on `empty_barrier[s]` for the current tile only — but the producer, when it finishes loading all `K_tiles` stages for the *current* output tile and moves on to the *next* output tile, wraps its stage index back to 0. If the consumer is still working on the previous output tile's stage 0, the producer will overwrite it. **The naive fix that breaks.** A common misunderstanding is to reset the producer's empty-barrier phase tracking at the start of each output tile (because the `block_k_idx` is reset to 0). This is wrong: the phase tracking is over the lifetime of the mbarrier object, not over the output-tile iteration. The empty_barrier for stage `s` has flipped exactly once per (mainloop iteration that used stage s and released it), regardless of how that iteration maps to output tiles. **The right fix** (as implemented in CUTLASS's warp-specialized persistent kernel and in Triton's persistent WS GEMM tutorial, per PyTorch WS blog): the condition for `producer_acquire(empty_barrier[s])` must account for *all* previous K-iterations across *all* previous output tiles — equivalently, the producer's `PipelineState.phase` is incremented continuously across output-tile boundaries and never reset. The consumer's state likewise. This is a subtle invariant, and the PyTorch WS blog notes: *"Warp-specialized pipelining supports complex control flow. For example, our warp-specialized persistent GEMM kernel uses a doubly-nested loop, allowing the producer to begin fetching data for the next output tile while the consumer is finishing the compute for the prior tile."* That "doubly-nested loop" only works correctly if phase tracking is over the inner iteration count, not reset per output tile. ### 7.5. How Triton automates this From the PyTorch WS Blog: > "To enable warp specialization, users simply need to specify two autotune flags: `num_consumer_groups` and `num_buffers_warp_spec`. [...] Users can enable warp specialization by setting a non-zero value for `num_consumer_groups`, which defines the number of consumer warp groups. There is no corresponding flag to set the number of producer warp groups, as currently only one producer is supported. The `num_buffers_warp_spec` flag specifies the number of buffers the producer warp group will use to communicate with the consumer warp groups." The compiler, given these two flags, performs: - **Task partitioning** — decides which ops go to the producer warp group and which to the consumer warp groups, based on dependency analysis. - **Data partitioning** — for multiple consumer groups, partitions the accumulator (default: along M-mode; falls back to N-mode if the partition is too small for the WGMMA atom). - **Dataflow pipelining** — inserts cyclic SMEM buffers automatically. - **Communication ops** — inserts `ProducerAcquireOp`, `ProducerCommitOp`, `ConsumerWaitOp`, `ConsumerReleaseOp` into TTGIR. - **Code partitioning** — outlines each async task into a separate code region guarded by a warpgroup-ID check. - **PTX materialization** — lowers the communication ops into `mbarrier.*` PTX. This is the same structure as a hand-written CUTLASS WS kernel; the user just doesn't write the boilerplate. --- ## Section 8: Code Examples in Each DSL Each example below uses **only** PTX/API syntax I verified against primary sources in this session. Inline comments flag where syntax is quoted verbatim. ### 8.1. Minimal CUDA/PTX: single TMA load with mbarrier synchronization This kernel shows the canonical elect.sync → `mbarrier.arrive.expect_tx` → `cp.async.bulk.tensor` → `mbarrier.try_wait.parity` spin loop. The host-side descriptor construction is sketched only (the driver-API call sequence is long — see §4.2). All PTX here is verified against the Colfax TMA Tutorial (rank-2 TMA load PTX, quoted verbatim from CUTLASS `include/cute/arch/copy_sm90_tma.hpp`) and the PyTorch TMA Blog (elect.sync form). ```cpp #include <cuda.h> #include <cuda/barrier> #include <cuda_runtime.h> // Host helper: make a 2D, tile-mode TMA descriptor using cuTensorMapEncodeTiled. // Fields and enums as documented in the CUDA Driver API; names: // CU_TENSOR_MAP_DATA_TYPE_FLOAT16, CU_TENSOR_MAP_SWIZZLE_128B, // CU_TENSOR_MAP_INTERLEAVE_NONE, CU_TENSOR_MAP_L2_PROMOTION_L2_128B, // CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE, etc. // (Names are from the CUDA driver API headers; I cite the existence of these // enum values from the Colfax TMA Tutorial's reference to cuTensorMapEncodeTiled // but do not reproduce the entire header here.) CUtensorMap make_tma_desc_2d_fp16(const __half* gmem, int M, int N, int tileM, int tileN) { CUtensorMap desc = {}; // Fill fields: data type FP16, rank 2, base ptr, globalDim = {N, M} (the // innermost dim is first in the driver API's convention), globalStrides = // {N * sizeof(__half)} for a row-major tensor, boxDim = {tileN, tileM}, // elementStrides = {1, 1}, interleave = NONE, swizzle = 128B (to match // the WGMMA descriptor's expected layout), L2 promotion = 128B, // oobFill = NONE. // The exact call is cuTensorMapEncodeTiled(&desc, ...args...); // (Colfax TMA Tutorial references the CUDA Driver API TMA documentation // for the full argument list.) // ... fill in fields, call cuTensorMapEncodeTiled ... return desc; } // Device kernel: single thread issues one TMA load of a tileM x tileN FP16 // tile and waits for it to land in SMEM. __global__ void tma_minimal_kernel(const __grid_constant__ CUtensorMap desc, int tile_x, int tile_y) { constexpr int TILE_M = 64; constexpr int TILE_N = 64; constexpr int NUM_BYTES = TILE_M * TILE_N * sizeof(__half); // 8192 bytes __shared__ alignas(128) __half smem_tile[TILE_M * TILE_N]; __shared__ alignas(8) uint64_t mbar; // mbarrier object is .b64, 8-byte aligned uint32_t smem_ptr = __cvta_generic_to_shared(&smem_tile[0]); uint32_t smem_mbar = __cvta_generic_to_shared(&mbar); uint64_t desc_ptr = (uint64_t)(&desc); // grid-constant, pass-by-value // ---- One thread initializes the mbarrier ---- if (threadIdx.x == 0) { // mbarrier.init.shared::cta.b64 [mbar], 1; // Expected arrival count = 1 (single thread will "arrive" via expect_tx). asm volatile("mbarrier.init.shared::cta.b64 [%0], 1;\n" :: "r"(smem_mbar)); } __syncthreads(); // bootstrap: all threads see the initialized mbarrier // ---- One thread (elected) issues expect_tx + TMA load ---- // The elect.sync form is taken from the PyTorch TMA Blog: // elect.sync _|%p7, 0xffffffff; // Note the *underscore*-before-pred operand order: destination reg (underscore // = ignore), then predicate, then membermask. uint32_t is_elected; asm volatile("{\n" ".reg .pred P;\n" "elect.sync _|P, 0xffffffff;\n" // exact PTX form, verified "selp.u32 %0, 1, 0, P;\n" "}\n" : "=r"(is_elected)); if (is_elected && (threadIdx.x & 31) == 0) { // only one thread per warp // mbarrier.arrive.expect_tx.shared::cta.b64 _, [mbar], NUM_BYTES; asm volatile("mbarrier.arrive.expect_tx.shared::cta.b64 _, [%0], %1;\n" :: "r"(smem_mbar), "r"((uint32_t)NUM_BYTES)); // The TMA load itself. PTX quoted verbatim from Colfax TMA Tutorial. asm volatile( "cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes" " [%0], [%1, {%3, %4}], [%2];\n" :: "r"(smem_ptr), // %0 - dst SMEM ptr "l"(desc_ptr), // %1 - TMA descriptor ptr (64-bit) "r"(smem_mbar), // %2 - mbarrier ptr "r"(tile_x), // %3 - dim-0 coord "r"(tile_y) // %4 - dim-1 coord : "memory"); } // ---- All threads wait on the mbarrier in a spin loop ---- // Phase parity = 0 for the first use of the mbarrier. constexpr int PHASE = 0; uint32_t done = 0; while (!done) { asm volatile("{\n" ".reg .pred P;\n" "mbarrier.try_wait.parity.shared::cta.b64 P, [%1], %2;\n" "selp.u32 %0, 1, 0, P;\n" "}\n" : "=r"(done) : "r"(smem_mbar), "r"(PHASE)); } // After this point, smem_tile is guaranteed to contain the loaded data, // visible to every thread in the CTA (the mbarrier wait provides the // acquire-side visibility guarantee). // ... consumer code uses smem_tile ... } ``` Points to flag: - The `elect.sync` operand order `_|%pred` (underscore first) is quoted verbatim from compiled Triton output in the PyTorch TMA Blog. The wrong order `%pred|_` is a common mistake in hand-written PTX. - `mbarrier.try_wait.parity` is in a `while (!done)` loop because **the PTX has no blocking wait**; try_wait returns `false` on timeout even if the phase hasn't completed. The MLIR `nvvm.mbarrier.try_wait.parity` docs explicitly say they *generate* this loop to make blocking semantics correct (MLIR NVVM dialect docs). - `__grid_constant__ const CUtensorMap` is the required way to pass the descriptor by value to the kernel (Colfax TMA Tutorial). ### 8.2. CUDA/PTX: producer-consumer warp-specialized GEMM mainloop skeleton A skeleton showing the two-barrier circular buffer. The prologue, main K-loop, and epilogue drain are shown explicitly. Full descriptors, SMEM layout, and accumulator handling are omitted for brevity (they're covered in §8.3 and §5). ```cpp template <int NUM_STAGES, int BM, int BN, int BK> __global__ __launch_bounds__(384 /* 1 producer + 2 consumer warpgroups */, 1) void ws_gemm_kernel(const __grid_constant__ CUtensorMap descA, const __grid_constant__ CUtensorMap descB, float* C, int M, int N, int K, int tile_m, int tile_n) { // --- SMEM layout (conceptual) --- extern __shared__ uint8_t smem_buf[]; auto smem_A = reinterpret_cast<__half*>(smem_buf); // [NUM_STAGES][BM][BK] auto smem_B = smem_A + NUM_STAGES * BM * BK; // [NUM_STAGES][BN][BK] auto full_bar = reinterpret_cast<uint64_t*>( // NUM_STAGES mbarriers smem_A + NUM_STAGES * (BM + BN) * BK); auto empty_bar = full_bar + NUM_STAGES; const int wg_id = (threadIdx.x / 128); // 0 = producer, 1..2 = consumer const int is_producer = (wg_id == 0); // --- Register-count redistribution per warpgroup --- // (See §9 for the register budget math.) // 504 / 128 = ~504 regs total per thread across all warpgroups. // Producer needs little; give it 40. Consumer gets 232 each (x2 consumers). if (is_producer) { // setmaxnreg.dec.sync.aligned.u32 40; // PTX ISA 9.2, §9.7.19.5 asm volatile("setmaxnreg.dec.sync.aligned.u32 40;"); } else { // setmaxnreg.inc.sync.aligned.u32 232; // PTX ISA 9.2, §9.7.19.5 asm volatile("setmaxnreg.inc.sync.aligned.u32 232;"); } // --- Initialize barriers (producer does it; consumer waits via __syncthreads) --- if (threadIdx.x == 0) { for (int s = 0; s < NUM_STAGES; ++s) { uint32_t fb = __cvta_generic_to_shared(&full_bar[s]); uint32_t eb = __cvta_generic_to_shared(&empty_bar[s]); // Full barrier: the producer's elect'd thread will arrive-and-expect-tx. // Arrival count = 1 (single producer thread). asm volatile("mbarrier.init.shared::cta.b64 [%0], 1;\n" :: "r"(fb)); // Empty barrier: each consumer warpgroup arrives (128 threads per group x 2 groups). // Arrival count is NUM_CONSUMER_THREADS if every thread arrives; or 2 if // each group arrives once via elect.sync. // We use the latter convention for cleaner code. asm volatile("mbarrier.init.shared::cta.b64 [%0], 2;\n" :: "r"(eb)); } } __syncthreads(); const int K_tiles = (K + BK - 1) / BK; // Thread-local phase tracking. Starts at 0 for consumer's full-barrier wait // (phase flips to 1 on first completion), and 1 for producer's empty-barrier // wait (buffers start "empty" conceptually — the producer does not wait on // the first pass through each stage). int prod_phase = 1; // producer's next-expected empty_barrier phase int cons_phase = 0; // consumer's next-expected full_barrier phase if (is_producer) { // --- PRODUCER PATH --- // Elect a single thread in the producer warpgroup's first warp. uint32_t is_elected; asm volatile("{\n" ".reg .pred P;\n" "elect.sync _|P, 0xffffffff;\n" "selp.u32 %0, 1, 0, P;\n" "}\n" : "=r"(is_elected)); is_elected &= ((threadIdx.x & 127) < 32); // first warp of the wg only int producer_idx = 0; // stage index, 0..NUM_STAGES-1 for (int k = 0; k < K_tiles; ++k) { const int s = producer_idx; uint32_t fb = __cvta_generic_to_shared(&full_bar[s]); uint32_t eb = __cvta_generic_to_shared(&empty_bar[s]); // Wait for the consumer to have released this stage (except in prologue, // where the first NUM_STAGES iterations should not wait). if (k >= NUM_STAGES) { uint32_t done = 0; while (!done) { asm volatile("{\n" ".reg .pred P;\n" "mbarrier.try_wait.parity.shared::cta.b64 P, [%1], %2;\n" "selp.u32 %0, 1, 0, P;\n" "}\n" : "=r"(done) : "r"(eb), "r"(prod_phase)); } } // Arrive and expect tx, then issue TMA loads into stage s. if (is_elected) { uint32_t tx_bytes = (BM*BK + BN*BK) * sizeof(__half); asm volatile("mbarrier.arrive.expect_tx.shared::cta.b64 _, [%0], %1;" :: "r"(fb), "r"(tx_bytes)); uint32_t smemA_s = __cvta_generic_to_shared(&smem_A[s * BM * BK]); uint32_t smemB_s = __cvta_generic_to_shared(&smem_B[s * BN * BK]); asm volatile( "cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes" " [%0], [%1, {%3, %4}], [%2];" :: "r"(smemA_s), "l"((uint64_t)&descA), "r"(fb), "r"(k * BK), "r"(tile_m * BM) : "memory"); asm volatile( "cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes" " [%0], [%1, {%3, %4}], [%2];" :: "r"(smemB_s), "l"((uint64_t)&descB), "r"(fb), "r"(k * BK), "r"(tile_n * BN) : "memory"); } // Advance stage. producer_idx = (producer_idx + 1) % NUM_STAGES; if (producer_idx == 0) prod_phase ^= 1; } // Epilogue: drain not needed on the producer side — the consumer's // consumer_release on the final stages is what completes things. } else { // --- CONSUMER PATH --- // Two consumer warpgroups. Each owns half of the output tile (along M). int consumer_idx = 0; // ... allocate accumulator registers (see §5) ... for (int k = 0; k < K_tiles; ++k) { const int s = consumer_idx; uint32_t fb = __cvta_generic_to_shared(&full_bar[s]); uint32_t eb = __cvta_generic_to_shared(&empty_bar[s]); // Wait for producer to fill stage s. uint32_t done = 0; while (!done) { asm volatile("{\n" ".reg .pred P;\n" "mbarrier.try_wait.parity.shared::cta.b64 P, [%1], %2;\n" "selp.u32 %0, 1, 0, P;\n" "}\n" : "=r"(done) : "r"(fb), "r"(cons_phase)); } // WGMMA sequence (§5.4): asm volatile("wgmma.fence.sync.aligned;"); // ... issue wgmma.mma_async over the MMA_K dimension of this stage ... asm volatile("wgmma.commit_group.sync.aligned;"); asm volatile("wgmma.wait_group.sync.aligned 0;"); // Signal empty_bar[s] — one thread per warpgroup arrives. if ((threadIdx.x & 127) == 0) { asm volatile("mbarrier.arrive.shared::cta.b64 _, [%0];" :: "r"(eb)); } consumer_idx = (consumer_idx + 1) % NUM_STAGES; if (consumer_idx == 0) cons_phase ^= 1; } // Epilogue: stmatrix → fence.proxy.async.shared::cta → TMA store. // (Omitted for brevity; see §4.8.) } } ``` Notes on verified syntax: - `setmaxnreg.{inc,dec}.sync.aligned.u32 <count>` — PTX ISA 9.2, §9.7.19.5 ("setmaxnreg"). The `.dec` form releases registers (producer path); `.inc` requests additional registers (consumer path). Both take a per-thread register count between 24 and 256 inclusive, in multiples of 8 (Colfax Pipelining Tutorial: *"setmaxnreg can be set to a value between 24 and 256 (inclusive) at multiples of 8"*). - The producer can skip the `empty_barrier` wait on the first `NUM_STAGES` iterations — the buffer is empty to start. This is the "producer starts with phase = 1" convention (Colfax Pipelining Tutorial). - Arrival-count of 1 on the full_barrier assumes a single elected producer thread. Arrival-count of 2 on the empty_barrier assumes one arrival per consumer warpgroup (two consumers = 2). If every consumer thread arrived, the count would be 256. CUTLASS's `ClusterTransactionBarrier` encapsulates these counting conventions. ### 8.3. CuTe C++ warp-specialized GEMM with `PipelineTmaAsync` Based on the structure shown in the Colfax Pipelining Tutorial, "Warp-specialization" subsection. This is abbreviated but structurally faithful. ```cpp #include <cute/tensor.hpp> #include <cutlass/pipeline/sm90_pipeline.hpp> #include <cutlass/arch/reg_reconfig.h> using namespace cute; using MainloopPipeline = typename cutlass::PipelineTmaAsync<NUM_STAGES>; using PipelineState = typename cutlass::PipelineState<NUM_STAGES>; template <class TmaLoadA, class TmaLoadB, class TiledMma, /* ... */> __global__ void ws_gemm(__grid_constant__ const TmaLoadA tma_a, __grid_constant__ const TmaLoadB tma_b, TiledMma tiled_mma, /* ... gmem/smem layouts ... */) { extern __shared__ char smem_raw[]; auto& storage = *reinterpret_cast<SharedStorage*>(smem_raw); // Build the pipeline. TMA pipeline uses transaction-count-based signaling // of full_barrier (§3.2, §7.2); producer_commit is thus a NOP. typename MainloopPipeline::Params params; params.transaction_bytes = TmaTransactionBytes; // BM*BK + BN*BK, in bytes params.role = (warp_group_idx == 0) ? MainloopPipeline::ThreadCategory::Producer : MainloopPipeline::ThreadCategory::Consumer; params.is_leader = (warp_group_thread_idx == 0); params.num_consumers = 128; // one consumer warpgroup in this sketch auto cluster_shape = Shape<_1,_1,_1>{}; MainloopPipeline pipeline(storage.pipeline_storage, params, cluster_shape); __syncthreads(); PipelineState smem_pipe_write = cutlass::make_producer_start_state<MainloopPipeline>(); PipelineState smem_pipe_read; if (warp_group_idx == 0) { // ===== PRODUCER ===== // Release excess registers; give them to consumers. 40 regs/thread is typical. cutlass::arch::warpgroup_reg_dealloc<40>(); // Single elected thread in warp 0 issues TMA. int lane_predicate = cute::elect_one_sync(); if (warp_idx_in_warpgroup == 0 && lane_predicate) { for (int k = 0; k < k_tile_count; ++k) { pipeline.producer_acquire(smem_pipe_write); // producer_acquire internally: // (a) waits on empty_barrier[stage] via mbarrier.try_wait.parity // (b) if leader, does mbarrier.arrive.expect_tx.shared::cta.b64 // on full_barrier[stage] with params.transaction_bytes. // Pass full_barrier[stage] as the mbarrier operand to cute::copy, // which dispatches to cp.async.bulk.tensor.2d.* (§4.7). int stage = smem_pipe_write.index(); copy(tma_a.with(*pipeline.producer_get_barrier(smem_pipe_write), 0), tAgA(_, k), tAsA(_, stage)); copy(tma_b.with(*pipeline.producer_get_barrier(smem_pipe_write), 0), tBgB(_, k), tBsB(_, stage)); // producer_commit is a NOP for PipelineTmaAsync — the full_barrier // arrival is baked into the TMA descriptor's `mbarrier::complete_tx` // completion mechanism (Colfax Pipelining Tutorial). pipeline.producer_commit(smem_pipe_write); ++smem_pipe_write; } } } else { // ===== CONSUMER ===== // Request extra registers for the accumulator + descriptors. cutlass::arch::warpgroup_reg_alloc<232>(); Tensor tCrC = ...; // register-backed accumulator fragment clear(tCrC); for (int k = 0; k < k_tile_count; ++k) { pipeline.consumer_wait(smem_pipe_read); // internally: mbarrier.try_wait.parity.shared::cta.b64 on full_barrier[stage] int stage = smem_pipe_read.index(); cute::warpgroup_arrive(); // wgmma.fence.sync.aligned cute::gemm(tiled_mma, tCrA(_, _, _, stage), tCrB(_, _, _, stage), tCrC); // issues a sequence of wgmma.mma_async cute::warpgroup_commit_batch(); // wgmma.commit_group.sync.aligned cute::warpgroup_wait<0>(); // wgmma.wait_group.sync.aligned 0 pipeline.consumer_release(smem_pipe_read); // internally: mbarrier.arrive.shared::cta.b64 on empty_barrier[stage] ++smem_pipe_read; } // Epilogue: register → SMEM (stmatrix), fence, TMA store. // tma_store_fence(); // fence.proxy.async.shared::cta // if (is_tma_thread) { // copy(tma_c, sC, gC); // tma_store_arrive(); // cp.async.bulk.commit_group // } // tma_store_wait<0>(); // cp.async.bulk.wait_group 0 } } ``` Key points, all verified from the Colfax Pipelining Tutorial: - **`producer_commit` is a no-op for `PipelineTmaAsync`.** Colfax Pipelining Tutorial: *"With the `cute::copy` call thus linked to the mbarrier object `full_barrier` of the pipeline, we can then use the transaction count-based completion mechanism of TMA to signal the consumer that the buffer is ready to be used, obviating the need to invoke `producer_commit` from the pipeline object itself. This is why CUTLASS makes `producer_commit` a no-op for `PipelineTmaAsync`."* - **`producer_acquire` internally does both the empty-barrier wait and the expect-tx arrival on the full barrier.** Colfax Pipelining Tutorial quotes the source: *"`if (barrier_token != BarrierStatus::WaitDone) { empty_barrier_ptr_[stage].wait(phase); }` ... `if (params_.is_leader) { full_barrier_ptr_[stage].arrive_and_expect_tx(params_.transaction_bytes); }`"*. - Register-count conventions — see §9. ### 8.4. CuTe-DSL (Python) pipelined kernel — structure from `dense_gemm.py` Structure verified against the Veitner Pipelining Blog's walkthrough of the official CUTLASS `examples/python/CuTeDSL/hopper/dense_gemm.py`. Abbreviated; full example is in the CUTLASS repo. ```python import cutlass import cutlass.cute as cute from cutlass.cute.nvgpu import cpasync, warpgroup from cutlass.utils import pipeline @cute.kernel def kernel(self, tma_atom_a, mA_mkl, # TMA atom + GMEM tensor for A tma_atom_b, mB_nkl, # TMA atom + GMEM tensor for B tma_atom_c, mC_mnl, # TMA atom + GMEM tensor for C (store) tiled_mma, # cute.TiledMma (wraps wgmma atom) cta_layout_mnk, a_smem_layout_staged, b_smem_layout_staged, epi_smem_layout_staged): # ---------- SMEM allocation ---------- smem = cutlass.utils.SmemAllocator() storage = smem.allocate(self.shared_storage) # ---------- Prefetch TMA descriptors (warp 0) ---------- if warp_idx == 0: cute.nvgpu.cpasync.prefetch_descriptor(tma_atom_a) cute.nvgpu.cpasync.prefetch_descriptor(tma_atom_b) # ---------- Build the TMA-based producer/consumer pipeline ---------- mainloop_pipeline_array_ptr = storage.mainloop_pipeline_array_ptr.data_ptr() mainloop_pipeline_producer_group = pipeline.CooperativeGroup(pipeline.Agent.Thread) num_warps = self.threads_per_cta // 32 consumer_arrive_cnt = (self.num_mcast_ctas_a + self.num_mcast_ctas_b - 1) * num_warps mainloop_pipeline_consumer_group = pipeline.CooperativeGroup( pipeline.Agent.Thread, consumer_arrive_cnt) cta_layout_vmnk = cute.make_layout((1, *cta_layout_mnk.shape)) mainloop_pipeline = pipeline.PipelineTmaAsync.create( barrier_storage = mainloop_pipeline_array_ptr, num_stages = self.ab_stage, producer_group = mainloop_pipeline_producer_group, consumer_group = mainloop_pipeline_consumer_group, tx_count = tma_copy_bytes, cta_layout_vmnk = cta_layout_vmnk, ) # ---------- Pipeline state (index + phase) ---------- k_tile_cnt = cute.size(gA_mkl, mode=[2]) mainloop_producer_state = pipeline.make_pipeline_state( pipeline.PipelineUserType.Producer, self.ab_stage) mainloop_consumer_read_state = pipeline.make_pipeline_state( pipeline.PipelineUserType.Consumer, self.ab_stage) mainloop_consumer_release_state = pipeline.make_pipeline_state( pipeline.PipelineUserType.Consumer, self.ab_stage) # ---------- PROLOGUE: prefetch N stages ---------- prefetch_k_tile_cnt = cutlass.max(cutlass.min(self.ab_stage, k_tile_cnt), 0) if warp_idx == 0: for prefetch_idx in cutlass.range(prefetch_k_tile_cnt, unroll=1): mainloop_pipeline.producer_acquire(mainloop_producer_state) tAsA_pipe = tAsA[(None, mainloop_producer_state.index)] tBsB_pipe = tBsB[(None, mainloop_producer_state.index)] tAgA_k = tAgA_mkl[(None, mainloop_producer_state.count)] tBgB_k = tBgB_nkl[(None, mainloop_producer_state.count)] cute.copy(tma_atom_a, tAgA_k, tAsA_pipe, tma_bar_ptr=mainloop_pipeline.producer_get_barrier( mainloop_producer_state), mcast_mask=a_mcast_mask) cute.copy(tma_atom_b, tBgB_k, tBsB_pipe, tma_bar_ptr=mainloop_pipeline.producer_get_barrier( mainloop_producer_state), mcast_mask=b_mcast_mask) mainloop_pipeline.producer_commit(mainloop_producer_state) # NOP mainloop_producer_state.advance() # ---------- MAINLOOP: interleaved producer + consumer ---------- tiled_mma.set(cute.nvgpu.warpgroup.Field.ACCUMULATE, False) for k_tile in cutlass.range(k_pipe_mmas, k_tile_cnt, 1, unroll=1): # Consumer side: wait, run the wgmma group, release. mainloop_pipeline.consumer_wait(mainloop_consumer_read_state, peek_ab_full_status) cute.nvgpu.warpgroup.fence() # wgmma.fence.sync.aligned for k_block_idx in cutlass.range(num_k_blocks, unroll_full=True): k_block_coord = (None, None, k_block_idx, mainloop_consumer_read_state.index) cute.gemm(tiled_mma, accumulators, tCrA[k_block_coord], tCrB[k_block_coord], accumulators) # wgmma.mma_async cute.nvgpu.warpgroup.commit_group() # wgmma.commit_group.sync.aligned cute.nvgpu.warpgroup.wait_group(k_pipe_mmas) # wgmma.wait_group N mainloop_pipeline.consumer_release(mainloop_consumer_release_state) mainloop_consumer_read_state.advance() mainloop_consumer_release_state.advance() # Producer side: if not drained, fire another TMA load. if warp_idx == 0 and mainloop_producer_state.count < k_tile_cnt: mainloop_pipeline.producer_acquire(mainloop_producer_state) # ... same cute.copy calls as in prologue ... mainloop_pipeline.producer_commit(mainloop_producer_state) mainloop_producer_state.advance() # ---------- EPILOGUE: drain wgmmas, write accumulator out via stmatrix + TMA ---------- cute.nvgpu.warpgroup.wait_group(0) cute.arch.sync_threads() # ... register → SMEM via StMatrix8x8x16bOp, fence.proxy.async, TMA store ... ``` Verified against Veitner Pipelining Blog: - `pipeline.PipelineTmaAsync.create(...)` API surface and argument list (`barrier_storage`, `num_stages`, `producer_group`, `consumer_group`, `tx_count`, `cta_layout_vmnk`). - The producer `producer_acquire → cute.copy → producer_commit(NOP)` sequence. - The consumer `consumer_wait → cute.nvgpu.warpgroup.fence() → cute.gemm → warpgroup.commit_group() → warpgroup.wait_group(N) → consumer_release` sequence. - The prologue prefetch pattern using `prefetch_k_tile_cnt = min(stages, k_tiles)`. ### 8.5. Triton warp-specialized persistent GEMM Quoted from the PyTorch WS Blog verbatim (autotuner config excerpt, with minor reformatting): ```python import triton import triton.language as tl @triton.autotune( configs=[ triton.Config( { "BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 64, "GROUP_SIZE_M": 8, }, num_stages=2, num_warps=4, num_consumer_groups=2, # <<< WS-specific flag num_buffers_warp_spec=3, # <<< WS-specific flag ), ], key=["M", "N", "K"], ) @triton.jit def matmul_persistent_ws_kernel( a_ptr, b_ptr, c_ptr, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr, ): pid = tl.program_id(axis=0) num_pid_m = tl.cdiv(M, BLOCK_M) num_pid_n = tl.cdiv(N, BLOCK_N) pid_m = pid // num_pid_m pid_n = pid % num_pid_n offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) offs_k = tl.arange(0, BLOCK_K) a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak) b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32) for k in range(0, tl.cdiv(K, BLOCK_K)): a = tl.load(a_ptrs) # compiler lowers to TMA if it can b = tl.load(b_ptrs) acc += tl.dot(a, b) # compiler lowers to wgmma a_ptrs += BLOCK_K * stride_ak b_ptrs += BLOCK_K * stride_bk c = acc.to(tl.float16) c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :] tl.store(c_ptrs, c) ``` **What the Triton compiler does** (PyTorch WS Blog): > - **Task Partitioning**: the kernel is automatically divided into async tasks for one producer warp group and `num_consumer_groups` consumer warp groups, based on dependency analysis and anchor-op heuristics. Shared memory is chosen as the inter-warpgroup transport to minimize register spill to SMEM. > - **Data Partitioning**: the output tile is split across consumer groups along M by default; the compiler falls back to N-split if the M-split would produce a sub-MMA tile. > - **Dataflow Pipelining**: cyclic SMEM buffers of depth `num_buffers_warp_spec` are allocated automatically; the compiler handles double/triple buffering and produces correct phase tracking. > - **Communication Operations**: four TTGIR ops — `ProducerAcquireOp`, `ProducerCommitOp`, `ConsumerWaitOp`, `ConsumerReleaseOp` — manage the dataflow. These support both TMA and non-TMA memory ops. > - **Code Partitioning**: each async task is outlined into a code region guarded by a warpgroup-ID check. > - **TTGIR → LLVM/PTX**: the communication ops lower into `mbarrier.init.shared::cta.b64`, `mbarrier.try_wait.parity.shared::cta.b64`, `mbarrier.arrive.shared::cta.b64`, and similar PTX barriers. The fence and async-proxy ops lower into `fence.proxy.async`. `tl.dot` on Hopper lowers into a sequence of `wgmma.mma_async` calls with `wgmma.fence/commit/wait` (Triton TTNG Dialect Docs confirm the lowerings for `ttng.init_barrier`, `ttng.wait_barrier`, `ttng.fence_async_shared`, `ttng.warp_group_dot`, `ttng.warp_group_dot_wait`). **What the programmer controls explicitly** — only the autotune flags: - `num_consumer_groups` — 0 disables WS; non-zero enables. - `num_buffers_warp_spec` — SMEM buffer depth. - `num_stages`, `num_warps`, tile shape — unchanged from non-WS Triton. - The compiler currently supports **only one producer warp group** (PyTorch WS Blog). **Gluon** (Yadav, Unweaving WS, citing Triton issue #7392): a lower-level Triton dialect that lets expert users bypass the compiler's WS and write warp-specialized code by hand. This exists because compiler-driven WS heuristics are not always optimal; for complex kernels like FlashAttention, expert-written Gluon can still outperform the compiler. --- ## Section 9: Resource Budgeting ### 9.1. SMEM for staged A/B buffers The formula is mechanical: ``` bytes_per_stage = (BM * BK + BN * BK) * sizeof(dtype) smem_mainloop = stages * bytes_per_stage ``` **Worked example — FP16, BM=64, BN=128, BK=16, 4 stages:** ``` bytes_per_stage = (64*16 + 128*16) * 2 = (1024 + 2048) * 2 = 3072 * 2 = 6144 bytes smem_mainloop = 4 * 6144 = 24,576 bytes (24 KB) ``` Leaves 228 − 24 = 204 KB for the epilogue SMEM stage, mbarrier storage (16 × 8 = 128 bytes for 16 mbarriers — well under 1 KB), and any other kernel state. (Hopper Tuning Guide §1.4.1.1 gives the 228 KB cap.) **Bigger example — FP8, BM=128, BN=256, BK=128, 4 stages:** ``` bytes_per_stage = (128*128 + 256*128) * 1 = 16384 + 32768 = 49,152 bytes smem_mainloop = 4 * 49152 = 196,608 bytes (192 KB) ``` This almost saturates the SMEM, leaving ~36 KB for the epilogue tile and barriers — workable but tight. Dropping to 3 stages gives 144 KB, opening up 84 KB of headroom. ### 9.2. Registers for the accumulator The formula (derived in §5.1): ``` regs_per_thread_acc = (BM * BN * bytes_per_acc_elem) / (num_threads * 4) ``` Where `num_threads = 128 * num_consumer_warpgroups` (the accumulator is spread across all consumer warpgroups' threads). **Worked examples, all with FP32 accumulator (4 bytes per element):** | BM | BN | num_consumer_wgs | num_threads | regs/thread | |---:|---:|---:|---:|---:| | 128 | 128 | 1 | 128 | (128·128·4)/(128·4) = 128 | | 128 | 128 | 2 | 256 | (128·128·4)/(256·4) = 64 | | 256 | 256 | 1 | 128 | 2048 (exceeds 255 limit — forbidden) | | 256 | 256 | 2 | 256 | 256 (just over 255 — forbidden) | | 256 | 256 | 3 | 384 | ~171 (fits) | | 128 | 256 | 1 | 128 | 256 (exceeds 255) | | 128 | 256 | 2 | 256 | 128 | Because the per-thread max is 255 (Hopper Tuning Guide §1.4.1.1), any tile that gives `regs_per_thread_acc > 255` **forces** at least enough consumer warpgroups to drop that number below 255. This is Yadav's "condition 1" for WS (Yadav, Unweaving WS). ### 9.3. The SMEM–register–occupancy trade-off All three resources draw from a fixed SM budget. Some rules of thumb: - **Stages vs. SMEM.** Each additional stage costs one `bytes_per_stage` of SMEM. On a 228-KB-SMEM SM with a BM=128, BN=128, BK=64 FP16 setup, `bytes_per_stage = (128·64 + 128·64)·2 = 32,768`, so 2 stages = 64 KB, 3 stages = 96 KB, 4 stages = 128 KB. Going from 3 to 4 stages can tip the design from 2 resident CTAs per SM to 1, halving occupancy but doubling the pipeline depth per CTA. - **Warpgroups vs. registers.** Each consumer warpgroup adds 128 thread slots and eats its chunk of the 65,536-register file. With register reallocation, the split is: producer takes `R_p`, each consumer takes `R_c`, total warpgroups = `W`, constraint is `R_p + (W-1)·R_c ≤ 512` (since 512 × 128 = 65,536). Colfax Pipelining Tutorial gives the example 24/240/240 = 504 for one producer + two consumers, 32/160/160/160 = 512 for one producer + three consumers. - **Why WS helps register budget.** Without register reallocation, all warpgroups get the same register count (the max of what any of them needs). With reallocation, the producer — which only issues TMA and doesn't need much — can be budgeted tightly, freeing register headroom for the consumer to hold a large accumulator. Colfax Pipelining Tutorial: *"A larger portion of registers can be assigned to the consumer warps, which typically need more of them."* - **Why occupancy is fragile.** At 228 KB SMEM/SM and a 128 KB per-CTA SMEM footprint (e.g., 4-stage FP16 big tile), only one CTA fits per SM. This is common for high-performance Hopper GEMM, which is why persistent kernels (one CTA per SM lifetime-of-kernel) are natural on Hopper (Aleksa Gordic observation, confirmed: *"with persistent kernels, we often launch just one thread block per SM, so it's important to structure work so that all warp schedulers are kept busy"*). ### 9.4. Sanity check on register split choices (from Colfax Pipelining Tutorial) The tutorial observed that: - **With 3 warpgroups (1P + 2C):** `ptxas` reports "Used 168 registers" which is `504 / 3 = 168`. So a 24/240/240 split gives the producer 40 regs and each consumer 232 — producer's excess above 40 gets reclaimed via `setmaxnreg.dec`. - **With 4 warpgroups (1P + 3C):** `ptxas` reports "Used 128 registers" = `512/4`. So a 32/160/160/160 split works exactly up to the 64K RF cap. - **FP16 accumulator, small enough that register usage is 90/thread without reallocation:** no benefit from reallocation (register count is already the bottleneck, not SMEM or MMA). - **FP32 accumulator, reallocation removed:** "2784 bytes stack frame, 4764 bytes spill stores, 4760 bytes spill loads ... Used 168 registers ... about 21 TFLOP/s, a catastrophic loss of performance." Register reallocation is not optional for FP32-accum WS kernels on Hopper. --- ## Appendix: Quick reference — PTX synchronization for Hopper kernels | Purpose | PTX (verified syntax) | Section | |---|---|---| | Initialize mbarrier (CTA scope) | `mbarrier.init.shared::cta.b64 [mbar], count;` | §3.3 | | Arrive + set expected-tx | `mbarrier.arrive.expect_tx.shared::cta.b64 _, [mbar], tx_bytes;` | §3.3 | | Plain arrive | `mbarrier.arrive.shared::cta.b64 _, [mbar];` | §3.3 | | Try-wait (use in spin loop) | `mbarrier.try_wait.parity.shared::cta.b64 P, [mbar], phase;` | §3.3 | | Test-wait (single poll) | `mbarrier.test_wait.parity.shared::cta.b64 P, [mbar], phase;` | §3.3 | | Invalidate mbarrier | `mbarrier.inval.shared::cta.b64 [mbar];` | §3.3 | | Elect single thread | `elect.sync _\|%pred, 0xffffffff;` | §4.6 | | Proxy fence (generic → async SMEM) | `fence.proxy.async.shared::cta;` | §2.3, §4.8 | | TMA 2D load | `cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes [smem], [desc, {x, y}], [mbar];` | §4.7 | | TMA 2D store commit | `cp.async.bulk.commit_group;` | §4.8 | | TMA 2D store wait | `cp.async.bulk.wait_group N;` | §4.8 | | Cluster barrier arrive | `barrier.cluster.arrive;` | §6.4 | | Cluster barrier wait | `barrier.cluster.wait;` | §6.4 | | WGMMA fence | `wgmma.fence.sync.aligned;` | §5.4 | | WGMMA commit | `wgmma.commit_group.sync.aligned;` | §5.4 | | WGMMA wait ≤ N groups | `wgmma.wait_group.sync.aligned N;` | §5.4 | | Reallocate registers (release) | `setmaxnreg.dec.sync.aligned.u32 <count>;` | §8.2, PTX ISA 9.2 §9.7.19.5 | | Reallocate registers (request) | `setmaxnreg.inc.sync.aligned.u32 <count>;` | §8.2, PTX ISA 9.2 §9.7.19.5 | --- ## Honest caveats and gaps in this session Items I could not fully verify from primary sources within this session, flagged as `[UNVERIFIED]` above: 1. **SM counts for specific H100 SKUs** (132 for SXM5, 114 for PCIe, 144 for full GH100). These are consistently cited across multiple secondary sources (Wikipedia, Aleksa Gordic, SemiAnalysis), but I did not fetch a first-party NVIDIA datasheet confirming each SKU's exact count. 2. **8 GPCs × 18 SMs = 144** breakdown on Hopper. Cited by Aleksa Gordic's "Inside NVIDIA GPUs"; not fetched from a first-party NVIDIA document in this session. 3. **"4 warp schedulers per SM" / "4 Tensor Cores per SM"**. Cross-confirmed across two secondary sources (Yadav, Colfax WGMMA Tutorial comment by Ryo) and consistent with the architecture's observable behavior, but not located verbatim in the Hopper Tuning Guide or whitepaper in this session. 4. **"SS mode is 5–10% faster than RS mode"** (Ash Vardanian blog). Cited as an observation, not an NVIDIA-documented specification. Take as suggestive. 5. **The exact layout of the FP32 accumulator register fragment for each m64nNk16 variant.** The PTX ISA does provide these in §9.7.15.5.1.1 (Matrix Fragments for `wgmma.mma_async.m64nNk16`), but I did not open and transcribe that table in this session. The per-thread register counts in §5.1 are derived by formula and consistent with the 16-register count shown in the verbatim FP16-acc example from Colfax. For any production kernel, I would strongly recommend a final cross-check against the PTX ISA 9.2 sections cited (particularly §9.7.15.5.1 and §9.7.13.15.x) before committing specific bit-level layouts to code. The mental model in this document is accurate; specific bit positions and register index orderings are reliable up to the level of detail verified in these cited sources and should be re-verified against the PTX ISA before use in low-level hand-written PTX.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.