<h1 style="text-align:center;">Parallel Programming HW3</h1> <h3 style="text-align:center;">110062305 謝諺緯</h3> # 1. Implementation ## a. Describe how you implemented the FlashAttention forward pass using CUDA. #### Key Steps and Improvements Over Sequential Version ##### 1. Matrix Blocking - **Sequential Version**: - In the sequential implementation, `Q`, `K`, and `V` matrices were divided into manageable blocks (`br` and `bc`) for computation. Specifically, the second `for` loop in `flash_attention` iteratively processed blocks of `Q`, loading them into memory for each block of `K`. - **CUDA Implementation**: - I flattened the second loop by processing the entire `Q` matrix in a single pass. Instead of loading blocks of `Q` into shared memory, the CUDA implementation directly loads the entire `Q` matrix onto the GPU once. - This approach eliminates the need to repeatedly move blocks of `Q` in and out of shared memory, significantly reducing memory transfer overhead. ##### 2. SRAM Usage - **Sequential Version**: - In the sequential implementation, intermediate matrices (e.g., `sij`, `pij`) were stored in CPU memory. Each block computation required frequent access to these matrices in HBM. - **CUDA Implementation**: - Shared memory (`__shared__`) was used for storing intermediate blocks of `Q` and `K`, as well as temporary results such as partial dot-products and attention scores. For instance: ```cpp __shared__ float Q_tile[TILE_SIZE][TILE_SIZE]; __shared__ float K_tile[TILE_SIZE][TILE_SIZE]; ``` - **Usage**: - During the computation of `QK^T`, tiles of `Q` and `K` were stored in shared memory. Threads within the same block collaboratively loaded data into shared memory and reused it for multiple calculations, minimizing HBM access. - For example, the `QKDotAndScalarKernel` performs dot-products using these shared memory tiles, reducing redundant global memory transactions. ##### 3. Reducing `cudaMemcpy` Calls - **Sequential Version**: - Each batch of `Q`, `K`, and `V` was processed independently, requiring multiple `cudaMemcpy` calls to transfer intermediate results between host and device memory. - **CUDA Implementation**: - By flattening the batch processing logic into the `flash_attention_all_batches` function, I ensured that all batches were kept in GPU memory throughout the computation. - This eliminated the need for transferring intermediate results back and forth, significantly reducing the number of `cudaMemcpy` calls. ##### 4. Key Algorithm Steps - **Dot Product (`QK^T`)**: - **Sequential Version**: - Computed row-by-row for each block of `Q` and `K`, iteratively updating results. - **CUDA Implementation**: - Flattened the second loop to compute `QK^T` for all rows of `Q` at once using the `QKDotAndScalarKernel`. - Shared memory tiles for `Q` and `K` enabled parallel computation of dot-products for multiple rows simultaneously. - **Row-wise Maximum (`𝑚`)**: - Identical to the sequential version but implemented in parallel using the `RowMaxKernel`, where threads within a block compute row-wise maxima efficiently using warp reductions. - **Softmax Computation**: - **Sequential Version**: - Performed normalization row-by-row with a focus on numerical stability. - **CUDA Implementation**: - Parallelized across rows using `MinusMaxAndExpKernel` and `RowSumKernel`. - **Weighted Sum and Output (`O`)**: - The `UpdateMiLiOiKernel` computes the output matrix `O` in parallel, leveraging shared memory to store partial sums and reduce global memory accesses. ##### 5. Improvements Over Sequential Version - **Loop Flattening**: - In the sequential implementation, the second loop in `flash_attention` repeatedly loaded blocks of `Q` for each block of `K`. In my CUDA implementation: - The entire `Q` matrix was processed at once, removing the need for block-by-block computation and significantly improving performance. - **Memory Access**: - The sequential version required frequent and redundant memory accesses to HBM for intermediate results. - My implementation minimizes HBM access by loading all required matrices (`Q`, `K`, `V`) into GPU memory once and reusing them throughout the computation. - **Parallelism**: - The sequential implementation lacked parallelism, processing computations row-by-row and block-by-block. - My implementation parallelizes across rows and blocks, fully utilizing GPU threads for both computation and memory operations. ## b. Explain how matrices Q, K, and V are divided into blocks and processed in parallel. 1. **Row-Block for Q** - The code treats Q in “row blocks” by using a grid dimension of BN = B * N. Each thread (or small set of threads) handles one row index i (from 0 to \(N-1\) in each batch) and loads the corresponding part of \(Q\) from global memory as needed. In other words, we consider a row of length \(d\) at a time when performing the dot product with a column block of \(K\). 2. **Column-Block for K and V** - We pick a column block size bc. A loop (e.g., `for (int jBlock = 0; jBlock < tc; jBlock++)`) iterates over the range [0, N) in steps of bc. - For each block, the threads process the subset of columns $$ \text{startJ} \dots \text{startJ} + \text{bc}-1. $$ This block of columns is loaded into shared memory (e.g., in `QKDotAndScalarKernel`) to perform the partial dot products or to combine with \(V\) data for output accumulation. 3. **Tiled Approach & Shared Memory** - In kernels like `QKDotAndScalarKernel`, tiles of \(Q\) and \(K\) (up to size TILESIZE * TILESIZE) are copied into shared memory, allowing each block of threads to reuse those elements multiple times before moving on to the next chunk of the d dimension. ## c. Describe how you choose the block sizes B_r and B_c and why. 1. **Block Size (`B_c`)**: - `B_c` represents the number of columns of `K` and `V` processed in each block. - This value was chosen based on the size of shared memory available on the GPU. Shared memory has a limited capacity (e.g., 48 KB per Streaming Multiprocessor). To ensure an entire block of `K` or `V` columns fits in shared memory, I calculated `B_c` such that: - Shared memory usage for `K_tile` = `B_c × d × sizeof(float)`. - Additional shared memory is reserved for temporary buffers, intermediate results, and registers. - For example, with `d = 64` and `sizeof(float) = 4 bytes`, `B_c` was set to 128 to stay within shared memory limits. 2. **Why B_r is Not Needed**: - The entire `Q` matrix is processed at once without dividing it into row blocks. Each thread processes one row of `Q` directly, so there is no need for a `B_r` parameter. - By not tiling `Q`, memory transfer overhead is reduced, and computation is parallelized efficiently across rows of `Q`. 3. **Why This Strategy?** - **Shared Memory Efficiency**: Using `B_c` ensures that shared memory is fully utilized for `K` and `V`, as their tiles are reused multiple times for dot-product computation. - **Simplified Parallelism**: Each thread directly processes a row of `Q`, while `K` and `V` tiles are shared among threads in a block for maximum efficiency. - **Memory Access Optimization**: This strategy reduces global memory transactions by keeping frequently accessed `K` and `V` tiles in shared memory. ## d. Specify the configurations for CUDA kernel launches. 1. **Threads Per Block**: - For matrix operations such as dot product (`QK^T`) and softmax: - A **2D block** of threads is used, with `blockDim.x = 16` and `blockDim.y = 16`. - Each thread computes a single element of the output or intermediate matrix for its assigned row and column. 2. **Grid Dimensions**: - For `QKDotAndScalarKernel`: - A **2D grid** is used, where: - `gridDim.x = ceil(B * N / blockDim.x)` for the number of row blocks across batches. - `gridDim.y = ceil(bc / blockDim.y)` for the number of column blocks. - For row-wise operations (e.g., `RowMaxKernel` and `RowSumKernel`): - A **1D grid** is used with `gridDim.x = ceil(B * N / blockDim.x)` to process rows in parallel. 3. **Shared Memory Allocation**: - **Tiled Matrices (`Q_tile` and `K_tile`)**: - Shared memory is allocated for tiles of size `TILE_SIZE x TILE_SIZE` (16x16). - Each tile of `Q` and `K` is collaboratively loaded by threads in a block. - Example: ```cpp __shared__ float Q_tile[16][16]; __shared__ float K_tile[16][16]; ``` - **Intermediate Results**: - Shared memory is reused for partial dot-product results within each block. 4. **Key Considerations**: - **Shared Memory Optimization**: - Shared memory usage was carefully tuned to ensure it stays within the GPU’s hardware limit while maximizing its utilization. - **Occupancy**: - The number of threads per block and the grid size were chosen to maximize the number of active warps and achieve high GPU occupancy. - **Coalesced Memory Access**: - Threads access global memory in a coalesced manner when loading `Q`, `K`, and `V`, minimizing latency. 5. **Example Configuration**: - Kernel launch for the dot product: ```cpp dim3 blockDim(16, 16); dim3 gridDim( (B * N + blockDim.x - 1) / blockDim.x, (bc + blockDim.y - 1) / blockDim.y ); QKDotAndScalarKernel<<<gridDim, blockDim>>>(...); ``` ## e. Justify your choices and how they relate to the blocking factors and the SRAM size. 1. **Blocking Factors and Shared Memory Usage**: - The **blocking factor (`B_c`)** was chosen to fit the tiles of `K` and `V` into shared memory (`SRAM`), ensuring each tile is processed entirely on-chip without additional global memory accesses. - Shared memory allocation: - For `Q_tile` and `K_tile`, the size is `TILE_SIZE x TILE_SIZE` (e.g., 16x16 for a 256-thread block), which aligns with the shared memory constraints of modern GPUs (~48 KB per Streaming Multiprocessor). - This allows efficient reuse of `Q` and `K` tiles across multiple dot-product calculations. 2. **Thread Block and Grid Dimensions**: - The **block size** (`16x16`) provides a good balance between: - **Parallelism**: Each block processes a 16x16 submatrix, allowing 256 threads to work in parallel. - **Shared Memory Availability**: The memory requirement for two 16x16 tiles is 2 x 16 x 16 x 4 bytes = 2 KB, leaving ample space for additional buffers and avoiding shared memory overflow. - **Grid dimensions** are determined by the size of the matrices, ensuring the entire `Q`, `K`, and `V` matrices are covered. This setup supports high GPU occupancy while maintaining coalesced memory access. 3. **SRAM Size Optimization**: - **Efficient Tiling**: - By splitting `K` and `V` into column blocks (`B_c`), only a manageable portion of these matrices is loaded into SRAM at a time, reducing global memory accesses. - Tiling ensures that SRAM is reused efficiently, as each tile is shared across multiple threads for multiple operations (e.g., dot product, normalization). 4. **GPU Occupancy and Warp Utilization**: - The block size (`16x16`) aligns with the warp size of 32, ensuring all threads in a warp are fully utilized. - The grid size was calculated to maximize the number of active thread blocks across all SMs, ensuring high occupancy and parallelism. 5. **Minimizing Latency**: - Shared memory usage for tiles of `Q` and `K` significantly reduces latency compared to accessing global memory for each computation. - The chosen blocking factor (`B_c`) minimizes tile switching overhead, balancing computational workload and memory efficiency. 6. **Numerical Stability**: - The blocking strategy ensures that each block of rows in `Q` interacts with multiple blocks of columns in `K` and `V` independently, allowing precise and parallel computation of row-wise maxima and normalization. # 2. Profiling Results ### Results (tested on testcases t05) ``` Invocations Metric Name Metric Description Min Max Avg Device "NVIDIA GeForce GTX 1080 (0)" Kernel: RowSumKernel(float*, float const *, int, int) 1 achieved_occupancy Achieved Occupancy 0.717591 0.717591 0.717591 1 sm_efficiency Multiprocessor Activity 99.91% 99.91% 99.91% 1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s 1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s 1 gld_throughput Global Load Throughput 386.32GB/s 386.32GB/s 386.32GB/s 1 gst_throughput Global Store Throughput 386.32MB/s 386.32MB/s 386.32MB/s Kernel: RowMaxKernel(float*, float const *, int, int) 1 achieved_occupancy Achieved Occupancy 0.673466 0.673466 0.673466 1 sm_efficiency Multiprocessor Activity 99.88% 99.88% 99.88% 1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s 1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s 1 gld_throughput Global Load Throughput 342.79GB/s 342.79GB/s 342.79GB/s 1 gst_throughput Global Store Throughput 342.79MB/s 342.79MB/s 342.79MB/s Kernel: QKDotAndScalarKernel(float*, float const *, float const *, int, int, int, int, float, int) 1 achieved_occupancy Achieved Occupancy 0.973949 0.973949 0.973949 1 sm_efficiency Multiprocessor Activity 99.99% 99.99% 99.99% 1 shared_load_throughput Shared Memory Load Throughput 3320.4GB/s 3320.4GB/s 3320.4GB/s 1 shared_store_throughput Shared Memory Store Throughput 664.07GB/s 664.07GB/s 664.07GB/s 1 gld_throughput Global Load Throughput 207.52GB/s 207.52GB/s 207.52GB/s 1 gst_throughput Global Store Throughput 166.02GB/s 166.02GB/s 166.02GB/s Kernel: UpdateMiLiOiKernel(float*, float*, float*, float const *, float const *, float const *, float const *, int, int, int, int, int) 1 achieved_occupancy Achieved Occupancy 0.714618 0.714618 0.714618 1 sm_efficiency Multiprocessor Activity 99.91% 99.91% 99.91% 1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s 1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s 1 gld_throughput Global Load Throughput 47.888GB/s 47.888GB/s 47.888GB/s 1 gst_throughput Global Store Throughput 5.3532GB/s 5.3532GB/s 5.3532GB/s Kernel: MinusMaxAndExpKernel(float*, float const *, float const *, int, int) 1 achieved_occupancy Achieved Occupancy 0.838661 0.838661 0.838661 1 sm_efficiency Multiprocessor Activity 99.98% 99.98% 99.98% 1 shared_load_throughput Shared Memory Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s 1 shared_store_throughput Shared Memory Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s 1 gld_throughput Global Load Throughput 198.51GB/s 198.51GB/s 198.51GB/s 1 gst_throughput Global Store Throughput 352.91GB/s 352.91GB/s 352.91GB/s ``` ### Analysis #### a. Achieved Occupancy - **Key Observations**: - Most kernels have achieved high occupancy (e.g., `QKDotAndScalarKernel` at 97.39% and `UpdateMiLiOiKernel` at 71.46%). - High occupancy indicates effective utilization of GPU resources, ensuring enough threads are active to hide memory latency. #### b. SM Efficiency (Multiprocessor Activity) - **Key Observations**: - SM efficiency is near 100% for all kernels, indicating that the streaming multiprocessors (SMs) are fully utilized during kernel execution. - This demonstrates good parallelization and workload distribution. #### c. Shared Memory Load/Store Throughput - **Key Observations**: - Shared memory throughput is used heavily by the `QKDotAndScalarKernel` with a **shared load throughput of 3320.4 GB/s** and **store throughput of 664.07 GB/s**. - Other kernels like `RowMaxKernel` and `MinusMaxAndExpKernel` do not rely on shared memory, as these operations primarily involve row-wise reductions and global memory accesses. #### d. Global Memory Throughput - **Global Load Throughput**: - **`QKDotAndScalarKernel`** achieves a global load throughput of 207.52 GB/s, which is the highest among all kernels, as it processes tiles of `Q` and `K` stored in global memory. - Other kernels like `RowSumKernel` and `RowMaxKernel` also show reasonable global memory throughput, demonstrating coalesced memory access patterns. - **Global Store Throughput**: - Global store throughput is highest in `RowSumKernel` (386.32 MB/s), followed by `RowMaxKernel` (342.79 MB/s). - The `UpdateMiLiOiKernel` shows relatively low store throughput (5.35 GB/s), as it involves combining intermediate results rather than bulk writes. #### e. Analysis Per Kernel 1. **`QKDotAndScalarKernel`**: - High occupancy (97.39%) and SM efficiency (99.99%) indicate this kernel is well-optimized for parallel execution. - Shared memory usage is highly efficient with throughput over 3 TB/s for loads and 664 GB/s for stores. - Coalesced global memory access is evident, contributing to 207.52 GB/s global load throughput. 2. **`RowMaxKernel`**: - Good occupancy (67.35%) and SM efficiency (99.88%) show parallel reduction is effective. - Relatively lower shared memory usage is expected as this kernel primarily works with row-wise maximums. 3. **`MinusMaxAndExpKernel`**: - Occupancy (83.86%) and SM efficiency (99.98%) are strong, ensuring numerical stability computations are parallelized. - Minimal shared memory usage as this kernel focuses on point-wise operations across rows. 4. **`RowSumKernel`**: - High occupancy (71.76%) with good SM efficiency (99.91%). - High global store throughput (386.32 MB/s) indicates efficient row-wise summation. 5. **`UpdateMiLiOiKernel`**: - Occupancy (71.46%) and SM efficiency (99.91%) are solid. - Moderate global memory load (47.88 GB/s) and store throughput (5.35 GB/s) reflect the complexity of combining intermediate results. #### f. Optimization Opportunities 1. **Kernel-Specific Optimizations**: - For kernels like `RowMaxKernel` and `MinusMaxAndExpKernel`, explore opportunities to use shared memory or reduce redundant global memory accesses. 2. **Global Memory Throughput**: - The global memory throughput for `UpdateMiLiOiKernel` can potentially be improved by better coalescing or reorganizing the memory access patterns. 3. **Shared Memory Utilization**: - Optimize shared memory allocation across kernels where underutilization is observed to balance memory bandwidth better. # 3. Experiment & Analysis ## System Spec The experiments were conducted on the Apollo-GPU platform. ## Optimization ### Overview Charts ![output](https://hackmd.io/_uploads/r1mqqZdBkx.png) ![output (1)](https://hackmd.io/_uploads/HkQc9WOrJg.png) ### `Version 1`: The Original Version - **Approach** - A host-side loop over `B` (the batch dimension). - Inside each batch, another set of nested loops over sub-blocks `(i, j)`, repeatedly launching kernels for small `(br x bc)` chunks. - `QKDotAndScalarKernel` itself uses a for-loop (over `t` in `[0..d-1]`) to perform the dot product sequentially. - **Drawbacks** - Excessive kernel launches (host calls a kernel for every small `(i, j)` block). - Poor GPU utilization (many small blocks, and the device code still has inner loops). - Potential long execution time, e.g., 30+ seconds on `(B, N, d) = (13600, 128, 32)`. --- ### `Version 2`: Flattening the Inner For-Loop - **Key Changes** - Instead of having the device kernel handle a tiny sub-block `(i, j)` plus an inner for-loop for computing dot products, this version uses 2D grids and blocks to handle a larger region at once. - Still loops over `B` on the host side, but for each batch, the kernel can manage the entire `(N x bc)` region for `QKDotAndScalarKernel`. - **Benefits** - Fewer kernel launches: each call covers all `(i, j)` in that block. - Better concurrency for the dot products (less host overhead). - Performance often improves significantly over `<code1>`—example: from ~35 seconds to a few seconds. --- ### `Version 3`: Bringing `B` into the Function - **Key Changes** - Eliminates the outer `for (b=0..B-1)` loop on the host side. - All `B*N` rows are processed in a single pass on the GPU: the kernel indexing uses `idx = b*N + i`. - This drastically reduces kernel-launch overhead by running one set of kernels for *all* batches rather than calling multiple times for each batch. - **Benefits** - Greatly fewer kernel calls (no more “loop over B”). - Higher GPU occupancy when `B` is large (e.g., `B=13600`). - Speeds up the computation when `B` is large—time can drop from several seconds to less than a second in some cases. --- ### `Version 4`: Shared Memory & Tiled Approach - **Key Changes** - Same “all-batches at once” approach as `<code3>`, but further optimizes the dot product in `QKDotAndScalarKernel` via **tile-based** (blocked) matrix multiplication. - Uses shared memory to load chunks (tiles) of size `TILE_SIZE x TILE_SIZE` (e.g., 16x16) from `Q` and `K` during the dot-product loop. - This reduces global memory traffic for the inner `d` dimension and allows partial sums to be accumulated in fast shared memory. - **Benefits** - Significant speedup for large `d`: a tile-based method typically provides better memory reuse and higher throughput. - Often yields the best performance among the four versions—example: can drop from seconds to well below 1 second if `(B, N, d)` are well-suited. --- ### Overall Trend 1. **`<code1>`** is the naive, original approach with many host loops and small kernel blocks. 2. **`<code2>`** flattens the inner loop, improving concurrency and reducing overhead. 3. **`<code3>`** merges `B` into the GPU processing to cut down on repeated kernel launches for each batch. 4. **`<code4>`** adds **shared memory tile-based** dot-product, typically producing the largest performance gain when `(N, d)` are large. In practice, `<code4>` can offer orders-of-magnitude improvement over `<code1>`, especially on large GPUs and large `(B, N, d)` workloads. # Experience & conclusion #### a. What have you learned from this homework? Through this homework, I gained valuable experience in implementing efficient CUDA kernels and leveraging GPU resources effectively. I learned how to optimize performance using techniques like tiling, shared memory utilization, and coalesced memory access. Profiling tools such as NVIDIA nvprof allowed me to analyze kernel performance in detail, deepening my understanding of GPU memory hierarchy and optimization strategies. This task also highlighted the importance of balancing memory usage, parallelism, and occupancy for achieving optimal performance. #### b. Feedback (optional) Thanks to the TAs and the professor for designing this assignment. It provided a challenging and rewarding experience that greatly enhanced my understanding of CUDA programming and GPU optimization.