### CUDA Programming model - In order to use GPU to accelerate our program, we need to understand what is **CUDA Programming Model**. - NVIDIA invented the CUDA programming model and addressed these challenges. CUDA is a parallel computing platform and programming model for general computing on graphical processing units (GPUs). With CUDA, you can speed up applications by harnessing the power of GPUs. - The CUDA programming model provides an abstraction of GPU architecture that acts as a bridge between an application and its possible implementation on GPU hardware. - For easy adoption, CUDA provides a simple interface based on C/C++. A great benefit of the CUDA programming model is that it allows you to write a scalar program. The CUDA compiler uses programming abstractions to leverage parallelism built in to the CUDA programming model. <img src="https://playlab.computing.ncku.edu.tw:3001/uploads/upload_f0769edb2750b2ca0b3524312102dc71.png" width=500> - In our working environment, our hardware would be Nvidia A100. Following is the memory hierachy of A100. ### Software Programming View vs. Hardware Design View - If you have ever take a look at any document or article regarding to Nvidia GPU or CUDA programming, you would find a lot of terminology that might confuse you a lot. First of all, we need to understand that those terminologies could be categorized into two aspects -- **software and hardware**. For example, when we write a cuda program, we need to launch a kernel, which means you need to tell you Nvidia GPU the function you want to run on GPU. Therefore, be aware of that it's **not** the case that after you write a cuda program and the whole program would directly run on GPU. - Following are some terminologies that you need to know before we get into the world of CUDA progamming. - **Sorfware** view - Kernel: - A process launched on GPU. Notice that you could launch many processes in your cuda program ```cpp= // in your_cuda_program.cu kernel1<<<numBlocks1, numThreads1>>>(...); kernel2<<<numBlocks2, numThreads2>>>(...); ... ``` - Grid: - A kernel is launched as a grid of threadblocks. You could simply take it as the same as **kernel** - Threadblock: - A thread block is divided into multiple warps - Warp - The basic execution unit - Contains 32 threads in general - SIMD. That is, the threads in the same warp execute the same instruction with different data. - Thread - One thread execute on one core. - Hardware view: - GPC(Graphics Processing Clusters): - In A100, there are 8 GPC in GPU - Each GPC contains 8 TPC - GPCs connect to L2 cache - TPC(Texture Processing Clusters): - Contains a texture cache - Contains 2 SM for each TPC - SM(Streaming Multiprocessor): - Contains multiple thread blocks - Threads in one thread block must be located in the same SM - Contains a warp scheduler, and only **one** warp would be executed in one SM. - Therefore, the context switch is frequent! The cost of context switch is essentially free.(We only need to modify the pointer for each context switch!) - Since the context switch is based on the pointer modification(That is, values in the registers don't need to be moved!), and the numbers of register in a SM is limited, the number of thread blocks in a SM is also limited! - Tensor Core - basic execution unit for a thread ![](https://playlab.computing.ncku.edu.tw:3001/uploads/upload_9a3ed9410d5bf735773c953db9f6ccc8.png) - The NVIDIA A100 Tensor Core GPU implementation of the GA100 GPU includes the following units: - 7 GPCs, - 7 or 8 TPCs/GPC, - 2 SMs/TPC, up to 16 SMs/GPC, 108 SMs - 64 FP32 CUDA Cores/SM, 6912 FP32 CUDA Cores per GPU - 4 Third-generation Tensor Cores/SM, 432 Third-generation Tensor Cores per GPU - HBM2 stacks, 10 512-bit Memory Controllers <img src="https://playlab.computing.ncku.edu.tw:3001/uploads/upload_f43194bfb0add2f713240c08a6436562.png"> - The relationship between the **software programming model** and the **hardware design block** <img src="https://playlab.computing.ncku.edu.tw:3001/uploads/upload_696c2d24ccad10e7fa327a86471cf10b.png" width=500> ### GPU Memory Hierarchy and Programming 在 CUDA 中,要讓 thread 可以使用的變數,都必須要先把資料放置到 device 的記憶體裡;而 device 的記憶體,又分為 DRAM 和 on-chip SRAM 上的記憶體兩種。在實際使用上,分成下面幾種記憶體的類型: - registers - Read-write per-thread (SRAM) - local memory - Read-write per-thread (DRAM) - shared memory - Read-write per-block (SRAM) - global memory - Read-write per-grid (DRAM) - constant memory - Read-only per-grid (DRAM) - texture memory - Read-only per-grid (DRAM) 可以發現,registers 是以 thread 為單位來使用的;而 shared memory 則是存在於 block,讓每一個 thread 共用。上面這兩種,都是在 chip 上的記憶體,相較之下,速度會比屬於 DRAM 的記憶體來的快。(local memory 只存在於在舊版本的Nvidia GPU上,並不是 on-chip 的,而是在 DRAM 上,讀取速度也不快。) Grid 中的 global, constant, texture 這三種 memory 可以讓不同的 block 中的 thread 一起使用。此外,這三種記憶體則是屬於 DRAM 上的記憶體,可以在同一個程式的不同 kernel 中,持續的存在、使用;而他們之間的差異,在於最佳化的方式不同。像 constant 和 texture 因為對於 thread 是唯獨的,所以實際上會有快取的機制,可以用來加速讀取。 <img src="https://playlab.computing.ncku.edu.tw:3001/uploads/upload_6e71ff4ea20c889e7d30fda27b65d797.png" width=500> :::info Be aware of that the **L1 cache** is split into two parts, with a portion dedicated to **shared memory** and a portion dedicated to the **L1 cache**. The size of the shared memory and L1 cache can be configured, and the GPU will allocate the specified amount of memory for each. The shared memory is used to store data that is shared among threads within a block, while the L1 cache is used to store data that is shared among threads across multiple blocks. The L1 cache is used to reduce the number of global memory accesses and improve performance. ::: ## 其他需要懂才能做Lab 的東西 - How to compile a cuda program in the course container ```shell= $ nvcc your_cuda_program.cu -o your_cuda_program ``` - How to run and profile a cuda program ```shell= $ ./your_cuda_program $ sudo nvprof ./your_cuda_program ``` ## Lab 14-0 Working Environement Setup ### Step 1 : Clone the Lab14 repository ```shell= ## bring up the AIAS course docker container ## clone the lab10 files $ cd /workspace/projects $ git clone https://playlab.computing.ncku.edu.tw:4001/acai-spring-2023/Lab14.git Lab14 $ cd Lab14 ## add your private upstream repositories ## make sure you have create project repo under your gitlab account $ git remote add gitlab https://playlab.computing.ncku.edu.tw:4001/<your ldap name>/Lab14.git $ git remote -v gitlab https://playlab.computing.ncku.edu.tw:4001/<your ldap name>/Lab14.git (fetch) gitlab https://playlab.computing.ncku.edu.tw:4001/<your ldap name>/Lab14.git (push) origin https://playlab.computing.ncku.edu.tw:4001/acai-spring-2023/Lab14.git (fetch) origin https://playlab.computing.ncku.edu.tw:4001/acai-spring-2023/Lab14.git (push) ``` ### Step 2 : Setup your project upstream for submission - **When you are done with your code**, you have to push your code back to your own gitlab account with the following command... ```shell= ## the first time $ git push --set-upstream https://playlab.computing.ncku.edu.tw:4001/<your ldap name>/Lab14.git master ## after the first time $ git fetch origin <branch name> $ git push gitlab <branch name> ``` ## Lab 14-1 Simple Matrix Multiply - In this lab, we wanna use cuda in c/c++ to accelerate the multiplication of 2 matrices, A and B, which are in a size of 1024(INT)X1024(INT), repectively. - Here we directly demonstrate the rather optimized way to speed up the kernel. Notice that if you are not totally understand the connection between your code and the real execution process on your GPU while running, you might get a lower performance when you use the GPU instead of CPU. For instance, you could try you own to launch a kernel with GridDim = (1, 1). - In our program, we make several assumption as following - 1024x1024x**sizeof(int)** = 4MB for A,B and C matrix, repectively. - total MAC(Multiply Accumulate) operations = 1024x1024x1024x**sizeof(int)** = 8G IntOPs - The nuimber of threads in a thread block = 32x32 - The number of thread blocks in a grid = (1024/32)x(1024/32) = 32x32 - Let's firstly check whether the program is running correctly ```shell= ## under ./Lab14 $ cd Lab14-1 $ nvcc matmul.cu -o matmul $ ./matmul 1024 matMul_kernel time = 0.003276 seconds GPU time = 0.076076 seconds CPU time = 7.163367 seconds ``` --- ### Now, let's dive into our code! - **main** function - Initialize the matrix A, B, and C - Call **matmul_gpu** function to prepare and launch the kernel - Call **matmul_cpu** as a benchmark for kernel function ```cpp= int main(int argc, char **argv) { clock_t start_t, end_t; int N = argc > 1 ? (atoi(argv[1])): (1 << 9); //Initialize matrices int *mat_a = (int *)malloc(sizeof(int)*N*N); int *mat_b = (int *)malloc(sizeof(int)*N*N); int *mat_c = (int *)malloc(sizeof(int)*N*N); init_matrix(mat_a, N); init_matrix(mat_b, N); init_matrix(mat_c, N); start_t = clock(); matmul_gpu(mat_a, mat_b, mat_c, N); end_t = clock(); printf("GPU time = %f seconds\n", (double)(end_t - start_t)/CLOCKS_PER_SEC); start_t = clock(); matmul_cpu(mat_a, mat_b, mat_c, N); end_t = clock(); printf("CPU time = %f seconds\n", (double)(end_t - start_t)/CLOCKS_PER_SEC); return 0; } ``` - **matmul_gpu** function - step1: Allocate GPU memory ```C= int *device_a, *device_b, *device_c; cudaMalloc((void **)&device_a, N*N*sizeof(int)); cudaMalloc((void **)&device_b, N*N*sizeof(int)); cudaMalloc((void **)&device_c, N*N*sizeof(int)); ``` - step2: Copy to GPU ```C= cudaMemcpy(device_a, mat_a, N*N*sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(device_b, mat_b, N*N*sizeof(int), cudaMemcpyHostToDevice); ``` - step3: Assign the size of threads and blocks - We need to do ceiling for **numBlocks** to avoid from launching less threads than desired. - **dim3**: an integer vector type that can be used in CUDA code. It can be used in any user code for holding values of 3 dimensions. In our case, we use only two dimension, x and y, and the unassigned dimension z would be assigned to 1 automatically. ```c= unsigned int numThreadPerBlock = 32; unsigned int numBlocks = (N + numThreadPerBlock - 1)/numThreadPerBlock ; dim3 threads(numThreadPerBlock, numThreadPerBlock); dim3 blocks(numBlocks, numBlocks); ``` - step4: Launch the kernel ```cpp= matMul_kernel<<< blocks, threads>>>(device_a, device_b, device_c, N); ``` - step 5: Copy from the GPU ```cpp= cudaMemcpy(mat_c, device_c, N*N*sizeof(int), cudaMemcpyDeviceToHost); ``` - step 6: Deallocate GPU memory ```cpp= cudaFree(device_a); cudaFree(device_b); cudaFree(device_c); ``` #### gpu kernel function -- matMul_kernel - Keyword explanation - The keyword is used to declare where a function is executed, including the source and destination. - Notice that when we the term **host** represent our CPU, and the **device** refers to GPU. | Keyword | Callable from | Executed On | | ------------ | --------------- | ----------- | | \_\_host\_\_ | Host | Host | | \_\_global__ | Host(or Device) | Device | | \_\_device__ | Device | Device | - **blockIdx.x** , **gridDim.x** vs **threadIdx.x** , **blockDim.x** (You can deliver the similar definition for y and z dimension, repectively.) | Name | Definition | | --------- | ----------------------------------- | | blockIdx.x | the index of blocks in x dimension in a grid | | gridDim.x | total numbers of blocks in x dimension in a grid | | threadIdx.x | the index of threads in x dimension in a block | | blockDim.x | total numbers of threads in x dimension in a block | <img src="https://playlab.computing.ncku.edu.tw:3001/uploads/upload_64f96752d86740ff35a8691c65d2e3fc.png" alt="drawing" width="600"/> <img src="https://playlab.computing.ncku.edu.tw:3001/uploads/upload_dab793b5206e004af214b57c29da43fa.png" alt="drawing" width="450"/> - code for launched **matmul_kernel** ```cpp= __global__ void matmul_kernel(const int *a, const int *b, int *c, int N) { int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; int Pvalue = 0; for (int k = 0; k < N; k++) { // Accumulate results for a single element Pvalue += a[row * N + k] * b[k * N + col]; } c[row*N + col] = Pvalue; } ``` - Comparing with the **matmul_cpu** ```cpp= void matmul_cpu(int *a, int *b, int *c, int N) { for(int i = 0;i < N; i++){//x dimension for(int j = 0; j < N; j++){//y dimension c[i*N + j] = 0; for(int k = 0; k < N; k++){ c[i*N + j] += a[i*N + k]*b[k*N + j]; } } } } ``` - We take off the x and y dimensions from **matmul_cpu** and substitute the i and j index with row and col in **matmul_kernel**, repectively. That is, each element in matrix C could finish its own calculatin in the assigned thread individually. #### Test the result - After you finish the code, please uncomment the function **verify_result** to check whether your code is correct or not. After confirming the correctness, comment the verify_result and check the time of gpu, cpu, and gpu_kernel. - The difference between gpu and gpu_kernel time is that the former contains the time load/store data from/to device to/from host. #### Result ```shell= $ nvcc matmul.cu -o matmul $ ./matmul 1024 matMul_kernel time = 0.003438 seconds GPU time = 0.107383 seconds CPU time = 7.960088 seconds ``` ## Lab 14-2 Tiled Matrix Multiply - In the previous example, we have to load the data of matrix A and B from global memory without using the shared memory. The shared memory is an on-chip memory, therefore, we could access the data from shared memory in a more high speed. - In this lab, we want to speed up our matrix multiplication code via the usage of shared memory and the tiling algorithm technique. - The tiling algorithm is to guarantee the reuse of our data in the shared memory. you could let your tile size different with your shared memory size, but it is meaningless as the purpose of tiling is to optimize the usage of our shared memory. - After we create a shared memory space for each tile(include a submatrix A and a submatrix B, which in our case is a 32x32 matrix in both A and B), we need to make sure the tile size is the same as our shared memory for each thread block. - Before introducing the tiling algorithm, let's take a look at the **shared memory** - kernel<<<numBlocks, numThread, size_of_SharedMemory>>>(...); - Could be dynamically declared, but thre size needs to decided before launching the kernel and it only allows **1-D array** declaration!! - Compare the numbers of memory access with/without shared memory - Without shared memory: need **6 memory access** ```c= __globle__ void withoutSM(int k, int D[n][n]){ int i = threadIdx.x; int j = threadIdx.y; if(D[i][j] > D[i][k] + D[k][j])//six memory access D[i][j] = D[i][k] + D[k][j]; } ``` - Using shared memory: only **3 memory access** ```c= extern __shared__ int S[]; __global__ void withSM(int *k, int *D, int *n){ int i = threadIdx.x; int j = threadIdx.y; S[i*(*n) + j] = D[i*(*n) + j]; __syncthreads(); if(S[i*(*n) + j] > S[i*(*n)+k] + S[k*(*n)+j]) D[i*(*n)+j] = S[i*(*n)+k] + S[k*(*n)+j]; } ``` ### Code for Tile Matrix Multiplication -- tiled_matmul.cu ![](https://playlab.computing.ncku.edu.tw:3001/uploads/upload_082896eb1aee24e1a41228c75742c4c9.png) ```cpp= __global__ void matrixMul(const int *a, const int *b, int *c, int N) { // Compute each thread's global row and column index int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; int tx = threadIdx.x; int ty = threadIdx.y; // Statically allocated shared memory __shared__ int s_a[TILE_SIZE][TILE_SIZE]; __shared__ int s_b[TILE_SIZE][TILE_SIZE]; // Accumulate in temporary variable int tmp = 0; // Sweep tile across matrix for (int tile = 0; tile < N/TILE_SIZE; tile++) { // Load in elements for this tile s_a[ty][tx] = a[row * N + tile*TILE_SIZE + threadIdx.x]; s_b[ty][tx] = b[tile*TILE_SIZE * N + threadIdx.y * N + col]; // Wait for both tiles to be loaded in before doing computation __syncthreads(); // Do matrix multiplication on the small matrix for (int j = 0; j < TILE_SIZE; j++) { tmp += s_a[ty][j] * s_b[j][tx]; } // Wait for all threads to finish using current tiles before loading in new // ones __syncthreads(); } // Write back results c[row * N + col] = tmp; } ``` - The details for the code - In Line 17-18, we load one tile in each iteration - In Line 21, we need to make sure we have already loaded the desired value in one tile - In Line 24-27, we get the partial sum for the certain element in c - In Line 31, we need to complete the summation of the the partial sum before enter the next iteration, which begins to load the new tile for A_s and B_s - Result: ```shell= $ nvcc -o tile_matmul tile_matmul.cu $ ./tile_matmul 1024 gpu kernel time = 0.000012 seconds gpu time = 0.098726 seconds cpu time = 7.500089 seconds ``` ## Lab 14-3 Shared Memory Data Reuse and Memory Bandwidth Benefit Analysis - In this lab, we would introduce 1 - The concept of **Memory Coalescing** - The profiling tools, **ncu** and **nsight compute**. Use the naive matrix multiplication and tiling matrix multiplication as an example. ### Memory Coalescing - Threads in the same warp access consecutive memory locations in the same burst(the burst here indicates the action that we can access the data in dram at a time. In general, dram provide numbers of bytes in one burst) - example: matrix multiplication ```c= int row = blockDim.y*blockIdx.y + threadIdx.y; int col = blockDim.x*blockIdx.x + threadIdx.x; for(unsigned int i = 0; i < N; i++) sum+=A[row*N + i] * B[i*N + col]; ``` - For thread0~31(that is, threadIdx.y = 0, threadIdx.x = 0~31), A[0] is accessed by those threads. As for matrix B, thread0 gets B[0], thread1 gets B[1], and so on. Therefore, both memory access result in one burst. - Memory Divergence - As opposed to memory coalescing, it means fetching data in the different burst for the threads in the same warp ### Profiling -- Using **ncu** and nsight compute - We would use codes in lab14-1 and lab14-2 as an example - Please install [nsight compute](https://developer.nvidia.com/downloads) on your computer in advance. - Check [**nvidia profiling guide**](https://docs.nvidia.com/nsight-compute/ProfilingGuide/) if you want to get more information for each parameter and terminology in the **nsight compute** - take **tile_matmul.cu** as an example: - step 0_1:recompile the code for tile_matmul.cu ```shell= $nvcc tile_matmul.cu -o tile_matmul -lineinfo ``` - step 0_2: Check the information from ncu ```shell= $ sudo ncu ./tile_matmul ``` - It would generate 3 sections: **GPU Speed Of Light Throughput**, **Launch Statistics** and **Occupancy** - **GPU Speed Of Light Throughput**: Check the Duration first, which indicates the total time for executing the kernel. Then check the **Memory [%]** and **Compute (SM) [%]** to determine if the kernel is memory bound/computation bound. - The **Memory [%]** is determined by three elements: **DRAM Throughput**, **L1/TEX Cache Throughput** and **L2 Cache Throughput** - In our case, we could conclude that the compute and memory are well-balanced. It has the rooms for improvement, but all in all, it has reached a pretty good situation. ```shell= ---------------------------------------------------------------------- --------------- ------------------------------ DRAM Frequency cycle/nsecond 6.79 SM Frequency cycle/nsecond 1.41 Elapsed Cycles cycle 376,8034 Memory [%] % 74.36 DRAM Throughput % 10.25 Duration msecond 2.68 L1/TEX Cache Throughput % 95.29 L2 Cache Throughput % 13.57 SM Active Cycles cycle 375,4091.17 Compute (SM) [%] % 74.36 ---------------------------------------------------------------------- --------------- ------------------------------ INF Compute and Memory are well-balanced: To reduce runtime, both computation and memory traffic must be reduced. Check both the Compute Workload Analysis and Memory Workload Analysis sections. ``` - step1: create **tile_matmul.ncu-rep** ```shell= $ sudo ncu --set full -f -o tile_matmul ./tile_matmul 1024 ``` - step2: open **tile_matmul.ncu-rep** on **nsight compute** <img src="https://playlab.computing.ncku.edu.tw:3001/uploads/upload_f9497dba0cf2b490ec8bcf90ac889aeb.png" alt="drawing" width="300"/> - step3: Check the physical limit of your GPU and the GPU throughput for the current kernel - Notice that in this example, we run tiled_matmul.cu on RTX2070, which have the upper limit of the numbers of threads that a muliprocessor could contain(1024 in RTX2070). Therefore, for the 2-D thread block, the **blockDim.x and blockDim.y** couldn't exceed 32! If you set the total number of thread exceed 1024, the memcpy would generate the error message. ```cpp= cudaError_t ret_memcpy = cudaMemcpy(...); if(error != cudaSuccess){ // print the CUDA error message and exit printf("CUDA error: %s\n", cudaGetErrorString(error)); exit(-1); } ``` - Click the **occupancy calculator**to get the physical limit of your GPU ![](https://playlab.computing.ncku.edu.tw:3001/uploads/upload_df50d189246ea473c11b381611a95b8e.png) - In the following graph, the blue part is tiled_matmul kernel, and the baseline(green part) is the matmul kernel without tiling. - Firstly, check the duration. We could find that the it reduces 16.09% via the tiling optimization. - Notice that even if the compute and memory throughput is lower, we get a better performance. Therefore, we could determine bottleneck by only one index. <img src="https://playlab.computing.ncku.edu.tw:3001/uploads/upload_df35d63ca84a5a5cfc2ba7ee1ccd9bea.png" alt="drawing" width="900"/> - step4: **Memory Workload Analysis** - First of all, since there is rooms for memory to improve, we need to go to check the memory workload - In the following graph, we could find that although the hit rate in L1 cache is lower, the usage of shared memory is increasing, and therefore the usage of global memory is decreasing. - We can check **warp state statistics** for further information. - <img src="https://playlab.computing.ncku.edu.tw:3001/uploads/upload_dcd5fd229fe416dde4dcce79f8ac002b.png" alt="drawing" width="2000"/> - <img src="https://playlab.computing.ncku.edu.tw:3001/uploads/upload_5af2de27adaed00d540ce57047d37cbf.png" alt="drawing" width="1000"/> - Step5: Performance Latency Analysis - In order to hide the latency, we need to increase the occupancy. - The occupancy may be affected by the size of the assigned shared memory, since the bigger the shared memory is, the less the number of the blocks each SM could contain. Therefore, the increase of the size of shared memory may decrease the level of occupancy. - **Warp analysis**: the warp scheduler could issue one instruction per cycle, however, in our kernel, the scheduler could only issue one instruction every four cycles. Remember that in our tiling algorithm, we use **__syncthreads()** to avoid the inaccurate calculation. Therefore, there is no way to improve the performance without applying other optimizations, for example, **Thread Coarsening** ![](https://playlab.computing.ncku.edu.tw:3001/uploads/upload_bec88f11247cc9629fc1ca0663d96412.png) - The following graph has two important terminology: **Stall LG Throttle** and **Stall MIO Throttle**. - **Stall LG Throttle**: average number of warps resident per issue cycle, waiting for a free entry in THE LSU instruction queue.(LSU: load store unit) - **Stall MIO Throttle**: average number of warps resident per issue cycle, waiting for a free entry in THE MIO instruction queue.(MIO: memory input output) - For **matmul kernel**, it has zero **Stall MIO Throttle** and high **Stall LG Throttle**, since it use no shared memory and no **_syncthread()**, therefore, there is no need to wait for the warps to finish their own part in order to enter the next step. ![](https://playlab.computing.ncku.edu.tw:3001/uploads/upload_ce23530a69219f786cb129821c8786a1.png) - Step6: Source Code Analysis - Please turn the page from **Details** to **Source**.(the menu on the up right corner) - You could check which line of code cause what kind of problem. - For **naive matmul kernel** ![](https://playlab.computing.ncku.edu.tw:3001/uploads/upload_ee322e04b762812a079724ebf650cf89.png) - For **tiling matmul kernel** ![](https://playlab.computing.ncku.edu.tw:3001/uploads/upload_77fc872be15132f0a4748d6f87c54102.png) ### Thread Coarsening - A thread is assigned to multiple units of parallelism to process - Less redundant work or synchronization - Optimization for tiled matrix multiplication using **thread coarsening** - <img src="https://playlab.computing.ncku.edu.tw:3001/uploads/upload_ac253757d04dc7691cc58245aca80a96.jpg" alt="drawing" width="400"/> <img src="https://playlab.computing.ncku.edu.tw:3001/uploads/upload_eb7e0a356e3768f8d823a16e543fe42b.jpg" alt="drawing" width="500"/> - In this example, we choose matrixA for reuse. ```cpp= __global__ void matrixMul_coarse(const int *a, const int *b, int *c, int N) { // Compute each thread's global row and column index int row = blockIdx.y * blockDim.y + threadIdx.y; int colStart = blockIdx.x * blockDim.x*COARSE_FACTOR + threadIdx.x; int tx = threadIdx.x; int ty = threadIdx.y; // Statically allocated shared memory __shared__ int s_a[TILE_SIZE][TILE_SIZE]; __shared__ int s_b[TILE_SIZE][TILE_SIZE]; // Accumulate in temporary variable //int tmp = 0; //partial sum for COARSE_FACTOR int sum[COARSE_FACTOR]; for(unsigned int i = 0; i < COARSE_FACTOR; i++) sum[i] = 0; // Sweep tile across matrix for (int tile = 0; tile < N/TILE_SIZE; tile++) { // Load in elements for this tile s_a[ty][tx] = a[row * N + tile*TILE_SIZE + threadIdx.x]; //COARSE_FACTOR for(unsigned int i = 0; i < COARSE_FACTOR; i++){ unsigned int col = colStart + i*TILE_SIZE; s_b[ty][tx] = b[tile*TILE_SIZE * N + threadIdx.y * N + col]; // Wait for both tiles to be loaded in before doing computation __syncthreads(); // Do matrix multiplication on the small matrix for (int j = 0; j < TILE_SIZE; j++) { sum[i] += s_a[ty][j] * s_b[j][tx]; } // Wait for all threads to finish using current tiles before loading in new // ones __syncthreads(); } } // Write back results for(unsigned int cof = 0; cof < COARSE_FACTOR; cof++){ int col = colStart + cof * TILE_SIZE; c[row*N+col] = sum[cof]; } } ``` - The difference between thread-coarsening and non thread-coarsening is not significant. Furthermore, the cost of thread coarsening technique is the increase in the register usage. Therefore, when you increment the size of **COARSE_FACTOR**, the performance is getting worse. ![](https://playlab.computing.ncku.edu.tw:3001/uploads/upload_b8d7b7dbf8b4107fc8366ec590f58877.png) --- ## Homework 14 Different from Labs, please notice that you need to handle the boundary condition issue. For example, when you have a 2D threadblock with the dimension 32x32, how would you deal with the matrix with dimension 1023x1023? Take the simple matrix multiplication for example, ```cpp= __global__ void matMul_kernel(const int *a, const int *b, int *c, int N) { int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; int Pvalue = 0; if(col < N && row < N){//**boundary condition** for (int k = 0; k < N; k++) { // Accumulate results for a single element Pvalue += a[row * N + k] * b[k * N + col]; } c[row*N + col] = Pvalue; } } ``` ## Hoemwork 14-1 Simple Conv2D ### Description Please write your simple conv2D in simpleConv2D.cu and compile it once you complete the function design. If your answer is correct, the terminal should display as follows, ```shell= $ nvcc simpleConv2D.cu -o simpleConv2D $ ./simpleConv2D Congratulation ! Your simpleConv2D is correct! ``` ### Details for implementation How to apply the matrix multiplication operation to 2D convolution problem? One way is to use the [**im2col algorithm**](https://cloud.tencent.com/developer/article/1781621). There are 3 more alternatives you could use: - [im2row](https://iq.opengenus.org/im2row-convolution/) - kn2col - kn2row In order to determine which one is better, you have to check the following factors: - The hardware you choose(GPU or CPU) - The limits of memory in your server - The size of image and kernel in your model :::info Why we apply **Matrix Multiplication** while computing **conv2D**? Of course you could simply compute the conv2D in the naive way(that is, just arrange the task for each thread). However, due to the optimization for matrix multiplication has evolved for a long time, we could use some package or function corresponding to matrix multiplication provided by CUDA library to enhance the speed of computation in ::: ## Homework 14-2 2D Tiled Convolution Kernel Please write your tiled conv2D in tiledConv2D.cu and compile it once you complete the function design. If your answer is correct, the terminal should display as follows, ```shell= $ nvcc tiledConv2D.cu -o tiledConv2D $ ./simpleConv2D Congratulation ! Your tiledConv2D is correct! ``` ## Homewoprk 14-3 Shared Memory Data Reuse and Memory Bandwidth Benefit Analysis for 1D and 2D Tiled Convolution Kernels