# HW6: GPU Simulator ## Introduction GPGPU-Sim (General Purpose Graphics Processing Unit Simulator) is a cycle-accurate simulator for Graphics Processing Units (GPUs). It provides a flexible platform for researchers and developers to study various aspects of GPU architecture and programming. It is an open-source software framework that simulates both the compute units and the memory hierarchy of a GPU, allowing for detailed analysis of various performance metrics such as latency, bandwidth, and power consumption. ### Review of the 5-stage CPU Pipeline 1. Instruction fetch (IF) - fetch the instruction from the instruction cache 3. Decode (ID) - decode the instruction & fetch registers 4. Execute (EXE) 6. Memory access (MEM) - memory store / load 8. Write back (WB) - write back the ALU / memory loaded result to register ![](https://hackmd.io/_uploads/B1yxzF_N3.png) ### Review of the Multiple-Issue Pipeline (Tomasulo’s Algorithm) - Introduced in the *"Computer Architecture - A Quantitative Approach"* - Designed for out-of-order execution CPU pipeline. - Instruction Level Parallelism (ILP) - Allows for more efficient and effective use of CPU resources, enabling the CPU to execute instructions in parallel and avoid unnecessary stalls - Contains multiple compute unit (e.g., floating point, integer...) Reservation station: - listens the readied registers (ensure the data dependency) - execute the instruction once all data is ready ![](https://hackmd.io/_uploads/HyTRDYOV3.png) ### GPGPU-SIM GPU Pipeline > This function calls a set of member functions that simulate the core's pipeline stages in reverse order to model the pipelining effect 1. `fetch`: Instruction fetch (IF) 2. `decode`: Instruction decode (ID) 3. `issue`: Dispatch ready instruction (similar to the reservation station) 4. `read_operands`: Fetch register (ID) 5. `execute`: Execute the instruction by corresponding compute unit (EXE) 6. `writeback`: Memory store/load & register writeback (MEM, WB) ![](https://hackmd.io/_uploads/H163ZtuNh.png) ![](https://hackmd.io/_uploads/SJYPWKuVn.png) SIMT-Stack: - SIMT-stack controls which SIMD lane should be active - Branch condition ## Running the GPGPU-SIM ### Compiling GPGPU-SIM from source 1. Start the `IPC23 GPGPU-SIM` container 2. Clone the GPGPU-SIM GitHub repository: ``` git clone https://github.com/samuel21119/gpgpu-sim_distribution ~/gpgpu-sim_distribution ``` 3. Build the simulator: ``` cd ~/gpgpu-sim_distribution env -i bash build_ipc.sh ``` 4. The output should be: ![](https://hackmd.io/_uploads/HJ6tFXYN3.png) ### Simulating CUDA program 1. Set proper environment variables: ```bash export CUDA_INSTALL_PATH=/opt/cuda source ~/gpgpu-sim_distribution/setup_environment ``` 2. Compile your cuda program with: ``` nvcc hw6.cu -o hw6.exe -Xptxas=-v -arch=sm_61 -gencode arch=compute_61,code=compute_61 -lcudart ``` **It is important to build the program with the flag `-lcudart`.** You can append the flags `-Xptxas=-v -arch=sm_61 -gencode arch=compute_61,code=compute_61 -lcudart` to the sample Makefile in the previous assignment 3. Copy GPU model hardware configuration to your project directory (we will simulate the device TITANX, which is also the Pascal architecture): ``` cp ~/gpgpu-sim_distribution/configs/tested-cfgs/SM6_TITANX/gpgpusim.config ./ cp ~/gpgpu-sim_distribution/configs/tested-cfgs/SM6_TITANX/config_pascal_islip.icnt ./ ``` 4. Execute your program just like before: ``` ./hw6.exe <argv1> <argv2> ... ``` 5. A lot of information should be printed by GPGPU-SIM: ![](https://hackmd.io/_uploads/HkGhzQK43.png) 6. After the kernel is simulated by GPGPU-SIM, detailed simulation statistics are then printed: ![](https://hackmd.io/_uploads/ByjZX7FN3.png) 7. To preserve the output data, redirect STDOUT to a file: ``` ./hw6.exe > gpgpu-sim.log ``` ## Assignment ### Overview In this assignment, you will be implementing a simple reduction CUDA code. Your task is to implement at least three different versions of the code: 1. Simple reduction by atomic operation (`hw6-1.cu`) 2. Interleaved addressing parallel reduction (`hw6-2.cu`) 3. Sequential addressing parallel reduction (`hw6-3.cu`) For the second and third versions, please refer to the following slide deck from NVIDIA's website, specifically pages 7 and 14: https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Note that for each version, you should ensure that your code is correct and that it produces the expected results. You should also measure the performance of each version and report your findings. ### Sample code The following is the sample code for the assignment. You should first implement a function that generates random floating points (in double format). Next, implement a CUDA version of the CPU_reduction function provided, which computes the minimum value of a fixed-size (1024) array of double-precision floating-point numbers. ```cpp= #include <iostream> #define N 1024 double min(double a, double b) {return a < b ? a : b;} double CPU_reduction(double *arr, int n) { double ret = arr[0]; for (int i = 1; i < n; i++) { ret = min(ret, arr[i]); } return ret; } int main() { double *ret = new double; double *arr = new double[N]; generate_random_doubles(arr, N); std::cout << "Generated numbers:"; for (int i = 0; i < N; i++) { std::cout << ' ' << arr[i]; } std::cout << '\n'; // cudaMalloc and cudaMemcpy is required cuda_reduction<<<1,N>>>(arr, N, ret); std::cout << "The minimum value: " << *ret << '\n'; delete ret; delete [] arr; return 0; } ``` ### Grading #### Environment Preparation (15%) 1. Submit a screenshot showing the successful build log and your student ID. (5%) 2. Explain the purpose of the build flag `-gencode arch=compute_61,code=compute_61`. (3%) 3. Explain how GPGPU-SIM hacks the CUDA program, including: (7%) - The purpose of the build flag `-lcudart`. - What happens if you do not source the `setup_environment` file? - Hint: dynamic linking and the `ldd` command. #### Program implementation (25%) 1. Implement the simple reduction by atomic operation in `hw6-1.cu`. (10%) 2. Implement the interleaved addressing parallel reduction in `hw6-2.cu`. (10%) 3. Implement the sequential addressing parallel reduction in `hw6-3.cu`. (5%) #### Experiment & discussion (55%) Points may be deducted if the explanation provided is unclear or does not effectively convey the intended meaning. 1. Explain the meaning and purpose of the following performance metrics (10%): - `gpu_sim_cycle` - `Stall:` - `gpgpu_n_stall_shd_mem` If you find any other metrics that can be used to explain your performance result, introduce it here. 2. Compare the performance of `hw6-1`, `hw6-2`, and `hw6-3` in terms of the above metrics. Include figures and plots if available. (25%) - Provide a brief summary of the performance comparison. - Explain any observed differences in performance between the implementations. - Evaluate the impact of various optimization strategies on performance. - Provide insights and recommendations for future optimization. 3. Hardware configuration exploration. (10%) You will explore the hardware configuration of the GPU and its impact on the performance of the implementations. The configuration file is `gpgpusim.config`. You can discuss the following configurations separately. 1. The metric `gpgpu_n_stall_shd_mem` may be the same for `hw6-2` and `hw6-3`. Explain why it differs from NVIDIA’s slide, and modify the `gpgpusim.config` to create two different values for `gpgpu_n_stall_shd_mem` for each implementation. **Hint: shared memory banks** 2. By default, the **L1 cache** is disabled. Turn it on and set its latency to 24 (default: 82). Compare the performance of the implementations with and without the L1 cache enabled, and explain your observations and the reasons behind them. 4. Warp Shuffle Version (10%) The previous versions of parallel reduction required exchanging data in shared memory. However, modern NVIDIA GPUs have introduced warp shuffle instructions that allow data exchange within a warp without using shared memory. In `hw6-4`, you need to implement a parallel reduction that uses warp shuffle instructions to reduce the values within a warp and then reduce the values across warps to obtain the final result. To implement `hw6-4`, you can use the `__shfl_down()` function to exchange data between threads within a warp. You can divide the thread block into warps and reduce the values within each warp. Afterward, you can use the simple parallel reduction to obtain the final result. For more details, please refer to: https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/ 1. Explain how you implement your program. 2. Comparethe performance of `hw6-4` to the previous versions. #### Code Tracing (10%) Your TA has modified the GPGPU-SIM to report the total number of memory accesses (`tol_mem_access_num`) generated by the CUDA core. You are required to explain how this modification works, including: - How to add another custom metric for GPGPU-SIM to report? - How is the `tol_mem_access_num` metric associated with the memory access coalescing mechanism? If threads within a warp all read to the same cache block, what is the value of `tol_mem_access_num`? 1 or 32? Please provide a brief explanation. - Will a shared memory access be counted by `tol_mem_access_num`? Why or why not? You can check the modified part here: https://github.com/samuel21119/gpgpu-sim_distribution/commit/608a679eb36f58689b59f39b88d13ef0e56288f7 ## Submission Submit the following files to eeclass before <span style="color:red;">6/6 23:59</span>: - `hw6-1.cu` - `hw6-2.cu` - `hw6-3.cu` - `hw6-4.cu` - Makefile - report.pdf ## References - GPGPU-SIM Manual: http://gpgpu-sim.org/manual/index.php/Main_Page - Parallel Reduction https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf - Warp Shuffle https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/ ###### tags: `spec`