HW6: GPU Simulator

Introduction

GPGPU-Sim (General Purpose Graphics Processing Unit Simulator) is a cycle-accurate simulator for Graphics Processing Units (GPUs). It provides a flexible platform for researchers and developers to study various aspects of GPU architecture and programming. It is an open-source software framework that simulates both the compute units and the memory hierarchy of a GPU, allowing for detailed analysis of various performance metrics such as latency, bandwidth, and power consumption.

Review of the 5-stage CPU Pipeline

  1. Instruction fetch (IF)
    • fetch the instruction from the instruction cache
  2. Decode (ID)
    • decode the instruction & fetch registers
  3. Execute (EXE)
  4. Memory access (MEM)
    • memory store / load
  5. Write back (WB)
    • write back the ALU / memory loaded result to register

Review of the Multiple-Issue Pipeline (Tomasulo’s Algorithm)

  • Introduced in the "Computer Architecture - A Quantitative Approach"
  • Designed for out-of-order execution CPU pipeline.
  • Instruction Level Parallelism (ILP)
  • Allows for more efficient and effective use of CPU resources, enabling the CPU to execute instructions in parallel and avoid unnecessary stalls
  • Contains multiple compute unit (e.g., floating point, integer)

Reservation station:

  • listens the readied registers (ensure the data dependency)
  • execute the instruction once all data is ready

GPGPU-SIM GPU Pipeline

This function calls a set of member functions that simulate the core's pipeline stages in reverse order to model the pipelining effect

  1. fetch: Instruction fetch (IF)
  2. decode: Instruction decode (ID)
  3. issue: Dispatch ready instruction (similar to the reservation station)
  4. read_operands: Fetch register (ID)
  5. execute: Execute the instruction by corresponding compute unit (EXE)
  6. writeback: Memory store/load & register writeback (MEM, WB)

SIMT-Stack:

  • SIMT-stack controls which SIMD lane should be active
  • Branch condition

Running the GPGPU-SIM

Compiling GPGPU-SIM from source

  1. Start the IPC23 GPGPU-SIM container
  2. Clone the GPGPU-SIM GitHub repository:
    ​​​​git clone https://github.com/samuel21119/gpgpu-sim_distribution ~/gpgpu-sim_distribution
    
  3. Build the simulator:
    ​​​​cd ~/gpgpu-sim_distribution
    ​​​​env -i bash build_ipc.sh
    
  4. The output should be:

Simulating CUDA program

  1. Set proper environment variables:
    ​​​​export CUDA_INSTALL_PATH=/opt/cuda
    ​​​​source ~/gpgpu-sim_distribution/setup_environment
    
  2. Compile your cuda program with:
    ​​​​nvcc hw6.cu -o hw6.exe -Xptxas=-v -arch=sm_61 -gencode arch=compute_61,code=compute_61 -lcudart
    
    It is important to build the program with the flag -lcudart.
    You can append the flags -Xptxas=-v -arch=sm_61 -gencode arch=compute_61,code=compute_61 -lcudart to the sample Makefile in the previous assignment
  3. Copy GPU model hardware configuration to your project directory (we will simulate the device TITANX, which is also the Pascal architecture):
    ​​​​cp ~/gpgpu-sim_distribution/configs/tested-cfgs/SM6_TITANX/gpgpusim.config ./
    ​​​​cp ~/gpgpu-sim_distribution/configs/tested-cfgs/SM6_TITANX/config_pascal_islip.icnt  ./
    
  4. Execute your program just like before:
    ​​​​./hw6.exe <argv1> <argv2> ...
    
  5. A lot of information should be printed by GPGPU-SIM:
  6. After the kernel is simulated by GPGPU-SIM, detailed simulation statistics are then printed:
  7. To preserve the output data, redirect STDOUT to a file:
    ​​​​./hw6.exe > gpgpu-sim.log
    

Assignment

Overview

In this assignment, you will be implementing a simple reduction CUDA code. Your task is to implement at least three different versions of the code:

  1. Simple reduction by atomic operation (hw6-1.cu)
  2. Interleaved addressing parallel reduction (hw6-2.cu)
  3. Sequential addressing parallel reduction (hw6-3.cu)

For the second and third versions, please refer to the following slide deck from NVIDIA's website, specifically pages 7 and 14:
https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

Note that for each version, you should ensure that your code is correct and that it produces the expected results. You should also measure the performance of each version and report your findings.

Sample code

The following is the sample code for the assignment. You should first implement a function that generates random floating points (in double format). Next, implement a CUDA version of the CPU_reduction function provided, which computes the minimum value of a fixed-size (1024) array of double-precision floating-point numbers.

#include <iostream> #define N 1024 double min(double a, double b) {return a < b ? a : b;} double CPU_reduction(double *arr, int n) { double ret = arr[0]; for (int i = 1; i < n; i++) { ret = min(ret, arr[i]); } return ret; } int main() { double *ret = new double; double *arr = new double[N]; generate_random_doubles(arr, N); std::cout << "Generated numbers:"; for (int i = 0; i < N; i++) { std::cout << ' ' << arr[i]; } std::cout << '\n'; // cudaMalloc and cudaMemcpy is required cuda_reduction<<<1,N>>>(arr, N, ret); std::cout << "The minimum value: " << *ret << '\n'; delete ret; delete [] arr; return 0; }

Grading

Environment Preparation (15%)

  1. Submit a screenshot showing the successful build log and your student ID. (5%)
  2. Explain the purpose of the build flag -gencode arch=compute_61,code=compute_61. (3%)
  3. Explain how GPGPU-SIM hacks the CUDA program, including: (7%)
    • The purpose of the build flag -lcudart.
    • What happens if you do not source the setup_environment file?
    • Hint: dynamic linking and the ldd command.

Program implementation (25%)

  1. Implement the simple reduction by atomic operation in hw6-1.cu. (10%)
  2. Implement the interleaved addressing parallel reduction in hw6-2.cu. (10%)
  3. Implement the sequential addressing parallel reduction in hw6-3.cu. (5%)

Experiment & discussion (55%)

Points may be deducted if the explanation provided is unclear or does not effectively convey the intended meaning.

  1. Explain the meaning and purpose of the following performance metrics (10%):

    • gpu_sim_cycle
    • Stall:
    • gpgpu_n_stall_shd_mem
      If you find any other metrics that can be used to explain your performance result, introduce it here.
  2. Compare the performance of hw6-1, hw6-2, and hw6-3 in terms of the above metrics. Include figures and plots if available. (25%)

    • Provide a brief summary of the performance comparison.
    • Explain any observed differences in performance between the implementations.
    • Evaluate the impact of various optimization strategies on performance.
    • Provide insights and recommendations for future optimization.
  3. Hardware configuration exploration. (10%)
    You will explore the hardware configuration of the GPU and its impact on the performance of the implementations. The configuration file is gpgpusim.config. You can discuss the following configurations separately.

    1. The metric gpgpu_n_stall_shd_mem may be the same for hw6-2 and hw6-3.
      Explain why it differs from NVIDIA’s slide, and modify the gpgpusim.config to create two different values for gpgpu_n_stall_shd_mem for each implementation.
      Hint: shared memory banks
    2. By default, the L1 cache is disabled. Turn it on and set its latency to 24 (default: 82). Compare the performance of the implementations with and without the L1 cache enabled, and explain your observations and the reasons behind them.
  4. Warp Shuffle Version (10%)
    The previous versions of parallel reduction required exchanging data in shared memory. However, modern NVIDIA GPUs have introduced warp shuffle instructions that allow data exchange within a warp without using shared memory. In hw6-4, you need to implement a parallel reduction that uses warp shuffle instructions to reduce the values within a warp and then reduce the values across warps to obtain the final result.
    To implement hw6-4, you can use the __shfl_down() function to exchange data between threads within a warp. You can divide the thread block into warps and reduce the values within each warp. Afterward, you can use the simple parallel reduction to obtain the final result.
    For more details, please refer to: https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/

    1. Explain how you implement your program.
    2. Comparethe performance of hw6-4 to the previous versions.

Code Tracing (10%)

Your TA has modified the GPGPU-SIM to report the total number of memory accesses (tol_mem_access_num) generated by the CUDA core. You are required to explain how this modification works, including:

  • How to add another custom metric for GPGPU-SIM to report?
  • How is the tol_mem_access_num metric associated with the memory access coalescing mechanism?
    If threads within a warp all read to the same cache block, what is the value of tol_mem_access_num? 1 or 32?
    Please provide a brief explanation.
  • Will a shared memory access be counted by tol_mem_access_num? Why or why not?

You can check the modified part here:
https://github.com/samuel21119/gpgpu-sim_distribution/commit/608a679eb36f58689b59f39b88d13ef0e56288f7

Submission

Submit the following files to eeclass before 6/6 23:59:

  • hw6-1.cu
  • hw6-2.cu
  • hw6-3.cu
  • hw6-4.cu
  • Makefile
  • report.pdf

References

tags: spec