HW6: GPU Simulator

Introduction

GPGPU-Sim (General Purpose Graphics Processing Unit Simulator) is a cycle-accurate simulator for Graphics Processing Units (GPUs). It provides a flexible platform for researchers and developers to study various aspects of GPU architecture and programming. It is an open-source software framework that simulates both the compute units and the memory hierarchy of a GPU, allowing for detailed analysis of various performance metrics such as latency, bandwidth, and power consumption.

Review of the 5-stage CPU Pipeline

Instruction fetch (IF)
- fetch the instruction from the instruction cache
Decode (ID)
- decode the instruction & fetch registers
Execute (EXE)
Memory access (MEM)
- memory store / load
Write back (WB)
- write back the ALU / memory loaded result to register

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Review of the Multiple-Issue Pipeline (Tomasulo’s Algorithm)

Introduced in the "Computer Architecture - A Quantitative Approach"
Designed for out-of-order execution CPU pipeline.
Instruction Level Parallelism (ILP)
Allows for more efficient and effective use of CPU resources, enabling the CPU to execute instructions in parallel and avoid unnecessary stalls
Contains multiple compute unit (e.g., floating point, integer…)

Reservation station:

listens the readied registers (ensure the data dependency)
execute the instruction once all data is ready

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

GPGPU-SIM GPU Pipeline

This function calls a set of member functions that simulate the core's pipeline stages in reverse order to model the pipelining effect

fetch: Instruction fetch (IF)
decode: Instruction decode (ID)
issue: Dispatch ready instruction (similar to the reservation station)
read_operands: Fetch register (ID)
execute: Execute the instruction by corresponding compute unit (EXE)
writeback: Memory store/load & register writeback (MEM, WB)
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

SIMT-Stack:

SIMT-stack controls which SIMD lane should be active
Branch condition

Running the GPGPU-SIM

Compiling GPGPU-SIM from source

Start the IPC23 GPGPU-SIM container

Clone the GPGPU-SIM GitHub repository:

git clone https://github.com/samuel21119/gpgpu-sim_distribution ~/gpgpu-sim_distribution

Build the simulator:

cd ~/gpgpu-sim_distribution
env -i bash build_ipc.sh

The output should be:
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →

Simulating CUDA program

Set proper environment variables:

export CUDA_INSTALL_PATH=/opt/cuda
source ~/gpgpu-sim_distribution/setup_environment

Compile your cuda program with:
```
nvcc hw6.cu -o hw6.exe -Xptxas=-v -arch=sm_61 -gencode arch=compute_61,code=compute_61 -lcudart
```
It is important to build the program with the flag -lcudart.
You can append the flags -Xptxas=-v -arch=sm_61 -gencode arch=compute_61,code=compute_61 -lcudart to the sample Makefile in the previous assignment

Copy GPU model hardware configuration to your project directory (we will simulate the device TITANX, which is also the Pascal architecture):

cp ~/gpgpu-sim_distribution/configs/tested-cfgs/SM6_TITANX/gpgpusim.config ./
cp ~/gpgpu-sim_distribution/configs/tested-cfgs/SM6_TITANX/config_pascal_islip.icnt  ./

Execute your program just like before:

./hw6.exe <argv1> <argv2> ...

A lot of information should be printed by GPGPU-SIM:
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
After the kernel is simulated by GPGPU-SIM, detailed simulation statistics are then printed:
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
To preserve the output data, redirect STDOUT to a file:
```
./hw6.exe > gpgpu-sim.log
```

Assignment

Overview

In this assignment, you will be implementing a simple reduction CUDA code. Your task is to implement at least three different versions of the code:

Simple reduction by atomic operation (hw6-1.cu)
Interleaved addressing parallel reduction (hw6-2.cu)
Sequential addressing parallel reduction (hw6-3.cu)

For the second and third versions, please refer to the following slide deck from NVIDIA's website, specifically pages 7 and 14:
https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

Note that for each version, you should ensure that your code is correct and that it produces the expected results. You should also measure the performance of each version and report your findings.

Sample code

The following is the sample code for the assignment. You should first implement a function that generates random floating points (in double format). Next, implement a CUDA version of the CPU_reduction function provided, which computes the minimum value of a fixed-size (1024) array of double-precision floating-point numbers.

































#include <iostream>

#define N 1024
double min(double a, double b) {return a < b ? a : b;}

double CPU_reduction(double *arr, int n) {
    double ret = arr[0];
    for (int i = 1; i < n; i++) {
        ret = min(ret, arr[i]);
    }
    return ret;
}

int main() {
    double *ret = new double;
    double *arr = new double[N];
    
    generate_random_doubles(arr, N);
    std::cout << "Generated numbers:";
    for (int i = 0; i < N; i++) {
        std::cout << ' ' << arr[i];
    }
    std::cout << '\n';
    
    // cudaMalloc and cudaMemcpy is required
    cuda_reduction<<<1,N>>>(arr, N, ret);
    
    std::cout << "The minimum value: " << *ret << '\n';
    
    delete ret;
    delete [] arr;
    return 0;
}

Grading

Environment Preparation (15%)

Submit a screenshot showing the successful build log and your student ID. (5%)
Explain the purpose of the build flag -gencode arch=compute_61,code=compute_61. (3%)
Explain how GPGPU-SIM hacks the CUDA program, including: (7%)
- The purpose of the build flag -lcudart.
- What happens if you do not source the setup_environment file?
- Hint: dynamic linking and the ldd command.

Program implementation (25%)

Implement the simple reduction by atomic operation in hw6-1.cu. (10%)
Implement the interleaved addressing parallel reduction in hw6-2.cu. (10%)
Implement the sequential addressing parallel reduction in hw6-3.cu. (5%)

Experiment & discussion (55%)

Points may be deducted if the explanation provided is unclear or does not effectively convey the intended meaning.

Explain the meaning and purpose of the following performance metrics (10%):
- gpu_sim_cycle
- Stall:
- gpgpu_n_stall_shd_mem
  If you find any other metrics that can be used to explain your performance result, introduce it here.
Compare the performance of hw6-1, hw6-2, and hw6-3 in terms of the above metrics. Include figures and plots if available. (25%)
- Provide a brief summary of the performance comparison.
- Explain any observed differences in performance between the implementations.
- Evaluate the impact of various optimization strategies on performance.
- Provide insights and recommendations for future optimization.
Hardware configuration exploration. (10%)
You will explore the hardware configuration of the GPU and its impact on the performance of the implementations. The configuration file is gpgpusim.config. You can discuss the following configurations separately.
1. The metric gpgpu_n_stall_shd_mem may be the same for hw6-2 and hw6-3.
  Explain why it differs from NVIDIA’s slide, and modify the gpgpusim.config to create two different values for gpgpu_n_stall_shd_mem for each implementation.
  Hint: shared memory banks
2. By default, the L1 cache is disabled. Turn it on and set its latency to 24 (default: 82). Compare the performance of the implementations with and without the L1 cache enabled, and explain your observations and the reasons behind them.
Warp Shuffle Version (10%)
The previous versions of parallel reduction required exchanging data in shared memory. However, modern NVIDIA GPUs have introduced warp shuffle instructions that allow data exchange within a warp without using shared memory. In hw6-4, you need to implement a parallel reduction that uses warp shuffle instructions to reduce the values within a warp and then reduce the values across warps to obtain the final result.
To implement hw6-4, you can use the __shfl_down() function to exchange data between threads within a warp. You can divide the thread block into warps and reduce the values within each warp. Afterward, you can use the simple parallel reduction to obtain the final result.
For more details, please refer to: https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/
1. Explain how you implement your program.
2. Comparethe performance of hw6-4 to the previous versions.

Code Tracing (10%)

Your TA has modified the GPGPU-SIM to report the total number of memory accesses (tol_mem_access_num) generated by the CUDA core. You are required to explain how this modification works, including:

How to add another custom metric for GPGPU-SIM to report?
How is the tol_mem_access_num metric associated with the memory access coalescing mechanism?
If threads within a warp all read to the same cache block, what is the value of tol_mem_access_num? 1 or 32?
Please provide a brief explanation.
Will a shared memory access be counted by tol_mem_access_num? Why or why not?

You can check the modified part here:
https://github.com/samuel21119/gpgpu-sim_distribution/commit/608a679eb36f58689b59f39b88d13ef0e56288f7

Submission

Submit the following files to eeclass before 6/6 23:59:

hw6-1.cu
hw6-2.cu
hw6-3.cu
hw6-4.cu
Makefile
report.pdf

References

GPGPU-SIM Manual: http://gpgpu-sim.org/manual/index.php/Main_Page
Parallel Reduction https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
Warp Shuffle
https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/

HW6: GPU Simulator

Introduction

Review of the 5-stage CPU Pipeline

Review of the Multiple-Issue Pipeline (Tomasulo’s Algorithm)

GPGPU-SIM GPU Pipeline

Running the GPGPU-SIM

Compiling GPGPU-SIM from source

Simulating CUDA program

Assignment

Overview

Sample code

Grading

Environment Preparation (15%)

Program implementation (25%)

Experiment & discussion (55%)

Code Tracing (10%)

Submission

References

tags: spec

Read more

HW3 Grading Policy

HW6 Grading Policy

HW5 Grading Policy

HW4 Grading Policy

tags: `spec`