GPGPU-Sim (General Purpose Graphics Processing Unit Simulator) is a cycle-accurate simulator for Graphics Processing Units (GPUs). It provides a flexible platform for researchers and developers to study various aspects of GPU architecture and programming. It is an open-source software framework that simulates both the compute units and the memory hierarchy of a GPU, allowing for detailed analysis of various performance metrics such as latency, bandwidth, and power consumption.
Reservation station:
This function calls a set of member functions that simulate the core's pipeline stages in reverse order to model the pipelining effect
fetch
: Instruction fetch (IF)decode
: Instruction decode (ID)issue
: Dispatch ready instruction (similar to the reservation station)read_operands
: Fetch register (ID)execute
: Execute the instruction by corresponding compute unit (EXE)writeback
: Memory store/load & register writeback (MEM, WB)SIMT-Stack:
IPC23 GPGPU-SIM
containergit clone https://github.com/samuel21119/gpgpu-sim_distribution ~/gpgpu-sim_distribution
cd ~/gpgpu-sim_distribution
env -i bash build_ipc.sh
export CUDA_INSTALL_PATH=/opt/cuda
source ~/gpgpu-sim_distribution/setup_environment
nvcc hw6.cu -o hw6.exe -Xptxas=-v -arch=sm_61 -gencode arch=compute_61,code=compute_61 -lcudart
-lcudart
.-Xptxas=-v -arch=sm_61 -gencode arch=compute_61,code=compute_61 -lcudart
to the sample Makefile in the previous assignmentcp ~/gpgpu-sim_distribution/configs/tested-cfgs/SM6_TITANX/gpgpusim.config ./
cp ~/gpgpu-sim_distribution/configs/tested-cfgs/SM6_TITANX/config_pascal_islip.icnt ./
./hw6.exe <argv1> <argv2> ...
./hw6.exe > gpgpu-sim.log
In this assignment, you will be implementing a simple reduction CUDA code. Your task is to implement at least three different versions of the code:
hw6-1.cu
)hw6-2.cu
)hw6-3.cu
)For the second and third versions, please refer to the following slide deck from NVIDIA's website, specifically pages 7 and 14:
https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
Note that for each version, you should ensure that your code is correct and that it produces the expected results. You should also measure the performance of each version and report your findings.
The following is the sample code for the assignment. You should first implement a function that generates random floating points (in double format). Next, implement a CUDA version of the CPU_reduction function provided, which computes the minimum value of a fixed-size (1024) array of double-precision floating-point numbers.
#include <iostream>
#define N 1024
double min(double a, double b) {return a < b ? a : b;}
double CPU_reduction(double *arr, int n) {
double ret = arr[0];
for (int i = 1; i < n; i++) {
ret = min(ret, arr[i]);
}
return ret;
}
int main() {
double *ret = new double;
double *arr = new double[N];
generate_random_doubles(arr, N);
std::cout << "Generated numbers:";
for (int i = 0; i < N; i++) {
std::cout << ' ' << arr[i];
}
std::cout << '\n';
// cudaMalloc and cudaMemcpy is required
cuda_reduction<<<1,N>>>(arr, N, ret);
std::cout << "The minimum value: " << *ret << '\n';
delete ret;
delete [] arr;
return 0;
}
-gencode arch=compute_61,code=compute_61
. (3%)-lcudart
.setup_environment
file?ldd
command.hw6-1.cu
. (10%)hw6-2.cu
. (10%)hw6-3.cu
. (5%)Points may be deducted if the explanation provided is unclear or does not effectively convey the intended meaning.
Explain the meaning and purpose of the following performance metrics (10%):
gpu_sim_cycle
Stall:
gpgpu_n_stall_shd_mem
Compare the performance of hw6-1
, hw6-2
, and hw6-3
in terms of the above metrics. Include figures and plots if available. (25%)
Hardware configuration exploration. (10%)
You will explore the hardware configuration of the GPU and its impact on the performance of the implementations. The configuration file is gpgpusim.config
. You can discuss the following configurations separately.
gpgpu_n_stall_shd_mem
may be the same for hw6-2
and hw6-3
.gpgpusim.config
to create two different values for gpgpu_n_stall_shd_mem
for each implementation.Warp Shuffle Version (10%)
The previous versions of parallel reduction required exchanging data in shared memory. However, modern NVIDIA GPUs have introduced warp shuffle instructions that allow data exchange within a warp without using shared memory. In hw6-4
, you need to implement a parallel reduction that uses warp shuffle instructions to reduce the values within a warp and then reduce the values across warps to obtain the final result.
To implement hw6-4
, you can use the __shfl_down()
function to exchange data between threads within a warp. You can divide the thread block into warps and reduce the values within each warp. Afterward, you can use the simple parallel reduction to obtain the final result.
For more details, please refer to: https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/
hw6-4
to the previous versions.Your TA has modified the GPGPU-SIM to report the total number of memory accesses (tol_mem_access_num
) generated by the CUDA core. You are required to explain how this modification works, including:
tol_mem_access_num
metric associated with the memory access coalescing mechanism?tol_mem_access_num
? 1 or 32?tol_mem_access_num
? Why or why not?You can check the modified part here:
https://github.com/samuel21119/gpgpu-sim_distribution/commit/608a679eb36f58689b59f39b88d13ef0e56288f7
Submit the following files to eeclass before 6/6 23:59:
hw6-1.cu
hw6-2.cu
hw6-3.cu
hw6-4.cu
spec