Task | Score |
---|---|
Environment Preparation | 15% |
Program implementation | 25% |
Experiment & discussion | 50% |
Code Tracing | 10% |
Total | 100% |
-gencode arch=compute_61,code=compute_61
. (3%)
-lcudart
.setup_environment
file?ldd
command.-lcudart
告知 compiler CUDA program 不要將 cuda 的 library static link 到 binary file 裡面,而是在執行程式的時候再 dynamic load cudart library。如果沒有輸入 -lcudart
,程式仍可以正確執行,但因為是 static linking 的方式,導致我們無法用 GPGPU-SIM 來 hack CUDA program。 setup_environment
會修改 LD_LIBRARY_PATH,使得系統找到的第一個 libcudart 是 GPGPU-SIM 的程式,而不是 CUDA toolkit 的。
-lcudart
的情形為何會被扣分。hw6-1.cu
. (10%)hw6-2.cu
. (10%)hw6-3.cu
. (5%)如果每次執行產生的亂數會不一樣,每個 version 均扣一分(也就是如果三份 code 都有相同問題,扣三分)。
Points may be deducted if the explanation provided is unclear or does not effectively convey the intended meaning.
Explain the meaning and purpose of the following performance metrics (10%):
gpu_sim_cycle
gpu_sim_cycle
,會扣分。Stall:
gpgpu_n_stall_shd_mem
If you find any other metrics that can be used to explain your performance result, introduce it here.
Compare the performance of hw6-1
, hw6-2
, and hw6-3
in terms of the above metrics. Include figures and plots if available. (25%)
Hardware configuration exploration. (15%)
You will explore the hardware configuration of the GPU and its impact on the performance of the implementations. The configuration file is gpgpusim.config
. You can discuss the following configurations separately.
gpgpu_n_stall_shd_mem
may be the same for hw6-2
and hw6-3
.gpgpusim.config
to create two different values for gpgpu_n_stall_shd_mem
for each implementation.Your TA has modified the GPGPU-SIM to report the total number of memory accesses (tol_mem_access_num
) generated by the CUDA core. You are required to explain how this modification works, including:
tol_mem_access_num
metric associated with the memory access coalescing mechanism? (4%)tol_mem_access_num
? 1 or 32?tol_mem_access_num
? Why or why not? (3%)Answer:
gpgpu_sim::update_stats
這個 function 把新增的變數 print 出來。這會在每個 kernel 結束的時候被執行到。inst.generate_mem_accesses()
這個 function 執行之後總共產生多少個 memory request。Generate_mem_accesses 是在每個 instruction 執行之後,計算需要產生幾個 memory request。而裡面就包含了 memory coalescing 的機制。m_accessq
的 queue size。generate_mem_accesses
裡面會決定 cache_block_size
的值為何。只有當其不為 0 才會更動到 m_accessq
。但是 shared memory 的 cache_block_size 是 0 所以不會更新到。grading policy