# HW6 Grading Policy Task | Score -----|------- Environment Preparation | 15% Program implementation | 25% Experiment & discussion | 50% Code Tracing | 10% **Total** | **100%** ## Environment Preparation (15%) 1. Submit a screenshot showing the successful build log and your student ID. (5%) 2. Explain the purpose of the build flag `-gencode arch=compute_61,code=compute_61`. (3%) - Answer: 由於我們使用的裝置為 compute capability 6.1 的 TITIANX,我們需要產生針對 compute capability 6.1 裝置的 binary code (ptx code)。 3. Explain how GPGPU-SIM hacks the CUDA program, including: (7%) - The purpose of the build flag `-lcudart`. - What happens if you do not source the `setup_environment` file? - Hint: dynamic linking and the `ldd` command. - Answer: GPGPU-SIM 最終 build 出來的是一個叫做 libcudart 的 dynamic library。 `-lcudart` 告知 compiler CUDA program 不要將 cuda 的 library static link 到 binary file 裡面,而是在執行程式的時候再 dynamic load cudart library。如果沒有輸入 `-lcudart`,程式仍可以正確執行,但因為是 static linking 的方式,導致我們無法用 GPGPU-SIM 來 hack CUDA program。 `setup_environment` 會修改 LD_LIBRARY_PATH,使得系統找到的第一個 libcudart 是 GPGPU-SIM 的程式,而不是 CUDA toolkit 的。 - 沒講到若不使用 `-lcudart` 的情形為何會被扣分。 - 補充:如果沒有 source env_setup,可以在執行 cuda program 時使用 LD_PRELOAD: LD_PRELOAD=/.../gpgpu-sim/.../libcudart.so ./a.out 強制加載自己的 libcudart ## Program implementation (25%) 1. Implement the simple reduction by atomic operation in `hw6-1.cu`. (10%) 2. Implement the interleaved addressing parallel reduction in `hw6-2.cu`. (10%) 3. Implement the sequential addressing parallel reduction in `hw6-3.cu`. (5%) 如果每次執行產生的亂數會不一樣,每個 version 均扣一分(也就是如果三份 code 都有相同問題,扣三分)。 ## Experiment & discussion (55%) > Points may be deducted if the explanation provided is unclear or does not effectively convey the intended meaning. 1. Explain the meaning and purpose of the following performance metrics (10%): - `gpu_sim_cycle` - Answer: 單一 kernel 執行所需的 gpu cycle 數量。如果後續實驗對 「kernel 的執行效率」所使用的 metric 並非 `gpu_sim_cycle`,會扣分。 - `Stall:` - Answer: 因為各種原因造成 pipeline stall cycle 次數 - `gpgpu_n_stall_shd_mem` - Answer: 讀取 shared memory 而造成的 pipeline stall cycle 次數 If you find any other metrics that can be used to explain your performance result, introduce it here. 2. Compare the performance of `hw6-1`, `hw6-2`, and `hw6-3` in terms of the above metrics. Include figures and plots if available. (25%) - Provide a brief summary of the performance comparison. (6%) - Explain any observed differences in performance between the implementations. (7%) - Evaluate the impact of various optimization strategies on performance. (6%) - Provide insights and recommendations for future optimization. (6%) 3. Hardware configuration exploration. (15%) You will explore the hardware configuration of the GPU and its impact on the performance of the implementations. The configuration file is `gpgpusim.config`. You can discuss the following configurations separately. 1. The metric `gpgpu_n_stall_shd_mem` may be the same for `hw6-2` and `hw6-3`. Explain why it differs from NVIDIA’s slide, and modify the `gpgpusim.config` to create two different values for `gpgpu_n_stall_shd_mem` for each implementation. Answer: 如果先前作業的 code 有確定 data 分佈在每次執行都一模一樣,才會有上述的現象。嘗試縮小 bank 數量應該可以產生 hw6-2 & hw6-3 有不同的數字。 **Hint: shared memory banks** 2. By default, the **L1 cache** is disabled. Turn it on and set its latency to 24 (default: 82). Compare the performance of the implementations with and without the L1 cache enabled, and explain your observations and the reasons behind them. Answer: 效能應該不太會有太大的差別(在 version 1上甚至會變慢)。這是因為幾乎所有的資料都只會被存取一次,每次讀取時都要到 L2 cache or global memory,L1 cache 反而會造成額外的 delay。 ## Code Tracing (10%) Your TA has modified the GPGPU-SIM to report the total number of memory accesses (`tol_mem_access_num`) generated by the CUDA core. You are required to explain how this modification works, including: - How to add another custom metric for GPGPU-SIM to report? (3%) - How is the `tol_mem_access_num` metric associated with the memory access coalescing mechanism? (4%) If threads within a warp all read to the same cache block, what is the value of `tol_mem_access_num`? 1 or 32? Please provide a brief explanation. - Will a shared memory access be counted by `tol_mem_access_num`? Why or why not? (3%) Answer: - 在 shader core (SM) 的 class 裡面新增我們要記錄的變數,並且在預觀察的目標函數內更新變數的值。接下來在`gpgpu_sim::update_stats` 這個 function 把新增的變數 print 出來。這會在每個 kernel 結束的時候被執行到。 - 1。該變數是在觀察 `inst.generate_mem_accesses()` 這個 function 執行之後總共產生多少個 memory request。Generate_mem_accesses 是在每個 instruction 執行之後,計算需要產生幾個 memory request。而裡面就包含了 memory coalescing 的機制。 - No。TA 新增的程式是在記錄 `m_accessq` 的 queue size。`generate_mem_accesses` 裡面會決定 `cache_block_size` 的值為何。只有當其不為 0 才會更動到 `m_accessq`。但是 shared memory 的 cache_block_size 是 0 所以不會更新到。 ###### tags: `grading policy`