HW6 Grading Policy

Task Score
Environment Preparation 15%
Program implementation 25%
Experiment & discussion 50%
Code Tracing 10%
Total 100%

Environment Preparation (15%)

  1. Submit a screenshot showing the successful build log and your student ID. (5%)
  2. Explain the purpose of the build flag -gencode arch=compute_61,code=compute_61. (3%)
    • Answer: 由於我們使用的裝置為 compute capability 6.1 的 TITIANX,我們需要產生針對 compute capability 6.1 裝置的 binary code (ptx code)。
  3. Explain how GPGPU-SIM hacks the CUDA program, including: (7%)
    • The purpose of the build flag -lcudart.
    • What happens if you do not source the setup_environment file?
    • Hint: dynamic linking and the ldd command.
    • Answer: GPGPU-SIM 最終 build 出來的是一個叫做 libcudart 的 dynamic library。 -lcudart 告知 compiler CUDA program 不要將 cuda 的 library static link 到 binary file 裡面,而是在執行程式的時候再 dynamic load cudart library。如果沒有輸入 -lcudart,程式仍可以正確執行,但因為是 static linking 的方式,導致我們無法用 GPGPU-SIM 來 hack CUDA program。 setup_environment 會修改 LD_LIBRARY_PATH,使得系統找到的第一個 libcudart 是 GPGPU-SIM 的程式,而不是 CUDA toolkit 的。
      • 沒講到若不使用 -lcudart 的情形為何會被扣分。
      • 補充:如果沒有 source env_setup,可以在執行 cuda program 時使用 LD_PRELOAD: LD_PRELOAD=//gpgpu-sim//libcudart.so ./a.out 強制加載自己的 libcudart

Program implementation (25%)

  1. Implement the simple reduction by atomic operation in hw6-1.cu. (10%)
  2. Implement the interleaved addressing parallel reduction in hw6-2.cu. (10%)
  3. Implement the sequential addressing parallel reduction in hw6-3.cu. (5%)

如果每次執行產生的亂數會不一樣,每個 version 均扣一分(也就是如果三份 code 都有相同問題,扣三分)。

Experiment & discussion (55%)

Points may be deducted if the explanation provided is unclear or does not effectively convey the intended meaning.

  1. Explain the meaning and purpose of the following performance metrics (10%):

    • gpu_sim_cycle
      • Answer: 單一 kernel 執行所需的 gpu cycle 數量。如果後續實驗對 「kernel 的執行效率」所使用的 metric 並非 gpu_sim_cycle,會扣分。
    • Stall:
      • Answer: 因為各種原因造成 pipeline stall cycle 次數
    • gpgpu_n_stall_shd_mem
      • Answer: 讀取 shared memory 而造成的 pipeline stall cycle 次數

    If you find any other metrics that can be used to explain your performance result, introduce it here.

  2. Compare the performance of hw6-1, hw6-2, and hw6-3 in terms of the above metrics. Include figures and plots if available. (25%)

    • Provide a brief summary of the performance comparison. (6%)
    • Explain any observed differences in performance between the implementations. (7%)
    • Evaluate the impact of various optimization strategies on performance. (6%)
    • Provide insights and recommendations for future optimization. (6%)
  3. Hardware configuration exploration. (15%)
    You will explore the hardware configuration of the GPU and its impact on the performance of the implementations. The configuration file is gpgpusim.config. You can discuss the following configurations separately.

    1. The metric gpgpu_n_stall_shd_mem may be the same for hw6-2 and hw6-3.
      Explain why it differs from NVIDIA’s slide, and modify the gpgpusim.config to create two different values for gpgpu_n_stall_shd_mem for each implementation.
      Answer: 如果先前作業的 code 有確定 data 分佈在每次執行都一模一樣,才會有上述的現象。嘗試縮小 bank 數量應該可以產生 hw6-2 & hw6-3 有不同的數字。
      Hint: shared memory banks
    2. By default, the L1 cache is disabled. Turn it on and set its latency to 24 (default: 82). Compare the performance of the implementations with and without the L1 cache enabled, and explain your observations and the reasons behind them.
      Answer: 效能應該不太會有太大的差別(在 version 1上甚至會變慢)。這是因為幾乎所有的資料都只會被存取一次,每次讀取時都要到 L2 cache or global memory,L1 cache 反而會造成額外的 delay。

Code Tracing (10%)

Your TA has modified the GPGPU-SIM to report the total number of memory accesses (tol_mem_access_num) generated by the CUDA core. You are required to explain how this modification works, including:

  • How to add another custom metric for GPGPU-SIM to report? (3%)
  • How is the tol_mem_access_num metric associated with the memory access coalescing mechanism? (4%)
    If threads within a warp all read to the same cache block, what is the value of tol_mem_access_num? 1 or 32?
    Please provide a brief explanation.
  • Will a shared memory access be counted by tol_mem_access_num? Why or why not? (3%)

Answer:

  • 在 shader core (SM) 的 class 裡面新增我們要記錄的變數,並且在預觀察的目標函數內更新變數的值。接下來在gpgpu_sim::update_stats 這個 function 把新增的變數 print 出來。這會在每個 kernel 結束的時候被執行到。
  • 1。該變數是在觀察 inst.generate_mem_accesses() 這個 function 執行之後總共產生多少個 memory request。Generate_mem_accesses 是在每個 instruction 執行之後,計算需要產生幾個 memory request。而裡面就包含了 memory coalescing 的機制。
  • No。TA 新增的程式是在記錄 m_accessq 的 queue size。generate_mem_accesses 裡面會決定 cache_block_size 的值為何。只有當其不為 0 才會更動到 m_accessq。但是 shared memory 的 cache_block_size 是 0 所以不會更新到。
tags: grading policy