HW3 Grading Policy === $$ score = correctness*0.4 + performance*0.25 + report $$ ## Correctness (40%) - $X$: passed tests - $N$: total number of tests $$ correctness = \frac{X}{N} * 100 $$ ## Performance (25%) ### (15%) sobel GPU kernel function. - $T$: time + panelty (from the [correctness-kernel scoreboard](https://apollo.cs.nthu.edu.tw/ipc23/scoreboard/hw3_correctness-kernel/)) - $T_i$: student $i$'s $T$ - $T_{best}$: the minimum $T$ of all the students $$ performance = \frac{T_{best}}{T_i}*100 $$ ### (10%) Total CPU time. - $T$: time + panelty (from the [correctness scoreboard](https://apollo.cs.nthu.edu.tw/ipc23/scoreboard/hw3_correctness/)) - $T_i$: student $i$'s $T$ - $T_{best}$: the minimum $T$ of all the students $$ performance = \frac{T_{best}}{T_i}*100 $$ ## Report (35%) 1. (8%+ <span style="color:red;">bonus 5%</span>) How did you parallelize the code? - (2%) Which CUDA APIs did you use? - Metion at least 3 CUDA APIs used. - Which functions are ported to CUDA? How did you distribute the workload to blocks and threads? - (1%) Mention at least 1 function ported to CUDA. - (2%) Elaberate on how they distribute the workload distribution to blocks and threads - (<span style="color:red;">bonus, 5%</span>) Conduct experiment to show their workload distribution is better than some other combinations. - How do you implement shared memory? - (1%) Mention which variables are stored in shared memory. - (2%) Explain why they stored them in shared memory. 2. (5%) Which optimization techniques did you apply to your code? - Metion at least 1 optimization technique they used, excluding simple parallelization with CUDA and simple shared memory implementation. 3. (6%) What's the difference between `cudaMalloc` and `cudaMallocManaged`? When will you pick one over another? - (3%) Mention at least 1 main difference - (1.5%) Describe 1 scenario that should use `cudaMalloc` - (1.5%) Describe 1 scenario that should use `cudaMallocManaged` 4. (11%) Experiment: - Measure the GPU kernel time using `nvprof`. Show the difference with and without shared memory. In addition, measure the global memory load throughput (`gld_throughput`) and instruction per cycle (`ipc`) and explain your observation. - (2%) Show the GPU kernel time difference of using and not using shared memory - (2%) Show the global memory load throughput and instruction per cycle - (2%) Given at least 1 observation - (2%) Explain some meaningful thing about the observation (e.g., possible causes) - Profile your program by measuring the time spend in I/O, memory copy, CPU, and kernel. - (1%) Show the total time - (0.5%) Show the time spend in I/O - (0.5%) Show the time spend in memory copy - (0.5%) Show the time spend in CPU - (0.5%) Show the time spend in kernel 5. (5%) Pick any image that is not in the sample test cases, run your implementation with the image, and showcase both the input and output in your report. - Show both the input and output image 6. (optional, 5%) Any constructive suggestions or feedback for the homework are welcome. - At least 1 meaningful suggestions or constructive feedback to the assignment or spec ###### tags: `grading policy`