HW3 Grading Policy
===
$$
score = correctness*0.4 + performance*0.25 + report
$$
## Correctness (40%)
- $X$: passed tests
- $N$: total number of tests
$$
correctness = \frac{X}{N} * 100
$$
## Performance (25%)
### (15%) sobel GPU kernel function.
- $T$: time + panelty (from the [correctness-kernel scoreboard](https://apollo.cs.nthu.edu.tw/ipc23/scoreboard/hw3_correctness-kernel/))
- $T_i$: student $i$'s $T$
- $T_{best}$: the minimum $T$ of all the students
$$
performance = \frac{T_{best}}{T_i}*100
$$
### (10%) Total CPU time.
- $T$: time + panelty (from the [correctness scoreboard](https://apollo.cs.nthu.edu.tw/ipc23/scoreboard/hw3_correctness/))
- $T_i$: student $i$'s $T$
- $T_{best}$: the minimum $T$ of all the students
$$
performance = \frac{T_{best}}{T_i}*100
$$
## Report (35%)
1. (8%+ <span style="color:red;">bonus 5%</span>) How did you parallelize the code?
- (2%) Which CUDA APIs did you use?
- Metion at least 3 CUDA APIs used.
- Which functions are ported to CUDA? How did you distribute the workload to blocks and threads?
- (1%) Mention at least 1 function ported to CUDA.
- (2%) Elaberate on how they distribute the workload distribution to blocks and threads
- (<span style="color:red;">bonus, 5%</span>) Conduct experiment to show their workload distribution is better than some other combinations.
- How do you implement shared memory?
- (1%) Mention which variables are stored in shared memory.
- (2%) Explain why they stored them in shared memory.
2. (5%) Which optimization techniques did you apply to your code?
- Metion at least 1 optimization technique they used, excluding simple parallelization with CUDA and simple shared memory implementation.
3. (6%) What's the difference between `cudaMalloc` and `cudaMallocManaged`? When will you pick one over another?
- (3%) Mention at least 1 main difference
- (1.5%) Describe 1 scenario that should use `cudaMalloc`
- (1.5%) Describe 1 scenario that should use `cudaMallocManaged`
4. (11%) Experiment:
- Measure the GPU kernel time using `nvprof`. Show the difference with and without shared memory. In addition, measure the global memory load throughput (`gld_throughput`) and instruction per cycle (`ipc`) and explain your observation.
- (2%) Show the GPU kernel time difference of using and not using shared memory
- (2%) Show the global memory load throughput and instruction per cycle
- (2%) Given at least 1 observation
- (2%) Explain some meaningful thing about the observation (e.g., possible causes)
- Profile your program by measuring the time spend in I/O, memory copy, CPU, and kernel.
- (1%) Show the total time
- (0.5%) Show the time spend in I/O
- (0.5%) Show the time spend in memory copy
- (0.5%) Show the time spend in CPU
- (0.5%) Show the time spend in kernel
5. (5%) Pick any image that is not in the sample test cases, run your implementation with the image, and showcase both the input and output in your report.
- Show both the input and output image
6. (optional, 5%) Any constructive suggestions or feedback for the homework are welcome.
- At least 1 meaningful suggestions or constructive feedback to the assignment or spec
###### tags: `grading policy`