Solutions: Final Project Part 2 - Threads

# Solutions: Final Project Part 2 - Threads #### P1 (1pt): Which implementation provide the best performance? The worst? ``` Best OMP implementation: NN implementation, Case 2, 0.053 seconds Worst OMP implementation: N implementation, Case 4, 0.348 seconds PS: Runtimes could vary. The values have been judged based on answers in P2. dataset|training_inputs_count|omp_threads|activate_impl|calc_grads_impl|param3|param4|full_name |function |layer|layer_type|reps|misses |IC |cycles |runtime|PAPI_REF_CYC|MHz |TIC |Tmisses |IPC |CT |CPI |ET | -------|---------------------|-----------|-------------|---------------|------|------|---------------------------------|----------|-----|----------|----|--------|--------|--------|-------|------------|-------|--------|--------|-----|--------|-----|-----| mininet|4.0 |1.0 |1.0 |1.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |8.43e+07|1.35e+09|6.7e+08 |0.355 |7.03e+08 |1.9e+03|1.35e+09|8.43e+07|2.02 |5.29e-10|0.495|0.355| mininet|4.0 |1.0 |1.0 |2.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |1.27e+08|1.65e+09|8.39e+08|0.444 |8.8e+08 |1.9e+03|1.65e+09|1.27e+08|1.97 |5.29e-10|0.508|0.444| mininet|4.0 |1.0 |1.0 |3.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |8.48e+07|1.36e+09|6.03e+08|0.323 |6.33e+08 |1.9e+03|1.36e+09|8.48e+07|2.25 |5.36e-10|0.444|0.323| mininet|4.0 |1.0 |1.0 |4.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |2.16e+08|2.6e+09 |1.22e+09|0.664 |1.28e+09 |1.9e+03|2.6e+09 |2.16e+08|2.12 |5.43e-10|0.471|0.664| mininet|4.0 |1.0 |1.0 |5.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |8.94e+07|1.44e+09|6.53e+08|0.413 |6.86e+08 |1.9e+03|1.44e+09|8.94e+07|2.2 |6.32e-10|0.455|0.413| mininet|4.0 |2.0 |1.0 |1.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |8.43e+07|1.35e+09|6.71e+08|0.355 |7.04e+08 |1.9e+03|2.71e+09|1.69e+08|2.02 |5.29e-10|0.496|0.355| mininet|4.0 |2.0 |1.0 |2.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |6.47e+07|8.29e+08|4.62e+08|0.245 |4.84e+08 |1.9e+03|1.66e+09|1.29e+08|1.8 |5.31e-10|0.557|0.245| mininet|4.0 |2.0 |1.0 |3.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |4.26e+07|6.84e+08|4.39e+08|0.238 |4.61e+08 |1.9e+03|1.37e+09|8.51e+07|1.56 |5.42e-10|0.642|0.238| mininet|4.0 |2.0 |1.0 |4.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |1.28e+08|1.31e+09|1.09e+09|0.596 |1.15e+09 |1.9e+03|2.63e+09|2.56e+08|1.2 |5.46e-10|0.831|0.596| mininet|4.0 |2.0 |1.0 |5.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |4.03e+07|7.43e+08|5.51e+08|0.334 |5.79e+08 |1.9e+03|1.49e+09|8.06e+07|1.35 |6.05e-10|0.742|0.334| mininet|4.0 |4.0 |1.0 |1.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |8.43e+07|1.35e+09|6.7e+08 |0.354 |7.03e+08 |1.9e+03|5.41e+09|3.37e+08|2.02 |5.28e-10|0.495|0.354| mininet|4.0 |4.0 |1.0 |2.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |3.36e+07|4.18e+08|2.42e+08|0.129 |2.54e+08 |1.9e+03|1.67e+09|1.35e+08|1.73 |5.33e-10|0.578|0.129| mininet|4.0 |4.0 |1.0 |3.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |2.15e+07|3.48e+08|3.5e+08 |0.191 |3.67e+08 |1.9e+03|1.39e+09|8.62e+07|0.995|5.45e-10|1.01 |0.191| mininet|4.0 |4.0 |1.0 |4.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |6.24e+07|6.95e+08|1.1e+09 |0.59 |1.16e+09 |1.9e+03|2.78e+09|2.5e+08 |0.631|5.36e-10|1.58 |0.59 | mininet|4.0 |4.0 |1.0 |5.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |1.56e+07|4.13e+08|2.98e+08|0.285 |3.14e+08 |1.9e+03|1.65e+09|6.23e+07|1.39 |9.54e-10|0.722|0.285| mininet|4.0 |8.0 |1.0 |1.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |8.43e+07|1.35e+09|6.7e+08 |0.354 |7.03e+08 |1.9e+03|1.08e+10|6.74e+08|2.02 |5.29e-10|0.495|0.354| mininet|4.0 |8.0 |1.0 |2.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |1.63e+07|2.15e+08|2.66e+08|0.142 |2.79e+08 |1.9e+03|1.72e+09|1.3e+08 |0.809|5.35e-10|1.24 |0.142| mininet|4.0 |8.0 |1.0 |3.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |1.67e+07|3.5e+08 |5.41e+08|0.296 |5.68e+08 |1.9e+03|2.8e+09 |1.33e+08|0.646|5.48e-10|1.55 |0.296| mininet|4.0 |8.0 |1.0 |4.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |6.45e+07|7.06e+08|1.41e+09|0.767 |1.47e+09 |1.9e+03|5.65e+09|5.16e+08|0.502|5.45e-10|1.99 |0.767| mininet|4.0 |8.0 |1.0 |5.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |9.85e+06|2.69e+08|3.58e+08|0.336 |3.76e+08 |1.9e+03|2.15e+09|7.88e+07|0.752|9.39e-10|1.33 |0.336| ``` #### P2 (10pts): Draw 5 graphs, one for each of the 5 implementations (`calc_grads_thread_baseline()` and the four variations). Each one should plot the normalized values for the first 7 values (`misses` through `ET` in the table above) in the data you collected above (y-axis) vs. thread count (x-axis). The graphs, their axes, and their legends should be clearly labeled (e.g., 'calc_grads_thread_baseline_nn()'). I'll do an example of how to draw these graphs efficiently in class. ![](https://i.imgur.com/PTFEvm3.png) ![](https://i.imgur.com/P8VEi0x.png) ![](https://i.imgur.com/VWtffGi.png) ![](https://i.imgur.com/Gt4alyr.png) #### P3 (1 pt): Best Implementation and thread count ```Baseline_nn and 4 threads``` #### P4 (2 pts): Consider Version 3 (b loop). Use the Moneta trace to determine aptly how many kBs of ```grads_out``` each thread accesses and roughly how many times it accesses those bytes. Provide and label one screen capture how these values were arrived at. ``` Each thread accesses ~ 32 kB Each element is accessed ****too many to count**** times ``` ![](https://i.imgur.com/ENYnQwL.png) #### P5 (2 pts): Consider Version 3 (b loop). Provide a screen capture of ```weights```. Approximately, how large is weights tensor in kB? ![](https://i.imgur.com/hAPNr6o.png) ``` Size of weights tensor = 120,000 kB How many times is each element accessed = 4 ``` #### P6 (2 pts): What did you learn from P4 and P5 that could explain the high CPI for version 3 (b loop)? ``` Simply increasing the number of threads doesn't aid in increasing the performance of the application. There are capacity misses caused by increasing the thread count. This explains the higher CPI. When one thread tries to use an element, the other thread tries to evict it resulting in a high cache miss rate and CPI. ``` #### P7 (2 pts): Consider ```calc_grads_thread_baseline_n()``` How much of the weights tensor does each thread access repeatedly before moving onto more data (i.e., how big is it's working set)? How many times does it accesses each entry of the tensor? Provide and label a Moneta screen capture that supports your conclusions. ``` Each entry of the weights tensor is accessed 4 times The working set size = 32 kB ``` ![](https://i.imgur.com/Sdh1HPc.png) #### P8 (2 pts) Consider calc_grads_thread_baseline_n(). What's the ratio of IC on calc_grads_thread_baseline() to IC on calc_grads_thread_baseline_n() with 1 thread? Paste in a copy of the C++ code that corresponds to the extra instructions. ``` IC Ratio = 1.35 x 10^9 / 2.6 x 10^9 ~= 0.5194 (2) #pragma omp critical for (int i = 0; i < grads_out.size.x; i++) { grads_out(i, 0, 0, b) += grads_out_local(i,0,0,0); } ``` #### P9 (2 pts) Consider calc_grads_thread_baseline_i(). How much of the weights tensor does each thread access repeatedly before moving onto more data (i.e., how big is it's working set)? Give your answer in KB. Provide and label a screen capture illustrating your conclusion. ``` Working set size = 8 kB (4 threads access 8 kB each for baseline_i) ``` ![](https://i.imgur.com/T1q1PXT.png) #### P10 (2 pts) Consider calc_grads_thread_baseline_i(). How much (in KB) of grads_out tensor is does each thread access? Provide and label a screen capture illustrating your conclusion. ``` Working set size = 8 kB ``` ![](https://i.imgur.com/ZjE1cME.png) #### P11 (2pts) Use your answers to P9 and P10 and similar measurements of the trace of calc_grads_thread_baseline() to explain the difference between the number of misses in calc_grads_thread_baseline_i() and calc_grads_thread_baseline(). Two sentences max. ``` For weights, the access is for grads_out and weights. The working set size is 8 kB. The increase in thread count ensures there are fewer capacity misses and improves the overall spatial locality. As thread count is increased, baseline_i performs better than baseline. ``` #### P12 (5pts) Compare the data for calc_grads_thread_baseline_nn() and calc_grads_thread_baseline_i() with 4 threads using all three terms of the PE (IC, CPI, and CT). For each term compute (value for calc_grads_thread_baseline_i)/(value for calc_grads_thread_baseline_nn). Which term explains calc_grads_thread_baseline_nn()'s lower ET? What could be one underlying cause? (2 sentences max) ``` IC Ratio = 4.13 * 10^8 / 4.18 * 10^8 = 0.988 CPI Ratio = 0.722 / 0.578 = 1.25 CT Ratio = 9.54 * 10^-10 / 5.33 * 10^ -10 = 2 Cycle time difference is the cause. This could be because of thermal throttling where baseline_i() is running at a lower frequency (higher cycle time) ```