Solutions: Final Project Part 1

# Solutions: Final Project Part 1 --- ### Experience `-O3` Let's see how much the compiler is helping us. Set `OPTIMIZE=-O3` in `config.env`, and measure performance for our benchmark: ``` runlab --run-git-remotely -- make benchmark.csv ``` #### P1 (1pt): Provide the runtime for `train_model` reported in `benchmark.csv`. ``` -O3 execution time: 20.9s ``` The set `OPTIMIZE=-O0` and run it again. #### P1 (1pt): Provide the runtime for `train_model` in `benchmark.csv`. ``` -OO execution time: 198s Speedup from -O3: ~9.5X ``` ### Get Yourself a Map Next, we analyze the performance of the existing code and figure out which functions are taking majority of the execution time in this new CNN architecture of ours. To generate benchmark results, first enable `gprof` by uncomment the line `GROF=yes` and set `OPTIMIZE=-O3`. Commit, push, and run `make benchmark.csv` again with `--run-git-remotely`. Examine the output in `benchmark.gprof` and answer the following: #### P3 (3pt): List the functions (as listed in `benchmark.gprof`) that combined to account for 99% of the total execution time. Report the percent proportions as well. ``` conv_layer_t::calc_grads(tensor_t<double> const&) conv_layer_t::activate(tensor_t<double>&) fc_layer_t::activate(tensor_t<double>&) fc_layer_t::fix_weights() conv_layer_t::fix_weights() fc_layer_t::calc_grads -- Atleast top 4 required ``` **Note:** Remember to turn of `gprof` when you dont need it. It adds overhead. ### Tile `fc_layer_t::activate` In the starter code you'll find an implemententation of `fc_layer_t::activate` that tiles all three loops based on the iteration space analysis I did during the lecture for the lab. **Note** This means you should look at the table in the slides and select the tile sizes that, according to my analysis, would provide the lowest cache miss rate. That is the implementation you should mesaure in P3 and P4 below. You don't need to run multiple experiments to compare different tile sizes for those problems. You'll do that in P5 and P6. Compare the performance of the resulting code to the original code in `fc_layer_t.hpp` (i.e, `fc_layer_t::activate`) on layer 14 of the model. You should get a speedup of 9-10x. Use the `--param*` command line options to do this in one run. #### P3 (3pt) Write the `runlab` command and the values of `CMD_LINE_ARGS` and `IMPL_SEL_ARGS` used to compare your optimization with the original code ``` runlab command: runlab --run-git-remotely --make cnn.csv CMD_LINE_ARGS: --scale 2 --test-layer 14 --function activate IMPL_SEL_ARGS: --param1-name activate_impl --param1 0 1 --param2-name i_tile_size --param2 1 64 -- --param3-name b_tile_size --param3 1 4 --param4-name n_tile_size --param4 1 32 ``` #### P4 (1pt) Note down the both execution times, and calculate the speedup ``` Original execution time: (Depends on scale) Optimized execution time: (Depends on scale) Speedup: 8X - 12X ``` #### P5 (4pt) Use `--param*` to explore the range of tiling sizes for `activate()`. Draw a graph that shows the impact on execution time of tiling `I` and `N` at tile sizes from 1 to 128 (all combinations). Make sure it is clearly labeled legible. You are plotting execution time against two variables, so you will need to account for that in the graph. ``` Your graph here ``` #### P6 (2pt) Based on your data, what are the optimal tile sizes? How much speedup do the optimal sizes provide relative to the performance you measured for P4? ``` Optimal I_TILE_SIZE: 32 Optimal B_TILE_SIZE: 4 Optimal N_TILE_SIZE: 64 Speedup of Optimal over P4: 1.2X - 1.6X -- Optimal sizes may vary in your experiments, will be graded based on the graph ``` # Solutions: Final Project Part 2 - Threads #### P1 (1pt): Which implementation provide the best performance? The worst? ``` Best OMP implementation: NN implementation, Case 2, 0.053 seconds Worst OMP implementation: N implementation, Case 4, 0.348 seconds PS: Runtimes could vary. The values have been judged based on answers in P2. dataset|training_inputs_count|omp_threads|activate_impl|calc_grads_impl|param3|param4|full_name |function |layer|layer_type|reps|misses |IC |cycles |runtime|PAPI_REF_CYC|MHz |TIC |Tmisses |IPC |CT |CPI |ET | -------|---------------------|-----------|-------------|---------------|------|------|---------------------------------|----------|-----|----------|----|--------|--------|--------|-------|------------|-------|--------|--------|-----|--------|-----|-----| mininet|4.0 |1.0 |1.0 |1.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |8.43e+07|1.35e+09|6.7e+08 |0.355 |7.03e+08 |1.9e+03|1.35e+09|8.43e+07|2.02 |5.29e-10|0.495|0.355| mininet|4.0 |1.0 |1.0 |2.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |1.27e+08|1.65e+09|8.39e+08|0.444 |8.8e+08 |1.9e+03|1.65e+09|1.27e+08|1.97 |5.29e-10|0.508|0.444| mininet|4.0 |1.0 |1.0 |3.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |8.48e+07|1.36e+09|6.03e+08|0.323 |6.33e+08 |1.9e+03|1.36e+09|8.48e+07|2.25 |5.36e-10|0.444|0.323| mininet|4.0 |1.0 |1.0 |4.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |2.16e+08|2.6e+09 |1.22e+09|0.664 |1.28e+09 |1.9e+03|2.6e+09 |2.16e+08|2.12 |5.43e-10|0.471|0.664| mininet|4.0 |1.0 |1.0 |5.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |8.94e+07|1.44e+09|6.53e+08|0.413 |6.86e+08 |1.9e+03|1.44e+09|8.94e+07|2.2 |6.32e-10|0.455|0.413| mininet|4.0 |2.0 |1.0 |1.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |8.43e+07|1.35e+09|6.71e+08|0.355 |7.04e+08 |1.9e+03|2.71e+09|1.69e+08|2.02 |5.29e-10|0.496|0.355| mininet|4.0 |2.0 |1.0 |2.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |6.47e+07|8.29e+08|4.62e+08|0.245 |4.84e+08 |1.9e+03|1.66e+09|1.29e+08|1.8 |5.31e-10|0.557|0.245| mininet|4.0 |2.0 |1.0 |3.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |4.26e+07|6.84e+08|4.39e+08|0.238 |4.61e+08 |1.9e+03|1.37e+09|8.51e+07|1.56 |5.42e-10|0.642|0.238| mininet|4.0 |2.0 |1.0 |4.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |1.28e+08|1.31e+09|1.09e+09|0.596 |1.15e+09 |1.9e+03|2.63e+09|2.56e+08|1.2 |5.46e-10|0.831|0.596| mininet|4.0 |2.0 |1.0 |5.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |4.03e+07|7.43e+08|5.51e+08|0.334 |5.79e+08 |1.9e+03|1.49e+09|8.06e+07|1.35 |6.05e-10|0.742|0.334| mininet|4.0 |4.0 |1.0 |1.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |8.43e+07|1.35e+09|6.7e+08 |0.354 |7.03e+08 |1.9e+03|5.41e+09|3.37e+08|2.02 |5.28e-10|0.495|0.354| mininet|4.0 |4.0 |1.0 |2.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |3.36e+07|4.18e+08|2.42e+08|0.129 |2.54e+08 |1.9e+03|1.67e+09|1.35e+08|1.73 |5.33e-10|0.578|0.129| mininet|4.0 |4.0 |1.0 |3.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |2.15e+07|3.48e+08|3.5e+08 |0.191 |3.67e+08 |1.9e+03|1.39e+09|8.62e+07|0.995|5.45e-10|1.01 |0.191| mininet|4.0 |4.0 |1.0 |4.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |6.24e+07|6.95e+08|1.1e+09 |0.59 |1.16e+09 |1.9e+03|2.78e+09|2.5e+08 |0.631|5.36e-10|1.58 |0.59 | mininet|4.0 |4.0 |1.0 |5.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |1.56e+07|4.13e+08|2.98e+08|0.285 |3.14e+08 |1.9e+03|1.65e+09|6.23e+07|1.39 |9.54e-10|0.722|0.285| mininet|4.0 |8.0 |1.0 |1.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |8.43e+07|1.35e+09|6.7e+08 |0.354 |7.03e+08 |1.9e+03|1.08e+10|6.74e+08|2.02 |5.29e-10|0.495|0.354| mininet|4.0 |8.0 |1.0 |2.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |1.63e+07|2.15e+08|2.66e+08|0.142 |2.79e+08 |1.9e+03|1.72e+09|1.3e+08 |0.809|5.35e-10|1.24 |0.142| mininet|4.0 |8.0 |1.0 |3.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |1.67e+07|3.5e+08 |5.41e+08|0.296 |5.68e+08 |1.9e+03|2.8e+09 |1.33e+08|0.646|5.48e-10|1.55 |0.296| mininet|4.0 |8.0 |1.0 |4.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |6.45e+07|7.06e+08|1.41e+09|0.767 |1.47e+09 |1.9e+03|5.65e+09|5.16e+08|0.502|5.45e-10|1.99 |0.767| mininet|4.0 |8.0 |1.0 |5.0 |1.0 |1.0 |layer[14]->calc_grads fc_layer_t|calc_grads|14.0 |fc_layer_t|5.0 |9.85e+06|2.69e+08|3.58e+08|0.336 |3.76e+08 |1.9e+03|2.15e+09|7.88e+07|0.752|9.39e-10|1.33 |0.336| ``` #### P2 (10pts): Draw 5 graphs, one for each of the 5 implementations (`calc_grads_thread_baseline()` and the four variations). Each one should plot the normalized values for the first 7 values (`misses` through `ET` in the table above) in the data you collected above (y-axis) vs. thread count (x-axis). The graphs, their axes, and their legends should be clearly labeled (e.g., 'calc_grads_thread_baseline_nn()'). I'll do an example of how to draw these graphs efficiently in class. ![](https://i.imgur.com/PTFEvm3.png) ![](https://i.imgur.com/P8VEi0x.png) ![](https://i.imgur.com/VWtffGi.png) ![](https://i.imgur.com/Gt4alyr.png) #### P3 (1 pt): Best Implementation and thread count ```Baseline_nn and 4 threads``` #### P4 (2 pts): Consider Version 3 (b loop). Use the Moneta trace to determine aptly how many kBs of ```grads_out``` each thread accesses and roughly how many times it accesses those bytes. Provide and label one screen capture how these values were arrived at. ``` Each thread accesses ~ 32 kB Each element is accessed ****too many to count**** times ``` ![](https://i.imgur.com/ENYnQwL.png) #### P5 (2 pts): Consider Version 3 (b loop). Provide a screen capture of ```weights```. Approximately, how large is weights tensor in kB? ![](https://i.imgur.com/hAPNr6o.png) ``` Size of weights tensor = 120,000 kB How many times is each element accessed = 4 ``` #### P6 (2 pts): What did you learn from P4 and P5 that could explain the high CPI for version 3 (b loop)? ``` Simply increasing the number of threads doesn't aid in increasing the performance of the application. There are capacity misses caused by increasing the thread count. This explains the higher CPI. When one thread tries to use an element, the other thread tries to evict it resulting in a high cache miss rate and CPI. ``` #### P7 (2 pts): Consider ```calc_grads_thread_baseline_n()``` How much of the weights tensor does each thread access repeatedly before moving onto more data (i.e., how big is it's working set)? How many times does it accesses each entry of the tensor? Provide and label a Moneta screen capture that supports your conclusions. ``` Each entry of the weights tensor is accessed 4 times The working set size = 32 kB ``` ![](https://i.imgur.com/Sdh1HPc.png) #### P8 (2 pts) Consider calc_grads_thread_baseline_n(). What's the ratio of IC on calc_grads_thread_baseline() to IC on calc_grads_thread_baseline_n() with 1 thread? Paste in a copy of the C++ code that corresponds to the extra instructions. ``` IC Ratio = 1.35 x 10^9 / 2.6 x 10^9 ~= 0.5194 (2) #pragma omp critical for (int i = 0; i < grads_out.size.x; i++) { grads_out(i, 0, 0, b) += grads_out_local(i,0,0,0); } ``` #### P9 (2 pts) Consider calc_grads_thread_baseline_i(). How much of the weights tensor does each thread access repeatedly before moving onto more data (i.e., how big is it's working set)? Give your answer in KB. Provide and label a screen capture illustrating your conclusion. ``` Working set size = 8 kB (4 threads access 8 kB each for baseline_i) ``` ![](https://i.imgur.com/T1q1PXT.png) #### P10 (2 pts) Consider calc_grads_thread_baseline_i(). How much (in KB) of grads_out tensor is does each thread access? Provide and label a screen capture illustrating your conclusion. ``` Working set size = 8 kB ``` ![](https://i.imgur.com/ZjE1cME.png) #### P11 (2pts) Use your answers to P9 and P10 and similar measurements of the trace of calc_grads_thread_baseline() to explain the difference between the number of misses in calc_grads_thread_baseline_i() and calc_grads_thread_baseline(). Two sentences max. ``` For weights, the access is for grads_out and weights. The working set size is 8 kB. The increase in thread count ensures there are fewer capacity misses and improves the overall spatial locality. As thread count is increased, baseline_i performs better than baseline. ``` #### P12 (5pts) Compare the data for calc_grads_thread_baseline_nn() and calc_grads_thread_baseline_i() with 4 threads using all three terms of the PE (IC, CPI, and CT). For each term compute (value for calc_grads_thread_baseline_i)/(value for calc_grads_thread_baseline_nn). Which term explains calc_grads_thread_baseline_nn()'s lower ET? What could be one underlying cause? (2 sentences max) ``` IC Ratio = 4.13 * 10^8 / 4.18 * 10^8 = 0.988 CPI Ratio = 0.722 / 0.578 = 1.25 CT Ratio = 9.54 * 10^-10 / 5.33 * 10^ -10 = 2 Cycle time difference is the cause. This could be because of thermal throttling where baseline_i() is running at a lower frequency (higher cycle time) ``` # Solutions: Final Project Part 3 - Vectors #### P1 (1pts) Add `-O3 -fopt-info-vec-optimized` to `MICROBENCH_OPTIMIZE` in `config.env` and then `runlab --run-git-remotely -- make build/microbench.o` to generate the optimization report. It'll show up in the terminal and `STDOUT.txt`. Paste in the lines that contain the string `microbench.cpp` below (there will be less than 10 of them). ``` build/microbench.cpp:46:13: note: loop vectorized build/microbench.cpp:97:13: note: loop vectorized build/microbench.cpp:211:25: note: basic block vectorized build/microbench.cpp:74:12: note: loop vectorized build/microbench.cpp:117:12: note: loop vectorized ``` #### P2 (1pts) Which functions contain loops that were vectorized? ``` serial_improved openmp_simd openmp_threads openmp_threads_simd ``` Set `MICROBENCH_CMD_LINE_ARGS=--stat-set PE.cfg --impl openmp_simd serial` (leave `MICROBENCH_OPTIMIZE` as it is) and then `runlab --run-by-proxy -- make microbench.csv` (or commit and use `--run-git-remotely`). Use the resulting `microbench.csv` to answer the following question. #### P3 (3pt) Compute the impact of SIMD on the following terms of the performance equation using the data for `openmp_simd` and `serial`. For each term, compute `(value for serial)/(value for openmp_simd)`. ``` omp_threads|size |function |IC |cycles |runtime|PAPI_REF_CYC|MHz |IPC |CT |CPI |ET | -----------|--------|-----------|--------|--------|-------|------------|-----|-----|--------|-----|----| 1.0 |5.37e+08|openmp_simd|4.16e+09|5.16e+09|2.76 |4.3e+09 |2e+03|0.806|5.35e-10|1.24 |2.76| 1.0 |5.37e+08|serial |1.23e+10|1.07e+10|5.12 |8.98e+09 |2e+03|1.15 |4.77e-10|0.869|5.12| IC: 12.3/4.16 = 2.96 CT: 4.77/5.35 = 0.89 CPI: 0.869/1.24 = 0.70 ET: 5.12/2.76 = 1.85 ``` #### P4 (2pt) If Intel improved their processors so that the CPI for vector instructions matched that of normal instructions, how much speedup would `openmp_simd` achieve relative to `serial`? ``` Speedup: (1.23*4.77*cpi)/(4.16*0.535*cpi) = 2.63 ``` #### P5 (10pt) Here's 10 free points because we couldn't get vectorization to do anything useful on our code base. Put whatever you'd like below. ``` +10 ``` --- # YOU ARE DONE! PLEASE ENJOY YOUR SUMMER! ---