# CSE142L Lab 2 Characterizing a Perceptron Solution ## Enabling the Profiler #### P1 (1pt) Which function accounts for the most time? A: `tensor_t<double>get(int, int, int, int)` #### P2 (1pt) What percentage of time does it account for? A: Range of values are observed, 30 - 60% #### P3 (1pt) According to Amdahl's Law, how much speedup could you possibly achieve by optimizing this function? ``` 1/(1-x) where x = (answer in P2)/100 .. around 1.4285 (30%) to 2.5 (60%) ``` <div style="page-break-after: always;"></div> ## Taking a Closer Look at the Code #### P3 (1pt) We noticed that fc_layer_t::activate(tensor_t<double\>&) took 46.7% of the total execution time. If we speed up this function by 3x, how much will the total program be sped up by? ``` 1/(1-0.467 + 0.467/3) = 1.452 ``` ## What's the Compiler Doing? #### P1 (1pt) How many instructions does the function 'point_t::point_t(int, int, int, int)' execute when called? A: 24 including the call to mcount. 23 otherwise ## Looking at Performance Counters #### P1 (4pt) Compute the instruction mix for each of the dataset and enter them below (as %) | Dataset | Memory insts | Branches | uncond. branches | |----------|--------------|----------|------------------| | mnist | 70.57 | 6.58 | 6.13 | | emnist | 70.56 | 6.53 | 6.12 | | cifar10 | 70.6 | 6.57 | 6.13 | | cifar100 | 70.57 | 6.52 | 6.12 | #### P2 (4pt) Fill out the table (total data processed is the product of the model size and the number of training inputs) | Dataset | Model size (B) | training_inputs_count | total data processed | Memory ops | |----------|----------------|-----------------------|----------------------|------------| | mnist | 74K | 200 | 14800K | 8.34e + 08 | | emnist | 230K | 150 | 34500K | 2.23e + 09 | | cifar10 | 290K | 100 | 29000K | 1.6e + 09 | | cifar100 | 2500K | 5 | 12300K | 8.33e + 08 | #### P3 (4pt) Prepare a bar graph from your table that plots the number of memory operation/byte of data processed and the number of branches per byte of data processed for each workload. For mnist, Memory ops per byte = Memory ops/total data processed(B). Branches per byte = #branches/total data processed(B). Similarily for other datasets. *>NOTE: This is a sample graph from one of the submissions and not the only solution. The values can be different for you on y-axis.* Note the axes labels, dataset labels, clear title and appropriate colors. A variety of answers have been accepted unless the graph is missing something critical or is nto legible. Partial marks if you did not plot “per byte”. 0 marks if you plotted memory ops vs branches. ![](https://i.imgur.com/FxQwWBW.png) ## Asking the Compiler to Do More #### P1 (4pt) Compute the instruction mix for each of the dataset for the optimized code and enter them below (as %) *> NOTE: You will have slightly different values for the columns below. If your values differ widely, for example by order of 10, then check where you went wrong.* | Dataset | Memory insts | Branches | uncond. branches | |----------|--------------|----------|------------------| | mnist | 34.66% | 11.46% | 0.94% | | emnist | 36.68% | 11.01% | 0.37% | | cifar10 | 35.74% | 11.94% | 0.87% | | cifar100 | 35.31% | 11.35% | 0.13% | #### P2 (4pt) Fill out the table for the optimized code (total data processed is the product of the model size and the number of training inputs) *> NOTE: You will have different values for the memory ops column below.* *> NOTE: If you have not reported memory ops from optimized (gprof or non-gprof) version of the code, marks have been deducted. The order will be off by 10 or 100.* | Dataset | Model size (B) | training_inputs_count | total data processed | Memory ops | |----------|----------------|-----------------------|----------------------|------------| | mnist | 74K | 200 | 14800K | 1.33e + 07 | | emnist | 230K | 150 | 34500K | 3.18e + 07 | | cifar10 | 290K | 100 | 29000K | 5.72e + 07 | | cifar100 | 2500K | 5 | 12300K | 4.28e + 07 | #### P3 (4pt) Prepare a bar graph from your table that plots the number of memory operation/byte of data processed and the number of branches per byte of data processed for each workload for the optimized code. For mnist, Memory ops per byte = Memory ops/total data processed(B). Branches per byte = #branches/total data processed(B). Similarily for other datasets. *> NOTE: This is a sample graph from one of the submissions and not the only solution. The values can be different for you on y-axis.* Note the axes labels, dataset labels, clear title and appropriate colors. A variety of answers have been accepted unless the graph is missing something critical or is nto legible. Partial marks if you did not plot "per byte". 0 marks if you plotted memory ops vs branches. ![](https://i.imgur.com/CjH0oG8.png) For the following questions, compute the answers based on the total number of instructions, cycles, etc. across all the workloads. #### P4 (1pt) Based on the data in optimized-pe.csv, how much speedup from '-O3' do you expect due to change in IC? A: Speedup = Total optimized IC/Total unoptimized IC ~ Generally, range of values from 25x to 40x but graded based on what you reported above. #### P5 (1pt) Based on the data in optimized-pe.csv, how much speedup from '-O3' do you expect due to change in CPI? A: Opt CPI = Total optimized cycle count/ Total optimized IC. Similaily for Unopt CPI. Then, Speedup = Unopt CPI/ Opt CPI ~ Generally, range of values from 1.05x to 1.7x but graded based on what you reported above. Reciprocal of this value is accepted as well if you reported Opt CPI/ Unopt CPI instead. #### P6 (1pt) Based on the data in optimized-pe.csv, how much speedup from '-O3' do expect from the combination of IC and CPI? A: Multiply above two speedups. Generally, range of values from ~30x to ~50x but graded based on what you reported above. #### P7 (5pt) Fill in the data below *> NOTE: You will have different values for the table below. The grading for calculated values is based on the numbers what you have reported.* | Assembly Code | Unoptimized | Optimized | |----------------------------------|------------------------|------------------------| | Instruction count | 8.02e9 | 2.27e8 | | Cycle count | 3.57e9 | 8.32e7 | | Cycle time | 0.5ns | 0.5ns | | Projected execution time | Cycle count x Cycle time = 1.785s | 0.041637s | | Projected speedup vs unoptimized | 1 | 42.83x | | Actual execution time | 1.0586s | 0.026s | | Actual speedup vs unoptimized | 1 | 40.71x | #### P8 (4pt) How accurately did the PE accurately model the performance of this program on these workloads? A: 95.05%. Compare projected speedup and actual speedup above. Any reasonable and clear explanation is accepted, e.g. 95%, "Poorly", "Very Close" based on your numbers above. Fractions like 1.3 are ambiguous and given 0 marks. #### P9 (4pt) Based on profile data with -O3 turned on, which functions should you target for optimization. A: At least 2 of activate, calc_grads, or fix_weights, unless only one function reported close to 100% for you. #### P10 (4pt) For the functions you listed, what's the largest speed up you could hope to achieve? ``` Use Amdahl's Law for each function. For example, If activate takes 30%, 1/1-0.3 = 1.428 If calc_grads takes 65%, 1/1-0.65 = 2.85 No marks if you have added proportions for all the functions above and reported infinite speedup. ``` <div style="page-break-after: always;"></div> ## Measuring Actual Performance For the following questions, compute the answers based on the total number of instructions, cycles, etc. across all the workloads. #### P1 (3pt) How much overhead (i.e., increase) does gprof cause in terms of the following? REMOVED. ## Reasoning About Performance #### P1 (1pt) Which function accounts for the largest fraction of time in optimized gprof data? A: One of activate/fix_weights/calc_grads. No marks if reported `get` or `operator`. #### P2 (1pt) What's the O() complexity of that function? A: O(mn) if hot function is activate/fix_weights/calc_grads. Please check the definition for `m` and `n` in the README if you have reported O(m^2n) or O(n^3) or something similar. #### P3 (4pt) Fill out this table using data from your per-workload gprof outputs for your hot function. *> NOTE: You will have different values for the measured ET and ET rel. to mnist columns below. 0.01s for all measured ET values is not a correct measurement and has been awarded 0.* | dataset | measured ET | ET rel. to mnist | Big-O estimate rel. to mnist | |----------|-------------|------------------|------------------------------| | mnist | 0.0000396775 | 1 | 1 | | emnist | 0.000208805 | 5.262 | `28*28*62/(28*28*10) = 6.2` | | cifar10 | 0.000047555 | 1.198 | `32*32*3*10/(28*28*10) = 3.918` | | cifar100 | 0.0019966 | 50.32 | `32*32*3*100/(28*28*10) = 39.18` | #### P4 (4pt) Draw a scatter plot with the relative values of m*n on the x-axis and relative execution time on the y-axis. Plot the data for measured ET and your O() estimate. *> NOTE: The following graph is graded based on table you reported above.* Note the x-label, y-label, dataset labels, clear x and y scale, title in the graph below. Graph needs to convey "relative to MNIST" through either title or axes labels. ![](https://i.imgur.com/VKaRtCI.png) (Ignore imagenet above) #### P5 (1pt) How well does your O() match actual performance? A: Interpret based on the above graph. Any reasonable interpretation is accepted, e.g. "Poorly", "Close", "dataset X is an outlier" based on your graph and/or numbers above. ## Changing the Clock Rate and Measuring Power #### P1 (4pt) Draw a line graph with clock speed on the x-axis and execution time on the y-axis. ![](https://i.imgur.com/RPcb757.png) #### P2 (4pt) Draw a line graph with clock speed on the x-axis and energy on the y-axis. ![](https://i.imgur.com/EH1CFFJ.png) #### P3 (4pt) Draw a line graph with clock speed on the x-axis and power on the y-axis. ![](https://i.imgur.com/pcgoz5I.png) <!--- ## Memory Accesses with Moneta #### P1 (4pt) What does the memory accesses to the weight tensor during training of the Perceptron look like? Include the cache hit rate statistics in your graph. ``` Your Graph here ``` -->