# CSE142L Lab 2 Characterizing a Perceptron Solution
## Enabling the Profiler
#### P1 (1pt) Which function accounts for the most time?
A: `tensor_t<double>get(int, int, int, int)`
#### P2 (1pt) What percentage of time does it account for?
A: Range of values are observed, 30 - 60%
#### P3 (1pt) According to Amdahl's Law, how much speedup could you possibly achieve by optimizing this function?
```
1/(1-x) where x = (answer in P2)/100
.. around 1.4285 (30%) to 2.5 (60%)
```
<div style="page-break-after: always;"></div>
## Taking a Closer Look at the Code
#### P3 (1pt) We noticed that fc_layer_t::activate(tensor_t<double\>&) took 46.7% of the total execution time. If we speed up this function by 3x, how much will the total program be sped up by?
```
1/(1-0.467 + 0.467/3) = 1.452
```
## What's the Compiler Doing?
#### P1 (1pt) How many instructions does the function 'point_t::point_t(int, int, int, int)' execute when called?
A: 24 including the call to mcount. 23 otherwise
## Looking at Performance Counters
#### P1 (4pt) Compute the instruction mix for each of the dataset and enter them below (as %)
| Dataset | Memory insts | Branches | uncond. branches |
|----------|--------------|----------|------------------|
| mnist | 70.57 | 6.58 | 6.13 |
| emnist | 70.56 | 6.53 | 6.12 |
| cifar10 | 70.6 | 6.57 | 6.13 |
| cifar100 | 70.57 | 6.52 | 6.12 |
#### P2 (4pt) Fill out the table (total data processed is the product of the model size and the number of training inputs)
| Dataset | Model size (B) | training_inputs_count | total data processed | Memory ops |
|----------|----------------|-----------------------|----------------------|------------|
| mnist | 74K | 200 | 14800K | 8.34e + 08 |
| emnist | 230K | 150 | 34500K | 2.23e + 09 |
| cifar10 | 290K | 100 | 29000K | 1.6e + 09 |
| cifar100 | 2500K | 5 | 12300K | 8.33e + 08 |
#### P3 (4pt) Prepare a bar graph from your table that plots the number of memory operation/byte of data processed and the number of branches per byte of data processed for each workload.
For mnist,
Memory ops per byte = Memory ops/total data processed(B).
Branches per byte = #branches/total data processed(B).
Similarily for other datasets.
*>NOTE: This is a sample graph from one of the submissions and not the only solution. The values can be different for you on y-axis.*
Note the axes labels, dataset labels, clear title and appropriate colors. A variety of answers have been accepted unless the graph is missing something critical or is nto legible.
Partial marks if you did not plot “per byte”. 0 marks if you plotted memory ops vs branches.

## Asking the Compiler to Do More
#### P1 (4pt) Compute the instruction mix for each of the dataset for the optimized code and enter them below (as %)
*> NOTE: You will have slightly different values for the columns below. If your values differ widely, for example by order of 10, then check where you went wrong.*
| Dataset | Memory insts | Branches | uncond. branches |
|----------|--------------|----------|------------------|
| mnist | 34.66% | 11.46% | 0.94% |
| emnist | 36.68% | 11.01% | 0.37% |
| cifar10 | 35.74% | 11.94% | 0.87% |
| cifar100 | 35.31% | 11.35% | 0.13% |
#### P2 (4pt) Fill out the table for the optimized code (total data processed is the product of the model size and the number of training inputs)
*> NOTE: You will have different values for the memory ops column below.*
*> NOTE: If you have not reported memory ops from optimized (gprof or non-gprof) version of the code, marks have been deducted. The order will be off by 10 or 100.*
| Dataset | Model size (B) | training_inputs_count | total data processed | Memory ops |
|----------|----------------|-----------------------|----------------------|------------|
| mnist | 74K | 200 | 14800K | 1.33e + 07 |
| emnist | 230K | 150 | 34500K | 3.18e + 07 |
| cifar10 | 290K | 100 | 29000K | 5.72e + 07 |
| cifar100 | 2500K | 5 | 12300K | 4.28e + 07 |
#### P3 (4pt) Prepare a bar graph from your table that plots the number of memory operation/byte of data processed and the number of branches per byte of data processed for each workload for the optimized code.
For mnist,
Memory ops per byte = Memory ops/total data processed(B).
Branches per byte = #branches/total data processed(B).
Similarily for other datasets.
*> NOTE: This is a sample graph from one of the submissions and not the only solution. The values can be different for you on y-axis.*
Note the axes labels, dataset labels, clear title and appropriate colors. A variety of answers have been accepted unless the graph is missing something critical or is nto legible.
Partial marks if you did not plot "per byte". 0 marks if you plotted memory ops vs branches.

For the following questions, compute the answers based on the total number of instructions, cycles, etc. across all the workloads.
#### P4 (1pt) Based on the data in optimized-pe.csv, how much speedup from '-O3' do you expect due to change in IC?
A: Speedup = Total optimized IC/Total unoptimized IC ~ Generally, range of values from 25x to 40x but graded based on what you reported above.
#### P5 (1pt) Based on the data in optimized-pe.csv, how much speedup from '-O3' do you expect due to change in CPI?
A:
Opt CPI = Total optimized cycle count/ Total optimized IC. Similaily for Unopt CPI.
Then, Speedup = Unopt CPI/ Opt CPI ~ Generally, range of values from 1.05x to 1.7x but graded based on what you reported above.
Reciprocal of this value is accepted as well if you reported Opt CPI/ Unopt CPI instead.
#### P6 (1pt) Based on the data in optimized-pe.csv, how much speedup from '-O3' do expect from the combination of IC and CPI?
A: Multiply above two speedups. Generally, range of values from ~30x to ~50x but graded based on what you reported above.
#### P7 (5pt) Fill in the data below
*> NOTE: You will have different values for the table below. The grading for calculated values is based on the numbers what you have reported.*
| Assembly Code | Unoptimized | Optimized |
|----------------------------------|------------------------|------------------------|
| Instruction count | 8.02e9 | 2.27e8 |
| Cycle count | 3.57e9 | 8.32e7 |
| Cycle time | 0.5ns | 0.5ns |
| Projected execution time | Cycle count x Cycle time = 1.785s | 0.041637s |
| Projected speedup vs unoptimized | 1 | 42.83x |
| Actual execution time | 1.0586s | 0.026s |
| Actual speedup vs unoptimized | 1 | 40.71x |
#### P8 (4pt) How accurately did the PE accurately model the performance of this program on these workloads?
A: 95.05%. Compare projected speedup and actual speedup above. Any reasonable and clear explanation is accepted, e.g. 95%, "Poorly", "Very Close" based on your numbers above. Fractions like 1.3 are ambiguous and given 0 marks.
#### P9 (4pt) Based on profile data with -O3 turned on, which functions should you target for optimization.
A: At least 2 of activate, calc_grads, or fix_weights, unless only one function reported close to 100% for you.
#### P10 (4pt) For the functions you listed, what's the largest speed up you could hope to achieve?
```
Use Amdahl's Law for each function.
For example,
If activate takes 30%, 1/1-0.3 = 1.428
If calc_grads takes 65%, 1/1-0.65 = 2.85
No marks if you have added proportions for all the functions above
and reported infinite speedup.
```
<div style="page-break-after: always;"></div>
## Measuring Actual Performance
For the following questions, compute the answers based on the total number of instructions, cycles, etc. across all the workloads.
#### P1 (3pt) How much overhead (i.e., increase) does gprof cause in terms of the following?
REMOVED.
## Reasoning About Performance
#### P1 (1pt) Which function accounts for the largest fraction of time in optimized gprof data?
A: One of activate/fix_weights/calc_grads. No marks if reported `get` or `operator`.
#### P2 (1pt) What's the O() complexity of that function?
A: O(mn) if hot function is activate/fix_weights/calc_grads. Please check the definition for `m` and `n` in the README if you have reported O(m^2n) or O(n^3) or something similar.
#### P3 (4pt) Fill out this table using data from your per-workload gprof outputs for your hot function.
*> NOTE: You will have different values for the measured ET and ET rel. to mnist columns below. 0.01s for all measured ET values is not a correct measurement and has been awarded 0.*
| dataset | measured ET | ET rel. to mnist | Big-O estimate rel. to mnist |
|----------|-------------|------------------|------------------------------|
| mnist | 0.0000396775 | 1 | 1 |
| emnist | 0.000208805 | 5.262 | `28*28*62/(28*28*10) = 6.2` |
| cifar10 | 0.000047555 | 1.198 | `32*32*3*10/(28*28*10) = 3.918` |
| cifar100 | 0.0019966 | 50.32 | `32*32*3*100/(28*28*10) = 39.18` |
#### P4 (4pt) Draw a scatter plot with the relative values of m*n on the x-axis and relative execution time on the y-axis. Plot the data for measured ET and your O() estimate.
*> NOTE: The following graph is graded based on table you reported above.*
Note the x-label, y-label, dataset labels, clear x and y scale, title in the graph below. Graph needs to convey "relative to MNIST" through either title or axes labels.

(Ignore imagenet above)
#### P5 (1pt) How well does your O() match actual performance?
A: Interpret based on the above graph. Any reasonable interpretation is accepted, e.g. "Poorly", "Close", "dataset X is an outlier" based on your graph and/or numbers above.
## Changing the Clock Rate and Measuring Power
#### P1 (4pt) Draw a line graph with clock speed on the x-axis and execution time on the y-axis.

#### P2 (4pt) Draw a line graph with clock speed on the x-axis and energy on the y-axis.

#### P3 (4pt) Draw a line graph with clock speed on the x-axis and power on the y-axis.

<!---
## Memory Accesses with Moneta
#### P1 (4pt) What does the memory accesses to the weight tensor during training of the Perceptron look like? Include the cache hit rate statistics in your graph.
```
Your Graph here
```
-->