# Performance Definition
* For different target the index for estimate performance will be different
1. Personal Use : Execution Time
2. Server : Throughput
* We define the $Performance = 1/Execution Time$.To determine that $PC_{a}$ is faster than $PC_{b}$ by *n* times and n is calculated by $Performance_a / Performance_b$
* **Performance is inversely proportion to Execution time**
> 
* Since the I/O time and System CPU time is uncontrollable.We will focus on the **User CPU time**.
# User Time
## Components of User Time
* To avoid errors caused by the data transfer speed.We use a *clock* to synchronize the data and ensure the result is correct
* Clock can be derived by $Clock = 1 / CPU\,Speed$ the *CPU Speed also known as **clock rate***
> 
> 
* To compare **different machine** we only need to focus on the $Avg\,CPI * Cycle\,Time$(*Intruction Time*) since the IC will remain the same when running same program.
* To compare **different program** we focus on the $IC * Avg\,CPI$(*CPU Clocks*) since the cycle time will be the same
---
> Example:
> For A the $Instruction\,time = 250 * 2 = 500$
> For B the $Instruction\,time = 500 * 1.2 = 600$
> =>A is faster than B by $600/500 = 1.2\ times$
> Example :
> $Cycles = Execution time / clock\ rate = 10 / (1/2*10^9)$ = 20G
> For B $Cycles_B = 20 * 1.2 = 24G$ the desired clock $speed = 24G/6 = 4Ghz$
> Example:
> 
> 
> 1. $Instructions_1 = 2 + 1 + 2 = 5$,$Instructions_2 = 4 + 1 + 1 = 6$
> 2. Sequence 2
> 3. $CPI_1 = (2*1+2*1+3*2)/5 = 2$, $CPI_2 = (4*1+1*2+3*1) = 1.5$
>Example: 
> (a.) $IPC_1 = 20*10^9/2*10^9 = 10$,$IPC_2 = 30*10^9/1.5*10^9 = 20$,$IPC_3 = 90*10^9/3*10^9 = 30$
> (b.) P2 is slower than P1 by 7/10 = 0.7times => To speed up to 7s the clock rate needs to be $1.5/0.7 = 2.14\ Ghz$
> (c.) Same as above,the ratio is 0.9 => $30*0.9 = 27 * 10^9$
## Factors that will influence the User Time
>
# Million Instructions Per Second(MIPS)
* Can be used for estimate efficiency but **it's not absolutely correct**
> 
> And the instruction time can be expressed like below
> 
* Like what we said before,MIPS can't be used for compare PC's efficiency
> 
> 1. For CISC,one instruction can perform more difficult job than a RISC instruction
> 2. Different program will have different MIPS since the instructions inside is different
---
> Example:
> 
> $MIPS = 10^7/0.25 * 10^9 = 40$
> $CPI = (100*10^6*0.25)/10^7 = 2.5$
> Example:
> 
> (a). $CPI_{M1} = (0.6+0.6+0.4) = 1.6$,$CPI_{M2} = (1.2+0.9+0.4) = 2.5$
> (b). $IT = 1/(MIPS)*10^6$ => $MIPS = 1/IT*10^6$. $MIPS_{M1} = 1/((1/{80*10^6})*1.6)*10^6 = 50$,$MIPS_{M2} = 1/((1/{100*10^6})*2.5)*10^6 = 40$
> (c). Choose the most frequent instructions to reduce its CPI => Reduce A'CPI to 1 => $CPI_{M2} = (0.6+0.9+0.4) = 1.9$ => $MIPS_{M2} = 1/((1/{100*10^6})*1.9)*10^6 = 52>50$
> Example:
> 
> Since every mul 4 instruction's CPI is reduced to 2(2 * 1) thus the time we saved is coming from it => $10-9 * 1Ghz = 1G = Total\ Clocks\ Saved$ => $1G/2 = 0.5G = Total\ Instruction\ Swapped$
> Example:
> 
> 1. $MIPS_A = (L+F*N)/Y*10^6$ $MIPS_B = (L+F)/Z*10^6$
> 2. Put the numbers into the equation above => $Y = (L+F*N)/MIPS_A*10^6 = 100ms$,$Z = (L+F)/MIPS_B*10^6 = 91.67ms$
> 3. Yes,with the co-processor the excution time is faster.
# Amdahl's Law
* Used for describe the improvement of the execution time
> 
---
> Example:
> $Speedup = 10/(5/5)+5 = 5/3$
> Example:
> Let total *time spend on fp arithmetic = x* => $3 = 100/(x/5)+(100-x) = 250/3$
---
* From the example above we can drived an formula for speedup by letting the *time affected = f* and *improvement = s*
> 
---
> Example:
> $Time\ after\ speedup = 60/3 = 20$ => $20 = 20 + 40/s$ => s needs to be infinity that the speedup of 3 is achieveable => impossible
> Example:
> 1. False,clock rate increase the execution time will reduce
> 2. True
> 3. True
> 4. False,$1/(0.4/40+0.6) = 1.63$
> 5. False,$1/(0.5/20+0.5 = 1.9$
> 6. True,$1/(0.6/10+0.4) = 2.17$
> Example:
> (1).$1/(0.2/4+0.8) = 1.17$
> (2).$1/(0.5/2 + 0.5) = 1.3$
> (3).$1/(0.2/4+0.5/2+0.3) = 1.6$
> Example:
> Each loop is independent => Can be perform in parallel
> Without parallelized => $Cycles_{Before} = 20 + 200 * 200 = 40020$
> In parallel => $Cycles_{After} = 20 + 200 = 220$
> $Answer = 40020 / 220 = 181.91$
> Example:
> (a) Let `X` be the time used for excuting MMX => $2 = \frac{1}{\frac{x}{10}+(1-x)} => x = 55\%$
> (b) Let `T` be total time spend before and since the time spend on MMX mode is 10 times faster => $Answer = \frac{\frac{0.55T}{10}}{\frac{0.55T}{10}+0.45T} = \frac{\frac{0.55}{10}}{\frac{0.55}{10}+0.45} = 10.89\%$
> (c) If whole program is in MMX then the speedup will be 10 => half is 5 => $5 = \frac{1}{\frac{x}{10}+(1-x)} => x =88\%$
> Example:
> Enhanced mode used 80% of the time means the *80% of the time spend after have been speedup*
> (1). Since the time after is speedup 10 times => Let `T` be the time spend after => Original time spend on fast mode = $10 * 0.8 * T = 8T$ => $SpeedUp = \frac{8T}{8T+0.2T} = 8.2$
> (2). $Answer = \frac{8T}{8.2T} = 97\%$
# Performance Summarizing
## Arithmetic Mean
* Consider a time of excution chart below
> 
* We can get performace by $\frac{Time_B}{Time_A}$,for example => for program 1 computer B is faster than computer A by $\frac{100}{20} = 5$ times faster
* And if we take the arithmetic mean for all program, we can get a overall comparison for those two
> 
* To get a more accurate result,we can consider adding *weight* to different program
## Geometric Mean
* We take one machine as reference and *normalized* the run time for other machine with it by the formula $SPECratio = \frac{Time_{Reference}}{Time_{Measured}}$
> 
* With the use of GM,we can compare machines without worrying the reference changed
> 
* But it's **not suitable to compare** performance since it don't predict excution time
> 
> From the chart above, we can see that B'S overall performance is better than A but the GM of both of them are the same
---
> Example:
>1. The excution time will increase by 10%
>2. The excution time will increase by $1.1*1.05 = 1.155$
>3.The reference time is original time => $SPECratio = \frac{1}{1.155} = 0.86$ => Decreased by 14%
> Example:
> 
> 1. $(1*0.1)+(3*0.1)+(2*0.5)+(5*0.1)+(4*0.2) = 2.7$
> 2. $10^9 * 2.7 * (10*10^-9) = 27sec$
> 3. Since the multiply can be replaced by the mul-add by 5% and it needs to perform arithmetic too,thus the percentage of arithmetic instruction will reduce by 5% too.
> => Jump = 10% Branch = 10% Arithmetic = 45% Multiply = 5% Memory = 20% Mul-Add = 5%
> * The total percentage isn't 100% when calculating CPI,we need to **divide the current total percentage** to get a correct answer =>
> $CPI = (1*0.1)+(3*0.1)+(2*0.45)+(5*0.05)+(4*0.2) + (5*0.05) / 0.95 = 2.74$
> 4. $10^9 * 0.95 * 2.74 * (10*10^-9) = 26.03sec$
> Example:
> 1. $(1*0.4)+(3*0.2)+(3*0.2)+(2*0.2) = 2$
> 2. Load
> 3. ALU = 40% Load = 10% Store = 20% Branch = 20% =>
> $CPI = ((2*0.4)+(3*0.1)+(3*0.2)+(2*0.2)) / 0.9= 2.33$
> =>No,since both CPI and Cycle Time have increased,the excution time will also increase