###### tags: `cs:app`
# Chapter 5 : Optimizing Program Performance
## [Here's the slide](https://hackmd.io/@hiFbws9HS-uyBrkLZP8c_w/BkrfGULeY)
## Capabilities and Limitations of Optimizing Compilers
- Compilers may apply only safe optimization to program.
- Some optimization might seem to be good but went bad on some cases. Here's some common situations.
```cpp=
void one(int *x, int *y) {
*x += *y;
*x += *y;
}
void two(int *x, int *y) {
*x += 2 * *y;
}
/*
two is faster then one obviously
but when the x == y, the result gonna differ.
*/
```
```cpp=
int cnt = 0;
int f() {
return cnt ++;
}
int one() {
return f() + f();
}
int two() {
return f() * 2;
}
/*
they might seem to be right only reading the below part.
but with a funcion has a side effect, it is not gonna work.
Side effect : a function has a side effect when it modifys some global state during operation.
*/
```
- memory aliasing: different pointers may reference the same memory space
- If a compiler cannot determine whether a optimization is safe, it's not gonna use it.
## Expressing Program Performance
- A metric : *CPE* (abbriviation of *cycles per element*), used to express the performance of a program. We're gonna use it for the following content.
- The frequecies of CPUs are described using this. ex. A 4GHz cpu runs at $4\times10^9$ *cycles per seconds (CPE)*.
## Some Prerequisite
- The following optimization all work on the reference machine *Core i7 Haswell processor*. So there's no guarantee when working on other machines or compilers.
- Here's some optimization you can do.

- There are many factors complicate the measuring of CPE so the results should be rounded up or down by a little.
- We set up a fundamental code which we're gonna do some optimization on it.
```cpp=
/*
vec_ptr is a structure representing a vector.
data_t is the data type we're using. could be {int, double};
OP is the operation we're using. could be {+, *};
IDENT is the Identity element of the operation.
*/
void combine1(vec_prt v, data_t *dest) {
int i;
*dest = IDENT;
for(i = 0; i < size(v); i ++) {
data_t val;
get_vec_ele(v, i, &val);
*dest = *dest OP val;
}
}
/*
basically a function OP through the vector;
*/
```
- This are the original CPE and the $-O1$ optimized CPE for the fundement code.

## Elimination Code efficiencies - Code Motion
- As told in chap 3, every loop has a section to test if the conditions still holds and decide whether to jump to the end of the loop, so the following code will do poorly because of the condition judging:
```cpp=
void lower1(char *s) {
int i;
for(int i=0; i < strlen(s); i ++) {
// do sth
}
}
// strlen gonna take O(n) to get the value.
```
but if we take the strlen(s) out of the loop:
```cpp=
void lower2(char *s) {
int t, siz = strlen(s);
for(int i=0; i < siz; i ++) {
// do sth
}
}
```
then the condition judge section won't have to do the $strlen(s)$ each time. Which lead to the following *CPE* result.

And this is called the code motion.
- Though it looks like a fairly simple and harmless optimization, sadly the compiler can't be sure of whether the functions in the judge section has side effect, so it's not gonna do this optimization. Thus the programmers are gonna have to code it by themselves.
- In case anyone doesn't know, the time complexity of $lower1$ is $O(n^2)$ and the $lower2$ is $O(n)$
## Some slightly changes
- And now we do some slightly changes to enable the following optimization. Now the code would look like this:
```cpp=
void combine3(vec_ptr v, data_t *dest) {
int i, siz = size(v);
data_t *data = get_vec_start(v);
*dest = IDENT;
for(i = 0; i < siz; i ++) {
*dest = *dest OP data[i];
}
}
```
we remove the function *get_vec_ele()* and instead we get the head element and use it to access the vector elements.
This doesn't change much *CPE* but enable some other optimizations, we're make use of it later.
## Eliminating Unneeded Memory References
- Now let's have a peek into the assembly code of $combine3()$

It's easy to see that we have to to some writing and reading to $*dest$ each iteration, and it's totally wasteful.
We introduce a temporary variable $acc$ to accumulate the computed value, so that the code will look like this:
```cpp=
void combine4(vec_ptr v, data_t *dest) {
int i, siz = size(v);
data_t *data = get_vec_start(v);
data_t acc = IDENT;
for(i = 0; i < siz; i ++) {
acc = acc OP data[i];
}
*dest = acc;
}
```
And the assembly code:

And the redundant reading and writing has vanished. The $CPE$ now has become:

- And again, this might seem like a simple and harmless optimization, but if any of the $vec\_ptr$'s elements has the same address as $dest$, the result would be different. So the compiler's not gonna help us with this either and this also needs to be done by programmers themselves.
## Understanding Modern Processors
The above optimization only reduce the overhead of percedure calls. To further optimize the performance, we're gonna have to exploit the system design of the processor, thus we'll need some basic understanding of modern processors.
- The following figure shows a very simplified view of modern processors.
<!---->

These processors are described as being *superscalar*, which means they can perform multiple operations on every clock cycle and out of order, meaning that the order in which instructions execute need not correspond to their ordering in the machine-level program.
In brief, it splits to two parts:
- *Instruction Control Unit(ICU)* : It reads the instructions and convert them into a set of primitive operations.
- *Execution Unit(EU)* : It has some hardware called *functional units*, which are specialized to handle different types of operatons. For example, the following figure shows our reference machine's *functional units*:

Also it employs a lot of elaborate technique but in my opinion those aren't necessary to the goal we're achieveing so I'm not gonna get into them for now. For anyone who's interested, you can refer to page 518.
- And here's a important concept : the *ICU* decodes instructions well ahead so that it has enough time to decode them and send them to *EU*. And when it encounters a branch, it's gonna employ a technique called *Branch Prediction* so that it'll guess which branch will the code take, which might potentially reduce the performance, we'll come back to this later.
## Some performance bound due to processors' architecture
- The below figure documents the performance of some of the arithmetic operations for our reference machine:
<!---->

The *Latency* stands for total time required to do the operation. The *Issue* stands for minimum time between two operations. The *Capacity* stands for the number of functional units that can perform the operation.
- And now we bring up a concept mentioned before - *pipeline*. The Addition and the Multiplication have issue times of one because they can be well pipelined a.k.a *fully pipelined*. While the Devision requires a issue time equal to it's latencies meaning taht it'll have to wait until the previous operation ends to start a new one.
- To well express the impact of issue time, we define a term *throughput* as the reciprocal of the issue time. A *fully pipelined* operation has throughput of 1 while others have lower. Also multiple functional units can increase the *throughput*. So intuitively, the *throughput* can be caculated using the formula: $$Capacity / IssueTime$$
- In summary, the architecture sets two lower bounds for the performance:
- *Latency bound* sets the bound based on the total time of operations that has to be done in strict sequence.
- *Throughput bound* sets the bound based on the throughput of the operations.
Here's the bounds for out function $combine4()$:
<!---->

Take the integer addition for example, the operation has *latency* of $1$ so it's *latency bound* should be $1$. And our reference machine has $4$ functional units for integer arthimetic and 2 for loading and issue time of $1$, thus has *throughput bound* of $1 / 2 = 0.5$;
- And now we introduce a useful tool - *Data-Flow Graph*. In order to draw a *Data-Flow Graph*, we must look into the machine level of the function :

And according to the content and dependencies we can draw a graph look like this:
<!---->

We omitted some parts because they doesn't really matter to our performance.
We can easily see that the whole function has now split to two parts. One on the left side with the mul and the other on the right side with add. Now we replicate this n times because it's in a loop that runs exactly n times:
<!---->

And since the left's total latency is $5$ and the right's is only $1$, so we call the left one the *critical path*, which means that it dictates the overall performance of the function.
## Loop Unrolling
- First, let's look at the code below with some slightly change from $combine4()$:
```cpp=
void combine5(vec_prt v, data_t *dest) {
// Same as combine4
for(i = 0; i < siz; i += 2) {
acc = acc OP data[i] OP data[i + 1];
}
}
```
The only difference is that it changes the number of operations each loop. We do two operations for each loop so this is called "$2 \times 1$ loop unrolling", and to generalize it, we call a loop that does $k$ operations each time a "$k \times 1$ loop unrolling"
And after this, we can get the following CPEs:
<!---->

We can see that the integer addition's CPE has reduced to touch the latency bound because we eliminate some redundant operations by halving the iteration. But the rest remains unchanged.
- To get to know the reason that the operations beside integer addition ramains unchanged, we'll have to draw the data-flow graph:
<!---->

Though we change the amount of operations per iteration, the overall data dependency is still the same. $$ \frac{n}{k} \times k \times 5 = 5n = n \times 5$$
So the critical path remains to be $5n$.
- In order to solve the above problem, we introduce a further optimized version:
```cpp=
void combine6(vec_prt v, data_t *dest) {
// Same thing
data_t acc1 = IDENT, acc2 = IDENT;
for(int i=0; i < siz; i += 2) {
acc1 = acc1 OP data[i];
acc2 = acc2 OP data[i + 1];
}
*dest = acc1 OP acc2;
}
```
Whis this change, our data-flow graph then becomes:
<!---->

We can see that the original critical path was split to two paths, thus we'll have the latency of: $$ \frac{n}{2} \times 5 = \frac{5 \times n}{2}$$
The CPEs then turns to:
<!---->

And also, to generalize this, we call a loop that does k operations parallel "$k \times k$ loop unrolling"
- For an operation with latency $L$ and capavity $C$, an unrolling factor $k = C \times L$ will be the limit, since we won't have enough resource to deal with more than that. Also too much unrolling might cause *register spilling*, leads to using memory to store registers' values which slows down the performance.
- There's another technique can be used to utilize all the resource avaliable, called *reassociation transformation*. We simply just change the code from
```cpp=
void combine6(vec_prt v, data_t *dest) {
// Same thing
acc = acc OP data[i] OP data[i + 1];
}
```
to
```cpp=
void combine7(vec_prt v, data_t *dest) {
// Same thing
acc = acc OP (data[i] OP data[i + 1]);
}
```
The data-flow graph then gonna looks like this:
<!---->

The operation of ```data[i] * data[i + 1]``` was detached from the critical path thus generating the CPE:
<!---->

## Other Limiting Factors
- Register Spilling(reminder)
- Branch Prediction and Misprediction Penalties: As mentioned before, the ICU works well ahead current state. Thus have to do some prediction while running into some branch. And when the prediction is wrong, one sure gonna have some penalties, and this might be a big factor that slow down the performance.
And to narrow down the penalties, we should use the condiotional data transfer mentioned in chapter 3, which has less penalties than conditional move.
- Loading dependencies : when a memory is being loading, we have to wait til all the writing to this memory has completed, thus causing a dependency. Consider the following code:
```cpp=
void copy(int *src, int *dest, int n) {
for(int i=0; i < n; i ++) {
dest[i] = src[i];
}
}
```
When calling ```copy(a, a + 1, siz)``` amd ```copy(a, a, siz)```, we're gonna have a much better performance with the latter one. This is because the formmer creates a dependency due to the loading location of each iteration is the same with the previous writing location.
## Performance Improvement Techniques summarize
- High-level design : Simply choose a algorithm with a better time complexity.
- Basic coding principals :
- Eliminate excessive function calls.
- Move computations out of loops when possible.
- Consider selective compromises of program modularity to gain greater efficiency.
- Eliminate unnecessary memory references. Introduce temporary variables to hold intermediate results. Store a result in an array or global variable only when the final value has been computed.
- Low-level optimizations
- Unroll loops to reduce overhead and to enable further optimization
- Find ways to increase instruction-level parallelism by techniques such as multiple accumulators and reassociation.
- Rewrite conditional operations in a functional style to enable compilation via conditional data transfers.
## Program Profiling
Using profiling tools can determine the bottlenecks of programs and according to Amdahl's Law, making improvements to slower parts of the program will yield a better performance gain.
Examples are: gprof, vtune, valgrind
## Summary
A cool sentence I think is great for summerizing this chap:
> Premature optimization is **the root of all evil**.
> [name=Donald Knuth]