# Assignment4: Cache
contributed by < [`kaeteyaruyo`](https://github.com/kaeteyaruyo) >
## Exercise 1 - A Couple of Memory Access Scenarios
### Scenario 1
#### Parameters

* Program Parameters:
* Array Size (a0): 128 (bytes)
* Step Size (a1): 8 (words)
* Rep Count (a2): 4
* Option (a3): 0
* Cache Parameters:
* Cache Levels: 1
* Block Size: 8 (bytes)
* Number of Blocks: 4
* Placement Policy: Direct Mapped
* Associativity: 1 (no other choice because it's direct mapped)
* Block Replacement Policy: LRU
:::info
In Ripes, each block is always 4 bytes (can be observed from the indexing breakdown). If want other block size we need to change the number of blocks.
:::
#### Final State

#### Tasks
1. What combination of parameters is producing the hit rate you observe?
* Because `Step Size` in bytes is exactly equal to `Cache Size = Block size * Number of Blocks` in bytes.
2. What is our hit rate if we increase `Rep Count` arbitrarily? Why?
* It will still be 0.
* Because the step size is equal to cache size, which means that every single access to array will always fit into the same block. Increasing `Rep Count` doesn't help.
3. How could we modify one program parameter to increase our hit rate?
* Change `Step Size` to 1. It will increase the hit rate to 0.5. Because each block can hold 2 words, when we access the array word by word, the first word in the block cause caches miss, and the second one causes cache hit.
* Change `Option` to 1. It will increase the hit rate to 0.75. In this case, we when we access the array word by word, we cause 4 accesses in a single block: read on 1st word, write on 1st word, read on 2nd word, write on second word. And the only one causes miss (out of the 4 accesses) is the read on 1st word.
* Change `Array Size` to 32 or less. It will increase the hit rate to 0.875. Because the array size is equal to or less than cache size, every single byte in the array can definitely fit into the cache. In this case, the misses are all come from cold start miss. We can't avoid it, but as the total access times increase, the impact of it can be ignored.
### Scenario 2
#### Parameters

* Program Parameters:
* Array Size (a0): 256 (bytes)
* Step Size (a1): 2 (words)
* Rep Count (a2): 1
* Option (a3): 1
* Cache Parameters:
* Cache Levels: 1
* Block Size: 16
* Number of Blocks: 16
* Placement Policy: N-Way Set Associative
* Associativity: 4
* Block Replacement Policy: LRU
#### Final State

#### Tasks
1. How many memory accesses are there **per iteration** of the inner loop?
* With the `Option` set to 1, each iteration executes `array[index] = array[index] + 1;`. It includes one read operation and one write operation. Thus there are 2 memory access per iteration.
2. What is the repeating hit/miss pattern? WHY? (Hint: it repeats every 4 accesses).
* It misses once and then hit 3 times every 4 access.
* Each block in the cache can hold 4 words (4 element for integer array), and because `Step Size` is set to 2 words, we always access the 1st word and the 3rd word in a block. For each block we cause 4 accesses in it: read 1st word, write 1st word, read 3rd word and write 3rd word. The first operation is miss, and causes the cache to fetch block from the memory, thus remained 3 operation can hit.
3. This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.
* $Hit\ Rate = \frac{hit}{access\ per\ round} = \frac{3}{4} = 0.75$
4. Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run!
* The hit rate will approach 1.
* Beacuse the cache size is equal to the array size, which means every element in the array can fit into the cache at the same time. And the only 16 misses are all **cold start miss**, they only happen when the process just start to run, and there is nothing in the cache. As $Miss\ Rate = \frac{miss}{total\ access}$, miss rate caused by cold start miss can be ignored as the access times grows.
5. Suppose we have a program that iterates through a very large array (AKA way bigger than the size of the cache) repcount times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others, AKA it doesn’t matter if Rep k is applied to element `arr[i]` before Rep k is applied to element `arr[i+1]`, etc.
* "...We want to exhibit more locality so that our caches can take advantage of our predictable behavior. SO, instead, we should try to access **one element** of the array at a time and apply all of the **function** to that **element** so we can be completely done with it before moving on, thereby keeping that **element** hot in the cache and not having to circle back to it later on!"
### Scenario 3
* Ripes doesn't provide Level 2 cache, so I am afraid that I need to skip this question.
## Exercise 2 - Loop Ordering and Matrix Multiplication
:::info
I wrote an animated visualization for these 6 kinds of matrix multiplication. You can checkout [here](https://kaeteyaruyo.github.io/Computer-Architecture/hw4/matrixMultiply.html) and choose a loop order to see how the multiplication process is done under that order.
:::
#### Execution Result
```
ijk: n = 1000, 1.268 Gflop/s
ikj: n = 1000, 0.200 Gflop/s
jik: n = 1000, 1.193 Gflop/s
jki: n = 1000, 4.774 Gflop/s
kij: n = 1000, 0.226 Gflop/s
kji: n = 1000, 4.239 Gflop/s
```
#### Tasks
* Which ordering(s) perform best for these 1000-by-1000 matrices? Why?
* The `jki` order performs best. Because in C language, it use [row-major order](https://en.wikipedia.org/wiki/Row-_and_column-major_order) to store a multidimensional array. In this case, a block in the cache contains elements in the same **row**. And as [the animation](https://kaeteyaruyo.github.io/Computer-Architecture/hw4/matrixMultiply.html) illustrates, the `jki` order access every matrix in the row by row, thus gives the best locality.
* Sometimes the `kji` order outperforms the `jki` order. Because this order also access A and C matrix row by row. It though accesses B matrix column by column, but it change to next element every 10 accesses (10 is the dimension of matrix), which is not quite often. So, this order still has rather good locality.
* Which ordering(s) perform the worst? Why?
* The `ikj` order performs the worst. Because it access every matrix column by column, which gives the worst locality. Everytime we go to next element, we need to move a block from memory to cache. And after we finish this element access, we just throw it away and don't make use of other elements in it.
* How does the way we stride through the matrices with respect to the innermost loop affect performance?
* The innermost directly decides how far the next element we want to access is. The smaller the stride length of innermost loop is, the better locality we get. So we must make sure that the innermost loop can travel the matrics in the row-major order.
:::warning
I tried to convert the given C code into assembly code capable for running on Ripes. But Ripes seems not supporting some function in RISC-V (like `%hi()` and `%low()`) and in `<stdlib>` (like `memcpy` and `memset`...etc.), so the program couldn't execute in the original behavior. I am not sure whether the result will still be correct if I shrink the size of the matrics, so I decide not to put the assembly version. If I successfully write the assembly version manually, I will update this section.
:::
## Exercise 3 - Cache Blocking and Matrix Transposition
My implementation of `transpose_blocking()` is:
```c=
void transpose_blocking(int n, int blocksize, int *dst, int *src)
{
for (int x = 0; x < n; x += blocksize)
{
for (int y = 0; y < n; y += blocksize)
{
int x_steps = blocksize < n - x ? blocksize : n - x;
int y_steps = blocksize < n - y ? blocksize : n - y;
for (int i = 0; i < x_steps; ++i)
{
for (int j = 0; j < y_steps; ++j)
{
dst[(x + i) + (y + j) * n] = src[(y + j) + (x + i) * n];
}
}
}
}
}
```
### Part 1 - Changing Array Sizes
#### Requirement
Fix the blocksize to be 20, and run your code with n equal to 100, 1000, 2000, 5000, and 10000.
#### Result
The result is ploted as below (both x and y axis are in logscale):

Not quite significant. So I plot another one with range `[100..10000..100]`.

:::warning
There are some oscillation in the plot, I think it is because that there is not only this process which run on this machine. The cache may also be modified by other process, thus cause some additional cache miss, thus makes the execution time longer.
:::
#### Tasks
* At what point does cache blocked version of transpose become faster than the non-cache blocked version?
* When the dimension of matrix is bigger than 1000 (more accurate: bigger than 1600), the cache blocked version becomes significantly faster than the naive one. And the ratio of the speed of cache blocked version to naive version become larger (in exponential rate) along with the matrix size.
* Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?
* Because if the matrix size is small enough, it can entirely be store into the cache. In this case, there is no cache block replacement occur in each situation, so the performance of these 2 version are almost the same. But as the matrix size grows, the cache blocked version causes less block replacement, thus outperforms the other version.
### Part 2 - Changing Blocksize
#### Requirement
Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000.
#### Result
The result is ploted as below (only x axis is in logscale):

The result in range `[50..5000..50]` (x axis in normal scale):

:::warning
The naive version doesn't make use of `blocksize` parameter, so its execution time should be the same through out this experiment. But it is a little bit higher at the beginning. I guess it is because of the cold start misses.
:::
#### Task
* How does performance change as blocksize increases? Why is this the case?
* The performance drops along with the blocksize grows.
* Because when the block get larger, we need to access more cache block within a transpose in a block. This causes the bad locality. As the blocksize grows to the matrix dimension, it degenerate into the naive version. So the 2 line should intersect when the blocksize reach 10000.