# Lab4: Cache
###### tags: `Computer Architecture 2020`
## Exercise 1 - A Couple of Memory Access Scenarios
The purpose of this part is to get acquainted the usage of cache visualiza tools. First, take given [sample code](https://github.com/61c-teach/su20-lab-starter/blob/master/lab07/cache.s) given by CS61C.
### Scenario 1
There are two parts parameter need to be changed:
* Program Parameters
* Cache Parameters
In scenario 1, program parameters need to be set as:
* Array Size (a0): 128 (bytes)
* Step Size (a1): 8
* Rep Count (a2): 4
* Option (a3): 0
Corresponding code is like:
```
main: li a0, 128 # array size in BYTES (power of 2 < array size)
li a1, 8 # step size (power of 2 > 0)
li a2, 4 # rep count (int > 0)
li a3, 0 # 0 - option 0, 1 - option 1
```
Second, we need to change cache parameters in Ripes:
* Cache Levels: 1 (Nothing to do, Ripes is based on L1 cache)
* Block Size: 8
* Number of Blocks: 4
* Placement Policy: Direct Mapped
* Associativity: 1
* Block Replacement Policy: LRU
Due to direct mapped policy, it means zero way and each cache line has its own blocks (maybe consist of several blocks). Also, each block in diagram is 4 bytes (becasue offset bits for byte is 2). Hence, we can claim each cache line contain 2 blocks given block size is 8.
Following is reference screenshot for setting & result:

### Task for scenario 1
Q1. What combination of parameters is producing the hit rate you observe? (Hint: Your answer should be of the form: “Because [parameter A] in bytes is exactly equal to [parameter B] in bytes.”)
A1. As the Cache indexing breakdown shown, it consists of 2 bits for byte offset, 1 bit for block offset, 2 bits for index and 27 bits for tag. The result of hit rate is zero in scenario 1.
Q2. What is our hit rate if we increase Rep Count arbitrarily? Why?
A2. In experiment result, hit rate is always been zero no matter how repeat time grows. The reason is that data in cache is always been evicted (step size is 8 which means 3 lower bits always been same).
Q3. How could we modify one program parameter to increase our hit rate? (Hint: Your answer should be of the form: “Change [parameter] to [value].” Note: we don’t care if we access the same array elements. Just give us a program parameter modification that would increase the hit rate.
A3. Just change `step size` into 1 and hit rate will increase to $0.5$ . The reason is that each `load` will get two blocks and it is guarantee to access next block in cache if and only if step size is 1.
### Scenario 2
Program Parameters:
* Array Size (a0): 256 (bytes)
* Step Size (a1): 2
* Rep Count (a2): 1
* Option (a3): 1
Cache Parameters: (set these in the Cache tab)
* Cache Levels: 1
* Block Size: 16
* Number of Blocks: 16
* Enable?: Should be green
* Placement Policy: N-Way Set Associative
* Associativity: 4
* Block Replacement Policy: LRU
Given block size and number of blocks, we can claim that there are four blocks in one cache line (2 bits for block offset) and four cache lines (2 bits for lndex).
Also need to configure program parameters and reference output is:

### Task for scenario 2
Q1. How many memory accesses are there per iteration of the inner loop? (not the one involving repcount). It’s not 1.
A1.
2, start code already mentioned it.
```
array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write
```
Q2. What is the repeating hit/miss pattern? WHY? (Hint: it repeats every 4 accesses).
A2. Because there are four blocks in each cache line and step size is 2, the hit will always occur second access of block 0 and both access of block 2; miss will always occur in the first access of block 0.
Q3. This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.
A3. Every two inner iterations need a new data loading into cache line. There're four memory access event occurred for every two inner iterations. It is guarantee that first access will miss and remaining will hit. As a result, the hit rate of every two inner iterations is $0.75$ .
Q4. Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run!
A4. As repeat time grows, hit rate will increase. In experiment, 1000 as repeat time will lead to $0.9998$ as hit rate. The cache is big enough to contain all data, and it means data will not be evicted in future. Except early miss, there's no other miss will occur in later.
Q5. Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others AKA it doesn’t matter if Rep k is applied to element arr[i] before Rep k is applied to element arr[i+1], etc.
A5. Because repeat time is 1, question is become less complex. The question becomes how to map a function to all element in array whose size is greater than cache entry. We should try to access four consecutive elements of the array (it only need one `lw`) at a time and apply all of the function to that chunks of elements so we can be completely done with it before moving on, thereby keeping that four chunks of elements hot in the cache and not having to circle back to it later on!
### Scenario 3
Program parameters
* Array Size (a0): 128 (bytes)
* Step Size (a1): 1
* Rep Count (a2): 1
* Option (a3): 0
Cache Levels: 1
* Block Size: 8
* Number of Blocks: 8
* Enable?: Should be green
* Placement Policy: Direct Mapped
* Associativity: 1
* Block Replacement Policy: LRU
Cache Levels: 2
* Block Size: 8
* Number of Blocks: 16
* Enable?: Should be green
* Placement Policy: Direct Mapped
* Associativity: 1
* Block Replacement Policy: LRU
### Task for scenario 3
Q1. What is the hit rate of our L1 cache? Our L2 cache? Overall?
A1.
Hit rate of L1 cache: $0\%$
Hit rate of L2 cache: $50\%$
Overall: $25\%$
Q2. How many accesses do we have to the L1 cache total? How many of them are misses?
A2. 32 accesses and 32 misses.
Q3. How many accesses do we have to the L2 cache total? How does this relate to the L1 cache (think about what the L1 cache has to do in order to make us access the L2 cache)?
A3. (1) 32 accesses (2) There're 32 accesses for L2 cache, it means there're 32 misses in L1 cache (Only miss happened in L1, then it will seek data in L2 cache).
Q4. What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same? Why?
A4. Increase repeat time to 2, L2 hit rate will increase to $75\%$. L2 cache is big enough to accommodate entire array.
Q5. What happens to our hit rates for L1 and L2 as we slowly increase the number of blocks in L1? What about L1 block size?
A5. Along with the number of L1 blocks grows, L1 hit rate will increase and L2 hit rate will decrease.
## Exercise 2 - Loop Ordering and Matrix Multiplication
The purpose of this part is trying show how important the locality is, especially on performance.
### Tasks
Q1. Record your results in answers.txt
A1.
```
ijk: n = 1000, 1.357 Gflop/s
ikj: n = 1000, 0.208 Gflop/s
jik: n = 1000, 1.170 Gflop/s
jki: n = 1000, 5.848 Gflop/s
kij: n = 1000, 0.213 Gflop/s
kji: n = 1000, 4.570 Gflop/s
```
Q2. Which ordering(s) perform best for these 1000-by-1000 matrices? Why?
A2. (1) jki (2) One of the reason is that it is the result of experiment. Another reason is that taking the order of `jki` to access data has spatial locality. In this order, the address of next member of `C` is next to it (offset is 1), so does `A`. The next address of `B`'s member is same (no locality issue).
Q3. Which ordering(s) perform the worst? Why?
A3. (1) ikj (2) Actually, there're two case are closer, we just mentioned worst case in this experiment. The reason probably makes it worst is that next address of `B` and `C` need to access didn't have spatial locality (offset is `j*n`).
Q4. How does the way we stride through the matrices with respect to the innermost loop affect performance?
A4. The less number of stride in innermost loop, the better spatial locality we get which means better performance.
## Exercise 3 - Cache Blocking and Matrix Transposition
This part is trying to use the advantages of locality to improve performance.
First of all, we need to implement blocking transpose:
```c
/* Implement cache blocking below. You should NOT assume that n is a
* multiple of the block size. */
void transpose_blocking(int n, int blocksize, int *dst, int *src) {
for (int i = 0; i < n; i += blocksize)
for (int j = 0; j < n; j += blocksize)
for (int k = i; k < i + blocksize && k < n; ++k)
for (int l = j; l < j + blocksize && l < n; ++l){
dst[k + l*n] = src[l + k*n];
}
```
Note: We may not assume that the matrix width (n) is a multiple of the blocksize. As a result, we need to be aware of boundary condition.
Execution example:
```
./transpose <n> <blocksize>
```
### Part 1 - Changing Array Sizes
In this part, fix the blocksize to be 20, and run your code with n equal to 100, 1000, 2000, 5000, and 10000.
Q1. Record your results in answers.txt
A1.
n = 100:
```
Testing naive transpose: 0.024 milliseconds
Testing transpose with blocking: 0.031 milliseconds
```
n = 1000:
```
Testing naive transpose: 3.842 milliseconds
Testing transpose with blocking: 1.388 milliseconds
```
n = 2000:
```
Testing naive transpose: 21.414 milliseconds
Testing transpose with blocking: 6.239 milliseconds
```
n = 5000:
```
Testing naive transpose: 203.536 milliseconds
Testing transpose with blocking: 44.091 milliseconds
```
n = 10000:
```
Testing naive transpose: 1082.8 milliseconds
Testing transpose with blocking: 251.057 milliseconds
```
Q2. At what point does cache blocked version of transpose become faster than the non-cache blocked version?
A2. When n is 1000.
Q3. Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?
A3. Because when width is too less, its benefit brought by locality is less than program complexity (e.g. four loops v.s. 3 loops).
### Part 2 - Changing Blocksize
In this part, fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000.
Q1. Record your results in answers.txt
A1.
blocksize = 50:
```
Testing naive transpose: 1086.61 milliseconds
Testing transpose with blocking: 224.224 milliseconds
```
blocksize = 100:
```
Testing naive transpose: 1056.79 milliseconds
Testing transpose with blocking: 276.932 milliseconds
```
blocksize = 500:
```
Testing naive transpose: 1081.4 milliseconds
Testing transpose with blocking: 243.465 milliseconds
```
blocksize = 1000:
```
Testing naive transpose: 1241.58 milliseconds
Testing transpose with blocking: 277.814 milliseconds
```
blocksize = 5000:
```
Testing naive transpose: 1199.85 milliseconds
Testing transpose with blocking: 999.678 milliseconds
```
Q2. How does performance change as blocksize increases? Why is this the case?
A2. In observation, if block size grows too big, the benefit brought by locality will fade. If block size too big, it lead to the same size with array itself. There's no different with naive transpose and blocking transpose.