# Assignment4: Cache
contributed by <`徐郁淞`>
###### tags: `Computer Architecture`
[TOC]
## Requirements
- [ ] Following the instructions of [Lab 7](https://cs61c.org/su20/labs/lab07/), you shall do the exercises via [Cache Simulation](https://github.com/mortbopet/Ripes/wiki/Cache-Simulation) provided by [Ripes](https://github.com/mortbopet/Ripes).
- [ ] With [the given source code](https://github.com/61c-teach/su20-lab-starter/tree/master/lab07), you shall prepare C programs which are eventually compiled to RISC-V assembly using [Compiler Explorer](https://godbolt.org/). Then, you must write your answers to the questions written in the Task section of each exercise specific to [Lab 7](https://cs61c.org/su20/labs/lab07/).
* For tips on how to convert [Compiler Explorer](https://godbolt.org/) generated RISC-V assembly to assembly compatible with the Ripes assembler, refer to the page [Adapting Compiler Explorer generated RISC V assembly code](https://github.com/mortbopet/Ripes/wiki/Adapting-Compiler-Explorer-generated-RISC-V-assembly-code).
* You definitely have to modify [the given source code](https://github.com/61c-teach/su20-lab-starter/tree/master/lab07) to fit the expectation of [Ripes](https://github.com/mortbopet/Ripes).
- [ ] Write down your thoughts and progress in [HackMD notes](https://hackmd.io/s/features).
## Exercise 1 - A Couple of Memory Access Scenarios
**Task**: Simulate the following scenarios and record the final cache hit rates with the program in `cache.s`.
`cache.s`:
```
# PSEUDOCODE:
# int array[]; //Assume sizeof(int) == 4
# for (k = 0; k < repcount; k++) { // repeat repcount times
# /* Step through the selected array segment with the given step size. */
# for (index = 0; index < arraysize; index += stepsize) {
# if(option==0)
# array[index] = 0; // Option 0: One cache access - write
# else
# array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write
# }
# }
```
### Scenario 1:
>Program Parameters: (set these by initializing the a registers in the code)
>* Array Size(a0): 128 (bytes)
>* Step Size(a1): 8
>* Rep Count(a2): 4
>* Option(a3): 0
>
> Cache Parameters: (set these in the Cache tab)
>* Cache Levels: 1
>* Block Size: 8
>* Number of Blocks: 4
>* Enable?: Should be green
>* Placement Policy: Direct Mapped
>* Associativity: 1 (Venus won’t let you change this, why?)
>* Block Replacement Policy: LRU
Set the program parameter in `cache.s`
```
main: li a0, 128 # array size in BYTES (power of 2 < array size)
li a1, 8 # step size (power of 2 > 0)
li a2, 4 # rep count (int > 0)
li a3, 0 # 0 - option 0, 1 - option 1
```
A direct-mapped cache also be called a "one-way set associative" cache.
The following Cache configuration:

### Task for senario 1
**Q1.** What combination of parameters is producing the hit rate you observe? (Hint: Your answer should be of the form: “Because [parameter A] in bytes is exactly equal to [parameter B] in bytes.”)
* Ans: The hit rate is zero
**Q2.** In experiment result, hit rate is always been zero no matter how repeat time grows. The reason is that data in cache is always been evicted (step size is 8 which means 3 lower bits always been same).
* Ans:
**Q3.** How could we modify one program parameter to increase our hit rate? (Hint: Your answer should be of the form: “Change [parameter] to [value].” Note: we don’t care if we access the same array elements. Just give us a program parameter modification that would increase the hit rate.
We can change the `step` to `1`, avoid the cache entry's contents to evict repeatly.
---
### Scenario 2:
>Program Parameters: (set these by initializing the a registers in the code)
>* Array Size (a0): 256 (bytes)
>* Step Size (a1): 2
>* Rep Count (a2): 1
>* Option (a3): 1
>
>Cache Parameters: (set these in the Cache tab)
>* Cache Levels: 1
>* Block Size: 16
>* Number of Blocks: 16
>* Enable?: Should be green
>* Placement Policy: N-Way Set Associative
>* Associativity: 4
>* Block Replacement Policy: LRU
The following Cache configuration

**Q1.** How many memory accesses are there per iteration of the inner loop? (not the one involving repcount). It’s not 1.
* Ans: There are two cache accesses - read AND write
**Q2.** What is the repeating hit/miss pattern? WHY? (Hint: it repeats every 4 accesses).
* Ans: Because the `step` is two, and there are four blocks in a cache line. The cache miss occur at block0, and cache hit occur at other block.
**Q3.** This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.
* Ans: One hit and three miss in each slots, so the ratio is 0.75
**Q4.** Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run!
* Ans: The hit ratio will up to 100%. Because it only have the Compulsory Miss.
---
### Scenario 3:
>Program Parameters: (set these by initializing the a registers in the code)
>* Array Size (a0): 128 (bytes)
>* Step Size (a1): 1
>* Rep Count (a2): 1
>* Option (a3): 0
>
>Cache Parameters: (set these in the Cache tab)
>* Cache Levels: 2
>
>NOTE: Make sure the following parameters are for the L1 cache! (Select L1 in the dropdown right next to the replacement policy)
>* Block Size: 8
>* Number of Blocks: 8
>* Enable?: Should be green
>* Placement Policy: Direct Mapped
>* Associativity: 1
>* Block Replacement Policy: LRU
>
>NOTE: Make sure the following parameters are for the L2 cache! (Select L2 in the dropdown right next to the replacement policy)
>* Block Size: 8
>* Number of Blocks: 16
>* Enable?: Should be green
>* Placement Policy: Direct Mapped
>* Associativity: 1
>* Block Replacement Policy: LRU
**Q1.** What is the hit rate of our L1 cache? Our L2 cache? Overall?
* Ans:
Hit ratio of L1 cache: 50%
Hit ratio of L2 cache: 0%
Overall: 50%
**Q2.** How many accesses do we have to the L1 cache total? How many of them are misses?
* Ans:
Total: 32
misses: 16
**Q3.** How many accesses do we have to the L2 cache total? How does this relate to the L1 cache (think about what the L1 cache has to do in order to make us access the L2 cache)?
* Ans:
Total: 16
When L1 cache miss, it will go to L2 cache and find the data.
So, we can observe the L1 cache miss is equal to l2 accesses.
**Q4.** What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same? Why?
**Q5.** What happens to our hit rates for L1 and L2 as we slowly increase the number of blocks in L1? What about L1 block size?
## Exercise 2 - Loop Ordering and Matrix Multiplication
```
~/su20-lab-starter/lab07/ [master+] make ex2
gcc -o matrixMultiply -ggdb -Wall -pedantic -std=gnu99 -O3 matrixMultiply.c
./matrixMultiply
ijk: n = 1000, 1.680 Gflop/s
ikj: n = 1000, 0.254 Gflop/s
jik: n = 1000, 1.740 Gflop/s
jki: n = 1000, 11.746 Gflop/s
kij: n = 1000, 0.257 Gflop/s
kji: n = 1000, 9.872 Gflop/s
```
### Tasks
**Q1.** Which ordering(s) perform best for these 1000-by-1000 matrices? Why?
* Ans:
`jki: n = 1000, 11.746 Gflop/s` is the best.
we see `jki` loop order:
* moves through B with stride 0
* moves through A with stride 1
* moves through C with stride 1
The stride in `jki` order is smallest. It have the best spatial locality.
**Q2.** Which ordering(s) perform the worst? Why?
* Ans:
`ikj: n = 1000, 0.254 Gflop/s` is the worst.
we see the `ikj` loop order:
* moves through B with stride n
* moves through A with stride 0
* moves through C with stride n
The stride in `ikj` order is biggest. The cache block be replaced frequently.
**Q3.** How does the way we stride through the matrices with respect to the innermost loop affect performance?
* Ans: when the stride is large, it do not have the spatial locality. It cause hit ratio is small. we need to find data in l2 cache or data mamory, it spents a lot of time.
## Exercise 3 - Cache Blocking and Matrix Transposition
Implement the cache blocking in the transpose_blockint() function:
```c=
/* Implement cache blocking below. You should NOT assume that n is a
* multiple of the block size. */
void transpose_blocking(int n, int blocksize, int *dst, int *src) {
for (int block2 = 0; block2 < n; block2 += blocksize) {
for (int block1 = 0; block1 < n; block1 += blocksize) {
for (int x = block2; x < block2 + blocksize && x < n; x++) {
for (int y = block1; y < block1 + blocksize && y < n; y++) {
dst[x + y * n] = src[y + x * n];
}
}
}
}
}
```
### Part1 - Changing Array Sizes
**Task**
Fix the blocksize to be 20, and run your code with n equal to 100, 1000, 2000, 5000, and 10000.
**Q1:** Record your results
* n = 100 :
```
Testing naive transpose: 0.008 milliseconds
Testing transpose with blocking: 0.015 milliseconds
```
* n = 1000 :
```
Testing naive transpose: 2.457 milliseconds
Testing transpose with blocking: 1.105 milliseconds
```
* n = 2000 :
```
Testing naive transpose: 17.037 milliseconds
Testing transpose with blocking: 8.076 milliseconds
```
* n = 5000 :
```
Testing naive transpose: 139.944 milliseconds
Testing transpose with blocking: 35.341 milliseconds
```
* n = 10000 :
```
Testing naive transpose: 720.717 milliseconds
Testing transpose with blocking: 146.127 milliseconds
```
**Q2:** At what point does cache blocked version of transpose become faster than the non-cache blocked version?
* Ans: At `n=1000` and `blocksize=20`
**Q3:** Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?
* Ans: it can increases the degree of temporal (and spatial) locality, and makes our cache performance better, which makes our program run faster.
### Part2 - Changing Blocksize
**Task**
Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000.
**Q1.** Record your results
* blocksize = 50
```
Testing naive transpose: 705.651 milliseconds
Testing transpose with blocking: 187.858 milliseconds
```
* blocksize = 100
```
Testing naive transpose: 704.197 milliseconds
Testing transpose with blocking: 141.941 milliseconds
```
* blocksize = 500
```
Testing naive transpose: 690.991 milliseconds
Testing transpose with blocking: 169.921 milliseconds
```
* blocksize = 1000
```
Testing naive transpose: 703.905 milliseconds
Testing transpose with blocking: 196.639 milliseconds
```
* blocksize = 5000
```
Testing naive transpose: 703.286 milliseconds
Testing transpose with blocking: 661.781 milliseconds
```
**Q2.** How does performance change as blocksize increases? why is this the case?
* Ans: The amount of speedup will increase as the blocksize increase until blocksize is about `100`, then the amount of speedup will decrease as the blocksize increase.
I think if the blocksize is too big, the mamory accesses can not take advantage of spatial and temporal locality.