# Assignments4: Cache
###### tags: `108-1_RISC-V`
* This assignment follows [Assignment4](https://hackmd.io/@sysprog/2020-arch-homework4) and [CS61C-LAB7](https://cs61c.org/su20/labs/lab07/).
* This exercise shows the importance of cache structure situations. The [sample code](https://github.com/61c-teach/su20-lab-starter/blob/master/lab07/cache.s) is given by CS61C.
## Exercise 1 - A Couple of Memory Access Scenarios
### Scenario 1:
#### Program Parameters: (set these by initializing the a registers in the code)
* Array Size(a0): 128(bytes)
* Step Size(a1): 8
* Rep Count(a2): 4
* Option(a3):0
#### Cache Parameters: (set these in the Cache tab)
* Cache Levels: 1
* Block Size: 8
* Number of Blocks: 4
* Enable?: Should be green
* Placement Policy: Direct Mapped
* Associativity: 1 (Venus won’t let you change this, why?)
* Block Replacement Policy: LRU
#### Task
1. What combination of parameters is producing the hit rate you observe? (Hint: Your answer should be of the form: “Because [parameter A] in bytes is exactly equal to [parameter B] in bytes.”)
2. What is our hit rate if we increase Rep Count arbitrarily? Why?
3. How could we modify one program parameter to increase our hit rate? (Hint: Your answer should be of the form: “Change [parameter] to [value].” Note: we don’t care if we access the same array elements. Just give us a program parameter modification that would increase the hit rate.
#### code preview
:::spoiler
```
Rewriten by Stephan Kaminsky in RISC-V on 7/30/2018
# This program accesses an array in ways that provide data about the cache parameters.
# Coupled with the Data Cache Simulator and Memory Reference Visualization to help students
# understand how the cache parameters affect cache performance.
#
# PSEUDOCODE:
# int array[]; //Assume sizeof(int) == 4
# for (k = 0; k < repcount; k++) { // repeat repcount times
# /* Step through the selected array segment with the given step size. */
# for (index = 0; index < arraysize; index += stepsize) {
# if(option==0)
# array[index] = 0; // Option 0: One cache access - write
# else
# array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write
# }
# }
.data
array: .word 2048 # max array size specified in BYTES (DO NOT CHANGE)
.text
##################################################################################################
# You MAY change the code below this section
main: li a0, 128 # array size in BYTES (power of 2 < array size)
li a1, 8 # step size (power of 2 > 0)
li a2, 4 # rep count (int > 0)
li a3, 0 # 0 - option 0, 1 - option 1
# You MAY change the code above this section
##################################################################################################
jal accessWords # lw/sw
#jal accessBytes # lb/sb
li a0,10 # exit
ecall
# SUMMARY OF REGISTER USE:
# a0 = array size in bytes
# a1 = step size
# a2 = number of times to repeat
# a3 = 0 (W) / 1 (RW)
# s0 = moving array ptr
# s1 = array limit (ptr)
accessWords:
la s0, array # ptr to array
add s1, s0, a0 # hardcode array limit (ptr)
slli t1, a1, 2 # multiply stepsize by 4 because WORDS
wordLoop:
beq a3, zero, wordZero
lw t0, 0(s0) # array[index/4]++
addi t0, t0, 1
sw t0, 0(s0)
j wordCheck
wordZero:
sw zero, 0(s0) # array[index/4] = 0
wordCheck:
add s0, s0, t1 # increment ptr
blt s0, s1, wordLoop # inner loop done?
addi a2, a2, -1
bgtz a2, accessWords # outer loop done?
jr ra
accessBytes:
la s0, array # ptr to array
add s1, s0, a0 # hardcode array limit (ptr)
byteLoop:
beq a3, zero, byteZero
lbu t0, 0(s0) # array[index]++
addi t0, t0, 1
sb t0, 0(s0)
j byteCheck
byteZero:
sb zero, 0(s0) # array[index] = 0
byteCheck:
add s0, s0, a1 # increment ptr
blt s0, s1, byteLoop # inner loop done?
addi a2, a2, -1
bgtz a2, accessBytes # outer loop done?
jr ra
```
:::
#### cache configuration
According to cache parameters, shown as below:

and the cache shown as below:

#### Answer of Tasks
1. What combination of parameters is producing the hit rate you observe?

* hit rate: 12/16 = 0.75
2. What is our hit rate if we increase Rep Count arbitrarily? Why?
* Hit rate will increase
* Because the first Rep has already brought the whole array in to cache. After that we will hit at each time going to cache to find the data.
3. How could we modify one program parameter to increase our hit rate?
* We can decrease the step size in order to increase the Utilization and so as to increase our hit rate
### Scenario 2:
#### Program Parameters: (set these by initializing the a registers in the code)
* Array Size (a0): 256 (bytes)
* Step Size (a1): 2
* Rep Count (a2): 1
* Option (a3): 1
#### Cache Parameters: (set these in the Cache tab)
* Cache Levels: 1
* Block Size: 16
* Number of Blocks: 16
* Enable?: Should be green
* Placement Policy: N-Way Set Associative
* Associativity: 4
* Block Replacement Policy: LRU
#### Tasks
1. How many memory accesses are there per iteration of the inner loop? (not the one involving repcount). It’s not 1.
2. What is the repeating hit/miss pattern? WHY? (Hint: it repeats every 4 accesses).
3. This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.
4. Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run!
5. You should have noticed that our hit rate was pretty high for this scenario, and your answer to the previous question should give you a good understanding of why. IF YOU ARE NOT SURE WHY, consider the size of the array and compare it to the size of the cache. Now, consider the following:
#### Code modified part
```
# You MAY change the code below this section
main: li a0, 256 # array size in BYTES (power of 2 < array size)
li a1, 2 # step size (power of 2 > 0)
li a2, 1 # rep count (int > 0)
li a3, 1 # 0 - option 0, 1 - option 1
# You MAY change the code above this section
```
#### cache configuraion
According to cache parameters, shown as below:

and the cache shown as below:

#### Answer of Tasks
1. How many memory accesses are there per iteration of the inner loop?
因為option為1,因此需要lw及sw,因此有兩次的memory accesses.
2. What is the repeating hit/miss pattern? WHY?


miss/hit patter meaning 1miss/15hit
3. This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.
:::success
* Each time missed will input 16 elements into cache. Then it can hit after that when there are any task.
* Otherwise, step equal 2 meaning at 16 elements will have 8 elements being read except for the first time reading will miss, all others will hit
:::
4. Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why?
:::success
* Increasing Rep count will also increase hit rate.
* Because at the first inner loop, the whole array already write into cache. Therefore, all the elements in the array will be found in the cache, and the hit rate will also increase.
:::
## Exercise 2 - Loop Ordering and Matrix Multiplication
### Main focus
* To compute the matrix multiplication correctly, the loop order doesn’t matter.
* Caches perform better (more cache hits, fewer cache misses) when memory accesses take advantage of spatial and temporal locality.
### Code slice
```c=
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
C[i+j*n] += A[i+k*n] * B[k+j*n];
```
The unit “Gflops/s” reads, “Giga-floating-point-operations per second.” __THE BIGGER THE NUMBER THE FASTER IT IS RUNNING!__
### Tasks
* Record your results in answers.txt
```shell=
$ make ex2
gcc -o matrixMultiply -ggdb -Wall -pedantic -std=gnu99 -O3 matrixMultiply.c
./matrixMultiply
ijk: n = 1000, 0.336 Gflop/s
ikj: n = 1000, 0.090 Gflop/s
jik: n = 1000, 0.272 Gflop/s
jki: n = 1000, 0.903 Gflop/s
kij: n = 1000, 0.085 Gflop/s
kji: n = 1000, 0.630 Gflop/s
```
* Which ordering(s) perform best for these 1000-by-1000 matrices? Why?
1. Ordering jki perform best.
2. Observe the log below
```shell
C_index[0] = A_index[0] * B_index[0]
C_index[1] = A_index[1] * B_index[0]
C_index[2] = A_index[2] * B_index[0]
C_index[3] = A_index[3] * B_index[0]
C_index[4] = A_index[4] * B_index[0]
C_index[5] = A_index[5] * B_index[0]
C_index[6] = A_index[6] * B_index[0]
C_index[7] = A_index[7] * B_index[0]
C_index[8] = A_index[8] * B_index[0]
C_index[9] = A_index[9] * B_index[0]
```
* Because each access to the array is in the same element or nearing that elements, the spatial and temporal locality characteristics of the cache can be used to reduce the number of memory accesses and speed up the overall operation.
* Which ordering(s) perform the worst? Why?
1. Ordering kij perform worst.
2. Observe the log below
```shell=
C_index[0] = A_index[0] * B_index[0]
C_index[10] = A_index[0] * B_index[10]
C_index[20] = A_index[0] * B_index[20]
C_index[30] = A_index[0] * B_index[30]
C_index[40] = A_index[0] * B_index[40]
C_index[50] = A_index[0] * B_index[50]
C_index[60] = A_index[0] * B_index[60]
C_index[70] = A_index[0] * B_index[70]
C_index[80] = A_index[0] * B_index[80]
C_index[90] = A_index[0] * B_index[90]
```
* Contrary to the previous question, because the elements accessed each time are very different, making it necessary to access the memory multiple times, making Gflop/s greatly reduced.
* How does the way we stride through the matrices with respect to the innermost loop affect performance?
* If the argument of the inner loop makes the stride between the accessed elements smaller, then temporal and spatial locality can be used to further improve the performance of the whole question, otherwise it will make the performance worse.
## Exercise 3 - Cache Blocking and Matrix Transposition
### Code
:::spoiler
```cpp=
/* Implement cache blocking below. You should NOT assume that n is a
* multiple of the block size. */
void transpose_blocking(int n, int blocksize, int *dst, int *src) {
// YOUR CODE HERE
int i, j, x, y,block_Edge,remainBlock;
remainBlock = n % blocksize;
block_Edge = n / blocksize;
for( i = 0; i <= block_Edge; i++){
for( j = 0; j <= block_Edge; j++){
if(j == block_Edge && i == block_Edge){
for( x = 0; x < remainBlock ; x++){
for( y = 0; y < remainBlock; y++){
dst[(x + blocksize * i) + n * (y + blocksize * j)] = src[(y + blocksize * j) + n * (x + blocksize * i)];
}
}
}
else if(j == block_Edge){
for( x = 0; x < blocksize ; x++){
for( y = 0; y < remainBlock; y++){
dst[(x + blocksize * i) + n * (y + blocksize * j)] = src[(y + blocksize * j) + n * (x + blocksize * i)];
}
}
}
else if(i == block_Edge){
for( x = 0; x < remainBlock ; x++){
for( y = 0; y < blocksize; y++){
dst[(x + blocksize * i) + n * (y + blocksize * j)] = src[(y + blocksize * j) + n * (x + blocksize * i)];
}
}
}
else{
for( x = 0; x < blocksize ; x++){
for( y = 0; y < blocksize; y++){
dst[(x + blocksize * i) + n * (y + blocksize * j)] = src[(y + blocksize * j) + n * (x + blocksize * i)];
}
}
}
}
}
}
```
:::
### Matrix Transposition

### Cache Blocking

### Main focus
* We can improve the amount of data reuse in the caches by implementing a technique called cache blocking
* Cache blocking is a technique that attempts to reduce the cache miss rate by further improving the temporal and/or spatial locality of memory accesses
### Part 1 - Changing Array Sizes
#### Task
:::warning
Fix the **blocksize** to be **20**, and run your code with **n equal to 100, 1000, 2000, 5000, and 10000**.
:::
:::success
* Unit: milliseconds
|n|100|1000|2000|5000|10000|
|---|---|---|---|---|---|
| __naive__ |0.012|2.079|21.517|187.948|957.832|
| __transpose__|0.009|1.973|8.255|54.948|306.126|
:::
* Record your results in answers.txt
* 
* At what point does cache blocked version of transpose become faster than the non-cache blocked version?
* ***ANS:*** **All the version** of cache blocked transpose **are faster** than non-cache blocked
* Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?
* ***ANS:*** I think the larger the matrix getting, the further the distance between two memory access.Therefore, it will makes hit rate decrease because cache needs to replace blocks frequently .
### Part 2 - Changing Blocksize
#### Task
:::warning
Fix **n** to be **10000**, and run your code with **blocksize equal to 50, 100, 500, 1000, 5000**.
:::
:::success
* Unit: milliseconds
|blocksize|50|100|500|1000|5000|
|---|---|---|---|---|---|
| __naive__ |946.487|938.279|941.322|940.372|946.53 |
| __transpose__|246.727|213.69 |263.115|284.604|923.413|
:::
* Record your results in answers.txt
* 
* How does performance change as blocksize increases? Why is this the case?
* When the performance increased to a specific rate an then start to decrease. Because of the bigger the blocksize is, the harder the cache to hit. Therefore, the performance is also affected.