# Asssigment4: Cache
contributed by < `Cheng Yu` >
###### tags: `Computer Architure`
## Review - Hit and Miss Policies
### Hit policy
* **Write-back** ==means that on a write hit, data is written to the cache only, and when this write happens, the dirty bit for the block that was written becomes 1.== Writing to the cache is fast, so write latency in write-back caches is usually quite small. However, when a block is evicted from a write-back cache, if the dirty bit is 1, memory must be updated with the contents of that block, as it contains changes that are not yet reflected in memory. This makes write-back caches more difficult to implement in hardware.
* **Write-through** ==means that on a write hit, data is written to both the cache and main memory.== Writing to the cache is fast, but writing to main memory is slow; this makes write latency in write-through caches slower than in a write-back cache. However, write-through caches mean simpler hardware, since we can assume in write-through caches that memory always has the most up-to-date data.
* **Write-around** ==means that in every situation, data is written to main memory only; if we have the block we’re writing in the cache, the valid bit of the block is changed to invalid.== Essentially, there is no such thing as a write hit in a write-around cache; a write “hit” does the same thing as a write miss.
### Miss policy
* **Write-allocate** ==means that on a write miss, you pull the block you missed on into the cache.== For write-back, write-allocate caches, this can mean that memory is never written to directly; instead, writes are always to the cache and memory is updated upon eviction.
* **No write-allocate** ==means that on a write miss, you do not pull the block you missed on into the cache.== Only memory is updated.
### Replacement policy
* **LRU** - **L**east **R**ecently **U**sed, when we decide to evict a cache block to make space, we select the block that has been used farthest back in time of all the other blocks.
* **Random** - When we decide to evict a cache block to make space, we randomly select one of the blocks in the cache to evict.
* **MRU** - **M**ost **R**ecently **U**sed, when we decide to evict a cache block to make space, we select the block that has been used most recently of all the other available blocks.
## Preliminary work
* Download [Ripes](https://github.com/mortbopet/Ripes) simulator.
> **Ripes** is a graphical 5-stage RISC-V pipeline simulator & assembly editor. [color=#70B3DB]
* Prepare the files from [here](https://github.com/61c-teach/su20-lab-starter/tree/master/lab07).
## Exercise 1 - A Couple of Memory Access Scenarios
:::info
**Task** : Simulate the following scenarios and record the final cache hit rates with the program in **cache.s**.
:::
### Cache.s
:::spoiler see the code here
```c=0
# Rewriten by Stephan Kaminsky in RISC-V on 7/30/2018
# This program accesses an array in ways that provide data about the cache parameters.
# Coupled with the Data Cache Simulator and Memory Reference Visualization to help students
# understand how the cache parameters affect cache performance.
#
# PSEUDOCODE:
# int array[]; //Assume sizeof(int) == 4
# for (k = 0; k < repcount; k++) { // repeat repcount times
# /* Step through the selected array segment with the given step size. */
# for (index = 0; index < arraysize; index += stepsize) {
# if(option==0)
# array[index] = 0; // Option 0: One cache access - write
# else
# array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write
# }
# }
.data
array: .word 2048 # max array size specified in BYTES (DO NOT CHANGE)
.text
##################################################################################################
# You MAY change the code below this section
main: li a0, 256 # array size in BYTES (power of 2 < array size)
li a1, 2 # step size (power of 2 > 0)
li a2, 1 # rep count (int > 0)
li a3, 1 # 0 - option 0, 1 - option 1
# You MAY change the code above this section
##################################################################################################
jal accessWords # lw/sw
#jal accessBytes # lb/sb
li a7,10 # exit (change a0 to a7 for Ripes)
ecall
# SUMMARY OF REGISTER USE:
# a0 = array size in bytes
# a1 = step size
# a2 = number of times to repeat
# a3 = 0 (W) / 1 (RW)
# s0 = moving array ptr
# s1 = array limit (ptr)
accessWords:
la s0, array # ptr to array
add s1, s0, a0 # hardcode array limit (ptr)
slli t1, a1, 2 # multiply stepsize by 4 because WORDS
wordLoop:
beq a3, zero, wordZero
lw t0, 0(s0) # array[index/4]++
addi t0, t0, 1
sw t0, 0(s0)
j wordCheck
wordZero:
sw zero, 0(s0) # array[index/4] = 0
wordCheck:
add s0, s0, t1 # increment ptr
blt s0, s1, wordLoop # inner loop done?
addi a2, a2, -1
bgtz a2, accessWords # outer loop done?
jr ra
accessBytes:
la s0, array # ptr to array
add s1, s0, a0 # hardcode array limit (ptr)
byteLoop:
beq a3, zero, byteZero
lbu t0, 0(s0) # array[index]++
addi t0, t0, 1
sb t0, 0(s0)
j byteCheck
byteZero:
sb zero, 0(s0) # array[index] = 0
byteCheck:
add s0, s0, a1 # increment ptr
blt s0, s1, byteLoop # inner loop done?
addi a2, a2, -1
bgtz a2, accessBytes # outer loop done?
jr ra
```
:::
### Scenario 1 :
#### Program Parameters
(set these by initializing the a registers in **cache.s**)
| Array Size (a0) | Step Size (a1) |Rep Count (a2)|Option (a3)|
|:------:|:-----------:|:----------:|:----------:|
| 128 (bytes) | 8 | 4 | 0 |
#### Cache Parameters
(set these in the Cache tab)
| Cache Levels | Block Size | Number of Blocks | Placement Policy | Associativity | Replacement Policy |
|:------:|:-----------:|:----------:|:----------:|:-----------:|:-----------:|
| 1 | 8 | 4 | Direct Mapped | 1 | LRU |
#### Ripes setting
( If block size is 8 bytes, the # of blocks is 8/4 = 2 with the condition of Ripes storing one word in a single bolck in one line. )

#### Result

#### Tasks
* **What combination of parameters is producing the hit rate you observe?**
+ The **step size** in this scenario is 8 words (32 bytes), which is exactly the **cache size** in bytes (8 * 4 = 32).
+ So every memory access maps to the same slot, in other word, all steps miss.
* **What is our hit rate if we increase Rep Count arbitrarily? Why?**
+ Hit rate won't change.
+ Because the mamery access patterns in every loop are the same.
* **How could we modify one program parameter to increase our hit rate?**
+ Change **step size** to **1**.
+ Then every miss will cause one hit consequently.
+ Hit rate turn to 50%.

### Scenario 2 :
#### Program Parameters
(set these by initializing the a registers in **cache.s**)
| Array Size (a0) | Step Size (a1) |Rep Count (a2)|Option (a3)|
|:------:|:-----------:|:----------:|:----------:|
| 256 (bytes) | 2 | 1 | 1 |
#### Cache Parameters
(set these in the Cache tab)
| Cache Levels | Block Size | Number of Blocks | Placement Policy | Associativity | Replacement Policy |
|:------:|:-----------:|:----------:|:----------:|:-----------:|:-----------:|
| 1 | 16 | 16 | N-Way Set Associative | 4 | LRU |
#### Ripes setting

#### Result

#### Tasks
* **How many memory accesses are there per iteration of the inner loop?**
+ There are **two** memory accesses per iteration of the inner loop, because both **lw** and **sw** will access memory.
* **What is the repeating hit/miss pattern? WHY?**
+ Repeating by **3/1** (hit/miss) **every 4 accesses**.
+ In **every 4 accesses**, cache will miss the first **lw** but hit the remaining memory accesses.
* **This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.**
+ Because step size is **2 words** and there are **4 words** per slot, every 4 accesses (**two steps**) will change to next set.
+ There are **32 steps** (256 / (2*4)) and **every 2 steps costs 1 slot**, so all the steps **just fill up cache**.
+ By above reasons, hit rate of whole program is the same as every 4 accesses, which is **$\frac{1}{4} = 0.75$**.
* **Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why?**
+ After first outer loop done, all the data we need is already in cache, so cache will hit every memory access in the remaining loops.
+ By above reasons, as Rep Count goes bigger, hit rate will also closer to **1**.
+ This is the result when I change repcount to 100.

* **QUESTION:**
:::spoiler **open question description**
Suppose we have a program that iterates through a very large array (AKA way bigger than the size of the cache) repcount times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). (For reference, in this scenario, we just had one function (incrementation) and one Rep).
Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario?
:::
+ The following is the original program
```c=0
int array[16384];
int func ()
{
for (int k = 0; k < 5; k++)
{
for (int index = 0; index < 16384; index++)
{
array[index] = array[index] + k;
}
}
}
```
+ ::: spoiler Compile above program to RISC-V assembly using [Compiler Explorer](https://godbolt.org/).
```c=0
.data
array:
.word 65536
.text
main:
jal ra, func
jal zero, end
func: # @func
addi sp, sp, -32
sw ra, 28(sp) # 4-byte Folded Spill
sw s0, 24(sp) # 4-byte Folded Spill
addi s0, sp, 32
mv a0, zero
sw a0, -16(s0)
j .LBB0_1
.LBB0_1: # =>This Loop Header: Depth=1
lw a1, -16(s0)
addi a0, zero, 4
blt a0, a1, .LBB0_8
j .LBB0_2
.LBB0_2: # in Loop: Header=BB0_1 Depth=1
mv a0, zero
sw a0, -20(s0)
j .LBB0_3
.LBB0_3: # Parent Loop BB0_1 Depth=1
lw a1, -20(s0)
lui a0, 4
addi a0, a0, -1
blt a0, a1, .LBB0_6
j .LBB0_4
.LBB0_4: # in Loop: Header=BB0_3 Depth=2
lw a0, -20(s0)
la a1, array
slli a0, a0, 2
add a1, a1, a0
lw a0, 0(a1)
lw a2, -16(s0)
add a0, a0, a2
sw a0, 0(a1)
j .LBB0_5
.LBB0_5: # in Loop: Header=BB0_3 Depth=2
lw a0, -20(s0)
addi a0, a0, 1
sw a0, -20(s0)
j .LBB0_3
.LBB0_6: # in Loop: Header=BB0_1 Depth=1
j .LBB0_7
.LBB0_7: # in Loop: Header=BB0_1 Depth=1
lw a0, -16(s0)
addi a0, a0, 1
sw a0, -16(s0)
j .LBB0_1
.LBB0_8:
lw a0, -12(s0)
lw s0, 24(sp) # 4-byte Folded Reload
lw ra, 28(sp) # 4-byte Folded Reload
addi sp, sp, 32
ret
end:
li a7,10 # exit
ecall
```
+ Result

---
+ Rewrite the program
```c=0
int array[16384];
int func ()
{
for (int index = 0; index < 16384; index++)
{
for (int k = 0; k < 5; k++)
{
array[index] = array[index] + k;
}
}
}
```
+ ::: spoiler Compile above program to RISC-V assembly using [Compiler Explorer](https://godbolt.org/).
```c=0
.data
array:
.word 65536
.text
main:
jal ra, func
jal zero, end
func: # @func
addi sp, sp, -32
sw ra, 28(sp) # 4-byte Folded Spill
sw s0, 24(sp) # 4-byte Folded Spill
addi s0, sp, 32
mv a0, zero
sw a0, -16(s0)
j .LBB0_1
.LBB0_1: # =>This Loop Header: Depth=1
lw a1, -16(s0)
lui a0, 4
addi a0, a0, -1
blt a0, a1, .LBB0_8
j .LBB0_2
.LBB0_2: # in Loop: Header=BB0_1 Depth=1
mv a0, zero
sw a0, -20(s0)
j .LBB0_3
.LBB0_3: # Parent Loop BB0_1 Depth=1
lw a1, -20(s0)
addi a0, zero, 4
blt a0, a1, .LBB0_6
j .LBB0_4
.LBB0_4: # in Loop: Header=BB0_3 Depth=2
lw a0, -16(s0)
la a1, array
slli a0, a0, 2
add a1, a1, a0
lw a0, 0(a1)
lw a2, -20(s0)
add a0, a0, a2
sw a0, 0(a1)
j .LBB0_5
.LBB0_5: # in Loop: Header=BB0_3 Depth=2
lw a0, -20(s0)
addi a0, a0, 1
sw a0, -20(s0)
j .LBB0_3
.LBB0_6: # in Loop: Header=BB0_1 Depth=1
j .LBB0_7
.LBB0_7: # in Loop: Header=BB0_1 Depth=1
lw a0, -16(s0)
addi a0, a0, 1
sw a0, -16(s0)
j .LBB0_1
.LBB0_8:
lw a0, -12(s0)
lw s0, 24(sp) # 4-byte Folded Reload
lw ra, 28(sp) # 4-byte Folded Reload
addi sp, sp, 32
ret
end:
li a7,10 # exit
ecall
```
+ Result

---
+ ::: success
We can see that if we keep one of the array elements hot in cache instead of circling back to it later on, then the hit rate will be higher.
:::
### Scenario 3 :
> Ripes doesn't support multilevel caches, so I done this by [Venus](https://venus.cs61c.org/).[color=#70B3DB]
#### Program Parameters
(set these by initializing the a registers in **cache.s**)
| Array Size (a0) | Step Size (a1) |Rep Count (a2)|Option (a3)|
|:------:|:-----------:|:----------:|:----------:|
| 128 (bytes) | 1 | 1 | 0 |
#### Cache Parameters
(set these in the Cache tab)
* Cache Levels : 2
* L1 cache
| Block Size | Number of Blocks | Enable? | Placement Policy | Associativity | Replacement Policy |
|:------:|:-----------:|:----------:|:----------:|:-----------:|:-----------:|
| 8 | 8 | Should be green | Direct Mapped | 1 | LRU |
* L2 cache
| Block Size | Number of Blocks | Enable? | Placement Policy | Associativity | Replacement Policy |
|:------:|:-----------:|:----------:|:----------:|:-----------:|:-----------:|
| 8 | 16 | Should be green | Direct Mapped | 1 | LRU |
#### Result
$L1 :

$L2 :

#### Tasks
* **What is the hit rate of our L1 cache? Our L2 cache? Overall?**
+ L1 hit rate = 0.5
+ L2 hit rate = 0
+ Overall = 0.5
* **How many accesses do we have to the L1 cache total? How many of them are misses?**
+ 32 accesses
+ 16 of them are misses
* **How many accesses do we have to the L2 cache total? How does this relate to the L1 cache (think about what the L1 cache has to do in order to make us access the L2 cache)?**
+ 16 accesses
+ Equals to the # of misses in L1 cache
* **What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same? Why?**
+ Every 2 memory accesses will have 1 miss in L1 cache, so the step size in L2 cache is 16 bytes.
+ By above reason, since the block size in L2 cache is 8 (byte), every accesses in L2 cache will be a compulsory miss.
+ To approve L2 cache hit rate, we can easily increase Repcount. By doing so, the accesses in next repeat will all hit in L2 cache and won't chenge L1 hit rate at the mean time.

* **What happens to our hit rates for L1 and L2 as we slowly increase the number of blocks in L1? What about L1 block size?**
+ Increase the **number of blocks** in L1 will change **nothing**.
+ In other hand, if we increase L1 **block size**, the hit rate in L1 **will increase** simultaneously.
+ But **L2 hit rate is still 0**, because the step size in L2 will increase, the opportunuty of causing compulsory miss is higher.
## Exercise 2 - Loop Ordering and Matrix Multiplication
:::spoiler See the source code here
```c=0
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
/* To save you time, we are including all 6 variants of the loop ordering
as separate functions and then calling them using function pointers.
The reason for having separate functions that are nearly identical is
to avoid counting any extraneous processing towards the computation
time. This includes I/O accesses (printf) and conditionals (if/switch).
I/O accesses are slow and conditional/branching statements could
unfairly bias results (lower cases in switches must run through more
case statements on each iteration).
*/
void multMat1( int n, float *A, float *B, float *C ) {
int i,j,k;
/* This is ijk loop order. */
for( i = 0; i < n; i++ )
for( j = 0; j < n; j++ )
for( k = 0; k < n; k++ )
C[i+j*n] += A[i+k*n]*B[k+j*n];
}
void multMat2( int n, float *A, float *B, float *C ) {
int i,j,k;
/* This is ikj loop order. */
for( i = 0; i < n; i++ )
for( k = 0; k < n; k++ )
for( j = 0; j < n; j++ )
C[i+j*n] += A[i+k*n]*B[k+j*n];
}
void multMat3( int n, float *A, float *B, float *C ) {
int i,j,k;
/* This is jik loop order. */
for( j = 0; j < n; j++ )
for( i = 0; i < n; i++ )
for( k = 0; k < n; k++ )
C[i+j*n] += A[i+k*n]*B[k+j*n];
}
void multMat4( int n, float *A, float *B, float *C ) {
int i,j,k;
/* This is jki loop order. */
for( j = 0; j < n; j++ )
for( k = 0; k < n; k++ )
for( i = 0; i < n; i++ )
C[i+j*n] += A[i+k*n]*B[k+j*n];
}
void multMat5( int n, float *A, float *B, float *C ) {
int i,j,k;
/* This is kij loop order. */
for( k = 0; k < n; k++ )
for( i = 0; i < n; i++ )
for( j = 0; j < n; j++ )
C[i+j*n] += A[i+k*n]*B[k+j*n];
}
void multMat6( int n, float *A, float *B, float *C ) {
int i,j,k;
/* This is kji loop order. */
for( k = 0; k < n; k++ )
for( j = 0; j < n; j++ )
for( i = 0; i < n; i++ )
C[i+j*n] += A[i+k*n]*B[k+j*n];
}
/* uses timing features from sys/time.h that you haven't seen before */
int main( int argc, char **argv ) {
int nmax = 1000,i;
void (*orderings[])(int,float *,float *,float *) =
{&multMat1,&multMat2,&multMat3,&multMat4,&multMat5,&multMat6};
char *names[] = {"ijk","ikj","jik","jki","kij","kji"};
float *A = (float *)malloc( nmax*nmax * sizeof(float));
float *B = (float *)malloc( nmax*nmax * sizeof(float));
float *C = (float *)malloc( nmax*nmax * sizeof(float));
struct timeval start, end;
/* fill matrices with random numbers */
for( i = 0; i < nmax*nmax; i++ ) A[i] = drand48()*2-1;
for( i = 0; i < nmax*nmax; i++ ) B[i] = drand48()*2-1;
for( i = 0; i < nmax*nmax; i++ ) C[i] = drand48()*2-1;
for( i = 0; i < 6; i++) {
/* multiply matrices and measure the time */
gettimeofday( &start, NULL );
(*orderings[i])( nmax, A, B, C );
gettimeofday( &end, NULL );
/* convert time to Gflop/s */
double seconds = (end.tv_sec - start.tv_sec) +
1.0e-6 * (end.tv_usec - start.tv_usec);
double Gflops = 2e-9*nmax*nmax*nmax/seconds;
printf( "%s:\tn = %d, %.3f Gflop/s\n", names[i], nmax, Gflops );
}
free( A );
free( B );
free( C );
printf("\n\n");
return 0;
}
```
:::
### The stride in the innermost loop of each function

### Tasks
* **Record your results**
```c
ijk: n = 1000, 2.193 Gflop/s
ikj: n = 1000, 0.315 Gflop/s
jik: n = 1000, 2.099 Gflop/s
jki: n = 1000, 13.613 Gflop/s
kij: n = 1000, 0.301 Gflop/s
kji: n = 1000, 13.412 Gflop/s
```
* **Which ordering(s) perform best for these 1000-by-1000 matrices? Why?**
+ **multMat4 (jki)** and **multMat6 (kji)** perform the best.
+ We can see that **jki** and **kji** have the same and the **smallest strides** **(1,1,0)** in the innermost loop, which perform more **local data reuse** and make them **faster**.
* **Which ordering(s) perform the worst? Why?**
+ **multMat2 (ikj)** and **multMat5 (kij)** perform the worst.
+ Because they all have **large strides** **(n,0,n)** in the innermost loop, which make the accesses miss more and lower the speed.
* **How does the way we stride through the matrices with respect to the innermost loop affect performance**
+ **Smaller** strides make the speed **faster**, because the cache hit rate will be **higher** according to **local data reuse**.
## Exercise 3 - Cache Blocking and Matrix Transposition
The following code is the C code to transpose each submatrix one at a time. Repeat several times to make it a entire matrix transpose.
```clike=
// Function transpose_naive() transposes a (blocksize x blocksize) block in a (n x n) martrix.
// The start indices in each block must be specified
void transpose(int IndexX, int IndexY, int blocksize, int n, int* dst, int* src) {
for (int y = IndexY; y < (IndexY + blocksize) ; y++) {
for (int x = IndexX; x < (IndexX + blocksize) ; x++) {
if (y < n && x < n) {
dst[y + x * n] = src[x + y * n];
}// Prevent any operation at index out of bound
}
}
}
// Function transpose_blocking() iterate through the matrix by a block.
// It transposes each block matrix.
void transpose_blocking(int n, int blocksize, int* dst, int* src) {
for (int blockX = 0; blockX < n; blockX+=blocksize) {
for (int blockY = 0; blockY < n; blockY += blocksize) {
transpose(blockX, blockY , blocksize, n, dst, src);
}
}
}
```
### Part 1 - Changing Array Sizes
#### Task
**Fix the blocksize to be 20, and run your code with n equal to 100, 1000, 2000, 5000, and 10000.**
* **Results**
| | n | 100 | 1000 | 2000 | 5000 | 10000 |
| --- | -------------- | ---- | ---- | ---- | ---- | ----- |
| Execution time (ms) | Naive transpose | 0.004 | 0.838 | 23.996 | 157.556 | 1013.84 |
| Execution time (ms) | Transpose with blocking | 0.009 | 0.142 | 7.23 | 50.97 | 212.791 |

`Transpose with blocking` become more faster than `Naive transpose` when the n grows.
* **At what point does cache blocked version of transpose become faster than the non-cache blocked version?**
**Ans:**
When n=1000, `Transpose with blocking` starts to be faster than `Naive transpose`.
* **Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?**
**Ans:**
When the matrix size n is small, `Naive transpose` also takes advantage of locality as `Transpose with blocking` do. In this case, because the cache can hold the majority of matrix elements in blocks, iterate through column by column have a higher hit rate than iterate through block by block.
However, when the matrix size getting larger, since each iteration involves two diagonal element accesses, the distance between two memory access will be very far as it access nearby the corner of matrix. If we iterate through column by column, it will makes hit rate decrease because cache needs to replace blocks frequently .
Therefore, `Transpose with blocking` will outperforms `Naive transpose` when the matrix size to be larger than a certain size.
### Part 2 - Changing Blocksize
#### Task
**Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000.**
* **Results**
| | Block size | 50 | 100 | 500 | 1000 | 5000 |
| --- | -------------- | ---- | ---- | ---- | ---- | ----- |
| Execution time (ms) | Naive transpose | 1027.04 | 898.784 | 904.159 | 967.571 | 1033.54 |
| Execution time (ms) | Transpose with blocking | 250.063 | 219.284 | 136.375 | 221.261 | 783.823 |

**Ans:**
Since the size of matrix is fixed, the execution time of `Naive transpose` will theoretically be the same in different block size. Overall,`Naive transpose` is more time-consuming than `Transpose with blocking`.
* **How does performance change as blocksize increases? Why is this the case?**
**Ans:**
When the block size grows larger, the execution time of `Transpose with blocking` also increase. Because when the block becomes so big that the distance between two memory access will be very far. This happens when the access is nearby the corner of block. This will lead to a decrease in hit rate. As a result, the performance is also affected.