# Asssigment4: Cache
contributed by < `WeiCheng14159` >
###### tags: `computer architure 2020`
## Table of Content
[TOC]
## Objectives
- Analyze how memory access patterns determine cache hit rates
- Analyze and discover which memory access patterns produce GOOD hit rates
- Analyze hit rates for caches and be able to optimize code accesses to produce good hit rates
## Experiment setup
The following experiments are done on Ripes simulator with the following CPU setup.
A single cycle processor support RV32IM ISA
## Exercise 1 - A Couple of Memory Access Scenarios
Simulate the following scenarios and record the **final cache hit rates** with the program in `cache.s`.
### cache.c
```c=
int array[]; //Assume sizeof(int) == 4
for (k = 0; k < repcount; k++) { // repeat repcount times
/* Step through the selected array segment with the given step size. */
for (index = 0; index < arraysize; index += stepsize) {
if(option==0)
array[index] = 0; // Option 0: One cache access - write
else
array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write
}
}
```
### cache.s
:::spoiler assembly code
```c=1
# Rewriten by Stephan Kaminsky in RISC-V on 7/30/2018
# This program accesses an array in ways that provide data about the cache parameters.
# Coupled with the Data Cache Simulator and Memory Reference Visualization to help students
# understand how the cache parameters affect cache performance.
#
# PSEUDOCODE:
# int array[]; //Assume sizeof(int) == 4
# for (k = 0; k < repcount; k++) { // repeat repcount times
# /* Step through the selected array segment with the given step size. */
# for (index = 0; index < arraysize; index += stepsize) {
# if(option==0)
# array[index] = 0; // Option 0: One cache access - write
# else
# array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write
# }
# }
.data
array: .word 2048 # max array size specified in BYTES (DO NOT CHANGE)
.text
##################################################################################################
# You MAY change the code below this section
main: li a0, 128 # array size in BYTES (power of 2 < array size)
li a1, 8 # step size (power of 2 > 0)
li a2, 4 # rep count (int > 0)
li a3, 0 # 0 - option 0, 1 - option 1
# You MAY change the code above this section
##################################################################################################
jal accessWords # lw/sw
#jal accessBytes # lb/sb
li a0,10 # exit
ecall
# SUMMARY OF REGISTER USE:
# a0 = array size in bytes
# a1 = step size
# a2 = number of times to repeat
# a3 = 0 (W) / 1 (RW)
# s0 = moving array ptr
# s1 = array limit (ptr)
accessWords:
la s0, array # ptr to array
add s1, s0, a0 # hardcode array limit (ptr)
slli t1, a1, 2 # multiply stepsize by 4 because WORDS
wordLoop:
beq a3, zero, wordZero
lw t0, 0(s0) # array[index/4]++
addi t0, t0, 1
sw t0, 0(s0)
j wordCheck
wordZero:
sw zero, 0(s0) # array[index/4] = 0
wordCheck:
add s0, s0, t1 # increment ptr
blt s0, s1, wordLoop # inner loop done?
addi a2, a2, -1
bgtz a2, accessWords # outer loop done?
jr ra
accessBytes:
la s0, array # ptr to array
add s1, s0, a0 # hardcode array limit (ptr)
byteLoop:
beq a3, zero, byteZero
lbu t0, 0(s0) # array[index]++
addi t0, t0, 1
sb t0, 0(s0)
j byteCheck
byteZero:
sb zero, 0(s0) # array[index] = 0
byteCheck:
add s0, s0, a1 # increment ptr
blt s0, s1, byteLoop # inner loop done?
addi a2, a2, -1
bgtz a2, accessBytes # outer loop done?
jr ra
```
:::
### Scenario 1:
#### Program Parameters
Program Parameters are set in `cache.s`
| Array size | Step size | Rep count | Option |
| -- | -- | -- | -- |
| 128 | 8 | 4 | 0 |
#### Cache Parameters
Cache Parameters are set in Ripes simulator
| Cache Levels | Block Size | # of Blocks | Placement Policy | Associativity | Replacement Policy |
| ------------ | ---------- | ----------- | ---------------- | ------------- | ------------------ |
| 1* | 8 | 4 | Direct Mapped | 1 | LRU |
`* Ripes simulator only supports 1 level cache`
Assumption of cache behavior set in Ripes
| Cache hit policy | Cache miss policy |
| - | - |
| Write-back | Write-allocate |
#### Results
- Cache setup
- 
- tag: [31:7]
- index: [6:5]
- block offset: [4:2]
- byte offset: [1:0]
- Data memory access pattern
- The program write to the following address for each repetition
| Address | tag [31:7] | index [6:5] | block offset [4:2] | byte offset [1:0] |
| -- | -- | -- | -- | -- |
| 0x10000000 | 0x200000 | 2'b00 | 3'b000 | 2'b00 |
| 0x10000020 | 0x200000 | 2'b01 | 3'b000 | 2'b00 |
| 0x10000040 | 0x200000 | 2'b10 | 3'b000 | 2'b00 |
| 0x10000060 | 0x200000 | 2'b11 | 3'b000 | 2'b00 |
- Notice the memory address access pattern is increased by **$20_{hex}$ ($32_{dec}$)** since each element in array is 4 bytes and the step size is 8 elements.
- The same memory access pattern repeats for **4** times. The first iteration creates **4 compulsory cache misses**, the later iterations contribute to **12 cache hits**
- Thus, the final cache hit rate is **12/16 = 0.75**
- Data cache
| Hit rate | Hits | Misses | Writebacks |
| -------- | ---- | ------ | ---------- |
| 0.75 | 12 | 4 | 0 |
- Instruction cache
| Hit rate | Hits | Misses | Writebacks |
| -------- | ---- | ------ | ---------- |
| 0.98 | 163 | 3 | 0 |
#### Task:
- What combination of parameters is producing the hit rate you observe ?
- The **stride size** of this program is **32 bytes** which is the same as **block size** of cache, also **32 bytes**.
- What is our hit rate if we increase Rep Count arbitrarily? Why?
- The hit rate will be **close to 1.0** since the **affect of compulsory cache miss is amortized**.
- How could we modify one program parameter to increase our hit rate?
- We can increate the hit rate by **setting the step size to 1**. The hit rate increased to **0.97** with the same setup.
### Scenario 2:
#### Program Parameters
Program Parameters are set in `cache.s`
| Array size | Step size | Rep count | Option |
| -- | -- | -- | -- |
| 256 | 2 | 1 | 1 |
#### Cache Parameters
Cache Parameters are set in Ripes simulator
| Cache Levels | Block Size | # of Blocks | Placement Policy | Associativity | Replacement Policy |
| -- | -- | -- | -- | -- | -- |
| 1* | 16 | 16** | 4-Way Set Associative | 4 | LRU |
`* Ripes simulator only supports 1 level cache`
`** Assume this is the number of blocks per cache line, NOT the total number of blocks in cache`
Assumption of cache behavior set in Ripes
| Cache hit policy | Cache miss policy |
| - | - |
| Write-back | Write-allocate |
#### Results
- Cache setup
- 
- tag: [31:10]
- index: [9:6]
- block offset: [5:2]
- byte offset: [1:0]
- Data memory access pattern
- The program write to the following address for each repetition
| Address | tag [31:10] | index [9:6] | block offset [5:2] | byte offset [1:0] |
| -- | -- | -- | -- | -- |
| 0x10000000 (R/W) | 0x40000 | **4'b0000** | 4'b0000 | 2'b00 |
| 0x10000008 (R/W) | 0x40000 | 4'b0000 | 4'b0010 | 2'b00 |
| 0x10000010 (R/W) | 0x40000 | 4'b0000 | 4'b0100 | 2'b00 |
| ... | ... | ... | ... | ... |
| 0x10000040 (R/W) | 0x40000 | **4'b0001** | 4'b0000 | 2'b00 |
| ... | ... | ... | ... | ... |
| 0x10000080 (R/W) | 0x40000 | **4'b0010** | 4'b0000 | 2'b00 |
| ... | ... | ... | ... | ... |
| 0x10000100 (R/W) | 0x40000 | **4'b0011** | 4'b0000 | 2'b00 |
| ... | ... | ... | ... | ... |
- Notice the memory address access pattern is increased by **$8_{hex}$ ($8_{dec}$)** since each element in array is 4 bytes and the step size is 2 elements.
- Data cache
| Hit rate | Hits | Misses | Writebacks |
| -------- | ---- | ------ | ---------- |
| 0.9375 | 60 | 4 | 0 |
- Instruction cache
| Hit rate | Hits | Misses | Writebacks |
| -------- | ---- | ------ | ---------- |
| 0.99 | 235 | 3 | 0 |
#### Task:
- How many memory accesses are there per iteration of the inner loop?
- There're **two memory access per iteration**. Marked by `lw t0, 0(s0)` and `sw t0, 0(s0)` in the assembly program.
- What is the repeating hit/miss pattern? WHY?
- The hit/miss pattern can be observed from the following table
- | Address | index [9:6] | block offset [5:2] | byte offset [1:0] | # of hit/miss |
| -- | -- | -- | -- | -- |
| 0x10000000 (R/W) | **4'b0000** | 4'b0000 | 2'b00 | 1m1h |
| 0x10000008 (R/W) | 4'b0000 | 4'b0010 | 2'b00 | 2h |
| 0x10000010 (R/W) | 4'b0000 | 4'b0100 | 2'b00 | 2h |
| 0x10000018 (R/W) | 4'b0000 | 4'b0110 | 2'b00 | 2h |
| 0x10000020 (R/W) | 4'b0000 | 4'b1000 | 2'b00 | 2h |
| 0x10000028 (R/W) | 4'b0000 | 4'b1010 | 2'b00 | 2h |
| 0x10000030 (R/W) | 4'b0000 | 4'b1100 | 2'b00 | 2h |
| 0x10000038 (R/W) | 4'b0000 | 4'b1110 | 2'b00 | 2h |
| 0x10000040 (R/W) | **4'b0001** | 4'b0000 | 2'b00 | 1m1h |
| ... | ... | ... | ... |
| 0x10000080 (R/W) | **4'b0010** | 4'b0000 | 2'b00 | 1m1h |
| ... | ... | ... | ... |
| 0x10000100 (R/W) | **4'b0011** | 4'b0000 | 2'b00 | 1m1h |
| ... | ... | ... | ... |
- Notice the **bold** number marked the **first access in that line**. A cache miss will force the cache system to load the **whole line** from memory to cache.
- The hit\miss pattern of this program is **1 miss per 16 memory access**.
- This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.
- As expected, the hit rate is **$\frac{15}{16} = 0.9375$**
- Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why?
- The hit rate **approaches 1.0** as the number of rep count approaches to infinity.
- For a single iteration, there're **64 memory access**. Based on the analysis above, **4 compulsory misses** will occur for the **first iteration**.
- Given the number of iteration (or rep count) goes to infinity, the affect of **compulsory misses is amortized to nearly zero**
- :::spoiler QUESTION
Suppose we have a program that iterates through a very large array (AKA way bigger than the size of the cache) repcount times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). (For reference, in this scenario, we just had one function (incrementation) and one Rep). QUESTION: Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario ?
:::
- Each function is defined as a **numberical operation on an array element**, a function might be the same across a repcount (i.e. increment by 1 as a function) OR function could be different across a repcount. (i.e. increment by 2i where variable i is the repcount number)
- Consider the following example: `if Rep Count = 4, we map 65536 different functions onto each of the array elements, one per Rep.` The C code example is as follow:
- ```c=1
// array size = 65536
// rep count = 4
int array[65536];
void accu(){
for (int k = 0; k < 4; k++) { // repeat repcount times
for (int i = 0; i < 65536; i += 1) {
array[i] = array[i] + 2*k; // function depends on k
}
}
}
```
- Assume the C program above is our target application and ALL cache configuration is the same as the previous question. To provide a solid example for discussion, the setup configuration is as follow:
- | Array size (bytes) | Step size | Rep count | Option |
| -- | -- | -- | -- |
| 262144 | 1 | 4 | 1 |
- Compile the C code on `Compiler Explorer` with `-O0` flag, and run the result in Ripes, the data cache result is as follow:
| Hit rate | Hits | Misses | Writebacks |
| -------- | ---- | ------ | ---------- |
| 0.9911 | 1818645 | 16389 | 16325 |
- Consider the given situation where the **array size is >> than cache size**. The program will be much more **cache friendly** if it is reconstructed as follow:
- Notice **two for loop are swapped**
- ```c=1
int array[65536];
void accu(){
for (int i = 0; i < 65536; i += 1) {
for (int k = 0; k < 4; k++) { // repeat repcount times
array[i] = array[i] + 2*k; // function depends on k
}
}
}
```
- Compile the C code on `Compiler Explorer` with `-O0` flag, and run the result in Ripes, the data cache result is as follow:
| Hit rate | Hits | Misses | Writebacks |
| -------- | ---- | ------ | ---------- |
| 0.9981 | 2158596 | 4034 | 4098 |
- **The result shows a 0.7% improve in cache hit rate** and only **1/4 number of cache miss** compared to the previous approach ! The reconstructed program achieves better performance because of **temporal locality**.
### Scenario 3:
Ripes doesn't support 2-level cache skipped
## Exercise 2 - Loop Ordering and Matrix Multiplication
Matrix multiplication source code
```c=1
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
C[i+j*n] += A[i+k*n] * B[k+j*n];
```
### Matrix multiplication problem analysis
- Matrix elements are `float` which is **4 bytes**
- $O(N^3)$ total operations
- Illustration: 
### Tasks
- Record your results in answers.txt
- ```c=1
./matrixMultiply
ijk: n = 1000, 2.048 Gflop/s
ikj: n = 1000, 0.294 Gflop/s
jik: n = 1000, 1.786 Gflop/s
jki: n = 1000, 10.682 Gflop/s
kij: n = 1000, 0.295 Gflop/s
kji: n = 1000, 8.760 Gflop/s
```
- Which ordering(s) perform best for these 1000-by-1000 matrices? Why?
- Assumption about the problem
- Matrix A, B, C are **row-major**
- Matrix dimension (N) is very large such that 1/N approaches zero
- Cache is **NOT big enough** to hold multiple rows
- Stride analysis:
- | ijk | C | A | B |
| ---------- | - | - | - |
|Stride size | n | n | 1 |
- | ikj | C | A | B |
| ---------- | - | - | - |
|Stride size | n | n | n |
- | jik | C | A | B |
| ---------- | - | - | - |
|Stride size | 1 | n | 1 |
- | jki | C | A | B |
| ---------- | - | - | - |
|Stride size | 1 | 1 | 1 |
- | kij | C | A | B |
| ---------- | - | - | - |
|Stride size | n | 1 | n |
- | kji | C | A | B |
| ---------- | - | - | - |
|Stride size | 1 | 1 | n |
- **jki configuration** has the best performance ! This result is expected given the stride analysis above
- Which ordering(s) perform the worst? Why?
- **ikj performs the worst** because the stride size is **n** for all three matrices
- How does the way we stride through the matrices with respect to the innermost loop affect performance?
- Given the matrix multiplication problem, three for loops should be arranged in a way that **stride 1 memory access occrus before stride n memory access**. This pattern could **maximize both spatial and temporal locality**.
## Exercise 3 - Cache Blocking and Matrix Transposition
### Matrix transpose with cache blocking
:::spoiler C code
```c=1
/* Implement cache blocking below. You should NOT assume that n is a
* multiple of the block size. */
void transpose_blocking(int n, int blocksize, int *dst, int *src) {
size_t remains = n % blocksize;
size_t iter = n / blocksize;
size_t block_base_i = 0;
size_t block_base_j = 0;
for (size_t block_i = 0; block_i < iter; block_i++) {
for (size_t block_j = 0; block_j < iter; block_j++) {
block_base_i = block_i * blocksize;
block_base_j = block_j * blocksize;
for (size_t i = 0; i < blocksize; i++) {
for (size_t j = 0; j < blocksize; j++) {
size_t temp_i = block_base_i + i;
size_t temp_j = block_base_j + j;
dst[temp_j + temp_i * n] = src[temp_i + temp_j * n];
}
}
}
}
block_base_i += blocksize;
block_base_j += blocksize;
for (size_t i = block_base_i; i < n; i++) {
for (size_t j = 0; j <= n - remains; j++) {
dst[j + i * n] = src[i + j * n];
}
}
for (size_t j = block_base_j; j < n; j++) {
for (size_t i = 0; i < n; i++) {
dst[j + i * n] = src[i + j * n];
}
}
}
```
:::
### Part 1 - Changing Array Sizes
- Fix the `blocksize` to be 20 and run your code with `n` equal to 100, 1000, 2000, 5000, 10000
- Naive
- | blocksize = 20 | n=100 | 1000 | 2000 | 5000 | 10000 |
| -------------- | ----- | ---- | ---- | ---- | ----- |
| time (ms) | 0.007 | 1.695| 13.415| 116.718|647.225|
- Blocking
- | blocksize = 20 | n=100 | 1000 | 2000 | 5000 | 10000 |
| -------------- | ----- | ---- | ---- | ---- | ----- |
| time (ms) | 0.003 | 1.411| 4.606 | 33.128 |149.103|
- 
#### Task
- At what point does cache blocked version of transpose become faster than the non-cache blocked version?
- Cache blocked version of transpose is **always** faster than the non-cache version in this setup
- Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?
- I **don't** observe this phenomenon on my computer. Perhaps the underlying architecture is different.
### Part 2 - Changing Blocksize
- Fix `n` to be 10000, and run your code with `blocksize` equal to 50, 100, 500, 1000, 5000.
- Naive
- | n = 10000 | blocksize=50 | 100 | 500 | 1000 | 5000 |
| -------------- | ----- | ---- | ---- | ---- | ----- |
| time (ms) | 615.668 | 598.47| 595.875| 611.302|599.406|
- Blocking
- | n = 10000 | blocksize=50 | 100 | 500 | 1000 | 5000 |
| -------------- | ----- | ---- | ---- | ---- | ----- |
| time (ms) | 132.042 | 139.508| 123.631 | 155.441 |594.837|
- 
#### Task
- How does performance change as blocksize increases? Why is this the case?
- As the block size increases, the performance of cache blocked version of transpose will approach the non-cache version.