owned this note
owned this note
Published
Linked with GitHub
# Assignment 4: Cache
---
## Requirements
1. Following the instructions of Lab 7, you shall do the exercises via Cache Simulation provided by Ripes.
2. With the given source code, you shall prepare C programs which are eventually compiled to RISC-V assembly using Compiler Explorer. Then, you must write your answers to the questions written in the Task section of each exercise specific to Lab 7.
- For tips on how to convert Compiler Explorer generated RISC-V assembly to assembly compatible with the Ripes assembler, refer to the page Adapting Compiler Explorer generated RISC V assembly code.
- You definitely have to modify the given source code to fit the expectation of Ripes.
3. Write down your thoughts and progress in HackMD notes.
## Exercise 1
---
Simulate the following scenarios and record the final cache hit rates with the program in cache.s.
The pseudocode is like:
```clike
int array[]; //Assume sizeof(int) == 4
for (k = 0; k < repcount; k++) { // repeat repcount times
/* Step through the selected array segment with the given step size. */
for (index = 0; index < arraysize; index += stepsize) {
if(option==0)
array[index] = 0; // Option 0: One cache access - write
else
array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write
}
}
```
and cache.s:
```clike
# Rewriten by Stephan Kaminsky in RISC-V on 7/30/2018
# This program accesses an array in ways that provide data about the cache parameters.
# Coupled with the Data Cache Simulator and Memory Reference Visualization to help students
# understand how the cache parameters affect cache performance.
#
# PSEUDOCODE:
# int array[]; //Assume sizeof(int) == 4
# for (k = 0; k < repcount; k++) { // repeat repcount times
# /* Step through the selected array segment with the given step size. */
# for (index = 0; index < arraysize; index += stepsize) {
# if(option==0)
# array[index] = 0; // Option 0: One cache access - write
# else
# array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write
# }
# }
.data
array: .word 2048 # max array size specified in BYTES (DO NOT CHANGE)
.text
##################################################################################################
# You MAY change the code below this section
main: li a0, 256 # array size in BYTES (power of 2 < array size)
li a1, 2 # step size (power of 2 > 0)
li a2, 1 # rep count (int > 0)
li a3, 1 # 0 - option 0, 1 - option 1
# You MAY change the code above this section
##################################################################################################
jal accessWords # lw/sw
#jal accessBytes # lb/sb
li a0,10 # exit
ecall
# SUMMARY OF REGISTER USE:
# a0 = array size in bytes
# a1 = step size
# a2 = number of times to repeat
# a3 = 0 (W) / 1 (RW)
# s0 = moving array ptr
# s1 = array limit (ptr)
accessWords:
la s0, array # ptr to array
add s1, s0, a0 # hardcode array limit (ptr)
slli t1, a1, 2 # multiply stepsize by 4 because WORDS
wordLoop:
beq a3, zero, wordZero
lw t0, 0(s0) # array[index/4]++
addi t0, t0, 1
sw t0, 0(s0)
j wordCheck
wordZero:
sw zero, 0(s0) # array[index/4] = 0
wordCheck:
add s0, s0, t1 # increment ptr
blt s0, s1, wordLoop # inner loop done?
addi a2, a2, -1
bgtz a2, accessWords # outer loop done?
jr ra
accessBytes:
la s0, array # ptr to array
add s1, s0, a0 # hardcode array limit (ptr)
byteLoop:
beq a3, zero, byteZero
lbu t0, 0(s0) # array[index]++
addi t0, t0, 1
sb t0, 0(s0)
j byteCheck
byteZero:
sb zero, 0(s0) # array[index] = 0
byteCheck:
add s0, s0, a1 # increment ptr
blt s0, s1, byteLoop # inner loop done?
addi a2, a2, -1
bgtz a2, accessBytes # outer loop done?
jr ra
```
by modifying the parameters in this block to implement the exercises.
```clike=
main: li a0, 256 # array size in BYTES (power of 2 < array size)
li a1, 2 # step size (power of 2 > 0)
li a2, 1 # rep count (int > 0)
li a3, 1 # 0 - option 0, 1 - option 1
```
### Scenario 1
In Scenario 1, the program parameters is as follows:
Array Size (a0): 128 (bytes)
Step Size (a1): 8
Rep Count (a2): 4
Option (a3): 0
and cache parameters is(set in cache tab):
Cache Levels: 1
Block Size: 8
Number of Blocks: 4
Enable?: Should be green
Placement Policy: Direct Mapped
Associativity: 1 (Venus won’t let you change this, why?)
Block Replacement Policy: LRU
The Ripes simulator only supports 1 level cache.

This program has nested loops inner and outer. the ```repcount```control the total times of memory access iterations, and```Step size```in inner loop is concern about the address of every memory access in every iterations.
**Q1. What combination of parameters is producing the hit rate you observe? (Hint: Your answer should be of the form: “Because [parameter A] in bytes is exactly equal to [parameter B] in bytes.”)**
A1: ```step size``` and ```repcount```
**Q2. What is our hit rate if we increase Rep Count arbitrarily? Why?**
A2: It will increase the hit rate, but also increase the execution time.
**Q3. How could we modify one program parameter to increase our hit rate? (Hint: Your answer should be of the form: “Change [parameter] to [value].” Note: we don’t care if we access the same array elements. Just give us a program parameter modification that would increase the hit rate.**
A3: We can reduce the ```step size```, or increase the ```repcount```.
By increase ```repcount``` to ```8``` ,hit rate is improved from 0.75 to 0.875. 
By reduce the ```step size```to ```2```, hit rate is improved to 0.875. 
If we modify both, the hit rate is 0.9375, but also took much longer to execute. 
### Scenario 2
In Scenario 2, the program parameters is as follows:
Array Size (a0): 256 (bytes)
Step Size (a1): 2
Rep Count (a2): 1
Option (a3): 1
and cache parameters is(set in cache tab):
Cache Levels: 1
Block Size: 16
Number of Blocks: 16
Enable?: Should be green
Placement Policy: N-Way Set Associative
Associativity: 4
Block Replacement Policy: LRU
**Q1. How many memory accesses are there per iteration of the inner loop? (not the one involving repcount). It’s not 1.**
A1: There are two memory access per iteration in ```wordloop```, that is, ```lw x5 0(x8)``` and ```sw x5 0(x8)```.
**Q2. What is the repeating hit/miss pattern? WHY? (Hint: it repeats every 4 accesses).**
A2: The hit miss happened once every 16 memory access.
**Q3. This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.**
A3:
The hit rate is 15/16, which is 0.9375.
**Q4. Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why?**
A4: The hit rate will increase almost to 1 when repcount increase to infinity, because it increase the total iteration times, so the compulsory miss has almost been eliminated.
### Scenario 3:
In Scenario 3, the program parameters is as follows:
Array Size (a0): 128 (bytes)
Step Size (a1): 1
Rep Count (a2): 1
Option (a3): 0
and cache parameters is(set in cache tab):
Cache Levels: 2
NOTE: Make sure the following parameters are for the L1 cache! (Select L1 in the dropdown right next to the replacement policy)
Block Size: 8
Number of Blocks: 8
Enable?: Should be green
Placement Policy: Direct Mapped
Associativity: 1
Block Replacement Policy: LRU
***NOTE: Make sure the following parameters are for the L2 cache! (Select L2 in the dropdown right next to the replacement policy)***
Block Size: 8
Number of Blocks: 16
Enable?: Should be green
Placement Policy: Direct Mapped
Associativity: 1
Block Replacement Policy: LRU
**Q1. What is the hit rate of our L1 cache? Our L2 cache? Overall?
How many accesses do we have to the L1 cache total? How many of them are misses?**
A1:
**Q2. How many accesses do we have to the L2 cache total? How does this relate to the L1 cache (think about what the L1 cache has to do in order to make us access the L2 cache)?**
A2:
**Q3. What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same? Why?**
A3:
**Q4. What happens to our hit rates for L1 and L2 as we slowly increase the number of blocks in L1? What about L1 block size?**
A4:
## Exercise 2
```clike=
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
C[i+j*n] += A[i+k*n] * B[k+j*n];
```
```shell
$ make ex2
gcc -o matrixMultiply -ggdb -Wall -pedantic -std=gnu99 -O3 matrixMultiply.c
./matrixMultiply
ijk: n = 100, 2.372 Gflop/s
ikj: n = 100, 4.751 Gflop/s
jik: n = 100, 2.558 Gflop/s
jki: n = 100, 20.833 Gflop/s
kij: n = 100, 4.819 Gflop/s
kji: n = 100, 18.018 Gflop/s
```
```shell
$ make ex2
gcc -o matrixMultiply -ggdb -Wall -pedantic -std=gnu99 -O3 matrixMultiply.c
./matrixMultiply
ijk: n = 1000, 1.569 Gflop/s
ikj: n = 1000, 0.249 Gflop/s
jik: n = 1000, 1.649 Gflop/s
jki: n = 1000, 10.492 Gflop/s
kij: n = 1000, 0.250 Gflop/s
kji: n = 1000, 9.392 Gflop/s
```
```shell
$ make ex2
gcc -o matrixMultiply -ggdb -Wall -pedantic -std=gnu99 -O3 matrixMultiply.c
./matrixMultiply
ijk: n = 2000, 0.586 Gflop/s
ikj: n = 2000, 0.216 Gflop/s
jik: n = 2000, 0.565 Gflop/s
jki: n = 2000, 7.889 Gflop/s
kij: n = 2000, 0.211 Gflop/s
kji: n = 2000, 7.576 Gflop/s
```
**Q1. Which ordering(s) perform best for these 1000-by-1000 matrices? Why?**
A1: The best ordering is ```jki```
**Q2. Which ordering(s) perform the worst? Why?**
A2: The worst ordering is ```ikj```, although I got different result when the size of matrix is different.
## Exercise 3
---
I took other's code as reference to run the test, since I didn't finish successfully the code.
```clike
for (int block2 = 0; block2 < n; block2 += blocksize) {
for (int block1 = 0; block1 < n; block1 += blocksize) {
for (int x = block2; x < block2 + blocksize && x < n; x++) {
for (int y = block1; y < block1 + blocksize && y < n; y++) {
dst[x + y * n] = src[y + x * n];
}
}
}
}
```
when fixed the blocksize to 20, and changed the n.
n = 100
```shell
./transpose 100 20
Testing naive transpose: 0.009 milliseconds
Testing transpose with blocking: 0.014 milliseconds
```
n = 1000
```shell
$ ./transpose 1000 20
Testing naive transpose: 1.302 milliseconds
Testing transpose with blocking: 0.926 milliseconds
```
n = 2000
```shell
$ ./transpose 2000 20
Testing naive transpose: 16.573 milliseconds
Testing transpose with blocking: 4.054 milliseconds
```
n = 5000
```shell
$ ./transpose 5000 20
Testing naive transpose: 145.306 milliseconds
Testing transpose with blocking: 32.478 milliseconds
```
n = 10000
```shell
$ ./transpose 10000 20
Testing naive transpose: 723.129 milliseconds
Testing transpose with blocking: 126.496 milliseconds
```
Then fixed n to 10000, and changed the blocksize:
blocksize = 10
```shell
$ ./transpose 10000 10
Testing naive transpose: 1213.38 milliseconds
Testing transpose with blocking: 195.02 milliseconds
```
blocksize = 20
```shell
$ ./transpose 10000 20
Testing naive transpose: 682.792 milliseconds
Testing transpose with blocking: 145.704 milliseconds
```
blocksize = 50
```shell
$ ./transpose 10000 50
Testing naive transpose: 679.057 milliseconds
Testing transpose with blocking: 182.012 milliseconds
```
blocksize = 100
```shell
$ ./transpose 10000 100
Testing naive transpose: 710.215 milliseconds
Testing transpose with blocking: 143.774 milliseconds
```
blocksize = 500
```shell
$ ./transpose 10000 500
Testing naive transpose: 1057.32 milliseconds
Testing transpose with blocking: 183.054 milliseconds
```
blocksize = 1000
```shell
./transpose 10000 1000
Testing naive transpose: 836.307 milliseconds
Testing transpose with blocking: 202.642 milliseconds
```
blocksize = 5000
```shell
$ ./transpose 10000 5000
Testing naive transpose: 725.658 milliseconds
Testing transpose with blocking: 668.387 milliseconds
```
At begin, increasing the blocksize would reduce the testing time, but when the size is too big, the testing time starts increasing.