# Term Project : Cache simulation and case study
contributed by < [`YaRu056`](https://github.com/YaRu056/2023_Computuer_architecture.git) >、< [`ChengChiTing`](https://github.com/ChengChiTing) >
In our term project, we are going to follow the instructions of [Lab 7](https://inst.eecs.berkeley.edu/~cs61c/su20/labs/lab07/) in CS 61C and finish the tasks via [Venus](https://venus.cs61c.org/). There are three main purpose in our project :
1. Analyze how memory access patterns determine cache hit rates.
2. Analyze and discover which memory access patterns produce GOOD hit rates.
3. Analyze hit rates for caches and be able to optimize code accesses to produce good hit rates.
## Exercise 1 - A Couple of Memory Access Scenarios
### Scenario 1:
:::info
**Program Parameters:**
* Array Size (`a0`): 128 (bytes)
* Step Size (`a1`): 8
* Rep Count (`a2`): 4
* Option (`a3`): 0
**Cache Parameters:**
* Cache Levels: 1
* Block Size: 8
* Number of Blocks: 4
* Placement Policy: Direct Mapped
* Associativity: 1
* Block Replacement Policy: LRU
:::
* Test environment: [Ripes](https://github.com/mortbopet/Ripes) ,[Venus](https://venus.cs61c.org/)
* The layout of the cache can be seen here:

* Set the require cache parameters in Ripes:

* After running the program , the status of the cache become
| Venus | Ripes |
| :--------: | :--------: |
|  |  |
We can find that the blocks at Index 1, 2, and 3 are all empty, which means that all block replaces occur at Index 0. We call this phenomenon the Ping pong effect.Ping pong effect means that alternating requests that map into the same cache slot.
| Hit (1:hit ,0: miss) | Miss count |
| :--------: | :--------: |
|  |  |
Total miss : 16 times
### Tasks
1. What combination of parameters is producing the hit rate you observe?
>Because cache size in bytes is exactly equal to step size in bytes.The other reason is that we use direct mapped, so the hit rate would be 0 in this case.
2. What is our hit rate if we increase Rep Count arbitrarily? Why?
>The hit rate will not have changed when we increase Rep Count, because we can see Rep Count like an iteration.The only change of the result will increase amount of misses.
>We increase Rep Count to `20`, we can see the result in the following figure.
>
>Total miss : 80 Hit rate:0
4. How could we modify one program parameter to increase our hit rate?
**Note:** we don’t care if we access the same array elements. Just give us a program parameter modification that would increase the hit rate.
>We can change step size to 1 , in this case we can make full use of all the blocks in the cache, then we get the hit rate increase to 0.5 .Because the step size is reduced, the number of misses also increases.
>
>| Venus | Ripes |
>| :--------: | :--------: |
>|  |  |
### Scenario 2:
:::info
**Program Parameters:**
* Array Size (`a0`): 256 (bytes)
* Step Size (`a1`): 2
* Rep Count (`a2`): 1
* Option (`a3`): 1
**Cache Parameters:**
* Cache Levels: 1
* Block Size: 16
* Number of Blocks: 16
* Placement Policy: N-Way Set Associative
* Associativity: 4
* Block Replacement Policy: LRU
:::
* Test environment: [Ripes](https://github.com/mortbopet/Ripes) ,[Venus](https://venus.cs61c.org/)
* The layout of the cache can be seen here:

* Set the require cache parameters in Ripes:

* After running the program , the status of the cache become
| Venus | Ripes |
| :--------: | :--------: |
|  |  |
Miss in `0`,`4`,`8`,`12`,`16`,`20`,`24`,`28`,`32`,`36`,`40`,`44`,`48`,`52`,`56`,`60` access count.
Total miss : 16 times
We can find all misses are Compulsory miss.
| Hit (1:hit ,0: miss) | Miss count |
| :--------: | :--------: |
|  | |
### Tasks
1. How many memory accesses are there per iteration of the inner loop? (not the one involving repcount). It’s not 1.
>Because option is 1,it causes to do an additional write to the memory, so there will be 2 memory access per iteration.
> ```C=
>if(option==0)
> array[index] = 0;
> // Option 0: One cache access - write
>else
> array[index] = array[index] + 1;
> // Option 1: Two cache accesses - read AND write
>```
2. What is the repeating hit/miss pattern? WHY? (Hint: it repeats every 4 accesses).
>Repeating pattern: Miss -> Hit -> Hit -> Hit
>The first access will be a **compulsory miss**, then the same block will be access again to write. Because the cache block size is 4 words, the next read and write will also be HIT.
3. This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.
>Hit rate = $\frac{Cache\ hits}{Cache\ Hits\ + \ Cache\ Misses}$ x 100 %
>=$\frac{3}{4}$x 100 %
>=75 %
4. Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run!
>We increase Rep Count to `10`, we can see the result in the following figure.
>
>Except for Complusory miss, we can see the rest accesses are hit. And hit rate increases.
>Because there is no replacement in this case, the number of misses will always stay at 16(compulsory miss), the number of hits will keep increasing.
>We can see the data in the cache in the following figure.
>
5. Suppose we have a program that iterates through a very large array (AKA way bigger than the size of the cache) repcount times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). (For reference, in this scenario, we just had one function (incrementation) and one Rep).
QUESTION: Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others AKA it doesn’t matter if Rep k is applied to element arr[i] before Rep k is applied to element arr[i+1], etc.
>We should try to **access first 64KB** of the array at a time and apply all of the **R/W hit** to that **first 64KB** so we can be completely done with it before moving on, thereby keeping that **64KB** hot in the cache and not having to circle back to it later on!
### Scenario 3:
:::info
**Program Parameters:**
* Array Size (`a0`): 128 (bytes)
* Step Size (`a1`): 1
* Rep Count (`a2`): 1
* Option (`a3`): 0
**Cache Parameters:**
* Cache Levels: 2
* Block Size: 8
* Number of Blocks: 8
* Placement Policy: Direct Mapped
* Associativity: 1
* Block Replacement Policy: LRU
:::
| L1 | L2 |
| :--------: | :--------: |
|  |  |
### Tasks
1. What is the hit rate of our L1 cache? Our L2 cache? Overall?
>The hit rate of our L1 cache is 0.5 and the hit rate of our L2 cache is 0. Overall hit ratio is 0.5.
2. How many accesses do we have to the L1 cache total? How many of them are misses?
>We have 32 accesses to the L1 cache and 16 misses.
3. How many accesses do we have to the L2 cache total? How does this relate to the L1 cache (think about what the L1 cache has to do in order to make us access the L2 cache)?
>We have 16 accesses to the L2 cache. When the L1 cache miss, it will access to L2 cache to check it is hit or miss. Therefore, the miss number of L1 cache will be equal to the access number of L2 cache.
4. What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same? Why?
>We can increase block size in L2 cache, thus we can put more elements into cache and increase the probability of hit.
5. What happens to our hit rates for L1 and L2 as we slowly increase the number of blocks in L1? What about L1 block size?
>If we increase the number of blocks in L1 cache, it will not affect the hit ratio of L1 and L2 cache. If we increase the block size of L1 cache, the hit ratio of L1 cache will increase and the access number of L2 cache will decrease.
## Exercise 2 - Loop Ordering and Matrix Multiplication
Cache design for matrix element access can be optimized by following these methods :
* Use CPU’s high-speed cache to reduce cache misses and improve access efficiency. When accessing matrix elements, you can visit them in row-major or column-major order. This way, you can take advantage of the CPU’s high-speed cache and reduce cache misses.
* Transpose the matrix so that the rows become columns and the columns become rows. This way, the arrangement of the matrix elements in memory is more aligned with the CPU’s cache structure, which can improve access efficiency.
* Use space-filling curves to optimize the order of matrix element access, thereby improving cache hit rate .
If you recall, matrices are 2-dimensional data structures wherein each data element is accessed via two indices. To multiply two matrices, we can simply use 3 nested loops, assuming that matrices A, B, and C are all n-by-n and stored in one-dimensional column-major arrays:
```C=
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
C[i+j*n] += A[i+k*n] * B[k+j*n];
```
>Fact: Matrix multiplication operations are at the heart of many linear algebra algorithms, and efficient matrix multiplication is critical for many applications within the applied sciences.
>
In the above code, note that the loops are ordered `i`, `j`, `k`. If we examine the innermost loop (the one that increments k), we see that it…
| loop order | Step size across B | Step size across A | Step size across C |
| :----------: | :---------------------------: | :---------------------------: | :---------------------------: |
| ijk | 1 | n | 0 |
| ikj | n | 0 | n |
| jik | 1 | n | 0 |
| jki | 0 | 1 | 1 |
| kij | n | 0 | n |
| kji | 0 | 1 | 1 |
The unit "Gflops/s" reads, "Giga-floating-point-operations per second."
THE BIGGER THE NUMBER THE FASTER IT IS RUNNING!
```text
ijk: n = 1000, 3.864 Gflop/s
ikj: n = 1000, 1.620 Gflop/s
jik: n = 1000, 3.774 Gflop/s
jki: n = 1000, 17.336 Gflop/s
kij: n = 1000, 1.205 Gflop/s
kji: n = 1000, 13.328 Gflop/s
```
### Tasks
1. Which ordering(s) perform best for these 1000-by-1000 matrices? Why?
> `jki`
> If i is in the inner loop, array A and C will get more cache hit because it only adds `1` instead of `n` in per iteration.
> Looking at the 2*2 matrix
>:::spoiler jki
>jki
>C[0]=A[0]+B[0]
>C[0]=A[2]+B[1]
>C[1]=A[1]+B[0]
>C[1]=A[3]+B[1]
>C[2]=A[0]+B[2]
>C[2]=A[2]+B[3]
>C[3]=A[1]+B[2]
>C[3]=A[3]+B[3]
>:::
>
2. Which ordering(s) perform the worst? Why?
>`kij`
>If j is in the inner loop, array B and C will add n in per iteration, it will cause much more cache block replacement.
> Looking at the 2*2 matrix
> ::: spoiler kij
>C[0]=A[0]+B[0]
>C[2]=A[0]+B[2]
>C[1]=A[1]+B[0]
>C[3]=A[1]+B[2]
>C[0]=A[2]+B[1]
>C[2]=A[2]+B[3]
>C[1]=A[3]+B[1]
>C[3]=A[3]+B[3]
>:::
3. How does the way we stride through the matrices with respect to the innermost loop affect performance?
>The way we stride through the matrices with respect to the innermost loop can affect **the cache hit rate** and thus the performance of the program. **The index of innermost loop** is the key, we should find the column index for two of the matrix to be the inner one, then find the other column index to be the second loop index, in this way we can find the li ess miss rate combination.The order in which we choose to access the elements of the matrices can affect the cache hit rate and thus the performance of the program.
### In different cache parameters
We rewrite the c code into assembly code and set the cache parameters as three scenario mentioned in Exercise 1. We will test the program under different condition, then we can summarize the critical points which affect the hit rate.
::: spoiler Assembly code runs in Venus
``` assembly=
.data
arr: .word 0
length: .word 16
.text
la s0, arr # load the address of arr into t1
add s7,x0,s0 # set s7 as addr A[0]
lw s1, length # load length into s1
add s10, x0,s1
lop:
addi t4, s1, 0
loop1:
addi t2, x0, 0 # set t2 as 0
addi t3, s1, 0
addi t4, t4, -1
loop2:
sw t2, 0(s0) # save t2 value into the address of arr[0] pointed by t1
addi t2, t2, 1 # t2 plus 1
addi s0, s0, 4 # t1 point to arr[n+1]
addi t3, t3, -1
bnez t3, loop2
bnez t4, loop1
addi s10,s10,-1
beqz s10,Next
addi s0, s0, 4 # set s8 as addr B[0]
add s8,s0,x0
j lop
Next:
addi s0, s0, 4 # set s9 as C[0]
add s9,s0,x0
add s10, x0,s1
addi s4, x0, 0 # set j as 0 s4=j
j:
addi s5, x0, 0 # set k as 0 s5=k
k:
addi s6, x0, 0 # set i as 0 s6=i
i:
slli t0,s4,4
add t0,t0,s5
slli t0,t0,2
add t0,t0,s8
lw t2,0(t0) # B[k+j*n]
slli t0,s5,4
add t0,t0,s6
slli t0,t0,2
add t0,t0,s7
lw t1,0(t0) # A[i+k*n]
slli t0,s4,4
add t0,t0,s6
slli t0,t0,2
add t0,t0,s9
lw t4,0(t0) # C[i+j*n]
mul t3,t1,t2
add t4,t4,t3
sw t4,0(t0) # C[i+j*n]
addi s6, s6,1 #i++
blt s6, s10, i # i < n or not
addi s5, s5, 1 # k++
blt s5, s10, k # k < n or not
addi s4, s4, 1 # j++
blt s4, s10, j # j < n
li a0,10 # exit
ecall
```
:::
| Scenario | Cache Parameters | L1 hit rate | L2 hit rate | L1GMR | L2GMR |
|:--------:|:-----------------------------------------------------------------------------------------------------------------:|:----------------------------:|:-----------:|:-----:|:-----:|
| 1 |  | $\frac{7072}{16384}$ = 0.43 | | | |
| 2 |  | $\frac{15231}{16384}$ = 0.93 | | | |
| 3 |  | 0.46 | 0 | 0.54 | 1 |
| 3-1 | According to the Scenario 3,we increase block size to 16 Bytes in L2 cache. | 0.46 | 0.37 | 0.54 | 0.2 |
| 3-2 | According to the Scenario 3,we increase the number of blocks to 16 in L2 cache. | 0.46 | 0.31 | 0.54 | 0.37 |
| 3-3 | According to the Scenario 3,we increase both block size to 16 Bytes and the number of blocks to 16 in L2 cache. | 0.46 | 0.67 | 0.54 | 0.18 |
| 4 |  | $\frac{15231}{16384}$ = 0.93 | | | |
## Exercise 3 - Cache Blocking and Matrix Transposition
In this exercise, we are going to finish the c code in `transpose.c`. Compare the performance differences between the general matrix transposition and cache blocking. Furthermore, we will rewrite the c code into assembly code and run in venus to observe the hit ratio under different conditions.
### Implementation
The test code provide by [Lab7](https://github.com/61c-teach/fa22-lab-starter/tree/main/lab07) was divided into two parts. One is `transpose.c`, we have to finish transpose_blocking function in this code. The another is `test_transpose.c`, we can use this code to test whether cache blocking speed up the execution time of matrix transposition.
#### Matrix Transposition
The swapping of the rows and columns of a matrix is called a `transposition`. The transpose of matrix A is often denoted as AT. The naive implementation of transposing a matrix would look like the following:

#### Row-Major Order
In this implementation, we craete an one-dimensional array to represent matrix and assume it is row-major ordering. Therefore, the order of elements stored in memory are shown below:

:::info
matrix A = $\begin{bmatrix}
0 &1&2 &3& 4 \\
5 &6&7 &8& 9 \\
10 &11&12 &13& 14 \\
15 &16&17 &18& 19 \\
20 &21&22 &23& 24
\end{bmatrix}$
=> array A ={0, 1, 2, 3, 4 ,5 ,6 ,7 ,8 ,9 , 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24}
:::
#### C code
The naive implementation of transposing a matrix would look like the following:



We will access an entire row of `src[]` and an entire column of `dst[]` in one iteration of the inner loop. We load the element form original aarry and store it into correspond place of the target array.
The code shown below is the transpose_naive function.
transpose_naive
```c=
void transpose_naive(int n, int blocksize, int *dst, int *src) {
for (int x = 0; x < n; x++) {
for (int y = 0; y < n; y++) {
dst[y + x * n] = src[x + y * n];
}
}
}
```
However, if we check the hit/miss pattern of the cache during the first iteration of the inner loop would look like the following:
(Assume we have a fully associative cache with 16 lines and a line size of 16 bytes.)

The hit rate when accessing the source matrix is 75% which is as good as it can be with the given cache configuration. (Every element is 4 bytes, thus every line can store 4 elements.) But the hit rate when accessing the destination matrix is 0.
If we want to improve the performance of our cache usage, we can use a technique called `cache blocking`.
#### Cache Blocking
Cache blocking is a technique that attempts to reduce the cache miss rate by improving the temporal and/or spatial locality of memory accesses. To achieve this, we will separate matrix into several blocks and transpose the matrix one "block" at a time.
We will make sure that the number of bytes within the width of the block can fit into an entire cache line. The following images show three iterations of transposing the matrix one block at a time.



Let's evaluate how cache blocking would perform with the same cache configuration given earlier (fully associative cache with 16 lines and a line size of 16 bytes).



We can see that cache blocking makes much better use of spatial and temporal locality and therefore improves the hit rate of our cache which improves the performance of our program.
#### C code
Nevertheless, when we refer to prior works from reference, we figure out bug in their code. They assume that n(matrix length) is a multiple of the block size. Therefore, if n is indivisible by block size, the program will out output the error message.
For example, n = 5000; blocksize = 23
```shell
Error!!!! Transpose does not result in correct answer!!
```
:::warning
More explanations required.
:::
Accordingly, we modify for loop condition and make sure it can run correctly in any condition. The following revised code is shown below:
transpose_blocking
```c=
/* Implement cache blocking below. You should NOT assume that n is a
* multiple of the block size. */
void transpose_blocking(int n, int blocksize, int *dst, int *src) {
int counter = n / blocksize;
for(int x = 0; x <= counter; ++x){
for(int y = 0; y <= counter; ++y){
for(int i = 0; i < blocksize; ++i){
for(int j = 0; j < blocksize; ++j){
if(x * blocksize + i < n && y * blocksize + j < n){
dst[(x * blocksize + i) * n + (y * blocksize + j)] = src[(y * blocksize + j) * n + (x * blocksize + i)];
}
}
}
}
}
}
```
If we execute the test program again, we can get the following output:
```shell
testing n = 5000, blocksize = 23
naive: 263.546 milliseconds
blocking: 52.044 milliseconds
Speedup sufficient
```
### Calculate execution time
In the first test, we will run the test program and check the execution time of the program. As mentioned previously, cache blocking will have higher hit ratio,so it will take less time to access memory and have less execution time.
The test program are shown below:
:::spoiler test program
```c=
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
#include "transpose.h"
double benchmark(int *A, int *B, int n, int blocksize,
void (*transpose)(int, int, int*, int*), char *description) {
int i, j;
/* initialize A,B to random integers */
srand48( time( NULL ) );
for( i = 0; i < n*n; i++ ) A[i] = lrand48( );
for( i = 0; i < n*n; i++ ) B[i] = lrand48( );
/* measure performance */
struct timeval start, end;
gettimeofday( &start, NULL );
transpose( n, blocksize, B, A );
gettimeofday( &end, NULL );
double seconds = (end.tv_sec - start.tv_sec) +
1.0e-6 * (end.tv_usec - start.tv_usec);
/* check correctness */
for( i = 0; i < n; i++ ) {
for( j = 0; j < n; j++ ) {
if( B[j+i*n] != A[i+j*n] ) {
printf("Error!!!! Transpose does not result in correct answer!!\n");
exit( -1 );
}
}
}
return seconds*1e3;
}
int main( int argc, char **argv ) {
int n = 5000;
int blocksize = 23;
/* allocate an n*n block of integers for the matrices */
int *A = (int*)malloc( n*n*sizeof(int) );
int *B = (int*)malloc( n*n*sizeof(int) );
/* run tests */
double time1 = benchmark(A, B, n, blocksize, transpose_naive, "naive transpose");
double time2 = benchmark(A, B, n, blocksize, transpose_blocking, "transpose with blocking");
/* release resources */
free( A );
free( B );
printf("testing n = %d, blocksize = %d\n", n, blocksize);
printf("naive: %g milliseconds\n", time1);
printf("blocking: %g milliseconds\n", time2);
if ((time1 - time2) < 200) {
printf("insufficient speedup\n");
return -1;
}
printf("Speedup sufficient\n");
return 0;
}
```
:::
#### Part 1 - Changing Array Sizes
In this part, we will fix the blocksize to be 20, and run our code with n equal to 100, 1000, 2000, 5000, and 10000, which `n` mean the array size.
block_size = 20
| n | 100 | 1000 | 2000 | 5000 | 10000 |
| -------- | -------- | -------- | -------- | -------- | -------- |
| naive | 0.004 | 0.808 | 14.658 | 215.845 | 1397.67 |
| blocking | 0.007 | 1.619 | 4.948 | 42.749 | 179.327 |
<font color="#f00">Red line : naive</font> ; <font color="#00f">Blue line : blocking</font>

#### Task
1. At what point does cache blocked version of transpose become faster than the non-cache blocked version?
>At about n = 2000, cache blocked version becomes faster than the non-cache blocked version?
2. Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?
>With a smaller matrix size, most of the matrix can probably fit in the cache, so blocking results in more overhead (cache blocked version have four for loop) and doesn't provide a significant increase in cache hit rate in the beginning.
With the matrix size become bigger, parts of the matrix can't fit inside the cache, leading to more cache misses from non-blocked code. Blocked code would consolidate accesses to a smaller portion of memory, giving more cache hits.
#### Part 2 - Changing Blocksize
In this part, we will fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000.
n = 10000
|block size| 50 | 100 | 500 | 1000 | 5000 |
| -------- | -------- | -------- | -------- | -------- | -------- |
| naive | 1863.71 | 1892.9 | 1889.23 | 1953.22 | 1912.84 |
| blocking | 268.094 | 174.333 | 164.624 | 201.054 | 1274.78 |
<font color="#f00">Red line : naive</font> ; <font color="#00f">Blue line : blocking</font>

#### Task
1. How does performance change as blocksize increases? Why is this the case?
>Notice that in neither of the last two exercises did we actually know the cache parameters of our machine. We just made the code exhibit a higher degree of locality, and this magically made things go faster! This tells us that caches, regardless of their specific parameters, will always benefit from operating on code which exhibits a high degree of locality. As the cod e
### Observe hit ratio
After we comparing execution timethe between cache blocked and non-cache blocked, we want to configure the hit ratio between them. Therefore, we rewrite C code into assembly code and run program in venus. We assume our cache is a fully associative cache with 16 lines and a line size of 16 bytes. Considering the memory in venus, the largest length of matrix will be 60.
#### Assembly code
Our test code can be separate into two parts. The first part generate a n*n matrix. The elements in every row will be set as {0, 1, 2, ......}, thus if we set `n = 5`, we will get the following array.
:::info
matrix A = $\begin{bmatrix}
0 &1&2 &3& 4 \\
0 &1&2 &3& 4 \\
0 &1&2 &3& 4 \\
0 &1&2 &3& 4 \\
0 &1&2 &3& 4
\end{bmatrix}$
=> array A ={0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4}
:::
The second part will tranpose the matrix. As mentioned previously, We assume the matrix is row-major order. If we transpose the matrix above, we will get the following matrix and array.
:::info
matrix A = $\begin{bmatrix}
0 &0&0 &0& 0 \\
1 &1&1 &1& 1 \\
2 &2&2 &2& 2 \\
3 &3&3 &3& 3 \\
4 &4&4 &4& 4
\end{bmatrix}$
=> array A ={0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4}
:::
:::spoiler transpose_naive assembly code
```assembly=
.data
arr: .word
length: .word 50 # n * n matrix
.text
lw s0, length # load length into s0
la s1, arr # load the address of array1 into s1
#chcek address of arr
print:
addi a0, x0, 1 # argument to ecall to execute print integer
addi a1, s1, 0 # argument to ecall, the value to be printed
ecall
#generate matrix
addi t0, s0, 0 # set t0(row) as n
loop1:
addi t1, s0, 0 # set t1(col) as n
addi t2, x0, 0 # set t2(value) as 0
addi t0, t0, -1
loop2:
sw t2, 0(s1) # save value into the address of array1[i] pointed by s0
addi t2, t2, 1 # value plus 1
addi s1, s1, 4 # s0 point to array[i+1]
addi t1, t1, -1
bnez t1, loop2
bnez t0, loop1
# set s2 as arry2[0]
addi s2, s1, 4
addi s3, s2, 0 # copy address of array2[0] into s3
#matrix transposition
naive_transpose:
addi t1, x0, 0 # set t1(x) as 0
loop4:
addi t2, x0, 0 # set t2(y) as 0
loop5:
addi s2, s3, 0
la s1, arr
mul t3, t2, s0 # src[x + y * n]
add t3, t3, t1
slli t3, t3, 2
add s1, s1, t3
mul t4, t1, s0 # dst[y + x * n]
add t4, t4, t2
slli t4, t4, 2
add s2, s2, t4
lw t5, 0(s1) # load src[] data
sw t5, 0(s2) # save dst[] data
addi t2, t2, 1 # y++
blt t2, s0, loop5 # y < n or not
addi t1, t1, 1 # x++
blt t1, s0, loop4 # x < n or not
exit:
li a0, 10 # exit
ecall
```
:::
:::spoiler transpose_blocking assembly code
```assembly=
.data
arr: .word
block_size: .word 10
length: .word 50
.text
la s0, arr # load the address of arr into s0
lw s1, length # load length into s1
lw s4, block_size # load block_size into s4
print:
addi a0, x0, 1 # argument to ecall to execute print integer
addi a1, s0, 0 # argument to ecall, the value to be printed
ecall
#generate matrix
addi t4, s1, 0
loop1:
addi t2, x0, 0 # set t2 as 0
addi t3, s1, 0
addi t4, t4, -1
loop2:
sw t2, 0(s0) # save t2 value into the address of arr[i] pointed by t2
addi t2, t2, 1 # t2 plus 1
addi s0, s0, 4 # s0 point to arr[i+1]
addi t3, t3, -1
bnez t3, loop2
bnez t4, loop1
addi s3, s0, 4 # set s3 as array2[0]
#blocking_matrix transposition
blocking_transpose:
addi t1, s1, 0 # set t1 as n
div s5, t1, s4 # set s5 as counter
addi s5, s5, 1 # counter plus 1
addi s6, x0, 0 # set x as 0
loop6:
addi s7, x0, 0 # set y as 0
loop7:
addi t2, x0, 0 # set i as 0
loop4:
addi t3, x0, 0 # set j as 0
loop5:
mul t4, s4, s6 # x * blocksize + i
add t4, t4, t2
mul t5, s4, s7 # y * blocksize + j
add t5, t5, t3
bge t4, s1, not_transpose
bge t5, s1, not_transpose
addi s2, s3, 0 # set s2 as array2[0]
la s0, arr # set s0 as array1[0]
mul t4, s7, s4 # (y * block_size + j) * n
add t4, t4, t3
mul t4, t4, s1
mul s8, s6, s4 # (x * blocksize + i)
add s8, s8, t2
add t4, t4, s8
slli t4, t4, 2
add s0, s0, t4
mul t5, s6, s4 # (x * block_size + i) * n
add t5, t5, t2
mul t5, t5, s1
mul s8, s7, s4 # (y * blocksize + j)
add s8, s8, t3
add t5, t5, s8
slli t5, t5, 2
add s2, s2, t5
lw t6, 0(s0)
sw t6, 0(s2)
not_transpose:
addi t3, t3, 1 # j++
blt t3, s4, loop5 # j < block_size or not
addi t2, t2, 1 # i++
blt t2, s4, loop4 # i < block_size or not
addi s7, s7, 1 # y++
blt s7, s5, loop7 # y < counter
addi s6, s6, 1 # x++
blt s6, s5, loop6 # x < counter
exit:
li a0, 10 # exit
ecall
```
:::
#### Part 1 - Changing Array Sizes
In this part, we will fix the blocksize to be 4, and run our code with n equal to 20, 30, 40, 50, and 60, which `n` mean the array size.
block_size = 4
| n | 20 | 30 | 40 | 50 | 60 |
| -------- | -------- | -------- | -------- | -------- | -------- |
| naive | 0.374 | 0.374 | 0.375 | 0.375 | 0.375 |
| blocking | 0.730 | 0.670 | 0.740 | 0.677 | 0.743 |
<font color="#f00">Red line : naive</font> ; <font color="#00f">Blue line : blocking</font>

As the result shown above, the hit ratio contain about 0.74 when n = 20, 40, 60. However, it drops about 0.67 when n = 30, 50. We think the reason is 30 and 50 can't be divisible by 4, thus there are some cache miss happened when we executed the transpose_blocking code.
#### Part 2 - Changing Blocksize
In this part, we will fix n to be 60, and run your code with blocksize equal to 2, 4, 6, 8, 10.
n = 60
|block size| 2 | 4 | 6 | 8 | 10 |
| -------- | -------- | -------- | -------- | -------- | -------- |
| naive | 0.375 | 0.375 | 0.375 | 0.375 | 0.375 |
| blocking | 0.631 | 0.744 | 0.627 | 0.685 | 0.676 |
<font color="#f00">Red line : naive</font> ; <font color="#00f">Blue line : blocking</font>

As the result shown above, the hit ratio is at highest point when blocksize = 4. We think the reason is the line size of cache. We set a line size of 16 bytes(4 words). If the block size too small, it can't fill up the cache. If the block size too big, the cache will be full. As new element coming, the value in the cache will be replaced by others, thus the cache miss will increase and hit ratio will decrease correspondingly.
## Reference
[CS 61C Lab 7](https://inst.eecs.berkeley.edu/~cs61c/su20/labs/lab07/)
[venus](https://github.com/kvakil/venus)
[Reference1](https://hackmd.io/VQqoaTVXROKX6OPz-TcKfQ?view#Exercise-3---Cache-Blocking-and-Matrix-Transposition)
[Reference2](https://hackmd.io/@r1YLxwFRRPe1xninh0Ma6w/SJRrVgQcj)