owned this note
owned this note
Published
Linked with GitHub
# Rework CacheLab
In my term project I'm going to rework on [Assignment4:Cache](https://hackmd.io/eBIh9LtJQguyBpSP7gYl3w?view), I'm gonna use the method provide in [CS 61C Lab7](https://inst.eecs.berkeley.edu/~cs61c/su20/labs/lab07/), and also reference three of the homeworks written by [魏晉成](https://hackmd.io/T_MxFzGNRMqS5b-7oGaz9w), [江松穎](https://hackmd.io/@Uduru0522/rJ-NUxC3v), [呂紹樺](https://hackmd.io/@horseradish1208/HkzkUEAiD)
## Exercise 1 - A Couple of Memory Access Scenarios
We use the program `cache.s` in [source code](https://github.com/61c-teach/su20-lab-starter/tree/master/lab07) to observe how will the cache do in these three kinds of situation
:::spoiler **Pseudocode:**
``` c=
int array[]; //Assume sizeof(int) == 4
for (k = 0; k < repcount; k++) { // repeat repcount times
/* Step through the selected array segment with the given step size. */
for (index = 0; index < arraysize; index += stepsize) {
if(option==0)
array[index] = 0; // Option 0: One cache access - write
else
array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write
}
}
```
:::
### Scenario 1:
Test environment : [Ripes](https://github.com/mortbopet/Ripes)
:::spoiler **Program Parameters**
* Array Size (a0): 128 (bytes)
* Step Size (a1): 8
* Rep Count (a2): 4
* Option (a3): 0
:::
Use the given program parameters to modify the assembly code in `cache.s`
```
main: li a0, 128 # array size in BYTES (power of 2 < array size)
li a1, 8 # step size (power of 2 > 0)
li a2, 4 # rep count (int > 0)
li a3, 0 # 0 - option 0, 1 - option 1
```
also we modify the code in main
```
li a7,10 # exit
ecall
```
in the origin code we load in `a0`,though Ripes requires:
* `a0` for specifying system call
* `a7` for system call parameters
:::spoiler **Cache Parameters**:
* Cache Levels: 1
* Block Size: 8
* Number of Blocks: 4
* Enable?: Should be green
* Placement Policy: Direct Mapped
* Associativity: 1
* Block Replacement Policy: LRU
:::
set the require cache parameters in Ripes(the require parameters are for [Venus](https://venus.cs61c.org/),so there will be a little different)

The cache will be like

In Ripes, size of a block will always be 4 bytes, so we need to set the number of blocks to be 2 tohave a 8 bytes block.than we see those 2 blocks as 1.

In the picture above we can see the hit rate is 0 and there are 16 misses, we can also get the result by calculating ourselves, 128(bytes)/8(step size)/4(bytes/word) = 4, and we do it 4 times equal to 16, but why miss, its because we use direct mapped and the step size is 8 so all blocks who wants to come in cache its Index will be 0, cause to the result that we are going to replace the block in Index 0, but there are still idle cache blocks.

If we change placement policy to 4-way set associate the hit rate will increase to 0.75

### Tasks
**Q1.**
What combination of parameters is producing the hit rate you observe? (Hint: Your answer should be of the form: “Because [parameter A] in bytes is exactly equal to [parameter B] in bytes.”)
**A1.**
Because cache size in bytes is exactly equal to step size in bytes and we use direct mapped, so the hit rate would be 0 in this case.
**Q2.**
What is our hit rate if we increase Rep Count arbitrarily? Why?
**A2.**
The hit rate will not have changed when we increase Rep Count, because we can see Rep Count like an iteration, in fact the only change of the result will bethe amount of misses
**Q3.**
How could we modify one program parameter to increase our hit rate? (Hint: Your answer should be of the form: “Change [parameter] to [value].” Note: we don’t care if we access the same array elements. Just give us a program parameter modification that would increase the hit rate.
**A3.**
We can change step size to 1 , in this case the cache will fill with blocks, then we get the miss rate increase to 0.5

### Scenario 2:
Test environment : [Ripes](https://github.com/mortbopet/Ripes)
:::spoiler **Program Parameters**
* Array Size (a0): 256 (bytes)
* Step Size (a1): 2
* Rep Count (a2): 1
* Option (a3): 1
:::
Same as scenario 1, use the given program parameters to modify the assembly code in `cache.s`
```
main: li a0, 256 # array size in BYTES (power of 2 < array size)
li a1, 2 # step size (power of 2 > 0)
li a2, 1 # rep count (int > 0)
li a3, 1 # 0 - option 0, 1 - option 1
```
:::spoiler **Cache Parameters**:
* Cache Levels: 1
* Block Size: 16
* Number of Blocks: 16
* Enable?: Should be green
* Placement Policy: N-Way Set Associative
* Associativity: 4
* Block Replacement Policy: LRU
:::
set the require cache parameters in Ripes(the require parameters are for [Venus](https://venus.cs61c.org/),so there will be a little different)

### Tasks
**Q1.**
How many memory accesses are there per iteration of the inner loop?
**A1.**
Because we set option to 1 in program parameter, it will do `arr[index] = arr[index] + 1` instead of `array[index] = 0`. It cause to do an additional write to the memory, so there will be 2 memory access per iteration.
**Q2.**
What is the repeating hit/miss pattern? WHY? (Hint: it repeats every 4 accesses).
**A2.**
Miss-Hit-Hit-Hit
The first access will be a compulsory miss, then the same block will be access again to write. Because the cache line width is 4 words, the next read and write will also be HIT.
**Q3.**
This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.
**A3.**
The hit rate is 0.75 because there are 3 hits in 4 accesses.
**Q4.**
Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run!
**A4.**
When the Rep Count goes to infinity, the hit rate will also very close to 1, it is because there is no replacement in this case, the number of misses will always stay at 16(compulsory miss), the number of hits will keep increasing.
set Rep Count to 100

**Q5.**
Suppose we have a program that iterates through a very large array (AKA way bigger than the size of the cache) times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). (For reference, in this scenario, we just had one function (incrementation) and one Rep).
QUESTION: Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others AKA it doesn’t matter if Rep is applied to element before Rep is applied to element , etc.
**A5.**
**Method 1:**(provided by 江松穎)
Since the array size is that huge, we cannot let every accessed address cached. Ergo, we can try to every “kind” of array access happens together. That is, for element arr[i], instead of incrementing i everytime we finish task k, we apply all functions then goto the next array member. In this way, for every array member, we undergo only 1 miss.
Following is a sample c-code:
``` c=
int array[SIZE];
void (*func_ptr)(int*);
for(int i = 0; i < SIZE; i += stepsize){
for(int k = 0; k < rep_count; ++k){
next_func(func_ptr); // Into some new function that modify/uses only arr[i]
(*func_ptr)(&arr[i]);
}
}
```
**Method 2:**(provided by 魏晉成)
We should try to access first 64KByte of the array at a time and apply all of the R/W hit to that first 64KByte so we can be completely done with it before moving on, thereby keeping that 64KByte hot in the cache and not having to circle back to it later on!
### Scenario 3:
Test environment : [Venus](https://venus.cs61c.org/)
:::spoiler **Program Parameters**
* Array Size (a0): 128 (bytes)
* Step Size (a1): 1
* Rep Count (a2): 1
* Option (a3): 0
:::
Same as scenario 1, use the given program parameters to modify the assembly code in `cache.s`
```
main: li a0, 128 # array size in BYTES (power of 2 < array size)
li a1, 1 # step size (power of 2 > 0)
li a2, 1 # rep count (int > 0)
li a3, 0 # 0 - option 0, 1 - option 1
```
:::spoiler **Cache Parameters**:
* Cache Levels: 2
* L1 Cache Parameters:
* Block Size: 8
* Number of Blocks: 8
* Placement Policy: Direct Mapped
* Associativity: 1
* Block Replacement Policy: LRU
* L2 Cache Parameters:
* Block Size: 8
* Number of Blocks: 16
* Placement Policy: Direct Mapped
* Associativity: 1
* Block Replacement Policy: LRU
:::
The different part is Ripes does not support multilevel cache so we use Venus.
The setting of the L1, L2 caches is in below
L1:

L2:

After running the program, the status of the cache become
L1:

L2:

In the result we can see there are 32 times of memory access in L1, 16 hits and 16 misses, when L1 misses it will go to L2, but in this case L2 still miss so it eventually goes to memory
### Tasks
**Q1.**
What is the hit rate of our L1 cache? Our L2 cache? Overall?
**A1.**
The hit rate of L1 cache is 0.5, L2 is 0, we can use `hits/total access` to get the overall hit rate, it will be 16/(32+16)=1/3
(There are two answers in this question provide by two of the student, i think this one is the correct answer)
**Q2.**
How many accesses do we have to the L1 cache total? How many of them are misses?
**A2.**
According to the result show in Venus, the total access of L1 cache will be 32 and the misses will be 16
**Q3.**
How many accesses do we have to the L2 cache total? How does this relate to the L1 cache (think about what the L1 cache has to do in order to make us access the L2 cache)? pattern.
**A3.**
According to the result show in Venus, the total access of L2 cache will be 16, and it is because we can't find the block we want in L1(L1 cache miss) then we will go find it in L2.
**Q4.**
What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same? Why?
**A4.**
If we increase the Rep Count, it is any easy way to increase the L2 hit rate, in first case the whole miss is because compulsory miss, when we increase Rep Count we reuse those blocks in L2.
**Q5.**
What happens to our hit rates for L1 and L2 as we slowly increase the number of blocks in L1? What about L1 block size?
**A5.**
Increase number of blocks of L1 cache will not affect the hit rate of L1, L2 cache, but when we also add Rep Count, the hit rate of L1 will increase but the misses doesn't increase so L2 cache hit rate will be still the same.
**Increse number of blocks:**


**Add Rep Count:**


Increase L1 block size to 16 will really change the hit rate from M-H-M-H to M-H-H-H, so it definately will increase hit rate of L1 cache, but L2 cahce still facing the same problem, compulsory miss.
**Increase block size:**


## Exercise 2 - Loop Ordering and Matrix Multiplication
this exercise is going to let us compare the performance between different loop order of matrix multiplication
:::spoiler **matrixMultiply.c**
``` c=
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
/* To save you time, we are including all 6 variants of the loop ordering
as separate functions and then calling them using function pointers.
The reason for having separate functions that are nearly identical is
to avoid counting any extraneous processing towards the computation
time. This includes I/O accesses (printf) and conditionals (if/switch).
I/O accesses are slow and conditional/branching statements could
unfairly bias results (lower cases in switches must run through more
case statements on each iteration).
*/
void multMat1( int n, float *A, float *B, float *C ) {
int i,j,k;
/* This is ijk loop order. */
for( i = 0; i < n; i++ )
for( j = 0; j < n; j++ )
for( k = 0; k < n; k++ )
C[i+j*n] += A[i+k*n]*B[k+j*n];
}
void multMat2( int n, float *A, float *B, float *C ) {
int i,j,k;
/* This is ikj loop order. */
for( i = 0; i < n; i++ )
for( k = 0; k < n; k++ )
for( j = 0; j < n; j++ )
C[i+j*n] += A[i+k*n]*B[k+j*n];
}
void multMat3( int n, float *A, float *B, float *C ) {
int i,j,k;
/* This is jik loop order. */
for( j = 0; j < n; j++ )
for( i = 0; i < n; i++ )
for( k = 0; k < n; k++ )
C[i+j*n] += A[i+k*n]*B[k+j*n];
}
void multMat4( int n, float *A, float *B, float *C ) {
int i,j,k;
/* This is jki loop order. */
for( j = 0; j < n; j++ )
for( k = 0; k < n; k++ )
for( i = 0; i < n; i++ )
C[i+j*n] += A[i+k*n]*B[k+j*n];
}
void multMat5( int n, float *A, float *B, float *C ) {
int i,j,k;
/* This is kij loop order. */
for( k = 0; k < n; k++ )
for( i = 0; i < n; i++ )
for( j = 0; j < n; j++ )
C[i+j*n] += A[i+k*n]*B[k+j*n];
}
void multMat6( int n, float *A, float *B, float *C ) {
int i,j,k;
/* This is kji loop order. */
for( k = 0; k < n; k++ )
for( j = 0; j < n; j++ )
for( i = 0; i < n; i++ )
C[i+j*n] += A[i+k*n]*B[k+j*n];
}
/* uses timing features from sys/time.h that you haven't seen before */
int main( int argc, char **argv ) {
int nmax = 1000,i;
void (*orderings[])(int,float *,float *,float *) =
{&multMat1,&multMat2,&multMat3,&multMat4,&multMat5,&multMat6};
char *names[] = {"ijk","ikj","jik","jki","kij","kji"};
float *A = (float *)malloc( nmax*nmax * sizeof(float));
float *B = (float *)malloc( nmax*nmax * sizeof(float));
float *C = (float *)malloc( nmax*nmax * sizeof(float));
struct timeval start, end;
/* fill matrices with random numbers */
for( i = 0; i < nmax*nmax; i++ ) A[i] = drand48()*2-1;
for( i = 0; i < nmax*nmax; i++ ) B[i] = drand48()*2-1;
for( i = 0; i < nmax*nmax; i++ ) C[i] = drand48()*2-1;
for( i = 0; i < 6; i++) {
/* multiply matrices and measure the time */
gettimeofday( &start, NULL );
(*orderings[i])( nmax, A, B, C );
gettimeofday( &end, NULL );
/* convert time to Gflop/s */
double seconds = (end.tv_sec - start.tv_sec) +
1.0e-6 * (end.tv_usec - start.tv_usec);
double Gflops = 2e-9*nmax*nmax*nmax/seconds;
printf( "%s:\tn = %d, %.3f Gflop/s\n", names[i], nmax, Gflops );
}
free( A );
free( B );
free( C );
printf("\n\n");
return 0;
}
```
:::
there are six ways in the code, `ijk`, `ikj`, `jik`, `jki`, `kij`, `kji`, and the corresponding column and row relaionshup will be like
(this figure provided by [林柏維](https://hackmd.io/lYbsD9D2Q2i-otiPCkB6SA?view#Exercise-2---Loop-Ordering-and-Matrix-Multiplication))

In memory, there is onli 1 dimensional can be used, so the 2 dimensional matrix will become
(this figure provided by [呂紹樺](https://hackmd.io/@horseradish1208/HkzkUEAiD))

use following command to get the result of each six ways
```
make ex2
gcc -o matrixMultiply -ggdb -Wall -pedantic -std=gnu99 -O3 matrixMultiply.c
```
test for three times, the results are
```
ijk: n = 1000, 1.746 Gflop/s
ikj: n = 1000, 0.154 Gflop/s
jik: n = 1000, 1.724 Gflop/s
jki: n = 1000, 10.591 Gflop/s
kij: n = 1000, 0.153 Gflop/s
kji: n = 1000, 9.020 Gflop/s
ijk: n = 1000, 1.840 Gflop/s
ikj: n = 1000, 0.151 Gflop/s
jik: n = 1000, 1.765 Gflop/s
jki: n = 1000, 11.146 Gflop/s
kij: n = 1000, 0.152 Gflop/s
kji: n = 1000, 9.207 Gflop/s
ijk: n = 1000, 1.841 Gflop/s
ikj: n = 1000, 0.149 Gflop/s
jik: n = 1000, 1.746 Gflop/s
jki: n = 1000, 10.151 Gflop/s
kij: n = 1000, 0.151 Gflop/s
kji: n = 1000, 9.475 Gflop/s
```
### Tasks
**Q1.**
Which ordering(s) perform best for these 1000-by-1000 matrices? Why?
**A1.**
Order `jki` and `kji` have better performance than the others, as we know the setting of the exercise is column-major, and creating C array is always be`C[i+j*n] += A[i+k*n]*B[k+j*n]`, we observe that if `i` is in the inner loop, array A and C will get less cache miss because it only adds one instead of n in per iteration.
**Q2.**
Which ordering(s) perform the worst? Why?
**A2.**
One the other hand, if `j` is in the inner loop, array B and C will add n un per iteration, it will cause much more cache block replacement. So the worst case will be either `ikj` or `kij`
**Q3.**
How does the way we stride through the matrices with respect to the innermost loop affect performance?
**A3.**
When we are trying to get higher performance, the first thing we have to do is decrease the miss rate of cache, so the index of innermost loop is the key, we should find the column index for two of the matrix to be the inner one, then find the other column index to be the second loop index, in this way you can find the less miss rate combination.
I think the figure really make its point
(provided by [呂紹樺](https://hackmd.io/@horseradish1208/HkzkUEAiD))

## Exercise 3 - Cache Blocking and Matrix Transposition
### Exercise Discription
In this exercise we need to design a transfer with blocking code and write it into the [transpose.c](https://github.com/61c-teach/su20-lab-starter/blob/master/lab07/transpose.c) provided by [Lab7](https://github.com/61c-teach/su20-lab-starter/tree/master/lab07)
In this part I'm gonna compare the code and the performance to figure out what's the difference between three of the codes I reference to.
:::spoiler Code designed by [江松穎](https://hackmd.io/@Uduru0522/rJ-NUxC3v):
``` c=
void transpose_blocking(int n, int blocksize, int *dst, int *src) {
int block_count = n / blocksize;
for(int i = 0; i < block_count; ++i){
for(int j = 0; j < block_count; ++j){
// i := row block count
// j := column block count
for(int a = 0; a < blocksize; ++a){
for(int b = 0; b < blocksize; ++b){
if(i * blocksize + a < n && j * blocksize + b < n){
dst[(j * blocksize + b) + (i * blocksize + a) * n] = src[(j * blocksize + b) * n + (i *
blocksize+ a)];
#ifdef EBUG
printf("Moving src[%d] to ", (j * blocksize+ b) * n + (i * blocksize + a));
printf("to dst[%d]", (j * blocksize + b) + (i * blocksize + a) * n);
putchar('\n');
#endif
}
}
}
}
}
}
```
:::
:::spoiler Performance:
./transpose 100 20
Testing naive transpose: 0.004 milliseconds
Testing transpose with blocking: 0.006 milliseconds
./transpose 1000 20
Testing naive transpose: 1.154 milliseconds
Testing transpose with blocking: 1.267 milliseconds
./transpose 5000 20
Testing naive transpose: 296.269 milliseconds
Testing transpose with blocking: 47.287 milliseconds
./transpose 10000 20
Testing naive transpose: 1434.96 milliseconds
Testing transpose with blocking: 250.862 milliseconds
:::
:::spoiler Code designed by [魏晉成](https://hackmd.io/T_MxFzGNRMqS5b-7oGaz9w):
``` c=
void transpose_blocking(int n, int blocksize, int *dst, int *src) {
for(int row = 0; row < n / blocksize; row ++) {
for(int col = 0; col < n / blocksize; col ++) {
int dst_base_offset = row * blocksize + col * blocksize * n,
src_base_offset = col * blocksize + row * blocksize * n;
for(int inner_row = 0; inner_row < blocksize; inner_row ++) {
for(int inner_col = 0; inner_col < blocksize; inner_col ++) {
*(dst + dst_base_offset + inner_col * n + inner_row) = *(src + src_base_offset + inner_row * n + inner_col);
}
}
}
}
}
```
:::
:::spoiler Performance:
./transpose 100 20
Testing naive transpose: 0.005 milliseconds
Testing transpose with blocking: 0.006 milliseconds
./transpose 1000 20
Testing naive transpose: 1.55 milliseconds
Testing transpose with blocking: 0.97 milliseconds
./transpose 5000 20
Testing naive transpose: 314.841 milliseconds
Testing transpose with blocking: 40.12 milliseconds
./transpose 10000 20
Testing naive transpose: 1471.05 milliseconds
Testing transpose with blocking: 175.337 milliseconds
:::
:::spoiler Code designed by [呂紹樺](https://hackmd.io/@horseradish1208/HkzkUEAiD):
``` c=
// Function transpose_naive() transposes a (blocksize x blocksize) block in a (n x n) martrix.
// The start indices in each block must be specified
void transpose(int IndexX, int IndexY, int blocksize, int n, int* dst, int* src) {
for (int y = IndexY; y < (IndexY + blocksize) ; y++) {
for (int x = IndexX; x < (IndexX + blocksize) ; x++) {
if (y < n && x < n) {
dst[y + x * n] = src[x + y * n];
}// Prevent any operation at index out of bound
}
}
}
// Function transpose_blocking() iterate through the matrix by a block.
// It transposes each block matrix.
void transpose_blocking(int n, int blocksize, int* dst, int* src) {
for (int blockX = 0; blockX < n; blockX+=blocksize) {
for (int blockY = 0; blockY < n; blockY += blocksize) {
transpose(blockX, blockY , blocksize, n, dst, src);
}
}
}
```
:::
:::spoiler Performance:
./transpose 100 20
Testing naive transpose: 0.005 milliseconds
Testing transpose with blocking: 0.012 milliseconds
./transpose 1000 20
Testing naive transpose: 1.105 milliseconds
Testing transpose with blocking: 1.713 milliseconds
./transpose 5000 20
Testing naive transpose: 302.005 milliseconds
Testing transpose with blocking: 66.458 milliseconds
./transpose 10000 20
Testing naive transpose: 1409.76 milliseconds
Testing transpose with blocking: 427.206 milliseconds
:::
They all use the same logic to write the code, but there are differences between there performances.
In the first code, it has the second good performance, compare with the best performance(second code), it done the same thing more times in the inner loop, but the calculation can be just done once one the second loop, so I think it is the point why second code has better performance.
In the third code, its performance is the worst, but I think it has an advantage is it has a clearest code compare with others, first it create a new function to do the transpose action, then it call the function in `transpose_blocking`, make the code avoid a fourth loop, it's more friendly to the programmer, and also more convenient if we may use it later.
### Part 1 - Changing Array Sizes
#### Tasks
**Q1.**
At what point does cache blocked version of transpose become faster than the non-cache blocked version?
**A1.**
By the test result, if we take the best performance code, it will be faster before `array size=1000`, the others will be between `array size=1000~5000`
**Q2.**
Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?
**A2.**
When the size is small ,`Naive transpose` also take adventage of locality just like`Transpose with blocking`, but when the size become larger, the same block access will be farer, and leads to miss rate increase.
### Part 2 - Changing Blocksize
Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000 and record results.
:::spoiler Performance([江松穎](https://hackmd.io/@Uduru0522/rJ-NUxC3v)):
./transpose 10000 50
Testing naive transpose: 1420.89 milliseconds
Testing transpose with blocking: 171.208 milliseconds
./transpose 10000 100
Testing naive transpose: 1436.73 milliseconds
Testing transpose with blocking: 146.55 milliseconds
./transpose 10000 500
Testing naive transpose: 1427.1 milliseconds
Testing transpose with blocking: 150.841 milliseconds
./transpose 10000 1000
Testing naive transpose: 1448.18 milliseconds
Testing transpose with blocking: 269.06 milliseconds
./transpose 10000 5000
Testing naive transpose: 1431.75 milliseconds
Testing transpose with blocking: 1306.34 milliseconds
:::
:::spoiler Performance([魏晉成](https://hackmd.io/T_MxFzGNRMqS5b-7oGaz9w)):
./transpose 10000 50
Testing naive transpose: 1415.84 milliseconds
Testing transpose with blocking: 154.443 milliseconds
./transpose 10000 100
Testing naive transpose: 1434.98 milliseconds
Testing transpose with blocking: 142.22 milliseconds
./transpose 10000 500
Testing naive transpose: 1424.07 milliseconds
Testing transpose with blocking: 184.678 milliseconds
./transpose 10000 1000
Testing naive transpose: 1428.87 milliseconds
Testing transpose with blocking: 238.894 milliseconds
./transpose 10000 5000
Testing naive transpose: 1452.11 milliseconds
Testing transpose with blocking: 1179.47 milliseconds
:::
:::spoiler Performance([呂紹樺](https://hackmd.io/@horseradish1208/HkzkUEAiD)):
./transpose 10000 50
Testing naive transpose: 1452.17 milliseconds
Testing transpose with blocking: 340.687 milliseconds
./transpose 10000 100
Testing naive transpose: 1523.58 milliseconds
Testing transpose with blocking: 263.248 milliseconds
./transpose 10000 500
Testing naive transpose: 1461.54 milliseconds
Testing transpose with blocking: 192.446 milliseconds
./transpose 10000 1000
Testing naive transpose: 1408.23 milliseconds
Testing transpose with blocking: 254.349 milliseconds
./transpose 10000 5000
Testing naive transpose: 1407.35 milliseconds
Testing transpose with blocking: 1122.16 milliseconds
:::
### Tasks
**Q1.**
How does performance change as increases? Why is this the case?
**A1.**
In these three cases we can observe that the performance has no obviously changed when `n` is very large, because if the block is become too big, the distance between two memory access will be very far, then the performance will decrease.
I think this picture makes the point( provided by [江松穎](https://hackmd.io/@Uduru0522/rJ-NUxC3v))

## Reference
[江松穎-Homework4](https://hackmd.io/@Uduru0522/rJ-NUxC3v)
[魏晉成-Homework4](https://hackmd.io/T_MxFzGNRMqS5b-7oGaz9w?view)
[呂紹樺-Homework4](https://hackmd.io/@horseradish1208/HkzkUEAiD)
[CS 61C Lab 7](https://inst.eecs.berkeley.edu/~cs61c/su20/labs/lab07/)
[Ripes](https://github.com/mortbopet/Ripes)