Assignment 4: Cache
===
[TOC]
### Exercise 1
#### Scenario 1
:::info
Program Parameters: (set these by initializing the a registers in the code)
* Array Size (a0): 128 (bytes)
* Step Size (a1): 8
* Rep Count (a2): 4
* Option (a3): 0
Cache Parameters: (set these in the Cache tab)
* Block Size: 8
* Number of Blocks: 4
* Placement Policy: Direct Mapped
* Block Replacement Policy: LRU
:::

(Fig. 1) Cache Setup
This picture is the cache with direct mapped, line(block) size 8 bytes and with 4 blocks.
Since `one block` in cache of Ripes simulator meaning 4 bytes, thus, we sets `one line` as 2 `blocks` to satisfy the requirement of 8-byte line.

(Fig. 2) Cold Miss
This picture indicates that when first access the memory, one cache miss occur due to cold cache.

(Fig. 3) Miss due to same line index
This picture shows that in direct mapped cache, two memory access on same index would cause cache miss.

(Fig. 4) Overall miss numbers
This picture shows that because each memory access causes one cache miss, and there are 128(bytes)/8(step size)/4(bytes/word) = 4 accesses in one repetition, and there are 4 repetitions in total. Thus, 4 times 4 resulting 16, which is the number of `Misses`. As for `Writebacks`, because of the `Write-allocate` policy, on cache miss, cache would allocate a block for missed part without writing back to memory. Thus, the first miss would allocate one line for missed cache without writing back to memory, while the second miss demands writing currently cached data back to memory, and reading the new one to cache. Ultimately there would be `the number of misses - 1` times write backs.

(Fig. 5) Overall miss numbers w/o write-allocate
In comparison, if we turn write-allocate function off, then the cache would write data directly back to memory without allocating. Thus, the number of write backs would be the same as misses.
:::success
1. What combination of parameters is producing the hit rate you observe? (Hint: Your answer should be of the form: “Because [parameter A] in bytes is exactly equal to [parameter B] in bytes.”)
Ans: Because `cache size in bytes` is exactly equal to `step size in bytes` along with direct mapping configuration, there would be 0% hit rate in this case. Every time cache want to cache the designated data, the first word in first line would be used, and thus creating n times misses if there are n times accesses.
2. What is our hit rate if we increase Rep Count arbitrarily? Why?
Ans: As I've mentioned above, there would be `n times misses` if there are `n times accesses` in this configuration, and thus there would be still 0% hi rate if the repetition count increases.
3. How could we modify one program parameter to increase our hit rate? (Hint: Your answer should be of the form: “Change [parameter] to [value].” Note: we don’t care if we access the same array elements. Just give us a program parameter modification that would increase the hit rate.
Ans: If we change `step size` to `6`, then hit rate would be raised to 6 hit out of total 24. Because there would be ceil(128/6/4)=6 accesses per repetition and lines with index 0 and 3 would be reused with different tags, making these two lines thrashing. However, lines with index 1 and 2 would be storing data from the same region of memory. Thus, despite the cold start miss, there would be still `2 * (4 - 1) = 6` hits after this alteration.

(Fig. 6) Overall hit and miss rate after alter step size to 6 words.
:::
#### Scenario 2
:::info
Program Parameters: (set these by initializing the a registers in the code)
* Array Size (a0): 256 (bytes)
* Step Size (a1): 2
* Rep Count (a2): 1
* Option (a3): 1
Cache Parameters: (set these in the Cache tab)
* Block Size: 16
* Number of Blocks: 16
* Placement Policy: N-Way Set Associative
* Associativity: 4
* Block Replacement Policy: LRU
:::

(Fig. 7)
This picture is the cache with 4-way set associative , line(block) size 16 bytes and with 16 blocks.
Since one block in cache of Ripes simulator meaning 4 bytes, thus, we sets one line as 4 blocks to satisfy the requirement of 8-byte line, and for 4-way set associative, I decrease index by 2 bits.

(Fig. 8)
This picture shows the situation that compulsory miss occurs.

(Fig. 9)
This picture shows that write hit occurs after first line of data has been cached.

(Fig. 10)
This picture shows that read hit occurs when step size is only 2 and line size is 4 words.

(Fig. 11)
This picture shows that write hit occurs after corresponding memory space has read hit on the last memory access operation.
However, with both write back and write through policies, the cache on block2 of every line would not be updated until the very ending of the simulation. You can observe the situation on next picture of next step array access operation, in which set 0 index 0 block 2 is still 0.
I haven't figured out what's the mechanism after this behavior of cache, or it's just a defect in Ripes.

(Fig. 12)
Compulsory miss for next step occurs.
Thus, for two array accesses, there would be one compulsory read miss along with one read hit and two write hits, making hit rate 75% hit rate before the entries with the same index in different lines.
There would be 2 accesses in each lines and between each access would be a 2-word step. Thus, after 2(accesses/line) * 4(lines) = 8(accesses), there would be a 16(words)/64(bytes) offset from the first access.
For that there are 2 index bits and 4 line offset bits, which is 64 bytes offset in total, in 9-15 accesses, cache would use cache lines with index 1.

(Fig. 13)
After 256(bytes)/2(words/step)/4(bytes/word)=32 accesses, the entries in block 0 and block 2 would be used, which would use up all 16 lines, and makes all 32 accesses cache hits except for compulsory cache miss.
Therefore, overall cache hit rate is 75%.
:::success
1. How many memory accesses are there per iteration of the inner loop? (not the one involving repcount). It’s not 1.
Ans: Accesses that occur in inner loop would be 256(bytes)/4(bytes/word)/2(words/step)\*2(read+write) = 64 accesses.
2. What is the repeating hit/miss pattern? WHY? (Hint: it repeats every 4 accesses).
Ans: Miss-Hit-Hit-Hit.
Because on the first access, there would be a compulsory miss followed by a write hit of memory with the same memory address. With cache line width as 4 words, data in memory of next access would be cached in cache on the first fetch. Hence, the next read and write would hit in caches.
3. This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.
Ans: Hit rate would be 75% because Miss-Hit-Hit-Hit pattern for every 4 memory accesses.
4. Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run!
Ans: Hit rate would go to about 100% as Rep Count goes to infinity as below figure 14 shows. The reason is that every accesses in each repetition would be all hits after the first one.

(Fig. 14) Hit rate after repetition count goes up to 1000 would be 0.9998
5. You should have noticed that our hit rate was pretty high for this scenario, and your answer to the previous question should give you a good understanding of why. IF YOU ARE NOT SURE WHY, consider the size of the array and compare it to the size of the cache. Now, consider the following:
Suppose we have a program that iterates through a very large array (AKA way bigger than the size of the cache) repcount times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). (For reference, in this scenario, we just had one function (incrementation) and one Rep).
QUESTION: Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others AKA it doesn’t matter if Rep k is applied to element arr[i] before Rep k is applied to element arr[i+1], etc.
ANSWER: We should try to access `first 64KByte` of the array at a time and apply all of the `R/W hit` to that `first 64KByte` so we can be completely done with it before moving on, thereby keeping that `64KByte` hot in the cache and not having to circle back to it later on!
:::
#### Scenario 3
:::info
Program Parameters: (set these by initializing the a registers in the code)
* Array Size (a0): 128 (bytes)
* Step Size (a1): 1
* Rep Count (a2): 1
* Option (a3): 0
Cache Parameters: (set these in the Cache tab)
* Cache Levels: 2
* L1 Cache Parameters:
* Block Size: 8
* Number of Blocks: 8
* Placement Policy: Direct Mapped
* Associativity: 1
* Block Replacement Policy: LRU
* L2 Cache Parameters:
* Block Size: 8
* Number of Blocks: 16
* Placement Policy: Direct Mapped
* Associativity: 1
* Block Replacement Policy: LRU
:::
For that there is no L2 cache on Ripes simulator, I use the venus simulator provided in CS61C course to accomplish two level cache simulation.

(Fig. 15) Setup of L1 Cache

(Fig. 16) Setup of L2 Cache

(Fig. 17) Status of L1 Cache after First Access

(Fig. 18) Status of L2 Cache after First Access
On first write access, both L1 and L2 caches suffer from compulsory miss. With write allocate policy, both L1 and L2 caches make line indexed 0 for last write miss.

(Fig. 19) Status of L1 Cache after Second Access

(Fig. 20) Status of L2 Cache after Second Access
On second write access, L1 caches hit because step size is only 1 word and L1 has a line size of 2 words. Thus, without L1 cache miss, L2 cache has no access operation.

(Fig. 21) Status of L1 Cache after Third Access

(Fig. 22) Status of L2 Cache after Third Access
On third write access, L1 and L2 cache both suffer from compulsory miss due to the line size of both L1 and L2 cache can only handle 2 steps in current configuration. Thus, both L1 and L2 cache have a write miss and allocate data in cache line indexed 1.
For that line size of both L1 cache and L2 cache are 2 words, thus, the situation would repeatedly occur as above first and second accesses demonstrate.

(Fig. 23) Status of L1 Cache after All Accesses

(Fig. 24) Status of L2 Cache after All Accesses
After all 128(bytes)/4(bytes/word)/1(word/step)=32 accesses of memory, L1 would hit 16 out of 32 accesses due to spatial locality, while L2 would hit 0 out of 16 accesses under L1-miss situation because in L2 there is no more data cached in the same line.
:::success
1. What is the hit rate of our L1 cache? Our L2 cache? Overall?
Ans: L1 cache would have 50% hit rate while L2 cache would have 0% hit rate.
2. How many accesses do we have to the L1 cache total? How many of them are misses?
Ans: 32 accesses are done to L1 cache and 16 of them are misses.
3. How many accesses do we have to the L2 cache total? How does this relate to the L1 cache (think about what the L1 cache has to do in order to make us access the L2 cache)?
Ans: 16 accesses are done to L2 cache due to the number of misses of L1 cache is 16. L2 would be accessed only under the circumstance that L1 cache missed.
4. What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same? Why?
Ans: Increasing the `a2` argument which is repetition count would increase hit rate of L2 and leaving L1 hit rate the same. Because L2 can cache as many as double the number of lines in L1 cache, and with write allocate policy, in the first repetition, L2 would cache all the data accessed in the iteration, while L1 can only handle the second half of data accesses in the iteration. Thus, increasing number of repetition(iteration) would make L2 have the data that L1 doesn't have, and hence the L2 hit rate would be raised.
5. What happens to our hit rates for L1 and L2 as we slowly increase the number of blocks in L1? What about L1 block size?
Ans: Increasing number of blocks of L1 cache to 16 or 32 would not affect the hit rate of L1 cache in first iteration(repetition) because the miss-hit-miss-hit... repetition can't be solved under this alteration. However, if the repetition count is more than one, then hit rate of L1 would increase because data accessed in the first iteration are cached in L1 cache, and hit rate of L2 would stay zero due to no further misses in L1 cache.

(Fig. 25) First access of 16-byte-per-block L1 cache

(Fig. 26) Second access of 16-byte-per-block L1 cache

(Fig. 27) Third access of 16-byte-per-block L1 cache

(Fig. 28) Fourth access of 16-byte-per-block L1 cache

(Fig. 29) Fifth access of 16-byte-per-block L1 cache

(Fig. 30) Overall hit rate of 16-byte-per-block L1 cache
Due to increment in block size, accesses to L1 cache would be miss-hit-hit-hit in the first repetition, and this situation would raise hit rate of L1 cache to 75%. If repetition(iteration) count is more than 1, data accesses in L1 cache in second and further iteration would be 100% because of increment in overall size of L1 cache.
:::
### Exercise 2
In exercise 2, the main task is to compare the speed between the matrix multiplications with different order of i, j, and k. As for the definition of i, j, k are mentioned below.
First of all, we would have a 2-d matrix multiplication as (Fig. 31).

(Fig. 31) 2-d Matrices Multiplication
However, in memory, there is only 1-d ram can be used, and with the setting of column-major in this exercise, the elements would be spread in memory as below order indicates. The letter on top of every two or three elements are the column index for corresponding 2-d matrices, and the other letter on top of the letter is the row index for corresponding 2-d matrices.

(Fig. 32) 2-d Matrices Scattered in 1-d Memory
At last, we can discuss how index i, j, and k work in this multiplication. Below (Fig. 33) is how these two matrices, A and B, would be multiplied and generate matrix C in order of **i, j, k**.
First, we do`C[0] += A[0] * B[0]`.
On second step, with increment on k by 1, this operation would be `C[0] += A[2] * B[1]`.
On third step, with another increment on k by 1, the operation would be `C[0] += A[4] * B[2]`
Because k reaches boundary on that dimension, we increase j by 1 and reset k, making this operation `C[2] += A[0] * B[3]`
This addition on k and j would continue until both of them reach boundary of the dimensions. Then i would be increased by 1 and both k and j would be reset to 0. making the corresponding operation `C[1] += A[1] * B[0]`.

(Fig. 33) Matrices Multiplication in ijk order
With the provided program `matrixMultiply.c` and makefile, the tested result on Ubuntu virtual machine is as below.
:::success
gcc -o matrixMultiply -ggdb -Wall -pedantic -std=gnu99 -O3 matrixMultiply.c
./matrixMultiply
ijk: n = 1000, 2.351 Gflop/s
ikj: n = 1000, 1.032 Gflop/s
jik: n = 1000, 2.015 Gflop/s
jki: n = 1000, 19.703 Gflop/s
kij: n = 1000, 1.631 Gflop/s
kji: n = 1000, 15.266 Gflop/s
:::
::: success
1. Which ordering(s) perform best for these 1000-by-1000 matrices? Why?
Ans: Order **jki** wins the 1000-by-1000 matrices multiplication. Because it utilize the spatial locality of column-major matrices the most. As (Fig. 34) shows, there is only a one step stride in matrix A constantly. In matrix B and C, entries used are limited in a continuous fragment of memory until the most outer loop has been triggered to increase index j. This characteristic stems from the fact that i is the column index for both matrix A and C, and k is the column index for matrix B. In this case, we use column major memory allocation, and thus, using column index in inner loop as index would create significant spatial locality in memory. This could also explain why order **kji** performs the second best.

(Fig. 34) Order jki Matrices Multiplication
2. Which ordering(s) perform the worst? Why?
Ans: Order **ikj** performs the worst, and the reason can also be inferred from the findings in Q1. Because j is row index for both matrix B and C, and k is the row index for matrix A, using them in inner loop would create a lot of cache miss and thrashing in cache for sure.
3. How does the way we stride through the matrices with respect to the innermost loop affect performance?
Ans: If we are using column major, then make sure that the innermost loop index is the column index for two of the matrices, and the second loop index is the column index for the other matrix. On the other hand, on row major, innermost index should be row index for two of the matrices, and index for second loop is vise versa.
:::
### Exercise 3
Use `Blocking Transpose` to compare with naive transpose in speed. Below is my code for transpose blocking.
```=c
void transpose_blocking(int n, int blocksize, int *dst, int *src) {
for(int row = 0; row < n / blocksize; row ++) {
for(int col = 0; col < n / blocksize; col ++) {
int dst_base_offset = row * blocksize + col * blocksize * n,
src_base_offset = col * blocksize + row * blocksize * n;
for(int inner_row = 0; inner_row < blocksize; inner_row ++) {
for(int inner_col = 0; inner_col < blocksize; inner_col ++) {
*(dst + dst_base_offset + inner_col * n + inner_row) = *(src + src_base_offset + inner_row * n + inner_col);
}
}
}
}
}
```
#### Task 1
Fix the blocksize to be 20, and run your code with n equal to 100, 1000, 2000, 5000, and 10000 and record results.
:::success
**Result**
n. n, blocksize
1. 100, 20:
Testing **naive** transpose: **0.004 milliseconds**
Testing transpose **with blocking**: **0.004 milliseconds**
2. 1000, 20:
Testing **naive** transpose: **0.848 milliseconds**
Testing transpose **with blocking**: **1.575 milliseconds**
3. 2000, 20:
Testing **naive** transpose: 5.861 milliseconds
Testing transpose **with blocking**: 6.837 milliseconds
4. 5000, 20:
Testing **naive** transpose: **194.746 milliseconds**
Testing transpose **with blocking**: **48.854 milliseconds**
5. 10000, 20:
Testing **naive** transpose: **1574.14 milliseconds**
Testing transpose **with blocking**: **206.995 milliseconds**
:::
:::success
**Discussion**
1. At what point does cache blocked version of transpose become faster than the non-cache blocked version?
Ans: When total size equal to 1000, the cache-blocked version would be as twice faster than non-cache-blocked.
2. Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?
Ans: In the beginning, the cache can handle naive data access. However, when size grows, the accesses start to miss in cache, and thus the cache-blocked version would outperform the naive one.
:::
#### Task 2
Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000 and record results.
:::success
**Result**
1. 10000, 50
Testing **naive** transpose: **1267.05 milliseconds**
Testing transpose **with blocking**: **127.405 milliseconds**
2. 10000, 100
Testing **naive** transpose: **1266.66 milliseconds**
Testing transpose **with blocking**: **112.469 milliseconds**
3. 10000, 500
Testing **naive** transpose: **1272.4 milliseconds**
Testing transpose **with blocking**: **91.299 milliseconds**
4. 10000, 1000
Testing **naive** transpose: **1267 milliseconds**
Testing transpose **with blocking**: **118.664 milliseconds**
5. 10000, 5000
Testing **naive** transpose: **1247.8 milliseconds**
Testing transpose **with blocking**: **1007.04 milliseconds**
:::
:::success
**Discussion**
1. How does performance change as blocksize increases? Why is this the case?
From cases in which block size equal to 50 to 500, the execution time decreases when the block size increases.
From cases in which block size equal to 500 to 5000, the execution time increases when the block size increases.
I suppose that when the block size is too small, cache would suffer thrashing, and 500x500 block size is the best block size for cache to allocate, and over 500x500 block is too large so that cache hit rate would decrease.
:::