Assignment4: Cache

# Assignment4: Cache ###### tags: `RISC-V` [Lab7](https://cs61c.org/su20/labs/lab07/) [source code](https://github.com/61c-teach/su20-lab-starter/tree/master/lab07) ## Exercise 1 - A Couple of Memory Access Scenarios :::info `cache.s` SUMMARY OF REGISTER USE: a0 = array size in bytes a1 = step size a2 = number of times to repeat (rep count) a3 = 0 (W) / 1 (RW) (option) s0 = moving array ptr s1 = array limit (ptr) ::: ```c= int array[]; for(int i = 0; i < a2; ++i) for(int index = 0; index < a0; index += a1) { if(a3 == 0) array[index] = 0; else array[index] = array[index] + 1; } ``` ### Scenario 1 Test environment : [Ripes](https://github.com/mortbopet/Ripes) :::spoiler Program Parameters * Array Size (a0): 128 (bytes) * Step Size (a1): 8 * Rep Count (a2): 4 * Option (a3): 0 ::: modify `.text` ```c=7 main: li a0, 128 # array size in BYTES (power of 2 < array size) li a1, 8 # step size (power of 2 > 0) li a2, 4 # rep count (int > 0) li a3, 0 # 0 - option 0, 1 - option 1 ``` :::spoiler Cache Parameters * Cache Levels: 1 * Block Size: 8 * Number of Blocks: 4 * Enable?: Should be green * Placement Policy: Direct Mapped * Associativity: 1 * Block Replacement Policy: LRU ::: ![](https://i.imgur.com/wlMCTSM.png) |tag|index|offset| |---|-----|------| |27 |2 |3 | > Because of the block size in [Ripes](https://github.com/mortbopet/Ripes) is fixed at 4 byte, the line is $2^2$ ![](https://i.imgur.com/dWQhj3i.png) #### Tasks ##### Task1 What combination of parameters is producing the hit rate you observe? (Hint: Your answer should be of the form: “Because [parameter A] in bytes is exactly equal to [parameter B] in bytes.”) :::success The block size is 1 word in [ripes](https://github.com/mortbopet/Ripes). To simulate the scenario 1, I set 2 blocks in each cache. $\therefore$ offset = $log_28 = 3$ The placement is directed mapped. $\therefore$ index = $log4/1$ = $2$ tag = $32-3-2$ = $27$ ::: ##### Task2 What is our hit rate if we increase Rep Count arbitrarily? Why? :::success Since the cache is empty at the begining, it must encounter first miss when we do the first `sw`. > [compulsory miss](https://en.wikipedia.org/wiki/Cache_performance_measurement_and_metric#Compulsory_misses) |address | tag | index | H/M | | -------- | -------- | ----- | --- | |0x10000000| 0x80000 | 0x0 | M | |0x10000020| 0x80001 | 0x0 | M | |0x10000040| 0x80002 | 0x0 | M | |0x10000060| 0x80003 | 0x0 | M | First, I get the **index** and check which line the data should stay. For eample, if `index = 0`, I search the `line 0` of the cache table. Then, I check the tag, if the tag in table is same as mine, I get hit or miss. (step size) $8\times 4 = 32$ $\because$ `words` It is increased by `32` in each iteration of `wordloop` and the condition is smaller than `array size`, so there is `4` iterations in the inner loop. When it has done the `4` iterations and leaves the inner loop, the array address is increased by `128` in the outer loop. ![](https://i.imgur.com/F4sjQjf.png) $>$ The `4` iterations in the inner loop has been done, and the iteration of the outer loop is about to go to the **next** iteration. ![](https://i.imgur.com/YajsBGk.png) $>$ The beginning of the next iteration in the outer loop, and the first iteration of the inner loop. $>$ index : the previous one is same as the present one. $\;\;\;$ offset : both of the offset is `0` $\Rightarrow$ `Block0` $\;\;\;$ tag : different $\Rightarrow$ miss ::: ##### Task3 How could we modify one program parameter to increase our hit rate? (Hint: Your answer should be of the form: “Change [parameter] to [value].” Note: we don’t care if we access the same array elements. Just give us a program parameter modification that would increase the hit rate. :::success  TBD ::: ### Scenario 2 Test environment : [Ripes](https://github.com/mortbopet/Ripes) :::spoiler Program Parameters * Array Size (a0): 256 (bytes) * Step Size (a1): 2 * Rep Count (a2): 1 * Option (a3): 1 ::: modify `.text` ```c=7 main: li a0, 256 # array size in BYTES (power of 2 < array size) li a1, 2 # step size (power of 2 > 0) li a2, 1 # rep count (int > 0) li a3, 1 # 0 - option 0, 1 - option 1 ``` :::spoiler Cache Parameter * Cache Levels: 1 * Block Size: 16 * Number of Blocks: 16 * Enable?: Should be green * Placement Policy: N-Way Set Associative * Associativity: 4 * Block Replacement Policy: LRU ::: ![](https://i.imgur.com/2svVtHn.png) |tag|index|offset| |---|-----|------| |26 |2 |4 | > Because of the block size in [Ripes](https://github.com/mortbopet/Ripes) is fixed at 4, the line is $2^2$ > I take 4 blocks as 1 block here. ![](https://i.imgur.com/a6dAHKd.png) #### Tasks ##### Task1 How many memory accesses are there per iteration of the inner loop? (not the one involving repcount). It’s not 1. :::success `a3` = 1, ie. option = 1 It will go to `wordloop`, and do `lw`, `sw`. $\therefore$ memory access = `2` ::: ##### Task2 What is the repeating hit/miss pattern? WHY? (Hint: it repeats every 4 accesses). :::success `a1` = 2, ie. step size = 2 multiply step size by 4 $\because$ word | R/W |address | tag | index | H/M | | --- | -------- | -------- | ----- | --- | |R |0x10000000| 0x40000 | 0x0 | M | |W |0x10000000| 0x40000 | 0x0 | H | |R |0x10000008| 0x40000 | 0x0 | H | |W |0x10000008| 0x40000 | 0x0 | H | |R |0x10000010| 0x40000 | 0x1 | M | |W |0x10000010| 0x40000 | 0x1 | H | |R |0x10000018| 0x40000 | 0x1 | H | |W |0x10000018| 0x40000 | 0x1 | H | $\bullet$ observe at index 0 Since the compulsory miss, it get the first miss on `lw` at `0x10000000`. The valid of index0 become `1`. Then, it get hit on `sw` at `0x10000000` after `sw` at the same address. Since the block size is 4 word, it is hit on `lw` at `0x10000008` $\bullet$ observe at index 1 In my observation, the situtation is same as **index 0**. ::: ##### Task3 This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern. :::success Since there is a miss on each memory access, the hit rate is $0.75$. ::: ##### Task4 Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run! :::success It will be closed to `1.0000`. After it passes all the compulsory miss, the later access memory gets hit. ::: ### Scenario 3 Test environment : [Venus](https://venus.cs61c.org/) :::spoiler Program Parameters * Array Size (a0): 128 (bytes) * Step Size (a1): 1 * Rep Count (a2): 1 * Option (a3): 0 ::: modify `.text` ```c=7 main: li a0, 128 # array size in BYTES (power of 2 < array size) li a1, 1 # step size (power of 2 > 0) li a2, 1 # rep count (int > 0) li a3, 0 # 0 - option 0, 1 - option 1 ``` :::spoiler Cache Parameters * Cache Levels: 2 * L1 * Block Size: 8 * Number of Blocks: 8 * Enable?: Should be green * Placement Policy: Direct Mapped * Associativity: 1 * Block Replacement Policy: LRU * L2 * Block Size: 8 * Number of Blocks: 16 * Enable?: Should be green * Placement Policy: Direct Mapped * Associativity: 1 * Block Replacement Policy: LRU ::: **L1** ![](https://i.imgur.com/NFuBfFS.png) **L2** ![](https://i.imgur.com/QZJ9gse.png) |tag|index|offset| |---|-----|------| |26 |3 |3 | #### Tasks ##### Task1 What is the hit rate of our L1 cache? Our L2 cache? Overall? :::success The HIT rate in `L1` is $0.5$, and the HIT rate in `L2` is $0$. The overall HIT rate is $0.5$. ::: ##### Task2 How many accesses do we have to the L1 cache total? How many of them are misses? :::success There are $32$ accesses in `L1`. There are $16$ misses in these accesses. ::: ##### Task3 How many accesses do we have to the L2 cache total? How does this relate to the L1 cache (think about what the L1 cache has to do in order to make us access the L2 cache)? :::success There are $16$ accesses in `L2`. If the miss occurs in `L1`, it will go to check `L2`. Therefore, we can observe that the $16$ misses in `L2` come from `L1`. ::: ##### Task4 What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same? Why? :::success We can increase our L2 hit rate by increasing cache size or associativity. We can keep L1 hit rate because it goes to check at L1 first. ::: ##### Task5 What happens to our hit rates for L1 and L2 as we slowly increase the number of blocks in L1? What about L1 block size? :::success If we increase the block size, the compulsory miss will be decreased. ::: ## Exercise 2 - Loop Ordering and Matrix Multiplication ```c= for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) for (int k = 0; k < n; k++) C[i+j*n] += A[i+k*n] * B[k+j*n]; ``` take n = 2 for example | ijk | ikj | jik | jki | kij | kji | | --- | --- | --- | --- | --- | --- | |![](https://i.imgur.com/hxq2syl.png)|![](https://i.imgur.com/LSh4vMW.png)|![](https://i.imgur.com/xDcGY6H.png)|![](https://i.imgur.com/m0oy1Tl.png)|![](https://i.imgur.com/Xtku7yY.png)|![](https://i.imgur.com/EnipzA6.png)| ### Tasks ##### task1 Record your results :::success ![](https://i.imgur.com/zFNqjn5.png) ::: ##### task2 Which ordering(s) perform best for these 1000-by-1000 matrices? Why? :::success **jki** Since the indices of the three matrices are all increased by 1, it will get hit easy at the later iteration. ::: ##### task3 Which ordering(s) perform the worst? Why? :::success **ikj** Since the indices of the three matrices are all increased by n, it will get miss at each memory access. ::: ##### task4 How does the way we stride through the matrices with respect to the innermost loop affect performance? :::success ::: ## Exercise 3 - Cache Blocking and Matrix Transposition ```cpp= /* Implement cache blocking below. You should NOT assume that n is a * multiple of the block size. */ void transpose_blocking(int n, int blocksize, int *dst, int *src) { int blknum = n / blocksize; int blk_rem = n % blocksize; int blkx = 0, blky = 0; int curr_x = 0, curr_y = 0; for (int i = 0; i < blknum; i++) { for (int j = 0; j < blknum; j++) { blkx = i * blocksize; blky = j * blocksize; for (int x = 0; x < blocksize; x++) { for (int y = 0; y < blocksize; y++) { curr_x = blkx + x; curr_y = blky + y; dst[curr_y + curr_x * n] = src[curr_x + curr_y * n]; } } } } if (blk_rem != 0) { for (int x = blkx + blocksize; x < n; x++) { for (int y = 0; y <= n - blk_rem; y++) { dst[y + x * n] = src[x + y * n]; } } for (int y = blky + blocksize; y < n; y++) { for (int x = 0; x < n; x++) { dst[y + x * n] = src[x + y * n]; } } } } ``` * demo ``` $ ./transpose 10000 33 Testing naive transpose: 808.673 milliseconds Testing transpose with blocking: 113.719 milliseconds ``` ### Part 1 - Changing Array Sizes #### Tasks ##### Task1 Fix the blocksize to be 20, and run your code with n equal to 100, 1000, 2000, 5000, and 10000. :::success | | 100 | 1000 | 2000 | 5000 | 10000 | |------------| --- | ---- | ----- | ----- | ----- | |naive(ms) |0.004|0.787 |20.322 |176.518|780.35 | |blocking(ms)|0.003|0.621 | 4.66 |26.689 |120.522| ::: ##### Task2 At what point does cache blocked version of transpose become faster than the non-cache blocked version? :::success Cache is faster than non-cache. When the size of matrix is very big and divided into enough blocks, the time with blocking is clearly decreased. :::