Lab4: Cache - HackMD

# Lab4: Cache ###### tags: `2020-Computer-Architecture` contributed by < `uduru0522` > ---- This lab is for purpose of understanding and visualizing cache operations. Exercises comes from [Here](https://cs61c.org/su20/labs/lab07/). Code for testing are from [Here](https://github.com/61c-teach/su20-lab-starter/tree/master/lab07). ---- ## Exercise 1 - A Couple of Memory Access Scenarios :::spoiler Code To Run > ```c > /* PSEUDOCODE */ > int array[]; //Assume sizeof(int) == 4 > for (k = 0; k < repcount; k++) { // repeat repcount times > /* Step through the selected array segment with the given step size. */ > for (index = 0; index < arraysize; index += stepsize) { > if(option==0) > /* Option 0: One cache access - write */ > array[index] = 0; > else > /* Option 1: Two cache accesses - read AND write */ > array[index] = array[index] + 1; > } > } > ``` ```assembly= .data array: .word 2048 .text main: li a0, 256 li a1, 2 li a2, 1 li a3, 1 jal accessWords # lw/sw # jal accessBytes # lb/sb li a7, 10 ecall # SUMMARY OF REGISTER USE: # a0 = array size in bytes # a1 = step size # a2 = number of times to repeat # a3 = 0 (W) / 1 (RW) # s0 = moving array ptr # s1 = array limit (ptr) accessWords: la s0, array add s1, s0, a0 slli t1, a1, 2 wordLoop: beq a3, zero, wordZero lw t0, 0(s0) addi t0, t0, 1 sw t0, 0(s0) j wordCheck wordZero: sw zero, 0(s0) wordCheck: add s0, s0, t1 blt s0, s1, wordLoop addi a2, a2, -1 bgtz a2, accessWords jr ra accessBytes: la s0, array add s1, s0, a0 byteLoop: beq a3, zero, byteZero lbu t0, 0(s0) addi t0, t0, 1 sb t0, 0(s0) j byteCheck byteZero: sb zero, 0(s0) byteCheck: add s0, s0, a1 blt s0, s1, byteLoop addi a2, a2, -1 bgtz a2, accessBytes jr ra ``` ::: ### Provided Code Modifications 1. **`ecall` Parameters** ```nasm= li a7, 10 # exit ecall ``` The original was loading the parameter into register `a0`, though Ripes requires: + `a0` for specifying system call + `a7` for system call parameters So **changed to `a7`**. :::info **Reference**: https://github.com/mortbopet/Ripes/wiki/Environment-calls ::: ### Scenario 1 Program Parameters: + Array Size (a0): 128 (bytes) + Step Size (a1): 8 + Rep Count (a2): 4 + Option (a3): 0 Cache Parameters: + Cache Levels: 1 + Block Size: 8 + Number of Blocks: 4 + Enable?: Should be green + Placement Policy: Direct Mapped + Associativity: 1 + Block Replacement Policy: LRU ![](https://i.imgur.com/07WPfRb.png) > + In Ripes, a block is defaulted to 4 bytes. To achieve the 8-byte block size, we consider block 0 & 1 as one. #### Tasks **Q1. What combination of parameters is producing the hit rate you observe?** The result hit-rate of the simulation is 0%. - First, the cache indexing of the given setting is as following: ![](https://i.imgur.com/nT687HN.png) Which bit[4:3] determines which block will the data be cached. - When accessing the array, the increment to address is calculated as: + (step size)$8\times$(4 byte for a word)$4=32=10\ 0000\text{'b}$ - Then, we find out the addresses that are been accessed by adding the following code to `<wordZero>`: ```nasm= .data str: string. "\n"; wordZero: sw zero, 0(s0) 　# array[index/4] = 0 mv t2, a0 mv a0, s0 li a7, 35 # Print binary ecall la a0, str # New line li a7, 4 ecall mv a0, t2 ``` Which produces: 0b0001 0000 0000 0000 0000 0000 0000 0000 (`0x10000000`) 0b0001 0000 0000 0000 0000 0000 0010 0000 (`0x10000030`) 0b0001 0000 0000 0000 0000 0000 0100 0000 (`0x10000040`) 0b0001 0000 0000 0000 0000 0000 0110 0000 (`0x10000060`), and repeats With the red part represents which block index it holds. We can find that all of them are mapped into block `00`, thus all are misses. **Q2. What is our hit rate if we increase Rep Count arbitrarily? Why?** With the current setting, increasing the rep count will **not** increase. If the accessed memory address isn't affected, the situation will remain the same, with all cached in the same block. **Q3. How could we modify one program parameter to increase our hit rate?** The modification should satisfy so that the caching could be averagely disputed to all blocks, since the 0 hit-rate is due to that all of them are mapped to block index 0. So, we could modify the step size to 1, which becomes that 4 entry misses, and the following 4 entry hits, for every tag, achieving a 50% hit rate. Or, alternatively, we could fix the array size to 32, so that every array access is the same (LOL). This way, the hit-rate goes toward 100% as the rep count grows. ### Scenario 2 Program Parameters: + Array Size (a0): 256 (bytes) + Step Size (a1): 2 + Rep Count (a2): 1 + Option (a3): 1 Cache Parameters: + Cache Levels: 1 + Block Size: 16 + Number of Blocks: 16 + Enable?: Should be green + Placement Policy: N-Way Set Associative + Associativity: 4 + Block Replacement Policy: LRU **Q1. How many memory accesses are there per iteration of the inner loop?** 1 line of process occurs in each iteration of inner loop, which is: ```c arr[index] = arr[index] + 1; ``` So there is **2** array accesses, 1 reading and 1 writing. **Q2. What is the reapeating hit-miss pattern? Why?** The observed pattern is **M-H-H-H**, and repeat. Consider the miss is at array pointer `0x100`: + The member of `0x100` to `0x10F` will be moved into the block. + Again, `0x100` is accessed via the save-instruction, HIT. + Then, the array pointer increments to `0x108`, which is also included in the previous step. + Execute load-instruction. HIT. + Execute save-instruction. HIT. Then, the array pointer in crements to `0x110`, which entroduces a new run. **Q3. Explain the hit-rate in ter of the hit/miss pattern.** Well, **M-H-H-H**, so 75%. **Q4. Keeping evreything else the same, what happens to out hit rate as Rep Count goes to infinity?** First, if we consider only 1 repitition, we access 32 different array address in 64 independent address, with each gets 2.Also, they are evenly mapped into the 4-ways, resulting that all 32 being cached. So, in the first interation, there will be 16 complusory misses; but in the following inerations, with all address being cached, all 64 array accesses are a hit. Therefore, the hit-rate will get close to 100% as the Rep Count goes to infinity. **Q5. Suppose we have a program that iterates through a very large array (way bigger than the size of the cache), and in each Rep, a different function is mapped to the array. How do we achieve the similar hit-rate?** Since the array size is that huge, we cannot let every accessed address cached. Ergo, we can try to every "kind" of array access happens together. That is, for element `arr[i]`, instead of incrementing `i` everytime we finish task `k`, we apply all functions then goto the next array member. In this way, for every array member, we undergo only 1 miss. Following is a sample c-code: ```c= int array[SIZE]; void (*func_ptr)(int*); for(int i = 0; i < SIZE; i += stepsize){ for(int k = 0; k < rep_count; ++k){ next_func(func_ptr); // Into some new function that modify/uses only arr[i] (*func_ptr)(&arr[i]); } } ``` **Achieving a hit-rate of $1-\frac{1}{\text{Rep Count}}$**. ### Scenario 3 Program Parameters: (set these by initializing the a registers in the code) + Array Size (a0): 128 (bytes) + Step Size (a1): 1 + Rep Count (a2): 1 + Option (a3): 0 Cache Parameters: (set these in the Cache tab) + Cache Levels: 2 + L1 Cache - Block Size: 8 - Number of Blocks: 8 - Enable?: Should be green - Placement Policy: Direct Mapped - Associativity: 1 - Block Replacement Policy: LRU + L2 Cache: - Block Size: 8 - Number of Blocks: 16 - Enable?: Should be green - Placement Policy: Direct Mapped - Associativity: 1 - Block Replacement Policy: LRU > RIPES does not support multilevel caches. > The tasks are still answerable, though. #### Tasks **Q1. Whats the hit-rate of L1 cache, L2 cache, and overall?** First, we can calculate the cache indexing: :::info + L1: - Offset: $log_28 = 3$ - Index: $log_28 = 3$ - Tag: $32 - 3 - 3 = 26$ + L2: - Offset: $log_28 = 3$ - Index: $log_216 = 4$ - Tag: $32 - 3 - 4 = 25$ ::: Then, figure out all array accesses: (For viewing convidence, each part of the indexing is seperated by a space) ```shell= 0b00010000000000000000000000 000 000 L1:M, L2:M 0b00010000000000000000000000 000 100 L1:H 0b00010000000000000000000000 001 000 L1:M, L2:M 0b00010000000000000000000000 001 100 L1:H 0b00010000000000000000000000 010 000 L1:M, L2:M 0b00010000000000000000000000 010 100 L1:H 0b00010000000000000000000000 011 000 L1:M, L2:M 0b00010000000000000000000000 011 100 L1:H 0b00010000000000000000000000 100 000 L1:M, L2:M 0b00010000000000000000000000 100 100 L1:H 0b00010000000000000000000000 101 000 L1:M, L2:M 0b00010000000000000000000000 101 100 L1:H 0b00010000000000000000000000 110 000 L1:M, L2:M 0b00010000000000000000000000 110 100 L1:H 0b00010000000000000000000000 111 000 L1:M, L2:M 0b00010000000000000000000000 111 100 L1:H 0b00010000000000000000000001 000 000 L1:M, L2:M 0b00010000000000000000000001 000 100 L1:H 0b00010000000000000000000001 001 000 L1:M, L2:M 0b00010000000000000000000001 001 100 L1:H 0b00010000000000000000000001 010 000 L1:M, L2:M 0b00010000000000000000000001 010 100 L1:H 0b00010000000000000000000001 011 000 L1:M, L2:M 0b00010000000000000000000001 011 100 L1:H 0b00010000000000000000000001 100 000 L1:M, L2:M 0b00010000000000000000000001 100 100 L1:H 0b00010000000000000000000001 101 000 L1:M, L2:M 0b00010000000000000000000001 101 100 L1:H 0b00010000000000000000000001 110 000 L1:M, L2:M 0b00010000000000000000000001 110 100 L1:H 0b00010000000000000000000001 111 000 L1:M, L2:M 0b00010000000000000000000001 111 100 L1:H ``` Also, the hit/miss is noted above by myself. Every Miss in L1 Cache is following a Hit, accessing the same block. Also, every miss in L1 is independent, which is impossible for an empty L2 cache to hit. Thus, + L1 Cache Hit-Rate: $50\%$ + L2 Cache Hit-Rate: $0\%$ + Overall Hit-Rate: $\frac{16}{16 + 32}\approx 33.3\%$ **Q2. How many accesses do we have to the L1 cache total? How many of them are misse?** 32 in total, with 16 hits and 16 misses. **Q3. Same question as above, but how about L2? How does this related to L1 cache?** 16 accesses, with all of them are misses. L2 cache is accessed whenever L1 misses, hence the number of L1 cache misses shall be the same as L2 cache accesses. **Q4. What program parameter would allow us to increase our L2 hit rate, but keep the L1 hit rate same? Why?** + We can first find out what is left in the cache after each repitition: The latter 64 byte of the array, exactly a half of it. Therefore, if we simply increace the Rep Count, The hit rate on L1 cache should not change since the first half of the array shall occupy the cache again, with none already in it, following with the latter half. + The leftovers in L2, thought, will be **the entire array**. Observing the L2 misses from above, we can find that all 16 misses exactly maps to all 16 blocks on L2 cache 1-by-1. Hence, from the 2nd iteration, the 16 cache misses in L2 accesses suddenly becomes 16 hits, and so on. Accordingly, __increasing the Rep Count helps L2 hit rate but does not chnge L1's state__ ## Exercise 2 - Loop Ordering and Matrix Multiplication ### Exericse Discription: The following is a way to perform multiplication between matrix `A` and `B`: ```c= for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) for (int k = 0; k < n; k++) C[i+j*n] += A[i+k*n] * B[k+j*n]; ``` However, the order of the loops are interchangable. Observe and analysis the relatiionship between **loop order** (strides traversing the matrixes) and **execution speed**. ### Program Execution [From Here](https://github.com/61c-teach/su20-lab-starter/blob/master/lab07/matrixMultiply.c) Testing 3 times, produced the following result: ```shell uduru@uduru-GL553VD:~/C-Cpp$ ./matrix-multiply ijk: n = 1000, 1.811 Gflop/s ikj: n = 1000, 0.257 Gflop/s jik: n = 1000, 1.704 Gflop/s jki: n = 1000, 11.258 Gflop/s kij: n = 1000, 0.259 Gflop/s kji: n = 1000, 10.974 Gflop/s uduru@uduru-GL553VD:~/C-Cpp$ ./matrix-multiply ijk: n = 1000, 1.796 Gflop/s ikj: n = 1000, 0.255 Gflop/s jik: n = 1000, 1.790 Gflop/s jki: n = 1000, 12.084 Gflop/s kij: n = 1000, 0.259 Gflop/s kji: n = 1000, 10.593 Gflop/s uduru@uduru-GL553VD:~/C-Cpp$ ./matrix-multiply ijk: n = 1000, 1.810 Gflop/s ikj: n = 1000, 0.257 Gflop/s jik: n = 1000, 1.723 Gflop/s jki: n = 1000, 11.235 Gflop/s kij: n = 1000, 0.260 Gflop/s kji: n = 1000, 10.338 Gflop/s ``` ### Matrix Striding Before the tasks, we can first think about how does the loop order affects traversing the matrixes: With the given operation `C[i+j*n] += A[i+k*n] * B[k+j*n]`, the index `i`, `j` and `k` stands for these directions through the matrix: ![](https://i.imgur.com/B9T7lwm.png) > Being a human that couldn't distinguish between the meaning of column/row, > [THIS](https://lambdalisue.hatenablog.com/entry/2013/07/18/134507) helped me alot. ### Memory Layout of A 2D Matrix Continuous memory are layed out with black squares from **left to right**. Each connected black squares represents a row of the corresponding matrix. ![](https://i.imgur.com/jnS2VZC.png) ### Tasks **Q1. Which ordering(s) perform best for these 1000-by-1000 matrices? Why?** By the result of the execution, the order which performed the best is the order `j k i`. Between the runs, the only thing that significantly changes is the ordere of address access. From exercise 1, scenario 2, task 5, we know that the order can affect hugely on cache hit-rate. Look at how this order of indexes produces the order of array accesses: + `i = 0, j = 0, k = 0`![](https://i.imgur.com/vMLX4wS.png) + `i = 1, j = 0, k = 0`![](https://i.imgur.com/0OMozCU.png) + `i = 2, j = 0, k = 0`![](https://i.imgur.com/70ODJMq.png) + `i = 3, j = 0, k = 0`![](https://i.imgur.com/ceWPxLD.png) + `i = 0, j = 0, k = 1`![](https://i.imgur.com/UtTSzIO.png) + `i = 1, j = 0, k = 1`![](https://i.imgur.com/iYxsDIB.png) + `i = 2, j = 0, k = 1`![](https://i.imgur.com/Az8lVaC.png) + `i = 3, j = 0, k = 1`![](https://i.imgur.com/psIaRj7.png) + `i = 0, j = 1, k = 0`![](https://i.imgur.com/RAeaF8b.png) (Remaining skipped) Consider a cache: + Block count = 3 + Block size = 4 words In the first 8 calculation (24 array accesses), we can see that only 4 cache misses occur: 1. Accessing A[0, 0] 2. Accessing A[1, 0], also replacing entry A[0, 0] loaded 3. Accessing B[0, 0] 4. Accessing C[0, 0] Which load entries that lets all other 20 access become a hit, acting as the best order of accessing. :::info In conclution, **In the current memory layout,** **row-major order is the most friendly to caching.** **The loop order `j-k-i` let us access each matrix by this exactly order.** ::: **Q2. Which ordering(s) perform the worst? Why?** The order `i-k-j` Again, we can visualize and trace the accesses: + `i = 0, j = 0, k = 0`![](https://i.imgur.com/vMLX4wS.png) + `i = 0, j = 1, k = 0`![](https://i.imgur.com/IMYFA91.png) + `i = 0, j = 2, k = 0`![](https://i.imgur.com/pqDzZO4.png) + `i = 0, j = 3, k = 0`![](https://i.imgur.com/JZ42upy.png) + `i = 0, j = 0, k = 1`![](https://i.imgur.com/CQOwEKz.png) + `i = 0, j = 1, k = 1`![](https://i.imgur.com/ylGpBer.png) + `i = 0, j = 2, k = 1`![](https://i.imgur.com/8S3jtCz.png) + `i = 0, j = 3, k = 1`![](https://i.imgur.com/aDvfDvr.png) We can see that in every iteration, there are huge gaps between accessing addresses. That results in an extremely high probability of cache-miss occuring. This result represents that the **column-major order works bad for caching**, since the address of each access are far, far away. **Q3. How does the way we stride through the matrices with respect to the innermost loop affect performance?** The less we stride in the innermost loop, the better performance we get, since it increases hit-rate. ## Exercise 3 - Cache Blocking and Matrix Transposition ### Exercise Discription Implement the blocking method of transpose.c, with mind of locality, and answer the following questions. ### Implemented Code ```c= void transpose_blocking(int n, int blocksize, int *dst, int *src) { int block_count = n / blocksize; for(int i = 0; i < block_count; ++i){ for(int j = 0; j < block_count; ++j){ // i := row block count // j := column block count for(int a = 0; a < blocksize; ++a){ for(int b = 0; b < blocksize; ++b){ if(i * blocksize + a < n && j * blocksize + b < n){ dst[(j * blocksize + b) + (i * blocksize + a) * n] = src[(j * blocksize + b) * n + (i * blocksize+ a)]; #ifdef EBUG printf("Moving src[%d] to ", (j * blocksize+ b) * n + (i * blocksize + a)); printf("to dst[%d]", (j * blocksize + b) + (i * blocksize + a) * n); putchar('\n'); #endif } } } } } } ``` ### Part 1. Changing Array Sizes :::success Fix the `blocksize` to be 20, and run your code with `n` equal to 100, 1000, 2000, 5000, and 10000. ::: #### Result ```shell BLOCKSIZE=20 FILENAME="transpose" SIZE_ARR=(100, 1000, 2000, 5000, 10000) gcc -o $FILENAME $FILENAME.c -O3 for SIZE in "${SIZE_ARR[@]}" do ./$FILENAME $SIZE $BLOCKSIZE done Testing naive transpose: 0.004 milliseconds Testing transpose with blocking: 0.006 milliseconds Testing naive transpose: 1.201 milliseconds Testing transpose with blocking: 2.173 milliseconds Testing naive transpose: 15.191 milliseconds Testing transpose with blocking: 5.453 milliseconds Testing naive transpose: 132.247 milliseconds Testing transpose with blocking: 33.325 milliseconds Testing naive transpose: 670.757 milliseconds Testing transpose with blocking: 192.385 milliseconds rm $FILENAME ``` Time represented as $ln(\text{milisecond})$. ![](https://i.imgur.com/mtaDQIF.png) :::success **Q1. At what point does cache blocked version of transpose become faster than the non-cache blocked version?** ::: Between the array size of 1000 and 2000. :::success **Q2. Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?** ::: With a certainly small matrix, all of it should be possible to fit in the cache (e.g., L1 cache in my desktop can fit up to 576 KB). Thus, the both method should perform similar in terms of hit-rate. However, the blocking method requires boundry check (Since we don't assume $\text{n}=0\pmod{\text{blocksize}}$), causing it to perform slightly slower. ### Part 2. :::success Fix `n` to be 10000, and run your code with `blocksize` equal to 50, 100, 500, 1000, 5000. ::: #### Result ```shell MATRIXSIZE=10000 FILENAME="transpose" BLOCKSIZE_ARR=(50, 100, 500, 1000, 5000) gcc -o $FILENAME $FILENAME.c -O3 for SIZE in "${BLOCKSIZE_ARR[@]}" do ./$FILENAME $MATRIXSIZE $SIZE done Testing naive transpose: 674.988 milliseconds Testing transpose with blocking: 143.586 milliseconds Testing naive transpose: 667.731 milliseconds Testing transpose with blocking: 130.14 milliseconds Testing naive transpose: 669.12 milliseconds Testing transpose with blocking: 139.602 milliseconds Testing naive transpose: 671.218 milliseconds Testing transpose with blocking: 172.181 milliseconds Testing naive transpose: 666.329 milliseconds Testing transpose with blocking: 610.692 milliseconds rm $FILENAME ``` Time represented as $ln(\text{milisecond})$. ![](https://i.imgur.com/Y7p5swu.png) :::success **Q1. How does performance change as blocksize increases? Why is this the case?** ::: The key to speedup in this problem is **locality**, instead of gliding headlessly through the matrix, we try to perform operations at some cirtain block of data at some time. But, when the so-call *cirtain block* gets too big, even if we perform operations only in that block, each of them still seems far away, a.k.a. increases miss-rate. Thus the performance changes as the blocksize goes up.