Assignment4: Cache

# Assignment4: Cache ###### tags: `Computer Architecture 2020` :::info Due: Dec 25, 2020 ::: This is an exercise following the instruction of [CS 61C lab7](https://cs61c.org/su20/labs/lab07/), and use [Ripes](https://github.com/mortbopet/Ripes) as a cache visualization tool. ___ ## Objectives * Analyze how memory access patterns determine cache hit rates * Analyze and discover which memory access patterns produce **GOOD** hit rates * Analyze hit rates for caches and be able to optimize code accesses to produce good hit rates ## Cache.s Modify cache.s to fit the expectation of Ripes. ```clike= # Rewriten by Stephan Kaminsky in RISC-V on 7/30/2018 # This program accesses an array in ways that provide data about the cache parameters. # Coupled with the Data Cache Simulator and Memory Reference Visualization to help students # understand how the cache parameters affect cache performance. # # PSEUDOCODE: # int array[]; //Assume sizeof(int) == 4 # for (k = 0; k < repcount; k++) { // repeat repcount times # /* Step through the selected array segment with the given step size. */ # for (index = 0; index < arraysize; index += stepsize) { # if(option==0) # array[index] = 0; // Option 0: One cache access - write # else # array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write # } # } .data array: .word 2048 # max array size specified in BYTES (DO NOT CHANGE) .text ################################################################################################## # You MAY change the code below this section main: li a0, 256 # array size in BYTES (power of 2 < array size) li a1, 2 # step size (power of 2 > 0) li a2, 1 # rep count (int > 0) li a3, 1 # 0 - option 0, 1 - option 1 # You MAY change the code above this section ################################################################################################## jal accessWords # lw/sw #jal accessBytes # lb/sb li a7,10 # Modify here, ecall number in Ripes # is stored in reg a7 ecall # SUMMARY OF REGISTER USE: # a0 = array size in bytes # a1 = step size # a2 = number of times to repeat # a3 = 0 (W) / 1 (RW) # s0 = moving array ptr # s1 = array limit (ptr) accessWords: la s0, array # ptr to array add s1, s0, a0 # hardcode array limit (ptr) slli t1, a1, 2 # multiply stepsize by 4 because WORDS wordLoop: beq a3, zero, wordZero lw t0, 0(s0) # array[index/4]++ addi t0, t0, 1 sw t0, 0(s0) j wordCheck wordZero: sw zero, 0(s0) # array[index/4] = 0 wordCheck: add s0, s0, t1 # increment ptr blt s0, s1, wordLoop # inner loop done? addi a2, a2, -1 bgtz a2, accessWords # outer loop done? jr ra accessBytes: la s0, array # ptr to array add s1, s0, a0 # hardcode array limit (ptr) byteLoop: beq a3, zero, byteZero lbu t0, 0(s0) # array[index]++ addi t0, t0, 1 sb t0, 0(s0) j byteCheck byteZero: sb zero, 0(s0) # array[index] = 0 byteCheck: add s0, s0, a1 # increment ptr blt s0, s1, byteLoop # inner loop done? addi a2, a2, -1 bgtz a2, accessBytes # outer loop done? jr ra ``` ## Exercise 1 - A Couple of Memory Access Scenarios ### Scenario 1 **Program Parameters**: (set these by initializing the a registers in the code) * **Array Size** (a0): 128 (bytes) * **Step Size** (a1): 8 * **Rep Count** (a2): 4 * **Option** (a3): 0 **Cache Parameters**: (set these in the Cache tab) * **Cache Levels**: 1 * **Block Size**: 8 * **Number of Blocks**: 4 * **Enable?**: Should be green * **Placement Policy**: Direct Mapped * **Associativity**: 1 (Venus won’t let you change this, why?) * **Block Replacement Policy**: LRU **Tasks** 1. What combination of parameters is producing the hit rate you observe? ![](https://i.imgur.com/FV3sZJQ.png) In this case, the hit rate is 0. To observe the result more easily, I added the extra instructions in cache.s to print the data address (s0) on console. The following is what I added. ```clike= .data str: .string "\n" .text . . . wordZero: mv t5, a0 mv a0, s0 li a7, 35 ecall la a0, str li a7, 4 ecall mv a0, t5 ``` ![](https://i.imgur.com/AUSOhSl.png) We can see that all addresses we access are all mapped to the same line due to the same index. Including compulsory misses, everytime we attempt to access the adress, we get a write miss, so there is no hit in this case. 2. What is our hit rate if we increase Rep Count arbitrarily? Why? In this case, if we increase Rep Count, we will get the same reault, 0% hit rate. The reason is that the stepsize is 8, and the word size is 4 bytes, so every inner loop, we access the address which is previous address plus 32, with same index and different tag. So no matter how we change the times of outer loop, we still have a miss every inner loop. 3. How could we modify one program parameter to increase our hit rate? In this case, we can increase our hit rate by modifying step size (a1). If step size bigger or equal than 2, we will all get 0% hit rate. However, if we set step size to 1, we will get 50% hit rate. ![](https://i.imgur.com/PMyvZJp.png) ### Scenario 2 **Program Parameters**: (set these by initializing the a registers in the code) * **Array Size** (a0): 256 (bytes) * **Step Size** (a1): 2 * **Rep Count** (a2): 1 * **Option** (a3): 1 **Cache Parameters**: (set these in the Cache tab) * **Cache Levels**: 1 * **Block Size**: 16 * **Number of Blocks**: 16 * **Enable?**: Should be green * **Placement Policy**: N-Way Set Associative * **Associativity**: 4 * **Block Replacement Policy**: LRU **Tasks** 1. How many memory accesses are there per iteration of the inner loop? (not the one involving repcount). It’s not 1. In this case, we set the option to 1, which means it will do the following thing in the inner loop. ```array[index] = array[index] + 1``` So there are 2 memory accesses per iteration, one read and one write. 2. What is the repeating hit/miss pattern? WHY? ![](https://i.imgur.com/BAvzBdP.png) The above graph shows the change of hit rate along with the cycle. When the hit rate grows up, it means `hit`. When the hit rate goes down, it means `miss`. We can see that every 4 accesses, there are a miss and 3 following hits. * Read miss ![](https://i.imgur.com/OXAQsbQ.png) * Write hit ![](https://i.imgur.com/9VY15ka.png) * Read hit ![](https://i.imgur.com/BhYVuXW.png) * Write hit ![](https://i.imgur.com/BdWgQj7.png) In every 4 accesses (same set), there are 1 miss and 3 hits. In the next 4 accesses, it will accesses another block and has the same resault, so it repeats every 4 accesses. 3. This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern. Every 4 accesses, there are 1 miss and 3 following hits, so hit rate goes down 1 time and grows up 3 times and finally to 75%. 4. Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run! ![](https://i.imgur.com/YhRSKBd.gif) In this case, we will access 32 ($256\ array\ size\div(2\ step\ size\times4\ word\ size)$) different addresses during an outer iteration, and they use the same set every two near accesses. So, in the first outer iteration, the CPU will undergo 16 complusory misses. In addition, in the next outer iteration, we will access the same addresses from the first iteration and they haven't been replaced (still in the cache), so there is no miss after first iteration. No matter how we increase the Rep Count, there are still 16 cache misses. Therefore, if Rep Count goes to infinity, our hit rate will close to 100%. 5. You should have noticed that our hit rate was pretty high for this scenario, and your answer to the previous question should give you a good understanding of why. **IF YOU ARE NOT SURE WHY**, consider the size of the array and compare it to the size of the cache. **Now**, consider the following: Suppose we have a program that iterates through a very large array (AKA way bigger than the size of the cache) repcount times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). (For reference, in this scenario, we just had one function (incrementation) and one Rep). **QUESTION**: Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others AKA it doesn’t matter if Rep k is applied to element arr[i] before Rep k is applied to element arr[i+1], etc. **ANSWER**: We can exchange the inner loop and the outer loop, like the following pseudocode. ```c= PSEUDOCODE: int array[]; //Assume sizeof(int) == 4 /* Step through the selected array segment with the given step size. */ for (index = 0; index < arraysize; index += stepsize) { for (k = 0; k < repcount; k++) { // repeat repcount times if(option==0) array[index] = 0; // Option 0: One cache access - write else array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write } } ``` Before we revise the code, the hit rate is 75%, like the result of **Scenario 1**. After revising, its array accesses are able to achieve a hit rate like that achieved in **Scenario 2**. By doing so, we can access **part** of the array at a time and apply part of the **cache** to that **part** so we can be completely done with it before moving on, thereby keeping that **part** hot in the cache and not having to circle back to it later on! ### Scenario 3 **Cache Levels**: 2 (Ripes doesn't support multilevel cache) ## Exercise 2 - Loop Ordering and Matrix Multiplication This exercise is to understand the important of locality. If we utilize blocks already contained within our cache, we can get a great performance. Take a glance at `matrixMultiply.c`. We use it to help us compare the different implementations of matrix multiply. :::spoiler matrixMultiply.c ```c= #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include <time.h> /* To save you time, we are including all 6 variants of the loop ordering as separate functions and then calling them using function pointers. The reason for having separate functions that are nearly identical is to avoid counting any extraneous processing towards the computation time. This includes I/O accesses (printf) and conditionals (if/switch). I/O accesses are slow and conditional/branching statements could unfairly bias results (lower cases in switches must run through more case statements on each iteration). */ void multMat1( int n, float *A, float *B, float *C ) { int i,j,k; /* This is ijk loop order. */ for( i = 0; i < n; i++ ) for( j = 0; j < n; j++ ) for( k = 0; k < n; k++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat2( int n, float *A, float *B, float *C ) { int i,j,k; /* This is ikj loop order. */ for( i = 0; i < n; i++ ) for( k = 0; k < n; k++ ) for( j = 0; j < n; j++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat3( int n, float *A, float *B, float *C ) { int i,j,k; /* This is jik loop order. */ for( j = 0; j < n; j++ ) for( i = 0; i < n; i++ ) for( k = 0; k < n; k++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat4( int n, float *A, float *B, float *C ) { int i,j,k; /* This is jki loop order. */ for( j = 0; j < n; j++ ) for( k = 0; k < n; k++ ) for( i = 0; i < n; i++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat5( int n, float *A, float *B, float *C ) { int i,j,k; /* This is kij loop order. */ for( k = 0; k < n; k++ ) for( i = 0; i < n; i++ ) for( j = 0; j < n; j++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat6( int n, float *A, float *B, float *C ) { int i,j,k; /* This is kji loop order. */ for( k = 0; k < n; k++ ) for( j = 0; j < n; j++ ) for( i = 0; i < n; i++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } /* uses timing features from sys/time.h that you haven't seen before */ int main( int argc, char **argv ) { int nmax = 1000,i; void (*orderings[])(int,float *,float *,float *) = {&multMat1,&multMat2,&multMat3,&multMat4,&multMat5,&multMat6}; char *names[] = {"ijk","ikj","jik","jki","kij","kji"}; float *A = (float *)malloc( nmax*nmax * sizeof(float)); float *B = (float *)malloc( nmax*nmax * sizeof(float)); float *C = (float *)malloc( nmax*nmax * sizeof(float)); struct timeval start, end; /* fill matrices with random numbers */ for( i = 0; i < nmax*nmax; i++ ) A[i] = drand48()*2-1; for( i = 0; i < nmax*nmax; i++ ) B[i] = drand48()*2-1; for( i = 0; i < nmax*nmax; i++ ) C[i] = drand48()*2-1; for( i = 0; i < 6; i++) { /* multiply matrices and measure the time */ gettimeofday( &start, NULL ); (*orderings[i])( nmax, A, B, C ); gettimeofday( &end, NULL ); /* convert time to Gflop/s */ double seconds = (end.tv_sec - start.tv_sec) + 1.0e-6 * (end.tv_usec - start.tv_usec); double Gflops = 2e-9*nmax*nmax*nmax/seconds; printf( "%s:\tn = %d, %.3f Gflop/s\n", names[i], nmax, Gflops ); } free( A ); free( B ); free( C ); printf("\n\n"); return 0; } ``` ::: **Tasks** * Record your results ```bash $ make ex2 gcc -o matrixMultiply -ggdb -Wall -pedantic -std=gnu99 -O3 matrixMultiply.c ./matrixMultiply ijk: n = 1000, 1.569 Gflop/s ikj: n = 1000, 0.117 Gflop/s jik: n = 1000, 1.387 Gflop/s jki: n = 1000, 10.085 Gflop/s kij: n = 1000, 0.115 Gflop/s kji: n = 1000, 8.458 Gflop/s ``` * Which ordering(s) perform best for these 1000-by-1000 matrices? **Why**? `jki` performs the best in the six implementations. Through the **Exercise 1**, we know the innermost loop's variable decides how we accesses memory. If the change of the variable every iteration makes the near two memory accesses closer or the same, we can utilize our cache more efficiently and get the better performance. In this case (`jki`), `i` is the variable of the innermost loop, if `i + 1`, third of our arrays access the nearby addresses. So, `jki` performs the best. * Which ordering(s) perform the worst? **Why**? 'ikj' and 'kij' perform the worst. The reason is that `j` is the variable of the innermost loop. If `j + 1`, both of array B and array C will access the address far from the previous address (distance is `n`). So, 'ikj' and 'kij' perform the worst. * How does the way we stride through the matrices with respect to the innermost loop affect performance? The shorter distance we stride in the innermost loop, we can utilize our cache more efficiently and get the greater performance. ## Exercise 3 - Cache Blocking and Matrix Transposition