# Assignment4: Cache ###### tags: `arch2020` ## Exercise 1 - A Couple of Memory Access Scenarios Task: Simulate the following scenarios and record the final cache hit rates with the program in cache.s. Try to reason out what the hit rate will be BEFORE running the code. After running each simulation, make sure you understand WHY you see what you see (the TAs will be asking)! ``` # PSEUDOCODE: # int array[]; //Assume sizeof(int) == 4 # for (k = 0; k < repcount; k++) { // repeat repcount times # /* Step through the selected array segment with the given step size. */ # for (index = 0; index < arraysize; index += stepsize) { # if(option==0) # array[index] = 0; // Option 0: One cache access - write # else # array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write # } # } ``` :::info The cache settings in Ripes has different namings to those in Venus. Conversion Table | Venus | Ripes | | -------- | -------- | | Block Size (Bytes) = $2^{N+2}$ | $2^N$ Blocks | | Number of Blocks = $2^N \cdot \mathrm{associativity}$ | $2^N$ Lines | | Associativity = $2^N$ | $2^N$ Ways | | Cache Size = $N$ bytes | Data Size = $8N$ bits ::: ### Scenario 1 ``` Array Size (a0): 128 (bytes) Step Size (a1): 8 Rep Count (a2): 4 Option (a3): 0 ``` ``` Cache Levels: 1 Block Size: 8 Number of Blocks: 4 Enable?: Should be green Placement Policy: Direct Mapped Associativity: 1 (Venus won’t let you change this, why?) Block Replacement Policy: LRU ``` :::success Hit rate: 0% (0/16) ::: 1. What combination of parameters is producing the hit rate you observe? (Hint: Your answer should be of the form: “Because [parameter A] in bytes is exactly equal to [parameter B] in bytes.”) :::success The step size 8 words * (4 bytes / word) is exactly equal to the cache size (32 bytes). This makes the index of each accessed block equal and therefore the blocks are overwritten on each access. ::: 2. What is our hit rate if we increase Rep Count arbitrarily? Why? :::success It'll still be 0 since the blocks are always overwritten. ::: 3. How could we modify one program parameter to increase our hit rate? (Hint: Your answer should be of the form: “Change [parameter] to [value].” Note: we don’t care if we access the same array elements. Just give us a program parameter modification that would increase the hit rate. :::success If the array size is set to 32, the hit rate is 3/4 = 75%. If the step size is set to 1, the hit rate is 64/128 = 50%. ::: ### Scenario 2 ``` Program Parameters: (set these by initializing the a registers in the code) Array Size (a0): 256 (bytes) Step Size (a1): 2 Rep Count (a2): 1 Option (a3): 1 Cache Parameters: (set these in the Cache tab) Cache Levels: 1 Block Size: 16 Number of Blocks: 16 Enable?: Should be green Placement Policy: N-Way Set Associative Associativity: 4 Block Replacement Policy: LRU ``` :::success Hit rate: 75% (32/48) ::: 1. How many memory accesses are there per iteration of the inner loop? (not the one involving repcount). It’s not 1. :::success 2 (1 read followed by 1 write). ::: 2. What is the repeating hit/miss pattern? WHY? (Hint: it repeats every 4 accesses). :::success `MISS HIT HIT HIT` ::: 3. This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern. :::success ``` Miss (Read miss, brings in 16 bytes) Hit (Write hit, same address as above) Hit (Read hit, previous address + 8, same block) Hit (Write hit, same address as above) ``` ::: 4. Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run! :::success It approaches 100%. Caches misses only happen in the first repetition because the data happens to fit the cache. ::: 5. You should have noticed that our hit rate was pretty high for this scenario, and your answer to the previous question should give you a good understanding of why. IF YOU ARE NOT SURE WHY, consider the size of the array and compare it to the size of the cache. Now, consider the following: Suppose we have a program that iterates through a very large array (AKA way bigger than the size of the cache) repcount times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). (For reference, in this scenario, we just had one function (incrementation) and one Rep). QUESTION: Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others AKA it doesn’t matter if Rep k is applied to element arr[i] before Rep k is applied to element arr[i+1], etc. :::success You process the array in chunks that fit in the cache. ::: ### Scenario 3 ``` Program Parameters: (set these by initializing the a registers in the code) Array Size (a0): 128 (bytes) Step Size (a1): 1 Rep Count (a2): 1 Option (a3): 0 Cache Parameters: (set these in the Cache tab) Cache Levels: 2 NOTE: Make sure the following parameters are for the L1 cache! (Select L1 in the dropdown right next to the replacement policy) Block Size: 8 Number of Blocks: 8 Enable?: Should be green Placement Policy: Direct Mapped Associativity: 1 Block Replacement Policy: LRU NOTE: Make sure the following parameters are for the L2 cache! (Select L2 in the dropdown right next to the replacement policy) Block Size: 8 Number of Blocks: 16 Enable?: Should be green Placement Policy: Direct Mapped Associativity: 1 Block Replacement Policy: LRU ``` 1. What is the hit rate of our L1 cache? Our L2 cache? Overall? :::success L1: 16/32 = 50% L2: 0/16 = 0% ::: 2. How many accesses do we have to the L1 cache total? How many of them are misses? :::success 32, 16 ::: 3. How many accesses do we have to the L2 cache total? How does this relate to the L1 cache (think about what the L1 cache has to do in order to make us access the L2 cache)? :::success 16, 1 for each L1 cache miss. ::: 4. What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same? Why? :::success Rep count, as the data fits the L2 cache, all subsequent repetitions hit the L2 cache. ::: 5. What happens to our hit rates for L1 and L2 as we slowly increase the number of blocks in L1? What about L1 block size? :::success Increasing the number of blocks in L1 doesn't affect the hit rates. If the block size is increased instead, the L1 hit rate increase as there's less fetches needed, and the L2 hit rate remains the same (0). ::: ## Exercise 2 - Loop Ordering and Matrix Multiplication Results on an x86_64 machine: ``` ijk: n = 1000, 2.078 Gflop/s ikj: n = 1000, 0.293 Gflop/s jik: n = 1000, 2.072 Gflop/s jki: n = 1000, 14.391 Gflop/s kij: n = 1000, 0.294 Gflop/s kji: n = 1000, 13.347 Gflop/s ``` The source code is modified to run on Ripes. The C code is compiled with clang on Compile Explorer. Specifically, `float`s are replaced with ints as Ripes support only RV32IM. n are set to 100 to speed up computation. Standard library calls are removed. A few modifications of the assembly is made to ensure the program runs propoerly. :::spoiler Modified `matrixMultiply.c` ```cpp /* To save you time, we are including all 6 variants of the loop ordering as separate functions and then calling them using function pointers. The reason for having separate functions that are nearly identical is to avoid counting any extraneous processing towards the computation time. This includes I/O accesses (printf) and conditionals (if/switch). I/O accesses are slow and conditional/branching statements could unfairly bias results (lower cases in switches must run through more case statements on each iteration). */ #define nmax 100 void multMat1( int n, int *A, int *B, int *C ) { int i,j,k; /* This is ijk loop order. */ for( i = 0; i < n; i++ ) for( j = 0; j < n; j++ ) for( k = 0; k < n; k++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat2( int n, int *A, int *B, int *C ) { int i,j,k; /* This is ikj loop order. */ for( i = 0; i < n; i++ ) for( k = 0; k < n; k++ ) for( j = 0; j < n; j++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat3( int n, int *A, int *B, int *C ) { int i,j,k; /* This is jik loop order. */ for( j = 0; j < n; j++ ) for( i = 0; i < n; i++ ) for( k = 0; k < n; k++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat4( int n, int *A, int *B, int *C ) { int i,j,k; /* This is jki loop order. */ for( j = 0; j < n; j++ ) for( k = 0; k < n; k++ ) for( i = 0; i < n; i++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat5( int n, int *A, int *B, int *C ) { int i,j,k; /* This is kij loop order. */ for( k = 0; k < n; k++ ) for( i = 0; i < n; i++ ) for( j = 0; j < n; j++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat6( int n, int *A, int *B, int *C ) { int i,j,k; /* This is kji loop order. */ for( k = 0; k < n; k++ ) for( j = 0; j < n; j++ ) for( i = 0; i < n; i++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } /* uses timing features from sys/time.h that you haven't seen before */ int main( int argc, char **argv ) { int i; int A[nmax * nmax]; int B[nmax * nmax]; int C[nmax * nmax]; /* fill matrices with random numbers */ for( i = 0; i < nmax*nmax; i++ ) A[i] = 123; for( i = 0; i < nmax*nmax; i++ ) B[i] = 123; for( i = 0; i < nmax*nmax; i++ ) C[i] = 123; multMat1(nmax, A, B, C); // multMat2(nmax, A, B, C); // multMat3(nmax, A, B, C); // multMat4(nmax, A, B, C); // multMat5(nmax, A, B, C); // multMat6(nmax, A, B, C); return 0; } ``` ::: :::spoiler Modified `matrixMultiply.s` ``` j main multMat1(int, int*, int*, int*): # @multMat1(int, int*, int*, int*) addi sp, sp, -48 sw ra, 44(sp) # 4-byte Folded Spill sw s0, 40(sp) # 4-byte Folded Spill addi s0, sp, 48 sw a0, -12(s0) sw a1, -16(s0) sw a2, -20(s0) sw a3, -24(s0) mv a0, zero sw a0, -28(s0) j .LBB0_1 .LBB0_1: # =>This Loop Header: Depth=1 lw a0, -28(s0) lw a1, -12(s0) bge a0, a1, .LBB0_12 j .LBB0_2 .LBB0_2: # in Loop: Header=BB0_1 Depth=1 mv a0, zero sw a0, -32(s0) j .LBB0_3 .LBB0_3: # Parent Loop BB0_1 Depth=1 lw a0, -32(s0) lw a1, -12(s0) bge a0, a1, .LBB0_10 j .LBB0_4 .LBB0_4: # in Loop: Header=BB0_3 Depth=2 mv a0, zero sw a0, -36(s0) j .LBB0_5 .LBB0_5: # Parent Loop BB0_1 Depth=1 lw a0, -36(s0) lw a1, -12(s0) bge a0, a1, .LBB0_8 j .LBB0_6 .LBB0_6: # in Loop: Header=BB0_5 Depth=3 lw a0, -16(s0) lw a1, -28(s0) lw a4, -36(s0) lw a5, -12(s0) mul a2, a4, a5 add a2, a2, a1 slli a2, a2, 2 add a0, a0, a2 lw a0, 0(a0) lw a2, -20(s0) lw a3, -32(s0) mul a3, a3, a5 add a4, a4, a3 slli a4, a4, 2 add a2, a2, a4 lw a2, 0(a2) mul a2, a0, a2 lw a0, -24(s0) add a1, a1, a3 slli a1, a1, 2 add a1, a1, a0 lw a0, 0(a1) add a0, a0, a2 sw a0, 0(a1) j .LBB0_7 .LBB0_7: # in Loop: Header=BB0_5 Depth=3 lw a0, -36(s0) addi a0, a0, 1 sw a0, -36(s0) j .LBB0_5 .LBB0_8: # in Loop: Header=BB0_3 Depth=2 j .LBB0_9 .LBB0_9: # in Loop: Header=BB0_3 Depth=2 lw a0, -32(s0) addi a0, a0, 1 sw a0, -32(s0) j .LBB0_3 .LBB0_10: # in Loop: Header=BB0_1 Depth=1 j .LBB0_11 .LBB0_11: # in Loop: Header=BB0_1 Depth=1 lw a0, -28(s0) addi a0, a0, 1 sw a0, -28(s0) j .LBB0_1 .LBB0_12: lw s0, 40(sp) # 4-byte Folded Reload lw ra, 44(sp) # 4-byte Folded Reload addi sp, sp, 48 ret multMat2(int, int*, int*, int*): # @multMat2(int, int*, int*, int*) addi sp, sp, -48 sw ra, 44(sp) # 4-byte Folded Spill sw s0, 40(sp) # 4-byte Folded Spill addi s0, sp, 48 sw a0, -12(s0) sw a1, -16(s0) sw a2, -20(s0) sw a3, -24(s0) mv a0, zero sw a0, -28(s0) j .LBB1_1 .LBB1_1: # =>This Loop Header: Depth=1 lw a0, -28(s0) lw a1, -12(s0) bge a0, a1, .LBB1_12 j .LBB1_2 .LBB1_2: # in Loop: Header=BB1_1 Depth=1 mv a0, zero sw a0, -36(s0) j .LBB1_3 .LBB1_3: # Parent Loop BB1_1 Depth=1 lw a0, -36(s0) lw a1, -12(s0) bge a0, a1, .LBB1_10 j .LBB1_4 .LBB1_4: # in Loop: Header=BB1_3 Depth=2 mv a0, zero sw a0, -32(s0) j .LBB1_5 .LBB1_5: # Parent Loop BB1_1 Depth=1 lw a0, -32(s0) lw a1, -12(s0) bge a0, a1, .LBB1_8 j .LBB1_6 .LBB1_6: # in Loop: Header=BB1_5 Depth=3 lw a0, -16(s0) lw a1, -28(s0) lw a4, -36(s0) lw a5, -12(s0) mul a2, a4, a5 add a2, a2, a1 slli a2, a2, 2 add a0, a0, a2 lw a0, 0(a0) lw a2, -20(s0) lw a3, -32(s0) mul a3, a3, a5 add a4, a4, a3 slli a4, a4, 2 add a2, a2, a4 lw a2, 0(a2) mul a2, a0, a2 lw a0, -24(s0) add a1, a1, a3 slli a1, a1, 2 add a1, a1, a0 lw a0, 0(a1) add a0, a0, a2 sw a0, 0(a1) j .LBB1_7 .LBB1_7: # in Loop: Header=BB1_5 Depth=3 lw a0, -32(s0) addi a0, a0, 1 sw a0, -32(s0) j .LBB1_5 .LBB1_8: # in Loop: Header=BB1_3 Depth=2 j .LBB1_9 .LBB1_9: # in Loop: Header=BB1_3 Depth=2 lw a0, -36(s0) addi a0, a0, 1 sw a0, -36(s0) j .LBB1_3 .LBB1_10: # in Loop: Header=BB1_1 Depth=1 j .LBB1_11 .LBB1_11: # in Loop: Header=BB1_1 Depth=1 lw a0, -28(s0) addi a0, a0, 1 sw a0, -28(s0) j .LBB1_1 .LBB1_12: lw s0, 40(sp) # 4-byte Folded Reload lw ra, 44(sp) # 4-byte Folded Reload addi sp, sp, 48 ret multMat3(int, int*, int*, int*): # @multMat3(int, int*, int*, int*) addi sp, sp, -48 sw ra, 44(sp) # 4-byte Folded Spill sw s0, 40(sp) # 4-byte Folded Spill addi s0, sp, 48 sw a0, -12(s0) sw a1, -16(s0) sw a2, -20(s0) sw a3, -24(s0) mv a0, zero sw a0, -32(s0) j .LBB2_1 .LBB2_1: # =>This Loop Header: Depth=1 lw a0, -32(s0) lw a1, -12(s0) bge a0, a1, .LBB2_12 j .LBB2_2 .LBB2_2: # in Loop: Header=BB2_1 Depth=1 mv a0, zero sw a0, -28(s0) j .LBB2_3 .LBB2_3: # Parent Loop BB2_1 Depth=1 lw a0, -28(s0) lw a1, -12(s0) bge a0, a1, .LBB2_10 j .LBB2_4 .LBB2_4: # in Loop: Header=BB2_3 Depth=2 mv a0, zero sw a0, -36(s0) j .LBB2_5 .LBB2_5: # Parent Loop BB2_1 Depth=1 lw a0, -36(s0) lw a1, -12(s0) bge a0, a1, .LBB2_8 j .LBB2_6 .LBB2_6: # in Loop: Header=BB2_5 Depth=3 lw a0, -16(s0) lw a1, -28(s0) lw a4, -36(s0) lw a5, -12(s0) mul a2, a4, a5 add a2, a2, a1 slli a2, a2, 2 add a0, a0, a2 lw a0, 0(a0) lw a2, -20(s0) lw a3, -32(s0) mul a3, a3, a5 add a4, a4, a3 slli a4, a4, 2 add a2, a2, a4 lw a2, 0(a2) mul a2, a0, a2 lw a0, -24(s0) add a1, a1, a3 slli a1, a1, 2 add a1, a1, a0 lw a0, 0(a1) add a0, a0, a2 sw a0, 0(a1) j .LBB2_7 .LBB2_7: # in Loop: Header=BB2_5 Depth=3 lw a0, -36(s0) addi a0, a0, 1 sw a0, -36(s0) j .LBB2_5 .LBB2_8: # in Loop: Header=BB2_3 Depth=2 j .LBB2_9 .LBB2_9: # in Loop: Header=BB2_3 Depth=2 lw a0, -28(s0) addi a0, a0, 1 sw a0, -28(s0) j .LBB2_3 .LBB2_10: # in Loop: Header=BB2_1 Depth=1 j .LBB2_11 .LBB2_11: # in Loop: Header=BB2_1 Depth=1 lw a0, -32(s0) addi a0, a0, 1 sw a0, -32(s0) j .LBB2_1 .LBB2_12: lw s0, 40(sp) # 4-byte Folded Reload lw ra, 44(sp) # 4-byte Folded Reload addi sp, sp, 48 ret multMat4(int, int*, int*, int*): # @multMat4(int, int*, int*, int*) addi sp, sp, -48 sw ra, 44(sp) # 4-byte Folded Spill sw s0, 40(sp) # 4-byte Folded Spill addi s0, sp, 48 sw a0, -12(s0) sw a1, -16(s0) sw a2, -20(s0) sw a3, -24(s0) mv a0, zero sw a0, -32(s0) j .LBB3_1 .LBB3_1: # =>This Loop Header: Depth=1 lw a0, -32(s0) lw a1, -12(s0) bge a0, a1, .LBB3_12 j .LBB3_2 .LBB3_2: # in Loop: Header=BB3_1 Depth=1 mv a0, zero sw a0, -36(s0) j .LBB3_3 .LBB3_3: # Parent Loop BB3_1 Depth=1 lw a0, -36(s0) lw a1, -12(s0) bge a0, a1, .LBB3_10 j .LBB3_4 .LBB3_4: # in Loop: Header=BB3_3 Depth=2 mv a0, zero sw a0, -28(s0) j .LBB3_5 .LBB3_5: # Parent Loop BB3_1 Depth=1 lw a0, -28(s0) lw a1, -12(s0) bge a0, a1, .LBB3_8 j .LBB3_6 .LBB3_6: # in Loop: Header=BB3_5 Depth=3 lw a0, -16(s0) lw a1, -28(s0) lw a4, -36(s0) lw a5, -12(s0) mul a2, a4, a5 add a2, a2, a1 slli a2, a2, 2 add a0, a0, a2 lw a0, 0(a0) lw a2, -20(s0) lw a3, -32(s0) mul a3, a3, a5 add a4, a4, a3 slli a4, a4, 2 add a2, a2, a4 lw a2, 0(a2) mul a2, a0, a2 lw a0, -24(s0) add a1, a1, a3 slli a1, a1, 2 add a1, a1, a0 lw a0, 0(a1) add a0, a0, a2 sw a0, 0(a1) j .LBB3_7 .LBB3_7: # in Loop: Header=BB3_5 Depth=3 lw a0, -28(s0) addi a0, a0, 1 sw a0, -28(s0) j .LBB3_5 .LBB3_8: # in Loop: Header=BB3_3 Depth=2 j .LBB3_9 .LBB3_9: # in Loop: Header=BB3_3 Depth=2 lw a0, -36(s0) addi a0, a0, 1 sw a0, -36(s0) j .LBB3_3 .LBB3_10: # in Loop: Header=BB3_1 Depth=1 j .LBB3_11 .LBB3_11: # in Loop: Header=BB3_1 Depth=1 lw a0, -32(s0) addi a0, a0, 1 sw a0, -32(s0) j .LBB3_1 .LBB3_12: lw s0, 40(sp) # 4-byte Folded Reload lw ra, 44(sp) # 4-byte Folded Reload addi sp, sp, 48 ret multMat5(int, int*, int*, int*): # @multMat5(int, int*, int*, int*) addi sp, sp, -48 sw ra, 44(sp) # 4-byte Folded Spill sw s0, 40(sp) # 4-byte Folded Spill addi s0, sp, 48 sw a0, -12(s0) sw a1, -16(s0) sw a2, -20(s0) sw a3, -24(s0) mv a0, zero sw a0, -36(s0) j .LBB4_1 .LBB4_1: # =>This Loop Header: Depth=1 lw a0, -36(s0) lw a1, -12(s0) bge a0, a1, .LBB4_12 j .LBB4_2 .LBB4_2: # in Loop: Header=BB4_1 Depth=1 mv a0, zero sw a0, -28(s0) j .LBB4_3 .LBB4_3: # Parent Loop BB4_1 Depth=1 lw a0, -28(s0) lw a1, -12(s0) bge a0, a1, .LBB4_10 j .LBB4_4 .LBB4_4: # in Loop: Header=BB4_3 Depth=2 mv a0, zero sw a0, -32(s0) j .LBB4_5 .LBB4_5: # Parent Loop BB4_1 Depth=1 lw a0, -32(s0) lw a1, -12(s0) bge a0, a1, .LBB4_8 j .LBB4_6 .LBB4_6: # in Loop: Header=BB4_5 Depth=3 lw a0, -16(s0) lw a1, -28(s0) lw a4, -36(s0) lw a5, -12(s0) mul a2, a4, a5 add a2, a2, a1 slli a2, a2, 2 add a0, a0, a2 lw a0, 0(a0) lw a2, -20(s0) lw a3, -32(s0) mul a3, a3, a5 add a4, a4, a3 slli a4, a4, 2 add a2, a2, a4 lw a2, 0(a2) mul a2, a0, a2 lw a0, -24(s0) add a1, a1, a3 slli a1, a1, 2 add a1, a1, a0 lw a0, 0(a1) add a0, a0, a2 sw a0, 0(a1) j .LBB4_7 .LBB4_7: # in Loop: Header=BB4_5 Depth=3 lw a0, -32(s0) addi a0, a0, 1 sw a0, -32(s0) j .LBB4_5 .LBB4_8: # in Loop: Header=BB4_3 Depth=2 j .LBB4_9 .LBB4_9: # in Loop: Header=BB4_3 Depth=2 lw a0, -28(s0) addi a0, a0, 1 sw a0, -28(s0) j .LBB4_3 .LBB4_10: # in Loop: Header=BB4_1 Depth=1 j .LBB4_11 .LBB4_11: # in Loop: Header=BB4_1 Depth=1 lw a0, -36(s0) addi a0, a0, 1 sw a0, -36(s0) j .LBB4_1 .LBB4_12: lw s0, 40(sp) # 4-byte Folded Reload lw ra, 44(sp) # 4-byte Folded Reload addi sp, sp, 48 ret multMat6(int, int*, int*, int*): # @multMat6(int, int*, int*, int*) addi sp, sp, -48 sw ra, 44(sp) # 4-byte Folded Spill sw s0, 40(sp) # 4-byte Folded Spill addi s0, sp, 48 sw a0, -12(s0) sw a1, -16(s0) sw a2, -20(s0) sw a3, -24(s0) mv a0, zero sw a0, -36(s0) j .LBB5_1 .LBB5_1: # =>This Loop Header: Depth=1 lw a0, -36(s0) lw a1, -12(s0) bge a0, a1, .LBB5_12 j .LBB5_2 .LBB5_2: # in Loop: Header=BB5_1 Depth=1 mv a0, zero sw a0, -32(s0) j .LBB5_3 .LBB5_3: # Parent Loop BB5_1 Depth=1 lw a0, -32(s0) lw a1, -12(s0) bge a0, a1, .LBB5_10 j .LBB5_4 .LBB5_4: # in Loop: Header=BB5_3 Depth=2 mv a0, zero sw a0, -28(s0) j .LBB5_5 .LBB5_5: # Parent Loop BB5_1 Depth=1 lw a0, -28(s0) lw a1, -12(s0) bge a0, a1, .LBB5_8 j .LBB5_6 .LBB5_6: # in Loop: Header=BB5_5 Depth=3 lw a0, -16(s0) lw a1, -28(s0) lw a4, -36(s0) lw a5, -12(s0) mul a2, a4, a5 add a2, a2, a1 slli a2, a2, 2 add a0, a0, a2 lw a0, 0(a0) lw a2, -20(s0) lw a3, -32(s0) mul a3, a3, a5 add a4, a4, a3 slli a4, a4, 2 add a2, a2, a4 lw a2, 0(a2) mul a2, a0, a2 lw a0, -24(s0) add a1, a1, a3 slli a1, a1, 2 add a1, a1, a0 lw a0, 0(a1) add a0, a0, a2 sw a0, 0(a1) j .LBB5_7 .LBB5_7: # in Loop: Header=BB5_5 Depth=3 lw a0, -28(s0) addi a0, a0, 1 sw a0, -28(s0) j .LBB5_5 .LBB5_8: # in Loop: Header=BB5_3 Depth=2 j .LBB5_9 .LBB5_9: # in Loop: Header=BB5_3 Depth=2 lw a0, -32(s0) addi a0, a0, 1 sw a0, -32(s0) j .LBB5_3 .LBB5_10: # in Loop: Header=BB5_1 Depth=1 j .LBB5_11 .LBB5_11: # in Loop: Header=BB5_1 Depth=1 lw a0, -36(s0) addi a0, a0, 1 sw a0, -36(s0) j .LBB5_1 .LBB5_12: lw s0, 40(sp) # 4-byte Folded Reload lw ra, 44(sp) # 4-byte Folded Reload addi sp, sp, 48 ret main: # @main addi sp, sp, -2032 sw ra, 2028(sp) # 4-byte Folded Spill sw s0, 2024(sp) # 4-byte Folded Spill addi s0, sp, 2032 lui a2, 29 addi a2, a2, -784 sub sp, sp, a2 add a2, zero, a0 mv a0, zero sw a0, -16(s0) sw a2, -20(s0) sw a1, -24(s0) sw a0, -28(s0) j .LBB6_1 .LBB6_1: # =>This Inner Loop Header: Depth=1 lw a1, -28(s0) lui a0, 2 addi a0, a0, 1807 blt a0, a1, .LBB6_4 j .LBB6_2 .LBB6_2: # in Loop: Header=BB6_1 Depth=1 lw a0, -28(s0) slli a1, a0, 2 lui a0, 1048566 addi a0, a0, 932 add a0, a0, s0 add a0, zero, a0 add a1, a1, a0 addi a0, zero, 123 sw a0, 0(a1) j .LBB6_3 .LBB6_3: # in Loop: Header=BB6_1 Depth=1 lw a0, -28(s0) addi a0, a0, 1 sw a0, -28(s0) j .LBB6_1 .LBB6_4: mv a0, zero sw a0, -28(s0) j .LBB6_5 .LBB6_5: # =>This Inner Loop Header: Depth=1 lw a1, -28(s0) lui a0, 2 addi a0, a0, 1807 blt a0, a1, .LBB6_8 j .LBB6_6 .LBB6_6: # in Loop: Header=BB6_5 Depth=1 lw a0, -28(s0) slli a1, a0, 2 lui a0, 1048556 addi a0, a0, 1892 add a0, a0, s0 add a0, zero, a0 add a1, a1, a0 addi a0, zero, 123 sw a0, 0(a1) j .LBB6_7 .LBB6_7: # in Loop: Header=BB6_5 Depth=1 lw a0, -28(s0) addi a0, a0, 1 sw a0, -28(s0) j .LBB6_5 .LBB6_8: mv a0, zero sw a0, -28(s0) j .LBB6_9 .LBB6_9: # =>This Inner Loop Header: Depth=1 lw a1, -28(s0) lui a0, 2 addi a0, a0, 1807 blt a0, a1, .LBB6_12 j .LBB6_10 .LBB6_10: # in Loop: Header=BB6_9 Depth=1 lw a0, -28(s0) slli a1, a0, 2 lui a0, 1048547 addi a0, a0, -1244 add a0, a0, s0 add a0, zero, a0 add a1, a1, a0 addi a0, zero, 123 sw a0, 0(a1) j .LBB6_11 .LBB6_11: # in Loop: Header=BB6_9 Depth=1 lw a0, -28(s0) addi a0, a0, 1 sw a0, -28(s0) j .LBB6_9 .LBB6_12: addi a0, zero, 100 lui a1, 1048566 addi a1, a1, 932 add a1, a1, s0 add a1, zero, a1 lui a2, 1048556 addi a2, a2, 1892 add a2, a2, s0 add a2, zero, a2 lui a3, 1048547 addi a3, a3, -1244 add a3, a3, s0 add a3, zero, a3 call multMat1(int, int*, int*, int*) mv a0, zero lui a1, 29 addi a1, a1, -784 add sp, sp, a1 lw s0, 2024(sp) # 4-byte Folded Reload lw ra, 2028(sp) # 4-byte Folded Reload addi sp, sp, 2032 ret end: ``` ::: As ripes does not simulate cache miss latency, only hit-rates are reported below. ### Results ``` Ripes hit-rate: (n=32, direct-mappaed cache preset) ijk 0.8657 ikj 0.8323 jik 0.8783 jki 0.9099 kij 0.8334 kji 0.9098 ``` The orderings with higher hit-rates corresonds to better performance. jki and kji performs the best. The inner loop moves through A, B, C with strides 1, 0, 1, which means better cache coherency. ikj and kij performs the worst. The inner loop moves through A, B, C with strides n, 0, n, which means accessescan be very far apart. ## Exercise 3 - Cache Blocking and Matrix Transposition ```cpp void transpose_blocking(int n, int blocksize, int *dst, int *src) { for (int yl = 0; yl < n; yl += blocksize) { int yh = yl + blocksize; if (yh > n) yh = n; for (int xl = 0; xl < n; xl += blocksize) { int xh = xl + blocksize; if (xh > n) xh = n; for (int x = xl; x < xh; ++x) { for (int y = yl; y < yh; ++y) { dst[y + x * n] = src[x + y * n]; } } } } } ``` ### Part 1 on x86_64: ``` n=100 Testing naive transpose: 0.021 milliseconds Testing transpose with blocking: 0.04 milliseconds n=1000 Testing naive transpose: 1.21 milliseconds Testing transpose with blocking: 1.487 milliseconds n=2000 Testing naive transpose: 13.385 milliseconds Testing transpose with blocking: 4.187 milliseconds n=5000 Testing naive transpose: 115.5 milliseconds Testing transpose with blocking: 26.022 milliseconds n=10000 Testing naive transpose: 583.762 milliseconds Testing transpose with blocking: 115.983 milliseconds ``` When n >= 2000, the blocking version is faster. For n too small, the data fits the caches anyway and there's additional overhead. ### Part 2 on x86_64: ``` blocksize=50 Testing naive transpose: 592.001 milliseconds Testing transpose with blocking: 119.767 milliseconds blocksize=100 Testing naive transpose: 587.153 milliseconds Testing transpose with blocking: 120.785 milliseconds blocksize=500 Testing naive transpose: 583.253 milliseconds Testing transpose with blocking: 113.549 milliseconds blocksize=1000 Testing naive transpose: 596.56 milliseconds Testing transpose with blocking: 127.072 milliseconds blocksize=5000 Testing naive transpose: 585.119 milliseconds Testing transpose with blocking: 578.12 milliseconds ``` The performance of the blocking version only gets worse when blocksize=5000. Even so, it's still consistently slightly better than the naive version. With block sizes too big, each block does not fit into the cache well so the performance doesn't get better.