Assignment4: Cache

# Assignment4: Cache ###### tags: `riscv` ## Cache.s ```c= .data array: .word 2048 # max array size specified in BYTES (DO NOT CHANGE) .text ################################################################################################## # You MAY change the code below this section main: li a0, 256 # array size in BYTES (power of 2 < array size) li a1, 2 # step size (power of 2 > 0) li a2, 1 # rep count (int > 0) li a3, 1 # 0 - option 0, 1 - option 1 # You MAY change the code above this section ################################################################################################## jal accessWords # lw/sw #jal accessBytes # lb/sb li a7,10 # exit // Here I modify the register from a0 to a7 for Ripes. ecall # SUMMARY OF REGISTER USE: # a0 = array size in bytes # a1 = step size # a2 = number of times to repeat # a3 = 0 (W) / 1 (RW) # s0 = moving array ptr # s1 = array limit (ptr) accessWords: la s0, array # ptr to array add s1, s0, a0 # hardcode array limit (ptr) slli t1, a1, 2 # multiply stepsize by 4 because WORDS wordLoop: beq a3, zero, wordZero lw t0, 0(s0) # array[index/4]++ addi t0, t0, 1 sw t0, 0(s0) j wordCheck wordZero: sw zero, 0(s0) # array[index/4] = 0 wordCheck: add s0, s0, t1 # increment ptr blt s0, s1, wordLoop # inner loop done? addi a2, a2, -1 bgtz a2, accessWords # outer loop done? jr ra ... // We focus on word access. ``` ### Simply description about cache.s ``` # PSEUDOCODE: # int array[]; //Assume sizeof(int) == 4 # for (k = 0; k < repcount; k++) { // repeat repcount times # /* Step through the selected array segment with the given step size. */ # for (index = 0; index < arraysize; index += stepsize) { # if(option==0) # array[index] = 0; // Option 0: One cache access - write # else # array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write # } # } # SUMMARY OF REGISTER USE: # a0 = array size in bytes # a1 = step size # a2 = number of times to repeat # a3 = 0 (W) / 1 (RW) # s0 = moving array ptr # s1 = array limit (ptr) ``` 1. `accessWords` ``` la s0, array # ptr to array add s1, s0, a0 # hardcode array limit (ptr) slli t1, a1, 2 # multiply stepsize by 4 because WORDS ``` `s0` is saved the start address of the array, and `s1` is saved the end address of the array. `t1` is saved that how many byte we would need to shift ($4 \times a1$). 2. `wordLoop` ``` beq a3, zero, wordZero lw t0, 0(s0) # array[index/4]++ addi t0, t0, 1 sw t0, 0(s0) j wordCheck ``` Check `a3` whether is equal to `0`, if does than jump to `wordZero`. Otherwise, load the data from `s0` (it is pointer to the start position of array) to `t0`, and then save back after adding `1`. (`array[index] = array[index] + 1`) Jump to `wordCheck`. 3. `wordCheck` ``` add s0, s0, t1 # increment ptr blt s0, s1, wordLoop # inner loop done? addi a2, a2, -1 bgtz a2, accessWords # outer loop done? jr ra ``` Move the pointer of array (`s0`) to next position by `add s0, s0, t1`. If `s0 < s1 (blt)` then jump to `wordLoop` and keep executing, it means that there is still in the inner loop, not yet access all the array data. `a3` is saved repeat times for this program, so reduce the `a3` value by `1` and check the value whether is equal to `0`. 4. `wordZero` ``` sw zero, 0(s0) # array[index/4] = 0 ``` When option is `0`, set the array[index] to `0`. ### Observation in Ripes - The program parameters setup ``` # a0 = 8, array size in bytes # a1 = 1, step size --> 1 * 4 Byte = 4 Bytes # a2 = 1, number of times to repeat # a3 = 1, 0 (W) / 1 (RW) ``` ![](https://i.imgur.com/gMUB68h.png) - Observation for Memory ![](https://i.imgur.com/tzX73tP.png) 1. Here we can see the array pointer to address `0x10000000`, and the end pointer to address `0x100000008`. The total array size is 8 Bytes (2 Words). 2. Due to the step size is `1`, it would access the next `1` word data. 3. Schematic diagram | Array:M | Data | Address | | -------- | -------- | ---------- | | **End** | - | **0x10000008** | | - | 0x0 | 0x10000007 | | - | 0x0 | 0x10000006 | | - | 0x0 | 0x10000005 | | **M[1]** | **0x1** | **0x10000004** | | - | 0x0 | 0x10000003 | | - | 0x0 | 0x10000002 | | - | 0x8 | 0x10000001 | | **M[0]** | **0x1** | **0x10000000** | > Why the data at 0x100000001 is always 0x8 ? And why the address 0x1000001c always be accessed ? - Another case ``` # a0 = 12, array size in bytes # a1 = 2, step size --> 2 * 4 Byte = 8 Bytes # a2 = 3, number of times to repeat # a3 = 1, 0 (W) / 1 (RW) ``` ![](https://i.imgur.com/hSaY48L.png) The step size is `2` word, so it would access the address `0x10000000`, `0x10000008` and `0x10000010`, but `0x10000010` is out of the limit address `0x1000000c`, so the address `0x10000010` would not write data. And the program would repeat `3` times, so `0x10000000` and `0x10000008` are saved the value `3`. --- ## Review - Hit and Miss Policies - **Hit** - Read hits: Take the correctly data into cpu from cache. - Write hits: 1) **Write-Through Policy**: Always write data **to cache and to memory (*through cache*)**, forces cache and memory to **always be consistent**, **slow**, and include a *Write Buffer* that updates memory in parallel with processor. 2) **Write-Back Policy**: Write data **only to cache**, then update memory when block is removed, allows cache and memory to **be inconsistent**, and ***Dirty bit*** per cache row that **is set if block was written to (is dirty) then needs to be written back**. 3) **Write-Around Policy**: In every situation, data **is written to main memory only**, if we have the block we're writing in the cache, the valid bit of the block is changed to invalid. Essentially, there is no such thing as a write hit in a write-around cache, a write "hit" does the same thing as a write miss. - **Miss** - Read misses: Stall execution, fetch block from memory, put in cache, send requested data to processor, resume. - Write misses: 1) Always have to update block from memory. 2) Carry the updated block into cache or not ? - **Write Allocate**: When we bring the block into the cache after a write miss, and always **pair with write-back** (accessing same address many times) - **No Write Allocate**: Only change main memory after a write miss, and typically **pair with write-through** (infrequent/random writes) - **Replacement policies** - **LRU (Least recently used)**: When we decide to evict a cache block to make space, we select the block that has been used farthest back in time of all the other blocks. - **Random**: When we decide to evict a cache block to make space, we randomly select one of the blocks in the cache to evict. - **MRU (Most recently used)**: When we decide to evict a cache block to make space, we select the block that has been used most recently of all the other available blocks. :::info The Venus cache simulator currently simulates a **write-through, write-allocate cache**. So we need to choose the same options in Ripes for our experiment. ::: --- ## Exercise 1 - A Couple of Memory Access Scenarios **Task:** Simulate the following scenarios and **record the final cache hit rates** with the program in cache.s. Try to reason out **what the hit rate will be BEFORE running the code**. **THE FOLLOWING** are good questions to ask yourself as you do these exercises (not checkoff questions): - How big is your cache block? - How many consecutive accesses (taking into account the step size) fit within a single block? - How much data fits in the WHOLE cache? - How far apart in memory are blocks that map to the same set (and could create conflicts)? - What is your cache’s associativity? - Where in the cache does a particular block map to? - When considering why a specific access is a miss or hit: Have you accessed this piece of data before? If so, is it still in the cache or not? ### Scenario 1 **Program Parameters**: (set these by initializing the a registers in the code) - Array Size (a0): 128 (bytes) - Step Size (a1): 8 - Rep Count (a2): 4 - Option (a3): 0 **Cache Parameters**: (set these in the Cache tab) - Cache Levels: 1 - Block Size: 8 - Number of Blocks: 4 - Enable?: Should be green - Placement Policy: Direct Mapped - Associativity: 1 - Block Replacement Policy: LRU - **Cache Size (Bytes): $32 \space Bytes= 8 \space Bytes \times 4 \space of \space blocks$** > $\space$ > ![](https://i.imgur.com/gpgRAiS.png) Here we modify the program parameters for **cache.s** in Ripes. > ![](https://i.imgur.com/iaZRFsx.png) Confirm that the *Data bits* exactly equal to 32 Bytes (256 bits). > ![](https://i.imgur.com/kXGVJPl.png) The result after executing the simulator. > ![](https://i.imgur.com/cFDVNac.png) > > Compare with Venus > > ![](https://i.imgur.com/limTdGp.png) #### Tasks 1. What combination of parameters is producing the hit rate you observe ? - Consider the sequence of memory address accesses, starting with a cold cache: - M[0] = 0x100000000, M[8] = 0x10000020, M[16] = 0x10000040, M[24] = 0x10000060 - TIO Breakdown: - O = $log_2(8) = 3 \space bits$ - Cache size / block size = $32/8=4$, so $I = log_2(4)=2 \space bits$ - $A = log_2(2^{32}) bits$, so $T = (A=32) - 2 - 3 = 27 \space bits$ - e.g. `0x100000060 = 0b0001_0000_0000_0000_0000_0000_0110_0000` break to `(Tag) 000_1000_0000_0000_0000_0000_0011 | (Index) 00 | (Offset) 000 = 0x00800003` - Because of the step size is ... , so the index after TIO-BreakDown is always `0` for M[0], M[8], M[16], M[24], it means the collision always happen for each access. - e.g. Access M[0]: miss | Index | V | Tag | Block 0 | Block 1 | | ------ | - | ---------- | -------- | -------- | | 0 | 1 | 0x00800000 | M[0] | M[1] | | 1 | 0 | - | - | - | | 2 | 0 | - | - | - | | 3 | 0 | - | - | - | Access M[8]: miss | Index | V | Tag | Block 0 | Block 1 | | ------ | - | ---------- | -------- | -------- | | 0 | 1 | 0x00800001 | M[8] | M[9] | | 1 | 0 | - | - | - | | 2 | 0 | - | - | - | | 3 | 0 | - | - | - | Access M[16]: miss | Index | V | Tag | Block 0 | Block 1 | | ------ | - | ---------- | -------- | -------- | | 0 | 1 | 0x00800002 | M[16] | M[17] | | 1 | 0 | - | - | - | | 2 | 0 | - | - | - | | 3 | 0 | - | - | - | Access M[24]: miss | Index | V | Tag | Block 0 | Block 1 | | ------ | - | ---------- | -------- | -------- | | 0 | 1 | 0x00800003 | M[24] | M[25] | | 1 | 0 | - | - | - | | 2 | 0 | - | - | - | | 3 | 0 | - | - | - | - **Improve the hit-rate for direct mapping** The program would repeat `4` times, so **staying the array value in cache as possible is a good method.** Such as 1. **Increase the number of block**, make a slot contains more the value, that would make more hit. > $2^4 \space Blocks$, with 0.875 hit rate. > ![](https://i.imgur.com/jXuo3uD.png) > > $2^5 \space Blocks$, with 0.9375 hit rate. > ![](https://i.imgur.com/ixPNWtO.png) 2. Because of the DM is one-to-one mapping, so **incresing the lines (more index)** to reduce the collsion happen for different data can also imporve the hit rate. > > ![](https://i.imgur.com/F1qYq6F.png) > > $2^4 \space Lines$, with hit rate 0.75. > ![](https://i.imgur.com/qDLa03w.png) > > ![](https://i.imgur.com/kMW8zLt.png) 2. What is our hit rate if we increase Rep Count arbitrarily? Why? Ans. The above combination of parameters would not induce the hit. No matter how we increase the rep count arbitrarily, it will not improve the hit rate. Instead, it will increase the times of miss, because of the cache miss always happen. But, if there has a chance to induce the hit, it will carry more hit occurring when we increase the rep count. e.g. > Compare repeat `4` with repeat `16`, the hit rate will be increased from 0.75 to 0.9375. > ![](https://i.imgur.com/cdtD7p5.png) 3. How could we modify one program parameter to increase our hit rate? Ans. We can change the `step size` to `1` to increase the hit rate, because it will carry two array value (even and odd index) into the cache at the same time when meeting a cache miss. That is why the cache hit always occur when we access the value of the odd index. Therefore, we will get the hit rate with 0.5 after finishing the program. > Miss when cpu access the **even index** of array. > ![](https://i.imgur.com/DSjrO0d.png) > > Hit when cpu access the **odd index** of array. > ![](https://i.imgur.com/cJTdGW1.png) ### Scenario 2 **Program Parameters**: (set these by initializing the a registers in the code) - Array Size (a0): 256 (bytes) - Step Size (a1): 2 - Rep Count (a2): 1 - Option (a3): 1 **Cache Parameters**: (set these in the Cache tab) - Cache Levels: 1 - Block Size: 16 - Number of Blocks: 16 - Enable?: Should be green - Placement Policy: N-Way Set Associative - Associativity: 4 - Block Replacement Policy: LRU - **Cache Size (Bytes): $256 \space Bytes= 16 \space Bytes \times 16 \space of \space blocks$** > Hit rate is 0.75. > ![](https://i.imgur.com/sdAIPN6.png) > > The result in Ripes. Hit rate is also 0.75. > ![](https://i.imgur.com/7gj4yHQ.png) #### Tasks 1. How many memory accesses are there per iteration of the inner loop ? (not the one involving repcount). It’s not 1. - Access **cache memory**: 2 times (read and write) - Access **main memory**: 2 times for $4n$ index (read and write) and 1 time for $4n+2$ index (just wirte back to main memory due to the **write-through**). - Consider the sequence of memory address accesses, starting with a cold cache: - M[0] = 0x100000000, M[2] = 0x10000008, M[4] = 0x10000010, M[6] = 0x10000018, M[8] = 0x10000020, M[10], ..., M[60], M[62] - TIO Breakdown: - O = $log_2(16) = 4 \space bits$ (green + black) - $I = log_2(Cache \space Size / Block \space Size / N-way) = log_2(256B / 16B / 4) = 2 \space bits$ - $T = (A=32) - 2 - 2 = 28 \space bits$ - e.g. Note: The option is `1`, so it will access the cache two times ! (One is for reading data, the other one is for writing data.) Access M[0]: read miss, and cpu will access main memory. | Set | Slot | V | Tag | Block 0 | Block 1 | Block 2 | Block 3 | | ------ | ---- | - | ---------- | -------- | -------- | -------- | -------- | | 0 | 0 | 1 | 0x00400000 | M[0] | M[1] | M[2] | M[3] | | 0 | 1 | 0 | - | - | - | - | - | | 0 | 2 | 0 | - | - | - | - | - | | 0 | 3 | 0 | - | - | - | - | - | Access M[0]: wirte hit, and cpu will access main memory for the write-through. Access M[2]: read hit, and cpu will not access main memory. Access M[2]: write hit, and cpu need to access main memory for the write-through. Access M[4]: read miss | Set | Slot | V | Tag | Block 0 | Block 1 | Block 2 | Block 3 | | ------ | ---- | - | ---------- | -------- | -------- | -------- | -------- | | 1 | 0 | 1 | 0x00400000 | M[4] | M[5] | M[6] | M[7] | | 1 | 1 | 0 | - | - | - | - | - | | 1 | 2 | 0 | - | - | - | - | - | | 1 | 3 | 0 | - | - | - | - | - | Access M[4]: write miss Access M[6]: read hit Access M[6]: write hit ... Access M[16]: read miss | Set | Slot | V | Tag | Block 0 | Block 1 | Block 2 | Block 3 | | ------ | ---- | - | ---------- | -------- | -------- | -------- | -------- | | 0 | 0 | 1 | 0x00400000 | M[0] | M[1] | M[2] | M[3] | | 0 | 1 | 1 | 0x00400001 | M[16] | M[17] | M[18] | M[19] | | 0 | 2 | 0 | - | - | - | - | - | | 0 | 3 | 0 | - | - | - | - | - | Access M[16]: write hit Access M[18]: read hit Access M[18]: write hit ... Access M[60]: read miss Access M[60]: write hit Access M[62]: read hit Access M[62]: write hit 2. What is the repeating hit/miss pattern ? WHY? Because of the block size is 16 Bytes (4 Words), it will contain 4 array value in a slot (line). Therefore, when cpu meet a cache miss, it carry the data that will be accessed at the next iteration due to `2` words of the step size. And the option is `1` so cpu will access the cache memory by `2` times per index, that would take a cache read miss and a cache write hit for the index $4n$, as well as take a cache read hit and a cache write hit for the index $4n+2$. e.g. Access M[0]: read miss, and cpu will access the main memory. > Hits: 0, Misses: 1, Writebacks: 0 > ![](https://i.imgur.com/G4SbMrU.png) > Access M[0]: wirte hit, and cpu will access the main memory. > Hits: 1, Misses: 1, Writebacks: 1 > ![](https://i.imgur.com/kxRPFFc.png) > Access M[2]: read hit, and cpu will do not access the main memory. > Hits: 2, Misses: 1, Writebacks: 1 > ![](https://i.imgur.com/YIUKpnv.png) > Access M[2]: write hit, and cpu need to access the main memory. > Hits: 3, Misses: 1, Writebacks: 2 > ![](https://i.imgur.com/LNEHaZ2.png) > Access M[4]: read miss > ![](https://i.imgur.com/dCrImEC.png) > ... and it repeats every 4 accesses. 3. This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern. Ans. There are 4 times of access at a slot, they contain 1 cache miss caused by reading the index $4n$ data and 3 cache hit. So $hit \space rate = 3/4=0.75$ 4. Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity ? Why ? Try it out by changing the appropriate program parameter and letting the code run ! Hit rate will approach to `1`. > rep count = 10240 > ![](https://i.imgur.com/kl72DI7.png) ### Scenario 3 L1, L2 Cache assess. --- ## Exercise 2 - Loop Ordering and Matrix Multiplication If you recall, matrices are 2-dimensional data structures wherein each data element is accessed via two indices. To multiply two matrices, we can simply use 3 nested loops, assuming that matrices A, B, and C are all n-by-n and stored in one-dimensional column-major arrays: ``` for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) for (int k = 0; k < n; k++) C[i+j*n] += A[i+k*n] * B[k+j*n]; ``` In the above code, note that the loops are ordered i, j, k. If we examine the innermost loop (the one that increments k), we see that it… - moves through B with stride 1 - moves through A with stride n - moves through C with stride 0 **REMEMBER:** To compute the matrix multiplication correctly, the loop order doesn’t matter. **BUT**, the order in which we choose to access the elements of the matrices can have a large impact on performance. Caches perform better (more cache hits, fewer cache misses) when memory accesses take advantage of spatial and temporal locality, utilizing blocks already contained within our cache. Optimizing a program’s memory access patterns is essential to obtaining good performance from the memory hierarchy. ### Observation for matrixMultiply.c :::spoiler matrixMultiply.c ```c= #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include <time.h> /* To save you time, we are including all 6 variants of the loop ordering as separate functions and then calling them using function pointers. The reason for having separate functions that are nearly identical is to avoid counting any extraneous processing towards the computation time. This includes I/O accesses (printf) and conditionals (if/switch). I/O accesses are slow and conditional/branching statements could unfairly bias results (lower cases in switches must run through more case statements on each iteration). */ void multMat1( int n, float *A, float *B, float *C ) { int i,j,k; /* This is ijk loop order. */ for( i = 0; i < n; i++ ) for( j = 0; j < n; j++ ) for( k = 0; k < n; k++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat2( int n, float *A, float *B, float *C ) { int i,j,k; /* This is ikj loop order. */ for( i = 0; i < n; i++ ) for( k = 0; k < n; k++ ) for( j = 0; j < n; j++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat3( int n, float *A, float *B, float *C ) { int i,j,k; /* This is jik loop order. */ for( j = 0; j < n; j++ ) for( i = 0; i < n; i++ ) for( k = 0; k < n; k++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat4( int n, float *A, float *B, float *C ) { int i,j,k; /* This is jki loop order. */ for( j = 0; j < n; j++ ) for( k = 0; k < n; k++ ) for( i = 0; i < n; i++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat5( int n, float *A, float *B, float *C ) { int i,j,k; /* This is kij loop order. */ for( k = 0; k < n; k++ ) for( i = 0; i < n; i++ ) for( j = 0; j < n; j++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat6( int n, float *A, float *B, float *C ) { int i,j,k; /* This is kji loop order. */ for( k = 0; k < n; k++ ) for( j = 0; j < n; j++ ) for( i = 0; i < n; i++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } /* uses timing features from sys/time.h that you haven't seen before */ int main( int argc, char **argv ) { int nmax = 1000,i; void (*orderings[])(int,float *,float *,float *) = {&multMat1,&multMat2,&multMat3,&multMat4,&multMat5,&multMat6}; char *names[] = {"ijk","ikj","jik","jki","kij","kji"}; float *A = (float *)malloc( nmax*nmax * sizeof(float)); float *B = (float *)malloc( nmax*nmax * sizeof(float)); float *C = (float *)malloc( nmax*nmax * sizeof(float)); struct timeval start, end; /* fill matrices with random numbers */ for( i = 0; i < nmax*nmax; i++ ) A[i] = drand48()*2-1; for( i = 0; i < nmax*nmax; i++ ) B[i] = drand48()*2-1; for( i = 0; i < nmax*nmax; i++ ) C[i] = drand48()*2-1; for( i = 0; i < 6; i++) { /* multiply matrices and measure the time */ gettimeofday( &start, NULL ); (*orderings[i])( nmax, A, B, C ); gettimeofday( &end, NULL ); /* convert time to Gflop/s */ double seconds = (end.tv_sec - start.tv_sec) + 1.0e-6 * (end.tv_usec - start.tv_usec); double Gflops = 2e-9*nmax*nmax*nmax/seconds; printf( "%s:\tn = %d, %.3f Gflop/s\n", names[i], nmax, Gflops ); } free( A ); free( B ); free( C ); printf("\n\n"); return 0; } ``` ::: ``` $make ex2 gcc -o matrixMultiply -ggdb -Wall -pedantic -std=gnu99 -O3 matrixMultiply.c ./matrixMultiply ijk: n = 1000, 2.191 Gflop/s ikj: n = 1000, 0.317 Gflop/s jik: n = 1000, 2.097 Gflop/s jki: n = 1000, 13.640 Gflop/s kij: n = 1000, 0.317 Gflop/s kji: n = 1000, 13.615 Gflop/s ``` > This will run some matrix multiplications according to the **six different implementations** in the file, and it will tell you the speed at which each implementation executed the operation. The unit “Gflops/s” reads, “Giga-floating-point-operations per second.” > THE BIGGER THE NUMBER THE FASTER IT IS RUNNING! Modification for `multMat1.c` 1. multMat1.c ```c= #include <stdio.h> #include <stdlib.h> /* uses timing features from sys/time.h that you haven't seen before */ int main() { int n = 2, id; int A[n*n]; int B[n*n]; int C[n*n]; /* fill matrices with random numbers */ for( id = 0; id < n*n; id++ ) A[id] = 3; for( id = 0; id < n*n; id++ ) B[id] = 2; for( id = 0; id < n*n; id++ ) C[id] = 0; /* multMat1, ijk */ int i,j,k; /* This is ijk loop order. */ for( i = 0; i < n; i++ ) for( j = 0; j < n; j++ ) for( k = 0; k < n; k++ ) { C[i+j*n] += A[i+k*n]*B[k+j*n]; } return 0; } ``` > 1. n = 2, i.e. the array size with 2 by 2. > 2. Change to static call so that we can observe the data state in stack memory for Ripes. > 3. Change the data type from float to int. > 4. Use the constant as the initial value. 2. Translate to RISC-V assembly code by Complier Explorer (choose RISC-V rv32gc clang(trunk)) > ![](https://i.imgur.com/YXuKsx9.png) 3. Modify the assembly code generated by Complier Explorer :::spoiler Assembly Code that can execute in Ripes. ```Ass= main: # @main addi sp, sp, -64 sw ra, 60(sp) # 4-byte Folded Spill sw s0, 56(sp) # 4-byte Folded Spill addi s0, sp, 64 mv a0, zero sw a0, -12(s0) addi a1, zero, 2 sw a1, -16(s0) lw a1, -16(s0) mul a1, a1, a1 add a2, zero, sp sw a2, -24(s0) slli a2, a1, 2 addi a2, a2, 15 andi a3, a2, -16 add a2, zero, sp sub a2, a2, a3 sw a2, -60(s0) # 4-byte Folded Spill add sp, zero, a2 sw a1, -28(s0) lw a1, -16(s0) mul a1, a1, a1 slli a2, a1, 2 addi a2, a2, 15 andi a3, a2, -16 add a2, zero, sp sub a2, a2, a3 sw a2, -56(s0) # 4-byte Folded Spill add sp, zero, a2 sw a1, -32(s0) lw a1, -16(s0) mul a1, a1, a1 slli a2, a1, 2 addi a2, a2, 15 andi a3, a2, -16 add a2, zero, sp sub a2, a2, a3 sw a2, -52(s0) # 4-byte Folded Spill add sp, zero, a2 sw a1, -36(s0) sw a0, -20(s0) j .LBB0_1 .LBB0_1: # =>This Inner Loop Header: Depth=1 lw a0, -20(s0) lw a1, -16(s0) mul a1, a1, a1 bge a0, a1, .LBB0_4 j .LBB0_2 .LBB0_2: # in Loop: Header=BB0_1 Depth=1 lw a0, -60(s0) # 4-byte Folded Reload lw a1, -20(s0) slli a1, a1, 2 add a1, a1, a0 addi a0, zero, 3 sw a0, 0(a1) j .LBB0_3 .LBB0_3: # in Loop: Header=BB0_1 Depth=1 lw a0, -20(s0) addi a0, a0, 1 sw a0, -20(s0) j .LBB0_1 .LBB0_4: mv a0, zero sw a0, -20(s0) j .LBB0_5 .LBB0_5: # =>This Inner Loop Header: Depth=1 lw a0, -20(s0) lw a1, -16(s0) mul a1, a1, a1 bge a0, a1, .LBB0_8 j .LBB0_6 .LBB0_6: # in Loop: Header=BB0_5 Depth=1 lw a0, -56(s0) # 4-byte Folded Reload lw a1, -20(s0) slli a1, a1, 2 add a1, a1, a0 addi a0, zero, 2 sw a0, 0(a1) j .LBB0_7 .LBB0_7: # in Loop: Header=BB0_5 Depth=1 lw a0, -20(s0) addi a0, a0, 1 sw a0, -20(s0) j .LBB0_5 .LBB0_8: mv a0, zero sw a0, -20(s0) j .LBB0_9 .LBB0_9: # =>This Inner Loop Header: Depth=1 lw a0, -20(s0) lw a1, -16(s0) mul a1, a1, a1 bge a0, a1, .LBB0_12 j .LBB0_10 .LBB0_10: # in Loop: Header=BB0_9 Depth=1 lw a0, -52(s0) # 4-byte Folded Reload lw a1, -20(s0) slli a1, a1, 2 add a1, a1, a0 mv a0, zero sw a0, 0(a1) j .LBB0_11 .LBB0_11: # in Loop: Header=BB0_9 Depth=1 lw a0, -20(s0) addi a0, a0, 1 sw a0, -20(s0) j .LBB0_9 .LBB0_12: mv a0, zero sw a0, -40(s0) j .LBB0_13 .LBB0_13: # =>This Loop Header: Depth=1 lw a0, -40(s0) lw a1, -16(s0) bge a0, a1, .LBB0_24 j .LBB0_14 .LBB0_14: # in Loop: Header=BB0_13 Depth=1 mv a0, zero sw a0, -44(s0) j .LBB0_15 .LBB0_15: # Parent Loop BB0_13 Depth=1 lw a0, -44(s0) lw a1, -16(s0) bge a0, a1, .LBB0_22 j .LBB0_16 .LBB0_16: # in Loop: Header=BB0_15 Depth=2 mv a0, zero sw a0, -48(s0) j .LBB0_17 .LBB0_17: # Parent Loop BB0_13 Depth=1 lw a0, -48(s0) lw a1, -16(s0) bge a0, a1, .LBB0_20 j .LBB0_18 .LBB0_18: # in Loop: Header=BB0_17 Depth=3 lw a0, -52(s0) # 4-byte Folded Reload lw a4, -56(s0) # 4-byte Folded Reload lw a2, -60(s0) # 4-byte Folded Reload lw a1, -40(s0) lw a5, -48(s0) lw a6, -16(s0) mul a3, a5, a6 add a3, a3, a1 slli a3, a3, 2 add a2, a2, a3 lw a2, 0(a2) lw a3, -44(s0) mul a3, a3, a6 add a5, a5, a3 slli a5, a5, 2 add a4, a4, a5 lw a4, 0(a4) mul a2, a2, a4 add a1, a1, a3 slli a1, a1, 2 add a1, a1, a0 lw a0, 0(a1) add a0, a0, a2 sw a0, 0(a1) j .LBB0_19 .LBB0_19: # in Loop: Header=BB0_17 Depth=3 lw a0, -48(s0) addi a0, a0, 1 sw a0, -48(s0) j .LBB0_17 .LBB0_20: # in Loop: Header=BB0_15 Depth=2 j .LBB0_21 .LBB0_21: # in Loop: Header=BB0_15 Depth=2 lw a0, -44(s0) addi a0, a0, 1 sw a0, -44(s0) j .LBB0_15 .LBB0_22: # in Loop: Header=BB0_13 Depth=1 j .LBB0_23 .LBB0_23: # in Loop: Header=BB0_13 Depth=1 lw a0, -40(s0) addi a0, a0, 1 sw a0, -40(s0) j .LBB0_13 .LBB0_24: mv a0, zero sw a0, -12(s0) lw a0, -24(s0) add sp, zero, a0 lw a0, -12(s0) addi sp, s0, -64 lw s0, 56(sp) # 4-byte Folded Reload lw ra, 60(sp) # 4-byte Folded Reload addi sp, sp, 64 j .EXIT .EXIT: li a7, 10 ecall ``` ::: We can do some modification. ``` ... .LBB0_24: mv a0, zero sw a0, -12(s0) lw a0, -24(s0) add sp, zero, a0 lw a0, -12(s0) addi sp, s0, -64 lw s0, 56(sp) # 4-byte Folded Reload lw ra, 60(sp) # 4-byte Folded Reload addi sp, sp, 64 // ret <--- # Remove j .EXIT <---| # Add .EXIT: | li a7, 10 | ecall | ``` 4. Check the result Setup (sp=0x7fffffff0) ![](https://i.imgur.com/Ecktzml.png) The Result in C. ```c /* printf("i=%d, j=%d, C[i+j*n]=%d \n", i, j, C[i+j*n]); */ i=0, j=0, C[i+j*n]=6 i=0, j=0, C[i+j*n]=12 i=0, j=1, C[i+j*n]=6 i=0, j=1, C[i+j*n]=12 i=1, j=0, C[i+j*n]=6 i=1, j=0, C[i+j*n]=12 i=1, j=1, C[i+j*n]=6 i=1, j=1, C[i+j*n]=12 ``` Memory in Ripes. > ![](https://i.imgur.com/uT8UcQP.png) 5. The access order (small case, n = 2) ![](https://i.imgur.com/FsAFVFE.png) ![](https://i.imgur.com/apm2NFY.png) 6. Result in Ripes (n = 50) > Hit rate is 0.9174 > ![](https://i.imgur.com/4paB7E4.png) ### Task 1. Record your results in answers.txt ``` ijk: n = 1000, 2.191 Gflop/s ikj: n = 1000, 0.317 Gflop/s jik: n = 1000, 2.097 Gflop/s **jki: n = 1000, 13.640 Gflop/s** kij: n = 1000, 0.317 Gflop/s kji: n = 1000, 13.615 Gflop/s ``` 2. Which ordering(s) perform best for these 1000-by-1000 matrices ? Why ? Ans 1. jki (multMat4) Ans 2. The cpu will often access the adjacent block in cache due to the stride 1. n = 2, ![](https://i.imgur.com/Mm9kJbY.png) n = 50, Hit rate = 0.966 ![](https://i.imgur.com/rwCP5LQ.png) 3. Which ordering(s) perform the worst? Why? Ans 1. ikj (multMat2) and kij (multMat5) Ans 2. Cpu access the cache block often suffer a cache miss due to the stride n when the cache block not enough big to include the array data. ikj (multMat2) ![](https://i.imgur.com/xtkhnDf.png) n = 50, Hit rate = 0.8692 ![](https://i.imgur.com/01ANgtO.png) kij (multMat5) ![](https://i.imgur.com/63TnE0B.png) n = 50, Hit rate = 0.8702 ![](https://i.imgur.com/avRJID0.png) 4. How does the way we stride through the matrices with respect to the innermost loop affect performance? Ans. The innermost loop will induce jn stride, thus, there are a weakly spatial locality when j is too big. --- ## Exercise 3 - Cache Blocking and Matrix Transposition **Matrix Transposition** Sometimes, we wish to swap the rows and columns of a matrix. This operation is called a “transposition” and an efficient implementation can be quite helpful while performing more complicated linear algebra operations. The transpose of matrix A is often denoted as AT. ![](https://i.imgur.com/T60UZ31.png) **Cache Blocking** In the above code for matrix multiplication, note that we are striding across the entire A and B matrices to compute a single value of C. As such, we are constantly accessing new values from memory and obtain very little reuse of cached data! We can improve the amount of data reuse in the caches by implementing a technique called cache blocking. More formally, cache blocking is a technique that attempts to reduce the cache miss rate by further improving the temporal and/or spatial locality of memory accesses. In the case of matrix transposition we consider performing the transposition one block at a time. ![](https://i.imgur.com/8X8moGW.png) **Things to note**: In the above image, we transpose each submatrix Aij of matrix A into its final location in the output matrix, one submatrix at a time. It is important to note that transposing each individual subsection is equivalent to tranposing the entire matrix. Since we operate on and finish transposing each submatrix one at a time, we consolidate our memory accesses to that smaller chunk of memory when transposing that particular submatrix, which increases the degree of temporal (and spatial) locality that we exhibit, which makes our cache performance better, which makes our program run faster. This (if implemented correctly) will result in a substantial improvement in performance. For this lab, you will implement a cache blocking scheme for matrix transposition and analyze its performance. **Transpose_blocking** ```c= /* Implement cache blocking below. You should NOT assume that n is a * multiple of the block size. */ void transpose_blocking(int n, int blocksize, int *dst, int *src) { // YOUR CODE HER for (int i = 0; i < n; i += blocksize) for (int j = 0; j < n; j += blocksize) // transpose the block beginning at [i,j] for (int k = i; k < i + blocksize && k < n; ++k) for (int l = j; l < j + blocksize && l < n; ++l) dst[k + l*n] = src[l + k*n]; } ``` ### Part 1 - Changing Array Sizes #### Task Fix the blocksize to be 20, and run your code with n equal to 100, 1000, 2000, 5000, and 10000. - Record your results in answers.txt ``` n = 100, Testing naive transpose: 0.004 milliseconds Testing transpose with blocking: 0.004 milliseconds n = 1000, Testing naive transpose: 0.716 milliseconds Testing transpose with blocking: 0.562 milliseconds n = 2000, Testing naive transpose: 12.236 milliseconds Testing transpose with blocking: 3.432 milliseconds n = 5000, Testing naive transpose: 113.673 milliseconds Testing transpose with blocking: 23.722 milliseconds n = 10000, Testing naive transpose: 600.908 milliseconds Testing transpose with blocking: 102.394 milliseconds ``` - At what point does cache blocked version of transpose become faster than the non-cache blocked version? Ans. n = 1000 - Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code? Ans. To make sure the spatial locality. ### Part 2 - Changing Blocksize #### Task Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000. - Record your results in answers.txt ``` blocksize = 50, Testing naive transpose: 574.05 milliseconds Testing transpose with blocking: 118.785 milliseconds blocksize = 100, Testing naive transpose: 562.55 milliseconds Testing transpose with blocking: 111.177 milliseconds blocksize = 500, Testing naive transpose: 595.29 milliseconds Testing transpose with blocking: 137.427 milliseconds blocksize = 1000, Testing naive transpose: 591.831 milliseconds Testing transpose with blocking: 163.924 milliseconds blocksize = 5000, Testing naive transpose: 583.844 milliseconds Testing transpose with blocking: 580.742 milliseconds ``` - How does performance change as blocksize increases? Why is this the case? Ans. The performance will decrease when increasing blocksize. Becuase the cache will need to change the block frequently for a big size n, it means the situation with a weakly spatial locality. And the worst case could probaly make the performance degerate to the `naive transpose`. Notice that in neither of the last two exercises did we actually know the cache parameters of our machine. We just made the code exhibit a higher degree of locality, and this magically made things go faster! This tells us that caches, regardless of their specific parameters, will always benefit from operating on code which exhibits a **high degree of locality**. ---

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.