Lab7: Cache - HackMD

###### tags: `RISC-V` # Lab7: Cache ## Review Hit, Miss polocies ### 3 hit policies :::info * Write-back * fast * write to the cache only * A block is evicted(驅逐) and its dirty bit is 1, need to update this block to memory. * Write-through * slower than write-back * write to both cache and main memory * simple hardware (memory always has the up-to-date data) * Write-around * write to main memory only * the block we have write in cache, the valid bit of the block is changed to invalid ::: ### 2 miss policies :::info * Write-allocate * pull the block you missed on into the cache * memory is never written to directly * No write-allocate * do not pull the block you missed on into the cache ::: ### replacement polocies :::info * LRU * Least recently used * Random * MRU * Most recently used ::: ## Exercise 1 - A Couple of Memory Access Scenarios **THE FOLLOWING** are good questions to ask yourself as you do these exercises (not checkoff questions): * How big is your cache block? * How many consecutive accesses (taking into account the step size) fit within a single block? * How much data fits in the WHOLE cache? * How far apart in memory are blocks that map to the same set (and could create conflicts)? * What is your cache’s associativity? * Where in the cache does a particular block map to? * When considering why a specific access is a miss or hit: Have you accessed this piece of data before? If so, is it still in the cache or not? ## Scenario 1: **Program Parameters**: (set these by initializing the a registers in the code) * **Array Size (a0)**: 128 (bytes) * **Step Size (a1)**: 8 * **Rep Count (a2)**: 4 * **Option (a3)**: 0 **Cache Parameters**: (set these in the Cache tab) * **Cache Levels**: 1 * **Block Size**: 8 (words?) * **Number of Blocks**: 4 * **Enable?**: Should be green * **Placement Policy**: Direct Mapped * **Associativity**: 1 * **Block Replacement Policy**: LRU ### cache.s assembly code :::spoiler ``` Rewriten by Stephan Kaminsky in RISC-V on 7/30/2018 # This program accesses an array in ways that provide data about the cache parameters. # Coupled with the Data Cache Simulator and Memory Reference Visualization to help students # understand how the cache parameters affect cache performance. # # PSEUDOCODE: # int array[]; //Assume sizeof(int) == 4 # for (k = 0; k < repcount; k++) { // repeat repcount times # /* Step through the selected array segment with the given step size. */ # for (index = 0; index < arraysize; index += stepsize) { # if(option==0) # array[index] = 0; // Option 0: One cache access - write # else # array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write # } # } .data array: .word 2048 # max array size specified in BYTES (DO NOT CHANGE) .text ################################################################################################## # You MAY change the code below this section main: li a0, 128 # array size in BYTES (power of 2 < array size) li a1, 8 # step size (power of 2 > 0) li a2, 4 # rep count (int > 0) li a3, 0 # 0 - option 0, 1 - option 1 # You MAY change the code above this section ################################################################################################## jal accessWords # lw/sw #jal accessBytes # lb/sb li a7,10 # exit ecall # SUMMARY OF REGISTER USE: # a0 = array size in bytes # a1 = step size # a2 = number of times to repeat # a3 = 0 (W) / 1 (RW) # s0 = moving array ptr # s1 = array limit (ptr) accessWords: la s0, array # ptr to array add s1, s0, a0 # hardcode array limit (ptr) slli t1, a1, 2 # multiply stepsize by 4 because WORDS wordLoop: beq a3, zero, wordZero lw t0, 0(s0) # array[index/4]++ addi t0, t0, 1 sw t0, 0(s0) j wordCheck wordZero: sw zero, 0(s0) # array[index/4] = 0 wordCheck: add s0, s0, t1 # increment ptr blt s0, s1, wordLoop # inner loop done? addi a2, a2, -1 bgtz a2, accessWords # outer loop done? jr ra accessBytes: la s0, array # ptr to array add s1, s0, a0 # hardcode array limit (ptr) byteLoop: beq a3, zero, byteZero lbu t0, 0(s0) # array[index]++ addi t0, t0, 1 sb t0, 0(s0) j byteCheck byteZero: sb zero, 0(s0) # array[index] = 0 byteCheck: add s0, s0, a1 # increment ptr blt s0, s1, byteLoop # inner loop done? addi a2, a2, -1 bgtz a2, accessBytes # outer loop done? jr ra ``` ::: ### cache.s psuedo code :::spoiler ```cpp= PSEUDOCODE: int array[]; //Assume sizeof(int) == 4 for (k = 0; k < repcount; k++) { // repeat repcount times /* Step through the selected array segment with the given step size. */ for (index = 0; index < arraysize; index += stepsize) { if(option==0) array[index] = 0; // Option 0: One cache access - write else array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write } } ``` ::: ### Cache setting in Ripes :::info ![](https://i.imgur.com/YkcNe8M.png) ::: ### Running code 1. set the ptr to the array s0 = 0 2. limit of the array s1 = s0 + 128 3. per step = 8, multiply 4 because of WORDS, 8 * 4 = 32 4. option = 0 , means just **"write"** 0 to array 5. inner loop: ``` wordZero: sw zero, 0(s0) wordCheck: add s0, s0, t1 #s0 + 32 blt s0, s1, wordLoop # if s0 < s1(128), do wordZero. # if s0 >= s1(128), go to outer loop to check rep counter. ``` 6. outer loop: ``` wordCheck: ...(check inner loop)... add a2, a2, -1 # a2 initail value is 4 bgtz a2, accessWords # if a2 > 0, go to step 1 # if a2 <= 0, ``` 7. finish this ### Observation * First time we sw into array, because __cache is empty__, encountering first __miss__. * Load the memory start address from **0x10000000 to 0x1000001c**(**array\[0]~array\[7]**)to fill 1st line of the cache. * In inner loop, sw zero to s0, and step is 8. * array[0] -> array[8] -> array[16] -> array[24] -> ~~array[32]~~ -> finish the first inner loop * Because of the block size and steps, encounterd miss everytime when we sw in first outer loop. * After 1st outer loop, all array data are in the cache, so we won't have a miss. * Loop * outer loop: 4 times(4->3->2->1->~~0~~) * inner loop: 4 times(0->32->64->96->~~128~~) * miss: 4 times(first inner loop) * hit: 12 times(after first inner loop) * hit rate: 3/4, miss rate: 1/4 ### Tasks 1. What combination of parameters is producing the hit rate you observe? :::success **Ans**: As shown above and below figure can understand more clearly.![](https://i.imgur.com/U3LrG78.png) ::: 2. What is our hit rate if we increase Rep Count arbitrarily?Why? :::success **Ans**: * __Hit rate increase.__ * As known above, the whole array is in cache, so we never have a miss after first loop. ::: 3. How could we modify one program parameter to increase our hit rate? :::success **Ans**: * As mentioned in question2, changing rep count to 5(or more) can increase hit rate. * Changing __step size__ to 1 can increase hit rate, because load 8 words at a time, and step size is 1 will make the data is fully utilized. ::: ## Scenario 2: __Program Parameters__: (set these by initializing the a registers in the code) * __Array Size (a0)__: 256 (bytes) * __Step Size (a1)__: 2 * __Rep Count (a2)__: 1 * __Option (a3)__: 1 __Cache Parameters__: (set these in the Cache tab) * __Cache Levels__: 1 * __Block Size__: 16 * __Number of Blocks__: 16 * __Enable?__: Should be green * __Placement Policy__: N-Way Set Associate * __Associativity__: 4 * __Block Replacement Policy__: LRU ### Cache setting in Ripes ![](https://i.imgur.com/3YUeM3F.png) ### Running code 1. set the ptr to the array s0 = 0 2. limit of the array s1 = s0 + 256(array size is 256/4=64) 3. per step = 2, multiply 4 because of WORDS, 2 * 4 = 8 4. option = 1 , means just **"read"** and **"write"** on the array 5. Rep Count is 1, so we just notice inner loop ``` wordLoop: beq a3, zero, wordZero ## skip because of option lw t0, 0(s0) #array[index/4]++ addi t0, t0, 1 sw t0, 0(s0) wordCheck: add s0, s0, t1 #s0 + 32 blt s0, s1, wordLoop # if s0 < s1(128), do wordZero. # if s0 >= s1(128), go to outer loop to check rep counter. ``` 6. finish ### Observation * Same as scenario1, first we access the memory,cache is empty and encounter cold miss. * In this exercise, we have 16 blocks per line, so we get the 16 elements of the array in every access. * Here is some error in Ripes UI, the 14th block in all line are always zero. ![](https://i.imgur.com/xlx4OUU.png) * Here 4 ways associated is not used, because the cache size is bigger than array size more and more, The policy is obviously very wasteful in this exercise. ### tasks 1. How many memory accesses are there per iteration of the inner loop? :::success __Ans__: In inner loop, we will execute __lw__ and __sw__, so need to access memory two times. ::: 2. What is the repeating hit/miss pattern? WHY? :::success __Ans__: **Hit/Miss pattern will be 16 times as a cycle, 1 miss and 15 hit.** ::: 3. This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern. :::success __Ans__: When encountered a cold miss, load 16 element into a line of cache. Because the step is 2, we visit 0, 2,...in order until the element is not in 16(0~15) elements. ::: 4. Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run! :::success __Ans__: After one repetition, whole array is in our cache, means we will never encounter miss in posterior cycle. **Hit rate will increase with increasing repCount.** ::: 5. You should have noticed that our hit rate was pretty high for this scenario, and your answer to the previous question should give you a good understanding of why. IF YOU ARE NOT SURE WHY, consider the size of the array and compare it to the size of the cache. Now, consider the following: __Ans__: ## Exercise 2 - Loop Ordering and Matrix Multiplication * Efficient matrix multiplication is critical for many applications within the applied sciences. * To compute the matrix multiplication correctly, the **loop order doesn’t matter**. ### matrixMultiply.c ::: spoiler #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include <time.h> /* To save you time, we are including all 6 variants of the loop ordering as separate functions and then calling them using function pointers. The reason for having separate functions that are nearly identical is to avoid counting any extraneous processing towards the computation time. This includes I/O accesses (printf) and conditionals (if/switch). I/O accesses are slow and conditional/branching statements could unfairly bias results (lower cases in switches must run through more case statements on each iteration). */ void multMat1( int n, float *A, float *B, float *C ) { int i,j,k; /* This is ijk loop order. */ for( i = 0; i < n; i++ ) for( j = 0; j < n; j++ ) for( k = 0; k < n; k++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat2( int n, float *A, float *B, float *C ) { int i,j,k; /* This is ikj loop order. */ for( i = 0; i < n; i++ ) for( k = 0; k < n; k++ ) for( j = 0; j < n; j++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat3( int n, float *A, float *B, float *C ) { int i,j,k; /* This is jik loop order. */ for( j = 0; j < n; j++ ) for( i = 0; i < n; i++ ) for( k = 0; k < n; k++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat4( int n, float *A, float *B, float *C ) { int i,j,k; /* This is jki loop order. */ for( j = 0; j < n; j++ ) for( k = 0; k < n; k++ ) for( i = 0; i < n; i++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat5( int n, float *A, float *B, float *C ) { int i,j,k; /* This is kij loop order. */ for( k = 0; k < n; k++ ) for( i = 0; i < n; i++ ) for( j = 0; j < n; j++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } void multMat6( int n, float *A, float *B, float *C ) { int i,j,k; /* This is kji loop order. */ for( k = 0; k < n; k++ ) for( j = 0; j < n; j++ ) for( i = 0; i < n; i++ ) C[i+j*n] += A[i+k*n]*B[k+j*n]; } /* uses timing features from sys/time.h that you haven't seen before */ int main( int argc, char **argv ) { int nmax = 1000,i; void (*orderings[])(int,float *,float *,float *) = {&multMat1,&multMat2,&multMat3,&multMat4,&multMat5,&multMat6}; char *names[] = {"ijk","ikj","jik","jki","kij","kji"}; float *A = (float *)malloc( nmax*nmax * sizeof(float)); float *B = (float *)malloc( nmax*nmax * sizeof(float)); float *C = (float *)malloc( nmax*nmax * sizeof(float)); struct timeval start, end; /* fill matrices with random numbers */ for( i = 0; i < nmax*nmax; i++ ) A[i] = drand48()*2-1; for( i = 0; i < nmax*nmax; i++ ) B[i] = drand48()*2-1; for( i = 0; i < nmax*nmax; i++ ) C[i] = drand48()*2-1; for( i = 0; i < 6; i++) { /* multiply matrices and measure the time */ gettimeofday( &start, NULL ); (*orderings[i])( nmax, A, B, C ); gettimeofday( &end, NULL ); /* convert time to Gflop/s */ double seconds = (end.tv_sec - start.tv_sec) + 1.0e-6 * (end.tv_usec - start.tv_usec); double Gflops = 2e-9*nmax*nmax*nmax/seconds; printf( "%s:\tn = %d, %.3f Gflop/s\n", names[i], nmax, Gflops ); } free( A ); free( B ); free( C ); printf("\n\n"); return 0; } ::: ### Observation |ijk|ikj|jik|jki|kij|kji| |---|---|---|---|---|---| |![](https://i.imgur.com/kf6KBuv.png)|![](https://i.imgur.com/IyjeJUS.png)|![](https://i.imgur.com/SUYR0i2.png)|![](https://i.imgur.com/ZseQKhp.png)|![](https://i.imgur.com/qo0yqAE.png)|![](https://i.imgur.com/uQOzWfL.png) * As above figures, we can realize how the code access memory to get get every array data. * Because 1000*1000 is too bigger, so we change the matrix size to observe. ### Tasks * Record your results in answers.txt ```shell= $ make ex2 cc -o matrixMultiply -ggdb -Wall -pedantic -std=gnu99 -O3 matrixMultiply.c ./matrixMultiply ijk: n = 1000, 0.912 Gflop/s ikj: n = 1000, 0.184 Gflop/s jik: n = 1000, 0.742 Gflop/s jki: n = 1000, 3.251 Gflop/s kij: n = 1000, 0.188 Gflop/s kji: n = 1000, 2.271 Gflop/s ``` * Which ordering(s) perform best for these 1000-by-1000 matrices? Why? :::success * __jki__ is the best performance ordering. * Use jki ordering to do matrix multiply may decease access memory times, according. As figure above, we can know that everytime put a cluster of data to cache from memory, and next time we want to get array data which is usually in cache,therefore, we just go to cache to get array data istead of memory, so we can greatly reduce the access time. ::: * Which ordering(s) perform the worst? Why? :::success * Both **ikj** ordering and **kij** ordering are worst. * Opposite to previous task, these two ordering matrix multiply result in we always can not find the required array data, and need to go to find in memory, access time become longer and longer. ::: * How does the way we stride through the matrices with respect to the innermost loop affect performance? :::success * The stride is less, we can efficiently visit array from cache instead of memory, so performance get better. ::: ## Exercise 3 - Cache Blocking and Matrix Transposition ### Matrix Transposition ![](https://i.imgur.com/iklx1EZ.png) ### Cache Blocking Cache blocking is a technique that attempts to __reduce the cache miss rate__ by further improving the temporal and/or __spatial locality of memory accesses.__ In the case of matrix transposition we consider performing the transposition one block at a time. ### implement transpose_blocking() :::spoiler ```c= /* Implement cache blocking below. You should NOT assume that n is a * multiple of the block size. */ void transpose_blocking(int n, int blocksize, int *dst, int *src) { // YOUR CODE HERE int block_idx = 0; //unused int block_count = n / blocksize; int block_left = n % blocksize; int base_x, base_y; int cal_x, cal_y; int y = 0, x = 0; int inner_y = 0, inner_x = 0; //complete block for(y = 0; y < block_count; y++) for(x = 0; x < block_count; x++) { base_x = x * blocksize; base_y = y * blocksize; for(inner_y = 0; inner_y < blocksize; inner_y++) for(inner_x = 0; inner_x < blocksize; inner_x++) { cal_x = base_x + inner_x; cal_y = base_y + inner_y; dst[cal_x + cal_y * n] = src[cal_x * n + cal_y]; } } //incomplete block if(block_left != 0) { base_x = 0 + block_count * blocksize; //printf("\nbase_x = %d\n", base_x); base_y = 0; for(inner_y = 0; inner_y < n; inner_y++) for(inner_x = base_x; inner_x < n; inner_x++) { dst[inner_x + inner_y * n] = src[inner_x * n + inner_y]; //printf("dst:%d src:%d\n", inner_x + inner_y * n, inner_x * n + inner_y); } base_y = 0 + block_count * blocksize; base_x = 0; for(inner_y = base_y; inner_y < n; inner_y++) for(inner_x = 0; inner_x < n - block_left; inner_x++) { dst[inner_x + inner_y * n] = src[inner_x * n + inner_y]; } } } ``` ::: ### test code ``` shell $ make ex3 cc -o transpose -ggdb -Wall -pedantic -std=gnu99 -O3 transpose.c $ ./transpose 10000 33 Testing naive transpose: 935.756 milliseconds Testing transpose with blocking: 262.243 milliseconds ``` ## part1 - Changing Array Size ### Task Fix the blocksize to be 20, and run your code with n equal to 100, 1000, 2000, 5000, and 10000. * Record your results in answers.txt :::success * fix blocksize to be 20 * Unit: milliseconds |n|100|1000|2000|5000|10000| |---|---|---|---|---|---| | __naive__ |0.012|2.018|21.58|190.73|952.306| | __transpose__|0.008|1.893|7.934|56.466|304.153| ::: * At what point does cache blocked version of transpose become faster than the non-cache blocked version? :::success * From the beginning, unless n is enough small. ::: * Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code? :::success * I didn't encounter this problem, but I think if n become large and large, the excution time decrease more and more, cache blocking's effect is very significant. ::: ## part2- Changing Blocksize ### Task Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000. * Record your results in answers.txt :::success * fix n to be 10000 * Unit: milliseconds |blocksize|50|100|500|1000|5000| |---|---|---|---|---|---| | __naive__ |947.251|945.886|932.057|939.945|943.884| | __transpose__|245.084|218.47 |260.678|297.151|919.584| ::: * How does performance change as blocksize increases? Why is this the case? :::success * Although transpose is always faster than naive, but it is observed that as the blocksize increased, the speed is the fastest when the blocksize is 100, and then as the blocksize increases, the speed becomes slower or even similar to naive. * I think one reason is that the cache size cannot contain a block of data, so need more time to access the memory and increase excution time. Second reason is that the blocksize is to close to matrix size cause the use of the blocksize lost. ::: :::warning * In exercise3, I implement transpose blocking, and perform the same program on different servers separately, and the result are very different. * One of the result is that transpose is always faster than naive. * Another one is that naive is always faster than transpose. I was very surprised at the result, so I think the underlying hardware structure also has a very serious effect on this EXERCISE. :::