Term project: Cache simulation and case study

--- ###### tags: `Computer Architecture 2022` --- # Term project: Cache simulation and case study teammate: 馮柏為、陳品崴、周士翔 ## Exercise 1 - A Couple of Memory Access Scenarios In this exercise, we will use the following code to analyze the performance under different cache. The code is reference by [su20-lab-stater](https://github.com/61c-teach/su20-lab-starter/tree/master/lab07) ```c= /* Assume that each value in array is initalized to 5 Declaration and initialization of `array` omitted */ void accessWords(int array_size, int step_size, int rep_count, int option) { for (int k = 0; k < rep_count; k++) { for (int index = 0; index < array_size; index += step_size) { if (option == 0) { // Option 0: One cache access - write array[index] = 0; } else { // Option 1: Two cache accesses - read AND write array[index] = array[index] + 1; } } } } ``` ### Scenario 1: #### Enviroment - [ripes](https://github.com/mortbopet/Ripes) #### Program Parameter - Array Size: 256 - Step Size: 8 - Rep Count: 4 - Option: 0 The partial assembely code can be seen: ``` main: li a0, 256 # array size in BYTES (power of 2 < array size) li a1, 8 # step size (power of 2 > 0) li a2, 4 # rep count (int > 0) li a3, 0 # 0 - option 0, 1 - option 1 ``` #### Cache Parameter - Cache Levels: 1 - Block Size: 4 bytes (ripes) - Number of Blocks: 16 - Associativity: Direct Mapped - Block Replacement: LRU (Block in [Ripes](https://github.com/mortbopet/Ripes) contains 4 bytes, and we set 4 blocks in one line.) The parameter set in Ripe: ![](https://i.imgur.com/AFFeSFd.png) The layout of the cache can be seen here: ![](https://i.imgur.com/mCEvQSE.png) According to the cache layout, we can observe there will be 4 elements in one line. The array element are 4-byte integers. #### Access Element ##### first access element ![](https://i.imgur.com/zHzhGN5.png) ##### second access element ![](https://i.imgur.com/JGRLNAX.png) #### Results: ![](https://i.imgur.com/avQAyrx.png) In this result, there are 0 hit and 32 misses. According to the access element and step size, cache will read next element accross 8 bytes, so when cache need to access new element, the compulsory miss happend every time. ### Tasks **Q1.** What combination of parameters is producing the hit rate you observe? (Hint: Your answer should be of the form: “Because [parameter A] in bytes is exactly equal to [parameter B] in bytes.”) **A1.** Because cache size in bytes is exactly equal to step size in bytes , the hit rate will be 0. **Q2.** What is our hit rate if we increase Rep Count arbitrarily? Why? **A2.** The hit rate will the same because every `array[index]` still appear in different lines. **Q3.** How could we modify one **program parameter** to increase our hit rate? (Hint: Your answer should be of the form: “Change [parameter] to [value].” Note: we don’t care if we access the same array elements. Just give us a program parameter modification that would increase the hit rate. **A3.** - Change step size to 4, the hit rate is going to be 0.5 ![](https://i.imgur.com/s52dFWF.png) --- ### Scenario 2: #### Enviroment - [ripes](https://github.com/mortbopet/Ripes) #### Program Parameter - Array Size: 256 - Step Size: 16 - Rep Count: 1 - Option: 1 The partial assembely code can be seen: ``` main: li a0, 256 # array size in BYTES (power of 2 < array size) li a1, 16 # step size (power of 2 > 0) li a2, 1 # rep count (int > 0) li a3, 1 # 0 - option 0, 1 - option 1 ``` #### Cache Parameter - Cache Levels: 1 - Block Size: 4 bytes (ripes) - Number of Blocks: 16 - Associativity: N-way Set Associative - Associativity: 4 - Block Replacement: LRU (Block in [Ripes](https://github.com/mortbopet/Ripes) contains 4 bytes, and we set 4 blocks in one line.) The parameter set in Ripe: ![](https://i.imgur.com/jpEiqse.png) The layout of the cache can be seen here: ![](https://i.imgur.com/T4e2sXE.png) #### Access Element ##### first access element ![](https://i.imgur.com/XTKLJe8.png) ##### second access element ![](https://i.imgur.com/gNuOM6Q.png) ### Tasks **Q1.** How many memory accesses are there per iteration of the inner loop? **A1.** There are two accesses per interation, because we change the `option=1`, the code will execute `array[index]=array[index]+1]`. (compare to the scenario 1, `option=0`, the code execute `array[index]=0`, the mememory only access one time per iteration) **Q2.** What is the repeating hit/miss pattern? WHY? **A2.** miss,hit, miss,hit ... ... **first miss: compulsory miss** there are 4 words in one line, and `step_size=16`, means there will only access one element. **hit:** Because `option=1`, it access the same element two times at the same iteration. **Q3.** This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern. **A3.** The hit rate is 0.5 because there are 1 hits in 2 accesses. **Q4.** Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run! **A4.** When the `rep_count` close to the infinity, the hit rate will near to 1. Becausee all the elements we need to access are already in the cache after first `rep_count` and cache miss wounldn't happend anymore. Element in the cache after first `rep_count`: ![](https://i.imgur.com/azKZ41K.png) Modified the `rep_count=1000`, and the cache result show in the below. ![](https://i.imgur.com/OgcviAq.png) **Q5.** Suppose we have a program that iterates through a very large array (AKA way bigger than the size of the cache) times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). (For reference, in this scenario, we just had one function (incrementation) and one Rep). **QUESTION:** Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others AKA it doesn’t matter if Rep is applied to element before Rep is applied to element, etc. **A5.** **Method:** (provided by 江松穎) Since the array size is that huge, we cannot let every accessed address cached. Ergo, we can try to every “kind” of array access happens together. That is, for element arr[i], instead of incrementing i everytime we finish task k, we apply all functions then goto the next array member. In this way, for every array member, we undergo only 1 miss. Following is a sample c-code: ``` c= int array[SIZE]; void (*func_ptr)(int*); for(int i = 0; i < SIZE; i += stepsize){ for(int k = 0; k < rep_count; ++k){ next_func(func_ptr); // Into some new function that modify/uses only arr[i] (*func_ptr)(&arr[i]); } } ``` --- ## Exercise 2 - Loop Ordering In this case, we will analysize the loop ordering by fixed the cache size. Comparing the row major and column major performance and observed each of case how to access in the cache. ### Environment - ==[Prior work](https://hackmd.io/VQqoaTVXROKX6OPz-TcKfQ?both) doesn't use [ripes](https://github.com/mortbopet/Ripes) to simulated the result. So we decided to use ripes to simulated the hit rate between row major and column major at this work.== #### Cache Parameter - Cache Levels: 1 - Block Size: 4 bytes (ripes) - Number of Blocks: 16 - Associativity: N-way Set Associative - Associativity: 4 - Block Replacement: LRU The parameter set in Ripes: ![](https://i.imgur.com/3RlLno1.png) The layout of the cache can be seen here: ![](https://i.imgur.com/Gogwhcv.png) ### Figure of row major The meaning of row major order is that the data in matrix would arrange in memory according to row. ![](https://i.imgur.com/Iy788ec.png =70%x) ### Figure of cloumn major The meaning of cloumn major order is that the data in matrix would arrange in memory according to cloumn. ![](https://i.imgur.com/pqHxF3R.png =70%x) ### Pseudo code for column major ```c= int main( int argc, char **argv ) { int buf[4096]; for(int j=0;j<1996;j++){ buf[j]=1; } int sum = 0; for (int j = 0; j < 400; j++) for (int i = 0; i < 400; i++) sum += buf[i * 4 + j]; return 0; } ``` ### Pseudo code for the row major ```c= int main( int argc, char **argv ) { int buf[4096]; for(int j=0;j<1996;j++){ buf[j]=1; } int sum = 0; for (int i = 0; i < 400; i++) for (int j = 0; j < 400; j++) sum += buf[i * 4 + j]; return 0; } ``` ### Result of column major ![](https://i.imgur.com/gioc9jY.png) ### Result of row major ![](https://i.imgur.com/yzhKhgc.png) ### 2-way associated When the case need to access different line continuously, the specific cache line should be replaced, and miss increased. Trying to use the 2-way associated to reduced the cache miss. #### 2-way associated cache can be seen here ![](https://i.imgur.com/SnCXmpb.png) #### 2-way associated column major result ![](https://i.imgur.com/Rsh0Rjc.png) #### 2-way associated row major result ![](https://i.imgur.com/Zn34Mrr.png) 2-way associated cache result in this two cases are all better than the directed mapped cache, meaning that compulsory miss reduced in the new design. ### Tasks **Q1.** Which ordering(s) perform best for these 400-by-400 matrices? Why? **A1.** According to the hit rate above, row major has better perfomance than columm major. This is because if the pseudo code is written in the form of row major, it will read the data in the memory sequentially. This will make it more easier to read the adjacent data and increase the hit rate of cache. **Q2.** Which ordering(s) perform the worst? Why? **A2.** According to the hit rate above, columm major has better perfomance than row major. This is because if the pseudo code is written in the form of columm major, it will not read the data in the memory in order. This will make it more harder to read the adjacent data and decrease the hit rate of cache. **Q3.** How does the way we stride through the matrices with respect to the innermost loop affect performance? **A3.** If we trying to get the higher performance, we should try to find a way to increase the hit rate of cache. Due to the small experiment above, we find that if we written the code as row major style, we can get higher hit rate of the program. Therefore, we should try to add a small number to the inner loop, so that the data we read each time can be more adjacent to each other as much as possible. --- ## Exercise 3 - Cache Blocking and Matrix Transposition ==In this exercise we will design a transpose with blocking code and rewrite it. We will also compare the code and its performance which provided by prior work on **ripes**.== Code designed by [江松穎](https://hackmd.io/@Uduru0522/rJ-NUxC3v): ``` c= void transpose_blocking(int n, int blocksize, int *dst, int *src) { int block_count = n / blocksize; for(int i = 0; i < block_count; ++i){ for(int j = 0; j < block_count; ++j){ // i := row block count // j := column block count for(int a = 0; a < blocksize; ++a){ for(int b = 0; b < blocksize; ++b){ if(i * blocksize + a < n && j * blocksize + b < n){ dst[(j * blocksize + b) + (i * blocksize + a) * n] = src[(j * blocksize + b) * n + (i * blocksize+ a)]; #ifdef EBUG printf("Moving src[%d] to ", (j * blocksize+ b) * n + (i * blocksize + a)); printf("to dst[%d]", (j * blocksize + b) + (i * blocksize + a) * n); putchar('\n'); #endif } } } } } } ``` Performance by Ripes: - `n`=10, `blocksize`=2: ==hit rate: 98.19%== ![](https://i.imgur.com/uKYPYWX.png) ![](https://i.imgur.com/tF0HDo4.png) - `n`=20, `blocksize`=2: ==hit rate: 96.49%== ![](https://i.imgur.com/VLd0k1z.png) ![](https://i.imgur.com/vUwufL7.png) - `n`=30, `blocksize`=2: ==hit rate: 95.95%== ![](https://i.imgur.com/Y9HKE03.png) ![](https://i.imgur.com/A9LHnjj.png) **So we can know that if the blocksize is fixed, the smaller n would cause the higher hit rate. It just because the smaller matrix will cause the higher cache hit rate.** Code designed by [魏晉成](https://hackmd.io/T_MxFzGNRMqS5b-7oGaz9w): ``` c= void transpose_blocking(int n, int blocksize, int *dst, int *src) { for(int row = 0; row < n / blocksize; row ++) { for(int col = 0; col < n / blocksize; col ++) { int dst_base_offset = row * blocksize + col * blocksize * n, src_base_offset = col * blocksize + row * blocksize * n; for(int inner_row = 0; inner_row < blocksize; inner_row ++) { for(int inner_col = 0; inner_col < blocksize; inner_col ++) { *(dst + dst_base_offset + inner_col * n + inner_row) = *(src + src_base_offset + inner_row * n + inner_col); } } } } } ``` Performance by Ripes: - `n`=10, `blocksize`=2: ==hit rate: 97.82%== ![](https://i.imgur.com/CLtFGI4.png) ![](https://i.imgur.com/31IoSU2.png) - `n`=20, `blocksize`=2: ==hit rate: 95.88%== ![](https://i.imgur.com/cnd43g3.png) ![](https://i.imgur.com/bwhnxUw.png) - `n`=30, `blocksize`=2: ==hit rate: 95.25%== ![](https://i.imgur.com/vcQiDwZ.png) ![](https://i.imgur.com/eMx13ZP.png) **So we can know that if the blocksize is fixed, the smaller n would cause the higher hit rate. It just because the smaller matrix will cause the higher cache hit rate.** Code designed by [呂紹樺](https://hackmd.io/@horseradish1208/HkzkUEAiD): ``` c= // Function transpose_naive() transposes a (blocksize x blocksize) block in a (n x n) martrix. // The start indices in each block must be specified void transpose(int IndexX, int IndexY, int blocksize, int n, int* dst, int* src) { for (int y = IndexY; y < (IndexY + blocksize) ; y++) { for (int x = IndexX; x < (IndexX + blocksize) ; x++) { if (y < n && x < n) { dst[y + x * n] = src[x + y * n]; }// Prevent any operation at index out of bound } } } // Function transpose_blocking() iterate through the matrix by a block. // It transposes each block matrix. void transpose_blocking(int n, int blocksize, int* dst, int* src) { for (int blockX = 0; blockX < n; blockX+=blocksize) { for (int blockY = 0; blockY < n; blockY += blocksize) { transpose(blockX, blockY , blocksize, n, dst, src); } } } ``` Performance by Ripes: - `n`=10, `blocksize`=2: ==hit rate: 97.66%== ![](https://i.imgur.com/7RAoPZw.png) ![](https://i.imgur.com/KEV94Kc.png) - `n`=20, `blocksize`=2: ==hit rate: 95.99%== ![](https://i.imgur.com/CvnhVsv.png) ![](https://i.imgur.com/5U6jgGw.png) - `n`=30, `blocksize`=2: ==hit rate: 95.42%== ![](https://i.imgur.com/pHCGwNb.png) ![](https://i.imgur.com/y7LMbgP.png) **Due to the simulated result by ripes, we can easily know that the lower n can cause the higher hit rate. Through these results can reveal that the code by [江松穎](https://hackmd.io/@Uduru0522/rJ-NUxC3v) can reach the highest performance.** ### Tasks **Q1.** At what point does cache blocked version of transpose become faster than the non-cache blocked version? A1. By the test result, between the array size of 1000 and 2000 will get the best performance. ![](https://i.imgur.com/BekfuI4.png) **Q2.** Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code? A2. For a matrix that is sure to be small, all of this should fit in cache. Therefore, the two different methods have nearly similar results in terms of hit rate. --- Reference: [CS 61C Lab 7](https://inst.eecs.berkeley.edu/~cs61c/su20/labs/lab07/) [Ripes](https://github.com/mortbopet/Ripes) [羅紹豪](https://hackmd.io/VQqoaTVXROKX6OPz-TcKfQ?both) [江松穎-Homework4](https://hackmd.io/@Uduru0522/rJ-NUxC3v)