--- title: 2020q3 Project (cache-lab) tags: sysprog --- # 2020q3: cache-lab contributed by < `TsundereChen` > > GitHub Repo: [TsundereChen/csapp-cache-lab](https://github.com/TsundereChen/csapp-cache-lab) > [README](https://csapp.cs.cmu.edu/3e/README-cachelab) ## Files ``` - cahcelab.[c,h] Lab helper files - csim.c Cache simulator - csim-ref Executable reference cache simulator - driver.py Driver program, run test-csim and test-trans - Makefile - test-csim Tool to test cache simulator - test-trans.c Tool to test transpose function - tracegen.c Helper program for test-trans - trans.c Transpose function - traces/ Trace files used by test-csim.c ``` ## Environment * Valgrind need to be installed in the system ## Part A - Getting the program argument - [getopt.h](https://www.gnu.org/software/libc/manual/html_node/Using-Getopt.html) - Create arguments according to the reference csim ```clike= void parseArg(int argc, char *argv[]) { if (argc == 1) { printf("Usage: ./csim [-hv] -s <num> -E <num> -b <num> -t <file>\n"); exit(1); }; int opt; while ((opt = getopt(argc, argv, "hvs:E:b:t:")) != -1) { switch (opt) { case 'h': printf("Usage: ./csim [-hv] -s <num> -E <num> -b <num> -t <file>\n"); printf("Options:\n"); printf(" -h Print this help message.\n"); printf(" -v Optional verbose flag.\n"); printf(" -s <num> Number of set index bits.\n"); printf(" -E <num> Number of lines per set.\n"); printf(" -b <num> Number of block offset bits.\n"); printf(" -t <file> Trace file.\n"); printf("\n"); printf("Examples:\n"); printf(" linux> ./csim -s 4 -E 1 -b 4 -t traces/yi.trace\n"); printf(" linux> ./csim -v -s 8 -E 2 -b 4 -t traces/yi.trace\n"); exit(1); case 'v': verbose = true; break; case 's': setIndexBit = atoi(optarg); break; case 'E': linePerSet = atoi(optarg); break; case 'b': blockOffsetBit = atoi(optarg); break; case 't': traceFile = fopen(optarg, "r"); break; default: printf("Usage: ./csim [-hv] -s <num> -E <num> -b <num> -t <file>\n"); exit(1); } } } ``` ### Cache - struct - Cache looks like this ![cache_read_cs213](https://i.imgur.com/sJWRc45.png) - Below is the struct of cache line - The part A of this lab doesn't require us to modify the data in the cache, we just need to check if the valid bit is 1 and if the tag matches the input ```clike= typedef struct cacheData { int valid; int lru; uint64_t tag; } cacheData; ``` - Input - Input files are stored under folder `traces/` - All the input has the same format, like ```clike= L 10,1 M 20,1 L 22,1 S 18,1 <INSTRUCTION> <ADDRESS>,<SIZE> ``` - Four instruction - I (instruction load) - Not used in this lab - L (load) - M (modify) - Load + Store - Two steps - S (store) - Use `sscanf` to read the input - We need to extract these information from `<ADDRESS>` - Tag - Cache Set - Offset - Since we already know the size of cache set and the size of offset, we can use bitwise operation to extract these informations ```clike= while(fgets(buffer, sizeof(buffer), traceFile) != NULL){ counter++; if(buffer[0] == 'I') continue; else{ sscanf(buffer, " %c %lx,%d", &operation, &address, &size); unsigned int tag = parseTag(setIndexBit, blockOffsetBit, address); unsigned int cacheset = parseCacheSet(setIndexBit, blockOffsetBit, address); unsigned int offset = parseOffset(setIndexBit, blockOffsetBit, address); switch(operation){ case 'L': checkCache(tag, cacheset, offset, counter); break; case 'M': checkCache(tag, cacheset, offset, counter); checkCache(tag, cacheset, offset, counter); break; case 'S': checkCache(tag, cacheset, offset, counter); break; } } } ``` - Cache Hit/Miss - If the valid bit is 1 and tag matches, it's a match - If there's no match, then it's a miss. - We need to go through the cache and find if there's any invalid cache - If there is, we activate that cache, and enter the tag information - If there's no match, and all the cache line is valid - Then it's an eviction, we need to find the least frequent used cache, then replace it. - By counting the execution line, we can achieve LRU - The smaller the LRU is, the less frequent it's been used ```clike= void checkCache(int tag, int cacheSet, int offset, int counter) { for(int i = 0; i < linePerSet; i++){ if(cache[cacheSet][i].valid == 1 && cache[cacheSet][i].tag == tag){ cache[cacheSet][i].lru = counter; hits++; if(verbose) printf("HIT\n"); return; } } misses++; if(verbose) printf("MISS\n"); int n_lru = 0; for(int i = 0; i < linePerSet; i++){ if(cache[cacheSet][i].valid == 0){ cache[cacheSet][i].valid = 1; cache[cacheSet][i].tag = tag; cache[cacheSet][i].lru = counter; return; } else if(cache[cacheSet][n_lru].lru > cache[cacheSet][i].lru) n_lru = i; } evictions++; if(verbose) printf("EVICTION!\n"); cache[cacheSet][n_lru].valid = 1; cache[cacheSet][n_lru].tag = tag; cache[cacheSet][n_lru].lru = counter; return; }; ``` ## Part B - Need to write a function to perform matrix transpose - Three test cases - 32 x 32 - 64 x 64 - 61 x 67 - s = 5, E = 1, b = 5 - Cache size is 32 * 2 - Cache block size is 32 bytes - Can contain 8 integer at once ### 32 x 32 - Instead of process one integer at a time, because the system will also fetch nearby data at the same time, we can process more integer at a time, utilizing cache and generate faster result. - Because cache block size is 32 bytes, which can contain 8 integer, so we move 8 integer at a time. ```clike= // Case 1 for (int i = 0; i < N; i += 8) { for (int j = 0; j < M; j += 8) { for (int k = i; k < i + 8; k++) { int t0 = A[k][j]; int t1 = A[k][j + 1]; int t2 = A[k][j + 2]; int t3 = A[k][j + 3]; int t4 = A[k][j + 4]; int t5 = A[k][j + 5]; int t6 = A[k][j + 6]; int t7 = A[k][j + 7]; B[j][k] = t0; B[j + 1][k] = t1; B[j + 2][k] = t2; B[j + 3][k] = t3; B[j + 4][k] = t4; B[j + 5][k] = t5; B[j + 6][k] = t6; B[j + 7][k] = t7; } } } ``` :::warning Why the code below doesn't utilize cache? ```clike= for (int i = 0; i < N; i += 8) { for (int j = 0; j < M; j += 8) { for (int k = i; k < i + 8; k++) { B[j + 0][k] = A[k][j]; B[j + 1][k] = A[k][j + 1]; B[j + 2][k] = A[k][j + 2]; B[j + 3][k] = A[k][j + 3]; B[j + 4][k] = A[k][j + 4]; B[j + 5][k] = A[k][j + 5]; B[j + 6][k] = A[k][j + 6]; B[j + 7][k] = A[k][j + 7]; } } } ``` ::: ### 64 x 64 - Refer to blog [深入理解计算机系统 (CS:APP) - 高速缓存实验 Cache Lab 解析](https://billc.io/2019/05/csapp-cachelab/) - Process using 4 blocks. ```clike= int t0, t1, t2, t3, t4, t5, t6, t7; for (int i = 0; i < N; i += 8) { for (int j = 0; j < M; j += 8) { for (int k = i; k < i + 4; k++) { t0 = A[k][j]; t1 = A[k][j + 1]; t2 = A[k][j + 2]; t3 = A[k][j + 3]; t4 = A[k][j + 4]; t5 = A[k][j + 5]; t6 = A[k][j + 6]; t7 = A[k][j + 7]; B[j][k] = t0; B[j + 1][k] = t1; B[j + 2][k] = t2; B[j + 3][k] = t3; B[j + 0][k + 4] = t7; B[j + 1][k + 4] = t6; B[j + 2][k + 4] = t5; B[j + 3][k + 4] = t4; } for (int h = 0; h < 4; h++) { t0 = A[i + 4][j + 3 - h]; t1 = A[i + 5][j + 3 - h]; t2 = A[i + 6][j + 3 - h]; t3 = A[i + 7][j + 3 - h]; t4 = A[i + 4][j + 4 + h]; t5 = A[i + 5][j + 4 + h]; t6 = A[i + 6][j + 4 + h]; t7 = A[i + 7][j + 4 + h]; B[j + 4 + h][i + 0] = B[j + 3 - h][i + 4]; B[j + 4 + h][i + 1] = B[j + 3 - h][i + 5]; B[j + 4 + h][i + 2] = B[j + 3 - h][i + 6]; B[j + 4 + h][i + 3] = B[j + 3 - h][i + 7]; B[j + 3 - h][i + 4] = t0; B[j + 3 - h][i + 5] = t1; B[j + 3 - h][i + 6] = t2; B[j + 3 - h][i + 7] = t3; B[j + 4 + h][i + 4] = t4; B[j + 4 + h][i + 5] = t5; B[j + 4 + h][i + 6] = t6; B[j + 4 + h][i + 7] = t7; } } } ``` ### 61 x 67 - By the method used in `32 x 32`, and changing the stepping, we can find out that processing 16 integer at a time works the best ```clike= for (int i = 0; i < N; i += 16) { for (int j = 0; j < M; j += 16) { for (int k = i; k < i + 16 && k < N; k++) { for (int l = j; l < j + 16 && l < M; l++) { B[l][k] = A[k][l]; } } } } ``` ## RISC-V simulator ### Ripes - [Cache Simulator with RISC-V](https://github.com/mortbopet/Ripes/wiki/Cache-Simulation) - Use single-cycle processor for cache-simulation - Pipelined processor models may stall a stage which is currently reading from memory - Much faster execution rate compared to pipelined processor - Cache configuration preset in Ripes - 32-entry 4-word direct-mapped - 32-entry 4-word fully associative - 32-entry 4-word 2-way set associative - Several things to notice - Although Ripes can compile C program for you, and run the program It will generate lots of assembly code, which will be harder for user to trace the program - Use [Compiler Explorer](https://godbolt.org/) to generate assembly, then add main function assembly code and termation function assembly code to stop the program - Refer to [Adapting Compiler Explorer generated RISC V assembly code](https://github.com/mortbopet/Ripes/wiki/Adapting-Compiler-Explorer-generated-RISC-V-assembly-code) - Since we can't pass the whole lab into Ripes, we need to create a function to perform matrix transpose. - Ripes doesn't support - Function `rand()` - Array in `.data` assembly ### Experiment #### N = 8 - Start with sample code below ```clike= #include <stdlib.h> #define MATRIX_SIZE 8 void transpose(){ int A[MATRIX_SIZE][MATRIX_SIZE], B[MATRIX_SIZE][MATRIX_SIZE]; for(int i = 0; i < MATRIX_SIZE; i++){ for(int j = 0; j < MATRIX_SIZE; j++){ A[i][j] = (i+1) * (j+1); } }; for(int j = 0; j < MATRIX_SIZE; j++){ for(int i = 0; i < MATRIX_SIZE; i++){ int t0 = A[i][j]; B[j][i] = t0; } } return; } ``` - Three cache configuration - 32-entry 4-word direct-mapped Hit rate:96.62% ![n8_direct_mapped](https://i.imgur.com/8BqKC6b.png) - 32-entry 4-word fully associative Hit rate: 97.16% ![n8_fully_associative](https://i.imgur.com/gFDVjkb.png) - 32-entry 4-word 2-way set associative Hit rate: 97.08% ![n8_2-way_set_associative](https://i.imgur.com/DqeDLku.png) - Fetch 4 variables at once ```clike= void transpose(){ int A[MATRIX_SIZE][MATRIX_SIZE], B[MATRIX_SIZE][MATRIX_SIZE]; for(int i = 0; i < MATRIX_SIZE; i++){ for(int j = 0; j < MATRIX_SIZE; j++){ A[i][j] = (i+1) * (j+1); } }; for(int j = 0; j < MATRIX_SIZE; j+=4){ for(int i = 0; i < MATRIX_SIZE; i++){ int t0 = A[i][j]; int t1 = A[i][j+1]; int t2 = A[i][j+2]; int t3 = A[i][j+3]; B[j][i] = t0; B[j+1][i] = t1; B[j+2][i] = t2; B[j+3][i] = t3; } } return; } ``` - Three cache configuration - 32-entry 4-word direct-mapped Hit rate: 93.81% ![n8_direct_mapped](https://i.imgur.com/3sD4ENr.png) - 32-entry 4-word fully associative Hit rate: 96.54% ![n8_fully_associative](https://i.imgur.com/4nfIzTi.png) - 32-entry 4-word 2-way set associative Hit rate: 96.72% ![n8_2-way_set_associative](https://i.imgur.com/tyvh9sE.png) - Fetch 8 variables at once - Three cache configuration - 32-entry 4-word direct-mapped Hit rate: 94.2% ![n8_direct_mapped](https://i.imgur.com/LSW5j6H.png) - 32-entry 4-word fully associative Hit rate: 95.34% ![n8_fully_associative](https://i.imgur.com/gE4OoW9.png) - 32-entry 4-word 2-way set associative Hit rate: 95.86% ![n8_2-way_set_associative](https://i.imgur.com/nfetRxO.png) | | Direct-mapped | Fully Associative | 2-way Set Associative | | --------- | ------------- | ----------------- | --------------------- | | 1 at once | 96.62% | 97.16% | 97.08% | | 4 at once | 93.81% | 96.54% | 96.72% | | 8 at once | 94.2% | 95.34% | 95.86% | | 4 at once, three for loop | 95.56% | 96.08% | 96.42% | #### N = 32 - Dummy method - Three cache configuration - 32-entry 4-word direct-mapped Hit rate: 93.81% ![n32_direct_mapped](https://i.imgur.com/fDf4oV0.png) - 32-entry 4-word fully associative Hit rate: 95.55% ![n32_fully_associative](https://i.imgur.com/EVbWJoJ.png) - 32-entry 4-word 2-way set associative Hit rate: 94.75% ![n32_2-way_set_associative](https://i.imgur.com/t3z4gmq.png) - Fetch 4 variables at once - Three cache configuration - 32-entry 4-word direct-mapped Hit rate: 89.87% ![n32_direct_mapped](https://i.imgur.com/v9Q1jU4.png) - 32-entry 4-word fully associative Hit rate: 91.19% ![n32_fully_associative](https://i.imgur.com/WLJR5ps.png) - 32-entry 4-word 2-way set associative Hit rate: 91.19% ![n32_2-way_set_associative](https://i.imgur.com/MCUNEiO.png) - Fetch 8 variables at once - Three cache configuration - 32-entry 4-word direct-mapped Hit rate: 93.36% ![n32_direct_mapped](https://i.imgur.com/aXMvre0.png) - 32-entry 4-word fully associative Hit rate: 96.32% ![n32_fully_associative](https://i.imgur.com/DtSDnq6.png) - 32-entry 4-word 2-way set associative Hit rate: 94.86% ![n32_2-way_set_associative](https://i.imgur.com/agCjGC8.png) - Fetch 4 variables at once, three for loop - Three cache configuration - 32-entry 4-word direct-mapped Hit rate: 86.15% ![n32_direct_mapped](https://i.imgur.com/pixiqBd.png) - 32-entry 4-word fully associative Hit rate: 94.36% ![n32_fully_associative](https://i.imgur.com/T0ZjHk5.png) - 32-entry 4-word 2-way set associative Hit rate: 94.03% ![n32_2-way_set_associative](https://i.imgur.com/wSePOLg.png) | | Direct-mapped | Fully Associative | 2-way Set Associative | | --------- | ------------- | ----------------- | --------------------- | | 1 at once | 93.81% | 95.55% | 94.75% | | 4 at once | 89.87% | 91.19% | 91.19% | | 8 at once | 93.36% | 96.32% | 94.86% | | 4 at once, three for loop | 86.15% | 94.36% | 94.03% | | 8 at once, three for loop | 90.11% | 94.03% | 90.11% | #### N = 64 | | Direct-mapped | Fully Associative | 2-way Set Associative | | --------- | ------------- | ----------------- | --------------------- | | 1 at once | 90.88% | 92.17% | 92.13% | | 4 at once | 89.75% | 95.54% | 91.09% | | 8 at once | 89.65% | 95.44% | 90.78% | | 4 at once, three for loop | 88.57% | 94.22% | 89.94% | | 8 at once, three for loop | 90.90% | 94.58% | 92.05% | ## Reference - [CS:APP 3/e cachelab](https://hackmd.io/@pSnFKx_GTlmTWXn4A8lpKw/SkeA19jWN?type=view) - [CSAPP:Lab4-Cache Lab](https://zhuanlan.zhihu.com/p/142942823)