CS:APP Cache Lab 解題筆記

# CS:APP Cache Lab 解題筆記 ###### tags: `cs:app` Cache Lab 對應書本第五章及第六章，目的是讓學生了解快取的機制及程式優化等議題。實驗分為兩個部份 1. Part A: 寫一個小的 C 程式 (200~300 行)，利用 `valgrind` 的 trace 輸出檔，以 LRU 策略模擬快取的行為，計算程式的 cache hit, miss & eviction 2. Part B: 優化矩陣轉置，盡可能降低 cache miss 的發生次數 ## Part A ### **Cache Basic** 當 CPU 執行讀取一個在內存的資料時，它會向 L1 快取發出請求(假設我們只有 L1 快取)，如果能在 L1 快取中找到需求的資料，則稱為 ==L1 快取命中 (cache hit)==，L1 快取將這塊資料讀取出來並傳回 CPU，反之則稱為 ==L1 快取不命中 (cache miss)==，L1 快取向主記憶體發出請求，兩者的傳輸速度可以差到 1~2 個數量級。資料在快取及記憶體之間是以塊 (block) 為單位傳輸，在迭代計算中，我們會希望被存取的資料 1. 放在記憶體中連續的位置，稱為==空間局部性(spatial locality)== ![](https://i.imgur.com/Wi1tjY6.png) 2. 在快取中被多次存取，這稱為==時間局部性(temporal locality)== ![](https://i.imgur.com/v4UBoMN.png) Cache miss 的發生主要有以下三個原因 1. Cold miss: 一開始 Cache 沒有儲存任何資料，因此一定發生 Cache miss 2. Conflict miss: Cache 的容量夠大，但不同資料映射到快取的同一組，導致 Cache line 被反覆替換 3. Capacity miss: 工作集 (working set，程式經常使用的資料) 大小超過快取的大小要寫一個快取的模擬程式，必需先了解快取的結構，一般可以用元組 (S, E, B, m) 來表達 - S: 組別數量 $2^s$ - E: 一組中有幾行(cacje line) - B: 一行儲存幾個字節 $2^b$ - m: 主記憶體的地址位數 P.S. 行跟塊有時被視為同義的詞，區別在於行還包含 `valid` & `tag` 資訊主記憶體被分為三個區塊，分別為 `tag`, `set index` 以及 `block offset`，`set index` 及 `block offset` 各佔用 `s` 及 `b` 個位元，`tag` 則佔用 `m - (b+s)` 位元 - tag: 表示 $S$ 組中是否有我們想要的資料 - set index: 組 $S$ 的索引值 - block offset: 在字節塊中的起始位置快取會先計算==地址的索引值 $S$==，接著去 $S$ 組的==各行 $E$== 確認，是否有效位 $v$ 被設置且 `tag` 值與地址匹配，若有符合兩個條件的 cache line，則為 cache hit，反之則為 cache miss ![](https://i.imgur.com/Q25KSQ6.png) 當 cache miss 發生且組中已經沒有空的行可以存放資料，必需替換其中一行的資料(eviction)，我們必需採取某種策略來選擇被替換的行，本作業要求使用==最久未使用演算法(Least recently used (LRU))==[[3]](https://en.wikipedia.org/wiki/Cache_replacement_policies)，也就是將組中最長時間未被使用的資料剔除 ### **Code Implement** 作業要求程式能接收以下參數，輸入格式為 `./csim [-v] -s <s> -E <E> -b <b> -t <tracefile>`，我們以 `getopt` 實作 - -s: 組別位元數 (必要參數) - -E: 行數 (必要參數) - -b: 塊的位元數 (必要參數) - -t: trace 檔案路徑 (必要參數) ```cpp ... ... int verbos = 0, set_bits = 0, set_num_lines = 0, block_bits = 0; ... ... int read_argv(int argc, char* argv[]){ /* # of args less than expected */ if (argc < 5) return 0; char* endptr; char* file_ext; int cmd_opt, flag_check[4]; while (1) { cmd_opt = getopt(argc, argv, "vs:E:b:t:"); /* has permuted all argv */ if (cmd_opt == -1) break; switch (cmd_opt) { case 's': set_bits = strtol(optarg, &endptr, 10); flag_check[0] = 1; break; case 'E': set_num_lines = strtol(optarg, &endptr, 10); flag_check[1] = 1; break; case 'b': block_bits = strtol(optarg, &endptr, 10); flag_check[2] = 1; break; case 't': strncpy(trace_filename, optarg, MAX_STR_LEN); trace_filename[MAX_STR_LEN - 1] = '\0'; file_ext = strchr(trace_filename, '.'); /* if '.' is not in file name or extension is in wrong format*/ if (file_ext == NULL || strcmp(file_ext, ".trace") != 0) { printf("trace file should be with extension .trace\n"); return 0; } flag_check[3] = 1; break; case 'v': verbos = 1; break; case '?': /* unexpected flag or miss argument for option*/ return 0; } } /* check whether arguments of -s -E -b -t are all correctly entered*/ for (int i = 0; i < 4; i++) { if (flag_check[i] == 0) return 0; } return 1; } int main(int argc, char* argv[]) { /********************/ /* intialization */ /********************/ /* read argv */ int status; status = read_argv(argc, argv); /* failed to read arguments */ if (!status) { fprintf(stderr, "Usage: ./csim [-v] -s <s> -E <E> -b <b> -t <tracefile>\n"); exit(0); ... ... } ``` 接著我們初始化紀錄快取狀態的資料結構，==快取其實可以看作 cache line 的二維陣列 `cache_line[S][E]`==，而一個 cache line 中需要儲存 1. valid bit 2. tag 3. LRU time stamp ```cpp typedef unsigned long int uint64_t; typedef struct { int valid; int stamp; uint64_t tag; } cache_line; typedef cache_line* associate; ... ... int initial_associate(){ int num_sets = 1 << set_bits; cache_track = malloc(sizeof(associate) * num_sets); if (!cache_track) return 0; for (int i = 0; i < num_sets; i++){ cache_track[i]= malloc(sizeof(cache_line) * set_num_lines); if (!cache_track[i]) return 0; for (int j = 0; j < set_num_lines; j++){ cache_track[i]->stamp = 0; cache_track[i]->tag = 0; cache_track[i]->valid = 0; } } return 1; } int main(int argc, char* argv[]){ ... ... /* initialize an array to track cache */ status = initial_associate(); if (!status) { fprintf(stderr, "Failed to allocate memory for cache tracking array\n"); exit(0); } ... ... } ``` 觀察 `valgrind` trace 檔的格式，其中 `I` 代表 instruction，`M` 代表資料修改，`L` 代表資料讀取，`S` 代表資料儲存 ```cpp I <記憶體位置>, <字節大小> M <記憶體位置>, <字節大小> L <記憶體位置>, <字節大小> S <記憶體位置>, <字節大小> ... ... ``` 我們必需一行行讀取資料，從中擷取==操作類型、記憶體位置==來判斷 cache 命中的狀態，作業有特別說明無須考慮 `I` 以及資料跨 block 的狀況，所以 instruction 及字節大小對於 cache hit/miss 的計算並不重要 1. 開啟 trace file 2. 建立字串分離用的陣列 (其實可以簡化使用 `fscanf` 分析字串，這邊是刻意練習 python like 的 `split` 函式實作) 3. 逐行讀取資料 1. 分析字串取得操作類型、記憶體位置以及字節大小(當開啟 verbos 選項時需要) 2. 將記憶體位置送入 `checkcache` 函式，更新 `hits, misses & evictions` 3. 如果==操作類型為 `M`，則無論 hit 或 miss 一定伴隨一個 hit==，所以要將 `hits` 加 1 (修改的動作包含一個讀及一個寫，無論讀取時是 hit or miss，寫入時一定為 hit)， 4. 如果有開啟 verbos 選項則輸出 cache 狀態 ```cpp ... ... associate* cache_track; int hits = 0, misses = 0, evictions = 0; char trace_filename[MAX_STR_LEN]; /* create file name buffer to ensure initial memory allocation */ ... ... int split_add(char* dest_arr[], const char* src_str, int dest_idx, int left, int right){ if (dest_idx >= MAX_COUNT) { printf("index out of bound\n"); return -1; } int idx = 0; char str_buf[MAX_STR_LEN]; for (int i = left; i < right; i++){ str_buf[idx] = src_str[i]; idx++; if (idx >= MAX_STR_LEN) break; } str_buf[idx] = '\0'; strncpy(dest_arr[dest_idx], str_buf, MAX_STR_LEN); return 0; } int split(char* dest_arr[], const char* src_str, const char split_ch){ int str_len = strlen(src_str); int max_count = MAX_COUNT; int i = 0, j = 0; while (max_count-- > 0) { while (i < str_len && src_str[i] == split_ch) i++; if (i == str_len) break; /* reach end of string */ j = i; i++; while (i < str_len && src_str[i] != split_ch && src_str[i] != '\n') i++; split_add(dest_arr, src_str, MAX_COUNT - max_count - 1, j, i); } return 0; } int main(int argc, char* argv[]){ ... ... /* open trace file */ FILE* trace_file = fopen(trace_filename, "r"); if (!trace_file) { fprintf(stderr, "Failed to open trace file: %s\n", trace_filename); exit(0); } /********************/ /* parse trace file */ /********************/ /* allocate space for sub string array */ char* split_str[MAX_COUNT]; char* addr_size[MAX_COUNT]; for (int i = 0; i < MAX_COUNT; i++){ split_str[i] = malloc(sizeof(char) * MAX_STR_LEN); } for (int i = 0; i < MAX_COUNT; i++){ addr_size[i] = malloc(sizeof(char) * MAX_STR_LEN); } /* read trace file line by line */ uint64_t address; int result; char line[MAX_STR_LEN]; char type_ch, *op_type, *data_addr, *op_size,*endptr; while (fgets(line, MAX_STR_LEN, trace_file)){ if (line[0] == 'I') continue; /* Instruction */ /* split line to get operand type, address of data & size of operand data */ /* just for practice purpose of implement split function */ /* could be simplified by using sscanf(line, " %c %lx,%d", &op_type, &data_addr, &op_size); */ split(split_str, line, ' '); op_type = split_str[0]; split(addr_size, split_str[1], ','); data_addr = addr_size[0]; op_size = addr_size[1]; /* compute cache condition of each data manipulation */ address = strtoul(data_addr, &endptr, 16); result = checkcache(address); type_ch = op_type[0]; if (type_ch == 'M') hits++; /* print out parse result if -v flag indicated */ if (verbos){ switch (result){ case 0: printf("%s %s,%s hit\n", op_type, data_addr, op_size); break; case 1: if (type_ch == 'M'){ printf("%s %s,%s miss hit\n", op_type, data_addr, op_size); } else { printf("%s %s,%s miss\n", op_type, data_addr, op_size); } break; case 2: if (type_ch == 'M'){ printf("%s %s,%s miss eviction hit\n", op_type, data_addr, op_size); } else { printf("%s %s,%s miss eviction\n", op_type, data_addr, op_size); } break; } } } ... ... } ``` `checkcache` 為程式計算快取是否命中的主要函式 1. 利用地址配合位元運算，計算組別 $S$ 以及 tag 值 2. 迭代組別中所有行 $E$ 1. 確認 `valid` 位是否被設置，若==尚未設置代表該行為空== 2. 若 `valid` 位被設置，則確認該行的 tag 值是否與輸入地址匹配 - 若==匹配代表快取命中==，`hits` 加 1 並將 LRU time stamp 設置為 0，代表最近有使用過 - 若==不匹配代表快取不命中==，此時 LRU time stamp 必需加 1 ，代表這筆資料沒被使用的時間增加。另外，為了==減少迴圈的次數==，同時判定該行是否為可能被替換的候選人 - 要注意的是 `cache_set[i].stamp >= max_timestamp` ==必需使用 `>=`==，確保使用 `counter` 實作 LRU 的結果與 `queue` 相同 3. `misses` 加 1 4. 替換組的行資料 - 若尚有空行，則填充空行 - 若無空行，則 `eviction` 加 1，並更新 LRU 選定的行資料 5. 不同的回傳值僅代表不同 cache 狀況，供 verbos 使用 ```cpp int checkcache(uint64_t address){ uint64_t set_idx = (address >> block_bits) & ((1 << set_bits) - 1); uint64_t tag_val = address >> (set_bits + block_bits); int max_timestamp = INT_MIN; int empty_line = -1; int evic_line = 0; associate cache_set = cache_track[set_idx]; for (int i = 0; i < set_num_lines; i++){ if (cache_set[i].valid) { if (cache_set[i].tag == tag_val){ hits++; cache_set[i].stamp = 0; return 0; } cache_set[i].stamp++; /* get new index to evict */ /* = is necessary to get all credits that last line will be evicted if tie */ /* to ensure this counter has the identical behavior as queue */ if (cache_set[i].stamp >= max_timestamp){ evic_line = i; max_timestamp = cache_set[i].stamp; } } else { empty_line = i; } } misses++; if (empty_line != -1){ /* update associate array of empty line */ cache_set[empty_line].valid = 1; cache_set[empty_line].stamp = 0; cache_set[empty_line].tag = tag_val; return 1; } else { /* update associate array of eviction line */ evictions++; cache_set[evic_line].valid = 1; cache_set[evic_line].stamp = 0; cache_set[evic_line].tag = tag_val; return 2; } } ``` 最後記得將動態配置的記憶體空間釋放掉 ```cpp int free_associate(){ int num_sets = 1 << set_bits; for (int i = 0; i < num_sets; i++){ free(cache_track[i]); } free(cache_track); return 0; } int main(int argc, char* argv[]){ ... ... /* free memory and close file */ for (int i = 0; i < MAX_COUNT; i++){ free(split_str[i]); free(addr_size[i]); } free_associate(); fclose(trace_file); ... ... } ``` 測試結果與 CMU 提供的參考答案相同 ``` ./test-csim Your simulator Reference simulator Points (s,E,b) Hits Misses Evicts Hits Misses Evicts 3 (1,1,1) 9 8 6 9 8 6 traces/yi2.trace 3 (4,2,4) 4 5 2 4 5 2 traces/yi.trace 3 (2,1,4) 2 3 1 2 3 1 traces/dave.trace 3 (2,1,3) 167 71 67 167 71 67 traces/trans.trace 3 (2,2,3) 201 37 29 201 37 29 traces/trans.trace 3 (2,4,3) 212 26 10 212 26 10 traces/trans.trace 3 (5,1,5) 231 7 0 231 7 0 traces/trans.trace 6 (5,1,5) 265189 21775 21743 265189 21775 21743 traces/long.trace 27 TEST_CSIM_RESULTS=27 ``` 完整程式碼請參考 [Github 連結](https://github.com/Chang-Chia-Chi/CSAPP-Labs/blob/main/Cache-Lab/csim.c) ## Part B ### **Blocking** Blocking 是一種程式優化技術，用來提昇==迴圈計算的空間局部性==，其基本概念是將==資料切成塊狀區域==(block，注意這裡的 block 與快取中的 block 不同)，快取每次都讀取一個或多個塊狀區域的資料，執行所有需要的對應操作，再讀取新的資料塊我們參考 CMU 的講義探討分塊對 cache miss 的影響，假設程式要執行矩陣相乘運算 $C = A× B$，先考慮未優化的簡單實作 ```cpp c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i*n + j] += a[i*n + k] * b[k*n + j]; } ``` 視覺化呈現如下 - 矩陣的元素為 `double` - cache block 可以存放 8 個 `doublde` - 快取大小 $C << n$ 一次 $k$ 內循環的 cache miss 可以簡單計算 1. 矩陣 A 使用同一行的資料，因此每 8 個元素會 miss 一次，cache miss = $n/8$ 2. 矩陣 B 每次都讀取不同行的資料，所以每次讀取都是 cache miss，cache miss = $n$ 3. 一次 $k$ 內循環(計算 $C$ 矩陣中一個元素)的 cache miss = $n/8 + n = 9n/8$ 其中紅色區域為 $k$ 完成第一次迭代時快取的狀態，可以想像這些資料在==下一次迭代開始時，都會被替換掉==，所以每一次迭代的 cache miss 次數是相同的，完成一次矩陣相乘的總 cache miss 數量為 ==$9n/8 * n^2 = 9n^{3}/8$== ![](https://i.imgur.com/dWodLrW.png) 但假若我們改成一次處理一個區塊的資料，如下方程式碼 `i1~k1` 的部份。假設 cache 的容量可以儲存 3 個 block 的資料 $3B^2 < C$ 1. 每個 block 的 cache miss 為 $B^2/8$(每一列 $B/8$，共有 $B$ 列) 2. 不考慮矩陣 $C$ 的快取，僅考慮矩陣 $A$ & $B$，則 $C$ 的一個 block 共有 $2×n/B$ 次 block 運算 3. 因此一整列 block 的總 cache miss 次數為 $(2n/B)×(B^2/8) = nB/4$ ```cpp c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=B) for (j = 0; j < n; j+=B) for (k = 0; k < n; k+=B) /* B x B mini matrix multiplications */ for (i1 = i; i1 < i+B; i++) for (j1 = j; j1 < j+B; j++) for (k1 = k; k1 < k+B; k++) c[i1*n+j1] += a[i1*n + k1]*b[k1*n + j1]; } ``` ![](https://i.imgur.com/cmLmGfm.png) 與簡單實作狀況相同，每次 $k$ 迭代結束後的資料，在下一次迭代開始時都被捨棄，但因為我們採用了 Blocking 的技巧，==迭代次數從 $n^2$ 降為 $(n/B)^2$ 次==，所以完成一次矩陣相乘的總 cache miss 數量為 ==$nB/4 × (n/B)^2 = n^{3}/4B$== ![](https://i.imgur.com/RxmYwl5.png) 對比 non-blocking 及 blocking 的 cache miss - non-blocking: $9n^{3}/8$ - blocking: $n^{3}/4B$ ### **Code Implement** Part B 要我們實作矩陣轉置，並將 cache miss 盡可能降低，Part B 的程式限制如下 - 在 stack 中至多 12 個整數型態的局部變數 - 不得使用 long 或位元操作，將 2 個整數型態變數存在 1 個變數中 - 不得使用遞迴 - 不得修改矩陣 A ，但可以修改矩陣 B - 不得自定義矩陣或使用 `malloc` 對變數動態配置記憶體空間快取的參數 - 快取大小 1KB - 採用直接映射($E = 1$) - Block 大小為 32 Byte($b = 5$) - Set 共 32 組($s = 5$) Eviction 的策略 - 矩陣 A & B 的第一行在 cache 中為同一組 - 對角線元素互相 evict 測試矩陣大小及分數 - 32 x 32: cache miss < 300 滿分 - 64 x 64: cache miss < 1300 滿分 - 61 x 67: cache miss < 2000 滿分 #### **32 x 32** 快取一個塊的大小為 32 Bytes，可放入 8 個整數型別，又整個快取有 32 組，代表快取一次可以存放 32 x 8 = 256 個連續位置的整數。對於 32 x 32 的矩陣來說，等於==每 8 列(256 / 32) 就會發生衝突，因此理想的分塊大小應該為 8 x 8== 另外，因為假設為直接映射，每組都只有一行，等於說==只要發生衝突一定有 eviction==，代表我們必需盡可能降低行替換的次數。作業特別說明對角線元素互相 evict，我們畫圖觀察轉置對角線元素會發生什麼情況，為了簡化以 4 x 4 的狀況來呈現 - T1: 第一次置換，都是 cache miss - T2: 第二次置換，A 是 cache hit，但 B 矩陣第二行不在快取中為 cache miss - T3: 第二次置換，為了將 B 矩陣第二行讀進快取，必需將 A 矩陣第二行替換掉 - T4: 第三次置換，因為 T3 替換了 A 矩陣第二行，在 T4 又必需加載回來 ![](https://i.imgur.com/dGFeSPy.png) 從以上分析可以發現，==快取在 A & B 對角線元素的那一行發生衝突，所以對角線元素的替換會產生 2 次的 miss 及 eviction==。一個比較好的作法是，先將 A 矩陣欲轉置的元素先以局部變數儲存起來，這樣對角線元素置換就不會產生額外的 eviction。32 x 32 的矩陣轉置實作如下 ```cpp /* * trans_1: transpose optimized function for 32 x 32 Matrix */ void trans_1(int M, int N, int A[N][M], int B[M][N]) { int i, j, k; int tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8; for (i = 0; i < N; i += 8) { for (j = 0; j < M; j += 8) { for (k = i; k < i + 8; k++) { tmp1 = A[k][j]; tmp2 = A[k][j + 1]; tmp3 = A[k][j + 2]; tmp4 = A[k][j + 3]; tmp5 = A[k][j + 4]; tmp6 = A[k][j + 5]; tmp7 = A[k][j + 6]; tmp8 = A[k][j + 7]; B[j][k] = tmp1; B[j + 1][k] = tmp2; B[j + 2][k] = tmp3; B[j + 3][k] = tmp4; B[j + 4][k] = tmp5; B[j + 5][k] = tmp6; B[j + 6][k] = tmp7; B[j + 7][k] = tmp8; } } } } ``` 測試結果 cache miss = 287，達到滿分的要求 ``` ./test-trans -M 32 -N 32 Function 0 (2 total) Step 1: Validating and generating memory traces Step 2: Evaluating performance (s=5, E=1, b=5) func 0 (Transpose submission): hits:1766, misses:287, evictions:255 Function 1 (2 total) Step 1: Validating and generating memory traces Step 2: Evaluating performance (s=5, E=1, b=5) func 1 (Simple row-wise scan transpose): hits:870, misses:1183, evictions:1151 Summary for official submission (func 0): correctness=1 misses=287 TEST_TRANS_RESULTS=1:287 ``` #### **64 x 64** 因為矩陣被放大兩倍，等於==每 4 列就會發生衝突==，我們先嘗試以 4 x 4 分塊優化，cache miss 次數約 1700 次，離滿分的 1300 次仍有段距離 ```cpp /* * trans_2: transpose optimized function for 64 x 64 Matrix */ void trans_2(int M, int N, int A[N][M], int B[M][N]) { int i, j, k; int tmp1, tmp2, tmp3, tmp4; for (i = 0; i < N; i += 4) { for (j = 0; j < M; j += 4) { for (k = i; k < i + 4; k++) { tmp1 = A[k][j]; tmp2 = A[k][j + 1]; tmp3 = A[k][j + 2]; tmp4 = A[k][j + 3]; B[j][k] = tmp1; B[j + 1][k] = tmp2; B[j + 2][k] = tmp3; B[j + 3][k] = tmp4; } } } } ``` ``` ./test-trans -M 64 -N 64 Function 0 (2 total) Step 1: Validating and generating memory traces Step 2: Evaluating performance (s=5, E=1, b=5) func 0 (Transpose submission): hits:6498, misses:1699, evictions:1667 Function 1 (2 total) Step 1: Validating and generating memory traces Step 2: Evaluating performance (s=5, E=1, b=5) func 1 (Simple row-wise scan transpose): hits:3474, misses:4723, evictions:4691 Summary for official submission (func 0): correctness=1 misses=1699 TEST_TRANS_RESULTS=1:1699 ``` 這題起初毫無頭緒應該如何修正，參考網路上其他人的筆記才恍然大悟 1. 每 4 行發生衝突，但快取的一行又可以容納 8 個整數型別，那我們可以採用 8 x 8 的分塊策略，然後==在 8 x 8 中再採用 4 x 8 的分塊策略==，一次讀取整列的資料 2. 但如果採用 4 x 8 的分塊，塊中欄索引 4 ~ 7 的資料無法直接寫入矩陣 B。作業特別說明矩陣 B 是可以隨意修改的，這給了我們一個提示，將==部份的資料先放到矩陣 B 中還沒轉置的區域==，等到要將資料寫入正確的位置時，再從 B 讀取出來，提昇讀取矩陣 A 的空間局部性實作 8 x 8 分塊策略的步驟如下： - 上半部 4 x 8 處理 1. 讀取矩陣 A 的一列並暫存於局部變數 2. 將局部變數 `tmp1~tmp4` 存到矩陣 B 欄 $k$，`tmp5~tmp8` 存到矩陣 B 的欄 $k + 4$ 待下半部處理時使用 ![](https://i.imgur.com/I1MGq8q.png) - 下半部 4 x 8 處理 1. 逐列讀取矩陣 A 左下角的資料 `tmp1~tmp4` 2. 逐欄讀取矩陣 B 右上角的資料 `tmp5~tmp8` 3. 逐欄替換矩陣 B 右上角的資料 `tmp1~tmp4` 4. 逐欄替換矩陣 B 左下角的資料 `tmp5~tmp8` ![](https://i.imgur.com/tTUQWEc.png) 5. 最後處理右下角 4 x 4 分塊這麼做的好處是在第 2 & 3 步除了空間局部性外，還有==較好的時間局部性== (我一開始矩陣 A 的讀取以列為單位，而矩陣 B 的讀取及寫入以欄為單位，因為矩陣 B 有兩次不同地址但相同組別的資料操作，造成嚴重的 eviction，cache miss 高達 2659 次，再次說明寫出快取友善程式碼的重要性) ```cpp void trans_2(int M, int N, int A[N][M], int B[M][N]) { int i, j, k; int tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8; for (i = 0; i < N; i += 8) { for (j = 0; j < M; j += 8) { for (k = 0; k < 4; k++) { tmp1 = A[i + k][j]; tmp2 = A[i + k][j + 1]; tmp3 = A[i + k][j + 2]; tmp4 = A[i + k][j + 3]; tmp5 = A[i + k][j + 4]; tmp6 = A[i + k][j + 5]; tmp7 = A[i + k][j + 6]; tmp8 = A[i + k][j + 7]; B[j][i + k] = tmp1; B[j + 1][i + k] = tmp2; B[j + 2][i + k] = tmp3; B[j + 3][i + k] = tmp4; B[j][i + k + 4] = tmp5; B[j + 1][i + k + 4] = tmp6; B[j + 2][i + k + 4] = tmp7; B[j + 3][i + k + 4] = tmp8; } for (k = 0; k < 4; k++) { tmp1 = A[i + 4][j + k]; tmp2 = A[i + 5][j + k]; tmp3 = A[i + 6][j + k]; tmp4 = A[i + 7][j + k]; tmp5 = B[j + k][i + 4]; tmp6 = B[j + k][i + 5]; tmp7 = B[j + k][i + 6]; tmp8 = B[j + k][i + 7]; B[j + k][i + 4] = tmp1; B[j + k][i + 5] = tmp2; B[j + k][i + 6] = tmp3; B[j + k][i + 7] = tmp4; B[j + k + 4][i] = tmp5; B[j + k + 4][i + 1] = tmp6; B[j + k + 4][i + 2] = tmp7; B[j + k + 4][i + 3] = tmp8; } for (k = 4; k < 8; k++) { tmp1 = A[i + k][j + 4]; tmp2 = A[i + k][j + 5]; tmp3 = A[i + k][j + 6]; tmp4 = A[i + k][j + 7]; B[j + 4][i + k] = tmp1; B[j + 5][i + k] = tmp2; B[j + 6][i + k] = tmp3; B[j + 7][i + k] = tmp4; } } } } ``` 測試結果 cache miss = 1179，達到滿分的要求 ``` ./test-trans -M 64 -N 64 Function 0 (2 total) Step 1: Validating and generating memory traces Step 2: Evaluating performance (s=5, E=1, b=5) func 0 (Transpose submission): hits:9066, misses:1179, evictions:1147 Function 1 (2 total) Step 1: Validating and generating memory traces Step 2: Evaluating performance (s=5, E=1, b=5) func 1 (Simple row-wise scan transpose): hits:3474, misses:4723, evictions:4691 Summary for official submission (func 0): correctness=1 misses=1179 TEST_TRANS_RESULTS=1:1179 ``` #### **61 x 67** 這題的關鍵在於==矩陣不再是 2 的冪次==，如果利用前面的分塊方法，A & B 矩陣很可能會有如下藍色區域尚待處理 ![](https://i.imgur.com/DMtTl6v.png) 然而藍色區域很難再套用分塊技巧，必需使用最簡單的實作，這會造成==矩陣 B 在寫入時與矩陣 A 發生大量的衝突==，所以在分塊階段，我們應該要讓矩陣 A 在 row 方向 & 矩陣 B 在 column 方向先行轉置完成，剩餘元素轉置時==矩陣 B 有比較好的空間局部性==，降低 eviction 發生的頻率 ![](https://i.imgur.com/eOFMhia.png) 與 32 x 32 採用相同的分塊策略，但在矩陣 A 的==列方向不分塊處理==(其實在 32 x 32 的例子中，$i$ 分不分塊都沒有影響，有興趣可以自行實作測試)，同樣使用局部變數避免對角線元素引起的額外 eviction。61 x 67 的矩陣轉置實作如下 ```cpp /* * trans_3: transpose optimized function for 61 x 67 Matrix */ void trans_3(int M, int N, int A[N][M], int B[M][N]) { int i, j; int tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, tmp8; int col_limit = M - (M % 8); for (j = 0; j < col_limit; j += 8) { for (i = 0; i < N; i++) { tmp1 = A[i][j]; tmp2 = A[i][j + 1]; tmp3 = A[i][j + 2]; tmp4 = A[i][j + 3]; tmp5 = A[i][j + 4]; tmp6 = A[i][j + 5]; tmp7 = A[i][j + 6]; tmp8 = A[i][j + 7]; B[j][i] = tmp1; B[j + 1][i] = tmp2; B[j + 2][i] = tmp3; B[j + 3][i] = tmp4; B[j + 4][i] = tmp5; B[j + 5][i] = tmp6; B[j + 6][i] = tmp7; B[j + 7][i] = tmp8; } } /* transpose rest elements */ for (i = 0; i < N; i++) { for (j = col_limit; j < M; j++) { tmp1 = A[i][j]; B[j][i] = tmp1; } } } ``` 測試結果 cache miss = 1761，達到滿分的要求 ``` ./test-trans -M 61 -N 67 Function 0 (2 total) Step 1: Validating and generating memory traces Step 2: Evaluating performance (s=5, E=1, b=5) func 0 (Transpose submission): hits:6418, misses:1761, evictions:1729 Function 1 (2 total) Step 1: Validating and generating memory traces Step 2: Evaluating performance (s=5, E=1, b=5) func 1 (Simple row-wise scan transpose): hits:3756, misses:4423, evictions:4391 Summary for official submission (func 0): correctness=1 misses=1761 TEST_TRANS_RESULTS=1:1761 ``` 完整程式碼請參考 [Github](https://github.com/Chang-Chia-Chi/CSAPP-Labs/blob/main/Cache-Lab/trans.c) 連結 ## Reference 1. [Computer Systems: A Programmer’s Perspective, 3/E (CS:APP3e)](http://csapp.cs.cmu.edu/3e/labs.html) 2. [Cache Lab Implementation and Blocking](http://www.cs.cmu.edu/afs/cs/academic/class/15213-f15/www/recitations/rec07.pdf) 3. [Cache replacement policies](https://en.wikipedia.org/wiki/Cache_replacement_policies) 4. [CS:APP2e Web Aside MEM:BLOCKING: Using Blocking to Increase Temporal Locality∗](http://csapp.cs.cmu.edu/public/waside/waside-blocking.pdf)