---
title: 2020q3 Project (cache-lab)
tags: sysprog
---
# 2020q3: cache-lab
contributed by < `TsundereChen` >
> GitHub Repo: [TsundereChen/csapp-cache-lab](https://github.com/TsundereChen/csapp-cache-lab)
> [README](https://csapp.cs.cmu.edu/3e/README-cachelab)
## Files
```
- cahcelab.[c,h] Lab helper files
- csim.c Cache simulator
- csim-ref Executable reference cache simulator
- driver.py Driver program, run test-csim and test-trans
- Makefile
- test-csim Tool to test cache simulator
- test-trans.c Tool to test transpose function
- tracegen.c Helper program for test-trans
- trans.c Transpose function
- traces/ Trace files used by test-csim.c
```
## Environment
* Valgrind need to be installed in the system
## Part A
- Getting the program argument
- [getopt.h](https://www.gnu.org/software/libc/manual/html_node/Using-Getopt.html)
- Create arguments according to the reference csim
```clike=
void parseArg(int argc, char *argv[]) {
if (argc == 1) {
printf("Usage: ./csim [-hv] -s <num> -E <num> -b <num> -t <file>\n");
exit(1);
};
int opt;
while ((opt = getopt(argc, argv, "hvs:E:b:t:")) != -1) {
switch (opt) {
case 'h':
printf("Usage: ./csim [-hv] -s <num> -E <num> -b <num> -t <file>\n");
printf("Options:\n");
printf(" -h Print this help message.\n");
printf(" -v Optional verbose flag.\n");
printf(" -s <num> Number of set index bits.\n");
printf(" -E <num> Number of lines per set.\n");
printf(" -b <num> Number of block offset bits.\n");
printf(" -t <file> Trace file.\n");
printf("\n");
printf("Examples:\n");
printf(" linux> ./csim -s 4 -E 1 -b 4 -t traces/yi.trace\n");
printf(" linux> ./csim -v -s 8 -E 2 -b 4 -t traces/yi.trace\n");
exit(1);
case 'v':
verbose = true;
break;
case 's':
setIndexBit = atoi(optarg);
break;
case 'E':
linePerSet = atoi(optarg);
break;
case 'b':
blockOffsetBit = atoi(optarg);
break;
case 't':
traceFile = fopen(optarg, "r");
break;
default:
printf("Usage: ./csim [-hv] -s <num> -E <num> -b <num> -t <file>\n");
exit(1);
}
}
}
```
### Cache
- struct
- Cache looks like this

- Below is the struct of cache line
- The part A of this lab doesn't require us to modify the data in the cache, we just need to check if the valid bit is 1 and if the tag matches the input
```clike=
typedef struct cacheData {
int valid;
int lru;
uint64_t tag;
} cacheData;
```
- Input
- Input files are stored under folder `traces/`
- All the input has the same format, like
```clike=
L 10,1
M 20,1
L 22,1
S 18,1
<INSTRUCTION> <ADDRESS>,<SIZE>
```
- Four instruction
- I (instruction load)
- Not used in this lab
- L (load)
- M (modify)
- Load + Store
- Two steps
- S (store)
- Use `sscanf` to read the input
- We need to extract these information from `<ADDRESS>`
- Tag
- Cache Set
- Offset
- Since we already know the size of cache set and the size of offset, we can use bitwise operation to extract these informations
```clike=
while(fgets(buffer, sizeof(buffer), traceFile) != NULL){
counter++;
if(buffer[0] == 'I')
continue;
else{
sscanf(buffer, " %c %lx,%d", &operation, &address, &size);
unsigned int tag = parseTag(setIndexBit, blockOffsetBit, address);
unsigned int cacheset = parseCacheSet(setIndexBit, blockOffsetBit, address);
unsigned int offset = parseOffset(setIndexBit, blockOffsetBit, address);
switch(operation){
case 'L':
checkCache(tag, cacheset, offset, counter);
break;
case 'M':
checkCache(tag, cacheset, offset, counter);
checkCache(tag, cacheset, offset, counter);
break;
case 'S':
checkCache(tag, cacheset, offset, counter);
break;
}
}
}
```
- Cache Hit/Miss
- If the valid bit is 1 and tag matches, it's a match
- If there's no match, then it's a miss.
- We need to go through the cache and find if there's any invalid cache
- If there is, we activate that cache, and enter the tag information
- If there's no match, and all the cache line is valid
- Then it's an eviction, we need to find the least frequent used cache, then replace it.
- By counting the execution line, we can achieve LRU
- The smaller the LRU is, the less frequent it's been used
```clike=
void checkCache(int tag, int cacheSet, int offset, int counter) {
for(int i = 0; i < linePerSet; i++){
if(cache[cacheSet][i].valid == 1 && cache[cacheSet][i].tag == tag){
cache[cacheSet][i].lru = counter;
hits++;
if(verbose)
printf("HIT\n");
return;
}
}
misses++;
if(verbose)
printf("MISS\n");
int n_lru = 0;
for(int i = 0; i < linePerSet; i++){
if(cache[cacheSet][i].valid == 0){
cache[cacheSet][i].valid = 1;
cache[cacheSet][i].tag = tag;
cache[cacheSet][i].lru = counter;
return;
}
else
if(cache[cacheSet][n_lru].lru > cache[cacheSet][i].lru)
n_lru = i;
}
evictions++;
if(verbose)
printf("EVICTION!\n");
cache[cacheSet][n_lru].valid = 1;
cache[cacheSet][n_lru].tag = tag;
cache[cacheSet][n_lru].lru = counter;
return;
};
```
## Part B
- Need to write a function to perform matrix transpose
- Three test cases
- 32 x 32
- 64 x 64
- 61 x 67
- s = 5, E = 1, b = 5
- Cache size is 32 * 2
- Cache block size is 32 bytes
- Can contain 8 integer at once
### 32 x 32
- Instead of process one integer at a time, because the system will also fetch nearby data at the same time, we can process more integer at a time, utilizing cache and generate faster result.
- Because cache block size is 32 bytes, which can contain 8 integer, so we move 8 integer at a time.
```clike=
// Case 1
for (int i = 0; i < N; i += 8) {
for (int j = 0; j < M; j += 8) {
for (int k = i; k < i + 8; k++) {
int t0 = A[k][j];
int t1 = A[k][j + 1];
int t2 = A[k][j + 2];
int t3 = A[k][j + 3];
int t4 = A[k][j + 4];
int t5 = A[k][j + 5];
int t6 = A[k][j + 6];
int t7 = A[k][j + 7];
B[j][k] = t0;
B[j + 1][k] = t1;
B[j + 2][k] = t2;
B[j + 3][k] = t3;
B[j + 4][k] = t4;
B[j + 5][k] = t5;
B[j + 6][k] = t6;
B[j + 7][k] = t7;
}
}
}
```
:::warning
Why the code below doesn't utilize cache?
```clike=
for (int i = 0; i < N; i += 8) {
for (int j = 0; j < M; j += 8) {
for (int k = i; k < i + 8; k++) {
B[j + 0][k] = A[k][j];
B[j + 1][k] = A[k][j + 1];
B[j + 2][k] = A[k][j + 2];
B[j + 3][k] = A[k][j + 3];
B[j + 4][k] = A[k][j + 4];
B[j + 5][k] = A[k][j + 5];
B[j + 6][k] = A[k][j + 6];
B[j + 7][k] = A[k][j + 7];
}
}
}
```
:::
### 64 x 64
- Refer to blog [深入理解计算机系统 (CS:APP) - 高速缓存实验 Cache Lab 解析](https://billc.io/2019/05/csapp-cachelab/)
- Process using 4 blocks.
```clike=
int t0, t1, t2, t3, t4, t5, t6, t7;
for (int i = 0; i < N; i += 8) {
for (int j = 0; j < M; j += 8) {
for (int k = i; k < i + 4; k++) {
t0 = A[k][j];
t1 = A[k][j + 1];
t2 = A[k][j + 2];
t3 = A[k][j + 3];
t4 = A[k][j + 4];
t5 = A[k][j + 5];
t6 = A[k][j + 6];
t7 = A[k][j + 7];
B[j][k] = t0;
B[j + 1][k] = t1;
B[j + 2][k] = t2;
B[j + 3][k] = t3;
B[j + 0][k + 4] = t7;
B[j + 1][k + 4] = t6;
B[j + 2][k + 4] = t5;
B[j + 3][k + 4] = t4;
}
for (int h = 0; h < 4; h++) {
t0 = A[i + 4][j + 3 - h];
t1 = A[i + 5][j + 3 - h];
t2 = A[i + 6][j + 3 - h];
t3 = A[i + 7][j + 3 - h];
t4 = A[i + 4][j + 4 + h];
t5 = A[i + 5][j + 4 + h];
t6 = A[i + 6][j + 4 + h];
t7 = A[i + 7][j + 4 + h];
B[j + 4 + h][i + 0] = B[j + 3 - h][i + 4];
B[j + 4 + h][i + 1] = B[j + 3 - h][i + 5];
B[j + 4 + h][i + 2] = B[j + 3 - h][i + 6];
B[j + 4 + h][i + 3] = B[j + 3 - h][i + 7];
B[j + 3 - h][i + 4] = t0;
B[j + 3 - h][i + 5] = t1;
B[j + 3 - h][i + 6] = t2;
B[j + 3 - h][i + 7] = t3;
B[j + 4 + h][i + 4] = t4;
B[j + 4 + h][i + 5] = t5;
B[j + 4 + h][i + 6] = t6;
B[j + 4 + h][i + 7] = t7;
}
}
}
```
### 61 x 67
- By the method used in `32 x 32`, and changing the stepping, we can find out that processing 16 integer at a time works the best
```clike=
for (int i = 0; i < N; i += 16) {
for (int j = 0; j < M; j += 16) {
for (int k = i; k < i + 16 && k < N; k++) {
for (int l = j; l < j + 16 && l < M; l++) {
B[l][k] = A[k][l];
}
}
}
}
```
## RISC-V simulator
### Ripes
- [Cache Simulator with RISC-V](https://github.com/mortbopet/Ripes/wiki/Cache-Simulation)
- Use single-cycle processor for cache-simulation
- Pipelined processor models may stall a stage which is currently reading from memory
- Much faster execution rate compared to pipelined processor
- Cache configuration preset in Ripes
- 32-entry 4-word direct-mapped
- 32-entry 4-word fully associative
- 32-entry 4-word 2-way set associative
- Several things to notice
- Although Ripes can compile C program for you, and run the program
It will generate lots of assembly code, which will be harder for user to trace the program
- Use [Compiler Explorer](https://godbolt.org/) to generate assembly, then add main function assembly code and termation function assembly code to stop the program
- Refer to [Adapting Compiler Explorer generated RISC V assembly code](https://github.com/mortbopet/Ripes/wiki/Adapting-Compiler-Explorer-generated-RISC-V-assembly-code)
- Since we can't pass the whole lab into Ripes, we need to create a function to perform matrix transpose.
- Ripes doesn't support
- Function `rand()`
- Array in `.data` assembly
### Experiment
#### N = 8
- Start with sample code below
```clike=
#include <stdlib.h>
#define MATRIX_SIZE 8
void transpose(){
int A[MATRIX_SIZE][MATRIX_SIZE], B[MATRIX_SIZE][MATRIX_SIZE];
for(int i = 0; i < MATRIX_SIZE; i++){
for(int j = 0; j < MATRIX_SIZE; j++){
A[i][j] = (i+1) * (j+1);
}
};
for(int j = 0; j < MATRIX_SIZE; j++){
for(int i = 0; i < MATRIX_SIZE; i++){
int t0 = A[i][j];
B[j][i] = t0;
}
}
return;
}
```
- Three cache configuration
- 32-entry 4-word direct-mapped
Hit rate:96.62%

- 32-entry 4-word fully associative
Hit rate: 97.16%

- 32-entry 4-word 2-way set associative
Hit rate: 97.08%

- Fetch 4 variables at once
```clike=
void transpose(){
int A[MATRIX_SIZE][MATRIX_SIZE], B[MATRIX_SIZE][MATRIX_SIZE];
for(int i = 0; i < MATRIX_SIZE; i++){
for(int j = 0; j < MATRIX_SIZE; j++){
A[i][j] = (i+1) * (j+1);
}
};
for(int j = 0; j < MATRIX_SIZE; j+=4){
for(int i = 0; i < MATRIX_SIZE; i++){
int t0 = A[i][j];
int t1 = A[i][j+1];
int t2 = A[i][j+2];
int t3 = A[i][j+3];
B[j][i] = t0;
B[j+1][i] = t1;
B[j+2][i] = t2;
B[j+3][i] = t3;
}
}
return;
}
```
- Three cache configuration
- 32-entry 4-word direct-mapped
Hit rate: 93.81%

- 32-entry 4-word fully associative
Hit rate: 96.54%

- 32-entry 4-word 2-way set associative
Hit rate: 96.72%

- Fetch 8 variables at once
- Three cache configuration
- 32-entry 4-word direct-mapped
Hit rate: 94.2%

- 32-entry 4-word fully associative
Hit rate: 95.34%

- 32-entry 4-word 2-way set associative
Hit rate: 95.86%

| | Direct-mapped | Fully Associative | 2-way Set Associative |
| --------- | ------------- | ----------------- | --------------------- |
| 1 at once | 96.62% | 97.16% | 97.08% |
| 4 at once | 93.81% | 96.54% | 96.72% |
| 8 at once | 94.2% | 95.34% | 95.86% |
| 4 at once, three for loop | 95.56% | 96.08% | 96.42% |
#### N = 32
- Dummy method
- Three cache configuration
- 32-entry 4-word direct-mapped
Hit rate: 93.81%

- 32-entry 4-word fully associative
Hit rate: 95.55%

- 32-entry 4-word 2-way set associative
Hit rate: 94.75%

- Fetch 4 variables at once
- Three cache configuration
- 32-entry 4-word direct-mapped
Hit rate: 89.87%

- 32-entry 4-word fully associative
Hit rate: 91.19%

- 32-entry 4-word 2-way set associative
Hit rate: 91.19%

- Fetch 8 variables at once
- Three cache configuration
- 32-entry 4-word direct-mapped
Hit rate: 93.36%

- 32-entry 4-word fully associative
Hit rate: 96.32%

- 32-entry 4-word 2-way set associative
Hit rate: 94.86%

- Fetch 4 variables at once, three for loop
- Three cache configuration
- 32-entry 4-word direct-mapped
Hit rate: 86.15%

- 32-entry 4-word fully associative
Hit rate: 94.36%

- 32-entry 4-word 2-way set associative
Hit rate: 94.03%

| | Direct-mapped | Fully Associative | 2-way Set Associative |
| --------- | ------------- | ----------------- | --------------------- |
| 1 at once | 93.81% | 95.55% | 94.75% |
| 4 at once | 89.87% | 91.19% | 91.19% |
| 8 at once | 93.36% | 96.32% | 94.86% |
| 4 at once, three for loop | 86.15% | 94.36% | 94.03% |
| 8 at once, three for loop | 90.11% | 94.03% | 90.11% |
#### N = 64
| | Direct-mapped | Fully Associative | 2-way Set Associative |
| --------- | ------------- | ----------------- | --------------------- |
| 1 at once | 90.88% | 92.17% | 92.13% |
| 4 at once | 89.75% | 95.54% | 91.09% |
| 8 at once | 89.65% | 95.44% | 90.78% |
| 4 at once, three for loop | 88.57% | 94.22% | 89.94% |
| 8 at once, three for loop | 90.90% | 94.58% | 92.05% |
## Reference
- [CS:APP 3/e cachelab](https://hackmd.io/@pSnFKx_GTlmTWXn4A8lpKw/SkeA19jWN?type=view)
- [CSAPP:Lab4-Cache Lab](https://zhuanlan.zhihu.com/p/142942823)