# Asssigment4: Cache contributed by < `WeiCheng14159` > ###### tags: `computer architure 2020` ## Table of Content [TOC] ## Objectives - Analyze how memory access patterns determine cache hit rates - Analyze and discover which memory access patterns produce GOOD hit rates - Analyze hit rates for caches and be able to optimize code accesses to produce good hit rates ## Experiment setup The following experiments are done on Ripes simulator with the following CPU setup. A single cycle processor support RV32IM ISA ## Exercise 1 - A Couple of Memory Access Scenarios Simulate the following scenarios and record the **final cache hit rates** with the program in `cache.s`. ### cache.c ```c= int array[]; //Assume sizeof(int) == 4 for (k = 0; k < repcount; k++) { // repeat repcount times /* Step through the selected array segment with the given step size. */ for (index = 0; index < arraysize; index += stepsize) { if(option==0) array[index] = 0; // Option 0: One cache access - write else array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write } } ``` ### cache.s :::spoiler assembly code ```c=1 # Rewriten by Stephan Kaminsky in RISC-V on 7/30/2018 # This program accesses an array in ways that provide data about the cache parameters. # Coupled with the Data Cache Simulator and Memory Reference Visualization to help students # understand how the cache parameters affect cache performance. # # PSEUDOCODE: # int array[]; //Assume sizeof(int) == 4 # for (k = 0; k < repcount; k++) { // repeat repcount times # /* Step through the selected array segment with the given step size. */ # for (index = 0; index < arraysize; index += stepsize) { # if(option==0) # array[index] = 0; // Option 0: One cache access - write # else # array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write # } # } .data array: .word 2048 # max array size specified in BYTES (DO NOT CHANGE) .text ################################################################################################## # You MAY change the code below this section main: li a0, 128 # array size in BYTES (power of 2 < array size) li a1, 8 # step size (power of 2 > 0) li a2, 4 # rep count (int > 0) li a3, 0 # 0 - option 0, 1 - option 1 # You MAY change the code above this section ################################################################################################## jal accessWords # lw/sw #jal accessBytes # lb/sb li a0,10 # exit ecall # SUMMARY OF REGISTER USE: # a0 = array size in bytes # a1 = step size # a2 = number of times to repeat # a3 = 0 (W) / 1 (RW) # s0 = moving array ptr # s1 = array limit (ptr) accessWords: la s0, array # ptr to array add s1, s0, a0 # hardcode array limit (ptr) slli t1, a1, 2 # multiply stepsize by 4 because WORDS wordLoop: beq a3, zero, wordZero lw t0, 0(s0) # array[index/4]++ addi t0, t0, 1 sw t0, 0(s0) j wordCheck wordZero: sw zero, 0(s0) # array[index/4] = 0 wordCheck: add s0, s0, t1 # increment ptr blt s0, s1, wordLoop # inner loop done? addi a2, a2, -1 bgtz a2, accessWords # outer loop done? jr ra accessBytes: la s0, array # ptr to array add s1, s0, a0 # hardcode array limit (ptr) byteLoop: beq a3, zero, byteZero lbu t0, 0(s0) # array[index]++ addi t0, t0, 1 sb t0, 0(s0) j byteCheck byteZero: sb zero, 0(s0) # array[index] = 0 byteCheck: add s0, s0, a1 # increment ptr blt s0, s1, byteLoop # inner loop done? addi a2, a2, -1 bgtz a2, accessBytes # outer loop done? jr ra ``` ::: ### Scenario 1: #### Program Parameters Program Parameters are set in `cache.s` | Array size | Step size | Rep count | Option | | -- | -- | -- | -- | | 128 | 8 | 4 | 0 | #### Cache Parameters Cache Parameters are set in Ripes simulator | Cache Levels | Block Size | # of Blocks | Placement Policy | Associativity | Replacement Policy | | ------------ | ---------- | ----------- | ---------------- | ------------- | ------------------ | | 1* | 8 | 4 | Direct Mapped | 1 | LRU | `* Ripes simulator only supports 1 level cache` Assumption of cache behavior set in Ripes | Cache hit policy | Cache miss policy | | - | - | | Write-back | Write-allocate | #### Results - Cache setup - ![](https://i.imgur.com/d2k99jB.png) - tag: [31:7] - index: [6:5] - block offset: [4:2] - byte offset: [1:0] - Data memory access pattern - The program write to the following address for each repetition | Address | tag [31:7] | index [6:5] | block offset [4:2] | byte offset [1:0] | | -- | -- | -- | -- | -- | | 0x10000000 | 0x200000 | 2'b00 | 3'b000 | 2'b00 | | 0x10000020 | 0x200000 | 2'b01 | 3'b000 | 2'b00 | | 0x10000040 | 0x200000 | 2'b10 | 3'b000 | 2'b00 | | 0x10000060 | 0x200000 | 2'b11 | 3'b000 | 2'b00 | - Notice the memory address access pattern is increased by **$20_{hex}$ ($32_{dec}$)** since each element in array is 4 bytes and the step size is 8 elements. - The same memory access pattern repeats for **4** times. The first iteration creates **4 compulsory cache misses**, the later iterations contribute to **12 cache hits** - Thus, the final cache hit rate is **12/16 = 0.75** - Data cache | Hit rate | Hits | Misses | Writebacks | | -------- | ---- | ------ | ---------- | | 0.75 | 12 | 4 | 0 | - Instruction cache | Hit rate | Hits | Misses | Writebacks | | -------- | ---- | ------ | ---------- | | 0.98 | 163 | 3 | 0 | #### Task: - What combination of parameters is producing the hit rate you observe ? - The **stride size** of this program is **32 bytes** which is the same as **block size** of cache, also **32 bytes**. - What is our hit rate if we increase Rep Count arbitrarily? Why? - The hit rate will be **close to 1.0** since the **affect of compulsory cache miss is amortized**. - How could we modify one program parameter to increase our hit rate? - We can increate the hit rate by **setting the step size to 1**. The hit rate increased to **0.97** with the same setup. ### Scenario 2: #### Program Parameters Program Parameters are set in `cache.s` | Array size | Step size | Rep count | Option | | -- | -- | -- | -- | | 256 | 2 | 1 | 1 | #### Cache Parameters Cache Parameters are set in Ripes simulator | Cache Levels | Block Size | # of Blocks | Placement Policy | Associativity | Replacement Policy | | -- | -- | -- | -- | -- | -- | | 1* | 16 | 16** | 4-Way Set Associative | 4 | LRU | `* Ripes simulator only supports 1 level cache` `** Assume this is the number of blocks per cache line, NOT the total number of blocks in cache` Assumption of cache behavior set in Ripes | Cache hit policy | Cache miss policy | | - | - | | Write-back | Write-allocate | #### Results - Cache setup - ![](https://i.imgur.com/yzZ2jY3.png) - tag: [31:10] - index: [9:6] - block offset: [5:2] - byte offset: [1:0] - Data memory access pattern - The program write to the following address for each repetition | Address | tag [31:10] | index [9:6] | block offset [5:2] | byte offset [1:0] | | -- | -- | -- | -- | -- | | 0x10000000 (R/W) | 0x40000 | **4'b0000** | 4'b0000 | 2'b00 | | 0x10000008 (R/W) | 0x40000 | 4'b0000 | 4'b0010 | 2'b00 | | 0x10000010 (R/W) | 0x40000 | 4'b0000 | 4'b0100 | 2'b00 | | ... | ... | ... | ... | ... | | 0x10000040 (R/W) | 0x40000 | **4'b0001** | 4'b0000 | 2'b00 | | ... | ... | ... | ... | ... | | 0x10000080 (R/W) | 0x40000 | **4'b0010** | 4'b0000 | 2'b00 | | ... | ... | ... | ... | ... | | 0x10000100 (R/W) | 0x40000 | **4'b0011** | 4'b0000 | 2'b00 | | ... | ... | ... | ... | ... | - Notice the memory address access pattern is increased by **$8_{hex}$ ($8_{dec}$)** since each element in array is 4 bytes and the step size is 2 elements. - Data cache | Hit rate | Hits | Misses | Writebacks | | -------- | ---- | ------ | ---------- | | 0.9375 | 60 | 4 | 0 | - Instruction cache | Hit rate | Hits | Misses | Writebacks | | -------- | ---- | ------ | ---------- | | 0.99 | 235 | 3 | 0 | #### Task: - How many memory accesses are there per iteration of the inner loop? - There're **two memory access per iteration**. Marked by `lw t0, 0(s0)` and `sw t0, 0(s0)` in the assembly program. - What is the repeating hit/miss pattern? WHY? - The hit/miss pattern can be observed from the following table - | Address | index [9:6] | block offset [5:2] | byte offset [1:0] | # of hit/miss | | -- | -- | -- | -- | -- | | 0x10000000 (R/W) | **4'b0000** | 4'b0000 | 2'b00 | 1m1h | | 0x10000008 (R/W) | 4'b0000 | 4'b0010 | 2'b00 | 2h | | 0x10000010 (R/W) | 4'b0000 | 4'b0100 | 2'b00 | 2h | | 0x10000018 (R/W) | 4'b0000 | 4'b0110 | 2'b00 | 2h | | 0x10000020 (R/W) | 4'b0000 | 4'b1000 | 2'b00 | 2h | | 0x10000028 (R/W) | 4'b0000 | 4'b1010 | 2'b00 | 2h | | 0x10000030 (R/W) | 4'b0000 | 4'b1100 | 2'b00 | 2h | | 0x10000038 (R/W) | 4'b0000 | 4'b1110 | 2'b00 | 2h | | 0x10000040 (R/W) | **4'b0001** | 4'b0000 | 2'b00 | 1m1h | | ... | ... | ... | ... | | 0x10000080 (R/W) | **4'b0010** | 4'b0000 | 2'b00 | 1m1h | | ... | ... | ... | ... | | 0x10000100 (R/W) | **4'b0011** | 4'b0000 | 2'b00 | 1m1h | | ... | ... | ... | ... | - Notice the **bold** number marked the **first access in that line**. A cache miss will force the cache system to load the **whole line** from memory to cache. - The hit\miss pattern of this program is **1 miss per 16 memory access**. - This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern. - As expected, the hit rate is **$\frac{15}{16} = 0.9375$** - Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? - The hit rate **approaches 1.0** as the number of rep count approaches to infinity. - For a single iteration, there're **64 memory access**. Based on the analysis above, **4 compulsory misses** will occur for the **first iteration**. - Given the number of iteration (or rep count) goes to infinity, the affect of **compulsory misses is amortized to nearly zero** - :::spoiler QUESTION Suppose we have a program that iterates through a very large array (AKA way bigger than the size of the cache) repcount times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). (For reference, in this scenario, we just had one function (incrementation) and one Rep). QUESTION: Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario ? ::: - Each function is defined as a **numberical operation on an array element**, a function might be the same across a repcount (i.e. increment by 1 as a function) OR function could be different across a repcount. (i.e. increment by 2i where variable i is the repcount number) - Consider the following example: `if Rep Count = 4, we map 65536 different functions onto each of the array elements, one per Rep.` The C code example is as follow: - ```c=1 // array size = 65536 // rep count = 4 int array[65536]; void accu(){ for (int k = 0; k < 4; k++) { // repeat repcount times for (int i = 0; i < 65536; i += 1) { array[i] = array[i] + 2*k; // function depends on k } } } ``` - Assume the C program above is our target application and ALL cache configuration is the same as the previous question. To provide a solid example for discussion, the setup configuration is as follow: - | Array size (bytes) | Step size | Rep count | Option | | -- | -- | -- | -- | | 262144 | 1 | 4 | 1 | - Compile the C code on `Compiler Explorer` with `-O0` flag, and run the result in Ripes, the data cache result is as follow: | Hit rate | Hits | Misses | Writebacks | | -------- | ---- | ------ | ---------- | | 0.9911 | 1818645 | 16389 | 16325 | - Consider the given situation where the **array size is >> than cache size**. The program will be much more **cache friendly** if it is reconstructed as follow: - Notice **two for loop are swapped** - ```c=1 int array[65536]; void accu(){ for (int i = 0; i < 65536; i += 1) { for (int k = 0; k < 4; k++) { // repeat repcount times array[i] = array[i] + 2*k; // function depends on k } } } ``` - Compile the C code on `Compiler Explorer` with `-O0` flag, and run the result in Ripes, the data cache result is as follow: | Hit rate | Hits | Misses | Writebacks | | -------- | ---- | ------ | ---------- | | 0.9981 | 2158596 | 4034 | 4098 | - **The result shows a 0.7% improve in cache hit rate** and only **1/4 number of cache miss** compared to the previous approach ! The reconstructed program achieves better performance because of **temporal locality**. ### Scenario 3: Ripes doesn't support 2-level cache skipped ## Exercise 2 - Loop Ordering and Matrix Multiplication Matrix multiplication source code ```c=1 for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) for (int k = 0; k < n; k++) C[i+j*n] += A[i+k*n] * B[k+j*n]; ``` ### Matrix multiplication problem analysis - Matrix elements are `float` which is **4 bytes** - $O(N^3)$ total operations - Illustration: ![](https://i.imgur.com/hyFq2WY.png) ### Tasks - Record your results in answers.txt - ```c=1 ./matrixMultiply ijk: n = 1000, 2.048 Gflop/s ikj: n = 1000, 0.294 Gflop/s jik: n = 1000, 1.786 Gflop/s jki: n = 1000, 10.682 Gflop/s kij: n = 1000, 0.295 Gflop/s kji: n = 1000, 8.760 Gflop/s ``` - Which ordering(s) perform best for these 1000-by-1000 matrices? Why? - Assumption about the problem - Matrix A, B, C are **row-major** - Matrix dimension (N) is very large such that 1/N approaches zero - Cache is **NOT big enough** to hold multiple rows - Stride analysis: - | ijk | C | A | B | | ---------- | - | - | - | |Stride size | n | n | 1 | - | ikj | C | A | B | | ---------- | - | - | - | |Stride size | n | n | n | - | jik | C | A | B | | ---------- | - | - | - | |Stride size | 1 | n | 1 | - | jki | C | A | B | | ---------- | - | - | - | |Stride size | 1 | 1 | 1 | - | kij | C | A | B | | ---------- | - | - | - | |Stride size | n | 1 | n | - | kji | C | A | B | | ---------- | - | - | - | |Stride size | 1 | 1 | n | - **jki configuration** has the best performance ! This result is expected given the stride analysis above - Which ordering(s) perform the worst? Why? - **ikj performs the worst** because the stride size is **n** for all three matrices - How does the way we stride through the matrices with respect to the innermost loop affect performance? - Given the matrix multiplication problem, three for loops should be arranged in a way that **stride 1 memory access occrus before stride n memory access**. This pattern could **maximize both spatial and temporal locality**. ## Exercise 3 - Cache Blocking and Matrix Transposition ### Matrix transpose with cache blocking :::spoiler C code ```c=1 /* Implement cache blocking below. You should NOT assume that n is a * multiple of the block size. */ void transpose_blocking(int n, int blocksize, int *dst, int *src) { size_t remains = n % blocksize; size_t iter = n / blocksize; size_t block_base_i = 0; size_t block_base_j = 0; for (size_t block_i = 0; block_i < iter; block_i++) { for (size_t block_j = 0; block_j < iter; block_j++) { block_base_i = block_i * blocksize; block_base_j = block_j * blocksize; for (size_t i = 0; i < blocksize; i++) { for (size_t j = 0; j < blocksize; j++) { size_t temp_i = block_base_i + i; size_t temp_j = block_base_j + j; dst[temp_j + temp_i * n] = src[temp_i + temp_j * n]; } } } } block_base_i += blocksize; block_base_j += blocksize; for (size_t i = block_base_i; i < n; i++) { for (size_t j = 0; j <= n - remains; j++) { dst[j + i * n] = src[i + j * n]; } } for (size_t j = block_base_j; j < n; j++) { for (size_t i = 0; i < n; i++) { dst[j + i * n] = src[i + j * n]; } } } ``` ::: ### Part 1 - Changing Array Sizes - Fix the `blocksize` to be 20 and run your code with `n` equal to 100, 1000, 2000, 5000, 10000 - Naive - | blocksize = 20 | n=100 | 1000 | 2000 | 5000 | 10000 | | -------------- | ----- | ---- | ---- | ---- | ----- | | time (ms) | 0.007 | 1.695| 13.415| 116.718|647.225| - Blocking - | blocksize = 20 | n=100 | 1000 | 2000 | 5000 | 10000 | | -------------- | ----- | ---- | ---- | ---- | ----- | | time (ms) | 0.003 | 1.411| 4.606 | 33.128 |149.103| - ![](https://i.imgur.com/95k9QeG.png) #### Task - At what point does cache blocked version of transpose become faster than the non-cache blocked version? - Cache blocked version of transpose is **always** faster than the non-cache version in this setup - Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code? - I **don't** observe this phenomenon on my computer. Perhaps the underlying architecture is different. ### Part 2 - Changing Blocksize - Fix `n` to be 10000, and run your code with `blocksize` equal to 50, 100, 500, 1000, 5000. - Naive - | n = 10000 | blocksize=50 | 100 | 500 | 1000 | 5000 | | -------------- | ----- | ---- | ---- | ---- | ----- | | time (ms) | 615.668 | 598.47| 595.875| 611.302|599.406| - Blocking - | n = 10000 | blocksize=50 | 100 | 500 | 1000 | 5000 | | -------------- | ----- | ---- | ---- | ---- | ----- | | time (ms) | 132.042 | 139.508| 123.631 | 155.441 |594.837| - ![](https://i.imgur.com/tMmTtta.png) #### Task - How does performance change as blocksize increases? Why is this the case? - As the block size increases, the performance of cache blocked version of transpose will approach the non-cache version.