Try   HackMD

Assignment4: Cache

tags: Computer Architecture

Requirement

Introduction to cache

Exercise lab7

Source code of this assignment

Exercise 1 - A Couple of Memory Access Scenarios

PSEUDOCODE:

int array[]; //Assume sizeof(int) == 4 for (k = 0; k < repcount; k++) { // repeat repcount times /* Step through the selected array segment with the given step size. */ for (index = 0; index < arraysize; index += stepsize) { if(option==0) array[index] = 0; // Option 0: One cache access - write else array[index] = array[index] + 1;// Option 1: Two cache accesses - read AND write } }

source: cache.s

Scenario 1

Program Parameter: the following register parameters are set in cache.s

parameter register value
Array Size a0 128(byte)
Step Size a1 8
Rep Count a2 4
Option a3 0

Cache Parameters: the following cache parameters are set in Ripes simulator

parameter value or strategy
Cache Level 1
Block Size 8
Number of blocks 4
Enable Should be green
Placement Policy Direct Mapped
Associativity 1
Block Replacement Policy LRU

Q1: What combination of parameters is producing the hit rate you observe?

A1:

  • the result is shown below:
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • explanation: tracing the code we can observe that outer loop iterate 4 times, inner loop iterate 4 times, totally 16 times, and it travel inner loop with step size 8.At the situation at first round, the increament of inner index is 0 -> 8 -> 16 -> 24, it will all meet compulsory miss when it first access the the array(that's only occur at the first iteration of outer loop).
  • the category of miss can check here

Q2: What is our hit rate if we increase Rep Count arbitrarily? Why?

A2: the hit rate will higher than previous one, because if it increase step size, the times it has to traverse inner loop will decrease, then the times of compulsory miss rate will decrease (note that the miss catogory above are all compulsory cache miss)

Q3: How could we modify one program parameter to increase our hit rate?

A3: there are two options we can modified to acheive incresing hit rate as following

  • Step Size: if it decreases the step size, the hit rate will increase,

    • here we decrease step size to 2, the result is figure below
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
  • Rep Count: increase the Rep Count will increase hit rate too, as we mentioned in Q1, the miss are all occured at first iteration, if we increase the total iteration times, the hit ratio will get higher by more hit times.(miss times is fixed) Below is the result that we double Rep Count to 8.

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Scenario 2

Program Parameters:

Array Size Step Size Rep Count Option
256(byte) 2 1 1

Cache Parameters:

Parameter value or strategy
Cache Level 1
Number of blocks 16
Block Size 16
Enable Should be green
Placement Policy N-Way Set Associative
Associativity 4
Block Replacement Policy LRU

Q1: How many memory accesses are there per iteration of the inner loop?

A1: in per iteration of inner loop, that is wordLoop, it will execute one lw and one sw, so it will access memory two times in each iteration.

Q2: What is the repeating hit/miss pattern? WHY?

A2: when it encounter a compulsory miss, it will load 16 entries into cache line, then following will has 15 consequent hit because the stride is 1 in this scenario.

Q3: This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.

A3: with the explanation of previous question, the hit rate will be
1516
, that is 0.9375

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Q4: Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why?

A4: If the Rep Count goes to infinity, the hit rate will getting close to 1, because it will load all data at first iteration, then the following round will never encounter miss.

Scenario 3

Ripes only support one-level cache now, this problem should do with Venus (TODO)

Exercise 2 - Loop Ordering and Matrix Multiplication

source: matrixMultiply.c

multiplication of two matrix

for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) for (int k = 0; k < n; k++) C[i+j*n] += A[i+k*n] * B[k+j*n];

To compute the matrix multiplication correctly, the loop order doesn’t matter.

  • although the loop order doesn't matter, the efficient with different situation is large variat, one critical point how to accelerate matrix multiplication problem is to utilize the spatial and temporal locality of cache

Task1: Record results

  • result of running matrixMultiply.c
  • difference between these matrix multiplication is only the order of i, j, k loop
make ex2
gcc -o matrixMultiply -ggdb -Wall -pedantic -std=gnu99 -O3 matrixMultiply.c
./matrixMultiply
ijk:    n = 1000, 0.858 Gflop/s
ikj:    n = 1000, 0.155 Gflop/s
jik:    n = 1000, 0.849 Gflop/s
jki:    n = 1000, 1.451 Gflop/s
kij:    n = 1000, 0.478 Gflop/s
kji:    n = 1000, 1.448 Gflop/s
  • note that Gfloap is following:

Giga-floating-point-operations per second.” THE BIGGER THE NUMBER THE FASTER IT IS RUNNING!

  • we can record the value on every index in each iteration with different ordering
  • here we experiment in 3*3 matrix multiplication and record their accessing stride
    • A_idx: i+k*n
    • B_idx: k+j*n
    • C_idx: i+j*n
ijk
iteration i j k A_idx B_idx C_idx
1 0 0 0 0 0 0
2 0 0 1 3 1 0
3 0 0 2 6 2 0
4 0 1 0 0 3 3
5 0 1 1 3 4 3
6 0 1 2 6 5 3
7 0 2 0 0 6 6
8 0 2 1 3 7 6
9 0 2 2 6 8 6
10 1 0 0 1 0 1
11 1 0 1 4 1 1
12 1 0 2 7 2 1
13 1 1 0 1 3 4
14 1 1 1 4 4 4
15 1 1 2 7 5 4
16 1 2 0 1 6 7
17 1 2 1 4 7 7
18 1 2 2 7 8 7
19 2 0 0 2 0 2
20 2 0 1 5 1 2
21 2 0 2 8 2 2
22 2 1 0 2 3 5
23 2 1 1 5 4 5
24 2 1 2 8 5 5
25 2 2 0 2 6 8
26 2 2 1 5 7 8
27 2 2 2 8 8 8
A B C
stride 3 1 3
ikj
iteration i j k A_idx B_idx C_idx
1 0 0 0 0 0 0
2 0 1 0 3 0 3
3 0 2 0 6 0 6
4 0 0 1 0 3 1
5 0 1 1 3 3 4
6 0 2 1 6 3 7
7 0 0 2 0 6 2
8 0 1 2 3 6 5
9 0 2 2 6 6 8
10 1 0 0 1 1 0
11 1 1 0 4 1 3
12 1 2 0 7 1 6
13 1 0 1 1 4 1
14 1 1 1 4 4 4
15 1 2 1 7 4 7
16 1 0 2 1 7 2
17 1 1 2 4 7 5
18 1 2 2 7 7 8
19 2 0 0 2 2 0
20 2 1 0 5 2 3
21 2 2 0 8 2 6
22 2 0 1 2 5 1
23 2 1 1 5 5 4
24 2 2 1 8 5 7
25 2 0 2 2 8 2
26 2 1 2 5 8 5
27 2 2 2 8 8 8
A B C
stride 3 3 3
jik
iteration i j k A_idx B_idx C_idx
1 0 0 0 0 0 0
2 0 0 1 3 1 0
3 0 0 2 6 2 0
4 1 0 0 1 0 1
5 1 0 1 4 1 1
6 1 0 2 7 2 1
7 2 0 0 2 0 2
8 2 0 1 5 1 2
9 2 0 2 8 2 2
10 0 1 0 0 3 3
11 0 1 1 3 4 3
12 0 1 2 6 5 3
13 1 1 0 1 3 4
14 1 1 1 4 4 4
15 1 1 2 7 5 4
16 2 1 0 2 3 5
17 2 1 1 5 4 5
18 2 1 2 8 5 5
19 0 2 0 0 6 6
20 0 2 1 3 7 6
21 0 2 2 6 8 6
22 1 2 0 1 6 7
23 1 2 1 4 7 7
24 1 2 2 7 8 7
25 2 2 0 2 6 8
26 2 2 1 5 7 8
27 2 2 2 8 8 8
A B C
stride 3 1 1
jki
iteration i j k A_idx B_idx C_idx
1 0 0 0 0 0 0
2 1 0 0 1 0 1
3 2 0 0 2 0 2
4 0 0 1 3 1 0
5 1 0 1 4 1 1
6 2 0 1 5 1 2
7 0 0 2 6 2 0
8 1 0 2 7 2 1
9 2 0 2 8 2 2
10 0 1 0 0 3 3
11 1 1 0 1 3 4
12 2 1 0 2 3 5
13 0 1 1 3 4 3
14 1 1 1 4 4 4
15 2 1 1 5 4 5
16 0 1 2 6 5 3
17 1 1 2 7 5 4
18 2 1 2 8 5 5
19 0 2 0 0 6 6
20 1 2 0 1 6 7
21 2 2 0 2 6 8
22 0 2 1 3 7 6
23 1 2 1 4 7 7
24 2 2 1 5 7 8
25 0 2 2 6 8 6
26 1 2 2 7 8 7
27 2 2 2 8 8 8
A B C
stride 1 1 1
kij
iteration i j k A_idx B_idx C_idx
1 0 0 0 0 0 0
2 0 1 0 0 3 3
3 0 2 0 0 6 6
4 1 0 0 1 0 1
5 1 1 0 1 3 4
6 1 2 0 1 6 7
7 2 0 0 2 0 2
8 2 1 0 2 3 5
9 2 2 0 2 6 8
10 0 0 1 3 1 0
11 0 1 1 3 4 3
12 0 2 1 3 7 6
13 1 0 1 4 1 1
14 1 1 1 4 4 4
15 1 2 1 4 7 7
16 2 0 1 5 1 2
17 2 1 1 5 4 5
18 2 2 1 5 7 8
19 0 0 2 6 2 0
20 0 1 2 6 5 3
21 0 2 2 6 8 6
22 1 0 2 7 2 1
23 1 1 2 7 5 4
24 1 2 2 7 8 7
25 2 0 2 8 2 2
26 2 1 2 8 5 5
27 2 2 2 8 8 8
A B C
stride 1 3 3
kji
iteration i j k A_idx B_idx C_idx
1 0 0 0 0 0 0
2 1 0 0 1 0 1
3 2 0 0 2 0 2
4 0 1 0 0 3 3
5 1 1 0 1 3 4
6 2 1 0 2 3 5
7 0 2 0 0 6 6
8 1 2 0 1 6 7
9 2 2 0 2 6 8
10 0 0 1 3 1 0
11 1 0 1 4 1 1
12 2 0 1 5 1 2
13 0 1 1 3 4 3
14 1 1 1 4 4 4
15 2 1 1 5 4 5
16 0 2 1 3 7 6
17 1 2 1 4 7 7
18 2 2 1 5 7 8
19 0 0 2 6 2 0
20 1 0 2 7 2 1
21 2 0 2 8 2 2
22 0 1 2 6 5 3
23 1 1 2 7 5 4
24 2 1 2 8 5 5
25 0 2 2 6 8 6
26 1 2 2 7 8 7
27 2 2 2 8 8 8
A B C
stride 1 3 1
  • we know that if the stride is less, we can utilize more feature of spacial locality.
  • the influence of different strides have reflect in their runnung time.

Task2: Which ordering(s) perform best for these 1000-by-1000 matrices? Why?

  • we can observe that the ordering of jki runs the best with 1.451 Gflop/s
  • as mentioned above, this situation most utilize the advantage of locality.

Task3: Which ordering(s) perform the worst? Why?

  • obviously the ordering of ikj perform the worst with 0.478 Gflop/s
  • as mentioned above, this situation worst utilize the advantage of locality, cpu have to reload each data in every iteration (here we always assume that cache isn't large enough to save whole matrix data).

Task4: How does the way we stride through the matrices with respect to the innermost loop affect performance?

  • in experiment, we can see that when the stride of inner loopmost is 1, which always faster.

Exercise 3 - Cache Blocking and Matrix Transposition

source: transpose.c

C code implementation
/* Implement cache blocking below. You should NOT assume that n is a * multiple of the block size. */ void transpose_blocking(int n, int blocksize, int *dst, int *src) { // YOUR CODE HERE int blocknum = n / blocksize; for (int block_i = 0; block_i < blocknum; block_i++){ for (int block_j = 0; block_j < blocknum; block_j++){ int itr_i = block_i * blocksize; int itr_j = block_j * blocksize; for (int inner_i = 0; inner_i < blocksize; inner_i++) { for (int inner_j = 0; inner_j < blocksize; inner_j++){ int x = itr_i + inner_i; int y = itr_j + inner_j; dst[y + x * n] = src[x + y * n]; } } } } // transpose residual elements int nn = blocknum * blocksize; int remain = n - nn; for (int x = 0; x < n; x++){ for (int y = nn; y < n; y++){ dst[y + x * n] = src[x + y * n]; } } for (int x = nn; x < n; x++){ for (int y = 0; y < n; y++){ dst[y + x * n] = src[x + y * n]; } } }

Part 1: Changing Array Sizes

  • blocksize fixed to 20, and modified n
  • time unit is milliseconds(ms)
n 20 60 70 80 100 1000 2000 5000 10000
naive 0.001 0.007 0.01 0.014 0.021 2.731 13.802 106.16 428.3
blocking 0.001 0.007 0.009 0.011 0.017 1.896 15.806 92.287 398.101
  • Q1: At what point does cache blocked version of transpose become faster than the non-cache blocked version?

  • A1: from n > 70, the advantage of blocked version of transpose has appear clearly, and is getting obviously when n getting larger.

  • Q2: Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?

  • A2: if the matrix size is too small, the advantage of spatial/temporal localty cannot be brought out since the data can all be loaded to memory in one time access.

Part 2: Changing Blocksize

  • n is fixed to 10000, and modified blocksize
  • time unit is milliseconds(ms)
blocksize 10 20 50 100 500 1000 5000
naive 427.988 428.112 428.154 428.393 427.967 428.344 428.618
blocking 478.875 398.02 296.951 284.136 278.36 336.922 428.019
  • Q: How does performance change as blocksize increases? Why is this the case?
  • A: As the blocksize is getting larger, the advantage of temporary locality is lose, and the behavior is getting similar with naive matrix transpose due to the cache size that is less than block size.