Try   HackMD

Assignment4: Cache

tags: RISC-V

This assignment follows Assignment4 and CS61C-LAB7.

Exercise 1 - A Couple of Memory Access Scenarios

This exercise shows the importance of cache structure situations. The sample code is given by CS61C.

  • modified it from a0 to a7, because the ecall variable of RIPES stores in a7
li a7,10 # exit

Scenario 1

Program Parameters: (set these by initializing the a registers in the code)

  • Array Size (a0): 128 (bytes)
  • Step Size (a1): 8
  • Rep Count (a2): 4
  • Option (a3): 0
main:	li	a0, 128		# array size in BYTES (power of 2 < array size)
        li	a1, 8		# step size  (power of 2 > 0)
        li	a2, 4		# rep count  (int > 0)
        li	a3, 0		# 0 - option 0, 1 - option 1

Cache Parameters: (set these in the Cache tab)

  • Cache Levels: 1
  • Block Size: 8
  • Number of Blocks: 4
  • Enable?: Should be green
  • Placement Policy: Direct Mapped
  • Associativity: 1 (Venus won’t let you change this, why?)
  • Block Replacement Policy: LRU

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Explain how to decide the parameters

  1. Direct Mapped: Ways is set
    20
  2. Lines is equal to Number of Blocks, so set it to
    22
  3. In RIPES, one cell is 4 bytes, so we can see two blocks in one row.

Q1. What combination of parameters is producing the hit rate you observe? (Hint: Your answer should be of the form: “Because [parameter A] in bytes is exactly equal to [parameter B] in bytes.”)

The hit rate is 0.
TIO Breakdown:

  • offset:
    log28
    = 3
  • index:
    log2(NumberofBlocks/ways)
    =
    log2(4/1)
    = 2
  • tag: 32 - 3 - 2 = 27

Q2. What is our hit rate if we increase Rep Count arbitrarily? Why?

First, I confuse the unit of step size, and I find out that it should be words. Therefore, the 8-words size means the index bits maintain 0 in each iterate. It leads to that the index bits are always the same, and the value of Rep Count can not influence the hit rate in this case.

Q3. How could we modify one program parameter to increase our hit rate? (Hint: Your answer should be of the form: “Change [parameter] to [value].” Note: we don’t care if we access the same array elements. Just give us a program parameter modification that would increase the hit rate.

Change a1 (step size) to 1. Following a cache miss in block 0, there is a cache hit in block 1. Thus, the hit rate will be 0.5.

  • a cache miss

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • a following cache hit

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Scenario 2

Program Parameters: (set these by initializing the a registers in the code)

  • Array Size (a0): 256 (bytes)
  • Step Size (a1): 2
  • Rep Count (a2): 1
  • Option (a3): 1
main:	li	a0, 256		# array size in BYTES (power of 2 < array size)
	li	a1, 2		# step size  (power of 2 > 0)
	li	a2, 1		# rep count  (int > 0)
	li	a3, 1		# 0 - option 0, 1 - option 1

Cache Parameters: (set these in the Cache tab)

  • Cache Levels: 1
  • Block Size: 16
  • Number of Blocks: 16
  • Enable?: Should be green
  • Placement Policy: N-Way Set Associative
  • Associativity: 4
  • Block Replacement Policy: LRU

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Explain how to decide the parameters

TIO breakdown:

  • offset:
    log216
    = 4
  • index:
    log2(16/4)
    = 2
  • tag: 32 - 4 - 2 = 26
  1. 4-Way Set Associative: Ways is set
    22
  2. Lines is equal to Number of Blocks divided by 4, so set it to
    22
  3. In RIPES, one cell is 4 bytes, so we can see four blocks in one row, and then it can represent 16.

Q1.How many memory accesses are there per iteration of the inner loop? (not the one involving repcount). It’s not 1.

2 memory accesses

array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write

Q2.What is the repeating hit/miss pattern? WHY? (Hint: it repeats every 4 accesses).

It repeats the 3 hits / 1 miss pattern. Because the step size is 2 and the option is 1. It accesses the same block twice, and accessing the next block in equal line. As a result, we get the 0.75 hit rate.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Q3.This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.

Due to the 3 hits / 1 miss pattern, the hit rate is 0.75.

Q4.Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run!

If the Rep Count goes to infinity, we can get the 100% hit rate. Hence the array size is equal to the cache size, the whole array will be loaded into cache after one outer loop.

Q5.Suppose we have a program that iterates through a very large array (AKA way bigger than the size of the cache) repcount times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). (For reference, in this scenario, we just had one function (incrementation) and one Rep).

Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others AKA it doesn’t matter if Rep k is applied to element arr[i] before Rep k is applied to element arr[i+1], etc.

In my opinion, I don't iterate through the entire array in one Rep Count. Because each element is to be modified independently, and it will pass into different function per Rep Count, I think that I can map all different functions in one inner loop. like:

for (index = 0; index < arraysize; index += stepsize) {
    array[index] = func1(array[index]);
    array[index] = func2(array[index]);
    array[index] = func3(array[index]);
    ...
}

Thus, I can get a great temporal locality, an array element will be modified in the same loop. It only get 1 cache miss firstly, and then getting cache hits in other functions.

Scenario 3

Program Parameters: (set these by initializing the a registers in the code)

  • Array Size (a0): 128 (bytes)
  • Step Size (a1): 1
  • Rep Count (a2): 1
  • Option (a3): 0

Cache Parameters: (set these in the Cache tab)

  • Cache Levels: 2

NOTE: Make sure the following parameters are for the L1 cache! (Select L1 in the dropdown right next to the replacement policy)

  • Block Size: 8
  • Number of Blocks: 8
  • Enable?: Should be green
  • Placement Policy: Direct Mapped
  • Associativity: 1
  • Block Replacement Policy: LRU

NOTE: Make sure the following parameters are for the L2 cache! (Select L2 in the dropdown right next to the replacement policy)

  • Block Size: 8
  • Number of Blocks: 16
  • Enable?: Should be green
  • Placement Policy: Direct Mapped
  • Associativity: 1
  • Block Replacement Policy: LRU

Explain the TIO Breakdown

RIPES don't have multi-level cache, so I simulate the cache situation of RIPES in my mind.

LEVEL 1:
offset:

log28 = 3
index:
log28
= 3
tag: 32 - 3 - 3 = 26

LEVEL 2:
offset:

log28 = 3
index:
log216
= 4
tag: 32 - 3 - 4 = 25

Q1.What is the hit rate of our L1 cache? Our L2 cache? Overall?

  • L1 cache: 0.5
  • L2 cache: 0.0
  • Overall:
    (320.5+160)/48=0.3333

Q2.How many accesses do we have to the L1 cache total? How many of them are misses?

It has 32 times to access L1 cache, and has 16 times misses.

Q3.How many accesses do we have to the L2 cache total? How does this relate to the L1 cache (think about what the L1 cache has to do in order to make us access the L2 cache)?

It has 16 times to access L2 cache because only accessing it as L1 miss.

Q4.What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same? Why?

Increasing the Rep Count, because L2 is enough big to accommodate the whole array. Also, it don't influence the L1 access results. As a result, the L2 hit rate will rise, and the L1 hit rate will be the same.

Q5.What happens to our hit rates for L1 and L2 as we slowly increase the number of blocks in L1? What about L1 block size?

Assume the Rep Count remains 1

  • Increasing number of blocks in L1 won't get any effect, the hit rate is still 0.5.
  • Increasing L1 block size will get a positive effect, the hit rate will rise.

Exercise 2 - Loop Ordering and Matrix Multiplication

This exercise shows the importance of locality.

Tasks

Record your results in answers.txt

The unit “Gflops/s” reads, “Giga-floating-point-operations per second

C[i+j*n] += A[i+k*n]*B[k+j*n];
ijk:	n = 1000, 1.774 Gflop/s
ikj:	n = 1000, 0.253 Gflop/s
jik:	n = 1000, 1.571 Gflop/s
jki:	n = 1000, 11.964 Gflop/s
kij:	n = 1000, 0.227 Gflop/s
kji:	n = 1000, 8.061 Gflop/s

Which ordering(s) perform best for these 1000-by-1000 matrices? Why?

jki ordering performs best.

Analyzing the code, if we want to maintain the spatial locality, we have to stride less as much as possible.

  • j multiplies n in two parts
  • k multiplies n in one part
  • i doesn't multiply n

Thus, i has the smallest stride, and j has the biggest stride. This is why jki ordering is the best.

Which ordering(s) perform the worst? Why?

kij ordering performs worst, and ikj is pretty much.

As mentioned above, j will break the spatial locality soon when it is put innermost.

How does the way we stride through the matrices with respect to the innermost loop affect performance?

The stride size decides the maintenance of spatial locality. The less size can get a better result.

Exercise 3 - Cache Blocking and Matrix Transposition

This exercise is showing the advantage of caring locality. It compares two situations: careful locality (cache blocking) and careless locality (naive method).

  • my transpose_blocking code
void transpose_blocking(int n, int blocksize, int *dst, int *src) {
    for (int i = 0; i < n; i += blocksize) {
        for (int j = 0; j < n; j += blocksize) {
            for (int k = i; k < blocksize + i && k < n; k++) {
                for (int l = j; l < blocksize + j && l < n; l++) {
                    dst[l + k * n] = src[k + l * n];
                }
            }
        }
    }
}

Part 1 - Changing Array Sizes

Fix the blocksize to be 20, and run your code with n equal to 50, 100, 1000, 2000, 5000, and 10000.

  • n: 50
Testing naive transpose: 0.01 milliseconds
Testing transpose with blocking: 0.006 milliseconds
  • n: 100
Testing naive transpose: 0.027 milliseconds
Testing transpose with blocking: 0.014 milliseconds
  • n: 1000
Testing naive transpose: 1.651 milliseconds
Testing transpose with blocking: 0.914 milliseconds
  • n: 2000
Testing naive transpose: 16.169 milliseconds
Testing transpose with blocking: 3.793 milliseconds
  • n: 5000
Testing naive transpose: 136.588 milliseconds
Testing transpose with blocking: 24.525 milliseconds
  • n: 10000
Testing naive transpose: 784.778 milliseconds
Testing transpose with blocking: 131.418 milliseconds

At what point does cache blocked version of transpose become faster than the non-cache blocked version?

When n is 100, the cache blocked version is faster than the other.

Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?

First, both of them suffer some cache misses. When n is small, the cache blocked version can not show the advantage of locality. Instead, it causes a negative effect, due to the code complexity (using more for-loop).

Part 2 - Changing Blocksize

Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000.

  • blocksize: 50
Testing naive transpose: 709.491 milliseconds
Testing transpose with blocking: 118.873 milliseconds
  • blocksize: 100
Testing naive transpose: 707.79 milliseconds
Testing transpose with blocking: 114.205 milliseconds
  • blocksize: 500
Testing naive transpose: 754.169 milliseconds
Testing transpose with blocking: 105.873 milliseconds
  • blocksize: 1000
Testing naive transpose: 705.629 milliseconds
Testing transpose with blocking: 152.805 milliseconds
  • blocksize: 5000
Testing naive transpose: 783.175 milliseconds
Testing transpose with blocking: 714.039 milliseconds

How does performance change as blocksize increases? Why is this the case?

There is not an apparent performance improvement. When the blocksize is too big to exhibit the advantage of locality, the result is almost identical with the naive transpose. It can be seen that the appropriate blocksize is essential.