Assignment4: Cache

tags: `RISC-V`

This assignment follows Assignment4 and CS61C-LAB7.

Exercise 1 - A Couple of Memory Access Scenarios

This exercise shows the importance of cache structure situations. The sample code is given by CS61C.

modified it from a0 to a7, because the ecall variable of RIPES stores in a7


li	a7,10		# exit

Scenario 1

Program Parameters: (set these by initializing the a registers in the code)

Array Size (a0): 128 (bytes)
Step Size (a1): 8
Rep Count (a2): 4
Option (a3): 0

main:	li	a0, 128		# array size in BYTES (power of 2 < array size)
        li	a1, 8		# step size  (power of 2 > 0)
        li	a2, 4		# rep count  (int > 0)
        li	a3, 0		# 0 - option 0, 1 - option 1

Cache Parameters: (set these in the Cache tab)

Cache Levels: 1
Block Size: 8
Number of Blocks: 4
Enable?: Should be green
Placement Policy: Direct Mapped
Associativity: 1 (Venus won’t let you change this, why?)
Block Replacement Policy: LRU

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Explain how to decide the parameters

Direct Mapped: Ways is set
$2^{0}$
Lines is equal to Number of Blocks, so set it to
$2^{2}$
In RIPES, one cell is 4 bytes, so we can see two blocks in one row.

Q1. What combination of parameters is producing the hit rate you observe? (Hint: Your answer should be of the form: “Because [parameter A] in bytes is exactly equal to [parameter B] in bytes.”)

The hit rate is 0.
TIO Breakdown:

offset:
$l o g_{2} 8$ = 3
index:
$l o g_{2} (N u m b e r o f B l o c k s / w a y s)$ =
$l o g_{2} (4 / 1)$ = 2
tag: 32 - 3 - 2 = 27

Q2. What is our hit rate if we increase Rep Count arbitrarily? Why?

First, I confuse the unit of step size, and I find out that it should be words. Therefore, the 8-words size means the index bits maintain 0 in each iterate. It leads to that the index bits are always the same, and the value of Rep Count can not influence the hit rate in this case.

Q3. How could we modify one program parameter to increase our hit rate? (Hint: Your answer should be of the form: “Change [parameter] to [value].” Note: we don’t care if we access the same array elements. Just give us a program parameter modification that would increase the hit rate.

Change a1 (step size) to 1. Following a cache miss in block 0, there is a cache hit in block 1. Thus, the hit rate will be 0.5.

a cache miss

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

a following cache hit

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Scenario 2

Program Parameters: (set these by initializing the a registers in the code)

Array Size (a0): 256 (bytes)
Step Size (a1): 2
Rep Count (a2): 1
Option (a3): 1

main:	li	a0, 256		# array size in BYTES (power of 2 < array size)
	li	a1, 2		# step size  (power of 2 > 0)
	li	a2, 1		# rep count  (int > 0)
	li	a3, 1		# 0 - option 0, 1 - option 1

Cache Parameters: (set these in the Cache tab)

Cache Levels: 1
Block Size: 16
Number of Blocks: 16
Enable?: Should be green
Placement Policy: N-Way Set Associative
Associativity: 4
Block Replacement Policy: LRU

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Explain how to decide the parameters

TIO breakdown:

offset:
$l o g_{2} 16$ = 4
index:
$l o g_{2} (16 / 4)$ = 2
tag: 32 - 4 - 2 = 26

4-Way Set Associative: Ways is set
$2^{2}$
Lines is equal to Number of Blocks divided by 4, so set it to
$2^{2}$
In RIPES, one cell is 4 bytes, so we can see four blocks in one row, and then it can represent 16.

Q1.How many memory accesses are there per iteration of the inner loop? (not the one involving repcount). It’s not 1.

2 memory accesses

array[index] = array[index] + 1; // Option 1: Two cache accesses - read AND write

Q2.What is the repeating hit/miss pattern? WHY? (Hint: it repeats every 4 accesses).

It repeats the 3 hits / 1 miss pattern. Because the step size is 2 and the option is 1. It accesses the same block twice, and accessing the next block in equal line. As a result, we get the 0.75 hit rate.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Q3.This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.

Due to the 3 hits / 1 miss pattern, the hit rate is 0.75.

Q4.Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run!

If the Rep Count goes to infinity, we can get the 100% hit rate. Hence the array size is equal to the cache size, the whole array will be loaded into cache after one outer loop.

Q5.Suppose we have a program that iterates through a very large array (AKA way bigger than the size of the cache) repcount times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). (For reference, in this scenario, we just had one function (incrementation) and one Rep).

Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others AKA it doesn’t matter if Rep k is applied to element arr[i] before Rep k is applied to element arr[i+1], etc.

In my opinion, I don't iterate through the entire array in one Rep Count. Because each element is to be modified independently, and it will pass into different function per Rep Count, I think that I can map all different functions in one inner loop. like:

for (index = 0; index < arraysize; index += stepsize) {
    array[index] = func1(array[index]);
    array[index] = func2(array[index]);
    array[index] = func3(array[index]);
    ...
}

Thus, I can get a great temporal locality, an array element will be modified in the same loop. It only get 1 cache miss firstly, and then getting cache hits in other functions.

Scenario 3

Program Parameters: (set these by initializing the a registers in the code)

Array Size (a0): 128 (bytes)
Step Size (a1): 1
Rep Count (a2): 1
Option (a3): 0

Cache Parameters: (set these in the Cache tab)

Cache Levels: 2

NOTE: Make sure the following parameters are for the L1 cache! (Select L1 in the dropdown right next to the replacement policy)

Block Size: 8
Number of Blocks: 8
Enable?: Should be green
Placement Policy: Direct Mapped
Associativity: 1
Block Replacement Policy: LRU

NOTE: Make sure the following parameters are for the L2 cache! (Select L2 in the dropdown right next to the replacement policy)

Block Size: 8
Number of Blocks: 16
Enable?: Should be green
Placement Policy: Direct Mapped
Associativity: 1
Block Replacement Policy: LRU

Explain the TIO Breakdown

RIPES don't have multi-level cache, so I simulate the cache situation of RIPES in my mind.

LEVEL 1:
offset:

l o g_{2} 8

= 3
index:

l o g_{2} 8

= 3
tag: 32 - 3 - 3 = 26

LEVEL 2:
offset:

l o g_{2} 8

= 3
index:

l o g_{2} 16

= 4
tag: 32 - 3 - 4 = 25

Q1.What is the hit rate of our L1 cache? Our L2 cache? Overall?

L1 cache: 0.5
L2 cache: 0.0
Overall:
$(32 * 0.5 + 16 * 0) / 48 = 0.3333$

Q2.How many accesses do we have to the L1 cache total? How many of them are misses?

It has 32 times to access L1 cache, and has 16 times misses.

Q3.How many accesses do we have to the L2 cache total? How does this relate to the L1 cache (think about what the L1 cache has to do in order to make us access the L2 cache)?

It has 16 times to access L2 cache because only accessing it as L1 miss.

Q4.What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same? Why?

Increasing the Rep Count, because L2 is enough big to accommodate the whole array. Also, it don't influence the L1 access results. As a result, the L2 hit rate will rise, and the L1 hit rate will be the same.

Q5.What happens to our hit rates for L1 and L2 as we slowly increase the number of blocks in L1? What about L1 block size?

Assume the Rep Count remains 1

Increasing number of blocks in L1 won't get any effect, the hit rate is still 0.5.
Increasing L1 block size will get a positive effect, the hit rate will rise.

Exercise 2 - Loop Ordering and Matrix Multiplication

This exercise shows the importance of locality.

Tasks

Record your results in answers.txt

The unit “Gflops/s” reads, “Giga-floating-point-operations per second

C[i+j*n] += A[i+k*n]*B[k+j*n];

ijk:	n = 1000, 1.774 Gflop/s
ikj:	n = 1000, 0.253 Gflop/s
jik:	n = 1000, 1.571 Gflop/s
jki:	n = 1000, 11.964 Gflop/s
kij:	n = 1000, 0.227 Gflop/s
kji:	n = 1000, 8.061 Gflop/s

Which ordering(s) perform best for these 1000-by-1000 matrices? Why?

jki ordering performs best.

Analyzing the code, if we want to maintain the spatial locality, we have to stride less as much as possible.

j multiplies n in two parts
k multiplies n in one part
i doesn't multiply n

Thus, i has the smallest stride, and j has the biggest stride. This is why jki ordering is the best.

Which ordering(s) perform the worst? Why?

kij ordering performs worst, and ikj is pretty much.

As mentioned above, j will break the spatial locality soon when it is put innermost.

How does the way we stride through the matrices with respect to the innermost loop affect performance?

The stride size decides the maintenance of spatial locality. The less size can get a better result.

Exercise 3 - Cache Blocking and Matrix Transposition

This exercise is showing the advantage of caring locality. It compares two situations: careful locality (cache blocking) and careless locality (naive method).

my transpose_blocking code

void transpose_blocking(int n, int blocksize, int *dst, int *src) {
    for (int i = 0; i < n; i += blocksize) {
        for (int j = 0; j < n; j += blocksize) {
            for (int k = i; k < blocksize + i && k < n; k++) {
                for (int l = j; l < blocksize + j && l < n; l++) {
                    dst[l + k * n] = src[k + l * n];
                }
            }
        }
    }
}

Part 1 - Changing Array Sizes

Fix the blocksize to be 20, and run your code with n equal to 50, 100, 1000, 2000, 5000, and 10000.

n: 50

Testing naive transpose: 0.01 milliseconds
Testing transpose with blocking: 0.006 milliseconds

n: 100

Testing naive transpose: 0.027 milliseconds
Testing transpose with blocking: 0.014 milliseconds

n: 1000

Testing naive transpose: 1.651 milliseconds
Testing transpose with blocking: 0.914 milliseconds

n: 2000

Testing naive transpose: 16.169 milliseconds
Testing transpose with blocking: 3.793 milliseconds

n: 5000

Testing naive transpose: 136.588 milliseconds
Testing transpose with blocking: 24.525 milliseconds

n: 10000

Testing naive transpose: 784.778 milliseconds
Testing transpose with blocking: 131.418 milliseconds

At what point does cache blocked version of transpose become faster than the non-cache blocked version?

When n is 100, the cache blocked version is faster than the other.

Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?

First, both of them suffer some cache misses. When n is small, the cache blocked version can not show the advantage of locality. Instead, it causes a negative effect, due to the code complexity (using more for-loop).

Part 2 - Changing Blocksize

Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000.

blocksize: 50

Testing naive transpose: 709.491 milliseconds
Testing transpose with blocking: 118.873 milliseconds

blocksize: 100

Testing naive transpose: 707.79 milliseconds
Testing transpose with blocking: 114.205 milliseconds

blocksize: 500

Testing naive transpose: 754.169 milliseconds
Testing transpose with blocking: 105.873 milliseconds

blocksize: 1000

Testing naive transpose: 705.629 milliseconds
Testing transpose with blocking: 152.805 milliseconds

blocksize: 5000

Testing naive transpose: 783.175 milliseconds
Testing transpose with blocking: 714.039 milliseconds

How does performance change as blocksize increases? Why is this the case?

There is not an apparent performance improvement. When the blocksize is too big to exhibit the advantage of locality, the result is almost identical with the naive transpose. It can be seen that the appropriate blocksize is essential.

Assignment4: Cache

tags: RISC-V

Exercise 1 - A Couple of Memory Access Scenarios

Scenario 1

Program Parameters: (set these by initializing the a registers in the code)

Cache Parameters: (set these in the Cache tab)

Explain how to decide the parameters

Q1. What combination of parameters is producing the hit rate you observe? (Hint: Your answer should be of the form: “Because [parameter A] in bytes is exactly equal to [parameter B] in bytes.”)

Q2. What is our hit rate if we increase Rep Count arbitrarily? Why?

Q3. How could we modify one program parameter to increase our hit rate? (Hint: Your answer should be of the form: “Change [parameter] to [value].” Note: we don’t care if we access the same array elements. Just give us a program parameter modification that would increase the hit rate.

Scenario 2

Program Parameters: (set these by initializing the a registers in the code)

Cache Parameters: (set these in the Cache tab)

Explain how to decide the parameters

Q1.How many memory accesses are there per iteration of the inner loop? (not the one involving repcount). It’s not 1.

Q2.What is the repeating hit/miss pattern? WHY? (Hint: it repeats every 4 accesses).

Q3.This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.

Q4.Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run!

Scenario 3

Program Parameters: (set these by initializing the a registers in the code)

Cache Parameters: (set these in the Cache tab)

Explain the TIO Breakdown

Q1.What is the hit rate of our L1 cache? Our L2 cache? Overall?

Q2.How many accesses do we have to the L1 cache total? How many of them are misses?

Q3.How many accesses do we have to the L2 cache total? How does this relate to the L1 cache (think about what the L1 cache has to do in order to make us access the L2 cache)?

Q4.What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same? Why?

Q5.What happens to our hit rates for L1 and L2 as we slowly increase the number of blocks in L1? What about L1 block size?

Exercise 2 - Loop Ordering and Matrix Multiplication

Tasks

Record your results in answers.txt

Which ordering(s) perform best for these 1000-by-1000 matrices? Why?

Which ordering(s) perform the worst? Why?

How does the way we stride through the matrices with respect to the innermost loop affect performance?

Exercise 3 - Cache Blocking and Matrix Transposition

Part 1 - Changing Array Sizes

Fix the blocksize to be 20, and run your code with n equal to 50, 100, 1000, 2000, 5000, and 10000.

At what point does cache blocked version of transpose become faster than the non-cache blocked version?

Why does cache blocking require the matrix to be a certain size before it outperforms the non-cache blocked code?

Part 2 - Changing Blocksize

Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000.

How does performance change as blocksize increases? Why is this the case?

Read more

2022/08/23~24 OSSLA Records

So What Has RCU Done Lately? (2023)

2021q1 Homework5 (sort)

RCU in the Linux Kernel

tags: `RISC-V`