Assignment4: Cache
This assignment follows Assignment4 and CS61C-LAB7.
Exercise 1 - A Couple of Memory Access Scenarios
This exercise shows the importance of cache structure situations. The sample code is given by CS61C.
- modified it from a0 to a7, because the ecall variable of RIPES stores in a7
Scenario 1
Program Parameters: (set these by initializing the a registers in the code)
- Array Size (a0): 128 (bytes)
- Step Size (a1): 8
- Rep Count (a2): 4
- Option (a3): 0
Cache Parameters: (set these in the Cache tab)
- Cache Levels: 1
- Block Size: 8
- Number of Blocks: 4
- Enable?: Should be green
- Placement Policy: Direct Mapped
- Associativity: 1 (Venus won’t let you change this, why?)
- Block Replacement Policy: LRU
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Explain how to decide the parameters
- Direct Mapped:
Ways
is set
Lines
is equal to Number of Blocks
, so set it to
- In RIPES, one cell is 4 bytes, so we can see two blocks in one row.
The hit rate is 0.
TIO Breakdown:
- offset: = 3
- index: = = 2
- tag: 32 - 3 - 2 = 27
Q2. What is our hit rate if we increase Rep Count arbitrarily? Why?
First, I confuse the unit of step size, and I find out that it should be words. Therefore, the 8-words size means the index bits maintain 0 in each iterate. It leads to that the index bits are always the same, and the value of Rep Count can not influence the hit rate in this case.
Change a1 (step size) to 1. Following a cache miss in block 0, there is a cache hit in block 1. Thus, the hit rate will be 0.5.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Scenario 2
Program Parameters: (set these by initializing the a registers in the code)
- Array Size (a0): 256 (bytes)
- Step Size (a1): 2
- Rep Count (a2): 1
- Option (a3): 1
Cache Parameters: (set these in the Cache tab)
- Cache Levels: 1
- Block Size: 16
- Number of Blocks: 16
- Enable?: Should be green
- Placement Policy: N-Way Set Associative
- Associativity: 4
- Block Replacement Policy: LRU
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Explain how to decide the parameters
TIO breakdown:
- offset: = 4
- index: = 2
- tag: 32 - 4 - 2 = 26
- 4-Way Set Associative:
Ways
is set
Lines
is equal to Number of Blocks
divided by 4, so set it to
- In RIPES, one cell is 4 bytes, so we can see four blocks in one row, and then it can represent 16.
Q1.How many memory accesses are there per iteration of the inner loop? (not the one involving repcount). It’s not 1.
2 memory accesses
Q2.What is the repeating hit/miss pattern? WHY? (Hint: it repeats every 4 accesses).
It repeats the 3 hits / 1 miss pattern. Because the step size is 2 and the option is 1. It accesses the same block twice, and accessing the next block in equal line. As a result, we get the 0.75 hit rate.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Q3.This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.
Due to the 3 hits / 1 miss pattern, the hit rate is 0.75.
Q4.Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity? Why? Try it out by changing the appropriate program parameter and letting the code run!
If the Rep Count goes to infinity, we can get the 100% hit rate. Hence the array size is equal to the cache size, the whole array will be loaded into cache after one outer loop.
Q5.Suppose we have a program that iterates through a very large array (AKA way bigger than the size of the cache) repcount times. During each Rep, we map a different function to the elements of our array (e.g. if Rep Count = 1024, we map 1024 different functions onto each of the array elements, one per Rep). (For reference, in this scenario, we just had one function (incrementation) and one Rep).
Given the program described above, how can we restructure its array accesses to achieve a hit rate like that achieved in this scenario? Assume that each array element is to be modified independently of the others AKA it doesn’t matter if Rep k is applied to element arr[i] before Rep k is applied to element arr[i+1], etc.
In my opinion, I don't iterate through the entire array in one Rep Count. Because each element is to be modified independently, and it will pass into different function per Rep Count, I think that I can map all different functions in one inner loop. like:
Thus, I can get a great temporal locality, an array element will be modified in the same loop. It only get 1 cache miss firstly, and then getting cache hits in other functions.
Scenario 3
Program Parameters: (set these by initializing the a registers in the code)
- Array Size (a0): 128 (bytes)
- Step Size (a1): 1
- Rep Count (a2): 1
- Option (a3): 0
Cache Parameters: (set these in the Cache tab)
NOTE: Make sure the following parameters are for the L1 cache! (Select L1 in the dropdown right next to the replacement policy)
- Block Size: 8
- Number of Blocks: 8
- Enable?: Should be green
- Placement Policy: Direct Mapped
- Associativity: 1
- Block Replacement Policy: LRU
NOTE: Make sure the following parameters are for the L2 cache! (Select L2 in the dropdown right next to the replacement policy)
- Block Size: 8
- Number of Blocks: 16
- Enable?: Should be green
- Placement Policy: Direct Mapped
- Associativity: 1
- Block Replacement Policy: LRU
Explain the TIO Breakdown
RIPES don't have multi-level cache, so I simulate the cache situation of RIPES in my mind.
LEVEL 1:
offset: = 3
index: = 3
tag: 32 - 3 - 3 = 26
LEVEL 2:
offset: = 3
index: = 4
tag: 32 - 3 - 4 = 25
Q1.What is the hit rate of our L1 cache? Our L2 cache? Overall?
- L1 cache: 0.5
- L2 cache: 0.0
- Overall:
Q2.How many accesses do we have to the L1 cache total? How many of them are misses?
It has 32 times to access L1 cache, and has 16 times misses.
Q3.How many accesses do we have to the L2 cache total? How does this relate to the L1 cache (think about what the L1 cache has to do in order to make us access the L2 cache)?
It has 16 times to access L2 cache because only accessing it as L1 miss.
Q4.What program parameter would allow us to increase our L2 hit rate, but keep our L1 hit rate the same? Why?
Increasing the Rep Count, because L2 is enough big to accommodate the whole array. Also, it don't influence the L1 access results. As a result, the L2 hit rate will rise, and the L1 hit rate will be the same.
Q5.What happens to our hit rates for L1 and L2 as we slowly increase the number of blocks in L1? What about L1 block size?
Assume the Rep Count remains 1
- Increasing number of blocks in L1 won't get any effect, the hit rate is still 0.5.
- Increasing L1 block size will get a positive effect, the hit rate will rise.
Exercise 2 - Loop Ordering and Matrix Multiplication
This exercise shows the importance of locality.
Tasks
Record your results in answers.txt
The unit “Gflops/s” reads, “Giga-floating-point-operations per second
jki
ordering performs best.
Analyzing the code, if we want to maintain the spatial locality, we have to stride less as much as possible.
j
multiplies n
in two parts
k
multiplies n
in one part
i
doesn't multiply n
Thus, i
has the smallest stride, and j
has the biggest stride. This is why jki
ordering is the best.
kij
ordering performs worst, and ikj
is pretty much.
As mentioned above, j
will break the spatial locality soon when it is put innermost.
The stride size decides the maintenance of spatial locality. The less size can get a better result.
Exercise 3 - Cache Blocking and Matrix Transposition
This exercise is showing the advantage of caring locality. It compares two situations: careful locality (cache blocking) and careless locality (naive method).
- my transpose_blocking code
Part 1 - Changing Array Sizes
Fix the blocksize to be 20, and run your code with n equal to 50, 100, 1000, 2000, 5000, and 10000.
At what point does cache blocked version of transpose become faster than the non-cache blocked version?
When n is 100, the cache blocked version is faster than the other.
First, both of them suffer some cache misses. When n
is small, the cache blocked version can not show the advantage of locality. Instead, it causes a negative effect, due to the code complexity (using more for-loop).
Part 2 - Changing Blocksize
Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000.
There is not an apparent performance improvement. When the blocksize is too big to exhibit the advantage of locality, the result is almost identical with the naive transpose. It can be seen that the appropriate blocksize is essential.