riscv
accessWords
s0
is saved the start address of the array, and s1
is saved the end address of the array.t1
is saved that how many byte we would need to shift ().wordLoop
Check a3
whether is equal to 0
, if does than jump to wordZero
.s0
(it is pointer to the start position of array) to t0
, and then save back after adding 1
. (array[index] = array[index] + 1
)wordCheck
.wordCheck
Move the pointer of array (s0
) to next position by add s0, s0, t1
.s0 < s1 (blt)
then jump to wordLoop
and keep executing, it means that there is still in the inner loop, not yet access all the array data.a3
is saved repeat times for this program, so reduce the a3
value by 1
and check the value whether is equal to 0
.wordZero
When option is 0
, set the array[index] to 0
.The program parameters setup
Observation for Memory
0x10000000
, and the end pointer to address 0x100000008
. The total array size is 8 Bytes (2 Words).1
, it would access the next 1
word data.Array:M | Data | Address |
---|---|---|
End | - | 0x10000008 |
- | 0x0 | 0x10000007 |
- | 0x0 | 0x10000006 |
- | 0x0 | 0x10000005 |
M[1] | 0x1 | 0x10000004 |
- | 0x0 | 0x10000003 |
- | 0x0 | 0x10000002 |
- | 0x8 | 0x10000001 |
M[0] | 0x1 | 0x10000000 |
Why the data at 0x100000001 is always 0x8 ? And why the address 0x1000001c always be accessed ?
Another case
The step size is 2
word, so it would access the address 0x10000000
, 0x10000008
and 0x10000010
, but 0x10000010
is out of the limit address 0x1000000c
, so the address 0x10000010
would not write data.
And the program would repeat 3
times, so 0x10000000
and 0x10000008
are saved the value 3
.
Hit
Miss
Replacement policies
The Venus cache simulator currently simulates a write-through, write-allocate cache.
So we need to choose the same options in Ripes for our experiment.
Task:
Simulate the following scenarios and record the final cache hit rates with the program in cache.s.
Try to reason out what the hit rate will be BEFORE running the code.
THE FOLLOWING are good questions to ask yourself as you do these exercises (not checkoff questions):
Program Parameters: (set these by initializing the a registers in the code)
Cache Parameters: (set these in the Cache tab)
Here we modify the program parameters for cache.s in Ripes.
Confirm that the Data bits exactly equal to 32 Bytes (256 bits).
The result after executing the simulator.
Compare with Venus
0x100000060 = 0b0001_0000_0000_0000_0000_0000_0110_0000
(Tag) 000_1000_0000_0000_0000_0000_0011 | (Index) 00 | (Offset) 000 = 0x00800003
0
for M[0], M[8], M[16], M[24], it means the collision always happen for each access.Index | V | Tag | Block 0 | Block 1 |
---|---|---|---|---|
0 | 1 | 0x00800000 | M[0] | M[1] |
1 | 0 | - | - | - |
2 | 0 | - | - | - |
3 | 0 | - | - | - |
Access M[8]: miss | ||||
Index | V | Tag | Block 0 | Block 1 |
––– | - | ––––– | –––– | –––– |
0 | 1 | 0x00800001 | M[8] | M[9] |
1 | 0 | - | - | - |
2 | 0 | - | - | - |
3 | 0 | - | - | - |
Access M[16]: miss | ||||
Index | V | Tag | Block 0 | Block 1 |
––– | - | ––––– | –––– | –––– |
0 | 1 | 0x00800002 | M[16] | M[17] |
1 | 0 | - | - | - |
2 | 0 | - | - | - |
3 | 0 | - | - | - |
Access M[24]: miss | ||||
Index | V | Tag | Block 0 | Block 1 |
––– | - | ––––– | –––– | –––– |
0 | 1 | 0x00800003 | M[24] | M[25] |
1 | 0 | - | - | - |
2 | 0 | - | - | - |
3 | 0 | - | - | - |
4
times, so staying the array value in cache as possible is a good method. Such as
, with 0.875 hit rate.
, with 0.9375 hit rate.
, with hit rate 0.75.
What is our hit rate if we increase Rep Count arbitrarily? Why?
Ans.
The above combination of parameters would not induce the hit. No matter how we increase the rep count arbitrarily, it will not improve the hit rate. Instead, it will increase the times of miss, because of the cache miss always happen.
But, if there has a chance to induce the hit, it will carry more hit occurring when we increase the rep count. e.g.
Compare repeat
4
with repeat16
, the hit rate will be increased from 0.75 to 0.9375.
How could we modify one program parameter to increase our hit rate?
Ans.
We can change the step size
to 1
to increase the hit rate, because it will carry two array value (even and odd index) into the cache at the same time when meeting a cache miss. That is why the cache hit always occur when we access the value of the odd index.
Therefore, we will get the hit rate with 0.5 after finishing the program.
Miss when cpu access the even index of array.
Hit when cpu access the odd index of array.
Program Parameters: (set these by initializing the a registers in the code)
Cache Parameters: (set these in the Cache tab)
Hit rate is 0.75.
The result in Ripes. Hit rate is also 0.75.
M[0] = 0x100000000, M[2] = 0x10000008, M[4] = 0x10000010, M[6] = 0x10000018, M[8] = 0x10000020, M[10], …, M[60], M[62]
TIO Breakdown:
e.g. Note: The option is 1
, so it will access the cache two times ! (One is for reading data, the other one is for writing data.)
Access M[0]: read miss, and cpu will access main memory.
Set | Slot | V | Tag | Block 0 | Block 1 | Block 2 | Block 3 |
---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0x00400000 | M[0] | M[1] | M[2] | M[3] |
0 | 1 | 0 | - | - | - | - | - |
0 | 2 | 0 | - | - | - | - | - |
0 | 3 | 0 | - | - | - | - | - |
Access M[0]: wirte hit, and cpu will access main memory for the write-through. | |||||||
Access M[2]: read hit, and cpu will not access main memory. | |||||||
Access M[2]: write hit, and cpu need to access main memory for the write-through. |
Access M[4]: read miss
Set | Slot | V | Tag | Block 0 | Block 1 | Block 2 | Block 3 |
---|---|---|---|---|---|---|---|
1 | 0 | 1 | 0x00400000 | M[4] | M[5] | M[6] | M[7] |
1 | 1 | 0 | - | - | - | - | - |
1 | 2 | 0 | - | - | - | - | - |
1 | 3 | 0 | - | - | - | - | - |
Access M[4]: write miss | |||||||
Access M[6]: read hit | |||||||
Access M[6]: write hit |
…
Access M[16]: read miss
Set | Slot | V | Tag | Block 0 | Block 1 | Block 2 | Block 3 |
---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0x00400000 | M[0] | M[1] | M[2] | M[3] |
0 | 1 | 1 | 0x00400001 | M[16] | M[17] | M[18] | M[19] |
0 | 2 | 0 | - | - | - | - | - |
0 | 3 | 0 | - | - | - | - | - |
Access M[16]: write hit | |||||||
Access M[18]: read hit | |||||||
Access M[18]: write hit |
…
Access M[60]: read miss
Access M[60]: write hit
Access M[62]: read hit
Access M[62]: write hit
What is the repeating hit/miss pattern ? WHY?
Because of the block size is 16 Bytes (4 Words), it will contain 4 array value in a slot (line).
Therefore, when cpu meet a cache miss, it carry the data that will be accessed at the next iteration due to 2
words of the step size.
And the option is 1
so cpu will access the cache memory by 2
times per index, that would take a cache read miss and a cache write hit for the index , as well as take a cache read hit and a cache write hit for the index .
e.g.
Access M[0]: read miss, and cpu will access the main memory.
Hits: 0, Misses: 1, Writebacks: 0
Access M[0]: wirte hit, and cpu will access the main memory.
Hits: 1, Misses: 1, Writebacks: 1
Access M[2]: read hit, and cpu will do not access the main memory.
Hits: 2, Misses: 1, Writebacks: 1
Access M[2]: write hit, and cpu need to access the main memory.
Hits: 3, Misses: 1, Writebacks: 2
Access M[4]: read miss
… and it repeats every 4 accesses.
This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.
Ans.
There are 4 times of access at a slot, they contain 1 cache miss caused by reading the index data and 3 cache hit.
So
Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity ? Why ? Try it out by changing the appropriate program parameter and letting the code run !
Hit rate will approach to 1
.
rep count = 10240
L1, L2 Cache assess.
If you recall, matrices are 2-dimensional data structures wherein each data element is accessed via two indices. To multiply two matrices, we can simply use 3 nested loops, assuming that matrices A, B, and C are all n-by-n and stored in one-dimensional column-major arrays:
In the above code, note that the loops are ordered i, j, k. If we examine the innermost loop (the one that increments k), we see that it…
REMEMBER:
To compute the matrix multiplication correctly, the loop order doesn’t matter.
BUT, the order in which we choose to access the elements of the matrices can have a large impact on performance. Caches perform better (more cache hits, fewer cache misses) when memory accesses take advantage of spatial and temporal locality, utilizing blocks already contained within our cache. Optimizing a program’s memory access patterns is essential to obtaining good performance from the memory hierarchy.
This will run some matrix multiplications according to the six different implementations in the file, and it will tell you the speed at which each implementation executed the operation. The unit “Gflops/s” reads, “Giga-floating-point-operations per second.”
THE BIGGER THE NUMBER THE FASTER IT IS RUNNING!
Modification for multMat1.c
- n = 2, i.e. the array size with 2 by 2.
- Change to static call so that we can observe the data state in stack memory for Ripes.
- Change the data type from float to int.
- Use the constant as the initial value.
Modify the assembly code generated by Complier Explorer
We can do some modification.
Check the result
Setup (sp=0x7fffffff0)
The Result in C.
Memory in Ripes.
The access order (small case, n = 2)
Result in Ripes (n = 50)
Hit rate is 0.9174
Record your results in answers.txt
Which ordering(s) perform best for these 1000-by-1000 matrices ? Why ?
Ans 1. jki (multMat4)
Ans 2.
The cpu will often access the adjacent block in cache due to the stride 1.
n = 2,
n = 50, Hit rate = 0.966
Which ordering(s) perform the worst? Why?
Ans 1. ikj (multMat2) and kij (multMat5)
Ans 2.
Cpu access the cache block often suffer a cache miss due to the stride n when the cache block not enough big to include the array data.
ikj (multMat2)
n = 50, Hit rate = 0.8692
kij (multMat5)
n = 50, Hit rate = 0.8702
How does the way we stride through the matrices with respect to the innermost loop affect performance?
Ans.
The innermost loop will induce jn stride, thus, there are a weakly spatial locality when j is too big.
Matrix Transposition
Sometimes, we wish to swap the rows and columns of a matrix. This operation is called a “transposition” and an efficient implementation can be quite helpful while performing more complicated linear algebra operations. The transpose of matrix A is often denoted as AT.
Cache Blocking
In the above code for matrix multiplication, note that we are striding across the entire A and B matrices to compute a single value of C. As such, we are constantly accessing new values from memory and obtain very little reuse of cached data! We can improve the amount of data reuse in the caches by implementing a technique called cache blocking. More formally, cache blocking is a technique that attempts to reduce the cache miss rate by further improving the temporal and/or spatial locality of memory accesses. In the case of matrix transposition we consider performing the transposition one block at a time.
Things to note: In the above image, we transpose each submatrix Aij of matrix A into its final location in the output matrix, one submatrix at a time. It is important to note that transposing each individual subsection is equivalent to tranposing the entire matrix.
Since we operate on and finish transposing each submatrix one at a time, we consolidate our memory accesses to that smaller chunk of memory when transposing that particular submatrix, which increases the degree of temporal (and spatial) locality that we exhibit, which makes our cache performance better, which makes our program run faster.
This (if implemented correctly) will result in a substantial improvement in performance. For this lab, you will implement a cache blocking scheme for matrix transposition and analyze its performance.
Transpose_blocking
Fix the blocksize to be 20, and run your code with n equal to 100, 1000, 2000, 5000, and 10000.
Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000.
naive transpose
.Notice that in neither of the last two exercises did we actually know the cache parameters of our machine. We just made the code exhibit a higher degree of locality, and this magically made things go faster! This tells us that caches, regardless of their specific parameters, will always benefit from operating on code which exhibits a high degree of locality.