or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
Assignment4: Cache
tags:
riscv
Cache.s
Simply description about cache.s
accessWords
s0
is saved the start address of the array, ands1
is saved the end address of the array.t1
is saved that how many byte we would need to shift (\(4 \times a1\)).wordLoop
a3
whether is equal to0
, if does than jump towordZero
.Otherwise, load the data from
s0
(it is pointer to the start position of array) tot0
, and then save back after adding1
. (array[index] = array[index] + 1
)Jump to
wordCheck
.wordCheck
s0
) to next position byadd s0, s0, t1
.If
s0 < s1 (blt)
then jump towordLoop
and keep executing, it means that there is still in the inner loop, not yet access all the array data.a3
is saved repeat times for this program, so reduce thea3
value by1
and check the value whether is equal to0
.wordZero
0
, set the array[index] to0
.Observation in Ripes
The program parameters setup
Observation for Memory

0x10000000
, and the end pointer to address0x100000008
. The total array size is 8 Bytes (2 Words).1
, it would access the next1
word data.Another case
The step size is
2
word, so it would access the address0x10000000
,0x10000008
and0x10000010
, but0x10000010
is out of the limit address0x1000000c
, so the address0x10000010
would not write data.And the program would repeat
3
times, so0x10000000
and0x10000008
are saved the value3
.Review - Hit and Miss Policies
Hit
Essentially, there is no such thing as a write hit in a write-around cache, a write "hit" does the same thing as a write miss.
Miss
Replacement policies
The Venus cache simulator currently simulates a write-through, write-allocate cache.
So we need to choose the same options in Ripes for our experiment.
Exercise 1 - A Couple of Memory Access Scenarios
Task:
Simulate the following scenarios and record the final cache hit rates with the program in cache.s.
Try to reason out what the hit rate will be BEFORE running the code.
THE FOLLOWING are good questions to ask yourself as you do these exercises (not checkoff questions):
Scenario 1
Program Parameters: (set these by initializing the a registers in the code)
Cache Parameters: (set these in the Cache tab)
Here we modify the program parameters for cache.s in Ripes.
Confirm that the Data bits exactly equal to 32 Bytes (256 bits).
The result after executing the simulator.
Tasks
0x100000060 = 0b0001_0000_0000_0000_0000_0000_0110_0000
break to
(Tag) 000_1000_0000_0000_0000_0000_0011 | (Index) 00 | (Offset) 000 = 0x00800003
0
for M[0], M[8], M[16], M[24], it means the collision always happen for each access.Access M[0]: miss
The program would repeat
4
times, so staying the array value in cache as possible is a good method. Such asWhat is our hit rate if we increase Rep Count arbitrarily? Why?
Ans.
The above combination of parameters would not induce the hit. No matter how we increase the rep count arbitrarily, it will not improve the hit rate. Instead, it will increase the times of miss, because of the cache miss always happen.
But, if there has a chance to induce the hit, it will carry more hit occurring when we increase the rep count. e.g.
How could we modify one program parameter to increase our hit rate?
Ans.
We can change the
step size
to1
to increase the hit rate, because it will carry two array value (even and odd index) into the cache at the same time when meeting a cache miss. That is why the cache hit always occur when we access the value of the odd index.Therefore, we will get the hit rate with 0.5 after finishing the program.
Scenario 2
Program Parameters: (set these by initializing the a registers in the code)
Cache Parameters: (set these in the Cache tab)
Tasks
M[0] = 0x100000000, M[2] = 0x10000008, M[4] = 0x10000010, M[6] = 0x10000018, M[8] = 0x10000020, M[10], …, M[60], M[62]
TIO Breakdown:
e.g. Note: The option is
1
, so it will access the cache two times ! (One is for reading data, the other one is for writing data.)Access M[0]: read miss, and cpu will access main memory.
Access M[4]: read miss
…
Access M[16]: read miss
…
Access M[60]: read miss
Access M[60]: write hit
Access M[62]: read hit
Access M[62]: write hit
What is the repeating hit/miss pattern ? WHY?
Because of the block size is 16 Bytes (4 Words), it will contain 4 array value in a slot (line).
Therefore, when cpu meet a cache miss, it carry the data that will be accessed at the next iteration due to
2
words of the step size.And the option is
1
so cpu will access the cache memory by2
times per index, that would take a cache read miss and a cache write hit for the index \(4n\), as well as take a cache read hit and a cache write hit for the index \(4n+2\).e.g.
Access M[0]: read miss, and cpu will access the main memory.
Access M[0]: wirte hit, and cpu will access the main memory.
Access M[2]: read hit, and cpu will do not access the main memory.
Access M[2]: write hit, and cpu need to access the main memory.
Access M[4]: read miss
… and it repeats every 4 accesses.
This should follow very straightforwardly from the above question: Explain the hit rate in terms of the hit/miss pattern.
Ans.
There are 4 times of access at a slot, they contain 1 cache miss caused by reading the index \(4n\) data and 3 cache hit.
So \(hit \space rate = 3/4=0.75\)
Keeping everything else the same, what happens to our hit rate as Rep Count goes to infinity ? Why ? Try it out by changing the appropriate program parameter and letting the code run !
Hit rate will approach to
1
.Scenario 3
L1, L2 Cache assess.
Exercise 2 - Loop Ordering and Matrix Multiplication
If you recall, matrices are 2-dimensional data structures wherein each data element is accessed via two indices. To multiply two matrices, we can simply use 3 nested loops, assuming that matrices A, B, and C are all n-by-n and stored in one-dimensional column-major arrays:
In the above code, note that the loops are ordered i, j, k. If we examine the innermost loop (the one that increments k), we see that it…
REMEMBER:
To compute the matrix multiplication correctly, the loop order doesn’t matter.
BUT, the order in which we choose to access the elements of the matrices can have a large impact on performance. Caches perform better (more cache hits, fewer cache misses) when memory accesses take advantage of spatial and temporal locality, utilizing blocks already contained within our cache. Optimizing a program’s memory access patterns is essential to obtaining good performance from the memory hierarchy.
Observation for matrixMultiply.c
matrixMultiply.c
Modification for
multMat1.c
Modify the assembly code generated by Complier Explorer
Assembly Code that can execute in Ripes.
We can do some modification.
Check the result

Setup (sp=0x7fffffff0)
The Result in C.
Memory in Ripes.
The access order (small case, n = 2)

Result in Ripes (n = 50)
Task
Record your results in answers.txt
Which ordering(s) perform best for these 1000-by-1000 matrices ? Why ?
Ans 1. jki (multMat4)
Ans 2.
The cpu will often access the adjacent block in cache due to the stride 1.
n = 2,

n = 50, Hit rate = 0.966

Which ordering(s) perform the worst? Why?
Ans 1. ikj (multMat2) and kij (multMat5)
Ans 2.
Cpu access the cache block often suffer a cache miss due to the stride n when the cache block not enough big to include the array data.
ikj (multMat2)

n = 50, Hit rate = 0.8692

kij (multMat5)

n = 50, Hit rate = 0.8702

How does the way we stride through the matrices with respect to the innermost loop affect performance?
Ans.
The innermost loop will induce jn stride, thus, there are a weakly spatial locality when j is too big.
Exercise 3 - Cache Blocking and Matrix Transposition
Matrix Transposition
Sometimes, we wish to swap the rows and columns of a matrix. This operation is called a “transposition” and an efficient implementation can be quite helpful while performing more complicated linear algebra operations. The transpose of matrix A is often denoted as AT.
Cache Blocking
In the above code for matrix multiplication, note that we are striding across the entire A and B matrices to compute a single value of C. As such, we are constantly accessing new values from memory and obtain very little reuse of cached data! We can improve the amount of data reuse in the caches by implementing a technique called cache blocking. More formally, cache blocking is a technique that attempts to reduce the cache miss rate by further improving the temporal and/or spatial locality of memory accesses. In the case of matrix transposition we consider performing the transposition one block at a time.
Things to note: In the above image, we transpose each submatrix Aij of matrix A into its final location in the output matrix, one submatrix at a time. It is important to note that transposing each individual subsection is equivalent to tranposing the entire matrix.
Since we operate on and finish transposing each submatrix one at a time, we consolidate our memory accesses to that smaller chunk of memory when transposing that particular submatrix, which increases the degree of temporal (and spatial) locality that we exhibit, which makes our cache performance better, which makes our program run faster.
This (if implemented correctly) will result in a substantial improvement in performance. For this lab, you will implement a cache blocking scheme for matrix transposition and analyze its performance.
Transpose_blocking
Part 1 - Changing Array Sizes
Task
Fix the blocksize to be 20, and run your code with n equal to 100, 1000, 2000, 5000, and 10000.
Ans.
n = 1000
Ans.
To make sure the spatial locality.
Part 2 - Changing Blocksize
Task
Fix n to be 10000, and run your code with blocksize equal to 50, 100, 500, 1000, 5000.
Ans.
The performance will decrease when increasing blocksize.
Becuase the cache will need to change the block frequently for a big size n, it means the situation with a weakly spatial locality.
And the worst case could probaly make the performance degerate to the
naive transpose
.Notice that in neither of the last two exercises did we actually know the cache parameters of our machine. We just made the code exhibit a higher degree of locality, and this magically made things go faster! This tells us that caches, regardless of their specific parameters, will always benefit from operating on code which exhibits a high degree of locality.