or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by <
kdnvt
>main
is the straightforward version,unrolling
use the technique of loop unrolling to optimize the code.Palindrome Linked List (LeetCode 234)
Solution
Idea for problem solving
Given the condition of \(O(1)\) space, I think the modification of original linked list is inevitable. My basic idea is to reverse the linked list node one by one until middle node, and compare the first half list, which has already been reversed, with the second half list. If two linked list are the same, the original linked list is palindrome linked list.
C program
First version
Because getting the length of linked list directly is impossible, my first intuition is to traverse the whole linked list, get the length of it, and calculate how many nodes should I reverse.
After that, I can reverse the corresponding number of nodes, and compare the lists.
Second version
After the first version, I was thinking of whether there is a better way to solve this problem. It occurs to me that one of solutions to remove middle node of linked list uses the fast-slow pointers technique.
Third version
Same thought as second version, but a more concise one.
Fourth version
By simplifying the comparing while loop, it could be much more consice.
Assembly
There are 5 test data in this program:
list
: Palindrome list with even number length.list2
: Non-palindrome list with even number length.list3
: Palindrome list with odd number length.list4
: Non-palindrome list with odd number length.list5
: Single node list.(palindrome)experiment results
The 5-stage in-order processor with hazard detection/elimination and forwarding gets correct results on each of test data. However, when using the processor without hazard detection/elimination, the simulator says "Unknown system call in register 'a7':0 ". It obvious that there exists hazard around the "ecall" instruction. Compare the pipeline diagram of two version :
With hazard detection/elimination:
Without hazard detection/elimination:
Finally, I add three nop operations before ecall, so the processor without hazard detection can use the ecall. (two nop aren't enough)
After I fixed the ecall, nevertheless, the processor without hazard detection still got the wrong result. It seems like there are other hazards.
Potential data hazard
When comparing two pipeline diagrams, I can find many hazards which are eliminated by the standard processor.
Standard:
without hazard detection:
As we can see, there is a RAW(Read After Write) data hazard on x5 register.
This diagram is the state before stalling.
beq x5 x0 104 <true>
is in the ID stage, whilelw x5 4 x10
is in the EXE stage.This diagram presents the state when stalling happens. The
lw
instruction read thex5
register, the value is 0x10000008 on the diagram, which is different from the previous value 0x10000018beq
just read fromx5
.This diagram show the next state.
In this state,
lw
is in the WB stage, which means it will write back the new value to registerx5
. As we can see, theWr_En
signal in Registers file is green, which means it is enabled. Aside from writing back, it is also "forwarding" the new value tobeq
in EXE stage. The diagram below shows the forwarding path.beq
in EXE stage compares the two value from registersx5
,x0
, and decides whether branches or not. Two diagrams below shows the data path of register value. The first diagram shows the data path ofx5
from forwardinglw
instruction, while the second diagram shows the data path ofx0
.beq
then compares the two value at the Branch unit which is shown at the bottom part of diagram. In this example,x5
is 0x10000008 andx0
is 0x00000000, so the branch isn't taken. As we can see, theBranch_taken
signal is red.This diagram shows
x5
data path.This diagram shows
x0
data path.This diagram shows another operation
beq
do in EXE stage, calculating the branch address. The upper value 0x00000050 is the instruction address ofbeq
, while the below value 0x00000068 is the immediate field specified in instruction, which is the same to 104(decimal) in thebeq x5 x0 104
.beq
then adds the two value and generate the target address 0x000000b8. In this example, the branch isn't taken.Assembly optimization
Use temporary registers
Because
isPalindrome
doesn't call any other function, it is feasible to replace all callee-saved register with caller-save register, so we don't need to push them on the stack everytime the function being called.Avoid redundant branch
In the comparing while loop of
isPalindrome
, there are several different branch. By referencing the fourth edition of my C program, I obviate the redundant branches and make it more concise.Loop unrolling
I think this one is interesting, because it doesn't reduce the number of instructions in the instruction memory. On the contrary, it increases the number. However, except for the disadvantage on the code size and readability, it may actually improve the performance of the program by reducing branch and utilizing pipelining.
The reason I want to use loop unrolling on this program is that Ripes always takes next instruction as predicted one, which means the loop will take the wrong branch most of the time. Therefore, it is a good opportunity to use loop unrolling, because it reduce the number of branch.
In order to demonstrate the impact of loop unrolling, I have to build a long linked list. I introduce some new function shown as below:
Ripes doesn't support
sbrk
system call, so I usebrk
to implement a similar function. The functiongenerate
uses sbrk to require a block of memory, and use that to generate an arbitrary length linked list. The length is assigned from the first parameter passed in this function in registera0
.I plan to experiment three cases:
All loops includes loop1, loop2, gen_loop.
Takes loop2 for example.
Unrolling 2 times:
Unrolling 4 times:
I try to pass a linked list with 100 nodes to
isPalindrome
, and compare the difference between cases. Here's the results:Without unrolling:
Unrolling two times:
Unrolling four times:
As we can see, for a linked list with 100 nodes, the number of cycles drops from 2098 to 1680 with four times unrolling.
Obviate data hazards by reordering instructions
There are several data hazards in this program, which leads to stalling of pipeline and reduces the performance. Here, I tried to reorder instructions and avoid the hazards. I didn't deal with all of hazards but only those in the loops, because the process runs loops most of the time.
This figure below shows the pipeline diagram of loop2 without reordering:
This figure below shows the pipeline diagram of loop2 with reordering:
As we can see, the one stalling in each loop is eliminated.
I combine instructions reordering with loop unrolling, and measures the performance on linked list with 100 nodes.
Without unrolling:

Unrolling with two times:

Unrolling with four times:

Eliminate redundant branch instructions by strip mining like technique
There are some updates in the program. I remove the
gen_loop
ingenerate
function, and uses a new functionconnect
instead. The function connects the nodes in a continuous memory block. The new loopcon_loop
inconnect
have nearly the same effect as original gen_loop. For the following examples, I will use con_loop instead of gen_loop.To see the newest version of code, please view the following link: https://github.com/kdnvt/Palindrome-linked-list
It contains two different branch:
main
andunrolling
. The main is the standard version, whileunrolling
use the loop unrolling technique.There are some branch instrucions in unrolling con_loop:
When I unroll the loop four times, the number of
bge
instruction increase correspondingly. I think these instrucions is redundant and want to eliminate increased instructions and make each iteration body have only onebge
instruction.The solution I want to use is similar to strip mining in compilers for vector processors, which splits original loop into two new loops. One runs iterations without unrolling, while another runs iterations with unrolling.
Let n be number of iterations, and k be the times of loop unrolling. First execute n%k iterations in original loop, and then execute remaining iterations in k unrolling loop. Process can determine whether to take the branch or not on the head of loop body, and obviate the checking for the rest of body, since the the remaining iterations must be multiple of k.
The optimized code is shown as below:
I measured the number of cycles and instructions retired with linked list with 500 nodes.
Without unrolling:

With unrolling four times:

With unrolling four times and strip mining on con_loop:

It is interesting that unrolling with strip mining got fewer cycles and instrucions retired but a higher CPI than unrolling without strip mining, which is a good example to demonstrate that higher CPI doesn't always means better performance.
After optimize the
con_loop
, I try to use the same technique to optimize loop1 and loop2. Unfortunately, since we don't know the exact length of linked list at the beginning, it is infeasible to apply strip mining on loop1. However, if we count the number of iterations of loop1, we can optimize the loop2.Nonetheless, the effect of this optimization is not very well, because we need to add extra instrucions in loop1 and loop2 to count. Theoretically, the effect would be more obvious with much longer list or much larger number of unrolling.
The optimized code is shown as below:
This optimization with 500 nodes list is shown as below:
