contributed by < dhiptmc
>
Problem
Problem was chosen from the work done by 王昱承 in Assignment 1.
Given an array nums. We define a running sum of an array as runningSum[i] = sum(nums[0]…nums[i]).
Return the running sum of nums.
- Motivation:
In the last assignment, I spent lots of time dealing with arrays.
I would like to examine whether I can make use of the experiences learnt from previous attempts.
Example 1:
Example 2:
Example 3:
Solution
The solution is based on the work done by 王昱承.
Initially, pass the first element, i.e., nums[0], to result[0]. Add the result[i-1] element with nums[i], then pass the value to the result.
Return the result array eventually.
Implementation
C code
Assembly code
.data
nums1: .word 1,2,3,4,5
nums2: .word 1,3,5,7,9,11
nums3: .word 0,2,4,6,8,10,12
res1: .word 0,0,0,0,0,
res2: .word 0,0,0,0,0,0
res3: .word 0,0,0,0,0,0,0
space: .string " "
nl: .string "\n"
.text
main:
la t2, nums1
la t3, res1
li t4, 5
jal ra, runningSum
la t2, nums2
la t3, res2
li t4, 6
jal ra, runningSum
la t2, nums3
la t3, res3
li t4, 7
jal ra, runningSum
j exit
runningSum:
li t5, 1
lw a3, 0(t2)
sw a3, 0(t3)
end.L1:
slli a0, t5, 2
add a1, t3, a0
lw a3, -4(a1)
add a2, t2, a0
lw a2, 0(a2)
add a3, a3, a2
sw a3, 0(a1)
addi t5, t5, 1
blt t5, t4, end.L1
li t5, 0
print:
slli a1, t5, 2
add a1, t3, a1
lw a0, 0(a1)
li a7, 1
ecall
la a0, space
li a7, 4
ecall
addi t5, t5, 1
blt t5, t4, print
la a0, nl
li a7, 4
ecall
jr ra
exit:
li a7, 10
ecall
There are possible ways to use fewer instructions.
Modified version of Assembly code is shown below.
.data
nums1: .word 1,2,3,4,5
nums2: .word 1,3,5,7,9,11
nums3: .word 0,2,4,6,8,10,12
res1: .word 0,0,0,0,0,
res2: .word 0,0,0,0,0,0
res3: .word 0,0,0,0,0,0,0
space: .string " "
nl: .string "\n"
.text
main:
la t2, nums1
la t3, res1
li t4, 5
jal ra, runningSum
la t2, nums2
la t3, res2
li t4, 6
jal ra, runningSum
la t2, nums3
la t3, res3
li t4, 7
jal ra, runningSum
j exit
runningSum:
add t6, t3, zero
li t5, 1
lw a3, 0(t2)
sw a3, 0(t6)
loop:
addi t2, t2, 4
lw a2, 0(t2)
add a3, a3, a2
addi t6, t6, 4
addi t5, t5, 1
sw a3, 0(t6)
blt t5, t4, loop
print:
lw a0, 0(t3)
li a7, 1
ecall
la a0, space
li a7, 4
ecall
addi t3, t3, 4
addi t5, t5, -1
blt zero, t5, print
la a0, nl
li a7, 4
ecall
jr ra
exit:
li a7, 10
ecall
As a result, clock cycle as the figure shown below decreases compared to the original work.
Origin:

After modified:

Environment setup
Following the instructions of Lab2: RISC-V RV32I[MACF] emulator with ELF support,
however, there is an error occured:
Credit to TA(yucheng871011@gmail.com) who help solving this problem for me. That is, some tools should be installed first.
Instructions
- Execute and Statistics of execution
Compare Assembly code
In this section, C code for the problem is compiled with different compiler optimization level (-Os to -Ofast).
The assembly code is lengthy and hard to read. We focus on the main section of the code to better demonstrate the effect of compiler optimization.
Handwritten Assembly
Observation
- LOC(Line of code): 13
- Register used:
- tx registers: t2, t3, t4, t5, t6
- ax registers: a2, a3
- sx registers: none
- others: none
- Stack used(bytes): 0 bytes
- Number of lw and sw: lw=2 sw=2
- Number of branch/jump instructions: 1
-Os
Observation
- LOC(Line of code): 29
- Register used:
- tx registers: none
- ax registers: a0, a1, a2, a3, a4, a5
- sx registers: s0, s1
- others: sp, ra
- Stack used(bytes): 16 bytes
- Number of lw and sw: lw=6 sw=5
- Number of branch/jump instructions: 5
- ELF file header:
- Size:
- Execute and Statistics of execution
-O0
00010184 <runningSum>:
10184: fd010113 addi sp,sp,-48
10188: 02112623 sw ra,44(sp)
1018c: 02812423 sw s0,40(sp)
10190: 03010413 addi s0,sp,48
10194: fca42e23 sw a0,-36(s0)
10198: fcb42c23 sw a1,-40(s0)
1019c: fd842783 lw a5,-40(s0)
101a0: 00279793 slli a5,a5,0x2
101a4: 00078513 mv a0,a5
101a8: 3a8000ef jal ra,10550 <malloc>
101ac: 00050793 mv a5,a0
101b0: fef42423 sw a5,-24(s0)
101b4: fe042623 sw zero,-20(s0)
101b8: 0780006f j 10230 <runningSum+0xac>
101bc: fec42783 lw a5,-20(s0)
101c0: 00079c63 bnez a5,101d8 <runningSum+0x54>
101c4: fdc42783 lw a5,-36(s0)
101c8: 0007a703 lw a4,0(a5)
101cc: fe842783 lw a5,-24(s0)
101d0: 00e7a023 sw a4,0(a5)
101d4: 0500006f j 10224 <runningSum+0xa0>
101d8: fec42703 lw a4,-20(s0)
101dc: 400007b7 lui a5,0x40000
101e0: fff78793 addi a5,a5,-1 # 3fffffff <__BSS_END__+0x3ffdc1cf>
101e4: 00f707b3 add a5,a4,a5
101e8: 00279793 slli a5,a5,0x2
101ec: fe842703 lw a4,-24(s0)
101f0: 00f707b3 add a5,a4,a5
101f4: 0007a683 lw a3,0(a5)
101f8: fec42783 lw a5,-20(s0)
101fc: 00279793 slli a5,a5,0x2
10200: fdc42703 lw a4,-36(s0)
10204: 00f707b3 add a5,a4,a5
10208: 0007a703 lw a4,0(a5)
1020c: fec42783 lw a5,-20(s0)
10210: 00279793 slli a5,a5,0x2
10214: fe842603 lw a2,-24(s0)
10218: 00f607b3 add a5,a2,a5
1021c: 00e68733 add a4,a3,a4
10220: 00e7a023 sw a4,0(a5)
10224: fec42783 lw a5,-20(s0)
10228: 00178793 addi a5,a5,1
1022c: fef42623 sw a5,-20(s0)
10230: fec42703 lw a4,-20(s0)
10234: fd842783 lw a5,-40(s0)
10238: f8f742e3 blt a4,a5,101bc <runningSum+0x38>
1023c: fe842783 lw a5,-24(s0)
10240: 00078513 mv a0,a5
10244: 02c12083 lw ra,44(sp)
10248: 02812403 lw s0,40(sp)
1024c: 03010113 addi sp,sp,48
10250: 00008067 ret
Observation
- LOC(Line of code): 53
- Register used:
- tx registers: none
- ax registers: a0, a1, a2, a3, a4, a5
- sx registers: s0
- others: sp, ra, zero
- Stack used(bytes): 48 bytes
- Number of lw and sw: lw=19 sw=9
- Number of branch/jump instructions: 6
- ELF file header:
- Size:
- Execute and Statistics of execution
-O1
Observation
- LOC(Line of code): 31
- Register used:
- tx registers: none
- ax registers: a0, a1, a2, a3, a4, a5
- sx registers: s0, s1
- others: sp, ra
- Stack used(bytes): 16 bytes
- Number of lw and sw: lw=6 sw=5
- Number of branch/jump instructions: 7
- ELF file header:
- Size:
- Execute and Statistics of execution
-O2
Observation
- LOC(Line of code): 29
- Register used:
- tx registers: none
- ax registers: a0, a1, a2, a3, a4, a5, a6
- sx registers: s0, s1
- others: sp, ra
- Stack used(bytes): 16 bytes
- Number of lw and sw: lw=6 sw=5
- Number of branch/jump instructions: 5
- ELF file header:
- Size:
- Execute and Statistics of execution
-Ofast
Observation
- LOC(Line of code): 30
- Register used:
- tx registers: none
- ax registers: a0, a1, a2, a3, a4, a5, a6
- sx registers: s0, s1
- others: sp, ra
- Stack used(bytes): 16 bytes
- Number of lw and sw: lw=6 sw=5
- Number of branch/jump instructions: 5
- ELF file header:
- Size:
- Execute and Statistics of execution
Summary: Observed Compiler Behavior
- From
-O0
to -O1
- Additional usage of s1 which significantly decreases lw/sw count and stack space.
- From
-O1
to -O2
- Unroll the loop to avoid branch misprediction.
- From
-O2
to -Ofast
|
-Os |
-O0 |
-O1 |
-O2 |
-Ofast |
Line of Code |
29 |
53 |
31 |
29 |
30 |
Register used |
10 |
10 |
10 |
11 |
11 |
Stack used (bytes) |
16 |
48 |
16 |
16 |
16 |
lw/sw count |
11 |
28 |
11 |
11 |
11 |
branch/jump count |
5 |
6 |
7 |
5 |
5 |
CSR cycle count |
16503 |
16889 |
16406 |
16401 |
16395 |