Computer Architecture HW2

--- tags : Computer_Architecture, csapp, cs61c, jserv, RISC-V --- # Computer Architecture HW2 contributed by <[`WangHanChi`](https://github.com/WangHanChi)> ## Question Selection ### Question I chose the question from [鄭至崴 Move Zeroes](https://hackmd.io/@Fo7UsdePRsKPVV4CPYGbpA/rk91z0LWi) and the leetcode is [283. Move Zeroes](https://leetcode.com/problems/move-zeroes/) ### Motivation The reason which I chose is that there are many for loop in the question.I think it will be a great practice for me. ## Rewrite Code Because his code does not include testing data, I rewirten the code as following.And I have already checked the output is correct, so I comment the printf function because of it's assembly code will be very large. I also add the volatile to ensure the `Movezeroes` function will be executed. ### Modefied C code ```c #include <stdio.h> #define numsSize0 5 int moveZeroes(int *nums, int numsSize) { volatile int i, j; // force compiler enter the loop for (i = 0, j = 0; i < numsSize; i++) { if (*(nums + i)) { *(nums + j) = *(nums + i); if (i != j) { *(nums + i) = 0; } j++; } } return i; // force compiler enter the loop } int main(int argc, char *argv[]) { int nums0[5] = {1, 0, 0, 2, 5}; // printf("Before move zeroes\n"); for (int i = 0; i < numsSize0; i++) { // printf("%d ", nums0[i]); } // printf("\n"); moveZeroes(nums0, numsSize0); // printf("After move zeroes\n"); for (int i = 0; i < numsSize0; i++) { // printf("%d ", nums0[i]); } // printf("\n"); // printf("\n"); return 0; } ``` ### His assembly code ::: spoiler assembly code ``` .data nums1: .word 1 .word 0 .word 0 .word 2 .word 5 nums2: .word 1 .word 9 .word 0 .word 6 .word 7 nums3: .word 0 .word 0 .word 0 .word 0 .word 4 .text main: la s0 nums1 # load nums base address to s0 addi s1 x0 5 #nums_size = 5 #call function moveZeroes add a0 s0 x0 # a0 which stands for address of nums array is the first argument add a1 s1 x0 # a1 which stands for array size of nums is the second argument jal ra moveZeroes #jump to moveZeroes function #PrintArray jal ra printArray li a7 10 ecall moveZeroes: addi sp sp -12 sw ra 0(sp) sw s0 4(sp) sw s1 8(sp) li s1 0 #next non zero index = 0 addi s2 x0 0 # i = 0 loop: bge s2 a1 exit #i >= array_size exit slli t0 s2 2 #i * 4 add t0 a0 t0 #array + i*4 lw t1 0(t0) #t1 = array[i] beq t1 x0 next_iter slli t2 s1 2 #next_nonzero_index * 4 add t2 t2 a0 #array + next_nonzero_index * 4 sw t1 0(t2) #array[next_nonzero_index] = array[i] beq s1 s2 next_iter_addIndex #if(next_nonzero_index != i) sw x0 0(t0) #store 0 to array[i] next_iter_addIndex: addi s1 s1 1 #next_nonzero_index++ next_iter: addi s2 s2 1 #i++ j loop exit: lw ra 0(sp) lw s0 4(sp) lw s1 8(sp) addi sp sp 12 jr ra printArray: li a7 1 addi sp sp -8 sw ra 0(sp) sw s0 4(sp) add s0 a0 x0 # s1 = pointer to array add t0 x0 x0 # t0 = i in for loop printloop: bge t0 a1 finish_print_loop slli t1 t0 2 # t1 = i * 4 add t1 t1 s0 # t1 = t1 + array address lw a0 0(t1) # a0 = element in nums[i] li a7 1 ecall addi t0 t0 1 j printloop finish_print_loop: lw ra 0(sp) lw s0 4(sp) addi sp sp 8 jr ra ``` ::: And I put his code into [Ripes](https://github.com/mortbopet/Ripes/releases) ![](https://i.imgur.com/lptAvQa.png) This is his code's Cycles in Ripes. **And I modify a little in order to run in rv32emu.** (**before optimized**) ``` .org 0 .global _start /* newlib system calls */ .set SYSEXIT, 93 .set SYSWRITE, 64 .data nums1: .word 1, 0, 0, 2, 5 .set arr_size1, .-nums1 nums2: .word 1, 9, 0, 6, 7 .set arr_size2, .-nums2 nums3: .word 0, 0, 0, 0, 4 .set arr_size3, .-nums3 nextline: .ascii "\n" .set str_next_len, .-nextline blank: .ascii " " .set str_blank_len, .-blank .text _start: la s0, nums1 # load nums base address to s0 addi s1, x0, 5 #nums_size = 5 #call function moveZeroes add a0, s0, x0 # a0 which stands for address of nums array is the first argument add t3, s1, x0 # t3 which stands for array size of nums is the second argument jal ra, moveZeroes #jump to moveZeroes function #PrintArray jal ra, printArray # printf("\n"); li a7, SYSWRITE li a0, 1 la a1, nextline li a2, 1 ecall # exit and return 0 li a7, SYSEXIT addi a0, x0, 0 ecall moveZeroes: addi sp, sp, -12 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) li s1, 0 #next non zero index = 0 addi s2, x0, 0 # i = 0 loop: bge s2, t3, exit #i >= array_size exit slli t0, s2, 2 #i * 4 add t0, a0, t0 #array + i*4 lw t1, 0(t0) #t1 = array[i] beq t1, x0, next_iter slli t2, s1, 2 #next_nonzero_index * 4 add t2, t2, a0 #array + next_nonzero_index * 4 sw t1, 0(t2) #array[next_nonzero_index] = array[i] beq s1, s2, next_iter_addIndex #if(next_nonzero_index != i) sw x0, 0(t0) #store 0 to array[i] next_iter_addIndex: addi s1,s1, 1 #next_nonzero_index++ next_iter: addi s2, s2, 1 #i++ j loop exit: lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) addi sp, sp, 12 jr ra printArray: addi sp, sp, -8 sw ra, 0(sp) sw s0, 4(sp) add s0, a0, x0 # s1 = pointer to array add t0, x0, x0 # t0 = i in for loop la t5, nums1 addi t6, x0, 0 addi s11, x0, arr_size1 srli s11, s11, 2 printloop: bge t6, s11, finish_print_loop lw t4, 0(t5) addi t5, t5, 4 mv a1, t4 addi a1, a1, 48 addi sp, sp, -4 sw a1, 0(sp) addi a1, sp, 0 li a7, SYSWRITE li a0, 1 li a2, 4 ecall addi sp,sp,4 li a7, SYSWRITE li a0, 1 la a1, blank li a2, str_blank_len ecall addi t6, t6, 1 j printloop finish_print_loop: lw ra, 0(sp) lw s0, 4(sp) addi sp, sp, 8 jr ra ``` ## Optimized by riscv-none-elf-gcc There are many types optimization of the code. ``` $ riscv-none-elf-gcc -Q --help=optimizers ``` ```makefile hanchi@hanchi:~$ riscv-none-elf-gcc -Q --help=optimizers The following options control optimizations: -O<number> -Ofast -Og -Os -Oz -faggressive-loop-optimizations [enabled] -falign-functions [disabled] -falign-functions= -falign-jumps [disabled] ... ``` We will compare `-O0`, `-O1`, `-O2`, `-O3`, `-Os`, `-Ofast` as following. ==And I comment the printf function in C code, because it made the elf oversize and hard to focus on optimization.== ### -O0 Optimized Assembly code * Compile ``` $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 Movezeroes.c -o Movezeroes_O0.elf ``` * Size ``` $ riscv-none-elf-size Movezeroes_O0.elf ``` ```makefile hanchi@hanchi:~/computer_architecture/HW02/C_to_elf$ riscv-none-elf-size Movezeroes_O0.elf text data bss dec hex filename 1744 1096 61 2901 b55 Movezeroes_O0.elf ``` * ELF header ``` $ riscv-none-elf-readelf -h Movezeroes_O0.elf ``` ```makefile hanchi@hanchi:~/computer_architecture/HW02/C_to_elf$ riscv-none-elf-readelf -h Movezeroes_O0.elf ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x100dc Start of program headers: 52 (bytes into file) Start of section headers: 6208 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` * Save txt ``` $ riscv-none-elf-objdump -d Movezeroes_O0.elf >dis_objdump_O0.txt ``` * Disassembly code There are **486** lines in O0 optimization, therefore I chose `moveZeroes`, `main` two sections to compare. ```= 00010184 <moveZeroes>: 10184: fd010113 addi sp,sp,-48 10188: 02812623 sw s0,44(sp) 1018c: 03010413 addi s0,sp,48 10190: fca42e23 sw a0,-36(s0) 10194: fcb42c23 sw a1,-40(s0) 10198: fe042623 sw zero,-20(s0) 1019c: fe042423 sw zero,-24(s0) 101a0: 07c0006f j 1021c <moveZeroes+0x98> 101a4: fec42783 lw a5,-20(s0) 101a8: 00279793 slli a5,a5,0x2 101ac: fdc42703 lw a4,-36(s0) 101b0: 00f707b3 add a5,a4,a5 101b4: 0007a783 lw a5,0(a5) 101b8: 04078c63 beqz a5,10210 <moveZeroes+0x8c> 101bc: fec42783 lw a5,-20(s0) 101c0: 00279793 slli a5,a5,0x2 101c4: fdc42703 lw a4,-36(s0) 101c8: 00f70733 add a4,a4,a5 101cc: fe842783 lw a5,-24(s0) 101d0: 00279793 slli a5,a5,0x2 101d4: fdc42683 lw a3,-36(s0) 101d8: 00f687b3 add a5,a3,a5 101dc: 00072703 lw a4,0(a4) 101e0: 00e7a023 sw a4,0(a5) 101e4: fec42703 lw a4,-20(s0) 101e8: fe842783 lw a5,-24(s0) 101ec: 00f70c63 beq a4,a5,10204 <moveZeroes+0x80> 101f0: fec42783 lw a5,-20(s0) 101f4: 00279793 slli a5,a5,0x2 101f8: fdc42703 lw a4,-36(s0) 101fc: 00f707b3 add a5,a4,a5 10200: 0007a023 sw zero,0(a5) 10204: fe842783 lw a5,-24(s0) 10208: 00178793 addi a5,a5,1 1020c: fef42423 sw a5,-24(s0) 10210: fec42783 lw a5,-20(s0) 10214: 00178793 addi a5,a5,1 10218: fef42623 sw a5,-20(s0) 1021c: fec42783 lw a5,-20(s0) 10220: fd842703 lw a4,-40(s0) 10224: f8e7c0e3 blt a5,a4,101a4 <moveZeroes+0x20> 10228: fec42783 lw a5,-20(s0) 1022c: 00078513 mv a0,a5 10230: 02c12403 lw s0,44(sp) 10234: 03010113 addi sp,sp,48 10238: 00008067 ret 0001023c <main>: 1023c: fc010113 addi sp,sp,-64 10240: 02112e23 sw ra,60(sp) 10244: 02812c23 sw s0,56(sp) 10248: 04010413 addi s0,sp,64 1024c: fca42623 sw a0,-52(s0) 10250: fcb42423 sw a1,-56(s0) 10254: 000107b7 lui a5,0x10 10258: 75078793 addi a5,a5,1872 # 10750 <__errno+0x8> 1025c: 0007a583 lw a1,0(a5) 10260: 0047a603 lw a2,4(a5) 10264: 0087a683 lw a3,8(a5) 10268: 00c7a703 lw a4,12(a5) 1026c: 0107a783 lw a5,16(a5) 10270: fcb42a23 sw a1,-44(s0) 10274: fcc42c23 sw a2,-40(s0) 10278: fcd42e23 sw a3,-36(s0) 1027c: fee42023 sw a4,-32(s0) 10280: fef42223 sw a5,-28(s0) 10284: fe042623 sw zero,-20(s0) 10288: 0100006f j 10298 <main+0x5c> 1028c: fec42783 lw a5,-20(s0) 10290: 00178793 addi a5,a5,1 10294: fef42623 sw a5,-20(s0) 10298: fec42703 lw a4,-20(s0) 1029c: 00400793 li a5,4 102a0: fee7d6e3 bge a5,a4,1028c <main+0x50> 102a4: fd440793 addi a5,s0,-44 102a8: 00500593 li a1,5 102ac: 00078513 mv a0,a5 102b0: ed5ff0ef jal ra,10184 <moveZeroes> 102b4: fe042423 sw zero,-24(s0) 102b8: 0100006f j 102c8 <main+0x8c> 102bc: fe842783 lw a5,-24(s0) 102c0: 00178793 addi a5,a5,1 102c4: fef42423 sw a5,-24(s0) 102c8: fe842703 lw a4,-24(s0) 102cc: 00400793 li a5,4 102d0: fee7d6e3 bge a5,a4,102bc <main+0x80> 102d4: 00000793 li a5,0 102d8: 00078513 mv a0,a5 102dc: 03c12083 lw ra,60(sp) 102e0: 03812403 lw s0,56(sp) 102e4: 04010113 addi sp,sp,64 102e8: 00008067 ret ``` * Observation and explain * Optimization Purpose * Do not do any optimization, this is the default compile option * Register Usage (main / Movezeroes) | Register | Main | Movezeroes | | -------- | ------------ | ------------ | | s-type | 1 | 1 | | a-type | 6 | 6 | | t-type | 0 | 0 | | other | ra, sp, zero | ra, sp, zero | | Total | 10 | 10 | * Stack | stack | Main | Movezeroes | | ----- | ---- | ---------- | | usage | 64 | 48 | * lw/sw instructions | lw/sw instruction | Main | Movezeroes | | ----------------- | ---- | ---------- | | usage | 24 | 27 | * Branch instructions | Branch instruction | Main | Movezeroes | | ------------------ | ---- | ---------- | | usage | 6 | 5 | * Lines | Lines | Main | Movezeroes | | ----- | ---- | ---------- | | usage | 44 | 46 | ### -O1 Optimized Assembly code * Compile ``` $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O1 Movezeroes.c -o Movezeroes_O1.elf ``` * Size ``` $ riscv-none-elf-size Movezeroes_O1.elf ``` ```makefile hanchi@hanchi:~/computer_architecture/HW02/C_to_elf$ riscv-none-elf-size Movezeroes_O1.elf text data bss dec hex filename 1616 1096 61 2773 ad5 Movezeroes_O1.elf ``` * ELF header ``` $ riscv-none-elf-readelf -h Movezeroes_O1.elf ``` ```makefile hanchi@hanchi:~/computer_architecture/HW02/C_to_elf$ riscv-none-elf-readelf -h Movezeroes_O1.elf ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x100dc Start of program headers: 52 (bytes into file) Start of section headers: 6080 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` * Save txt ``` $ riscv-none-elf-objdump -d Movezeroes_O1.elf >dis_objdump_O1.txt ``` * Disassembly code There are **454** lines in O1 optimization, therefore I chose `moveZeroes`, `main` two sections to compare. ```= 00010184 <moveZeroes>: 10184: ff010113 addi sp,sp,-16 10188: 00012623 sw zero,12(sp) 1018c: 00012423 sw zero,8(sp) 10190: 00c12783 lw a5,12(sp) 10194: 02b7c863 blt a5,a1,101c4 <moveZeroes+0x40> 10198: 00c12503 lw a0,12(sp) 1019c: 01010113 addi sp,sp,16 101a0: 00008067 ret 101a4: 00812783 lw a5,8(sp) 101a8: 00178793 addi a5,a5,1 101ac: 00f12423 sw a5,8(sp) 101b0: 00c12783 lw a5,12(sp) 101b4: 00178793 addi a5,a5,1 101b8: 00f12623 sw a5,12(sp) 101bc: 00c12783 lw a5,12(sp) 101c0: fcb7dce3 bge a5,a1,10198 <moveZeroes+0x14> 101c4: 00c12783 lw a5,12(sp) 101c8: 00279793 slli a5,a5,0x2 101cc: 00f507b3 add a5,a0,a5 101d0: 0007a783 lw a5,0(a5) 101d4: fc078ee3 beqz a5,101b0 <moveZeroes+0x2c> 101d8: 00c12703 lw a4,12(sp) 101dc: 00812783 lw a5,8(sp) 101e0: 00271713 slli a4,a4,0x2 101e4: 00e50733 add a4,a0,a4 101e8: 00072703 lw a4,0(a4) 101ec: 00279793 slli a5,a5,0x2 101f0: 00f507b3 add a5,a0,a5 101f4: 00e7a023 sw a4,0(a5) 101f8: 00c12703 lw a4,12(sp) 101fc: 00812783 lw a5,8(sp) 10200: faf702e3 beq a4,a5,101a4 <moveZeroes+0x20> 10204: 00c12783 lw a5,12(sp) 10208: 00279793 slli a5,a5,0x2 1020c: 00f507b3 add a5,a0,a5 10210: 0007a023 sw zero,0(a5) 10214: f91ff06f j 101a4 <moveZeroes+0x20> 00010218 <main>: 10218: fd010113 addi sp,sp,-48 1021c: 02112623 sw ra,44(sp) 10220: 000107b7 lui a5,0x10 10224: 6d078793 addi a5,a5,1744 # 106d0 <__errno+0x8> 10228: 0007a583 lw a1,0(a5) 1022c: 0047a603 lw a2,4(a5) 10230: 0087a683 lw a3,8(a5) 10234: 00c7a703 lw a4,12(a5) 10238: 0107a783 lw a5,16(a5) 1023c: 00b12623 sw a1,12(sp) 10240: 00c12823 sw a2,16(sp) 10244: 00d12a23 sw a3,20(sp) 10248: 00e12c23 sw a4,24(sp) 1024c: 00f12e23 sw a5,28(sp) 10250: 00500593 li a1,5 10254: 00c10513 addi a0,sp,12 10258: f2dff0ef jal ra,10184 <moveZeroes> 1025c: 00000513 li a0,0 10260: 02c12083 lw ra,44(sp) 10264: 03010113 addi sp,sp,48 10268: 00008067 ret ``` * Obseration and Explain * Optimization Purpose * Do some compilation optimization for the program. For large functions, the optimized compilation takes up slightly more time and a considerable amount of memory * Register Usage (main / Movezeroes) | Register | Main | Movezeroes | | -------- | ------------ | ------------ | | s-type | 0 | 0 | | a-type | 6 | 6 | | t-type | 0 | 0 | | other | ra, sp, zero | ra, sp, zero | | Total | 9 | 9 | * Stack | stack | Main | Movezeroes | | ----- | ---- | ---------- | | usage | 48 | 16 | * lw/sw instructions | lw/sw instruction | Main | Movezeroes | | ----------------- | ---- | ---------- | | usage | 12 | 19 | * Branch and jump instructions | Branch nd jump instruction | Main | Movezeroes | | -------------------------- | ---- | ---------- | | usage | 2 | 6 | * Lines | Lines | Main | Movezeroes | | ----- | ---- | ---------- | | usage | 21 | 37 | ### -O2 Optimized Assembly code * Compile ``` $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O2 Movezeroes.c -o Movezeroes_O2.elf ``` * Size ``` $ riscv-none-elf-size Movezeroes_O2.elf ``` ```makefile hanchi@hanchi:~/computer_architecture/HW02/C_to_elf$ riscv-none-elf-size Movezeroes_O2.elf text data bss dec hex filename 1612 1096 61 2769 ad1 Movezeroes_O2.elf ``` * ELF header ``` $ riscv-none-elf-readelf -h Movezeroes_O2.elf ``` ```makefile hanchi@hanchi:~/computer_architecture/HW02/C_to_elf$ riscv-none-elf-readelf -h Movezeroes_O2.elf ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10130 Start of program headers: 52 (bytes into file) Start of section headers: 6088 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` * Save txt ``` $ riscv-none-elf-objdump -d Movezeroes_O2.elf >dis_objdump_O2.txt ``` * Disassembly code There are **453** lines in O2 optimization, therefore I chose `moveZeroes`, `main` two sections to compare. ```= 000100c4 <main>: 100c4: 000107b7 lui a5,0x10 100c8: 6cc78793 addi a5,a5,1740 # 106cc <__errno+0x8> 100cc: 0007a803 lw a6,0(a5) 100d0: 0047a603 lw a2,4(a5) 100d4: 0087a683 lw a3,8(a5) 100d8: 00c7a703 lw a4,12(a5) 100dc: 0107a783 lw a5,16(a5) 100e0: fd010113 addi sp,sp,-48 100e4: 00c10513 addi a0,sp,12 100e8: 00500593 li a1,5 100ec: 02112623 sw ra,44(sp) 100f0: 01012623 sw a6,12(sp) 100f4: 00c12823 sw a2,16(sp) 100f8: 00d12a23 sw a3,20(sp) 100fc: 00e12c23 sw a4,24(sp) 10100: 00f12e23 sw a5,28(sp) 10104: 0d4000ef jal ra,101d8 <moveZeroes> 10108: 02c12083 lw ra,44(sp) 1010c: 00000513 li a0,0 10110: 03010113 addi sp,sp,48 10114: 00008067 ret 000101d8 <moveZeroes>: 101d8: ff010113 addi sp,sp,-16 101dc: 00012423 sw zero,8(sp) 101e0: 00012623 sw zero,12(sp) 101e4: 00812783 lw a5,8(sp) 101e8: 06b7da63 bge a5,a1,1025c <moveZeroes+0x84> 101ec: 00812783 lw a5,8(sp) 101f0: 00279793 slli a5,a5,0x2 101f4: 00f507b3 add a5,a0,a5 101f8: 0007a783 lw a5,0(a5) 101fc: 04078663 beqz a5,10248 <moveZeroes+0x70> 10200: 00812703 lw a4,8(sp) 10204: 00c12783 lw a5,12(sp) 10208: 00812683 lw a3,8(sp) 1020c: 00271713 slli a4,a4,0x2 10210: 00e50733 add a4,a0,a4 10214: 00072603 lw a2,0(a4) 10218: 00279793 slli a5,a5,0x2 1021c: 00c12703 lw a4,12(sp) 10220: 00f507b3 add a5,a0,a5 10224: 00c7a023 sw a2,0(a5) 10228: 00e68a63 beq a3,a4,1023c <moveZeroes+0x64> 1022c: 00812783 lw a5,8(sp) 10230: 00279793 slli a5,a5,0x2 10234: 00f507b3 add a5,a0,a5 10238: 0007a023 sw zero,0(a5) 1023c: 00c12783 lw a5,12(sp) 10240: 00178793 addi a5,a5,1 10244: 00f12623 sw a5,12(sp) 10248: 00812783 lw a5,8(sp) 1024c: 00178793 addi a5,a5,1 10250: 00f12423 sw a5,8(sp) 10254: 00812783 lw a5,8(sp) 10258: f8b7cae3 blt a5,a1,101ec <moveZeroes+0x14> 1025c: 00812503 lw a0,8(sp) 10260: 01010113 addi sp,sp,16 10264: 00008067 ret ``` * Observation and explain * Optimization Purpose * Will try more register-level optimizations as well as instruction-level optimizations, which take up more memory and compile time during compilation. * Register Usage (main / Movezeroes) | Register | Main | Movezeroes | | -------- | ------ | ------------ | | s-type | 0 | 0 | | a-type | 7 | 6 | | t-type | 0 | 0 | | other | ra, sp | ra, sp, zero | | Total | 9 | 9 | * Stack | stack | Main | Movezeroes | | ----- | ---- | ---------- | | usage | 48 | 16 | * lw/sw instructions | lw/sw instruction | Main | Movezeroes | | ----------------- | ---- | ---------- | | usage | 12 | 19 | * Branch and jump instructions | Branch nd jump instruction | Main | Movezeroes | | -------------------------- | ---- | ---------- | | usage | 2 | 5 | * Lines | Lines | Main | Movezeroes | | ----- | ---- | ---------- | | usage | 21 | 36 | ### -O3 Optimized Assembly code * Compile ``` $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O3 Movezeroes.c -o Movezeroes_O3.elf ``` * Size ``` $ riscv-none-elf-size Movezeroes_O3.elf ``` ```makefile hanchi@hanchi:~/computer_architecture/HW02/C_to_elf$ riscv-none-elf-size Movezeroes_O3.elf text data bss dec hex filename 1612 1096 61 2769 ad1 Movezeroes_O3.elf ``` * ELF header ``` $ riscv-none-elf-readelf -h Movezeroes_O3.elf ``` ```makefile hanchi@hanchi:~/computer_architecture/HW02/C_to_elf$ riscv-none-elf-readelf -h Movezeroes_O3.elf ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10130 Start of program headers: 52 (bytes into file) Start of section headers: 6088 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` * Save txt ``` $ riscv-none-elf-objdump -d Movezeroes_O3.elf >dis_objdump_O3.txt ``` * Disassembly code There are **453** lines in O3 optimization, therefore I chose `moveZeroes`, `main` two sections to compare. ```= 000100c4 <main>: 100c4: 000107b7 lui a5,0x10 100c8: 6cc78793 addi a5,a5,1740 # 106cc <__errno+0x8> 100cc: 0007a803 lw a6,0(a5) 100d0: 0047a603 lw a2,4(a5) 100d4: 0087a683 lw a3,8(a5) 100d8: 00c7a703 lw a4,12(a5) 100dc: 0107a783 lw a5,16(a5) 100e0: fd010113 addi sp,sp,-48 100e4: 00c10513 addi a0,sp,12 100e8: 00500593 li a1,5 100ec: 02112623 sw ra,44(sp) 100f0: 01012623 sw a6,12(sp) 100f4: 00c12823 sw a2,16(sp) 100f8: 00d12a23 sw a3,20(sp) 100fc: 00e12c23 sw a4,24(sp) 10100: 00f12e23 sw a5,28(sp) 10104: 0d4000ef jal ra,101d8 <moveZeroes> 10108: 02c12083 lw ra,44(sp) 1010c: 00000513 li a0,0 10110: 03010113 addi sp,sp,48 10114: 00008067 ret 000101d8 <moveZeroes>: 101d8: ff010113 addi sp,sp,-16 101dc: 00012423 sw zero,8(sp) 101e0: 00012623 sw zero,12(sp) 101e4: 00812783 lw a5,8(sp) 101e8: 06b7da63 bge a5,a1,1025c <moveZeroes+0x84> 101ec: 00812783 lw a5,8(sp) 101f0: 00279793 slli a5,a5,0x2 101f4: 00f507b3 add a5,a0,a5 101f8: 0007a783 lw a5,0(a5) 101fc: 04078663 beqz a5,10248 <moveZeroes+0x70> 10200: 00812703 lw a4,8(sp) 10204: 00c12783 lw a5,12(sp) 10208: 00812683 lw a3,8(sp) 1020c: 00271713 slli a4,a4,0x2 10210: 00e50733 add a4,a0,a4 10214: 00072603 lw a2,0(a4) 10218: 00279793 slli a5,a5,0x2 1021c: 00c12703 lw a4,12(sp) 10220: 00f507b3 add a5,a0,a5 10224: 00c7a023 sw a2,0(a5) 10228: 00e68a63 beq a3,a4,1023c <moveZeroes+0x64> 1022c: 00812783 lw a5,8(sp) 10230: 00279793 slli a5,a5,0x2 10234: 00f507b3 add a5,a0,a5 10238: 0007a023 sw zero,0(a5) 1023c: 00c12783 lw a5,12(sp) 10240: 00178793 addi a5,a5,1 10244: 00f12623 sw a5,12(sp) 10248: 00812783 lw a5,8(sp) 1024c: 00178793 addi a5,a5,1 10250: 00f12423 sw a5,8(sp) 10254: 00812783 lw a5,8(sp) 10258: f8b7cae3 blt a5,a1,101ec <moveZeroes+0x14> 1025c: 00812503 lw a0,8(sp) 10260: 01010113 addi sp,sp,16 10264: 00008067 ret ``` * Observation and explain * Optimization Purpose * Further optimization than O2 * Register Usage (main / Movezeroes) | Register | Main | Movezeroes | | -------- | ------------ | ------------ | | s-type | 0 | 0 | | a-type | 7 | 6 | | t-type | 0 | 0 | | other | ra, sp | ra, sp, zero | | Total | 9 | 9 | * Stack | stack | Main | Movezeroes | | ----- | ---- | ---------- | | usage | 48 | 16 | * lw/sw instructions | lw/sw instruction | Main | Movezeroes | | ----------------- | ---- | ---------- | | usage | 12 | 19 | * Branch and jump instructions | Branch nd jump instruction | Main | Movezeroes | | -------------------------- | ---- | ---------- | | usage | 2 | 5 | * Lines | Lines | Main | Movezeroes | | ----- | ---- | ---------- | | usage | 21 | 36 | ### -Os Optimized Assembly code * Compile ``` $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -Os Movezeroes.c -o Movezeroes_Os.elf ``` * Size ``` $ riscv-none-elf-size Movezeroes_Os.elf ``` ```makefile hanchi@hanchi:~/computer_architecture/HW02/C_to_elf$ riscv-none-elf-size Movezeroes_Os.elf text data bss dec hex filename 2000 1096 61 3157 c55 Movezeroes_Os.elf ``` * ELF header ``` $ riscv-none-elf-readelf -h Movezeroes_Os.elf ``` ```makefile hanchi@hanchi:~/computer_architecture/HW02/C_to_elf$ riscv-none-elf-readelf -h Movezeroes_Os.elf ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10114 Start of program headers: 52 (bytes into file) Start of section headers: 6552 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` * Save txt ``` $ riscv-none-elf-objdump -d Movezeroes_Os.elf >dis_objdump_Os.txt ``` * Disassembly code There are **552** lines in Os optimization, therefore I chose `moveZeroes`, `main` two sections to compare. ```= 000100c4 <main>: 100c4: fd010113 addi sp,sp,-48 100c8: 000115b7 lui a1,0x11 100cc: 01400613 li a2,20 100d0: 85058593 addi a1,a1,-1968 # 10850 <__errno+0x8> 100d4: 00c10513 addi a0,sp,12 100d8: 02112623 sw ra,44(sp) 100dc: 270000ef jal ra,1034c <memcpy> 100e0: 00c10513 addi a0,sp,12 100e4: 00500593 li a1,5 100e8: 0d4000ef jal ra,101bc <moveZeroes> 100ec: 02c12083 lw ra,44(sp) 100f0: 00000513 li a0,0 100f4: 03010113 addi sp,sp,48 100f8: 00008067 ret 000101bc <moveZeroes>: 101bc: ff010113 addi sp,sp,-16 101c0: 00012423 sw zero,8(sp) 101c4: 00012623 sw zero,12(sp) 101c8: 00812783 lw a5,8(sp) 101cc: 00b7c863 blt a5,a1,101dc <moveZeroes+0x20> 101d0: 00812503 lw a0,8(sp) 101d4: 01010113 addi sp,sp,16 101d8: 00008067 ret 101dc: 00812783 lw a5,8(sp) 101e0: 00279793 slli a5,a5,0x2 101e4: 00f507b3 add a5,a0,a5 101e8: 0007a783 lw a5,0(a5) 101ec: 04078663 beqz a5,10238 <moveZeroes+0x7c> 101f0: 00812703 lw a4,8(sp) 101f4: 00c12783 lw a5,12(sp) 101f8: 00271713 slli a4,a4,0x2 101fc: 00e50733 add a4,a0,a4 10200: 00072703 lw a4,0(a4) 10204: 00279793 slli a5,a5,0x2 10208: 00f507b3 add a5,a0,a5 1020c: 00e7a023 sw a4,0(a5) 10210: 00812703 lw a4,8(sp) 10214: 00c12783 lw a5,12(sp) 10218: 00f70a63 beq a4,a5,1022c <moveZeroes+0x70> 1021c: 00812783 lw a5,8(sp) 10220: 00279793 slli a5,a5,0x2 10224: 00f507b3 add a5,a0,a5 10228: 0007a023 sw zero,0(a5) 1022c: 00c12783 lw a5,12(sp) 10230: 00178793 addi a5,a5,1 10234: 00f12623 sw a5,12(sp) 10238: 00812783 lw a5,8(sp) 1023c: 00178793 addi a5,a5,1 10240: 00f12423 sw a5,8(sp) 10244: f85ff06f j 101c8 <moveZeroes+0xc> ``` * Observation and explain * Optimization Purpose * Mainly to optimize the size of the program * Register Usage (main / Movezeroes) | Register | Main | Movezeroes | |:-------- | ------ | ------------ | | s-type | 0 | 0 | | a-type | 3 | 6 | | t-type | 0 | 0 | | other | ra, sp | ra, sp, zero | | Total | 5 | 9 | * Stack | stack | Main | Movezeroes | | ----- | ---- | ---------- | | usage | 48 | 16 | * lw/sw instructions | lw/sw instruction | Main | Movezeroes | |:----------------- | ---- | ---------- | | usage | 2 | 18 | * Branch and jump instructions | Branch nd jump instruction | Main | Movezeroes | | -------------------------- | ---- | ---------- | | usage | 3 | 5 | * Lines | Lines | Main | Movezeroes | | ----- | ---- | ---------- | | usage | 14 | 35 | ### -Ofast Optimized Assembly code * Compile ``` $ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -Ofast Movezeroes.c -o Movezeroes_Ofast.elf ``` * Size ``` $ riscv-none-elf-size Movezeroes_Ofast.elf ``` ```makefile hanchi@hanchi:~/computer_architecture/HW02/C_to_elf$ riscv-none-elf-size Movezeroes_Ofast.elf text data bss dec hex filename 1612 1096 61 2769 ad1 Movezeroes_Ofast.elf ``` * ELF header ``` $ riscv-none-elf-readelf -h Movezeroes_Ofast.elf ``` ```makefile hanchi@hanchi:~/computer_architecture/HW02/C_to_elf$ riscv-none-elf-readelf -h Movezeroes_Ofast.elf ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x10130 Start of program headers: 52 (bytes into file) Start of section headers: 6088 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 3 Size of section headers: 40 (bytes) Number of section headers: 15 Section header string table index: 14 ``` * Save txt ``` $ riscv-none-elf-objdump -d Movezeroes_Ofast.elf >dis_objdump_Ofast.txt ``` * Disassembly code There are **453** lines in Os optimization, therefore I chose `moveZeroes`, `main` two sections to compare. ```= 000100c4 <main>: 100c4: 000107b7 lui a5,0x10 100c8: 6cc78793 addi a5,a5,1740 # 106cc <__errno+0x8> 100cc: 0007a803 lw a6,0(a5) 100d0: 0047a603 lw a2,4(a5) 100d4: 0087a683 lw a3,8(a5) 100d8: 00c7a703 lw a4,12(a5) 100dc: 0107a783 lw a5,16(a5) 100e0: fd010113 addi sp,sp,-48 100e4: 00c10513 addi a0,sp,12 100e8: 00500593 li a1,5 100ec: 02112623 sw ra,44(sp) 100f0: 01012623 sw a6,12(sp) 100f4: 00c12823 sw a2,16(sp) 100f8: 00d12a23 sw a3,20(sp) 100fc: 00e12c23 sw a4,24(sp) 10100: 00f12e23 sw a5,28(sp) 10104: 0d4000ef jal ra,101d8 <moveZeroes> 10108: 02c12083 lw ra,44(sp) 1010c: 00000513 li a0,0 10110: 03010113 addi sp,sp,48 10114: 00008067 ret 000101d8 <moveZeroes>: 101d8: ff010113 addi sp,sp,-16 101dc: 00012423 sw zero,8(sp) 101e0: 00012623 sw zero,12(sp) 101e4: 00812783 lw a5,8(sp) 101e8: 06b7da63 bge a5,a1,1025c <moveZeroes+0x84> 101ec: 00812783 lw a5,8(sp) 101f0: 00279793 slli a5,a5,0x2 101f4: 00f507b3 add a5,a0,a5 101f8: 0007a783 lw a5,0(a5) 101fc: 04078663 beqz a5,10248 <moveZeroes+0x70> 10200: 00812703 lw a4,8(sp) 10204: 00c12783 lw a5,12(sp) 10208: 00812683 lw a3,8(sp) 1020c: 00271713 slli a4,a4,0x2 10210: 00e50733 add a4,a0,a4 10214: 00072603 lw a2,0(a4) 10218: 00279793 slli a5,a5,0x2 1021c: 00c12703 lw a4,12(sp) 10220: 00f507b3 add a5,a0,a5 10224: 00c7a023 sw a2,0(a5) 10228: 00e68a63 beq a3,a4,1023c <moveZeroes+0x64> 1022c: 00812783 lw a5,8(sp) 10230: 00279793 slli a5,a5,0x2 10234: 00f507b3 add a5,a0,a5 10238: 0007a023 sw zero,0(a5) 1023c: 00c12783 lw a5,12(sp) 10240: 00178793 addi a5,a5,1 10244: 00f12623 sw a5,12(sp) 10248: 00812783 lw a5,8(sp) 1024c: 00178793 addi a5,a5,1 10250: 00f12423 sw a5,8(sp) 10254: 00812783 lw a5,8(sp) 10258: f8b7cae3 blt a5,a1,101ec <moveZeroes+0x14> 1025c: 00812503 lw a0,8(sp) 10260: 01010113 addi sp,sp,16 10264: 00008067 ret ``` * Observation and explain * Optimization Purpose * Ignore strict standard compilation. Contains all options in O3 * Register Usage (main / Movezeroes) | Register | Main | Movezeroes | |:-------- | ------ | ------------ | | s-type | 0 | 0 | | a-type | 7 | 6 | | t-type | 0 | 0 | | other | ra, sp | ra, sp, zero | | Total | 9 | 9 | * Stack | stack | Main | Movezeroes | | ----- | ---- | ---------- | | usage | 48 | 16 | * lw/sw instructions | lw/sw instruction | Main | Movezeroes | | ----------------- | ---- | ---------- | | usage | 12 | 19 | * Branch and jump instructions | Branch nd jump instruction | Main | Movezeroes | | -------------------------- | ---- | ---------- | | usage | 2 | 5 | * Lines | Lines | Main | Movezeroes | | ----- | ---- | ---------- | | usage | 21 | 36 | ### In conclusion #### CSR | Name | CSR cycle | | ----- | --------- | | O0 | 393 | | O1 | 305 | | O2 | 328 | | O3 | 328 | | Os | 334 | | Ofast | 328 | By looking at the number of CSRs and "compare only main and Movezero" the two sections. We can find that `O2`, `O3` is exactly the same as `Ofast`. #### From O0 to O1 We can find that O0 is basically a line-by-line translation of the C language into a combined language. Therefore, when comparing it with O1, it can be found that not only the number of lines used in the `Movezero` section is more, but also the instructions of sw/lw are far more than O1. This will cause it to execute slower than O1. In addition, we can see that from O0 to O1, the number of rows has been reduced by more than half. #### From O1 to O2 We can find that there is not as much optimization as O0 to O1, and also uses more a-type registers, but reduces some branch instructions(reduce 1) in `Movezero` section. but maybe somewhere is over-optimized, caused it used more CSR cycles in O1. #### From O2 to O3 and Ofast From O2 to O3 is the same as Ofast, which may indicate that the compiler has optimized them to the extreme, and the compiler can no longer propose a better optimization method. #### From O1 to Os In general, Os will generate the smallest elf file. But in this case, it maybe not the smallest in the six ELF files. But we also can figure ot that it only used 5 S-format instructions rather then 15 in O1, in other word, slightly less in everywhere. --- ## Optimize assembly code There are for loop and while loop in my code, sp I will use **loop unrooling** to optimize my code.The cost of loop unrooling is that there will be a lot of repetitive code to replace the loop, so my code size will increase a lot. Loop unrooling : * pros 1. Reduce branch prediction failures 2. Eliminate few loop bounds checks * cons 1. Much larger program 2. Decrease in program readability ``` .org 0 .global _start /* newlib system calls */ .set SYSEXIT, 93 .set SYSWRITE, 64 .data nums1: .word 1, 0, 0, 2, 5 .set arr_size1, .-nums1 nums2: .word 1, 9, 0, 6, 7 .set arr_size2, .-nums2 nums3: .word 0, 0, 0, 0, 4 .set arr_size3, .-nums3 nextline: .ascii "\n" .set str_next_len, .-nextline blank: .ascii " " .set str_blank_len, .-blank .text _start: la s0, nums1 # load nums base address to s0 addi s1, x0, 5 #nums_size = 5 #call function moveZeroes add a0, s0, x0 # a0 which stands for address of nums array is the first argument add t3, s1, x0 # t3 which stands for array size of nums is the second argument jal ra, moveZeroes #jump to moveZeroes function #PrintArray jal ra, printArray # printf("\n"); li a7, SYSWRITE li a0, 1 la a1, nextline li a2, 1 ecall # exit and return 0 li a7, SYSEXIT addi a0, x0, 0 ecall moveZeroes: addi sp, sp, -12 sw ra, 0(sp) sw s0, 4(sp) sw s1, 8(sp) li s1, 0 #next non zero index = 0 addi s2, x0, 0 # i = 0 loop: # loop unrooling bge s2, t3, exit #i >= array_size exit slli t0, s2, 2 #i * 4 add t0, a0, t0 #array + i*4 lw t1, 0(t0) #t1 = array[i] beq t1, x0, next_iter0 slli t2, s1, 2 #next_nonzero_index * 4 add t2, t2, a0 #array + next_nonzero_index * 4 sw t1, 0(t2) #array[next_nonzero_index] = array[i] beq s1, s2, next_iter_addIndex0 #if(next_nonzero_index != i) sw x0, 0(t0) #store 0 to array[i] next_iter_addIndex0: addi s1,s1, 1 #next_nonzero_index++ next_iter0: addi s2, s2, 1 #i++ bge s2, t3, exit #i >= array_size exit slli t0, s2, 2 #i * 4 add t0, a0, t0 #array + i*4 lw t1, 0(t0) #t1 = array[i] beq t1, x0, next_iter1 slli t2, s1, 2 #next_nonzero_index * 4 add t2, t2, a0 #array + next_nonzero_index * 4 sw t1, 0(t2) #array[next_nonzero_index] = array[i] beq s1, s2, next_iter_addIndex1 #if(next_nonzero_index != i) sw x0, 0(t0) #store 0 to array[i] next_iter_addIndex1: addi s1,s1, 1 #next_nonzero_index++ next_iter1: addi s2, s2, 1 #i++ bge s2, t3, exit #i >= array_size exit slli t0, s2, 2 #i * 4 add t0, a0, t0 #array + i*4 lw t1, 0(t0) #t1 = array[i] beq t1, x0, next_iter2 slli t2, s1, 2 #next_nonzero_index * 4 add t2, t2, a0 #array + next_nonzero_index * 4 sw t1, 0(t2) #array[next_nonzero_index] = array[i] beq s1, s2, next_iter_addIndex2 #if(next_nonzero_index != i) sw x0, 0(t0) #store 0 to array[i] next_iter_addIndex2: addi s1,s1, 1 #next_nonzero_index++ next_iter2: addi s2, s2, 1 #i++ bge s2, t3, exit #i >= array_size exit slli t0, s2, 2 #i * 4 add t0, a0, t0 #array + i*4 lw t1, 0(t0) #t1 = array[i] beq t1, x0, next_iter3 slli t2, s1, 2 #next_nonzero_index * 4 add t2, t2, a0 #array + next_nonzero_index * 4 sw t1, 0(t2) #array[next_nonzero_index] = array[i] beq s1, s2, next_iter_addIndex3 #if(next_nonzero_index != i) sw x0, 0(t0) #store 0 to array[i] next_iter_addIndex3: addi s1,s1, 1 #next_nonzero_index++ next_iter3: addi s2, s2, 1 #i++ bge s2, t3, exit #i >= array_size exit slli t0, s2, 2 #i * 4 add t0, a0, t0 #array + i*4 lw t1, 0(t0) #t1 = array[i] beq t1, x0, next_iter4 slli t2, s1, 2 #next_nonzero_index * 4 add t2, t2, a0 #array + next_nonzero_index * 4 sw t1, 0(t2) #array[next_nonzero_index] = array[i] beq s1, s2, next_iter_addIndex4 #if(next_nonzero_index != i) sw x0, 0(t0) #store 0 to array[i] next_iter_addIndex4: addi s1,s1, 1 #next_nonzero_index++ next_iter4: addi s2, s2, 1 #i++ exit: lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) addi sp, sp, 12 jr ra printArray: addi sp, sp, -8 sw ra, 0(sp) sw s0, 4(sp) add s0, a0, x0 # s1 = pointer to array add t0, x0, x0 # t0 = i in for loop la t5, nums1 addi t6, x0, 0 addi s11, x0, arr_size1 srli s11, s11, 2 # loop unrooling bge t6, s11, finish_print_loop lw t4, 0(t5) addi t5, t5, 4 mv a1, t4 addi a1, a1, 48 addi sp, sp, -4 sw a1, 0(sp) addi a1, sp, 0 li a7, SYSWRITE li a0, 1 li a2, 4 ecall addi sp,sp,4 li a7, SYSWRITE li a0, 1 la a1, blank li a2, str_blank_len ecall addi t6, t6, 1 bge t6, s11, finish_print_loop lw t4, 0(t5) addi t5, t5, 4 mv a1, t4 addi a1, a1, 48 addi sp, sp, -4 sw a1, 0(sp) addi a1, sp, 0 li a7, SYSWRITE li a0, 1 li a2, 4 ecall addi sp,sp,4 li a7, SYSWRITE li a0, 1 la a1, blank li a2, str_blank_len ecall addi t6, t6, 1 bge t6, s11, finish_print_loop lw t4, 0(t5) addi t5, t5, 4 mv a1, t4 addi a1, a1, 48 addi sp, sp, -4 sw a1, 0(sp) addi a1, sp, 0 li a7, SYSWRITE li a0, 1 li a2, 4 ecall addi sp,sp,4 li a7, SYSWRITE li a0, 1 la a1, blank li a2, str_blank_len ecall addi t6, t6, 1 bge t6, s11, finish_print_loop lw t4, 0(t5) addi t5, t5, 4 mv a1, t4 addi a1, a1, 48 addi sp, sp, -4 sw a1, 0(sp) addi a1, sp, 0 li a7, SYSWRITE li a0, 1 li a2, 4 ecall addi sp,sp,4 li a7, SYSWRITE li a0, 1 la a1, blank li a2, str_blank_len ecall addi t6, t6, 1 bge t6, s11, finish_print_loop lw t4, 0(t5) addi t5, t5, 4 mv a1, t4 addi a1, a1, 48 addi sp, sp, -4 sw a1, 0(sp) addi a1, sp, 0 li a7, SYSWRITE li a0, 1 li a2, 4 ecall addi sp,sp,4 li a7, SYSWRITE li a0, 1 la a1, blank li a2, str_blank_len ecall addi t6, t6, 1 finish_print_loop: lw ra, 0(sp) lw s0, 4(sp) addi sp, sp, 8 jr ra ``` I find that there is no change between loop unrooling and do nothing, the CSR cycles is still 108. Although it uses the same CSR cycles, but loop unrooling reduce the branch prediction failure, it means it will use fewer NOP in th pipeline. Therefore, I think it will reduce the execution time in theoretically. * CSR Cycles ```makefile hanchi@hanchi:~/rv32emu/tests/asm-hello$ ../../build/rv32emu --stats hello.elf 1 2 5 0 0 inferior exit code 0 CSR cycle count: 166 ``` * code size ```makefile hanchi@hanchi:~/rv32emu/tests/asm-hello$ riscv-none-elf-size hello.elf text data bss dec hex filename 868 0 0 868 364 hello.elf ``` * ELF header ```makefile hanchi@hanchi:~/rv32emu/tests/asm-hello$ riscv-none-elf-readelf -h hello.elf ELF Header: Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: EXEC (Executable file) Machine: RISC-V Version: 0x1 Entry point address: 0x0 Start of program headers: 52 (bytes into file) Start of section headers: 5916 (bytes into file) Flags: 0x0 Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 2 Size of section headers: 40 (bytes) Number of section headers: 6 Section header string table index: 5 ``` * OBJdump * store to txt, click as follow :::spoiler OBJdump txt file ``` hello.elf: file format elf32-littleriscv Disassembly of section .text: 00000000 <_start>: 0: 00000417 auipc s0,0x0 4: 32440413 addi s0,s0,804 # 324 <nums1> 8: 00500493 li s1,5 c: 00040533 add a0,s0,zero 10: 00048e33 add t3,s1,zero 14: 02c000ef jal ra,40 <moveZeroes> 18: 144000ef jal ra,15c <printArray> 1c: 04000893 li a7,64 20: 00100513 li a0,1 24: 00000597 auipc a1,0x0 28: 33c58593 addi a1,a1,828 # 360 <nextline> 2c: 00100613 li a2,1 30: 00000073 ecall 34: 05d00893 li a7,93 38: 00000513 li a0,0 3c: 00000073 ecall 00000040 <moveZeroes>: 40: ff410113 addi sp,sp,-12 44: 00112023 sw ra,0(sp) 48: 00812223 sw s0,4(sp) 4c: 00912423 sw s1,8(sp) 50: 00000493 li s1,0 54: 00000913 li s2,0 00000058 <loop>: 58: 0fc95863 bge s2,t3,148 <exit> 5c: 00291293 slli t0,s2,0x2 60: 005502b3 add t0,a0,t0 64: 0002a303 lw t1,0(t0) 68: 00030e63 beqz t1,84 <next_iter0> 6c: 00249393 slli t2,s1,0x2 70: 00a383b3 add t2,t2,a0 74: 0063a023 sw t1,0(t2) 78: 01248463 beq s1,s2,80 <next_iter_addIndex0> 7c: 0002a023 sw zero,0(t0) 00000080 <next_iter_addIndex0>: 80: 00148493 addi s1,s1,1 00000084 <next_iter0>: 84: 00190913 addi s2,s2,1 88: 0dc95063 bge s2,t3,148 <exit> 8c: 00291293 slli t0,s2,0x2 90: 005502b3 add t0,a0,t0 94: 0002a303 lw t1,0(t0) 98: 00030e63 beqz t1,b4 <next_iter1> 9c: 00249393 slli t2,s1,0x2 a0: 00a383b3 add t2,t2,a0 a4: 0063a023 sw t1,0(t2) a8: 01248463 beq s1,s2,b0 <next_iter_addIndex1> ac: 0002a023 sw zero,0(t0) 000000b0 <next_iter_addIndex1>: b0: 00148493 addi s1,s1,1 000000b4 <next_iter1>: b4: 00190913 addi s2,s2,1 b8: 09c95863 bge s2,t3,148 <exit> bc: 00291293 slli t0,s2,0x2 c0: 005502b3 add t0,a0,t0 c4: 0002a303 lw t1,0(t0) c8: 00030e63 beqz t1,e4 <next_iter2> cc: 00249393 slli t2,s1,0x2 d0: 00a383b3 add t2,t2,a0 d4: 0063a023 sw t1,0(t2) d8: 01248463 beq s1,s2,e0 <next_iter_addIndex2> dc: 0002a023 sw zero,0(t0) 000000e0 <next_iter_addIndex2>: e0: 00148493 addi s1,s1,1 000000e4 <next_iter2>: e4: 00190913 addi s2,s2,1 e8: 07c95063 bge s2,t3,148 <exit> ec: 00291293 slli t0,s2,0x2 f0: 005502b3 add t0,a0,t0 f4: 0002a303 lw t1,0(t0) f8: 00030e63 beqz t1,114 <next_iter3> fc: 00249393 slli t2,s1,0x2 100: 00a383b3 add t2,t2,a0 104: 0063a023 sw t1,0(t2) 108: 01248463 beq s1,s2,110 <next_iter_addIndex3> 10c: 0002a023 sw zero,0(t0) 00000110 <next_iter_addIndex3>: 110: 00148493 addi s1,s1,1 00000114 <next_iter3>: 114: 00190913 addi s2,s2,1 118: 03c95863 bge s2,t3,148 <exit> 11c: 00291293 slli t0,s2,0x2 120: 005502b3 add t0,a0,t0 124: 0002a303 lw t1,0(t0) 128: 00030e63 beqz t1,144 <next_iter4> 12c: 00249393 slli t2,s1,0x2 130: 00a383b3 add t2,t2,a0 134: 0063a023 sw t1,0(t2) 138: 01248463 beq s1,s2,140 <next_iter_addIndex4> 13c: 0002a023 sw zero,0(t0) 00000140 <next_iter_addIndex4>: 140: 00148493 addi s1,s1,1 00000144 <next_iter4>: 144: 00190913 addi s2,s2,1 00000148 <exit>: 148: 00012083 lw ra,0(sp) 14c: 00412403 lw s0,4(sp) 150: 00812483 lw s1,8(sp) 154: 00c10113 addi sp,sp,12 158: 00008067 ret 0000015c <printArray>: 15c: ff810113 addi sp,sp,-8 160: 00112023 sw ra,0(sp) 164: 00812223 sw s0,4(sp) 168: 00050433 add s0,a0,zero 16c: 000002b3 add t0,zero,zero 170: 00000f17 auipc t5,0x0 174: 1b4f0f13 addi t5,t5,436 # 324 <nums1> 178: 00000f93 li t6,0 17c: 01400d93 li s11,20 180: 002ddd93 srli s11,s11,0x2 184: 19bfd863 bge t6,s11,314 <finish_print_loop> 188: 000f2e83 lw t4,0(t5) 18c: 004f0f13 addi t5,t5,4 190: 000e8593 mv a1,t4 194: 03058593 addi a1,a1,48 198: ffc10113 addi sp,sp,-4 19c: 00b12023 sw a1,0(sp) 1a0: 00010593 mv a1,sp 1a4: 04000893 li a7,64 1a8: 00100513 li a0,1 1ac: 00400613 li a2,4 1b0: 00000073 ecall 1b4: 00410113 addi sp,sp,4 1b8: 04000893 li a7,64 1bc: 00100513 li a0,1 1c0: 00000597 auipc a1,0x0 1c4: 1a158593 addi a1,a1,417 # 361 <blank> 1c8: 00100613 li a2,1 1cc: 00000073 ecall 1d0: 001f8f93 addi t6,t6,1 1d4: 15bfd063 bge t6,s11,314 <finish_print_loop> 1d8: 000f2e83 lw t4,0(t5) 1dc: 004f0f13 addi t5,t5,4 1e0: 000e8593 mv a1,t4 1e4: 03058593 addi a1,a1,48 1e8: ffc10113 addi sp,sp,-4 1ec: 00b12023 sw a1,0(sp) 1f0: 00010593 mv a1,sp 1f4: 04000893 li a7,64 1f8: 00100513 li a0,1 1fc: 00400613 li a2,4 200: 00000073 ecall 204: 00410113 addi sp,sp,4 208: 04000893 li a7,64 20c: 00100513 li a0,1 210: 00000597 auipc a1,0x0 214: 15158593 addi a1,a1,337 # 361 <blank> 218: 00100613 li a2,1 21c: 00000073 ecall 220: 001f8f93 addi t6,t6,1 224: 0fbfd863 bge t6,s11,314 <finish_print_loop> 228: 000f2e83 lw t4,0(t5) 22c: 004f0f13 addi t5,t5,4 230: 000e8593 mv a1,t4 234: 03058593 addi a1,a1,48 238: ffc10113 addi sp,sp,-4 23c: 00b12023 sw a1,0(sp) 240: 00010593 mv a1,sp 244: 04000893 li a7,64 248: 00100513 li a0,1 24c: 00400613 li a2,4 250: 00000073 ecall 254: 00410113 addi sp,sp,4 258: 04000893 li a7,64 25c: 00100513 li a0,1 260: 00000597 auipc a1,0x0 264: 10158593 addi a1,a1,257 # 361 <blank> 268: 00100613 li a2,1 26c: 00000073 ecall 270: 001f8f93 addi t6,t6,1 274: 0bbfd063 bge t6,s11,314 <finish_print_loop> 278: 000f2e83 lw t4,0(t5) 27c: 004f0f13 addi t5,t5,4 280: 000e8593 mv a1,t4 284: 03058593 addi a1,a1,48 288: ffc10113 addi sp,sp,-4 28c: 00b12023 sw a1,0(sp) 290: 00010593 mv a1,sp 294: 04000893 li a7,64 298: 00100513 li a0,1 29c: 00400613 li a2,4 2a0: 00000073 ecall 2a4: 00410113 addi sp,sp,4 2a8: 04000893 li a7,64 2ac: 00100513 li a0,1 2b0: 00000597 auipc a1,0x0 2b4: 0b158593 addi a1,a1,177 # 361 <blank> 2b8: 00100613 li a2,1 2bc: 00000073 ecall 2c0: 001f8f93 addi t6,t6,1 2c4: 05bfd863 bge t6,s11,314 <finish_print_loop> 2c8: 000f2e83 lw t4,0(t5) 2cc: 004f0f13 addi t5,t5,4 2d0: 000e8593 mv a1,t4 2d4: 03058593 addi a1,a1,48 2d8: ffc10113 addi sp,sp,-4 2dc: 00b12023 sw a1,0(sp) 2e0: 00010593 mv a1,sp 2e4: 04000893 li a7,64 2e8: 00100513 li a0,1 2ec: 00400613 li a2,4 2f0: 00000073 ecall 2f4: 00410113 addi sp,sp,4 2f8: 04000893 li a7,64 2fc: 00100513 li a0,1 300: 00000597 auipc a1,0x0 304: 06158593 addi a1,a1,97 # 361 <blank> 308: 00100613 li a2,1 30c: 00000073 ecall 310: 001f8f93 addi t6,t6,1 00000314 <finish_print_loop>: 314: 00012083 lw ra,0(sp) 318: 00412403 lw s0,4(sp) 31c: 00810113 addi sp,sp,8 320: 00008067 ret 00000324 <nums1>: 324: 00000001 .word 0x00000001 ... 330: 00000002 .word 0x00000002 334: 00000005 .word 0x00000005 00000338 <nums2>: 338: 00000001 .word 0x00000001 33c: 00000009 .word 0x00000009 340: 00000000 .word 0x00000000 344: 00000006 .word 0x00000006 348: 00000007 .word 0x00000007 0000034c <nums3>: ... 35c: 00000004 .word 0x00000004 00000360 <nextline>: 360: Address 0x0000000000000360 is out of bounds. 364: 00000361 <blank>: 361: 0020 .short 0x0020 ... ``` ::: --- ## Summry | Name | CSR cycle | | --------------- | ------------- | | origin in Ripes | 185 | | O0 | 393 | | O1 | 305 | | O2 | 328 | | O3 | 328 | | Os | 334 | | Ofast | 328 | | hand-write | 166 :crown: | I have learned a lot of optimization methods for combined languages (loop unrooling, SIMD, SWAR, etc.) in this assignment. We can see that loop unrooling will shorten the execution time, but it will not reduce the number of CSRs used. Maybe in the future I can use more optimized ways to deal with the above program --- ## Referencr Link [Assignment2: RISC-V Toolchain](https://hackmd.io/Cl9XFI5hQ5G1ZuM8dwMG3g?view) [Lab1 RV32I Move Zeroes](https://hackmd.io/@Fo7UsdePRsKPVV4CPYGbpA/rk91z0LWi) [Lab2, RISC-V RV32I emulator with ELF support](https://hackmd.io/@zoanana990/BktYM91wt) [你所不知道的 C 語言：編譯器和最佳化原理篇](https://hackmd.io/@sysprog/c-compiler-optimization?type=view#%E4%BD%A0%E6%89%80%E4%B8%8D%E7%9F%A5%E9%81%93%E7%9A%84-C-%E8%AA%9E%E8%A8%80%EF%BC%9A%E7%B7%A8%E8%AD%AF%E5%99%A8%E5%92%8C%E6%9C%80%E4%BD%B3%E5%8C%96%E5%8E%9F%E7%90%86%E7%AF%87)