Try   HackMD

2022 Homework2: RISC-V Toolchain

tags: RISC-V, jserv

Before Start

We need to install RISC-V Toolchain on our virtual machine or computer and change the PATH variable of system, but the instruction in lab2 cannot change it permanently so we have to activate riscv-none-elf-gcc/setenv as a source file each time we log in. I think it is quite annoying so I rewrite the ~/.profile to automatically add our toolchain into user path each time we log in.

If follow the instruction, the riscv-none-elf-gcc should under "/home/YourUserName" directory, so ~/.profile should add following instruction:

if [ -d "/home/YourUserName/riscv-none-elf-gcc/bin" ] ; then
  PATH="/home/YourUserName/riscv-none-elf-gcc/bin:$PATH"
fi

Add this instruction to ~/.profile. It will automatically check whether toolchain exist and add the directory to user path.

Then you can restart the terminal or use source ~/.profile to update PATH variable.

Check $PATH

echo $PATH

You should be able to see this in your path variable:

/home/YourUserName/riscv-none-elf-gcc/bin

The following is my result:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Rewrite Concatenation of Array

I choose the Concatenation of Array from tonych1997

Motivation: My homework one is a practice of reducing array element. And this time I want to prctice how to increase the size of array with assembly code. Also, tonych1997 wrote enough comment so the assembly code is easy to understand.

Before I start to rewrite assembly program, I encountered a question. In the implementation of system call, the syscall_write function always print input data byte by byte, so the emulator cannot print 32 bits integer unless we first convert it into string.

/* lookup the file descriptor */ map_iter_t it; map_find(s->fd_map, &it, &fd); if (!map_at_end(s->fd_map, &it)) { /* write out the data */ size_t written = fwrite(tmp, 1, count, map_iter_value(&it, FILE *)); /* return number of bytes written */ rv_set_reg(rv, rv_reg_a0, written); } else { /* error */ rv_set_reg(rv, rv_reg_a0, -1); }

In rv32emu/src/state.h we can see that the default opened file of this emulator is stdin, atdout, stderr respectively, so the sample code in rv32emu/tests/asm-hello/hello.S will set a0 to 1 before ecall to get access of standard output.

static inline state_t *state_new() { state_t *s = malloc(sizeof(state_t)); s->mem = memory_new(); s->break_addr = 0; s->fd_map = map_init(int, FILE *, map_cmp_int); FILE *files[] = {stdin, stdout, stderr}; for (size_t i = 0; i < sizeof(files) / sizeof(files[0]); i++) map_insert(s->fd_map, &i, &files[i]); return s; }

Because there is no space for other argument to determine whether print integer or string unless rewrite some structure in emulator.c, but I think it is too complicated to modified. So I decided to use a very naive way to implement the integer output.

I modified syscall.c as following:

static void syscall_write(struct riscv_t *rv)
{
    state_t *s = rv_userdata(rv); /* access userdata */

    /* _write(fde, buffer, count) */
    riscv_word_t fd = rv_get_reg(rv, rv_reg_a0);
    riscv_word_t buffer = rv_get_reg(rv, rv_reg_a1);
    riscv_word_t count = rv_get_reg(rv, rv_reg_a2);

    /* If a0 is 0xfff, ecall will print integer*/
    int flag = 0;
    if (fd == 0xfff) {
        fd = 1;
        flag = 1;
    }

    /* read the string that we are printing */
    uint8_t *tmp = malloc(count);
    memory_read(s->mem, tmp, buffer, count);

    /* lookup the file descriptor */
    map_iter_t it;
    map_find(s->fd_map, &it, &fd);
    if (!map_at_end(s->fd_map, &it)) {
        size_t written;
        if (!flag)
            /* write out the data */
            written = fwrite(tmp, 1, count, map_iter_value(&it, FILE *));
        else
            written = fprintf(map_iter_value(&it, FILE *), "%d", *tmp);

        /* return number of bytes written */
        rv_set_reg(rv, rv_reg_a0, written);
    } else {
        /* error */
        rv_set_reg(rv, rv_reg_a0, -1);
    }

    free(tmp);
}

If 0xfff passed into a0 register, the syscall_write function will become integer mode, which will use fprintf rather than fwrite to output data into stdout. But it cannot print integer into specified file.

After modified all of the system call to fit rv32emu's SPEC, I tried to make file and got these errors:

The resons is that rv32emu and Ripes are not totally compatable, former do not support some instruction syntax sugar so we can modify the code by simply add comma between operands.

System calls is changed to fit rv32emu, and I also change some sigle character string into char to reduce memory size. Here is my modification:

.org 0
.global _start

.set STDOUT, 1
.set SYSEXIT,  93
.set SYSWRITE, 64
.set SYSBRK, 214

.data
arr: .word 1, 2, 3 # a[3] = {1, 2, 3}
len1: .word 3 # array length of a is 3
space: .byte ' ' # space
comma: .byte ','
nline: .byte '\n'

.text
# s1 = arr a base address
# s2 = arr b base address
# s3 = array length of a
# s4 = array length of b
# t0 = i for loopCon1, loopCon2, print

_start:
    la s1, arr          # s1 = address of a
    lw s3, len1         # s3 = length of a
    add s4, s3, s3      # s4 = length of b (a*2)
    add t0, x0, x0      # i = 0
    jal ra, loopCon1
    add t0, x0, x0      # i = 0
    jal ra, loopCon2
    add t0, x0, x0      # i = 0
    jal ra, print
    jal ra, exit

loopCon1:
    add t1, t0, x0          # t1 = i
    add t1, t1, t1          # t1 = t1*2
    add t1, t1, t1          # t1 = t1*2 (t1*4)
    add t1, t1, s1          # address of a[i] (base addr. + 4i)
    lw t2, 0(t1)            # t2 = s1 (content of a[i])
    add t1, t0, x0          # t1 = i
    add t1, t1, t1          # t1 = t1*2
    add t1, t1, t1          # t1 = t1*2 (t1*4)
    add t1, t1, s2          # address of b[i] (base addr. + 4i)
    sw t2, 0(t1)            # b[i] = t2 (content of a[i])
    addi t0, t0, 1          # i++
    blt t0, s3, loopCon1    # if i < length, go to loopCon1
    ret                     # else, return to main

loopCon2:
    add t1, t0, x0          # t1 = i
    add t1, t1, t1            # t1 i*2
    add t1, t1, t1            # t1 i*2 (t1*4)
    add t1, t1, s1            # t1=i*4+base_of_arr
    lw t2, 0(t1)             # t2 = s1 (content of a[i])
    add t1, t0, s3            # t1 = i + length
    add t1, t1, t1            # t1 = t1*2
    add t1, t1, t1            # t1 = t1*2 (t1*4)
    add t1, t1, s2            # t1 = address of b[i+length] (base addr. + 4*(i+length))
    sw t2, 0(t1)             # t2 = content in arr[n+1]
    addi t0, t0, 1            # i++
    blt t0, s3, loopCon2    # if i < length, go to loopCon2
    ret                     # else, return to main

print:
    addi sp, sp, -4
    sw ra, 0(sp)

    lw t2, 0(s2)             # t2 = content of b[i]
    add a1, t2, x0          # load result of array b
    call printInt

    la a1, space            # load string - space
    call printChar
    addi s2, s2, 4          # address move forward
    addi t0, t0, 1          # i++
    blt t0, s4, print
    lw ra, 0(sp)
    addi sp, sp, 4
    ret

exit:
    li a0, 0
    li a7, SYSEXIT           # end
    ecall

# a1 is the value of int
printInt:
    addi sp, sp, -4
    li a0, 0xfff
    sw a1, 0(sp)
    mv a1, sp
    li a2, 4
    li a7, SYSWRITE
    addi sp, sp, 4
    ecall
    ret

# a1 is the address of char
printChar:
    li a0, STDOUT
    li a2, 1
    li a7, SYSWRITE
    ecall
    ret

Compile:

teimeiki@ubuntu:~/rv32emu/hw2$ make
riscv-none-elf-as -R -march=rv32i -mabi=ilp32 -o concatenation_of_array.o concatenation_of_array.S
riscv-none-elf-ld -o concatenation_of_array.elf -T concatenation_of_array.ld --oformat=elf32-littleriscv concatenation_of_array.o

Execution and results:

teimeiki@ubuntu:~/rv32emu$ !b
build/rv32emu hw2/concatenation_of_array.elf
1 2 3 1 2 3 inferior exit code 0

And the assembly provide at here has the following output on Ripes:

1 2 3 1 2 3 
Program exited with code: 0

Analysis


The CSR count is 225 in the picture, and line of code is 66 in this implementation.

There are some problem in this program:

  • Tonych1997 used two function (loopCon1, loopCon2) to implement an inline for loop in main function, which will spend lots of time. They stated the result might be wrong if combine these for loop together but after some studies, I realized that is because they didn't initialize the base address of array 2.

  • Because they didn't initialized the register value, array 2 will be stored at 0x00 and there is code section! The following is a monitoring of instruction memory.


before execution


after execution

  • The former is the memory contents before execution and latter is after. The code will even change instruction which haven'd been executed. I think if there is no instruction cache, which stored the unmodified instruction, the modified code will be executed and lead to some umpredictable behavior.

  • But in modern system, this situation is unlikely to happend because operating system will monitor the usage of memory and deny invalid memory access, causing a segmentation fault.

Optimization

In seek of solving the problems I mentioned before, I combine loopCon1 and loopCon2 together and make it inline. I also extend stack in _start function to store our new array.

.org 0
.global _start

.set STDOUT, 1
.set SYSEXIT,  93
.set SYSWRITE, 64
.set SYSBRK, 214

.data
arr: .word 1, 2, 3 # a[3] = {1, 2, 3}
len1: .word 3 # array length of a is 3
space: .byte ' ' # space
comma: .byte ','
nline: .byte '\n'

.text
# s1 = arr a base address
# s2 = arr b base address
# s3 = array length of a
# s4 = array length of b
# t0 = i for loopCon1, loopCon2, print

_start:
    addi sp, sp, -24    # allocate space for b
    la s1, arr
    mv s2, sp           # base address of b
    lw s3, len1
    li t0, 0            # i = 0
    slli t1, s3, 2      # t1 = len(arr) * 4
for1:
    add t2, t0, s1      # t2 = curr address of a
    add t3, t0, s2      # t3 = curr address of b
    add t4, t3, t1      # t4 = curr address of b + len of a
    lw t2, 0(t2)        # t2 = a[i]
    sw t2, 0(t3)        # b[i] = a[i]
    sw t2, 0(t4)        # b[i + len(a)] = a[i]
    addi t0, t0, 4
    blt t0, t1, for1

    li t0, 0
    slli t1, s3, 3      # t1 = len(arr) * 8
forPrint:
    add t2, s2, t0      # t2 =
    lw a1, 0(t2)
    call printInt
    la a1, space
    call printChar
    addi t0, t0, 4
    blt t0, t1, forPrint
    j exit

exit:
    li a0, 0
    li a7, SYSEXIT           # end
    ecall

# a1 is the value of int
printInt:
    addi sp, sp, -4
    li a0, 0xfff
    sw a1, 0(sp)
    mv a1, sp
    li a2, 4
    li a7, SYSWRITE
    addi sp, sp, 4
    ecall
    ret

# a1 is the address of char
printChar:
    li a0, STDOUT
    li a2, 1
    li a7, SYSWRITE
    ecall
    ret


After the optimization, CSR cycle count reduced to 142 and LOC reduced to 41.

Observation: My implementation of printInt will push and pop stack each time be called, which is unnecessary because there is no other function call there. Each iteration we will do push and pop once and only 2 times is necessary. I also want to modifiy it to avoid function call overhead of storing return address.

.org 0
.global _start

.set STDOUT, 1
.set SYSEXIT,  93
.set SYSWRITE, 64
.set SYSBRK, 214

.data
arr1: .word 1, 2, 3 # a[3] = {1, 2, 3}
len1: .word 3 # array length of a is 3
space: .byte ' ' # space
comma: .byte ','
nline: .byte '\n'

.text
# s1 = arr a base address
# s2 = arr b base address
# s3 = array length of a
# s4 = array length of b
# t0 = i for loopCon1, loopCon2, print

_start:
    addi sp, sp, -24    # allocate space for b
    la s1, arr1
    mv s2, sp           # base address of b
    lw s3, len1
    li t0, 0            # i = 0
    slli t1, s3, 2      # t1 = len(arr) * 4
for1:
    add t2, t0, s1      # t2 = curr address of a
    add t3, t0, s2      # t3 = curr address of b
    add t4, t3, t1      # t4 = curr address of b + len of a
    lw t2, 0(t2)        # t2 = a[i]
    sw t2, 0(t3)        # b[i] = a[i]
    sw t2, 0(t4)        # b[i + len(a)] = a[i]
    addi t0, t0, 4
    blt t0, t1, for1

    li t0, 0
    slli t1, s3, 3      # t1 = len(arr) * 8
    li a7, SYSWRITE     # in forPrint, we only print value so system call argument is setted before
forPrint:
    add t2, s2, t0      # t2 = address of b[i]
    li a0, 0xfff
    mv a1, t2
    li a2, 4
    ecall

    li a0, STDOUT
    la a1, space
    li a2, 1
    ecall
    addi t0, t0, 4
    blt t0, t1, forPrint
    j exit

exit:
    li a0, 0
    li a7, SYSEXIT           # end
    ecall

Cycle count becomes 107 and LOC becomes 32. And I realized that I can take li a7, SYSWRITE outside from for loop because we only use SYSWRITE here. There is no need to specify for each iteration. And also, there is no need to push value into stack because we already have the address of value we want to show on screen. After this optimization, function calling, function returning, stack operation and a7 configuration will be eliminated. Instruction count will have a reduction of 10n - 1, where n is two times of input array's length.

Compile C Code

To get execution file from C code, simply type:

rriscv-none-elf-gcc concatenation_of_array.c -o concatenation_of_array.elf

Read the header:

teimeiki@ubuntu:~/rv32emu$ riscv-none-elf-readelf -h hw2/concatenation_of_array.elf
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           RISC-V
  Version:                           0x1
  Entry point address:               0x100c4
  Start of program headers:          52 (bytes into file)
  Start of section headers:          69760 (bytes into file)
  Flags:                             0x1, RVC, soft-float ABI
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         3
  Size of section headers:           40 (bytes)
  Number of section headers:         15
  Section header string table index: 14

I will explain some line I am instresting in.

The first line is magic number, and in wikipedia, it says magic number is a constant numerical or text value used to identify a file format or protocol. The first byte in this line is 0x7f, which is a leet and the reason is described in this stackoverflow page. We can see this in first line of file's heximal format with hexdump command:

teimeiki@ubuntu:~/rv32emu$ hexdump -C mytest/linked_list_cycle.elf | grep 0000000
00000000  7f 45 4c 46 01 01 01 00  00 00 00 00 00 00 00 00  |.ELF............|
0000bc60  30 30 30 30 30 30 30 30  30 30 30 30 20 20 20 20  |000000000000    |
0000c510  98 af ff ff 30 30 30 30  30 30 30 30 30 30 30 30  |....000000000000|

And the entry point is the first address of _start, rather than our main function. The compiler will add a _start function in our code to avoid invalid access of computer resources. After excecution, the _start will do some necessary initialization and invoke main function.

For more information, there is a book name ELF format details each part of ELF file.

Read the disassembly file:

 riscv-none-elf-objdump -d hw2/concatenation_of_array.elf
hw2/concatenation_of_array.elf:     file format elf32-littleriscv


Disassembly of section .text:

00010094 <exit>:
   10094:   1141                    addi    sp,sp,-16
   10096:   4581                    li  a1,0
   10098:   c422                    sw  s0,8(sp)
   1009a:   c606                    sw  ra,12(sp)
   1009c:   842a                    mv  s0,a0
   1009e:   4a0020ef            jal ra,1253e <__call_exitprocs>
   100a2:   2281a503            lw  a0,552(gp) # 1da38 <_global_impure_ptr>
   100a6:   5d5c                    lw  a5,60(a0)
   100a8:   c391                    beqz    a5,100ac <exit+0x18>
   100aa:   9782                    jalr    a5
   100ac:   8522                    mv  a0,s0
   100ae:   6cc090ef            jal ra,1977a <_exit>

000100c4 <_start>:
   100c4:   0000d197            auipc   gp,0xd
   100c8:   74c18193            addi    gp,gp,1868 # 1d810 <__global_pointer$>
   100cc:   24418513            addi    a0,gp,580 # 1da54 <completed.1>
   100d0:   57018613            addi    a2,gp,1392 # 1dd80 <__BSS_END__>
   100d4:   8e09                    sub a2,a2,a0
   100d6:   4581                    li  a1,0
   100d8:   22d9                    jal 1029e <memset>
   100da:   00000517            auipc   a0,0x0
   100de:   12250513            addi    a0,a0,290 # 101fc <__libc_fini_array>
   100e2:   2239                    jal 101f0 <atexit>
   100e4:   2a81                    jal 10234 <__libc_init_array>
   100e6:   4502                    lw  a0,0(sp)
   100e8:   004c                    addi    a1,sp,4
   100ea:   4601                    li  a2,0
   100ec:   2891                    jal 10140 <main>
   100ee:   b75d                    j   10094 <exit>
  
00010140 <main>:
   10140:   7139                    addi    sp,sp,-64
   10142:   de06                    sw  ra,60(sp)
   10144:   dc22                    sw  s0,56(sp)
   10146:   0080                    addi    s0,sp,64
   10148:   4785                    li  a5,1
   1014a:   fef42023            sw  a5,-32(s0)
   1014e:   4789                    li  a5,2
   10150:   fef42223            sw  a5,-28(s0)
   10154:   478d                    li  a5,3
   10156:   fef42423            sw  a5,-24(s0)
   1015a:   fe042623            sw  zero,-20(s0)
   1015e:   a0a9                    j   101a8 <main+0x68>
   10160:   fec42783            lw  a5,-20(s0)
   10164:   078a                    slli    a5,a5,0x2
   10166:   17c1                    addi    a5,a5,-16
   10168:   97a2                    add a5,a5,s0
   1016a:   ff07a703            lw  a4,-16(a5)
   1016e:   fec42783            lw  a5,-20(s0)
   10172:   078a                    slli    a5,a5,0x2
   10174:   17c1                    addi    a5,a5,-16
   10176:   97a2                    add a5,a5,s0
   10178:   fce7ac23            sw  a4,-40(a5)
   1017c:   fec42783            lw  a5,-20(s0)
   10180:   00378693            addi    a3,a5,3
   10184:   fec42783            lw  a5,-20(s0)
   10188:   078a                    slli    a5,a5,0x2
   1018a:   17c1                    addi    a5,a5,-16
   1018c:   97a2                    add a5,a5,s0
   1018e:   ff07a703            lw  a4,-16(a5)
   10192:   00269793            slli    a5,a3,0x2
   10196:   17c1                    addi    a5,a5,-16
   10198:   97a2                    add a5,a5,s0
   1019a:   fce7ac23            sw  a4,-40(a5)
   1019e:   fec42783            lw  a5,-20(s0)
   101a2:   0785                    addi    a5,a5,1
   101a4:   fef42623            sw  a5,-20(s0)
   101a8:   fec42703            lw  a4,-20(s0)
   101ac:   4789                    li  a5,2
   101ae:   fae7d9e3            bge a5,a4,10160 <main+0x20>
   101b2:   fe042623            sw  zero,-20(s0)
   101b6:   a015                    j   101da <main+0x9a>
   101b8:   fec42783            lw  a5,-20(s0)
   101bc:   078a                    slli    a5,a5,0x2
   101be:   17c1                    addi    a5,a5,-16
   101c0:   97a2                    add a5,a5,s0
   101c2:   fd87a783            lw  a5,-40(a5)
   101c6:   85be                    mv  a1,a5
   101c8:   67f1                    lui a5,0x1c
   101ca:   8e078513            addi    a0,a5,-1824 # 1b8e0 <__clzsi2+0x74>
   101ce:   2a61                    jal 10366 <printf>
   101d0:   fec42783            lw  a5,-20(s0)
   101d4:   0785                    addi    a5,a5,1
   101d6:   fef42623            sw  a5,-20(s0)
   101da:   fec42703            lw  a4,-20(s0)
   101de:   4795                    li  a5,5
   101e0:   fce7dce3            bge a5,a4,101b8 <main+0x78>
   101e4:   4781                    li  a5,0
   101e6:   853e                    mv  a0,a5
   101e8:   50f2                    lw  ra,60(sp)
   101ea:   5462                    lw  s0,56(sp)
   101ec:   6121                    addi    sp,sp,64
   101ee:   8082                    ret

The code has 17092 lines, which too large to show, so I simply pick up two function defined in our c implementation.

As showed, the compiler will automatically use compressed extension of rv32 to compile our c code if flags remain unspecified, but it is still very large.

As we can see in object dump file, the compiler will automatically add a function _start to invoke our main function. The OS will first load _start function in to physical memory and the position of _start finction is same as entry point address got with readelf command and in this example, it is 000100c4.

c4 is the offset of _start function, we can know that by printing out. We can also see that risc-v is based on little endian.

And the 10000 is the default virtual memory address when load program into RAM, but I can't find the reason of this address rather than 0x00000000.

Comparing Optimization Levels

-march=rv321, -mabi=ilp32 : specify to use interger and 4-bytes-long instruction.
-fdata-sections, -ffunction-sections : specify to seperate unused function and data.
-Wl,--gc-sections : -Wl will tell gcc to pass the arguments after comma to linker, and --gc-sections tells linker not to link unused sections. (what is -Wl)
how to link used functions only

The pictures show that after removing unused data and function from program, code section reduced from 74740 to 60120, data section reduced from 2816 to 2776 and bss section reduced from 812 to 104.

-O0

00010094 <exit>:
   10094:   ff010113            addi    sp,sp,-16
   10098:   00000593            li  a1,0
   1009c:   00812423            sw  s0,8(sp)
   100a0:   00112623            sw  ra,12(sp)
   100a4:   00050413            mv  s0,a0
   100a8:   3b0030ef            jal ra,13458 <__call_exitprocs>
   100ac:   2081a503            lw  a0,520(gp) # 1fac8 <_global_impure_ptr>
   100b0:   03c52783            lw  a5,60(a0)
   100b4:   00078463            beqz    a5,100bc <exit+0x28>
   100b8:   000780e7            jalr    a5
   100bc:   00040513            mv  a0,s0
   100c0:   6800a0ef            jal ra,1a740 <_exit>
   
000100dc <_start>:
   100dc:   0000f197            auipc   gp,0xf
   100e0:   7e418193            addi    gp,gp,2020 # 1f8c0 <__global_pointer$>
   100e4:   21c18513            addi    a0,gp,540 # 1fadc <completed.1>
   100e8:   28418613            addi    a2,gp,644 # 1fb44 <__BSS_END__>
   100ec:   40a60633            sub a2,a2,a0
   100f0:   00000593            li  a1,0
   100f4:   28c000ef            jal ra,10380 <memset>
   100f8:   00000517            auipc   a0,0x0
   100fc:   19850513            addi    a0,a0,408 # 10290 <__libc_fini_array>
   10100:   17c000ef            jal ra,1027c <atexit>
   10104:   1e8000ef            jal ra,102ec <__libc_init_array>
   10108:   00012503            lw  a0,0(sp)
   1010c:   00410593            addi    a1,sp,4
   10110:   00000613            li  a2,0
   10114:   070000ef            jal ra,10184 <main>
   10118:   f7dff06f            j   10094 <exit>

00010184 <main>:
   10184:   fc010113            addi    sp,sp,-64
   10188:   02112e23            sw  ra,60(sp)
   1018c:   02812c23            sw  s0,56(sp)
   10190:   04010413            addi    s0,sp,64
   10194:   00100793            li  a5,1
   10198:   fef42023            sw  a5,-32(s0)
   1019c:   00200793            li  a5,2
   101a0:   fef42223            sw  a5,-28(s0)
   101a4:   00300793            li  a5,3
   101a8:   fef42423            sw  a5,-24(s0)
   101ac:   fe042623            sw  zero,-20(s0)
   101b0:   0640006f            j   10214 <main+0x90>
   101b4:   fec42783            lw  a5,-20(s0)
   101b8:   00279793            slli    a5,a5,0x2
   101bc:   ff078793            addi    a5,a5,-16
   101c0:   008787b3            add a5,a5,s0
   101c4:   ff07a703            lw  a4,-16(a5)
   101c8:   fec42783            lw  a5,-20(s0)
   101cc:   00279793            slli    a5,a5,0x2
   101d0:   ff078793            addi    a5,a5,-16
   101d4:   008787b3            add a5,a5,s0
   101d8:   fce7ac23            sw  a4,-40(a5)
   101dc:   fec42783            lw  a5,-20(s0)
   101e0:   00378693            addi    a3,a5,3
   101e4:   fec42783            lw  a5,-20(s0)
   101e8:   00279793            slli    a5,a5,0x2
   101ec:   ff078793            addi    a5,a5,-16
   101f0:   008787b3            add a5,a5,s0
   101f4:   ff07a703            lw  a4,-16(a5)
   101f8:   00269793            slli    a5,a3,0x2
   101fc:   ff078793            addi    a5,a5,-16
   10200:   008787b3            add a5,a5,s0
   10204:   fce7ac23            sw  a4,-40(a5)
   10208:   fec42783            lw  a5,-20(s0)
   1020c:   00178793            addi    a5,a5,1
   10210:   fef42623            sw  a5,-20(s0)
   10214:   fec42703            lw  a4,-20(s0)
   10218:   00200793            li  a5,2
   1021c:   f8e7dce3            bge a5,a4,101b4 <main+0x30>
   10220:   fe042623            sw  zero,-20(s0)
   10224:   0340006f            j   10258 <main+0xd4>
   10228:   fec42783            lw  a5,-20(s0)
   1022c:   00279793            slli    a5,a5,0x2
   10230:   ff078793            addi    a5,a5,-16
   10234:   008787b3            add a5,a5,s0
   10238:   fd87a783            lw  a5,-40(a5)
   1023c:   00078593            mv  a1,a5
   10240:   0001e7b7            lui a5,0x1e
   10244:   2a078513            addi    a0,a5,672 # 1e2a0 <__clzsi2+0x8c>
   10248:   214000ef            jal ra,1045c <printf>
   1024c:   fec42783            lw  a5,-20(s0)
   10250:   00178793            addi    a5,a5,1
   10254:   fef42623            sw  a5,-20(s0)
   10258:   fec42703            lw  a4,-20(s0)
   1025c:   00500793            li  a5,5
   10260:   fce7d4e3            bge a5,a4,10228 <main+0xa4>
   10264:   00000793            li  a5,0
   10268:   00078513            mv  a0,a5
   1026c:   03c12083            lw  ra,60(sp)
   10270:   03812403            lw  s0,56(sp)
   10274:   04010113            addi    sp,sp,64
   10278:   00008067            ret

-O1

00010094 <exit>:
   10094:   ff010113            addi    sp,sp,-16
   10098:   00000593            li  a1,0
   1009c:   00812423            sw  s0,8(sp)
   100a0:   00112623            sw  ra,12(sp)
   100a4:   00050413            mv  s0,a0
   100a8:   32c030ef            jal ra,133d4 <__call_exitprocs>
   100ac:   2081a503            lw  a0,520(gp) # 1fac8 <_global_impure_ptr>
   100b0:   03c52783            lw  a5,60(a0)
   100b4:   00078463            beqz    a5,100bc <exit+0x28>
   100b8:   000780e7            jalr    a5
   100bc:   00040513            mv  a0,s0
   100c0:   5fc0a0ef            jal ra,1a6bc <_exit>
   
000100dc <_start>:
   100dc:   0000f197            auipc   gp,0xf
   100e0:   7e418193            addi    gp,gp,2020 # 1f8c0 <__global_pointer$>
   100e4:   21c18513            addi    a0,gp,540 # 1fadc <completed.1>
   100e8:   28418613            addi    a2,gp,644 # 1fb44 <__BSS_END__>
   100ec:   40a60633            sub a2,a2,a0
   100f0:   00000593            li  a1,0
   100f4:   208000ef            jal ra,102fc <memset>
   100f8:   00000517            auipc   a0,0x0
   100fc:   11450513            addi    a0,a0,276 # 1020c <__libc_fini_array>
   10100:   0f8000ef            jal ra,101f8 <atexit>
   10104:   164000ef            jal ra,10268 <__libc_init_array>
   10108:   00012503            lw  a0,0(sp)
   1010c:   00410593            addi    a1,sp,4
   10110:   00000613            li  a2,0
   10114:   070000ef            jal ra,10184 <main>
   10118:   f7dff06f            j   10094 <exit>
   
00010184 <main>:
   10184:   fd010113            addi    sp,sp,-48
   10188:   02112623            sw  ra,44(sp)
   1018c:   02812423            sw  s0,40(sp)
   10190:   02912223            sw  s1,36(sp)
   10194:   03212023            sw  s2,32(sp)
   10198:   00100793            li  a5,1
   1019c:   00f12423            sw  a5,8(sp)
   101a0:   00f12a23            sw  a5,20(sp)
   101a4:   00200793            li  a5,2
   101a8:   00f12623            sw  a5,12(sp)
   101ac:   00f12c23            sw  a5,24(sp)
   101b0:   00300793            li  a5,3
   101b4:   00f12823            sw  a5,16(sp)
   101b8:   00f12e23            sw  a5,28(sp)
   101bc:   00810413            addi    s0,sp,8
   101c0:   02010913            addi    s2,sp,32
   101c4:   0001e4b7            lui s1,0x1e
   101c8:   00042583            lw  a1,0(s0)
   101cc:   21848513            addi    a0,s1,536 # 1e218 <__clzsi2+0x88>
   101d0:   208000ef            jal ra,103d8 <printf>
   101d4:   00440413            addi    s0,s0,4
   101d8:   ff2418e3            bne s0,s2,101c8 <main+0x44>
   101dc:   00000513            li  a0,0
   101e0:   02c12083            lw  ra,44(sp)
   101e4:   02812403            lw  s0,40(sp)
   101e8:   02412483            lw  s1,36(sp)
   101ec:   02012903            lw  s2,32(sp)
   101f0:   03010113            addi    sp,sp,48
   101f4:   00008067            ret

-O2

00010094 <exit>:
   10094:   ff010113            addi    sp,sp,-16
   10098:   00000593            li  a1,0
   1009c:   00812423            sw  s0,8(sp)
   100a0:   00112623            sw  ra,12(sp)
   100a4:   00050413            mv  s0,a0
   100a8:   32c030ef            jal ra,133d4 <__call_exitprocs>
   100ac:   2081a503            lw  a0,520(gp) # 1fac8 <_global_impure_ptr>
   100b0:   03c52783            lw  a5,60(a0)
   100b4:   00078463            beqz    a5,100bc <exit+0x28>
   100b8:   000780e7            jalr    a5
   100bc:   00040513            mv  a0,s0
   100c0:   5fc0a0ef            jal ra,1a6bc <_exit>

000100c4 <main>:
   100c4:   fd010113            addi    sp,sp,-48
   100c8:   00100693            li  a3,1
   100cc:   00200713            li  a4,2
   100d0:   00300793            li  a5,3
   100d4:   02812423            sw  s0,40(sp)
   100d8:   02912223            sw  s1,36(sp)
   100dc:   03212023            sw  s2,32(sp)
   100e0:   02112623            sw  ra,44(sp)
   100e4:   00d12423            sw  a3,8(sp)
   100e8:   00d12a23            sw  a3,20(sp)
   100ec:   00e12623            sw  a4,12(sp)
   100f0:   00e12c23            sw  a4,24(sp)
   100f4:   00f12823            sw  a5,16(sp)
   100f8:   00f12e23            sw  a5,28(sp)
   100fc:   00810413            addi    s0,sp,8
   10100:   02010913            addi    s2,sp,32
   10104:   0001e4b7            lui s1,0x1e
   10108:   00042583            lw  a1,0(s0)
   1010c:   21848513            addi    a0,s1,536 # 1e218 <__clzsi2+0x88>
   10110:   00440413            addi    s0,s0,4
   10114:   2c4000ef            jal ra,103d8 <printf>
   10118:   ff2418e3            bne s0,s2,10108 <main+0x44>
   1011c:   02c12083            lw  ra,44(sp)
   10120:   02812403            lw  s0,40(sp)
   10124:   02412483            lw  s1,36(sp)
   10128:   02012903            lw  s2,32(sp)
   1012c:   00000513            li  a0,0
   10130:   03010113            addi    sp,sp,48
   10134:   00008067            ret
   
00010150 <_start>:
   10150:   0000f197            auipc   gp,0xf
   10154:   77018193            addi    gp,gp,1904 # 1f8c0 <__global_pointer$>
   10158:   21c18513            addi    a0,gp,540 # 1fadc <completed.1>
   1015c:   28418613            addi    a2,gp,644 # 1fb44 <__BSS_END__>
   10160:   40a60633            sub a2,a2,a0
   10164:   00000593            li  a1,0
   10168:   194000ef            jal ra,102fc <memset>
   1016c:   00000517            auipc   a0,0x0
   10170:   0a050513            addi    a0,a0,160 # 1020c <__libc_fini_array>
   10174:   084000ef            jal ra,101f8 <atexit>
   10178:   0f0000ef            jal ra,10268 <__libc_init_array>
   1017c:   00012503            lw  a0,0(sp)
   10180:   00410593            addi    a1,sp,4
   10184:   00000613            li  a2,0
   10188:   f3dff0ef            jal ra,100c4 <main>
   1018c:   f09ff06f            j   10094 <exit>

-O3

00010094 <exit>:
   10094:   ff010113            addi    sp,sp,-16
   10098:   00000593            li  a1,0
   1009c:   00812423            sw  s0,8(sp)
   100a0:   00112623            sw  ra,12(sp)
   100a4:   00050413            mv  s0,a0
   100a8:   32c030ef            jal ra,133d4 <__call_exitprocs>
   100ac:   2081a503            lw  a0,520(gp) # 1fac8 <_global_impure_ptr>
   100b0:   03c52783            lw  a5,60(a0)
   100b4:   00078463            beqz    a5,100bc <exit+0x28>
   100b8:   000780e7            jalr    a5
   100bc:   00040513            mv  a0,s0
   100c0:   5fc0a0ef            jal ra,1a6bc <_exit>

000100c4 <main>:
   100c4:   fd010113            addi    sp,sp,-48
   100c8:   00100693            li  a3,1
   100cc:   00200713            li  a4,2
   100d0:   00300793            li  a5,3
   100d4:   02812423            sw  s0,40(sp)
   100d8:   02912223            sw  s1,36(sp)
   100dc:   03212023            sw  s2,32(sp)
   100e0:   02112623            sw  ra,44(sp)
   100e4:   00d12423            sw  a3,8(sp)
   100e8:   00d12a23            sw  a3,20(sp)
   100ec:   00e12623            sw  a4,12(sp)
   100f0:   00e12c23            sw  a4,24(sp)
   100f4:   00f12823            sw  a5,16(sp)
   100f8:   00f12e23            sw  a5,28(sp)
   100fc:   00810413            addi    s0,sp,8
   10100:   02010913            addi    s2,sp,32
   10104:   0001e4b7            lui s1,0x1e
   10108:   00042583            lw  a1,0(s0)
   1010c:   21848513            addi    a0,s1,536 # 1e218 <__clzsi2+0x88>
   10110:   00440413            addi    s0,s0,4
   10114:   2c4000ef            jal ra,103d8 <printf>
   10118:   ff2418e3            bne s0,s2,10108 <main+0x44>
   1011c:   02c12083            lw  ra,44(sp)
   10120:   02812403            lw  s0,40(sp)
   10124:   02412483            lw  s1,36(sp)
   10128:   02012903            lw  s2,32(sp)
   1012c:   00000513            li  a0,0
   10130:   03010113            addi    sp,sp,48
   10134:   00008067            ret

00010150 <_start>:
   10150:   0000f197            auipc   gp,0xf
   10154:   77018193            addi    gp,gp,1904 # 1f8c0 <__global_pointer$>
   10158:   21c18513            addi    a0,gp,540 # 1fadc <completed.1>
   1015c:   28418613            addi    a2,gp,644 # 1fb44 <__BSS_END__>
   10160:   40a60633            sub a2,a2,a0
   10164:   00000593            li  a1,0
   10168:   194000ef            jal ra,102fc <memset>
   1016c:   00000517            auipc   a0,0x0
   10170:   0a050513            addi    a0,a0,160 # 1020c <__libc_fini_array>
   10174:   084000ef            jal ra,101f8 <atexit>
   10178:   0f0000ef            jal ra,10268 <__libc_init_array>
   1017c:   00012503            lw  a0,0(sp)
   10180:   00410593            addi    a1,sp,4
   10184:   00000613            li  a2,0
   10188:   f3dff0ef            jal ra,100c4 <main>
   1018c:   f09ff06f            j   10094 <exit>

-Os

00010094 <exit>:
   10094:   ff010113            addi    sp,sp,-16
   10098:   00000593            li  a1,0
   1009c:   00812423            sw  s0,8(sp)
   100a0:   00112623            sw  ra,12(sp)
   100a4:   00050413            mv  s0,a0
   100a8:   324030ef            jal ra,133cc <__call_exitprocs>
   100ac:   2081a503            lw  a0,520(gp) # 1fac8 <_global_impure_ptr>
   100b0:   03c52783            lw  a5,60(a0)
   100b4:   00078463            beqz    a5,100bc <exit+0x28>
   100b8:   000780e7            jalr    a5
   100bc:   00040513            mv  a0,s0
   100c0:   5f40a0ef            jal ra,1a6b4 <_exit>

000100c4 <main>:
   100c4:   fd010113            addi    sp,sp,-48
   100c8:   00100793            li  a5,1
   100cc:   00f12423            sw  a5,8(sp)
   100d0:   00f12a23            sw  a5,20(sp)
   100d4:   00200793            li  a5,2
   100d8:   00f12623            sw  a5,12(sp)
   100dc:   00f12c23            sw  a5,24(sp)
   100e0:   00300793            li  a5,3
   100e4:   02812423            sw  s0,40(sp)
   100e8:   02912223            sw  s1,36(sp)
   100ec:   02112623            sw  ra,44(sp)
   100f0:   00f12823            sw  a5,16(sp)
   100f4:   00f12e23            sw  a5,28(sp)
   100f8:   00810413            addi    s0,sp,8
   100fc:   0001e4b7            lui s1,0x1e
   10100:   00042583            lw  a1,0(s0)
   10104:   21048513            addi    a0,s1,528 # 1e210 <__clzsi2+0x88>
   10108:   00440413            addi    s0,s0,4
   1010c:   2c4000ef            jal ra,103d0 <printf>
   10110:   02010793            addi    a5,sp,32
   10114:   fef416e3            bne s0,a5,10100 <main+0x3c>
   10118:   02c12083            lw  ra,44(sp)
   1011c:   02812403            lw  s0,40(sp)
   10120:   02412483            lw  s1,36(sp)
   10124:   00000513            li  a0,0
   10128:   03010113            addi    sp,sp,48
   1012c:   00008067            ret

00010148 <_start>:
   10148:   0000f197            auipc   gp,0xf
   1014c:   77818193            addi    gp,gp,1912 # 1f8c0 <__global_pointer$>
   10150:   21c18513            addi    a0,gp,540 # 1fadc <completed.1>
   10154:   28418613            addi    a2,gp,644 # 1fb44 <__BSS_END__>
   10158:   40a60633            sub a2,a2,a0
   1015c:   00000593            li  a1,0
   10160:   194000ef            jal ra,102f4 <memset>
   10164:   00000517            auipc   a0,0x0
   10168:   0a050513            addi    a0,a0,160 # 10204 <__libc_fini_array>
   1016c:   084000ef            jal ra,101f0 <atexit>
   10170:   0f0000ef            jal ra,10260 <__libc_init_array>
   10174:   00012503            lw  a0,0(sp)
   10178:   00410593            addi    a1,sp,4
   1017c:   00000613            li  a2,0
   10180:   f45ff0ef            jal ra,100c4 <main>
   10184:   f11ff06f            j   10094 <exit>

-Ofast

00010094 <exit>:
   10094:   ff010113            addi    sp,sp,-16
   10098:   00000593            li  a1,0
   1009c:   00812423            sw  s0,8(sp)
   100a0:   00112623            sw  ra,12(sp)
   100a4:   00050413            mv  s0,a0
   100a8:   32c030ef            jal ra,133d4 <__call_exitprocs>
   100ac:   2081a503            lw  a0,520(gp) # 1fac8 <_global_impure_ptr>
   100b0:   03c52783            lw  a5,60(a0)
   100b4:   00078463            beqz    a5,100bc <exit+0x28>
   100b8:   000780e7            jalr    a5
   100bc:   00040513            mv  a0,s0
   100c0:   5fc0a0ef            jal ra,1a6bc <_exit>

000100c4 <main>:
   100c4:   fd010113            addi    sp,sp,-48
   100c8:   00100693            li  a3,1
   100cc:   00200713            li  a4,2
   100d0:   00300793            li  a5,3
   100d4:   02812423            sw  s0,40(sp)
   100d8:   02912223            sw  s1,36(sp)
   100dc:   03212023            sw  s2,32(sp)
   100e0:   02112623            sw  ra,44(sp)
   100e4:   00d12423            sw  a3,8(sp)
   100e8:   00d12a23            sw  a3,20(sp)
   100ec:   00e12623            sw  a4,12(sp)
   100f0:   00e12c23            sw  a4,24(sp)
   100f4:   00f12823            sw  a5,16(sp)
   100f8:   00f12e23            sw  a5,28(sp)
   100fc:   00810413            addi    s0,sp,8
   10100:   02010913            addi    s2,sp,32
   10104:   0001e4b7            lui s1,0x1e
   10108:   00042583            lw  a1,0(s0)
   1010c:   21848513            addi    a0,s1,536 # 1e218 <__clzsi2+0x88>
   10110:   00440413            addi    s0,s0,4
   10114:   2c4000ef            jal ra,103d8 <printf>
   10118:   ff2418e3            bne s0,s2,10108 <main+0x44>
   1011c:   02c12083            lw  ra,44(sp)
   10120:   02812403            lw  s0,40(sp)
   10124:   02412483            lw  s1,36(sp)
   10128:   02012903            lw  s2,32(sp)
   1012c:   00000513            li  a0,0
   10130:   03010113            addi    sp,sp,48
   10134:   00008067            ret
   
00010150 <_start>:
   10150:   0000f197            auipc   gp,0xf
   10154:   77018193            addi    gp,gp,1904 # 1f8c0 <__global_pointer$>
   10158:   21c18513            addi    a0,gp,540 # 1fadc <completed.1>
   1015c:   28418613            addi    a2,gp,644 # 1fb44 <__BSS_END__>
   10160:   40a60633            sub a2,a2,a0
   10164:   00000593            li  a1,0
   10168:   194000ef            jal ra,102fc <memset>
   1016c:   00000517            auipc   a0,0x0
   10170:   0a050513            addi    a0,a0,160 # 1020c <__libc_fini_array>
   10174:   084000ef            jal ra,101f8 <atexit>
   10178:   0f0000ef            jal ra,10268 <__libc_init_array>
   1017c:   00012503            lw  a0,0(sp)
   10180:   00410593            addi    a1,sp,4
   10184:   00000613            li  a2,0
   10188:   f3dff0ef            jal ra,100c4 <main>
   1018c:   f09ff06f            j   10094 <exit>

Analysis

level text size cycle count address of self written code size of main
O0 60120 4455 0x10184 244
O1 59988 4104 0x10184 112
O2 59988 4104 0x100c4 112
O3 59988 4104 0x100c4 112
Os 59988 4107 0x100c4 104
Ofast 59988 4104 0x100c4 112

We can see that first address of main is at 0x10184 when O0 and O1 are specified. This means the entry point address (100c4) doesn't point to self written code and a strat routine provied by library is needed. As I mentioned before, usually a _start function will be added in our elf file, but in seek of optimization, it might be excluded. O2, O3, Os and Ofast will directly execute main function without the help of _start, and the size of main function seems comparably large besides Os. I think it is because althought the efficiency may increased by eliminating _start, the main function should handel some condition that will hazard the systeym by itself. I have tried to read the main functions' assembly code but it it too understand to understand withot help of O1's and Os'.

There is a problem. The entry point might be diffrent from each elf file. After reviewing my homework again, I realized that the _start function exists in every elf file.

And we can realize that after optimization level of O2, the text section and cycle counts remain the same. There is a limitation of compiler optimization. And the code size is significantly larger than hand due to heavy work of printf