# **Assignment2: RISC-V Toolchain**
contributed by < [`fan1071221`](https://github.com/fan1071221/Computer_Architecture_2023_hw1) >
Topic: [Indexing of hierachical data structures by CLZ](https://hackmd.io/@scones525/computer_architecture_hw1) by [`scones525`](https://github.com/scones525/Computer-Architecture-2023Fall_NCKU/tree/main/hw1) (林以薰)
## **Indexing of hierachical data structures by CLZ**
> Description: Calculating the Depth using `clz`
```
1
/ \
2 3
/ \ / \
4 5 6 7
```
> The "count leading zeros" function (`clz`) gives the number of leading zeros in the binary representation of a number. In a 64-bit representation, the value 7 is represented as:0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0111
There are 61 leading zeros. Therefore, `clz(7) = 61`.
The depth of the number can be calculated by subtracting the `clz` value from 64 (since it's a 64-bit representation). So,depth = 64 - clz(7) = 3
This means node "7" is at depth 3 in the tree.
>Motivation:
>I haven't encountered questions related to trees for quite some time, so I'm taking this opportunity to review the concepts of trees.Since the computation is direct and doesn't rely on traversals, it offers predictable performance, which can be crucial in real-time systems where consistent response times are necessary.
## **Original Code**
### **C code**
:::spoiler
```c=
#include <stdio.h>
#include <stdint.h>
uint16_t clz(uint64_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x |= (x >> 32);
/* count ones (population count) */
x -= ((x >> 1) & 0x5555555555555555 );
x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
x += (x >> 32);
return (64 - (x & 0x7f));
}
int main() {
printf("%d\n",clz(0xf1ac00000123));
printf("%d\n",clz(0x1ac12123489));
printf("%d\n",clz(0x100032498));
return 0;
}
```
:::
### **Assembly code**
:::spoiler
```c
.data
num1: .word 0xf1ac,0x123
num2: .word 0x1ac,0x12123489
num3: .word 0x1,0x32498
and1: .word 0x33333333,0x33333333
and2: .word 0x0f0f0f0f,0x0f0f0f0f
and3: .word 0x7f
and4: .word 0x55555555, 0x55555555
and5: .word 0x33333333, 0x33333333
newline: .string "\n"
.text
main:
la a0, num1
jal ra, counting_zero
jal ra, count_ones
jal ra, print_result
la a0, num2
jal ra, counting_zero
jal ra, count_ones
jal ra, print_result
la a0, num3
jal ra, counting_zero
jal ra, count_ones
jal ra, print_result
j exit_program
print_result:
mv a0, t0
li a7, 1
ecall
la a0, newline
li a7, 4
ecall
ret
exit_program:
li a7, 10
ecall
counting_zero:
mv t0, a0
j shift_right_or_equal
jr ra
shift_right_or_equal:
#now data address in a0
#and do shift right and or equal 5 times.
li s0, 1
li s1, 1
li s2, 6
#high 32bit in s3, low 32bit in s4
lw s3, 0(t0)
lw s4, 4(t0)
Loop: #s0 in this loop are : 1,2,4,8,16,32
#t0 : data address
addi s1, s1, 1
li s6, 32
sub s6, s6, s0 # s6 32-n bits, how many bits that mask need to shift
# 31,30,28,24,16,0
sll s7, s3, s6 #upper 32-n bit need to and w/ lower 32 bit
#shift s3, s4
srl s8, s3, s0
srl s9, s4, s0
#add upper32bits shift's bit into lower 32bits
or s9, s9, s7
or s3, s3, s8
or s4, s4, s9
slli s0, s0, 1
ble s1, s2, Loop
mv t0, s3
mv t1, s4
ret
count_ones:
mv s0, t0
mv s1, t1
#x -= ((x >> 1) & 0x5555555555555555 );
slli s7, s0, 31 # s7 store bit that shift to lower 32bits
srli s8, s0, 1 # x >> 1
srli s9, s1, 1
or s9, s9, s7
la t2, and4
lw t3, 0(t2) #0x55555555
lw t4, 4(t2) #0x55555555
and s8, s8, t3 # x and x05555555555555555
and s9, s9, t4
# do sub here s0,s1 - s8,s9
sub s1, s1, s9
sltu a3, s1, s9
sub s0, s0, s8
sub s0, s0, a3
#x = ((x >> 2) & 0x3333333333333333) + (x &0x3333333333333333);
# s2 s3
# x >> 2
slli s7, s0, 30
srli s8, s0, 2
srli s9, s1, 2
or s9, s9, s7
# load 0x3333333333333333
la t2, and5
lw t3, 0(t2)
lw t4, 4(t2)
# and with 0x3333333333333333
and s8, s8, t3
and s9, s9, t4
# x &0x3333333333333333 store in s2, s3
and s2, s0, t3
and s3, s1, t4
# add together they are s8,s9 and s2,s3 respectively
add s1, s9, s3
sltu a3, s1, s3
add s0, s8, s2
add s0, s0, a3
#x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
# x >> 4
slli s7, s0, 28
srli s8, s0, 4
srli s9, s1, 4
or s9, s9, s7
# (x>>4) + x store in s8,s9
add s9, s9, s1
sltu a3, s9, s1
add a4, s8, s0
add s8, a4, a3
# & 0x0f0f0f0f0f0f0f0f
la t2, and2
lw t3, 0(t2)
lw t4, 4(t2)
and s0, s8, t3
and s1, s9, t4
# x += (x >> 8);
slli s7, s0, 24
srli s8, s0, 8
srli s9, s1, 8
or s9, s9, s7
add s1, s9, s1
sltu a3, s1, s9
add s0, s0, s8
add s0, s0, a3
# x += (x >> 16);
slli s7, s0, 16
srli s8, s0, 16
srli s9, s1, 16
or s9, s9, s7
add s1, s9, s1
sltu a3, s1, s9
add s0, s0, s8
add s0, s0, a3
# x += (x >> 32);
mv s8, x0
mv s9, s0
add s1, s9, s1
sltu a3, s1, s9
add s0, s0, s8
add s0, s0, a3
li t0,64
andi t1, s1, 0x7f
sub t0, t0, t1
ret
```
:::
### **Optimization of Leading Zero Count**
In our previous 64-bit version, we utilized numerous shifts and OR operations to compute the number of leading zeros in a 64-bit number.
However, counting the leading zeros in a 64-bit number and a 32-bit number is essentially the same operation.
- **Reason 1**: Regardless of which register we are counting, all we need to do is count the leading zeros either in the upper 32 bits or the lower 32 bits.
- **Reason 2**:
1. If the upper 32 bits are non-zero, the value in the lower 32 bits won't affect the number of leading zeros.
2. If the upper 32 bits are zero, we know that there are 32 leading zeros from this segment, and then we just count the leading zeros in the lower 32 bits.
## **scones525 Optimized code version**
### **Assembly code**
:::spoiler
```c
.data
# num1~num3 are test data
num1: .word 0xf1ac,0x123
num2: .word 0x1ac,0x12123489
num3: .word 0x0,0x32498
and1: .word 0x33333333,0x33333333
and2: .word 0x0f0f0f0f,0x0f0f0f0f
and3: .word 0x7f
and4: .word 0x55555555, 0x55555555
and5: .word 0x33333333, 0x33333333
newline: .string "\n"
.text
main:
la a0, num1
jal ra, counting_zero
jal ra, print_result
la a0, num2
jal ra, counting_zero
jal ra, print_result
la a0, num3
jal ra, counting_zero
jal ra, print_result
j exit_program
print_result:
mv a0, t0
li a7, 1
ecall
la a0, newline
li a7, 4
ecall
ret
exit_program:
li a7, 10
ecall
counting_zero:
mv t0, a0
mv t6, x0 # this is res return value
lw s0, 0(t0)
lw s1, 4(t0)
andi s2, s0, -1
bne x0, s2, start_count
# from here to start_count is when s0 == 0
addi t6, t6, 32
addi s0, s1, 0
start_count:
# count leading zero in s0
srli s1, s0, 1
or s0, s1, s0
srli s1, s0, 2
or s0, s1, s0
srli s1, s0, 4
or s0, s1, s0
srli s1, s0, 8
or s0, s1, s0
srli s1, s0, 16
or s0, s1, s0
# x -= ((x >> 1) & 0x55555555);
la s2, and4 # read 0x55555555 address
lw s3, 0(s2) # load 0x55555555
srli s1, s0, 1
and s1, s1, s3
sub s0, s0, s1
# x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
la s2, and5
lw s3, 0(s2)
srli s1, s0, 2
and s1, s1, s3
and s2, s0, s3
add s0, s1, s2
# x = ((x >> 4) + x) & 0x0f0f0f0f;
la s2, and2
lw s3, 0(s2)
srli s1, s0, 4
add s1, s1, s0
and s0, s1, s3
# x += (x >> 8);
srli s1, s0, 8
add s0, s0, s1
# x += (x >> 16);
srli s1, s0, 16
add s0, s0, s1
li s1, 32
la s2, and3
lw s3, 0(s2)
and s0, s0, s3
add t6, t6, s1
sub a1, t6, s0
mv t0, a1
ret
```
:::
### **scones525 Execution info(Add get_cycle function)**
```c
#include <stdio.h>
#include <stdint.h>
uint32_t get_cycle() {
uint32_t cycle_value;
asm volatile ("rdcycle %0" : "=r"(cycle_value));
return cycle_value;
}
uint16_t clz(uint64_t x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x |= (x >> 32);
/* count ones (population count) */
x -= ((x >> 1) & 0x5555555555555555 );
x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
x += (x >> 32);
return (64 - (x & 0x7f));
}
int main() {
uint32_t startcycle = get_cycle();
printf("%d\n",clz(0xf1ac00000123));
printf("%d\n",clz(0x1ac12123489));
printf("%d\n",clz(0x100032498));
uint32_t endcycle = get_cycle();
printf("RDCYCLE value: %u\n", endcycle - startcycle);
return 0;
}
```
```c
16
23
31
RDCYCLE value: 5987
inferior exit code 0
```
<s>

</s>
:::warning
You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution.
:notes: jserv
:::
---
## **Manual optimization**
### **Optimization 1: Data Segment Consolidation**
#### **scones525 Version:**
- Utilized 5 distinct data labels (`and1` through `and5`) to store the bit masks.
#### **Optimized Version:**
- Consolidated all bit masks into a single contiguous memory location named `and_mask`.
This consolidation reduces the number of data labels, streamlines the mask retrieval process, and makes the code more concise.
### **Optimization 2: Bit Mask Loading**
#### **scones525 Version:**
- Multiple `la` and `lw` instructions were employed to load each mask individually.
#### **Optimized Version:**
- All masks are loaded at once into consecutive registers (from `t2` to `t5`).
This reduces the number of instructions and enhances the efficiency of mask loading.
### **Manual optimization Assembly code:**
```c
.data
num1: .word 0xf1ac,0x123
num2: .word 0x1ac,0x12123489
num3: .word 0x0,0x32498
and_mask: .word 0x55555555, 0x33333333, 0x0f0f0f0f, 0x7f
newline: .string "\n"
.text
main:
la a0, num1
jal ra, counting_zero
jal ra, print_result
la a0, num2
jal ra, counting_zero
jal ra, print_result
la a0, num3
jal ra, counting_zero
jal ra, print_result
j exit_program
print_result:
mv a0, t0
li a7, 1
ecall
la a0, newline
li a7, 4
ecall
ret
exit_program:
li a7, 10
ecall
counting_zero:
lw s0, 0(a0)
lw s1, 4(a0)
mv t6, x0 # this is res return value
bnez s0, start_count
addi t6, t6, 32
mv s0, s1
start_count:
la t1, and_mask
lw t2, 0(t1)
lw t3, 4(t1)
lw t4, 8(t1)
lw t5, 12(t1)
# count leading zero in s0
srli s1, s0, 1
or s0, s0, s1
srli s1, s0, 2
or s0, s0, s1
srli s1, s0, 4
or s0, s0, s1
srli s1, s0, 8
or s0, s0, s1
srli s1, s0, 16
or s0, s0, s1
srli s1, s0, 1
and s1, s1, t2
sub s0, s0, s1
srli s1, s0, 2
and s1, s1, t3
and s2, s0, t3
add s0, s1, s2
srli s1, s0, 4
add s1, s1, s0
and s0, s1, t4
srli s1, s0, 8
add s0, s0, s1
srli s1, s0, 16
add s0, s0, s1
and s0, s0, t5
addi t6, t6, 32
sub t0, t6, s0
ret
```
### **Manual optimization Execution info**

---
## **Running RISC-V Assembly on rv32emu**
When transitioning from Ripes to rv32emu, developers might encounter some specific challenges.
### **1. Entry Symbol Issue**
In Ripes, the program starts its execution from the main label. However, rv32emu expects an entry symbol named _start.
#### **Error message:**
```
riscv-none-elf-ld: warning: cannot find entry symbol _start; defaulting to 00000000
```
#### **Solution:**
Declare a global _start at the beginning of your file and replace main with _start.
```c
.global _start
_start:
```
### **2. System Calls (Syscalls)**
#### **[rv32emu System Calls](https://github.com/sysprog21/rv32emu/blob/master/docs/syscall.md)**
| System Call Number | Name | Description |
|:------------------:|:----------------:|:-------------------------------------------------------------------------------------:|
| 57 | `close` | Deletes a descriptor from the per-process object reference table. |
| 62 | `lseek` | Repositions the file offset of the open file description to the argument offset according to the directive `whence`.|
| 63 | `read` | Reads specific bytes of data from the object referenced by the descriptor `fildes` into the buffer.|
| 64 | `write` | Prints the buffer as a string to the specified file descriptor. |
| 80 | `fstat` | No effect. |
| 93 | `exit` | Terminates with a status code. |
| 169 | `gettimeofday` | Gets date and time. Current time zone is NOT obtained. |
| 214 | `brk` | Supports updating the program break and returning the current program break. |
| 403 | `clock_gettime` | Retrieves the value used by a clock which is specified by `clock-id`. |
| 1024 | `open` | Opens or creates a file for reading or writing. |
### **3. Handling Numbers Greater than 9**
Since rv32emu can only print single digits (0-9) directly, numbers greater than 9 require special handling. To print numbers with two digits, we can extract the tens and ones place separately.
### **Modify to run on rv32emu**
The following is the code I have modified, which can be executed on rv32emu.
#### **scones525 version**
:::spoiler
```c=
.org 0
.global _start
.set SYSEXIT, 93
.set SYSWRITE, 64
.data
num1: .word 0xf1ac,0x123
num2: .word 0x1ac,0x12123489
num3: .word 0x0,0x32498
and1: .word 0x33333333,0x33333333
and2: .word 0x0f0f0f0f,0x0f0f0f0f
and3: .word 0x7f
and4: .word 0x55555555, 0x55555555
and5: .word 0x33333333, 0x33333333
msg_string: .string "\n"
.set msg_size, .-msg_string
.text
_start:
la a0, num1
jal ra, counting_zero
jal ra, print_result
la a0, num2
jal ra, counting_zero
jal ra, print_result
la a0, num3
jal ra, counting_zero
jal ra, print_result
j exit_program
print_result:
addi sp, sp, -8 # Reserve space on the stack for two digits
# Divide the number by 10 to get the tens place
li t1, 10 # divisor
li t2, 0 # quotient for tens place
divide_tens:
blt t0, t1, done_divide_tens
sub t0, t0, t1
addi t2, t2, 1
j divide_tens
done_divide_tens:
# At this point, t2 contains the quotient for tens place
# Convert the tens place to ASCII and store on the stack
addi t2, t2, 48
sw t2, 0(sp)
# Convert the ones place to ASCII and store on the stack
addi t0, t0, 48
sw t0, 4(sp)
# Print the tens place if it's not 0
lw t2, 0(sp)
bnez t2, print_tens
skip_tens:
# Print the ones place
lw t0, 4(sp)
li a0, 1 # 1 = standard output (stdout)
addi a1, sp, 4 # pass the address of ones place to a1
li a2, 1 # length of value digit
li a7, SYSWRITE # "write" syscall
ecall
j print_newline
print_tens:
li a0, 1
addi a1, sp, 0 # pass the address of tens place to a1
li a2, 1
li a7, SYSWRITE
ecall
j skip_tens
print_newline:
li a7, SYSWRITE
la a1, msg_string # load address of newline string
li a2, msg_size # length of newline string
ecall # invoke syscall to print the string
addi sp, sp, 8 # Restore the stack pointer
ret
exit_program:
li a7, SYSEXIT
ecall
counting_zero:
mv t0, a0
mv t6, x0 # this is res return value
lw s0, 0(t0)
lw s1, 4(t0)
andi s2, s0, -1
bne x0, s2, start_count
addi t6, t6, 32
addi s0, s1, 0
start_count:
srli s1, s0, 1
or s0, s1, s0
srli s1, s0, 2
or s0, s1, s0
srli s1, s0, 4
or s0, s1, s0
srli s1, s0, 8
or s0, s1, s0
srli s1, s0, 16
or s0, s1, s0
la s2, and4
lw s3, 0(s2)
srli s1, s0, 1
and s1, s1, s3
sub s0, s0, s1
la s2, and5
lw s3, 0(s2)
srli s1, s0, 2
and s1, s1, s3
and s2, s0, s3
add s0, s1, s2
la s2, and2
lw s3, 0(s2)
srli s1, s0, 4
add s1, s1, s0
and s0, s1, s3
srli s1, s0, 8
add s0, s0, s1
srli s1, s0, 16
add s0, s0, s1
li s1, 32
la s2, and3
lw s3, 0(s2)
and s0, s0, s3
add t6, t6, s1
sub a1, t6, s0
mv t0, a1
ret
```
:::
#### **Manual optimization**
```c
.org 0
.global _start
.set SYSEXIT, 93
.set SYSWRITE, 64
.data
num1: .word 0xf1ac,0x123
num2: .word 0x1ac,0x12123489
num3: .word 0x0,0x32498
and_mask: .word 0x55555555, 0x33333333, 0x0f0f0f0f, 0x7f
msg_string: .string "\n"
.set msg_size, .-msg_string
.text
_start:
la a0, num1
jal ra, counting_zero
jal ra, print_result
la a0, num2
jal ra, counting_zero
jal ra, print_result
la a0, num3
jal ra, counting_zero
jal ra, print_result
j exit_program
print_result:
addi sp, sp, -8 # Reserve space on the stack for two digits
# Divide the number by 10 to get the tens place
li t1, 10 # divisor
li t2, 0 # quotient for tens place
divide_tens:
blt t0, t1, done_divide_tens
sub t0, t0, t1
addi t2, t2, 1
j divide_tens
done_divide_tens:
# At this point, t2 contains the quotient for tens place
# Convert the tens place to ASCII and store on the stack
addi t2, t2, 48
sw t2, 0(sp)
# Convert the ones place to ASCII and store on the stack
addi t0, t0, 48
sw t0, 4(sp)
# Print the tens place if it's not 0
lw t2, 0(sp)
bnez t2, print_tens
skip_tens:
# Print the ones place
lw t0, 4(sp)
li a0, 1 # 1 = standard output (stdout)
addi a1, sp, 4 # pass the address of ones place to a1
li a2, 1 # length of value digit
li a7, SYSWRITE # "write" syscall
ecall
j print_newline
print_tens:
li a0, 1
addi a1, sp, 0 # pass the address of tens place to a1
li a2, 1
li a7, SYSWRITE
ecall
j skip_tens
print_newline:
li a7, SYSWRITE
la a1, msg_string # load address of newline string
li a2, msg_size # length of newline string
ecall # invoke syscall to print the string
addi sp, sp, 8 # Restore the stack pointer
ret
exit_program:
li a7, SYSEXIT
ecall
counting_zero:
lw s0, 0(a0)
lw s1, 4(a0)
mv t6, x0 # this is res return value
bnez s0, start_count
addi t6, t6, 32
mv s0, s1
start_count:
la t1, and_mask
lw t2, 0(t1)
lw t3, 4(t1)
lw t4, 8(t1)
lw t5, 12(t1)
# count leading zero in s0
srli s1, s0, 1
or s0, s0, s1
srli s1, s0, 2
or s0, s0, s1
srli s1, s0, 4
or s0, s0, s1
srli s1, s0, 8
or s0, s0, s1
srli s1, s0, 16
or s0, s0, s1
srli s1, s0, 1
and s1, s1, t2
sub s0, s0, s1
srli s1, s0, 2
and s1, s1, t3
and s2, s0, t3
add s0, s1, s2
srli s1, s0, 4
add s1, s1, s0
and s0, s1, t4
srli s1, s0, 8
add s0, s0, s1
srli s1, s0, 16
add s0, s0, s1
and s0, s0, t5
addi t6, t6, 32
sub t0, t6, s0
ret
```
## **Optimized by riscv-none-elf-gcc**
I will compare -O0, -O1, -O2, -O3, -Os, -Ofast as following.
### **-O0 Optimized Assembly code**
* **Compile**
```
$ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 hw2.c -o hw2_O0.elf
```
* **Size**
```
$ riscv-none-elf-size hw2_O0.elf
text data bss dec hex filename
76500 2320 1528 80348 139dc hw2_O0.elf
```
* **Display the ELF file header**
```
$ riscv-none-elf-readelf -h hw2_O0.elf
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x100d8
Start of program headers: 52 (bytes into file)
Start of section headers: 94492 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
```
### **Disassembly code**
:::spoiler
```
00010180 <clz>:
10180: f7010113 add sp,sp,-144
10184: 08812623 sw s0,140(sp)
10188: 09212423 sw s2,136(sp)
1018c: 09312223 sw s3,132(sp)
10190: 09412023 sw s4,128(sp)
10194: 07512e23 sw s5,124(sp)
10198: 07612c23 sw s6,120(sp)
1019c: 07712a23 sw s7,116(sp)
101a0: 07812823 sw s8,112(sp)
101a4: 07912623 sw s9,108(sp)
101a8: 07a12423 sw s10,104(sp)
101ac: 07b12223 sw s11,100(sp)
101b0: 09010413 add s0,sp,144
101b4: fca42423 sw a0,-56(s0)
101b8: fcb42623 sw a1,-52(s0)
101bc: fcc42783 lw a5,-52(s0)
101c0: 01f79713 sll a4,a5,0x1f
101c4: fc842783 lw a5,-56(s0)
101c8: 0017d613 srl a2,a5,0x1
101cc: 00c76633 or a2,a4,a2
101d0: fcc42783 lw a5,-52(s0)
101d4: 0017d693 srl a3,a5,0x1
101d8: fc842783 lw a5,-56(s0)
101dc: 00c7ee33 or t3,a5,a2
101e0: fcc42783 lw a5,-52(s0)
101e4: 00d7eeb3 or t4,a5,a3
101e8: fdc42423 sw t3,-56(s0)
101ec: fdd42623 sw t4,-52(s0)
101f0: fcc42783 lw a5,-52(s0)
101f4: 01e79713 sll a4,a5,0x1e
101f8: fc842783 lw a5,-56(s0)
101fc: 0027d813 srl a6,a5,0x2
10200: 01076833 or a6,a4,a6
10204: fcc42783 lw a5,-52(s0)
10208: 0027d893 srl a7,a5,0x2
1020c: fc842783 lw a5,-56(s0)
10210: 0107e7b3 or a5,a5,a6
10214: f8f42823 sw a5,-112(s0)
10218: fcc42783 lw a5,-52(s0)
1021c: 0117e7b3 or a5,a5,a7
10220: f8f42a23 sw a5,-108(s0)
10224: f9042783 lw a5,-112(s0)
10228: f9442803 lw a6,-108(s0)
1022c: fcf42423 sw a5,-56(s0)
10230: fd042623 sw a6,-52(s0)
10234: fcc42783 lw a5,-52(s0)
10238: 01c79713 sll a4,a5,0x1c
1023c: fc842783 lw a5,-56(s0)
10240: 0047d313 srl t1,a5,0x4
10244: 00676333 or t1,a4,t1
10248: fcc42783 lw a5,-52(s0)
1024c: 0047d393 srl t2,a5,0x4
10250: fc842783 lw a5,-56(s0)
10254: 0067e7b3 or a5,a5,t1
10258: f8f42423 sw a5,-120(s0)
1025c: fcc42783 lw a5,-52(s0)
10260: 0077e7b3 or a5,a5,t2
10264: f8f42623 sw a5,-116(s0)
10268: f8842783 lw a5,-120(s0)
1026c: f8c42803 lw a6,-116(s0)
10270: fcf42423 sw a5,-56(s0)
10274: fd042623 sw a6,-52(s0)
10278: fcc42783 lw a5,-52(s0)
1027c: 01879713 sll a4,a5,0x18
10280: fc842783 lw a5,-56(s0)
10284: 0087d913 srl s2,a5,0x8
10288: 01276933 or s2,a4,s2
1028c: fcc42783 lw a5,-52(s0)
10290: 0087d993 srl s3,a5,0x8
10294: fc842783 lw a5,-56(s0)
10298: 0127e7b3 or a5,a5,s2
1029c: f8f42023 sw a5,-128(s0)
102a0: fcc42783 lw a5,-52(s0)
102a4: 0137e7b3 or a5,a5,s3
102a8: f8f42223 sw a5,-124(s0)
102ac: f8042783 lw a5,-128(s0)
102b0: f8442803 lw a6,-124(s0)
102b4: fcf42423 sw a5,-56(s0)
102b8: fd042623 sw a6,-52(s0)
102bc: fcc42783 lw a5,-52(s0)
102c0: 01079713 sll a4,a5,0x10
102c4: fc842783 lw a5,-56(s0)
102c8: 0107d793 srl a5,a5,0x10
102cc: fcf42023 sw a5,-64(s0)
102d0: fc042783 lw a5,-64(s0)
102d4: 00f767b3 or a5,a4,a5
102d8: fcf42023 sw a5,-64(s0)
102dc: fcc42783 lw a5,-52(s0)
102e0: 0107d793 srl a5,a5,0x10
102e4: fcf42223 sw a5,-60(s0)
102e8: fc842783 lw a5,-56(s0)
102ec: fc042603 lw a2,-64(s0)
102f0: fc442683 lw a3,-60(s0)
102f4: 00060713 mv a4,a2
102f8: 00e7e7b3 or a5,a5,a4
102fc: f6f42c23 sw a5,-136(s0)
10300: fcc42783 lw a5,-52(s0)
10304: 00068713 mv a4,a3
10308: 00e7e7b3 or a5,a5,a4
1030c: f6f42e23 sw a5,-132(s0)
10310: f7842783 lw a5,-136(s0)
10314: f7c42803 lw a6,-132(s0)
10318: fcf42423 sw a5,-56(s0)
1031c: fd042623 sw a6,-52(s0)
10320: fcc42783 lw a5,-52(s0)
10324: 0007d793 srl a5,a5,0x0
10328: faf42c23 sw a5,-72(s0)
1032c: fa042e23 sw zero,-68(s0)
10330: fc842783 lw a5,-56(s0)
10334: fb842603 lw a2,-72(s0)
10338: fbc42683 lw a3,-68(s0)
1033c: 00060713 mv a4,a2
10340: 00e7e7b3 or a5,a5,a4
10344: f6f42823 sw a5,-144(s0)
10348: fcc42783 lw a5,-52(s0)
1034c: 00068713 mv a4,a3
10350: 00e7e7b3 or a5,a5,a4
10354: f6f42a23 sw a5,-140(s0)
10358: f7042783 lw a5,-144(s0)
1035c: f7442803 lw a6,-140(s0)
10360: fcf42423 sw a5,-56(s0)
10364: fd042623 sw a6,-52(s0)
10368: fcc42783 lw a5,-52(s0)
1036c: 01f79793 sll a5,a5,0x1f
10370: fc842703 lw a4,-56(s0)
10374: 00175d13 srl s10,a4,0x1
10378: 01a7ed33 or s10,a5,s10
1037c: fcc42783 lw a5,-52(s0)
10380: 0017dd93 srl s11,a5,0x1
10384: 555557b7 lui a5,0x55555
10388: 55578793 add a5,a5,1365 # 55555555 <__BSS_END__+0x55531645>
1038c: 00fd77b3 and a5,s10,a5
10390: faf42823 sw a5,-80(s0)
10394: 555557b7 lui a5,0x55555
10398: 55578793 add a5,a5,1365 # 55555555 <__BSS_END__+0x55531645>
1039c: 00fdf7b3 and a5,s11,a5
103a0: faf42a23 sw a5,-76(s0)
103a4: fc842603 lw a2,-56(s0)
103a8: fcc42683 lw a3,-52(s0)
103ac: fb042803 lw a6,-80(s0)
103b0: fb442883 lw a7,-76(s0)
103b4: 00080593 mv a1,a6
103b8: 40b60733 sub a4,a2,a1
103bc: 00070593 mv a1,a4
103c0: 00b635b3 sltu a1,a2,a1
103c4: 00088513 mv a0,a7
103c8: 40a687b3 sub a5,a3,a0
103cc: 40b786b3 sub a3,a5,a1
103d0: 00068793 mv a5,a3
103d4: fce42423 sw a4,-56(s0)
103d8: fcf42623 sw a5,-52(s0)
103dc: fcc42783 lw a5,-52(s0)
103e0: 01e79793 sll a5,a5,0x1e
103e4: fc842703 lw a4,-56(s0)
103e8: 00275c13 srl s8,a4,0x2
103ec: 0187ec33 or s8,a5,s8
103f0: fcc42783 lw a5,-52(s0)
103f4: 0027dc93 srl s9,a5,0x2
103f8: 333337b7 lui a5,0x33333
103fc: 33378793 add a5,a5,819 # 33333333 <__BSS_END__+0x3330f423>
10400: 00fc77b3 and a5,s8,a5
10404: faf42423 sw a5,-88(s0)
10408: 333337b7 lui a5,0x33333
1040c: 33378793 add a5,a5,819 # 33333333 <__BSS_END__+0x3330f423>
10410: 00fcf7b3 and a5,s9,a5
10414: faf42623 sw a5,-84(s0)
10418: fc842703 lw a4,-56(s0)
1041c: 333337b7 lui a5,0x33333
10420: 33378793 add a5,a5,819 # 33333333 <__BSS_END__+0x3330f423>
10424: 00f777b3 and a5,a4,a5
10428: faf42023 sw a5,-96(s0)
1042c: fcc42703 lw a4,-52(s0)
10430: 333337b7 lui a5,0x33333
10434: 33378793 add a5,a5,819 # 33333333 <__BSS_END__+0x3330f423>
10438: 00f777b3 and a5,a4,a5
1043c: faf42223 sw a5,-92(s0)
10440: fa842503 lw a0,-88(s0)
10444: fac42583 lw a1,-84(s0)
10448: 00050693 mv a3,a0
1044c: fa042803 lw a6,-96(s0)
10450: fa442883 lw a7,-92(s0)
10454: 00080613 mv a2,a6
10458: 00c68733 add a4,a3,a2
1045c: 00070693 mv a3,a4
10460: 00050613 mv a2,a0
10464: 00c6b6b3 sltu a3,a3,a2
10468: 00058613 mv a2,a1
1046c: 00088593 mv a1,a7
10470: 00b607b3 add a5,a2,a1
10474: 00f686b3 add a3,a3,a5
10478: 00068793 mv a5,a3
1047c: fce42423 sw a4,-56(s0)
10480: fcf42623 sw a5,-52(s0)
10484: fcc42783 lw a5,-52(s0)
10488: 01c79793 sll a5,a5,0x1c
1048c: fc842703 lw a4,-56(s0)
10490: 00475f13 srl t5,a4,0x4
10494: 01e7ef33 or t5,a5,t5
10498: fcc42783 lw a5,-52(s0)
1049c: 0047df93 srl t6,a5,0x4
104a0: fc842603 lw a2,-56(s0)
104a4: fcc42683 lw a3,-52(s0)
104a8: 00cf0733 add a4,t5,a2
104ac: 00070593 mv a1,a4
104b0: 01e5b5b3 sltu a1,a1,t5
104b4: 00df87b3 add a5,t6,a3
104b8: 00f586b3 add a3,a1,a5
104bc: 00068793 mv a5,a3
104c0: 0f0f16b7 lui a3,0xf0f1
104c4: f0f68693 add a3,a3,-241 # f0f0f0f <__BSS_END__+0xf0ccfff>
104c8: 00d776b3 and a3,a4,a3
104cc: fcd42423 sw a3,-56(s0)
104d0: 0f0f16b7 lui a3,0xf0f1
104d4: f0f68693 add a3,a3,-241 # f0f0f0f <__BSS_END__+0xf0ccfff>
104d8: 00d7f7b3 and a5,a5,a3
104dc: fcf42623 sw a5,-52(s0)
104e0: fcc42783 lw a5,-52(s0)
104e4: 01879793 sll a5,a5,0x18
104e8: fc842703 lw a4,-56(s0)
104ec: 00875b13 srl s6,a4,0x8
104f0: 0167eb33 or s6,a5,s6
104f4: fcc42783 lw a5,-52(s0)
104f8: 0087db93 srl s7,a5,0x8
104fc: fc842603 lw a2,-56(s0)
10500: fcc42683 lw a3,-52(s0)
10504: 01660733 add a4,a2,s6
10508: 00070593 mv a1,a4
1050c: 00c5b5b3 sltu a1,a1,a2
10510: 017687b3 add a5,a3,s7
10514: 00f586b3 add a3,a1,a5
10518: 00068793 mv a5,a3
1051c: fce42423 sw a4,-56(s0)
10520: fcf42623 sw a5,-52(s0)
10524: fcc42783 lw a5,-52(s0)
10528: 01079793 sll a5,a5,0x10
1052c: fc842703 lw a4,-56(s0)
10530: 01075a13 srl s4,a4,0x10
10534: 0147ea33 or s4,a5,s4
10538: fcc42783 lw a5,-52(s0)
1053c: 0107da93 srl s5,a5,0x10
10540: fc842603 lw a2,-56(s0)
10544: fcc42683 lw a3,-52(s0)
10548: 01460733 add a4,a2,s4
1054c: 00070593 mv a1,a4
10550: 00c5b5b3 sltu a1,a1,a2
10554: 015687b3 add a5,a3,s5
10558: 00f586b3 add a3,a1,a5
1055c: 00068793 mv a5,a3
10560: fce42423 sw a4,-56(s0)
10564: fcf42623 sw a5,-52(s0)
10568: fcc42783 lw a5,-52(s0)
1056c: 0007d793 srl a5,a5,0x0
10570: f8f42c23 sw a5,-104(s0)
10574: f8042e23 sw zero,-100(s0)
10578: fc842603 lw a2,-56(s0)
1057c: fcc42683 lw a3,-52(s0)
10580: f9842803 lw a6,-104(s0)
10584: f9c42883 lw a7,-100(s0)
10588: 00080593 mv a1,a6
1058c: 00b60733 add a4,a2,a1
10590: 00070593 mv a1,a4
10594: 00c5b5b3 sltu a1,a1,a2
10598: 00088513 mv a0,a7
1059c: 00a687b3 add a5,a3,a0
105a0: 00f586b3 add a3,a1,a5
105a4: 00068793 mv a5,a3
105a8: fce42423 sw a4,-56(s0)
105ac: fcf42623 sw a5,-52(s0)
105b0: fc845783 lhu a5,-56(s0)
105b4: 07f7f793 and a5,a5,127
105b8: 01079793 sll a5,a5,0x10
105bc: 0107d793 srl a5,a5,0x10
105c0: 04000713 li a4,64
105c4: 40f707b3 sub a5,a4,a5
105c8: 01079793 sll a5,a5,0x10
105cc: 0107d793 srl a5,a5,0x10
105d0: 00078513 mv a0,a5
105d4: 08c12403 lw s0,140(sp)
105d8: 08812903 lw s2,136(sp)
105dc: 08412983 lw s3,132(sp)
105e0: 08012a03 lw s4,128(sp)
105e4: 07c12a83 lw s5,124(sp)
105e8: 07812b03 lw s6,120(sp)
105ec: 07412b83 lw s7,116(sp)
105f0: 07012c03 lw s8,112(sp)
105f4: 06c12c83 lw s9,108(sp)
105f8: 06812d03 lw s10,104(sp)
105fc: 06412d83 lw s11,100(sp)
10600: 09010113 add sp,sp,144
10604: 00008067 ret
```
:::
### **Assembly Analysis of `-O0` Optimized Code**
#### **1. Extensive Stack Operations**
At the start and end, there are numerous stack operations. These operations are used to save and restore caller-saved registers. With `-O0`, the compiler typically does not attempt to reduce these, even if they might be unnecessary.
#### **2. Redundant Store and Load Operations**
For many operations, we see values being stored to the stack and then immediately being loaded back. These operations would typically be eliminated at higher optimization levels as the compiler would recognize them as unnecessary.
#### **3. Direct Flow of Data**
Instructions execute in the order of the C code, with minimal branching or jumping. This means the compiler has not tried to reorder instructions for better performance.
## **O1 Optimized Assembly code**
* **Compile**
```
$ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 hw2.c -o hw2_O1.elf
```
* **Text**
```
$ riscv-none-elf-size hw2_O1.elf
text data bss dec hex filename
75632 2320 1528 78480 13678 hw2_O1.elf
```
* **Display the ELF file header**
```
$ riscv-none-elf-readelf -h hw2_O1.elf
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x100d8
Start of program headers: 52 (bytes into file)
Start of section headers: 94492 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
```
### **Disassemble code**
:::spoiler
```
00010180 <clz>:
10180: 01f59713 sll a4,a1,0x1f
10184: 00155793 srl a5,a0,0x1
10188: 00f767b3 or a5,a4,a5
1018c: 0015d713 srl a4,a1,0x1
10190: 00a7e533 or a0,a5,a0
10194: 00b765b3 or a1,a4,a1
10198: 01e59713 sll a4,a1,0x1e
1019c: 00255793 srl a5,a0,0x2
101a0: 00f767b3 or a5,a4,a5
101a4: 0025d613 srl a2,a1,0x2
101a8: 00a7e533 or a0,a5,a0
101ac: 00b66633 or a2,a2,a1
101b0: 01c61713 sll a4,a2,0x1c
101b4: 00455793 srl a5,a0,0x4
101b8: 00f767b3 or a5,a4,a5
101bc: 00465693 srl a3,a2,0x4
101c0: 00a7e733 or a4,a5,a0
101c4: 00c6e6b3 or a3,a3,a2
101c8: 01869613 sll a2,a3,0x18
101cc: 00875793 srl a5,a4,0x8
101d0: 00f667b3 or a5,a2,a5
101d4: 0086d613 srl a2,a3,0x8
101d8: 00e7e7b3 or a5,a5,a4
101dc: 00d66633 or a2,a2,a3
101e0: 01061713 sll a4,a2,0x10
101e4: 0107d693 srl a3,a5,0x10
101e8: 00d766b3 or a3,a4,a3
101ec: 01065713 srl a4,a2,0x10
101f0: 00f6e6b3 or a3,a3,a5
101f4: 00c76733 or a4,a4,a2
101f8: 00d766b3 or a3,a4,a3
101fc: 01f71613 sll a2,a4,0x1f
10200: 0016d793 srl a5,a3,0x1
10204: 00f667b3 or a5,a2,a5
10208: 00175593 srl a1,a4,0x1
1020c: 55555637 lui a2,0x55555
10210: 55560613 add a2,a2,1365 # 55555555 <__BSS_END__+0x55531645>
10214: 00c7f7b3 and a5,a5,a2
10218: 00c5f633 and a2,a1,a2
1021c: 40f687b3 sub a5,a3,a5
10220: 00f6b6b3 sltu a3,a3,a5
10224: 40c70733 sub a4,a4,a2
10228: 40d70733 sub a4,a4,a3
1022c: 01e71613 sll a2,a4,0x1e
10230: 0027d693 srl a3,a5,0x2
10234: 00d666b3 or a3,a2,a3
10238: 00275593 srl a1,a4,0x2
1023c: 33333637 lui a2,0x33333
10240: 33360613 add a2,a2,819 # 33333333 <__BSS_END__+0x3330f423>
10244: 00c6f6b3 and a3,a3,a2
10248: 00c5f5b3 and a1,a1,a2
1024c: 00c7f7b3 and a5,a5,a2
10250: 00c77733 and a4,a4,a2
10254: 00f687b3 add a5,a3,a5
10258: 00d7b6b3 sltu a3,a5,a3
1025c: 00e58733 add a4,a1,a4
10260: 00e686b3 add a3,a3,a4
10264: 01c69613 sll a2,a3,0x1c
10268: 0047d713 srl a4,a5,0x4
1026c: 00e66733 or a4,a2,a4
10270: 0046d613 srl a2,a3,0x4
10274: 00f707b3 add a5,a4,a5
10278: 00e7b733 sltu a4,a5,a4
1027c: 00d606b3 add a3,a2,a3
10280: 00d70733 add a4,a4,a3
10284: 0f0f16b7 lui a3,0xf0f1
10288: f0f68693 add a3,a3,-241 # f0f0f0f <__BSS_END__+0xf0ccfff>
1028c: 00d7f7b3 and a5,a5,a3
10290: 00d77733 and a4,a4,a3
10294: 01871613 sll a2,a4,0x18
10298: 0087d693 srl a3,a5,0x8
1029c: 00d666b3 or a3,a2,a3
102a0: 00875613 srl a2,a4,0x8
102a4: 00f687b3 add a5,a3,a5
102a8: 00d7b6b3 sltu a3,a5,a3
102ac: 00e60733 add a4,a2,a4
102b0: 00e686b3 add a3,a3,a4
102b4: 01069613 sll a2,a3,0x10
102b8: 0107d713 srl a4,a5,0x10
102bc: 00e66733 or a4,a2,a4
102c0: 0106d613 srl a2,a3,0x10
102c4: 00f707b3 add a5,a4,a5
102c8: 00e7b733 sltu a4,a5,a4
102cc: 00d606b3 add a3,a2,a3
102d0: 00d70733 add a4,a4,a3
102d4: 00f70733 add a4,a4,a5
102d8: 07f77713 and a4,a4,127
102dc: 04000513 li a0,64
102e0: 40e50533 sub a0,a0,a4
102e4: 01051513 sll a0,a0,0x10
102e8: 01055513 srl a0,a0,0x10
102ec: 00008067 ret
```
:::
### **Assembly Analysis of `-O1` Optimized Code**
#### **1. Fewer Redundant Operations**
Compared to the `-O0` optimization level, the `-O1` level has fewer redundant store and load operations. This results in a more concise and efficient assembly output.
#### **2. Register Utilization**
There's more efficient usage of registers to hold intermediate values. This minimizes the need for memory accesses and leads to faster execution.
#### **3. Immediate Values**
The assembly has several instructions utilizing immediate values (`lui` and `add`). This means the compiler optimized certain constants directly into the code.
#### **4. Efficiency Over Clarity**
The `-O1` level optimizes for efficiency without a significant increase in compilation time, so while the assembly might not be as straightforward as the `-O0` version, it's more efficient.
## **O2 Optimized Assembly code**
* **Compile**
```shell
$ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O2 hw2.c -o hw2_O2.elf
```
* **text**
```shell
$ riscv-none-elf-size hw2_O2.elf
text data bss dec hex filename
75632 2320 1528 79480 13678 hw2_O2.elf
```
* **Display the ELF file header**
```shell
$ riscv-none-elf-readelf -h hw2_O2.elf
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x10120
Start of program headers: 52 (bytes into file)
Start of section headers: 94508 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
```
### Disassembly
:::spoiler
```
000101c8 <clz>:
101c8: 01f59713 sll a4,a1,0x1f
101cc: 00155793 srl a5,a0,0x1
101d0: 00f767b3 or a5,a4,a5
101d4: 0015d713 srl a4,a1,0x1
101d8: 00b765b3 or a1,a4,a1
101dc: 00a7e533 or a0,a5,a0
101e0: 01e59713 sll a4,a1,0x1e
101e4: 00255793 srl a5,a0,0x2
101e8: 00f767b3 or a5,a4,a5
101ec: 0025d713 srl a4,a1,0x2
101f0: 00b76733 or a4,a4,a1
101f4: 00a7e533 or a0,a5,a0
101f8: 01c71693 sll a3,a4,0x1c
101fc: 00455793 srl a5,a0,0x4
10200: 00f6e7b3 or a5,a3,a5
10204: 00475693 srl a3,a4,0x4
10208: 00e6e6b3 or a3,a3,a4
1020c: 00a7e733 or a4,a5,a0
10210: 01869613 sll a2,a3,0x18
10214: 00875793 srl a5,a4,0x8
10218: 00f667b3 or a5,a2,a5
1021c: 0086d613 srl a2,a3,0x8
10220: 00d66633 or a2,a2,a3
10224: 00e7e7b3 or a5,a5,a4
10228: 0107d693 srl a3,a5,0x10
1022c: 01061713 sll a4,a2,0x10
10230: 00d766b3 or a3,a4,a3
10234: 01065713 srl a4,a2,0x10
10238: 00c76733 or a4,a4,a2
1023c: 00f6e6b3 or a3,a3,a5
10240: 00d766b3 or a3,a4,a3
10244: 01f71593 sll a1,a4,0x1f
10248: 0016d793 srl a5,a3,0x1
1024c: 55555637 lui a2,0x55555
10250: 55560613 add a2,a2,1365 # 55555555 <__BSS_END__+0x55531645>
10254: 00f5e7b3 or a5,a1,a5
10258: 00c7f7b3 and a5,a5,a2
1025c: 00175593 srl a1,a4,0x1
10260: 40f687b3 sub a5,a3,a5
10264: 00c5f633 and a2,a1,a2
10268: 00f6b6b3 sltu a3,a3,a5
1026c: 40c70733 sub a4,a4,a2
10270: 40d70733 sub a4,a4,a3
10274: 01e71593 sll a1,a4,0x1e
10278: 0027d693 srl a3,a5,0x2
1027c: 33333637 lui a2,0x33333
10280: 33360613 add a2,a2,819 # 33333333 <__BSS_END__+0x3330f423>
10284: 00d5e6b3 or a3,a1,a3
10288: 00c6f6b3 and a3,a3,a2
1028c: 00275593 srl a1,a4,0x2
10290: 00c7f7b3 and a5,a5,a2
10294: 00f687b3 add a5,a3,a5
10298: 00c5f5b3 and a1,a1,a2
1029c: 00c77733 and a4,a4,a2
102a0: 00e58733 add a4,a1,a4
102a4: 00d7b6b3 sltu a3,a5,a3
102a8: 00e686b3 add a3,a3,a4
102ac: 01c69613 sll a2,a3,0x1c
102b0: 0047d713 srl a4,a5,0x4
102b4: 00e66733 or a4,a2,a4
102b8: 00f707b3 add a5,a4,a5
102bc: 0046d613 srl a2,a3,0x4
102c0: 00d60633 add a2,a2,a3
102c4: 00e7b733 sltu a4,a5,a4
102c8: 0f0f16b7 lui a3,0xf0f1
102cc: f0f68693 add a3,a3,-241 # f0f0f0f <__BSS_END__+0xf0ccfff>
102d0: 00c70733 add a4,a4,a2
102d4: 00d77733 and a4,a4,a3
102d8: 00d7f7b3 and a5,a5,a3
102dc: 01871613 sll a2,a4,0x18
102e0: 0087d693 srl a3,a5,0x8
102e4: 00d666b3 or a3,a2,a3
102e8: 00f687b3 add a5,a3,a5
102ec: 00875613 srl a2,a4,0x8
102f0: 00e60733 add a4,a2,a4
102f4: 00d7b6b3 sltu a3,a5,a3
102f8: 00e686b3 add a3,a3,a4
102fc: 01069613 sll a2,a3,0x10
10300: 0107d713 srl a4,a5,0x10
10304: 00e66733 or a4,a2,a4
10308: 00f707b3 add a5,a4,a5
1030c: 0106d613 srl a2,a3,0x10
10310: 00e7b733 sltu a4,a5,a4
10314: 00d606b3 add a3,a2,a3
10318: 00d70733 add a4,a4,a3
1031c: 00f70733 add a4,a4,a5
10320: 07f77713 and a4,a4,127
10324: 04000513 li a0,64
10328: 40e50533 sub a0,a0,a4
1032c: 01051513 sll a0,a0,0x10
10330: 01055513 srl a0,a0,0x10
10334: 00008067 ret
```
:::
### **Assembly Analysis of `-O2` Optimized Code**
#### **1. More Aggressive Optimizations**
`-O2` is more aggressive than `-O1`. This means there might be more inlining of functions, constant propagation, and other such techniques.
#### **2. More Extensive Use of Registers**
Similar to `-O1`, the `-O2` optimization uses registers efficiently. The value is kept in registers as much as possible to avoid slower memory operations.
#### **3. Return Value Preparation**
As with the `-O1` version, there's a sequence towards the end preparing the return value. This sequence seems slightly different, suggesting some optimization in how the return value is computed.
## **O3 Optimized Assembly code**
* **Compile**
```
$ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O3 hw2.c -o hw2_O3.elf
```
* **text**
```
$ riscv-none-elf-size hw2_O3.elf
text data bss dec hex filename
75632 2320 1528 79480 136778 hw_O3.elf
```
* **Display the ELF file header**
```
$ riscv-none-elf-readelf -h hw2_O3.elf
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x10120
Start of program headers: 52 (bytes into file)
Start of section headers: 94508 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
```
### Disassembly
:::spoiler
```
000101c8 <clz>:
101c8: 01f59713 sll a4,a1,0x1f
101cc: 00155793 srl a5,a0,0x1
101d0: 00f767b3 or a5,a4,a5
101d4: 0015d713 srl a4,a1,0x1
101d8: 00b765b3 or a1,a4,a1
101dc: 00a7e533 or a0,a5,a0
101e0: 01e59713 sll a4,a1,0x1e
101e4: 00255793 srl a5,a0,0x2
101e8: 00f767b3 or a5,a4,a5
101ec: 0025d713 srl a4,a1,0x2
101f0: 00b76733 or a4,a4,a1
101f4: 00a7e533 or a0,a5,a0
101f8: 01c71693 sll a3,a4,0x1c
101fc: 00455793 srl a5,a0,0x4
10200: 00f6e7b3 or a5,a3,a5
10204: 00475693 srl a3,a4,0x4
10208: 00e6e6b3 or a3,a3,a4
1020c: 00a7e733 or a4,a5,a0
10210: 01869613 sll a2,a3,0x18
10214: 00875793 srl a5,a4,0x8
10218: 00f667b3 or a5,a2,a5
1021c: 0086d613 srl a2,a3,0x8
10220: 00d66633 or a2,a2,a3
10224: 00e7e7b3 or a5,a5,a4
10228: 0107d693 srl a3,a5,0x10
1022c: 01061713 sll a4,a2,0x10
10230: 00d766b3 or a3,a4,a3
10234: 01065713 srl a4,a2,0x10
10238: 00c76733 or a4,a4,a2
1023c: 00f6e6b3 or a3,a3,a5
10240: 00d766b3 or a3,a4,a3
10244: 01f71593 sll a1,a4,0x1f
10248: 0016d793 srl a5,a3,0x1
1024c: 55555637 lui a2,0x55555
10250: 55560613 add a2,a2,1365 # 55555555 <__BSS_END__+0x55531645>
10254: 00f5e7b3 or a5,a1,a5
10258: 00c7f7b3 and a5,a5,a2
1025c: 00175593 srl a1,a4,0x1
10260: 40f687b3 sub a5,a3,a5
10264: 00c5f633 and a2,a1,a2
10268: 00f6b6b3 sltu a3,a3,a5
1026c: 40c70733 sub a4,a4,a2
10270: 40d70733 sub a4,a4,a3
10274: 01e71593 sll a1,a4,0x1e
10278: 0027d693 srl a3,a5,0x2
1027c: 33333637 lui a2,0x33333
10280: 33360613 add a2,a2,819 # 33333333 <__BSS_END__+0x3330f423>
10284: 00d5e6b3 or a3,a1,a3
10288: 00c6f6b3 and a3,a3,a2
1028c: 00275593 srl a1,a4,0x2
10290: 00c7f7b3 and a5,a5,a2
10294: 00f687b3 add a5,a3,a5
10298: 00c5f5b3 and a1,a1,a2
1029c: 00c77733 and a4,a4,a2
102a0: 00e58733 add a4,a1,a4
102a4: 00d7b6b3 sltu a3,a5,a3
102a8: 00e686b3 add a3,a3,a4
102ac: 01c69613 sll a2,a3,0x1c
102b0: 0047d713 srl a4,a5,0x4
102b4: 00e66733 or a4,a2,a4
102b8: 00f707b3 add a5,a4,a5
102bc: 0046d613 srl a2,a3,0x4
102c0: 00d60633 add a2,a2,a3
102c4: 00e7b733 sltu a4,a5,a4
102c8: 0f0f16b7 lui a3,0xf0f1
102cc: f0f68693 add a3,a3,-241 # f0f0f0f <__BSS_END__+0xf0ccfff>
102d0: 00c70733 add a4,a4,a2
102d4: 00d77733 and a4,a4,a3
102d8: 00d7f7b3 and a5,a5,a3
102dc: 01871613 sll a2,a4,0x18
102e0: 0087d693 srl a3,a5,0x8
102e4: 00d666b3 or a3,a2,a3
102e8: 00f687b3 add a5,a3,a5
102ec: 00875613 srl a2,a4,0x8
102f0: 00e60733 add a4,a2,a4
102f4: 00d7b6b3 sltu a3,a5,a3
102f8: 00e686b3 add a3,a3,a4
102fc: 01069613 sll a2,a3,0x10
10300: 0107d713 srl a4,a5,0x10
10304: 00e66733 or a4,a2,a4
10308: 00f707b3 add a5,a4,a5
1030c: 0106d613 srl a2,a3,0x10
10310: 00e7b733 sltu a4,a5,a4
10314: 00d606b3 add a3,a2,a3
10318: 00d70733 add a4,a4,a3
1031c: 00f70733 add a4,a4,a5
10320: 07f77713 and a4,a4,127
10324: 04000513 li a0,64
10328: 40e50533 sub a0,a0,a4
1032c: 01051513 sll a0,a0,0x10
10330: 01055513 srl a0,a0,0x10
10334: 00008067 ret
```
:::
## **scones525 Optimized Assembly**
* **text**
```
text data bss dec hex filename
456 0 0 456 1c8 hw2.elf
```
* **Display the ELF file header**
```
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x0
Start of program headers: 52 (bytes into file)
Start of section headers: 5288 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 2
Size of section headers: 40 (bytes)
Number of section headers: 6
Section header string table index: 5
```
### Disassembly
:::spoiler
```c
hw2.elf: file format elf32-littleriscv
Disassembly of section .text:
00000000 <_start>:
0: 00000517 auipc a0,0x0
4: 18850513 add a0,a0,392 # 188 <num1>
8: 0b8000ef jal c0 <counting_zero>
c: 028000ef jal 34 <print_result>
10: 00000517 auipc a0,0x0
14: 18050513 add a0,a0,384 # 190 <num2>
18: 0a8000ef jal c0 <counting_zero>
1c: 018000ef jal 34 <print_result>
20: 00000517 auipc a0,0x0
24: 17850513 add a0,a0,376 # 198 <num3>
28: 098000ef jal c0 <counting_zero>
2c: 008000ef jal 34 <print_result>
30: 0880006f j b8 <exit_program>
00000034 <print_result>:
34: ff810113 add sp,sp,-8
38: 00a00313 li t1,10
3c: 00000393 li t2,0
00000040 <divide_tens>:
40: 0062c863 blt t0,t1,50 <done_divide_tens>
44: 406282b3 sub t0,t0,t1
48: 00138393 add t2,t2,1
4c: ff5ff06f j 40 <divide_tens>
00000050 <done_divide_tens>:
50: 03038393 add t2,t2,48
54: 00712023 sw t2,0(sp)
58: 03028293 add t0,t0,48
5c: 00512223 sw t0,4(sp)
60: 00012383 lw t2,0(sp)
64: 02039063 bnez t2,84 <print_tens>
00000068 <skip_tens>:
68: 00412283 lw t0,4(sp)
6c: 00100513 li a0,1
70: 00410593 add a1,sp,4
74: 00100613 li a2,1
78: 04000893 li a7,64
7c: 00000073 ecall
80: 01c0006f j 9c <print_newline>
00000084 <print_tens>:
84: 00100513 li a0,1
88: 00010593 mv a1,sp
8c: 00100613 li a2,1
90: 04000893 li a7,64
94: 00000073 ecall
98: fd1ff06f j 68 <skip_tens>
0000009c <print_newline>:
9c: 04000893 li a7,64
a0: 00000597 auipc a1,0x0
a4: 12458593 add a1,a1,292 # 1c4 <msg_string>
a8: 00200613 li a2,2
ac: 00000073 ecall
b0: 00810113 add sp,sp,8
b4: 00008067 ret
000000b8 <exit_program>:
b8: 05d00893 li a7,93
bc: 00000073 ecall
000000c0 <counting_zero>:
c0: 00050293 mv t0,a0
c4: 00000f93 li t6,0
c8: 0002a403 lw s0,0(t0)
cc: 0042a483 lw s1,4(t0)
d0: fff47913 and s2,s0,-1
d4: 01201663 bne zero,s2,e0 <start_count>
d8: 020f8f93 add t6,t6,32
dc: 00048413 mv s0,s1
000000e0 <start_count>:
e0: 00145493 srl s1,s0,0x1
e4: 0084e433 or s0,s1,s0
e8: 00245493 srl s1,s0,0x2
ec: 0084e433 or s0,s1,s0
f0: 00445493 srl s1,s0,0x4
f4: 0084e433 or s0,s1,s0
f8: 00845493 srl s1,s0,0x8
fc: 0084e433 or s0,s1,s0
100: 01045493 srl s1,s0,0x10
104: 0084e433 or s0,s1,s0
108: 00000917 auipc s2,0x0
10c: 0ac90913 add s2,s2,172 # 1b4 <and4>
110: 00092983 lw s3,0(s2)
114: 00145493 srl s1,s0,0x1
118: 0134f4b3 and s1,s1,s3
11c: 40940433 sub s0,s0,s1
120: 00000917 auipc s2,0x0
124: 09c90913 add s2,s2,156 # 1bc <and5>
128: 00092983 lw s3,0(s2)
12c: 00245493 srl s1,s0,0x2
130: 0134f4b3 and s1,s1,s3
134: 01347933 and s2,s0,s3
138: 01248433 add s0,s1,s2
13c: 00000917 auipc s2,0x0
140: 06c90913 add s2,s2,108 # 1a8 <and2>
144: 00092983 lw s3,0(s2)
148: 00445493 srl s1,s0,0x4
14c: 008484b3 add s1,s1,s0
150: 0134f433 and s0,s1,s3
154: 00845493 srl s1,s0,0x8
158: 00940433 add s0,s0,s1
15c: 01045493 srl s1,s0,0x10
160: 00940433 add s0,s0,s1
164: 02000493 li s1,32
168: 00000917 auipc s2,0x0
16c: 04890913 add s2,s2,72 # 1b0 <and3>
170: 00092983 lw s3,0(s2)
174: 01347433 and s0,s0,s3
178: 009f8fb3 add t6,t6,s1
17c: 408f85b3 sub a1,t6,s0
180: 00058293 mv t0,a1
184: 00008067 ret
00000188 <num1>:
188: 0000f1ac .word 0x0000f1ac
18c: 00000123 .word 0x00000123
00000190 <num2>:
190: 000001ac .word 0x000001ac
194: 12123489 .word 0x12123489
00000198 <num3>:
198: 00000000 .word 0x00000000
19c: 00032498 .word 0x00032498
000001a0 <and1>:
1a0: 33333333 .word 0x33333333
1a4: 33333333 .word 0x33333333
000001a8 <and2>:
1a8: 0f0f0f0f .word 0x0f0f0f0f
1ac: 0f0f0f0f .word 0x0f0f0f0f
000001b0 <and3>:
1b0: 0000007f .word 0x0000007f
000001b4 <and4>:
1b4: 55555555 .word 0x55555555
1b8: 55555555 .word 0x55555555
000001bc <and5>:
1bc: 33333333 .word 0x33333333
1c0: 33333333 .word 0x33333333
000001c4 <msg_string>:
1c4: 0000000a .word 0x0000000a
```
:::
## **Handwritten Assembly**
* **text**
```
text data bss dec hex filename
396 0 0 396 18c hw2.elf
```
* **Display the ELF file header**
```
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x0
Start of program headers: 52 (bytes into file)
Start of section headers: 5148 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 2
Size of section headers: 40 (bytes)
Number of section headers: 6
Section header string table index: 5
```
### Disassembly
```
hw2.elf: file format elf32-littleriscv
Disassembly of section .text:
00000000 <_start>:
0: 00000517 auipc a0,0x0
4: 16050513 add a0,a0,352 # 160 <num1>
8: 0b8000ef jal c0 <counting_zero>
c: 028000ef jal 34 <print_result>
10: 00000517 auipc a0,0x0
14: 15850513 add a0,a0,344 # 168 <num2>
18: 0a8000ef jal c0 <counting_zero>
1c: 018000ef jal 34 <print_result>
20: 00000517 auipc a0,0x0
24: 15050513 add a0,a0,336 # 170 <num3>
28: 098000ef jal c0 <counting_zero>
2c: 008000ef jal 34 <print_result>
30: 0880006f j b8 <exit_program>
00000034 <print_result>:
34: ff810113 add sp,sp,-8
38: 00a00313 li t1,10
3c: 00000393 li t2,0
00000040 <divide_tens>:
40: 0062c863 blt t0,t1,50 <done_divide_tens>
44: 406282b3 sub t0,t0,t1
48: 00138393 add t2,t2,1
4c: ff5ff06f j 40 <divide_tens>
00000050 <done_divide_tens>:
50: 03038393 add t2,t2,48
54: 00712023 sw t2,0(sp)
58: 03028293 add t0,t0,48
5c: 00512223 sw t0,4(sp)
60: 00012383 lw t2,0(sp)
64: 02039063 bnez t2,84 <print_tens>
00000068 <skip_tens>:
68: 00412283 lw t0,4(sp)
6c: 00100513 li a0,1
70: 00410593 add a1,sp,4
74: 00100613 li a2,1
78: 04000893 li a7,64
7c: 00000073 ecall
80: 01c0006f j 9c <print_newline>
00000084 <print_tens>:
84: 00100513 li a0,1
88: 00010593 mv a1,sp
8c: 00100613 li a2,1
90: 04000893 li a7,64
94: 00000073 ecall
98: fd1ff06f j 68 <skip_tens>
0000009c <print_newline>:
9c: 04000893 li a7,64
a0: 00000597 auipc a1,0x0
a4: 0e858593 add a1,a1,232 # 188 <msg_string>
a8: 00200613 li a2,2
ac: 00000073 ecall
b0: 00810113 add sp,sp,8
b4: 00008067 ret
000000b8 <exit_program>:
b8: 05d00893 li a7,93
bc: 00000073 ecall
000000c0 <counting_zero>:
c0: 00052403 lw s0,0(a0)
c4: 00452483 lw s1,4(a0)
c8: 00000f93 li t6,0
cc: 00041663 bnez s0,d8 <start_count>
d0: 020f8f93 add t6,t6,32
d4: 00048413 mv s0,s1
000000d8 <start_count>:
d8: 00000317 auipc t1,0x0
dc: 0a030313 add t1,t1,160 # 178 <and_mask>
e0: 00032383 lw t2,0(t1)
e4: 00432e03 lw t3,4(t1)
e8: 00832e83 lw t4,8(t1)
ec: 00c32f03 lw t5,12(t1)
f0: 00145493 srl s1,s0,0x1
f4: 00946433 or s0,s0,s1
f8: 00245493 srl s1,s0,0x2
fc: 00946433 or s0,s0,s1
100: 00445493 srl s1,s0,0x4
104: 00946433 or s0,s0,s1
108: 00845493 srl s1,s0,0x8
10c: 00946433 or s0,s0,s1
110: 01045493 srl s1,s0,0x10
114: 00946433 or s0,s0,s1
118: 00145493 srl s1,s0,0x1
11c: 0074f4b3 and s1,s1,t2
120: 40940433 sub s0,s0,s1
124: 00245493 srl s1,s0,0x2
128: 01c4f4b3 and s1,s1,t3
12c: 01c47933 and s2,s0,t3
130: 01248433 add s0,s1,s2
134: 00445493 srl s1,s0,0x4
138: 008484b3 add s1,s1,s0
13c: 01d4f433 and s0,s1,t4
140: 00845493 srl s1,s0,0x8
144: 00940433 add s0,s0,s1
148: 01045493 srl s1,s0,0x10
14c: 00940433 add s0,s0,s1
150: 01e47433 and s0,s0,t5
154: 020f8f93 add t6,t6,32
158: 408f82b3 sub t0,t6,s0
15c: 00008067 ret
00000160 <num1>:
160: 0000f1ac .word 0x0000f1ac
164: 00000123 .word 0x00000123
00000168 <num2>:
168: 000001ac .word 0x000001ac
16c: 12123489 .word 0x12123489
00000170 <num3>:
170: 00000000 .word 0x00000000
174: 00032498 .word 0x00032498
00000178 <and_mask>:
178: 55555555 .word 0x55555555
17c: 33333333 .word 0x33333333
180: 0f0f0f0f .word 0x0f0f0f0f
184: 0000007f .word 0x0000007f
00000188 <msg_string>:
188: 0000000a .word 0x0000000a
```
## **Conclusion in different Optimized flag**
| | O0 | O1 | O2 | O3 | **Handwritten** |
| :----------------------: | :---: | :---: | :---: | :---: | :---: |
| Start of section headers | 94492 | 94492 | 94508 | 94508 | **5148** |
| text | 76500 | 75632 | 75632 | 75632 | **390** |
| data | 2320 | 2320 | 2320 | 2320 | **0** |
| bss | 1528 | 1528 | 1528 | 1528 | **0** |
| dec | 80348 | 78480 | 779480 | 779488 |**396** |
|RDCYCLE value|5987|5093|5093|5093|**254**|
### **`-O0`: No Optimization**
- **Code size** is the largest.
- Retains all **stack operations**, redundant storage and loading operations, and a direct data flow.
### **`-O1`**
- Code is **slightly smaller** compared to `-O0`.
- Reduces some **redundant operations**.
- Makes more **efficient use of registers**.
### **`-O2` and `-O3`**
- Code sizes are **similar**.
- `-O3` should have a **performance advantage** due to more optimization techniques.
- Similar sizes to `-O1` imply the original C code might not have much room for further optimization.
### **`Handwritten Assembly`**
- The **smallest** size.
16
23
46
<s>Cycle count: 1</s>
Cycle count: 254
:::warning
What is the reason behind obtaining the result that indicates a cycle count of 1?
:notes: jserv
> The reason for the cycle count of 1 is that I forgot to call get_cycles at the end; it should be endcycle - startcycle. The issue has been resolved now, and the correct cycle count is 254.
:::