# Assignment 2: GNU Toolchain
> Calculate the Hamming Distance using Counting Leading Zeros
contribute by < [JinYu1225](https://github.com/JinYu1225) >
## Program Selecting
The program I select is the [Calculate the Hamming Distance using Counting Leading Zeros](https://hackmd.io/@O6C2C3zQQBanDM55QRZ7DQ/Lab1_RV32I_assembly) by [唐飴苹](https://github.com/yptang5488).
The reason why I choose this program is because I have seen a problem about [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) before, but I didn't understand the algorithm at that time. Therefore, I choose this target to have a better understanding on Hamming Distance.
Besides, the student of this subject in assignment 1 only use [Count Leading Zeros (CLZ)](https://en.wikipedia.org/wiki/Find_first_set#CLZ) to simplify the algorithm of Hamming Distance. However, I'm interesting in **whether we can use CLZ through all the algorithm**. And I also want to try to implement a **64-bit CLZ structure** on Ripes.
## Modify C code
- [ ] The original work by yptang5488
```c
int HammingDistance(uint64_t x0, uint64_t x1){
int Hdist = 0;
int16_t max_digit = 64 - (int16_t)count_leading_zeros((x0 > x1)? x0 : x1);
while(max_digit > 0){
uint64_t c1 = x0 & 1;
uint64_t c2 = x1 & 1;
if(c1 != c2) Hdist += 1;
x0 = x0 >> 1;
x1 = x1 >> 1;
max_digit -= 1;
}
return Hdist;
}
```
In the above implementation, CLZ is only used in the first part of the algorithm in this code to find the bit to stop for the comparison of the two 64-bit parameter.
The following code trying to implement CLZ through the whole Hamming distance code.
- [ ] My first take
```c
int HammingDistance(uint64_t x1, uint64_t x2)
{
int Hdist = 0;
int16_t clz_count = 0;
int16_t count = 0;
while(clz_count != 64) // stop when the clz is 64
{
// use clz to find the biggest 1 bit within x1 and x2
clz_count = count_leading_zeros((x1 > x2) ? x1 : x2);
x1 = x1 << clz_count; // shift biggest 1 bit to MSB
x2 = x2 << clz_count;
uint64_t c1 = x1 & 0x8000000000000000; // compare MSB
uint64_t c2 = x2 & 0x8000000000000000;
if(c1 != c2) Hdist += 1;
x1 &= 0x7fffffffffffffff; // mask MSB
x2 &= 0x7fffffffffffffff;
count += 1;
}
return Hdist;
}
```
In my version, CLZ is used in order to find whether there is any 0 bit that can be skip before comparing two vectors in every loop.
My program and yptang5488's program have a significant difference, mainly related to the use of a mask. In each loop of my program, I used the mask `0x7fffffffffffffff`, while yptang5488 used a simpler mask, `0x1`. One way to potentially improve performance is to change the CLZ algorithm to "count following zeros" or to reverse the bits in the code. However, I didn't implement these optimizations in this assignment.
Besides, it is not sure that doing clz will be faster than the original algorithm produce by yptang5488. My code might only be more efficient when the input arguments have several 0 bits.
- [ ] My lastest version
```c
uint64_t c1 = x1 ^ x2;
// stop when there is no different between x1 and x2
while(c1 != 0)
{
// find biggest different bit by clz
clz_count = count_leading_zeros(c1);
c1 = c1 << clz_count + 1;
Hdist += 1;
}
```
Inspired by the assignment of [林以薰](https://hackmd.io/@scones525/SJ2gJgpW6), I adapted my code to a more faster and efficient version.
The main idea is to use **`^` (XOR)** operator to pick out the difference between `x1` and `x2` in the very beginnging. This operation can help me reduce several unnecessary steps in my previous code.
Result of .elf in rv32emu
```shell
$ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o HammingDistance.elf HammingDistance.c
$../../build/rv32emu HammingDistance.elf
Hamming Distance = 21
Hamming Distance = 63
Hamming Distance = 0
inferior exit code 0
```
## Assembly code
- [ ] [Handwritten version](https://github.com/JinYu1225/2023_Computer_Architecture/blob/main/Lab2/HammingDistance_Ripes.s)
In this version, I use two registers to store a 64-bit data. As the result, some operations that are straightforward in C code will be more complicated in assembly, such as `c1 = c1 << clz_count + 1`.
```c
# c1 = x1 ^ x2
# s1 = upper 32-bit of c1
# s2 = lower 32-bit of c1
# s6: check if shift more than 32
# c1 = c1 << clz_count + 1
addi a0, a0, 1
bge a0, s6, SGE32
SLT32: # shift less than 32-bit
sll s1, s1, a0 # s1 = s1 << a0
sub t0, s6, a0 # t0 = 32 - (clz_count + 1)
srl t1, s2, t0 # t1 = s2 >> t0
or s1, s1, t1 # s1 = s1 & t1
sll s2, s2, a0 # s2 = s2 << a0
j HAMDIS_LOOP
SGE32: # shift greater equ 32-bit
sub t0, a0, s6 # t0 = (clz_count + 1) - 32
sll s1, s2, t0 # s1 = s2 << t0
add s2, x0, x0 # s2 = 0
j HAMDIS_LOOP
```
The test data in this code consists of **`0x0000000000100000`** and **`0x00000000000FFFFF`**, and their Hamming distance should be **`21`**.
While there are two additional inputs in the code, it still lacks the capability to run multiple test data sets in a single execution.
```c
.data
test2: .dword 0x0000000000000001, 0x7FFFFFFFFFFFFFFE # HD(1, 9223372036854775806) = 63
test3: .dword 0x000000028370228F, 0x000000028370228F # HD(10795098767, 10795098767) = 0
```
The result in the console block in RIPES will be like following:
```shell
...
The clz result is 0
The clz result is 0
The clz result is 0
The Hamming distance is 21
Program exited with code: 0
```
### Fit in rv32emu
Due to the incompatibility of Ripes and rv32emu, I've made some modifications to the handwritten version to address this issue.
* In the beginning, provide the program a starting address to linker.
```c
# set the program start address
.org 0
# define a global symbol "_start" for the entry of the program
.global _start
...
# where the program start
.text
_start:
...
```
The usage of system call
* I've also made modifications to the print function to separately handle and display the tens and units digits of numbers.
* To simplify the code, I've removed the CLZ print function.
```c
# new system call
.equ SYSSTDOUT, 1
.equ SYSWRITE, 64
.equ SYSEXIT, 93
...
# new system call of print needs size of the buffer
str2: .string "\n"
.set str_size2, .-str2
str3: .string "The Hamming distance is "
.set str_size3, .-str3
...
# print
li a0, SYSSTDOUT
la a1, str3
li a2, str_size3
li a7, SYSWRITE
ecall
...
# done
addi sp, sp, 12
add a0, x0, 0
li a7, SYSEXIT
ecall
```
* compile & result
```shell
$ make
riscv-none-elf-as -R -march=rv32i -mabi=ilp32 -o HammingDistance.o HammingDistance.s
riscv-none-elf-ld -o HammingDistance.elf -T HammingDistance.ld --oformat=elf32-littleriscv HammingDistance.o
$ ../../build/rv32emu HammingDistance.elf
The Hamming distance is 21
inferior exit code 0
```
## Optimization
In this assignment, we will compile the code with different method of optimzation, including `O0`, `O1`, `O2`, `O3`, `Os`,and `Ofast`, and subsequently compare the difference between them. You can find more kinds of optimization and information using `$ riscv-none-elf-gcc -Q --help=optimizers`。
I will concentrate on the variations in the "HammingDistance" and "main" components made by each optimization for a better comparison.
You can find [my Makefile](https://github.com/JinYu1225/rv32emu/blob/master/tests/emo/Makefile) in here.
:::warning
Don't make listing for non-critical code. Use hyperlinks instead.
:notes: jserv
:::
### [O0 Optimized Assembly code]((https://github.com/JinYu1225/rv32emu/blob/master/tests/emo/HammingDistance_O0.txt))
* Compile
```shell
$ make O0
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 HammingDistance.c -o HammingDistance_O0.elf
```
* Size
```shell
$ riscv-none-elf-size HammingDistance_O0.elf
text data bss dec hex filename
76804 2372 1528 80704 13b40 HammingDistance_O0.elf
```
* ELF header
```shell
$ riscv-none-elf-readelf -h HammingDistance_O0.elf
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: RISC-V
Version: 0x1
Entry point address: 0x100d8
Start of program headers: 52 (bytes into file)
Start of section headers: 94780 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 15
Section header string table index: 14
```
* Save as txt
```shell
riscv-none-elf-objdump -d HammingDistance_O0.elf >HammingDistance_O0.txt
```
I found that the text is too much, so I plan to import the optimization for my code.
The original C code will also compile the `<stdint.h>` and `<stdio.h>`, so I use `typedef unsinged long long uint64_t` and `typedef unsigned short uint16_t`, instead of directly using `uint64_t` and `uint16_t`
Besides, I remove `printf` and the size reduce as following.
```shell
$ riscv-none-elf-size HammingDistance_O0.elf
text data bss dec hex filename
10072 1424 840 12336 3030 HammingDistance_O0.elf
```
:::warning
You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution.
:notes: jserv
:::
To calculate the cycle as a indicator of the program, the following code is add to .c file. You can look at [tick.c](https://github.com/sysprog21/rv32emu/blob/6294955d3708924a194402fa72f5978a63a75313/tests/ticks.c) for the reference.
```c
# include <inttypes.h>
typedef uint64_t ticks;
static inline ticks getticks(void)
{
uint64_t result;
uint32_t l, h, h2;
asm volatile(
"rdcycleh %0\n"
"rdcycle %1\n"
"rdcycleh %2\n"
"sub %0, %0, %2\n"
"seqz %0, %0\n"
"sub %0, zero, %0\n"
"and %1, %1, %0\n"
: "=r"(h), "=r"(l), "=r"(h2));
result = (((uint64_t) h) << 32) | ((uint64_t) l);
return result;
}
int main() {
ticks t0 = getticks();
printf("Hamming Distance = %d\n", HammingDistance(test1_x0, test1_x1));
ticks t1 = getticks();
printf("Cycle Count: %llu\n", t1 - t0);
return 0;
}
```
Then compile .c file again
```c
$ make O0
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 HammingDistance.c -o HammingDistance_O0.elf
riscv-none-elf-objdump -d HammingDistance_O0.elf > HammingDistance_O0.txt
$ ../../build/rv32emu HammingDistance_O0.elf
Hamming Distance = 21
Cycle Count: 9439
inferior exit code 0
```
* The size has change since I put `printf` to the program.
```c
text data bss dec hex filename
76944 2372 1528 80844 13bcc HammingDistance_O0.elf
```
Disassembly
```c
00010688 <HammingDistance>:
10688: fc010113 add sp,sp,-64
1068c: 02112e23 sw ra,60(sp)
10690: 02812c23 sw s0,56(sp)
10694: 03212a23 sw s2,52(sp)
10698: 03312823 sw s3,48(sp)
1069c: 04010413 add s0,sp,64
106a0: fca42423 sw a0,-56(s0)
106a4: fcb42623 sw a1,-52(s0)
106a8: fcc42023 sw a2,-64(s0)
106ac: fcd42223 sw a3,-60(s0)
106b0: fe042623 sw zero,-20(s0)
...
```
**In [O0 optimization](https://github.com/JinYu1225/rv32emu/blob/master/tests/emo/HammingDistance_O0.txt), the compiler transform the C code to assembly without most of the optimization.** It just basically translates all the instructions we do in the C code in a general way.
O0 optimization avoids performing most of the calculations with `a0` and `a1` while they are used for the return value of the function. This enhances the transformation's compatibility. However, the code will also become larger since it needs more intruction to move the data.
### [O1 Optimized Assembly code](https://github.com/JinYu1225/rv32emu/blob/master/tests/emo/HammingDistance_O1.txt)
* Compile
```shell
Hamming Distance = 21
Cycle Count: 5016
inferior exit code 0
```
* Size
```shell
$ riscv-none-elf-size HammingDistance_O1.elf
text data bss dec hex filename
75936 2372 1528 79836 137dc HammingDistance_O1.elf
```
* Disassembly
```c
00010310 <HammingDistance>:
...
10360: 00f49533 sll a0,s1,a5
10364: 00a76533 or a0,a4,a0
10368: 00f417b3 sll a5,s0,a5
...
103ec: f601a503 lw a0,-160(gp) # 23930 <test1_x0>
103f0: f641a583 lw a1,-156(gp) # 23934 <test1_x0+0x4>
103f4: f1dff0ef jal 10310 <HammingDistance>
103f8: 00050593 mv a1,a0
103fc: 00022537 lui a0,0x22
10400: b5050513 add a0,a0,-1200 # 21b50 <__clzsi2+0x88>
10404: 5c4000ef jal 109c8 <printf>
10408: d79ff0ef jal 10180 <getticks>
1040c: 40850633 sub a2,a0,s0
10410: 00c53533 sltu a0,a0,a2
10414: 409585b3 sub a1,a1,s1
10418: 40a586b3 sub a3,a1,a0
1041c: 00022537 lui a0,0x22
10420: b6850513 add a0,a0,-1176 # 21b68 <__clzsi2+0xa0>
...
```
[O1 optimization](https://github.com/JinYu1225/rv32emu/blob/master/tests/emo/HammingDistance_O1.txt) will tries to **decrease the code size and improve the operation time without any optimizations that increase compilation time**.
We can see the difference between O1 and O0 by the above code. In O1 optimization, the code will use `a0` and `a1` to perform some of the calculations. This help the program remove parts of useless code to reduce the code size and improve the cycle counts.
### [O2 Optimized Assembly code](https://github.com/JinYu1225/rv32emu/blob/master/tests/emo/HammingDistance_O2.txt)
* Compile
```shell
Hamming Distance = 21
Cycle Count: 4974
inferior exit code 0
```
* Size
```shell
$ riscv-none-elf-size HammingDistance_O2.elf
text data bss dec hex filename
75948 2372 1528 79848 137e8 HammingDistance_O2.elf
```
* Disassembly
```c
...
00010388 <HammingDistance>:
10388: fe010113 add sp,sp,-32
1038c: 00812c23 sw s0,24(sp)
10390: 00912a23 sw s1,20(sp)
10394: 00112e23 sw ra,28(sp)
10398: 01312623 sw s3,12(sp)
1039c: 00c544b3 xor s1,a0,a2
103a0: 00d5c433 xor s0,a1,a3
103a4: 08c50263 beq a0,a2,10428 <HammingDistance+0xa0>
103a8: 01212823 sw s2,16(sp)
103ac: 00000993 li s3,0
103b0: 01f00913 li s2,31
...
```
**O2 includes most of the optimization from O1, and additionally adds those do not impact the speed and code size.**
In the code provided above, there is no register that hold the same value. O2 optimization allocates a single register for storing identical value arguments.
### [O3 Optimized Assembly code](https://github.com/JinYu1225/2023_Computer_Architecture/blob/main/Lab2/HammingDistance_O3.txt)
* Compile
```shell
Hamming Distance = 21
Cycle Count: 4734
inferior exit code 0
```
* Size
```shell
$ riscv-none-elf-size HammingDistance_O3.elf
text data bss dec hex filename
76700 2372 1528 80600 13ad8 HammingDistance_O3.elf
```
* Disassembly
```c
...
00010560 <HammingDistance>:
10560: 00c548b3 xor a7,a0,a2
10564: 00d5c833 xor a6,a1,a3
10568: 1cc50463 beq a0,a2,10730 <HammingDistance+0x1d0>
1056c: 55555337 lui t1,0x55555
10570: 33333637 lui a2,0x33333
10574: 0f0f15b7 lui a1,0xf0f1
...
```
**O3 optimize even more than O2.** We can see that "HammingDistance" start without storing value, and there is no entrence to "HammingDistance".
If we look back to the "main", we will find that the code barely completes everything in the "main" function instead of calling other functions. Thus, we can assume that the actual text size it consumes should be smaller than what is displayed above.
### [Os Optimized Assembly code](https://github.com/JinYu1225/2023_Computer_Architecture/blob/main/Lab2/HammingDistance_Os.txt)
* Compile
```shell
Hamming Distance = 21
Cycle Count: 5094
inferior exit code 0
```
* Size
```shell
$ riscv-none-elf-size HammingDistance_Os.elf
text data bss dec hex filename
75904 2372 1528 79804 137bc HammingDistance_Os.elf
```
* Disassembly
```c
...
00010388 <HammingDistance>:
10388: fe010113 add sp,sp,-32
1038c: 00812c23 sw s0,24(sp)
10390: 00912a23 sw s1,20(sp)
10394: 00112e23 sw ra,28(sp)
10398: 00c54433 xor s0,a0,a2
1039c: 00d5c5b3 xor a1,a1,a3
103a0: 00000493 li s1,0
103a4: 00b467b3 or a5,s0,a1
...
```
**Os focus on the optimization of code size.**
In the code above, Os use the registers that don't need for that moment to save new arguments. Such as the `a1` in line 1039c.
### Ofast Optimized Assembly code
* Compile
```shell
Hamming Distance = 21
Cycle Count: 4734
inferior exit code 0
```
* Size
```shell
$ riscv-none-elf-size HammingDistance_Ofast.elf
text data bss dec hex filename
76700 2372 1528 80600 13ad8 HammingDistance_Ofast.elf
```
* Disassembly
```c
000100c0 <main>:
100c0: ff010113 add sp,sp,-16
100c4: 00112623 sw ra,12(sp)
100c8: 00812423 sw s0,8(sp)
100cc: 00912223 sw s1,4(sp)
100d0: c80024f3 rdcycleh s1
100d4: c0002473 rdcycle s0
100d8: c80027f3 rdcycleh a5
100dc: 40f484b3 sub s1,s1,a5
100e0: 0014b493 seqz s1,s1
100e4: 409004b3 neg s1,s1
100e8: 00947433 and s0,s0,s1
100ec: f601a703 lw a4,-160(gp) # 23930 <test1_x0>
100f0: f641a583 lw a1,-156(gp) # 23934 <test1_x0+0x4>
100f4: f5c1a603 lw a2,-164(gp) # 2392c <test1_x1+0x4>
100f8: f581a783 lw a5,-168(gp) # 23928 <test1_x1>
100fc: 00c5c6b3 xor a3,a1,a2
10100: 00f74833 xor a6,a4,a5
10104: 22f70063 beq a4,a5,10324 <main+0x264>
10108: 55555e37 lui t3,0x55555
1010c: 333338b7 lui a7,0x33333
10110: 0f0f1337 lui t1,0xf0f1
10114: 00000593 li a1,0
10118: 555e0e13 add t3,t3,1365 # 55555555 <__BSS_END__+0x55531611>
1011c: 33388893 add a7,a7,819 # 33333333 <__BSS_END__+0x3330f3ef>
10120: f0f30313 add t1,t1,-241 # f0f0f0f <__BSS_END__+0xf0ccfcb>
10124: 04000e93 li t4,64
10128: 01f00613 li a2,31
1012c: 01c0006f j 10148 <main+0x88>
10130: 00e816b3 sll a3,a6,a4
10134: 00000793 li a5,0
10138: 00078813 mv a6,a5
1013c: 00d7e7b3 or a5,a5,a3
10140: 00158593 add a1,a1,1
10144: 18078263 beqz a5,102c8 <main+0x208>
10148: 00185513 srl a0,a6,0x1
1014c: 01f69f13 sll t5,a3,0x1f
10150: 0016d713 srl a4,a3,0x1
10154: 00af6f33 or t5,t5,a0
10158: 00d76733 or a4,a4,a3
1015c: 010f6f33 or t5,t5,a6
10160: 01e71f93 sll t6,a4,0x1e
10164: 002f5793 srl a5,t5,0x2
10168: 00ffe7b3 or a5,t6,a5
1016c: 00275f93 srl t6,a4,0x2
10170: 01f76733 or a4,a4,t6
10174: 00ff6f33 or t5,t5,a5
10178: 01c71f93 sll t6,a4,0x1c
1017c: 004f5793 srl a5,t5,0x4
10180: 00ffe7b3 or a5,t6,a5
10184: 00475f93 srl t6,a4,0x4
10188: 01f76733 or a4,a4,t6
1018c: 00ff6f33 or t5,t5,a5
10190: 01871f93 sll t6,a4,0x18
10194: 008f5793 srl a5,t5,0x8
10198: 00ffe7b3 or a5,t6,a5
1019c: 00875f93 srl t6,a4,0x8
101a0: 01f76733 or a4,a4,t6
101a4: 00ff6f33 or t5,t5,a5
101a8: 01071f93 sll t6,a4,0x10
101ac: 010f5793 srl a5,t5,0x10
101b0: 00ffe7b3 or a5,t6,a5
101b4: 01075f93 srl t6,a4,0x10
101b8: 01f76733 or a4,a4,t6
101bc: 00ff6f33 or t5,t5,a5
101c0: 00ef6f33 or t5,t5,a4
101c4: 01f71f93 sll t6,a4,0x1f
101c8: 001f5793 srl a5,t5,0x1
101cc: 00ffe7b3 or a5,t6,a5
101d0: 01c7f7b3 and a5,a5,t3
101d4: 00175f93 srl t6,a4,0x1
101d8: 40ff07b3 sub a5,t5,a5
101dc: 01cfffb3 and t6,t6,t3
101e0: 00ff3f33 sltu t5,t5,a5
101e4: 41f70733 sub a4,a4,t6
101e8: 41e70733 sub a4,a4,t5
101ec: 01e71f93 sll t6,a4,0x1e
101f0: 0027df13 srl t5,a5,0x2
101f4: 01efef33 or t5,t6,t5
101f8: 011f7f33 and t5,t5,a7
101fc: 00275f93 srl t6,a4,0x2
10200: 0117f7b3 and a5,a5,a7
10204: 00ff07b3 add a5,t5,a5
10208: 011fffb3 and t6,t6,a7
1020c: 01177733 and a4,a4,a7
10210: 00ef8733 add a4,t6,a4
10214: 01e7bf33 sltu t5,a5,t5
10218: 00ef0f33 add t5,t5,a4
1021c: 01cf1f93 sll t6,t5,0x1c
10220: 0047d713 srl a4,a5,0x4
10224: 00efe733 or a4,t6,a4
10228: 00f707b3 add a5,a4,a5
1022c: 004f5f93 srl t6,t5,0x4
10230: 01ef8f33 add t5,t6,t5
10234: 00e7b733 sltu a4,a5,a4
10238: 01e70733 add a4,a4,t5
1023c: 0067f7b3 and a5,a5,t1
10240: 00677733 and a4,a4,t1
10244: 01871f93 sll t6,a4,0x18
10248: 0087df13 srl t5,a5,0x8
1024c: 01efef33 or t5,t6,t5
10250: 01e78f33 add t5,a5,t5
10254: 00875f93 srl t6,a4,0x8
10258: 01f70733 add a4,a4,t6
1025c: 00ff37b3 sltu a5,t5,a5
10260: 00e787b3 add a5,a5,a4
10264: 01079f93 sll t6,a5,0x10
10268: 010f5713 srl a4,t5,0x10
1026c: 00efe733 or a4,t6,a4
10270: 00ef0733 add a4,t5,a4
10274: 0107df93 srl t6,a5,0x10
10278: 01e73f33 sltu t5,a4,t5
1027c: 01f787b3 add a5,a5,t6
10280: 00ff0f33 add t5,t5,a5
10284: 01e707b3 add a5,a4,t5
10288: 07f7f793 and a5,a5,127
1028c: 40fe87b3 sub a5,t4,a5
10290: 01079793 sll a5,a5,0x10
10294: 0107d793 srl a5,a5,0x10
10298: fe178713 add a4,a5,-31
1029c: 00178793 add a5,a5,1
102a0: 40f60f33 sub t5,a2,a5
102a4: 00f696b3 sll a3,a3,a5
102a8: 01e55533 srl a0,a0,t5
102ac: e80752e3 bgez a4,10130 <main+0x70>
102b0: 00f817b3 sll a5,a6,a5
102b4: 00d566b3 or a3,a0,a3
102b8: 00078813 mv a6,a5
102bc: 00d7e7b3 or a5,a5,a3
102c0: 00158593 add a1,a1,1
102c4: e80792e3 bnez a5,10148 <main+0x88>
102c8: 00022537 lui a0,0x22
102cc: e5050513 add a0,a0,-432 # 21e50 <__clzsi2+0x8c>
102d0: 1f5000ef jal 10cc4 <printf>
102d4: c80026f3 rdcycleh a3
102d8: c00027f3 rdcycle a5
102dc: c8002773 rdcycleh a4
102e0: 40e686b3 sub a3,a3,a4
102e4: 0016b693 seqz a3,a3
102e8: 40d006b3 neg a3,a3
102ec: 00d7f7b3 and a5,a5,a3
102f0: 40878633 sub a2,a5,s0
102f4: 00c7b7b3 sltu a5,a5,a2
102f8: 409686b3 sub a3,a3,s1
102fc: 00022537 lui a0,0x22
10300: 40f686b3 sub a3,a3,a5
10304: e6850513 add a0,a0,-408 # 21e68 <__clzsi2+0xa4>
10308: 1bd000ef jal 10cc4 <printf>
1030c: 00c12083 lw ra,12(sp)
10310: 00812403 lw s0,8(sp)
10314: 00412483 lw s1,4(sp)
10318: 00000513 li a0,0
1031c: 01010113 add sp,sp,16
10320: 00008067 ret
10324: dec592e3 bne a1,a2,10108 <main+0x48>
10328: 00000593 li a1,0
1032c: f9dff06f j 102c8 <main+0x208>
00010560 <HammingDistance>:
10560: 00c548b3 xor a7,a0,a2
10564: 00d5c833 xor a6,a1,a3
10568: 1cc50463 beq a0,a2,10730 <HammingDistance+0x1d0>
1056c: 55555337 lui t1,0x55555
10570: 33333637 lui a2,0x33333
10574: 0f0f15b7 lui a1,0xf0f1
10578: 00000513 li a0,0
1057c: 55530313 add t1,t1,1365 # 55555555 <__BSS_END__+0x55531611>
10580: 33360613 add a2,a2,819 # 33333333 <__BSS_END__+0x3330f3ef>
10584: f0f58593 add a1,a1,-241 # f0f0f0f <__BSS_END__+0xf0ccfcb>
10588: 04000e93 li t4,64
1058c: 01f00e13 li t3,31
10590: 01c0006f j 105ac <HammingDistance+0x4c>
10594: 00e89833 sll a6,a7,a4
10598: 00000793 li a5,0
1059c: 00078893 mv a7,a5
105a0: 0107e7b3 or a5,a5,a6
105a4: 00150513 add a0,a0,1
105a8: 18078263 beqz a5,1072c <HammingDistance+0x1cc>
105ac: 0018d693 srl a3,a7,0x1
105b0: 01f81f13 sll t5,a6,0x1f
105b4: 00185713 srl a4,a6,0x1
105b8: 00df6f33 or t5,t5,a3
105bc: 01076733 or a4,a4,a6
105c0: 011f6f33 or t5,t5,a7
105c4: 01e71f93 sll t6,a4,0x1e
105c8: 002f5793 srl a5,t5,0x2
105cc: 00ffe7b3 or a5,t6,a5
105d0: 00275f93 srl t6,a4,0x2
105d4: 01f76733 or a4,a4,t6
105d8: 00ff6f33 or t5,t5,a5
105dc: 01c71f93 sll t6,a4,0x1c
105e0: 004f5793 srl a5,t5,0x4
105e4: 00ffe7b3 or a5,t6,a5
105e8: 00475f93 srl t6,a4,0x4
105ec: 01f76733 or a4,a4,t6
105f0: 00ff6f33 or t5,t5,a5
105f4: 01871f93 sll t6,a4,0x18
105f8: 008f5793 srl a5,t5,0x8
105fc: 00ffe7b3 or a5,t6,a5
10600: 00875f93 srl t6,a4,0x8
10604: 01f76733 or a4,a4,t6
10608: 00ff6f33 or t5,t5,a5
1060c: 01071f93 sll t6,a4,0x10
10610: 010f5793 srl a5,t5,0x10
10614: 00ffe7b3 or a5,t6,a5
10618: 01075f93 srl t6,a4,0x10
1061c: 01f76733 or a4,a4,t6
10620: 00ff6f33 or t5,t5,a5
10624: 00ef6f33 or t5,t5,a4
10628: 01f71f93 sll t6,a4,0x1f
1062c: 001f5793 srl a5,t5,0x1
10630: 00ffe7b3 or a5,t6,a5
10634: 0067f7b3 and a5,a5,t1
10638: 00175f93 srl t6,a4,0x1
1063c: 40ff07b3 sub a5,t5,a5
10640: 006fffb3 and t6,t6,t1
10644: 00ff3f33 sltu t5,t5,a5
10648: 41f70733 sub a4,a4,t6
1064c: 41e70733 sub a4,a4,t5
10650: 01e71f93 sll t6,a4,0x1e
10654: 0027df13 srl t5,a5,0x2
10658: 01efef33 or t5,t6,t5
1065c: 00cf7f33 and t5,t5,a2
10660: 00275f93 srl t6,a4,0x2
10664: 00c7f7b3 and a5,a5,a2
10668: 00ff07b3 add a5,t5,a5
1066c: 00cfffb3 and t6,t6,a2
10670: 00c77733 and a4,a4,a2
10674: 00ef8733 add a4,t6,a4
10678: 01e7bf33 sltu t5,a5,t5
1067c: 00ef0f33 add t5,t5,a4
10680: 01cf1f93 sll t6,t5,0x1c
10684: 0047d713 srl a4,a5,0x4
10688: 00efe733 or a4,t6,a4
1068c: 00f707b3 add a5,a4,a5
10690: 004f5f93 srl t6,t5,0x4
10694: 01ef8f33 add t5,t6,t5
10698: 00e7b733 sltu a4,a5,a4
1069c: 01e70733 add a4,a4,t5
106a0: 00b7f7b3 and a5,a5,a1
106a4: 00b77733 and a4,a4,a1
106a8: 01871f93 sll t6,a4,0x18
106ac: 0087df13 srl t5,a5,0x8
106b0: 01efef33 or t5,t6,t5
106b4: 01e78f33 add t5,a5,t5
106b8: 00875f93 srl t6,a4,0x8
106bc: 01f70733 add a4,a4,t6
106c0: 00ff37b3 sltu a5,t5,a5
106c4: 00e787b3 add a5,a5,a4
106c8: 01079f93 sll t6,a5,0x10
106cc: 010f5713 srl a4,t5,0x10
106d0: 00efe733 or a4,t6,a4
106d4: 00ef0733 add a4,t5,a4
106d8: 0107df93 srl t6,a5,0x10
106dc: 01e73f33 sltu t5,a4,t5
106e0: 01f787b3 add a5,a5,t6
106e4: 00ff0f33 add t5,t5,a5
106e8: 01e707b3 add a5,a4,t5
106ec: 07f7f793 and a5,a5,127
106f0: 40fe87b3 sub a5,t4,a5
106f4: 01079793 sll a5,a5,0x10
106f8: 0107d793 srl a5,a5,0x10
106fc: fe178713 add a4,a5,-31
10700: 00178793 add a5,a5,1
10704: 40fe0f33 sub t5,t3,a5
10708: 00f81833 sll a6,a6,a5
1070c: 01e6d6b3 srl a3,a3,t5
10710: e80752e3 bgez a4,10594 <HammingDistance+0x34>
10714: 00f897b3 sll a5,a7,a5
10718: 0106e833 or a6,a3,a6
1071c: 00078893 mv a7,a5
10720: 0107e7b3 or a5,a5,a6
10724: 00150513 add a0,a0,1
10728: e80792e3 bnez a5,105ac <HammingDistance+0x4c>
1072c: 00008067 ret
10730: e2d59ee3 bne a1,a3,1056c <HammingDistance+0xc>
10734: 00000513 li a0,0
10738: 00008067 ret
```
### Summary
| Optimization | Cycle Counts | Code Size (text) |
|:------------:|:------------:|:----------------:|
| **O0** | 9439 | 76944 |
| **O1** | 5016 | 75936 |
| **O2** | 4974 | 75948 |
| **O3** | 4734 | 76700 |
| **Os** | 5094 | 75904 |
| **Ofast** | 4734 | 76700 |