Try   HackMD

Assignment 2: GNU Toolchain

Calculate the Hamming Distance using Counting Leading Zeros

contribute by < JinYu1225 >

Program Selecting

The program I select is the Calculate the Hamming Distance using Counting Leading Zeros by 唐飴苹.

The reason why I choose this program is because I have seen a problem about Hamming Distance before, but I didn't understand the algorithm at that time. Therefore, I choose this target to have a better understanding on Hamming Distance.

Besides, the student of this subject in assignment 1 only use Count Leading Zeros (CLZ) to simplify the algorithm of Hamming Distance. However, I'm interesting in whether we can use CLZ through all the algorithm. And I also want to try to implement a 64-bit CLZ structure on Ripes.

Modify C code

  • The original work by yptang5488
int HammingDistance(uint64_t x0, uint64_t x1){
    int Hdist = 0;
    int16_t max_digit = 64 - (int16_t)count_leading_zeros((x0 > x1)? x0 : x1);
    while(max_digit > 0){
        uint64_t c1 = x0 & 1;
        uint64_t c2 = x1 & 1;
        if(c1 != c2) Hdist += 1;

        x0 = x0 >> 1;
        x1 = x1 >> 1;
        max_digit -= 1;
    }
    return Hdist;
}

In the above implementation, CLZ is only used in the first part of the algorithm in this code to find the bit to stop for the comparison of the two 64-bit parameter.

The following code trying to implement CLZ through the whole Hamming distance code.

  • My first take
int HammingDistance(uint64_t x1, uint64_t x2)
{
    int Hdist = 0;
    int16_t clz_count = 0;
    int16_t count = 0;
    while(clz_count != 64) // stop when the clz is 64
    {
        // use clz to find the biggest 1 bit within x1 and x2
        clz_count = count_leading_zeros((x1 > x2) ? x1 : x2);   
        x1 = x1 << clz_count;    // shift biggest 1 bit to MSB
        x2 = x2 << clz_count;
        uint64_t c1 = x1 & 0x8000000000000000;  // compare MSB
        uint64_t c2 = x2 & 0x8000000000000000;
        if(c1 != c2) Hdist += 1;
        x1 &= 0x7fffffffffffffff;  // mask MSB
        x2 &= 0x7fffffffffffffff;
        count += 1;
    }
    return Hdist;
}

In my version, CLZ is used in order to find whether there is any 0 bit that can be skip before comparing two vectors in every loop.

My program and yptang5488's program have a significant difference, mainly related to the use of a mask. In each loop of my program, I used the mask 0x7fffffffffffffff, while yptang5488 used a simpler mask, 0x1. One way to potentially improve performance is to change the CLZ algorithm to "count following zeros" or to reverse the bits in the code. However, I didn't implement these optimizations in this assignment.

Besides, it is not sure that doing clz will be faster than the original algorithm produce by yptang5488. My code might only be more efficient when the input arguments have several 0 bits.

  • My lastest version
uint64_t c1 = x1 ^ x2;
// stop when there is no different between x1 and x2
while(c1 != 0) 
{
    // find biggest different bit by clz
    clz_count = count_leading_zeros(c1);
    c1 = c1 << clz_count + 1;
    Hdist += 1;
}

Inspired by the assignment of 林以薰, I adapted my code to a more faster and efficient version.

The main idea is to use ^ (XOR) operator to pick out the difference between x1 and x2 in the very beginnging. This operation can help me reduce several unnecessary steps in my previous code.

Result of .elf in rv32emu

$ riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -o HammingDistance.elf HammingDistance.c
$../../build/rv32emu HammingDistance.elf
Hamming Distance = 21
Hamming Distance = 63
Hamming Distance = 0
inferior exit code 0

Assembly code

In this version, I use two registers to store a 64-bit data. As the result, some operations that are straightforward in C code will be more complicated in assembly, such as c1 = c1 << clz_count + 1.

    # c1 = x1 ^ x2
    # s1 = upper 32-bit of c1
    # s2 = lower 32-bit of c1
    # s6: check if shift more than 32
    # c1 = c1 << clz_count + 1
    addi a0, a0, 1
    bge a0, s6, SGE32
SLT32: # shift less than 32-bit
    sll s1, s1, a0 # s1 = s1 << a0
    sub t0, s6, a0 # t0 = 32 - (clz_count + 1)
    srl t1, s2, t0 # t1 = s2 >> t0
    or s1, s1, t1  # s1 = s1 & t1
    sll s2, s2, a0 # s2 = s2 << a0
    j HAMDIS_LOOP
SGE32: # shift greater equ 32-bit
    sub t0, a0, s6 # t0 = (clz_count + 1) - 32
    sll s1, s2, t0 # s1 = s2 << t0
    add s2, x0, x0 # s2 = 0
    j HAMDIS_LOOP

The test data in this code consists of 0x0000000000100000 and 0x00000000000FFFFF, and their Hamming distance should be 21.

While there are two additional inputs in the code, it still lacks the capability to run multiple test data sets in a single execution.

.data
    test2: .dword 0x0000000000000001, 0x7FFFFFFFFFFFFFFE # HD(1, 9223372036854775806) = 63
    test3: .dword 0x000000028370228F, 0x000000028370228F # HD(10795098767, 10795098767) = 0

The result in the console block in RIPES will be like following:

...
The clz result is 0
The clz result is 0
The clz result is 0
The Hamming distance is 21
Program exited with code: 0

Fit in rv32emu

Due to the incompatibility of Ripes and rv32emu, I've made some modifications to the handwritten version to address this issue.

  • In the beginning, provide the program a starting address to linker.
# set the program start address
.org 0
# define a global symbol "_start" for the entry of the program
.global _start
...
# where the program start
.text
_start:
...

The usage of system call

  • I've also made modifications to the print function to separately handle and display the tens and units digits of numbers.
  • To simplify the code, I've removed the CLZ print function.
# new system call
.equ SYSSTDOUT, 1
.equ SYSWRITE, 64
.equ SYSEXIT, 93
...
# new system call of print needs size of the buffer
str2: .string "\n"
	  .set str_size2, .-str2
str3: .string "The Hamming distance is "
	  .set str_size3, .-str3
...
# print
li a0, SYSSTDOUT
la a1, str3
li a2, str_size3
li a7, SYSWRITE
ecall
...
# done
addi sp, sp, 12
add a0, x0, 0
li a7, SYSEXIT
ecall
  • compile & result
$ make
riscv-none-elf-as -R -march=rv32i -mabi=ilp32 -o HammingDistance.o HammingDistance.s
riscv-none-elf-ld -o HammingDistance.elf -T HammingDistance.ld --oformat=elf32-littleriscv HammingDistance.o
$ ../../build/rv32emu HammingDistance.elf
The Hamming distance is 21
inferior exit code 0

Optimization

In this assignment, we will compile the code with different method of optimzation, including O0, O1, O2, O3, Os,and Ofast, and subsequently compare the difference between them. You can find more kinds of optimization and information using $ riscv-none-elf-gcc -Q --help=optimizers

I will concentrate on the variations in the "HammingDistance" and "main" components made by each optimization for a better comparison.

You can find my Makefile in here.

Don't make listing for non-critical code. Use hyperlinks instead.
:notes: jserv

O0 Optimized Assembly code

  • Compile
$ make O0
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 HammingDistance.c -o HammingDistance_O0.elf
  • Size
$ riscv-none-elf-size HammingDistance_O0.elf
   text	   data	    bss	    dec	    hex	filename
  76804	   2372	   1528	  80704	  13b40	HammingDistance_O0.elf
  • ELF header
$ riscv-none-elf-readelf -h HammingDistance_O0.elf
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           RISC-V
  Version:                           0x1
  Entry point address:               0x100d8
  Start of program headers:          52 (bytes into file)
  Start of section headers:          94780 (bytes into file)
  Flags:                             0x0
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         3
  Size of section headers:           40 (bytes)
  Number of section headers:         15
  Section header string table index: 14
  • Save as txt
riscv-none-elf-objdump -d HammingDistance_O0.elf >HammingDistance_O0.txt

I found that the text is too much, so I plan to import the optimization for my code.

The original C code will also compile the <stdint.h> and <stdio.h>, so I use typedef unsinged long long uint64_t and typedef unsigned short uint16_t, instead of directly using uint64_t and uint16_t

Besides, I remove printf and the size reduce as following.

$ riscv-none-elf-size HammingDistance_O0.elf
   text	   data	    bss	    dec	    hex	filename
  10072	   1424	    840	  12336	   3030	HammingDistance_O0.elf

You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution.
:notes: jserv

To calculate the cycle as a indicator of the program, the following code is add to .c file. You can look at tick.c for the reference.

# include <inttypes.h>
	
typedef uint64_t ticks;
static inline ticks getticks(void)
{
    uint64_t result;
    uint32_t l, h, h2;
    asm volatile(
        "rdcycleh %0\n"
        "rdcycle %1\n"
        "rdcycleh %2\n"
        "sub %0, %0, %2\n"
        "seqz %0, %0\n"
        "sub %0, zero, %0\n"
        "and %1, %1, %0\n"
        : "=r"(h), "=r"(l), "=r"(h2));
    result = (((uint64_t) h) << 32) | ((uint64_t) l);
    return result;
}

int main() {
    ticks t0 = getticks();
    printf("Hamming Distance = %d\n", HammingDistance(test1_x0, test1_x1));
    ticks t1 = getticks();
    printf("Cycle Count: %llu\n", t1 - t0);
    return 0;
}

Then compile .c file again

$ make O0
riscv-none-elf-gcc -march=rv32i -mabi=ilp32 -O0 HammingDistance.c -o HammingDistance_O0.elf
riscv-none-elf-objdump -d HammingDistance_O0.elf > HammingDistance_O0.txt
$ ../../build/rv32emu HammingDistance_O0.elf
Hamming Distance = 21
Cycle Count: 9439
inferior exit code 0
  • The size has change since I put printf to the program.
   text	   data	    bss	    dec	    hex	filename
  76944	   2372	   1528	  80844	  13bcc	HammingDistance_O0.elf

Disassembly

00010688 <HammingDistance>:
   10688:	fc010113          	add	sp,sp,-64
   1068c:	02112e23          	sw	ra,60(sp)
   10690:	02812c23          	sw	s0,56(sp)
   10694:	03212a23          	sw	s2,52(sp)
   10698:	03312823          	sw	s3,48(sp)
   1069c:	04010413          	add	s0,sp,64
   106a0:	fca42423          	sw	a0,-56(s0)
   106a4:	fcb42623          	sw	a1,-52(s0)
   106a8:	fcc42023          	sw	a2,-64(s0)
   106ac:	fcd42223          	sw	a3,-60(s0)
   106b0:	fe042623          	sw	zero,-20(s0)
   ...

In O0 optimization, the compiler transform the C code to assembly without most of the optimization. It just basically translates all the instructions we do in the C code in a general way.

O0 optimization avoids performing most of the calculations with a0 and a1 while they are used for the return value of the function. This enhances the transformation's compatibility. However, the code will also become larger since it needs more intruction to move the data.

O1 Optimized Assembly code

  • Compile
Hamming Distance = 21
Cycle Count: 5016
inferior exit code 0
  • Size
$ riscv-none-elf-size HammingDistance_O1.elf
   text	   data	    bss	    dec	    hex	filename
  75936	   2372	   1528	  79836	  137dc	HammingDistance_O1.elf
  • Disassembly
00010310 <HammingDistance>:
   ...
   10360:	00f49533          	sll	a0,s1,a5
   10364:	00a76533          	or	a0,a4,a0
   10368:	00f417b3          	sll	a5,s0,a5
   ...
   103ec:	f601a503          	lw	a0,-160(gp) # 23930 <test1_x0>
   103f0:	f641a583          	lw	a1,-156(gp) # 23934 <test1_x0+0x4>
   103f4:	f1dff0ef          	jal	10310 <HammingDistance>
   103f8:	00050593          	mv	a1,a0
   103fc:	00022537          	lui	a0,0x22
   10400:	b5050513          	add	a0,a0,-1200 # 21b50 <__clzsi2+0x88>
   10404:	5c4000ef          	jal	109c8 <printf>
   10408:	d79ff0ef          	jal	10180 <getticks>
   1040c:	40850633          	sub	a2,a0,s0
   10410:	00c53533          	sltu	a0,a0,a2
   10414:	409585b3          	sub	a1,a1,s1
   10418:	40a586b3          	sub	a3,a1,a0
   1041c:	00022537          	lui	a0,0x22
   10420:	b6850513          	add	a0,a0,-1176 # 21b68 <__clzsi2+0xa0>
   ...

O1 optimization will tries to decrease the code size and improve the operation time without any optimizations that increase compilation time.

We can see the difference between O1 and O0 by the above code. In O1 optimization, the code will use a0 and a1 to perform some of the calculations. This help the program remove parts of useless code to reduce the code size and improve the cycle counts.

O2 Optimized Assembly code

  • Compile
Hamming Distance = 21
Cycle Count: 4974
inferior exit code 0
  • Size
$ riscv-none-elf-size HammingDistance_O2.elf
   text	   data	    bss	    dec	    hex	filename
  75948	   2372	   1528	  79848	  137e8	HammingDistance_O2.elf
  • Disassembly
   ...
00010388 <HammingDistance>:
   10388:	fe010113          	add	sp,sp,-32
   1038c:	00812c23          	sw	s0,24(sp)
   10390:	00912a23          	sw	s1,20(sp)
   10394:	00112e23          	sw	ra,28(sp)
   10398:	01312623          	sw	s3,12(sp)
   1039c:	00c544b3          	xor	s1,a0,a2
   103a0:	00d5c433          	xor	s0,a1,a3
   103a4:	08c50263          	beq	a0,a2,10428 <HammingDistance+0xa0>
   103a8:	01212823          	sw	s2,16(sp)
   103ac:	00000993          	li	s3,0
   103b0:	01f00913          	li	s2,31
   ...

O2 includes most of the optimization from O1, and additionally adds those do not impact the speed and code size.

In the code provided above, there is no register that hold the same value. O2 optimization allocates a single register for storing identical value arguments.

O3 Optimized Assembly code

  • Compile
Hamming Distance = 21
Cycle Count: 4734
inferior exit code 0
  • Size
$ riscv-none-elf-size HammingDistance_O3.elf
   text	   data	    bss	    dec	    hex	filename
  76700	   2372	   1528	  80600	  13ad8	HammingDistance_O3.elf
  • Disassembly
   ...
00010560 <HammingDistance>:
   10560:	00c548b3          	xor	a7,a0,a2
   10564:	00d5c833          	xor	a6,a1,a3
   10568:	1cc50463          	beq	a0,a2,10730 <HammingDistance+0x1d0>
   1056c:	55555337          	lui	t1,0x55555
   10570:	33333637          	lui	a2,0x33333
   10574:	0f0f15b7          	lui	a1,0xf0f1
   ...

O3 optimize even more than O2. We can see that "HammingDistance" start without storing value, and there is no entrence to "HammingDistance".

If we look back to the "main", we will find that the code barely completes everything in the "main" function instead of calling other functions. Thus, we can assume that the actual text size it consumes should be smaller than what is displayed above.

Os Optimized Assembly code

  • Compile
Hamming Distance = 21
Cycle Count: 5094
inferior exit code 0
  • Size
$ riscv-none-elf-size HammingDistance_Os.elf
   text	   data	    bss	    dec	    hex	filename
  75904	   2372	   1528	  79804	  137bc	HammingDistance_Os.elf
  • Disassembly
   ...
00010388 <HammingDistance>:
   10388:	fe010113          	add	sp,sp,-32
   1038c:	00812c23          	sw	s0,24(sp)
   10390:	00912a23          	sw	s1,20(sp)
   10394:	00112e23          	sw	ra,28(sp)
   10398:	00c54433          	xor	s0,a0,a2
   1039c:	00d5c5b3          	xor	a1,a1,a3
   103a0:	00000493          	li	s1,0
   103a4:	00b467b3          	or	a5,s0,a1
   ...

Os focus on the optimization of code size.

In the code above, Os use the registers that don't need for that moment to save new arguments. Such as the a1 in line 1039c.

Ofast Optimized Assembly code

  • Compile
Hamming Distance = 21
Cycle Count: 4734
inferior exit code 0
  • Size
$ riscv-none-elf-size HammingDistance_Ofast.elf
   text	   data	    bss	    dec	    hex	filename
  76700	   2372	   1528	  80600	  13ad8	HammingDistance_Ofast.elf
  • Disassembly
000100c0 <main>:
   100c0:	ff010113          	add	sp,sp,-16
   100c4:	00112623          	sw	ra,12(sp)
   100c8:	00812423          	sw	s0,8(sp)
   100cc:	00912223          	sw	s1,4(sp)
   100d0:	c80024f3          	rdcycleh	s1
   100d4:	c0002473          	rdcycle	s0
   100d8:	c80027f3          	rdcycleh	a5
   100dc:	40f484b3          	sub	s1,s1,a5
   100e0:	0014b493          	seqz	s1,s1
   100e4:	409004b3          	neg	s1,s1
   100e8:	00947433          	and	s0,s0,s1
   100ec:	f601a703          	lw	a4,-160(gp) # 23930 <test1_x0>
   100f0:	f641a583          	lw	a1,-156(gp) # 23934 <test1_x0+0x4>
   100f4:	f5c1a603          	lw	a2,-164(gp) # 2392c <test1_x1+0x4>
   100f8:	f581a783          	lw	a5,-168(gp) # 23928 <test1_x1>
   100fc:	00c5c6b3          	xor	a3,a1,a2
   10100:	00f74833          	xor	a6,a4,a5
   10104:	22f70063          	beq	a4,a5,10324 <main+0x264>
   10108:	55555e37          	lui	t3,0x55555
   1010c:	333338b7          	lui	a7,0x33333
   10110:	0f0f1337          	lui	t1,0xf0f1
   10114:	00000593          	li	a1,0
   10118:	555e0e13          	add	t3,t3,1365 # 55555555 <__BSS_END__+0x55531611>
   1011c:	33388893          	add	a7,a7,819 # 33333333 <__BSS_END__+0x3330f3ef>
   10120:	f0f30313          	add	t1,t1,-241 # f0f0f0f <__BSS_END__+0xf0ccfcb>
   10124:	04000e93          	li	t4,64
   10128:	01f00613          	li	a2,31
   1012c:	01c0006f          	j	10148 <main+0x88>
   10130:	00e816b3          	sll	a3,a6,a4
   10134:	00000793          	li	a5,0
   10138:	00078813          	mv	a6,a5
   1013c:	00d7e7b3          	or	a5,a5,a3
   10140:	00158593          	add	a1,a1,1
   10144:	18078263          	beqz	a5,102c8 <main+0x208>
   10148:	00185513          	srl	a0,a6,0x1
   1014c:	01f69f13          	sll	t5,a3,0x1f
   10150:	0016d713          	srl	a4,a3,0x1
   10154:	00af6f33          	or	t5,t5,a0
   10158:	00d76733          	or	a4,a4,a3
   1015c:	010f6f33          	or	t5,t5,a6
   10160:	01e71f93          	sll	t6,a4,0x1e
   10164:	002f5793          	srl	a5,t5,0x2
   10168:	00ffe7b3          	or	a5,t6,a5
   1016c:	00275f93          	srl	t6,a4,0x2
   10170:	01f76733          	or	a4,a4,t6
   10174:	00ff6f33          	or	t5,t5,a5
   10178:	01c71f93          	sll	t6,a4,0x1c
   1017c:	004f5793          	srl	a5,t5,0x4
   10180:	00ffe7b3          	or	a5,t6,a5
   10184:	00475f93          	srl	t6,a4,0x4
   10188:	01f76733          	or	a4,a4,t6
   1018c:	00ff6f33          	or	t5,t5,a5
   10190:	01871f93          	sll	t6,a4,0x18
   10194:	008f5793          	srl	a5,t5,0x8
   10198:	00ffe7b3          	or	a5,t6,a5
   1019c:	00875f93          	srl	t6,a4,0x8
   101a0:	01f76733          	or	a4,a4,t6
   101a4:	00ff6f33          	or	t5,t5,a5
   101a8:	01071f93          	sll	t6,a4,0x10
   101ac:	010f5793          	srl	a5,t5,0x10
   101b0:	00ffe7b3          	or	a5,t6,a5
   101b4:	01075f93          	srl	t6,a4,0x10
   101b8:	01f76733          	or	a4,a4,t6
   101bc:	00ff6f33          	or	t5,t5,a5
   101c0:	00ef6f33          	or	t5,t5,a4
   101c4:	01f71f93          	sll	t6,a4,0x1f
   101c8:	001f5793          	srl	a5,t5,0x1
   101cc:	00ffe7b3          	or	a5,t6,a5
   101d0:	01c7f7b3          	and	a5,a5,t3
   101d4:	00175f93          	srl	t6,a4,0x1
   101d8:	40ff07b3          	sub	a5,t5,a5
   101dc:	01cfffb3          	and	t6,t6,t3
   101e0:	00ff3f33          	sltu	t5,t5,a5
   101e4:	41f70733          	sub	a4,a4,t6
   101e8:	41e70733          	sub	a4,a4,t5
   101ec:	01e71f93          	sll	t6,a4,0x1e
   101f0:	0027df13          	srl	t5,a5,0x2
   101f4:	01efef33          	or	t5,t6,t5
   101f8:	011f7f33          	and	t5,t5,a7
   101fc:	00275f93          	srl	t6,a4,0x2
   10200:	0117f7b3          	and	a5,a5,a7
   10204:	00ff07b3          	add	a5,t5,a5
   10208:	011fffb3          	and	t6,t6,a7
   1020c:	01177733          	and	a4,a4,a7
   10210:	00ef8733          	add	a4,t6,a4
   10214:	01e7bf33          	sltu	t5,a5,t5
   10218:	00ef0f33          	add	t5,t5,a4
   1021c:	01cf1f93          	sll	t6,t5,0x1c
   10220:	0047d713          	srl	a4,a5,0x4
   10224:	00efe733          	or	a4,t6,a4
   10228:	00f707b3          	add	a5,a4,a5
   1022c:	004f5f93          	srl	t6,t5,0x4
   10230:	01ef8f33          	add	t5,t6,t5
   10234:	00e7b733          	sltu	a4,a5,a4
   10238:	01e70733          	add	a4,a4,t5
   1023c:	0067f7b3          	and	a5,a5,t1
   10240:	00677733          	and	a4,a4,t1
   10244:	01871f93          	sll	t6,a4,0x18
   10248:	0087df13          	srl	t5,a5,0x8
   1024c:	01efef33          	or	t5,t6,t5
   10250:	01e78f33          	add	t5,a5,t5
   10254:	00875f93          	srl	t6,a4,0x8
   10258:	01f70733          	add	a4,a4,t6
   1025c:	00ff37b3          	sltu	a5,t5,a5
   10260:	00e787b3          	add	a5,a5,a4
   10264:	01079f93          	sll	t6,a5,0x10
   10268:	010f5713          	srl	a4,t5,0x10
   1026c:	00efe733          	or	a4,t6,a4
   10270:	00ef0733          	add	a4,t5,a4
   10274:	0107df93          	srl	t6,a5,0x10
   10278:	01e73f33          	sltu	t5,a4,t5
   1027c:	01f787b3          	add	a5,a5,t6
   10280:	00ff0f33          	add	t5,t5,a5
   10284:	01e707b3          	add	a5,a4,t5
   10288:	07f7f793          	and	a5,a5,127
   1028c:	40fe87b3          	sub	a5,t4,a5
   10290:	01079793          	sll	a5,a5,0x10
   10294:	0107d793          	srl	a5,a5,0x10
   10298:	fe178713          	add	a4,a5,-31
   1029c:	00178793          	add	a5,a5,1
   102a0:	40f60f33          	sub	t5,a2,a5
   102a4:	00f696b3          	sll	a3,a3,a5
   102a8:	01e55533          	srl	a0,a0,t5
   102ac:	e80752e3          	bgez	a4,10130 <main+0x70>
   102b0:	00f817b3          	sll	a5,a6,a5
   102b4:	00d566b3          	or	a3,a0,a3
   102b8:	00078813          	mv	a6,a5
   102bc:	00d7e7b3          	or	a5,a5,a3
   102c0:	00158593          	add	a1,a1,1
   102c4:	e80792e3          	bnez	a5,10148 <main+0x88>
   102c8:	00022537          	lui	a0,0x22
   102cc:	e5050513          	add	a0,a0,-432 # 21e50 <__clzsi2+0x8c>
   102d0:	1f5000ef          	jal	10cc4 <printf>
   102d4:	c80026f3          	rdcycleh	a3
   102d8:	c00027f3          	rdcycle	a5
   102dc:	c8002773          	rdcycleh	a4
   102e0:	40e686b3          	sub	a3,a3,a4
   102e4:	0016b693          	seqz	a3,a3
   102e8:	40d006b3          	neg	a3,a3
   102ec:	00d7f7b3          	and	a5,a5,a3
   102f0:	40878633          	sub	a2,a5,s0
   102f4:	00c7b7b3          	sltu	a5,a5,a2
   102f8:	409686b3          	sub	a3,a3,s1
   102fc:	00022537          	lui	a0,0x22
   10300:	40f686b3          	sub	a3,a3,a5
   10304:	e6850513          	add	a0,a0,-408 # 21e68 <__clzsi2+0xa4>
   10308:	1bd000ef          	jal	10cc4 <printf>
   1030c:	00c12083          	lw	ra,12(sp)
   10310:	00812403          	lw	s0,8(sp)
   10314:	00412483          	lw	s1,4(sp)
   10318:	00000513          	li	a0,0
   1031c:	01010113          	add	sp,sp,16
   10320:	00008067          	ret
   10324:	dec592e3          	bne	a1,a2,10108 <main+0x48>
   10328:	00000593          	li	a1,0
   1032c:	f9dff06f          	j	102c8 <main+0x208>
   
00010560 <HammingDistance>:
   10560:	00c548b3          	xor	a7,a0,a2
   10564:	00d5c833          	xor	a6,a1,a3
   10568:	1cc50463          	beq	a0,a2,10730 <HammingDistance+0x1d0>
   1056c:	55555337          	lui	t1,0x55555
   10570:	33333637          	lui	a2,0x33333
   10574:	0f0f15b7          	lui	a1,0xf0f1
   10578:	00000513          	li	a0,0
   1057c:	55530313          	add	t1,t1,1365 # 55555555 <__BSS_END__+0x55531611>
   10580:	33360613          	add	a2,a2,819 # 33333333 <__BSS_END__+0x3330f3ef>
   10584:	f0f58593          	add	a1,a1,-241 # f0f0f0f <__BSS_END__+0xf0ccfcb>
   10588:	04000e93          	li	t4,64
   1058c:	01f00e13          	li	t3,31
   10590:	01c0006f          	j	105ac <HammingDistance+0x4c>
   10594:	00e89833          	sll	a6,a7,a4
   10598:	00000793          	li	a5,0
   1059c:	00078893          	mv	a7,a5
   105a0:	0107e7b3          	or	a5,a5,a6
   105a4:	00150513          	add	a0,a0,1
   105a8:	18078263          	beqz	a5,1072c <HammingDistance+0x1cc>
   105ac:	0018d693          	srl	a3,a7,0x1
   105b0:	01f81f13          	sll	t5,a6,0x1f
   105b4:	00185713          	srl	a4,a6,0x1
   105b8:	00df6f33          	or	t5,t5,a3
   105bc:	01076733          	or	a4,a4,a6
   105c0:	011f6f33          	or	t5,t5,a7
   105c4:	01e71f93          	sll	t6,a4,0x1e
   105c8:	002f5793          	srl	a5,t5,0x2
   105cc:	00ffe7b3          	or	a5,t6,a5
   105d0:	00275f93          	srl	t6,a4,0x2
   105d4:	01f76733          	or	a4,a4,t6
   105d8:	00ff6f33          	or	t5,t5,a5
   105dc:	01c71f93          	sll	t6,a4,0x1c
   105e0:	004f5793          	srl	a5,t5,0x4
   105e4:	00ffe7b3          	or	a5,t6,a5
   105e8:	00475f93          	srl	t6,a4,0x4
   105ec:	01f76733          	or	a4,a4,t6
   105f0:	00ff6f33          	or	t5,t5,a5
   105f4:	01871f93          	sll	t6,a4,0x18
   105f8:	008f5793          	srl	a5,t5,0x8
   105fc:	00ffe7b3          	or	a5,t6,a5
   10600:	00875f93          	srl	t6,a4,0x8
   10604:	01f76733          	or	a4,a4,t6
   10608:	00ff6f33          	or	t5,t5,a5
   1060c:	01071f93          	sll	t6,a4,0x10
   10610:	010f5793          	srl	a5,t5,0x10
   10614:	00ffe7b3          	or	a5,t6,a5
   10618:	01075f93          	srl	t6,a4,0x10
   1061c:	01f76733          	or	a4,a4,t6
   10620:	00ff6f33          	or	t5,t5,a5
   10624:	00ef6f33          	or	t5,t5,a4
   10628:	01f71f93          	sll	t6,a4,0x1f
   1062c:	001f5793          	srl	a5,t5,0x1
   10630:	00ffe7b3          	or	a5,t6,a5
   10634:	0067f7b3          	and	a5,a5,t1
   10638:	00175f93          	srl	t6,a4,0x1
   1063c:	40ff07b3          	sub	a5,t5,a5
   10640:	006fffb3          	and	t6,t6,t1
   10644:	00ff3f33          	sltu	t5,t5,a5
   10648:	41f70733          	sub	a4,a4,t6
   1064c:	41e70733          	sub	a4,a4,t5
   10650:	01e71f93          	sll	t6,a4,0x1e
   10654:	0027df13          	srl	t5,a5,0x2
   10658:	01efef33          	or	t5,t6,t5
   1065c:	00cf7f33          	and	t5,t5,a2
   10660:	00275f93          	srl	t6,a4,0x2
   10664:	00c7f7b3          	and	a5,a5,a2
   10668:	00ff07b3          	add	a5,t5,a5
   1066c:	00cfffb3          	and	t6,t6,a2
   10670:	00c77733          	and	a4,a4,a2
   10674:	00ef8733          	add	a4,t6,a4
   10678:	01e7bf33          	sltu	t5,a5,t5
   1067c:	00ef0f33          	add	t5,t5,a4
   10680:	01cf1f93          	sll	t6,t5,0x1c
   10684:	0047d713          	srl	a4,a5,0x4
   10688:	00efe733          	or	a4,t6,a4
   1068c:	00f707b3          	add	a5,a4,a5
   10690:	004f5f93          	srl	t6,t5,0x4
   10694:	01ef8f33          	add	t5,t6,t5
   10698:	00e7b733          	sltu	a4,a5,a4
   1069c:	01e70733          	add	a4,a4,t5
   106a0:	00b7f7b3          	and	a5,a5,a1
   106a4:	00b77733          	and	a4,a4,a1
   106a8:	01871f93          	sll	t6,a4,0x18
   106ac:	0087df13          	srl	t5,a5,0x8
   106b0:	01efef33          	or	t5,t6,t5
   106b4:	01e78f33          	add	t5,a5,t5
   106b8:	00875f93          	srl	t6,a4,0x8
   106bc:	01f70733          	add	a4,a4,t6
   106c0:	00ff37b3          	sltu	a5,t5,a5
   106c4:	00e787b3          	add	a5,a5,a4
   106c8:	01079f93          	sll	t6,a5,0x10
   106cc:	010f5713          	srl	a4,t5,0x10
   106d0:	00efe733          	or	a4,t6,a4
   106d4:	00ef0733          	add	a4,t5,a4
   106d8:	0107df93          	srl	t6,a5,0x10
   106dc:	01e73f33          	sltu	t5,a4,t5
   106e0:	01f787b3          	add	a5,a5,t6
   106e4:	00ff0f33          	add	t5,t5,a5
   106e8:	01e707b3          	add	a5,a4,t5
   106ec:	07f7f793          	and	a5,a5,127
   106f0:	40fe87b3          	sub	a5,t4,a5
   106f4:	01079793          	sll	a5,a5,0x10
   106f8:	0107d793          	srl	a5,a5,0x10
   106fc:	fe178713          	add	a4,a5,-31
   10700:	00178793          	add	a5,a5,1
   10704:	40fe0f33          	sub	t5,t3,a5
   10708:	00f81833          	sll	a6,a6,a5
   1070c:	01e6d6b3          	srl	a3,a3,t5
   10710:	e80752e3          	bgez	a4,10594 <HammingDistance+0x34>
   10714:	00f897b3          	sll	a5,a7,a5
   10718:	0106e833          	or	a6,a3,a6
   1071c:	00078893          	mv	a7,a5
   10720:	0107e7b3          	or	a5,a5,a6
   10724:	00150513          	add	a0,a0,1
   10728:	e80792e3          	bnez	a5,105ac <HammingDistance+0x4c>
   1072c:	00008067          	ret
   10730:	e2d59ee3          	bne	a1,a3,1056c <HammingDistance+0xc>
   10734:	00000513          	li	a0,0
   10738:	00008067          	ret

Summary

Optimization Cycle Counts Code Size (text)
O0 9439 76944
O1 5016 75936
O2 4974 75948
O3 4734 76700
Os 5094 75904
Ofast 4734 76700