Calculate the Hamming Distance using Counting Leading Zeros
contribute by < JinYu1225 >
Program Selecting
The program I select is the Calculate the Hamming Distance using Counting Leading Zeros by 唐飴苹.
The reason why I choose this program is because I have seen a problem about Hamming Distance before, but I didn't understand the algorithm at that time. Therefore, I choose this target to have a better understanding on Hamming Distance.
Besides, the student of this subject in assignment 1 only use Count Leading Zeros (CLZ) to simplify the algorithm of Hamming Distance. However, I'm interesting in whether we can use CLZ through all the algorithm. And I also want to try to implement a 64-bit CLZ structure on Ripes.
In the above implementation, CLZ is only used in the first part of the algorithm in this code to find the bit to stop for the comparison of the two 64-bit parameter.
The following code trying to implement CLZ through the whole Hamming distance code.
In my version, CLZ is used in order to find whether there is any 0 bit that can be skip before comparing two vectors in every loop.
My program and yptang5488's program have a significant difference, mainly related to the use of a mask. In each loop of my program, I used the mask 0x7fffffffffffffff
, while yptang5488 used a simpler mask, 0x1
. One way to potentially improve performance is to change the CLZ algorithm to "count following zeros" or to reverse the bits in the code. However, I didn't implement these optimizations in this assignment.
Besides, it is not sure that doing clz will be faster than the original algorithm produce by yptang5488. My code might only be more efficient when the input arguments have several 0 bits.
Inspired by the assignment of 林以薰, I adapted my code to a more faster and efficient version.
The main idea is to use ^
(XOR) operator to pick out the difference between x1
and x2
in the very beginnging. This operation can help me reduce several unnecessary steps in my previous code.
Result of .elf in rv32emu
Assembly code
In this version, I use two registers to store a 64-bit datac1 = c1 << clz_count + 1
The test data in this code consists of 0x0000000000100000
and 0x00000000000FFFFF
, and their Hamming distance should be 21
.
While there are two additional inputs in the code, it still lacks the capability to run multiple test data sets in a single execution.
The result in the console block in RIPES will be like following:
Fit in rv32emu
Due to the incompatibility of Ripes and rv32emu, I've made some modifications to the handwritten version to address this issue.
- In the beginning, provide the program a starting address to linker.
The usage of system call
- I've also made modifications to the print function to separately handle and display the tens and units digits of numbers.
- To simplify the code, I've removed the CLZ print function.
Optimization
In this assignment, we will compile the code with different method of optimzation, including O0
, O1
, O2
, O3
, Os
,and Ofast
, and subsequently compare the difference between them. You can find more kinds of optimization and information using $ riscv-none-elf-gcc -Q --help=optimizers
。
I will concentrate on the variations in the "HammingDistance" and "main" components made by each optimization for a better comparison.
You can find my Makefile in here.
Don't make listing for non-critical code. Use hyperlinks instead.
jserv
I found that the text is too much, so I plan to import the optimization for my code.
The original C code will also compile the <stdint.h>
and <stdio.h>
, so I use typedef unsinged long long uint64_t
and typedef unsigned short uint16_t
, instead of directly using uint64_t
and uint16_t
Besides, I remove printf
and the size reduce as following.
You shall use RDCYCLE/RDCYCLEH instruction for the statistics of your program’s execution.
jserv
To calculate the cycle as a indicator of the program, the following code is add to .c file. You can look at tick.c for the reference.
Then compile .c file again
- The size has change since I put
printf
to the program.
Disassembly
In O0 optimization, the compiler transform the C code to assembly without most of the optimization. It just basically translates all the instructions we do in the C code in a general way.
O0 optimization avoids performing most of the calculations with a0
and a1
while they are used for the return value of the function. This enhances the transformation's compatibility. However, the code will also become larger since it needs more intruction to move the data.
O1 optimization will tries to decrease the code size and improve the operation time without any optimizations that increase compilation time.
We can see the difference between O1 and O0 by the above code. In O1 optimization, the code will use a0
and a1
to perform some of the calculations. This help the program remove parts of useless code to reduce the code size and improve the cycle counts.
O2 includes most of the optimization from O1, and additionally adds those do not impact the speed and code size.
In the code provided above, there is no register that hold the same value. O2 optimization allocates a single register for storing identical value arguments.
O3 optimize even more than O2. We can see that "HammingDistance" start without storing value, and there is no entrence to "HammingDistance".
If we look back to the "main", we will find that the code barely completes everything in the "main" function instead of calling other functions. Thus, we can assume that the actual text size it consumes should be smaller than what is displayed above.
Os focus on the optimization of code size.
In the code above, Os use the registers that don't need for that moment to save new arguments. Such as the a1
in line 1039c.
Ofast Optimized Assembly code
000100c0 <main>:
100c0: ff010113 add sp,sp,-16
100c4: 00112623 sw ra,12(sp)
100c8: 00812423 sw s0,8(sp)
100cc: 00912223 sw s1,4(sp)
100d0: c80024f3 rdcycleh s1
100d4: c0002473 rdcycle s0
100d8: c80027f3 rdcycleh a5
100dc: 40f484b3 sub s1,s1,a5
100e0: 0014b493 seqz s1,s1
100e4: 409004b3 neg s1,s1
100e8: 00947433 and s0,s0,s1
100ec: f601a703 lw a4,-160(gp) # 23930 <test1_x0>
100f0: f641a583 lw a1,-156(gp) # 23934 <test1_x0+0x4>
100f4: f5c1a603 lw a2,-164(gp) # 2392c <test1_x1+0x4>
100f8: f581a783 lw a5,-168(gp) # 23928 <test1_x1>
100fc: 00c5c6b3 xor a3,a1,a2
10100: 00f74833 xor a6,a4,a5
10104: 22f70063 beq a4,a5,10324 <main+0x264>
10108: 55555e37 lui t3,0x55555
1010c: 333338b7 lui a7,0x33333
10110: 0f0f1337 lui t1,0xf0f1
10114: 00000593 li a1,0
10118: 555e0e13 add t3,t3,1365 # 55555555 <__BSS_END__+0x55531611>
1011c: 33388893 add a7,a7,819 # 33333333 <__BSS_END__+0x3330f3ef>
10120: f0f30313 add t1,t1,-241 # f0f0f0f <__BSS_END__+0xf0ccfcb>
10124: 04000e93 li t4,64
10128: 01f00613 li a2,31
1012c: 01c0006f j 10148 <main+0x88>
10130: 00e816b3 sll a3,a6,a4
10134: 00000793 li a5,0
10138: 00078813 mv a6,a5
1013c: 00d7e7b3 or a5,a5,a3
10140: 00158593 add a1,a1,1
10144: 18078263 beqz a5,102c8 <main+0x208>
10148: 00185513 srl a0,a6,0x1
1014c: 01f69f13 sll t5,a3,0x1f
10150: 0016d713 srl a4,a3,0x1
10154: 00af6f33 or t5,t5,a0
10158: 00d76733 or a4,a4,a3
1015c: 010f6f33 or t5,t5,a6
10160: 01e71f93 sll t6,a4,0x1e
10164: 002f5793 srl a5,t5,0x2
10168: 00ffe7b3 or a5,t6,a5
1016c: 00275f93 srl t6,a4,0x2
10170: 01f76733 or a4,a4,t6
10174: 00ff6f33 or t5,t5,a5
10178: 01c71f93 sll t6,a4,0x1c
1017c: 004f5793 srl a5,t5,0x4
10180: 00ffe7b3 or a5,t6,a5
10184: 00475f93 srl t6,a4,0x4
10188: 01f76733 or a4,a4,t6
1018c: 00ff6f33 or t5,t5,a5
10190: 01871f93 sll t6,a4,0x18
10194: 008f5793 srl a5,t5,0x8
10198: 00ffe7b3 or a5,t6,a5
1019c: 00875f93 srl t6,a4,0x8
101a0: 01f76733 or a4,a4,t6
101a4: 00ff6f33 or t5,t5,a5
101a8: 01071f93 sll t6,a4,0x10
101ac: 010f5793 srl a5,t5,0x10
101b0: 00ffe7b3 or a5,t6,a5
101b4: 01075f93 srl t6,a4,0x10
101b8: 01f76733 or a4,a4,t6
101bc: 00ff6f33 or t5,t5,a5
101c0: 00ef6f33 or t5,t5,a4
101c4: 01f71f93 sll t6,a4,0x1f
101c8: 001f5793 srl a5,t5,0x1
101cc: 00ffe7b3 or a5,t6,a5
101d0: 01c7f7b3 and a5,a5,t3
101d4: 00175f93 srl t6,a4,0x1
101d8: 40ff07b3 sub a5,t5,a5
101dc: 01cfffb3 and t6,t6,t3
101e0: 00ff3f33 sltu t5,t5,a5
101e4: 41f70733 sub a4,a4,t6
101e8: 41e70733 sub a4,a4,t5
101ec: 01e71f93 sll t6,a4,0x1e
101f0: 0027df13 srl t5,a5,0x2
101f4: 01efef33 or t5,t6,t5
101f8: 011f7f33 and t5,t5,a7
101fc: 00275f93 srl t6,a4,0x2
10200: 0117f7b3 and a5,a5,a7
10204: 00ff07b3 add a5,t5,a5
10208: 011fffb3 and t6,t6,a7
1020c: 01177733 and a4,a4,a7
10210: 00ef8733 add a4,t6,a4
10214: 01e7bf33 sltu t5,a5,t5
10218: 00ef0f33 add t5,t5,a4
1021c: 01cf1f93 sll t6,t5,0x1c
10220: 0047d713 srl a4,a5,0x4
10224: 00efe733 or a4,t6,a4
10228: 00f707b3 add a5,a4,a5
1022c: 004f5f93 srl t6,t5,0x4
10230: 01ef8f33 add t5,t6,t5
10234: 00e7b733 sltu a4,a5,a4
10238: 01e70733 add a4,a4,t5
1023c: 0067f7b3 and a5,a5,t1
10240: 00677733 and a4,a4,t1
10244: 01871f93 sll t6,a4,0x18
10248: 0087df13 srl t5,a5,0x8
1024c: 01efef33 or t5,t6,t5
10250: 01e78f33 add t5,a5,t5
10254: 00875f93 srl t6,a4,0x8
10258: 01f70733 add a4,a4,t6
1025c: 00ff37b3 sltu a5,t5,a5
10260: 00e787b3 add a5,a5,a4
10264: 01079f93 sll t6,a5,0x10
10268: 010f5713 srl a4,t5,0x10
1026c: 00efe733 or a4,t6,a4
10270: 00ef0733 add a4,t5,a4
10274: 0107df93 srl t6,a5,0x10
10278: 01e73f33 sltu t5,a4,t5
1027c: 01f787b3 add a5,a5,t6
10280: 00ff0f33 add t5,t5,a5
10284: 01e707b3 add a5,a4,t5
10288: 07f7f793 and a5,a5,127
1028c: 40fe87b3 sub a5,t4,a5
10290: 01079793 sll a5,a5,0x10
10294: 0107d793 srl a5,a5,0x10
10298: fe178713 add a4,a5,-31
1029c: 00178793 add a5,a5,1
102a0: 40f60f33 sub t5,a2,a5
102a4: 00f696b3 sll a3,a3,a5
102a8: 01e55533 srl a0,a0,t5
102ac: e80752e3 bgez a4,10130 <main+0x70>
102b0: 00f817b3 sll a5,a6,a5
102b4: 00d566b3 or a3,a0,a3
102b8: 00078813 mv a6,a5
102bc: 00d7e7b3 or a5,a5,a3
102c0: 00158593 add a1,a1,1
102c4: e80792e3 bnez a5,10148 <main+0x88>
102c8: 00022537 lui a0,0x22
102cc: e5050513 add a0,a0,-432 # 21e50 <__clzsi2+0x8c>
102d0: 1f5000ef jal 10cc4 <printf>
102d4: c80026f3 rdcycleh a3
102d8: c00027f3 rdcycle a5
102dc: c8002773 rdcycleh a4
102e0: 40e686b3 sub a3,a3,a4
102e4: 0016b693 seqz a3,a3
102e8: 40d006b3 neg a3,a3
102ec: 00d7f7b3 and a5,a5,a3
102f0: 40878633 sub a2,a5,s0
102f4: 00c7b7b3 sltu a5,a5,a2
102f8: 409686b3 sub a3,a3,s1
102fc: 00022537 lui a0,0x22
10300: 40f686b3 sub a3,a3,a5
10304: e6850513 add a0,a0,-408 # 21e68 <__clzsi2+0xa4>
10308: 1bd000ef jal 10cc4 <printf>
1030c: 00c12083 lw ra,12(sp)
10310: 00812403 lw s0,8(sp)
10314: 00412483 lw s1,4(sp)
10318: 00000513 li a0,0
1031c: 01010113 add sp,sp,16
10320: 00008067 ret
10324: dec592e3 bne a1,a2,10108 <main+0x48>
10328: 00000593 li a1,0
1032c: f9dff06f j 102c8 <main+0x208>
00010560 <HammingDistance>:
10560: 00c548b3 xor a7,a0,a2
10564: 00d5c833 xor a6,a1,a3
10568: 1cc50463 beq a0,a2,10730 <HammingDistance+0x1d0>
1056c: 55555337 lui t1,0x55555
10570: 33333637 lui a2,0x33333
10574: 0f0f15b7 lui a1,0xf0f1
10578: 00000513 li a0,0
1057c: 55530313 add t1,t1,1365 # 55555555 <__BSS_END__+0x55531611>
10580: 33360613 add a2,a2,819 # 33333333 <__BSS_END__+0x3330f3ef>
10584: f0f58593 add a1,a1,-241 # f0f0f0f <__BSS_END__+0xf0ccfcb>
10588: 04000e93 li t4,64
1058c: 01f00e13 li t3,31
10590: 01c0006f j 105ac <HammingDistance+0x4c>
10594: 00e89833 sll a6,a7,a4
10598: 00000793 li a5,0
1059c: 00078893 mv a7,a5
105a0: 0107e7b3 or a5,a5,a6
105a4: 00150513 add a0,a0,1
105a8: 18078263 beqz a5,1072c <HammingDistance+0x1cc>
105ac: 0018d693 srl a3,a7,0x1
105b0: 01f81f13 sll t5,a6,0x1f
105b4: 00185713 srl a4,a6,0x1
105b8: 00df6f33 or t5,t5,a3
105bc: 01076733 or a4,a4,a6
105c0: 011f6f33 or t5,t5,a7
105c4: 01e71f93 sll t6,a4,0x1e
105c8: 002f5793 srl a5,t5,0x2
105cc: 00ffe7b3 or a5,t6,a5
105d0: 00275f93 srl t6,a4,0x2
105d4: 01f76733 or a4,a4,t6
105d8: 00ff6f33 or t5,t5,a5
105dc: 01c71f93 sll t6,a4,0x1c
105e0: 004f5793 srl a5,t5,0x4
105e4: 00ffe7b3 or a5,t6,a5
105e8: 00475f93 srl t6,a4,0x4
105ec: 01f76733 or a4,a4,t6
105f0: 00ff6f33 or t5,t5,a5
105f4: 01871f93 sll t6,a4,0x18
105f8: 008f5793 srl a5,t5,0x8
105fc: 00ffe7b3 or a5,t6,a5
10600: 00875f93 srl t6,a4,0x8
10604: 01f76733 or a4,a4,t6
10608: 00ff6f33 or t5,t5,a5
1060c: 01071f93 sll t6,a4,0x10
10610: 010f5793 srl a5,t5,0x10
10614: 00ffe7b3 or a5,t6,a5
10618: 01075f93 srl t6,a4,0x10
1061c: 01f76733 or a4,a4,t6
10620: 00ff6f33 or t5,t5,a5
10624: 00ef6f33 or t5,t5,a4
10628: 01f71f93 sll t6,a4,0x1f
1062c: 001f5793 srl a5,t5,0x1
10630: 00ffe7b3 or a5,t6,a5
10634: 0067f7b3 and a5,a5,t1
10638: 00175f93 srl t6,a4,0x1
1063c: 40ff07b3 sub a5,t5,a5
10640: 006fffb3 and t6,t6,t1
10644: 00ff3f33 sltu t5,t5,a5
10648: 41f70733 sub a4,a4,t6
1064c: 41e70733 sub a4,a4,t5
10650: 01e71f93 sll t6,a4,0x1e
10654: 0027df13 srl t5,a5,0x2
10658: 01efef33 or t5,t6,t5
1065c: 00cf7f33 and t5,t5,a2
10660: 00275f93 srl t6,a4,0x2
10664: 00c7f7b3 and a5,a5,a2
10668: 00ff07b3 add a5,t5,a5
1066c: 00cfffb3 and t6,t6,a2
10670: 00c77733 and a4,a4,a2
10674: 00ef8733 add a4,t6,a4
10678: 01e7bf33 sltu t5,a5,t5
1067c: 00ef0f33 add t5,t5,a4
10680: 01cf1f93 sll t6,t5,0x1c
10684: 0047d713 srl a4,a5,0x4
10688: 00efe733 or a4,t6,a4
1068c: 00f707b3 add a5,a4,a5
10690: 004f5f93 srl t6,t5,0x4
10694: 01ef8f33 add t5,t6,t5
10698: 00e7b733 sltu a4,a5,a4
1069c: 01e70733 add a4,a4,t5
106a0: 00b7f7b3 and a5,a5,a1
106a4: 00b77733 and a4,a4,a1
106a8: 01871f93 sll t6,a4,0x18
106ac: 0087df13 srl t5,a5,0x8
106b0: 01efef33 or t5,t6,t5
106b4: 01e78f33 add t5,a5,t5
106b8: 00875f93 srl t6,a4,0x8
106bc: 01f70733 add a4,a4,t6
106c0: 00ff37b3 sltu a5,t5,a5
106c4: 00e787b3 add a5,a5,a4
106c8: 01079f93 sll t6,a5,0x10
106cc: 010f5713 srl a4,t5,0x10
106d0: 00efe733 or a4,t6,a4
106d4: 00ef0733 add a4,t5,a4
106d8: 0107df93 srl t6,a5,0x10
106dc: 01e73f33 sltu t5,a4,t5
106e0: 01f787b3 add a5,a5,t6
106e4: 00ff0f33 add t5,t5,a5
106e8: 01e707b3 add a5,a4,t5
106ec: 07f7f793 and a5,a5,127
106f0: 40fe87b3 sub a5,t4,a5
106f4: 01079793 sll a5,a5,0x10
106f8: 0107d793 srl a5,a5,0x10
106fc: fe178713 add a4,a5,-31
10700: 00178793 add a5,a5,1
10704: 40fe0f33 sub t5,t3,a5
10708: 00f81833 sll a6,a6,a5
1070c: 01e6d6b3 srl a3,a3,t5
10710: e80752e3 bgez a4,10594 <HammingDistance+0x34>
10714: 00f897b3 sll a5,a7,a5
10718: 0106e833 or a6,a3,a6
1071c: 00078893 mv a7,a5
10720: 0107e7b3 or a5,a5,a6
10724: 00150513 add a0,a0,1
10728: e80792e3 bnez a5,105ac <HammingDistance+0x4c>
1072c: 00008067 ret
10730: e2d59ee3 bne a1,a3,1056c <HammingDistance+0xc>
10734: 00000513 li a0,0
10738: 00008067 ret
Summary
Optimization |
Cycle Counts |
Code Size (text) |
O0 |
9439 |
76944 |
O1 |
5016 |
75936 |
O2 |
4974 |
75948 |
O3 |
4734 |
76700 |
Os |
5094 |
75904 |
Ofast |
4734 |
76700 |