# Total Hamming Distance Using Clz ([Leetcode 477](https://leetcode.com/problems/total-hamming-distance/description/))
## Abstraction
Hamming distance is an elegant mathematic concept for binary world, especially in infornmation and communication technology.
For example, Hamming distance can be used for calculating bit error rate, optimal decoding method, power calculation, and more.
In this homework, I'll discuss some method for calculation of total Hamming distance with a set of integer given, comparing the performance of each case. And the concept of clz mentioned in problem C of quiz one will be applied.
## Quiz1 Problem `C`
### `fabsf` C code
``` c
static inline float fabsf(float x) {
uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer
i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value
x = *(float *)&i; // Write the modified bits back into the float
return x;
}
```
### `fabsf` assembly code
``` =
.data
test_data:
.word 0xBF800000 # -1.0
.word 0x43D80000 # 432.0
.word 0x42180000 # -38.0
verify:
.word 0X3F800000 # 1.0
.word 0x43D80000 # 432.0
.word 0X42180000 # 38.0
wrong_message:
.asciz "wrong\n"
right_message:
.asciz "right\n"
newline:
.asciz "\n"
.text
.global main
main:
la t0, test_data # Load address of `tests` into t0
la t1, verify # Load address of 'verify' into t1 data
addi t2, t2, 3 #3 number of test data
loop:
beq t2,zero,exit # If the t2=0, exit.
addi t2,t2,-1 # t2-1
jal fabsf # Jump and link to `fabsf` function
addi t0,t0,4 # next tested data
addi t1,t1,4 # next verifying data
j loop # jump to loop
fabsf:
lw a0, 0(t0) # Load the word from memory address pointed by t0 into a0
lw a1, 0(t1) # Load the word from memory address pointed by t1 into a1
li a2, 0x7FFFFFFF # Mask
and a0,a0,a2 # Set first bit to zero
bne a0, a1,false # if a0 is not equal to a1, then false message
sw a0, 0(t0) # Store the word from t1 back to memory address pointed by t0
addi t4, ra,0 # store ra in t4
jal true # jump to true message
addi ra,t4,0 # reload ra
# Return
ret
false:
la a0, wrong_message
li a7,4
ecall
lw a0,0(t0)
li a7,1
ecall
lw a0,0(t1)
li a7,1
ecall
ret
true:
la a0, right_message
li a7,4
ecall
la a0, newline
li a7,4
ecall
lw a0,0(t0)
li a7,1
ecall
la a0, newline
li a7,4
ecall
lw a0,0(t1)
li a7,1
ecall
ret
exit:
li a7,10
ecall
```
### `clz` C code
```c
static inline int my_clz(uint32_t x) {
int count = 0;
for (int i = 31; i >= 0; --i) {
if (x & (1U << i))
break;
count++;
}
return count;
}
```
### `clz` assembly
In this case, I remain alarm message only under false condition.
``` =
.data
test_data:
.word 0x1
.word 0x100
.word 0x100000
verify:
.word 31
.word 23
.word 11
wrong_message:
.asciz "wrong"
end_message:
.asciz "\n end"
.text
.global main
main:
la t0, test_data
la t1, verify
li t2, 3 #3 test data
jal test #jump and link to test function
j end
test:
addi t4,ra,0 #store ra in t4
beq t2, zero, exit #end if t2 = 0
lw a0,0(t0) #load first test data
lw a1,0(t1) #load first verifuing data
li a2,0 #use for counting the number of leading zero, set to 0
li a3,32 #total 32bits
jal clz #jump to clz and save position to clz, counting leading zero and save to a3
bne a3 ,a1,false #if a3 != a1, go labeled false
addi ra,t4,0 #recover ra
addi t2,t2,-1 #index -1
addi t0,t0,4 #next data
addi t1,t1,4 #next verying data
j test #back to this loop
clz:
beq a0, zero,exit # if a0 == 0, meaning that shift encough step, end loop
addi a3,a3,-1 #if t1 !=0, leadning zero bits -1
srli a0,a0,1 #right shift 1 bit
j clz #back to loop
exit:
ret
false:
la a0, wrong_message
li a7,4
ecall
j end
end:
la a0,end_message
li a7,4
ecall
li a7,10
ecall
```
## Leetcode: Total Hamming Distance([Leetcode 477](https://leetcode.com/problems/total-hamming-distance/description/))
**Description:**
The Hamming distance between two integers is the number of positions at which the corresponding bits are different.
**Constrains:**
1 <= nums.length <= 104
0 <= nums[i] <= 109
The answer for the given input will fit in a 32-bit integer.
### Solution `A`
**1. Idea**
To calculate the total Hamming distance of all given integers, I propose calculating the Hamming distance of the first element with the remaining elements. Then, calculate the Hamming distance of the second element with the remaining elements, excluding the first element, and so on. Continue this until the last element and the element before it are processed, and sum up all the Hamming distances to obtain the total Hamming distance.
The Hamming distance of each pair will apply $XOR$ and right shift operation.
**2. Mathematical Representation**
$$H_{total}=\displaystyle\sum^{N-1}_{i=1}\sum^N_{j=i+1}H(x_i,x_j)$$
where $H$ denoted as the Hamming distance function and $x_i$ denoted as the integer .
**3. C code**
Noted that this method can only afford calculation for Hamming distance in few integers due to its complexity $O(n^2)$.
```C
int totalHammingDistance(int* nums, int numsSize) {
int totalDistance = 0;
for (int i = 0; i < numsSize-1;i ++)
{
for (int j = i + 1; j < numsSize;j++)
{
uint32_t w = nums[i]^ nums[j];
for (int t = 0; t < 32;t++)
{
totalDistance += w & 0x1;
w = w >> 1;
}
}
}
return totalDistance;
}
```
**4. Assembly code**
In this case, I found that the allocation of alias is a big issue because I didn't utilize stack to store data when jump to other functions.
Because the total Hamming distance is the summation of Hamming distance of each pair, here I only test one data to check the final result. That is, in order to fit the final result, each computation of Hamming distance generally must be correct.
``` =
.data
test:
.word 5,8,104,233,1,5344,703
size:
.word 7
verify:
.word 100
pair:
.asciz "\n pair HD = "
total:
.asciz "\n total HD = "
fa:
.asciz "\n false"
successful:
.asciz "\n successful"
.text
.global main
main:
la t0,test #loda address of test data to t0
la t1,verify #load address of verifying data to t1
la t2,size #load address of test data size to t2
lw t2,0(t2) #load the value of size
li t3,0 #t3 = i
li t4,0 #t4 = j
li s0,0 #Total Hamming distance, s0
jal loop1
la t0,verify
lw t0,0(t0)
beq s0,t0,end
j false
loop1:
addi a5,ra,0 #store ra of main in a5
bge t3,t2,exit #if t3>=t2 => i<3 ,exit
addi t4,t3,1 #index of inner loop, start from t3+1, j=i+1
lw a3,0(t0) #load the i-th element of data
addi a1,t0,4 #load the j=i+1 th data address
jal loop2 #counting Hamming distance with i fixed
addi t0,t0,4 #shift to next element, i+1
addi ra,a5,0 #recover ra of main
addi t3,t3,1
j loop1 #back to this loop
loop2:
addi a6,ra,0 #store ra of loop1 in a6
bge t4,t2,exit #if j>= 3, exit
lw t5,0(a1) #load jth element
xor t5,t5,a3 #obtain xor result of ith and jth elements
li a0,0 #reset a0 for bit shift, <32
li t6,32 #reset a1 for total bit number
li s1,0 #number of Hammingdis of each pair
jal HammingDis #counting Hamming distance of ith and jth elements
addi ra,a6,0
add s0,s0,s1 #sum up each pair
addi a1,a1,4 #j=j+1 element
addi t4,t4,1
j loop2
HammingDis:
bge a0,t6,exit #shift 31 times
li a2,1 #load for and the least significant bit
and a2,a2,t5 #
add s1,s1,a2 #plus to total Hamming distance
srli t5,t5,1 #right shift 1 bits
addi a0,a0,1 #index +1
j HammingDis #count for next bit
exit:
ret
end:
la a0,total
li a7,4
ecall
addi a0,s0,0
li a7,1
ecall
li a7,10
ecall
false:
la a0,fa
li a7,4
ecall
li a7,10
ecall
```
**5. Performance**

### Solution `B`
**1. Idea**
In this version, before counting Hamming distance, I try to calculate the leading zero of $XOR$ to reduce the total cycle. I guess it might be helpful for the thorough procedures.
**2. C code**
```C
int totalHammingDistance(int* nums, int numsSize) {
int totalDistance = 0;
for (int i = 0; i < numsSize - 1; i++) {
for (int j = i + 1; j < numsSize; j++) {
uint32_t w = nums[i] ^ nums[j];
if (w == 0) {
continue; //avoid undefined case
}
int leading_zeros = __builtin_clz(w);
int bits_to_process = 32 - leading_zeros; //figure out clz to reduce the following cycle
for (int t = 0; t < bits_to_process; t++) {
totalDistance += w & 0x1;
w >>= 1;
}
}
return totalDistance;
}
```
**3. Revised Assembly Code**
In this revised case, I try to allocate alias more clearly, including the utilization of stack pointer. Furthermore, clz method is also applied in this approach with the performance improved significantly and shown at the next subtitle.
```
.data
test:
.word 5,8,104,233,1,5344,703
size:
.word 7
total:
.text
.global main
main:
la s0, test #load address of test data to s0
la s1, size #load address of test data size to s1
lw s1,0(s1) # s1 = numsSize
li s2,0 # s2 = total Hd
li t0, 0 # outer loop index = 1
jal loop1
jal end
loop1:
# to = outer loop index
bge t0, s1, exit
addi t1,t0,1 #inner loop index, start form t0+1
addi t2,s0,4 #load the data address to t2 used for shift in the loop
lw a0, 0(s0) #load first element
li a1, 32 #the clz of first element
addi sp,sp,-8
sw ra, 4(sp) #sp for ra
sw a0, 0(sp) #sp for first element
jal clz #calculate clz of first element
lw a0,0(sp) #load second element
jal loop2 #loop 2 for Hamming distance with reamining elements
lw ra,4(sp) #recover ra
addi sp,sp,8 #recover sp
addi t0,t0,1 #next loop
addi s0,s0,4 #next element
j loop1
loop2:
# a0 = the first element
# a1 = the clz of the first element
# a2 = the clz of the second element
# t1 is the index = i+1
# t2 = address of second element
bge t1, s1, exit
addi sp,sp,-16
sw ra,12(sp)
sw a0,8(sp)
sw a1,4(sp)
sw t1,0(sp)
#count clz of t2
lw a0,0(t2) #load second element to a0
li a1,32 #set clz to 0
jal clz #count clz of second element
addi a2,a1,0 #clz store to a2
lw a1,4(sp) #recover a1
jal reverse
addi a0,a1,-32 # a0=a1-32
sub a0,zero,a0 # a0=0-a0=-a1+32
lw t1,8(sp) #t1 = first element used for Hd_clz
lw t3,0(t2) #t3 = second element for Hd_clz
li a1,0 #index for Hd_clz = 0
xor t1,t1,t3 # xor t1,t2 to t1
jal HammingDis_clz
lw t1,0(sp) #recover data
lw a1,4(sp) #recover data
lw a0,8(sp) #recover data
lw ra,12(sp) #recover data
addi sp,sp,16 #recover sp
addi t2,t2,4 #next element
addi t1,t1,1 #next loop
j loop2
reverse:
bge a2,a1 ,exit #if a1>a2, a0 = 32-a2, else a0 = 32-a1
addi a1,a2,0
ret
clz:
# a0 is the element need to be calculated it's clz
# a1 is the clz
beq a0, zero,exit # if a0 == 0, meaning that shift encough step, end loop
addi a1,a1,-1 #if t1 !=0, leadning zero bits -1
srli a0,a0,1 #right shift 1 bit
j clz #back to loop
HammingDis_clz:
# a0 = less clz value of two elements
# a1 = index
# t1 = xor value
# t3 = check
bge a1, a0, exit
andi t3, t1, 1 # LSB
add s2,s2,t3 # add LSB to total Hd
srli t1,t1,1 # right shift one bit
addi a1,a1,1
j HammingDis_clz #count for next bit
exit:
ret
end:
la a0,total
li a7,4
ecall
addi a0,s2,0
li a7,1
ecall
li a7,10
ecall
```
**4. Comparison of Performances**
The pictures below showed the leading zero concept applied dramatically decreased the cycle number. All parameters are improved significantly.

Fig a. with no leading zero concept

Fig b. with leading zero concept
### Solution `C`
**1. Idea**
However, the complexities of above solutions aren't inproved. I figure out another approach to sovle the complexity issue. Under the condition that we know the total number of integers, we can calculate the total Hamming distance at specific bit position then sum up. The following steps show this idea more clearly.
**a.** Consider in bit position $l$, $$H(x_{il},x_{jl})=\begin{cases}
1,if\ \ \ x_{il}\ \ne x_{jl}\\
0,if\ \ \ x_{il}\ = x_{jl}
\end{cases}$$
Only bit different will cause $H=1$
**b.**
With above steps, denoted $o_l, z_l$ as the number of one and zero at the position $l$, respectively.
$$\begin{align}H_{total}&=\displaystyle\sum^{N-1}_{i=1}\sum^N_{j=i+1}H(x_i,x_j)=\sum^{32}_{l=1}\sum^{N-1}_{i=1}\sum^N_{j=i+1}H(x_{il},x_{jl})\\&=\sum^{32}_{l=1}o_lz_l\end{align}$$
**c.**
Conclusion: only the multiplication value of $o_l,z_l$ in each bit position is needed!
**2.C Code**
```C
int totalHammingDistance(int* nums, int numsSize) {
int totalDistance = 0; //total Hamming Distance
for (int l = 0; l < 32; l++) //outer loop for bit position
{
int o = 0; //set nubmer of one in o
int z = 0; //set number of zero in z
for (int j = 0; j < numsSize; j++) //check every integer
{
nums[j]&1 ? o++:z++; //check LSB = 1?
nums[j] = nums[j] >> 1; //right shift tested data
}
totalDistance = totalDistance + o * z; //sum up
}
return totalDistance;
}
```
**3.Assembly**
```
.data
test:
.word 5,8,104,233,1,5344,703, 832940,10239,4829,829,111,284920,1042,17,5873,478293,182,4471,23
size:
.word 20
verify:
.word 1314
right_mes:
.asciz "right"
wrong_mes:
.asciz "wrong"
.text
main:
la s0, test #load address of test data to s0
la s1, size #load address of test data size to s1
lw s1,0(s1) # s1 = numsSize
li s2,0 # s2 = total Hd
li t0, 0 # bits_shift index = 1
li t1, 32 # total 32 bit shifts
bits_shift:
bge t0,t1,end # 32bits shift
addi a0,s0,0 # copy the address of data
li a1,0 # the number of zero
li a2,0 # the number of one
li t2,0 # index
addi sp,sp,-4 # store return address on stack
sw ra,0(sp)
jal data_shift # count each zero and one of elements
lw ra,0(sp)
addi sp,sp,4
addi t0,t0,1 # next bit
j bits_shift # loop
data_shift:
# a0 = address of the data
# a1 = number of zero
# a2 = number of one
# t2 = index
# t3 = data value
bge t2,s1,z_times_o # index >= size number, count z times o
lw t3,0(a0) # load element
addi t2,t2,1 # next loop
andi t4,t3,1 # store and result to t4
srli t3,t3,1 # right shift one bit
sw t3,0(a0) # restore to s0 position
beq t4,zero,z_add # if t4=0, LSB=0, zero++
addi a2,a2,1 # else , one++
c:
addi a0,a0,4 # next element
j data_shift # back loop
z_add:
addi a1,a1,1 # add number of zero
j c # back and continue
z_times_o: # result will be stored in s2
# a1 = number of zero
# a2 = number of one
# t2 = temp value for store LSB
beq a1,zero,exit #right shift a1. if zero, return to main loop
li t2,0
andi t2,a1,1 # LSB of a1
beq t2,zero,b # if zero, j to b, next bit
add s2,s2,a2 # add to total Hd
b:
slli a2,a2,1
srli a1,a1,1
j z_times_o # back to this loop
exit:
ret
end:
la t0, verify
lw t1, 0(t0)
bne t1,s2,false
la a0, right_mes
li a7,4
ecall
li a7,10
ecall
false:
la a0, wrong_mes
li a7,4
ecall
li a7,10
ecall
```
**4. Performance**
It seems worse than above cases.

**5. Comparison by using different data**
I'm wondering why it looks worse than above cases, so I tried another data of which with large number of intege. The results are shown below. With the integers seies: $5,8,104,233,1,5344,703, 832940,10239,4829,829,111,284920,1042,17,5873,478293,182,4471,23$
a. Solution `A`

b. Solution `B`
This revised versio seems make the performance much better.

c. Solution `C`
This solution make a significant improvement of cycles!
!
## Assembly Code Improvement Based on Pipeline Diagram (Solution `C`)
### Unrolling func. data_shift
* When processing `data_shift`, here are some data hazard issues that I found. As below showed.When `EX` beq, the IF and ID will be eliminated at the next cycle and IF will will replaced by `addi a1,a1, 1` if `t4==0`. I have no idea how to avoid such banch issue.
* Another problem that I found is if processing either`C:` or`z_add:` the data hazard impacted dramatically of the performance. As picture showed.
To reduce to impact of this problem, I tried to enroll code of this part. The correesponding assembly code revised as:
```
data_shift:
# a0 = address of the data
# a1 = number of zero
# a2 = number of one
# t2 = index
# t3 = data value
bge t2,s1,z_times_o # index >= size number, count z times o
lw t3,0(a0) # load element
addi t2,t2,1 # next loop
andi t4,t3,1 # store and result to t4
beq t4,zero,z_add # if t4=0, LSB=0, zero++
addi a2,a2,1
srli t3,t3,1 # right shift one bit
sw t3,0(a0) # restore to s0 position
addi a0,a0,4 # next element
j data_shift # back loop
z_add:
srli t3,t3,1 # right shift one bit
sw t3,0(a0) # restore to s0 position
addi a1,a1,1 # add number of zero
addi a0,a0,4
j data_shift # back and continue
```
And the performance after revised:

* Conclusion
Currently, I am studing how to avoid nop issues when encounter jump address and jump branch.
## Reference
> [Example RISC-V Assembly Programs](https://marz.utk.edu/my-courses/cosc230/book/example-risc-v-assembly-programs/)
[Leetcode 477 ](https://leetcode.com/problems/total-hamming-distance/description/)
[open AI](https://chatgpt.com/)
[RISC-V Assembly Programmer’s Manual](https://github.com/riscv-non-isa/riscv-asm-manual/blob/main/src/asm-manual.adoc)