# Assignment 1 : Calculate the Hamming Distance using Counting Leading Zeros
contributed by [`yptang5488`](https://github.com/yptang5488)
## Hamming Distance between two integers
The Hamming Distance between two integers is defined as the number of differing bits at the same position when comparing the binary representations of the integers. For example, the Hamming Distance between `1011101` and `1001001` is 2.
In the assignment, I implement the program to calculate the Hamming Distance between the two given 64-bit unsigned integers.
### Motivation
I use a while loop to iterate through two integers, get their least significant bits (LSB), and then compare them to each other. Using the leading-zero counting function makes it possible to obtain the number of bits of the input integers, thus reducing the number of comparisons in the loop.
## Implementation
You can find the source code [here](https://github.com/yptang5488/Computer-Architecture/blob/master/HammingDistance.c).
### C program
In the hamming distance calculating function, I compare the two bit values and choose the larger one to be the number of the comparision operation.
```c=
#include <stdio.h>
#include <stdint.h>
uint64_t test1_x0 = 0x0000000000100000;
uint64_t test1_x1 = 0x00000000000FFFFF;
uint64_t test2_x0 = 0x0000000000000001;
uint64_t test2_x1 = 0x7FFFFFFFFFFFFFFE;
uint64_t test3_x0 = 0x000000028370228F;
uint64_t test3_x1 = 0x000000028370228F;
uint16_t count_leading_zeros(uint64_t x){
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
x |= (x >> 32);
/* count ones (population count) */
x -= ((x >> 1) & 0x5555555555555555);
x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
x = ((x >> 4) + x) & 0x0f0f0f0f0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
x += (x >> 32);
return (64 - (x & 0x7f));
}
int HammingDistance(uint64_t x0, uint64_t x1){
int Hdist = 0;
int16_t max_digit = 64 - (int16_t)count_leading_zeros((x0 > x1)? x0 : x1);
while(max_digit > 0){
uint64_t c1 = x0 & 1;
uint64_t c2 = x1 & 1;
if(c1 != c2) Hdist += 1;
x0 = x0 >> 1;
x1 = x1 >> 1;
max_digit -= 1;
}
return Hdist;
}
int main(){
printf("%d\n", HammingDistance(test1_x0, test1_x1));
printf("%d\n", HammingDistance(test2_x0, test2_x1));
printf("%d\n", HammingDistance(test3_x0, test3_x1));
return 0;
}
```
### Assembly Code
The complete source code can be found [here](https://github.com/yptang5488/Computer-Architecture/blob/master/HammingDistance.s). The following shows only the part of main function.
```s
.data
test_data_1: .dword 0x0000000000100000, 0x00000000000FFFFF # HD(1048576, 1048575) = 21
test_data_2: .dword 0x0000000000000001, 0x7FFFFFFFFFFFFFFE # HD(1, 9223372036854775806) = 63
test_data_3: .dword 0x000000028370228F, 0x000000028370228F # HD(10795098767, 10795098767) = 0
msg_string: .string "\nHamming Distance="
.text
main:
addi sp, sp, -12
# push pointers of test data onto the stack
la t0, test_data_1
sw t0, 0(sp)
la t0, test_data_2
sw t0, 4(sp)
la t0, test_data_3
sw t0, 8(sp)
# initialize main_loop
addi s0, zero, 3 # s0 : number of test case
addi s1, zero, 0 # s1 : test case counter
addi s2, sp, 0 # s2 : points to test_data_1
main_loop:
la a0, msg_string
li a7, 4 # print string
ecall
lw a0, 0(s2) # a0 : pointer to the first data in test_data_1
addi a1, a0, 8 # a1 : pointer to the second data in test_data_1
jal ra, hd_func
# print the result #
li a7, 1 # print integer
ecall # print result of hd_cal (which is in a0)
addi s2, s2, 4 # s2 : points to next test_data
addi s1, s1, 1 # counter++
bne s1, s0, main_loop
addi sp, sp, 12
li a7, 10
ecall
```
It is important to note that the registers provided in RV32I are all 32-bit, and when working with 64-bit numbers, I need to use 2 registers to access the upper and lower halves of the value respectively.
#### 64-bit number operation : shift right
Given a 64-bit number `x`, define `s0` to access the lower half and `s1` to access the other half. Using the **shift right** instruction as an example, `s1` can be shifted normally, but `s0` must take into account the LSB of the higher half.
```s=
# x(s1 s0) >> 1 #
srli t0, s0, 1
slli t1, s1, 31
or s0, t0, t1 # s0 >> 1
srli s1, s1, 1 # s1 >> 1
```
## Result
### test data 1
- `x0` = `0x0000000000100000`(`1048576`)
- `x1` = `0x00000000000FFFFF`(`1048575`)
- Hamming Distance = 21
### test data 2
- `x0` = `0x0000000000000001` (`1`)
- `x1` = `0x7FFFFFFFFFFFFFFE` (`9223372036854775806`)
- Hamming Distance = 63
### test data 3
- `x0` = `0x000000028370228F` (`10795098767`)
- `x1` = `0x000000028370228F` (`10795098767`)
- Hamming Distance = 0
### C program output

### RISC-V Assembly output

## Analysis
I test the code using [Ripes](https://github.com/mortbopet/Ripes) simulator.
### Disassembled Executable Code
The complete translated code can be found [here](https://github.com/yptang5488/Computer-Architecture/blob/master/disassemble.s). The following shows only the part of main function.
```
00000000 <main>:
0: ff410113 addi x2 x2 -12
4: 10000297 auipc x5 0x10000
8: ffc28293 addi x5 x5 -4
c: 00512023 sw x5 0 x2
10: 10000297 auipc x5 0x10000
14: 00028293 addi x5 x5 0
18: 00512223 sw x5 4 x2
1c: 10000297 auipc x5 0x10000
20: 00428293 addi x5 x5 4
24: 00512423 sw x5 8 x2
28: 00300413 addi x8 x0 3
2c: 00000493 addi x9 x0 0
30: 00010913 addi x18 x2 0
00000034 <main_loop>:
34: 10000517 auipc x10 0x10000
38: ffc50513 addi x10 x10 -4
3c: 00400893 addi x17 x0 4
40: 00000073 ecall
44: 00092503 lw x10 0 x18
48: 00850593 addi x11 x10 8
4c: 024000ef jal x1 36 <hd_func>
50: 00100893 addi x17 x0 1
54: 00000073 ecall
58: 00490913 addi x18 x18 4
5c: 00148493 addi x9 x9 1
60: fc849ae3 bne x9 x8 -44 <main_loop>
64: 00c10113 addi x2 x2 12
68: 00a00893 addi x17 x0 10
6c: 00000073 ecall
```
### 5-stage pipelined processor
Ripes provide different processor including single-cycle processor, 5-stage processor, 6-stage dual-issuer processor. 5-stage pipelined processor is chosen to run the program. The block diagram is shown below:

The 5 stages is designed to deal with the instructions in pipeline approach to enhance the execution efficiency. The 5 stages is shown as below:
1. **IF (Instruction Fetch)**
- fetch the instruction from instruction memory
3. **ID (Instruction Decode)**
- decode the instruction and read the value from registers
5. **EX (Execute)**
- use ALU to execute calculations or compute memory addresses
7. **MEM (Memory)**
- access the operands from memory
9. **WB (Write Back)**
- write the result back to the registers
#### I-type format
I take the instruction `srli x5 x10 1` in the psedoinstruction for example and analyze how the processor operates the instruction in different stages.
According to [RISC-V Instruction Set Manual (p.14)](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf), the I-type instruction will be accessed in the format below:
> Shifts by a constant are encoded as a specialization of the I-type format. The operand to be shifted is in rs1, and the shift amount is encoded in the lower 5 bits of the I-immediate field. The right shift type is encoded in a high bit of the I-immediate.

#### 1. IF

The instruction address of `srli x5 x10 1` is `0x000001f0`. After the instruction memory get the address from PC, the address will be translated to a 32-bit instruction.
We can find the instruction of `SRLI` and get its function code and opcode from [RV32I Instruction Set table (p.104)](https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) :

Finally, it will be transformed into hexadecimal format as a machine code `0x00155293` for RISC-V CPU.
| imm [11:5] | shamt [4:0] | rs | funct3 | rd | opcode |
|:----------:|:-----------:|:-----:|:------:|:-----:|:------:|
| 0000000 | 00001 (shift) | 01010 (x10) | 101 | 00101 (x5) | 010011 |
In the same time, `PC`(`0x000001f0`) will be added by 4 and get `pc4_out` (`0x000001f4`) for fetching the coming instruction.
#### 2. ID

In this stage, the instruction code `0x00155293` will be decoded into several part :
- `opcode` = `SRLI`
- `R1 idx` = `0x0a` (x10)
- `Wr reg idx` = `0x05` (x5)
- `R2 idx` is not used in I-type format
Then, the register index `R1 idx` will be sent into the registers to get the corresponding value :
- `R1 out` = `0xffffffff` (current value in x10)
The immediate value decoder will get the shamt value and extend it to a 32-bit number for later calculations.
- `imm.` = `0x00000001` (shift bit)
The `PC`(`0x000001f0`) and `pc4_out` (`0x000001f4`) are only transmitted through the stage and not modified or used.
#### 3. EX

ALU is a executable unit for many arithmetic and logical operations. In this stage, the data will be chosen by the 2-level multiplexers (MUXs) :
- choice of `op1` : `r1_out` go through 2 MUXs as the first input of ALU
- choice of `op2` : `imm_out` will be chosen by last MUX as the second input of ALU (`r2_out` is not used)
After that, ALU will obtain the corresponding data for calculation :
- `op1` : `0xffffffff` (x10 value)
- `op2` : `0x00000001` (shift bit)
- `result` : `0x7fffffff` (result)
`wr_reg_idx_out`(`0x05`) and `pc4_out` (`0x000001f4`) just go through the stage.
#### 4. MEM

I-type format doesn't need to read data from memory, so the control signal `wr_en` is `0x0`. `wr_reg_idx_out`(`0x05`), `pc4_out` (`0x000001f4`) and `alu_res_out`(`0x7fffffff`) will go through to the next stage.
#### 5. WB

There is a MUX to determine which data (`pc4_out`, `alu_res_out` or `mem_read_out`) will be saved in the register. In a shift right operation, we need to save the result of the ALU, `alu_res_out`, into the register indexed `wr_reg_idx_out`.

Therefore, `alu_res_out`(`0x7fffffff`) and `wr_reg_idx_out`(`0x05`) will be sent back to the register as shown above. In the same time, the control singal `wr_en` of the register will be set to `0x1` to allow shift right operation of the instruction.
### memory update
The data will be written back to the register only when the instruction `srli x5 x10 1` has finished execution and has been covered by other instructions in the processor as the last picture shows.

