# Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by < [HenryChiang](https://github.com/HenryChaing/Computer_Arch_112) >
###### tags: `RISC-V` `Computer Architecture` `CLZ`
## 1. Introduction to Problem
### 1.1. Normalizing an integer
The problem I have chosen is an extension of Problem A, "Counting leading zeros." It allows an integer to be normalized into a binary floating-point representation, where the format will be expressed in the standard "1.XX*2^n" form.
### 1.2. prepare solution
First, use CLZ to determine the number of shifts N (which will be exactly the number of leading zeros plus one). Then, use the sll instruction to perform the shifting and finally output the result to the console using ecall (System call).
## 2. Assembly code
### 2.1. Code section (with comments)
```c=
.data
//test array
arr: .word 6,9,25
//strings to output to console
str1: .string "The normalization for number "
str2: .string " is 1."
str3: .string "\n"
str4: .string " * 2^"
str5: .string "please ignore 0b\n"
.text
//main code
main:
la a0, str5 //print str5
li a7, 4
ecall
addi t5,zero,3 //load test array and its length to register t5,t6
la t6,arr
loop:
# load element to a0, run CLZ
lw a0,0(t6)
jal ra, count_leading_zeros
# Print the result to console
mv a1, a0
lw a0, 0(t6)
jal ra, printResult
# test next element
addi t5,t5,-1
addi t6,t6,4
bne t5,zero,loop
# Exit program
li a7, 10
ecall
//Problem A assembly code
count_leading_zeros:
srai t0, a0, 1
or a0, a0, t0
srai t0, a0, 2
or a0, a0, t0
srai t0, a0, 4
or a0, a0, t0
srai t0, a0, 8
or a0, a0, t0
srai t0, a0, 16
or a0, a0, t0
srai t0, a0, 16
srai t0, a0, 16
or a0, a0, t0
srai t0, a0, 1
andi t0, t0, 0x555
sub a0, a0, t0
srai t0, a0, 2
andi t0, t0,0x333
andi t1, a0,0x333
add a0,t0,t1
srai t0, a0,4
add t0,a0,t0
andi a0,t0,0xf0f
srai t0,a0,8
add a0,a0,t0
srai t0,a0,16
add a0,a0,t0
srai t0,a0,16
srai t0,t0,16
add a0,a0,t0
andi t0,a0,0x7f
addi t1,zero,64
sub a0,t0,zero
jr ra
# --- printResult ---
# a0: Original number
# a1: Leading Zero
printResult:
mv t0, a0
mv t1, a1
lw t3, 0(t6)
la a0, str1
li a7, 4
ecall
mv a0, t0
li a7, 1
ecall
la a0, str2
li a7, 4
ecall
addi t2,zero,33
sub t2,t2,t1
sll t3,t3,t2
mv a0, t3
li a7, 35
ecall
la a0, str4
li a7, 4
ecall
addi t1,t1,-1
mv a0, t1
li a7, 1
ecall
la a0, str3
li a7, 4
ecall
ret
```
## 3. 5-stage Pipeline Analyzing
### 2.1. The 5-stage graph


This is the program segment that the pipeline is currently executing. It represents the main loop in the program where each piece of data is analyzed and printed. We are now preparing to enter the next iteration of the loop. At this point, the EX stage has determined that there will be a branch, so we are analyzing the status of each stage at this moment.
### 2.2. IF stage

Here, it needs to be explained in two stages. First is the value fetched by the PC in the IF stage itself (0x44). This is the location in instruction memory where the ecall instruction is stored (as seen in section 2.1 of the code). The result obtained after reading from instruction memory is 0x73. This represents how the (ecall) instruction is translated into machine code, waiting to enter the ID stage for decoding.
Next is about the update of the PC value. Since the EX stage has already decided to branch to the 'loop' label, at this point, the updated value of PC will be the address of the 'loop' label (0x1C). The original PC+4 value obtained through the adder (0x48) will not be updated to the PC.
### 2.2. ID stage

In the ID stage at this moment, decoding and register fetching are performed. First, the instruction 0xa00893 is decoded to obtain the instruction "addi x17, x0, 10". Next, the values of the temporary registers and the immediate constant need to be extracted. Only after extracting these three can calculations be performed in the subsequent EX stage. The values obtained for x17 and x0 are 0x100000031 and 0x00, respectively, while the immediate value (imm) is 0x0a.
### 2.3. EX stage

Next is the EX stage, which can be divided into two parts to explain. First is the ALU (Arithmetic Logic Unit). Since the instruction in the EX stage is "bne," at this point, the ALU operation involves the instruction address to which the branch should be taken. Among these, op1 holds the instruction address of the "bne" instruction, while op2 represents the relative distance between the "loop" label and the "bne" instruction. The final result, 0x1C, corresponds to the address of the "loop" label in the instruction memory.

Next, let's introduce the branch unit. The branch unit compares rs1 and rd, and when they are equal, the branch taken is set to 1; otherwise, it's set to 0. This is used to determine whether branching is taken or not.
### 2.4. MEM stage

Next is the MEM stage. In this stage, because it's not currently an "lw" or "sw" instruction, there won't be any read or write operations performed on the data memory. Once the write enable is set to 0, data memory won't perform any writes, and the value read out won't be selected by the multiplexer, so there won't be any reading of incorrect data either.
### 2.5. WB stage

Finally, in the WB (Write Back) stage, the main purpose is to return the value obtained from data memory or the ALU calculation back to the Register Block. For example, in the current instruction "addi," we will return the result of adding x30 and 1 from the EX stage back to the x30 register. The value to be written back in this stage is 0x02.
(Note: Change t5 from 0x03 to 0x02.)

### There is no memory write back
## 3. Optimize the code
### 3.1. Before optimize


### 3.2. After optimize


### 3.3. Conclusion
Because originally the loop required two branch instructions to complete one iteration, it was modified to require only one branch to complete. The advantage of this modification is that it can reduce the control hazard caused by branches. Therefore, even with only three loops, we have still reduced the consumption of 6 cycles, effectively lowering the CPI (Cycles Per Instruction) and reducing execution time.