# Assignment1: RISC-V Assembly and Instruction Pipeline
> contributed by [tinhanho](https://github.com/tinhanho/Assignment1-RISC-V-Assembly-and-Instruction-Pipeline)
## Optimizing Storage with Count Leading Zero
Assume that we have a 32-bit data. Obviously, using the __int32( i.e. **int** in C language) to save the data is adequate. However, when the leading zero of the data is too much, it is better to regard it as __int16, or__int8. For example, when receiving a data 0x00000010, then we could use **char** to store the data. It will be stored as 0x10 in the data type **char** and it will not miss any bits of the data. The optimizing storage with count leading zero algorithm can be implemented as a simple C language code as following,
## C Program
```c=
#include <stdlib.h>
#include <stdio.h>
char num_p_8 = 0;
short num_p_16 = 0;
int num_p_32 = 0;
char count_leading_zeros(void *num_p){
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
/* count ones (population count) */
x -= ((x >> 1) & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (32 - (x & 0x7f));
}
void opt_storage(void *num_p){
char leading_zero = count_leading_zeros(num_p);
if(leading_zero>=24){
num_p_8 = *(char*)num_p;
}
else if(leading_zero>=16){
num_p_16 = *(short*)num_p;
}
else num_p_32 = *(int*)num_p;
}
int main(){
int num = 16;
void *num_p;
for(int i=0; i<=2; i++){
num_p = #
opt_storage(num_p);
num = num << 10;
printf("-------------\n");
printf("char: %d\n", num_p_8);
printf("short: %d\n", num_p_16);
printf("int: %d\n", num_p_32);
}
}
```
```c=8
char count_leading_zeros(void *num_p){
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
/* count ones (population count) */
x -= ((x >> 1) & 0x55555555);
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x = ((x >> 4) + x) & 0x0f0f0f0f;
x += (x >> 8);
x += (x >> 16);
return (32 - (x & 0x7f));
}
```
With the function **count_leading_zeros**, we could calculate the number of leading zero. It is useful for determining the size of the data.
```c=25
void opt_storage(void *num_p){
char leading_zero = count_leading_zeros(num_p);
if(leading_zero>=24){
num_p_8 = *(char*)num_p;
}
else if(leading_zero>=16){
num_p_16 = *(short*)num_p;
}
else num_p_32 = *(int*)num_p;
}
```
In the **opt_storage** function part, we use the leading zeros information to choose which is the best data structure to save our data. When leading zero is larger than 24, it means that we could simply use a **char** to save our data because the data is less or equal to 8 bits. In the same way, we could determine **short** data and **int** data with the information of leading zeros.
In this case, we assume that input is 16 initailly and we shift it left by 10 in the for-loop in order to change the size of the data. Moreover, there are 3 global variables, which can store __int8, __int16 or __int32 data. We put all the input in the right data structure.
Input is 16 and shift it left by 10 for 3 loops:

The data range can be stored:
|char|short|int|
|-|-|-|
|-128~127|-32768~32767|-2147483648~2147483647|
## Assembly Program
```mipsasm=
.data
input: .word 16
split: .string "-------------\n"
newline: .string "\n"
char_str: .string "char:"
short_str: .string "short:"
int_str: .string "int:"
.text
# a1 = input number
# a2 = loop counter
# a3 = base address for output
# s0 = pass parameter
# t0 = temporary register
# t1 = temporary register
main:
# a1 for input number
lw a1, input
# for-loop, a2 for loop counter
addi a2, x0, 3
# create space for char, short and int
sb x0, 0(a3)
sh x0, 1(a3)
sw x0, 4(a3)
loop:
jal ra, opt_storage
jal ra, print
slli a1, a1, 10 # shift input left to get larger data
addi a2, a2, -1 # loop counter
bne a2, x0, loop
# Exit
li a7, 10
ecall
opt_storage:
addi sp, sp, -4 # create stack space to save return address
sw ra, 0(sp)
jal ra, count_leading_zeros # calculate leading zeros
# if leading zero >= 24
addi t0, s0, -24
bge t0, x0, char
# if leading zero >= 16
addi t0, s0, -16
bge t0, x0, short
# else
jal x0, int
char:
sb a1, 0(a3)
jal x0, endif
short:
sh a1, 1(a3)
jal x0, endif
int:
sw a1, 4(a3)
jal x0, endif
endif:
lw ra, 0(sp)
addi sp, sp, 4
ret
count_leading_zeros:
add s0, x0, a1 # s0 counts the leading zeros
srli t0, s0, 1 # t0 = x>>1
or s0, s0, t0 # x |= (x >> 1);
srli t0, s0, 2 # x>>2
or s0, s0, t0
srli t0, s0, 4 # x>>4
or s0, s0, t0
srli t0, s0, 8 # x>>8
or s0, s0, t0
srli t0, s0, 16 # x>>16
or s0, s0, t0
li t5, 0x55555555
srli t0, s0, 1 # x>>1
and t0, t0, t5 # (x >> 1) & 0x55555555
sub s0, s0, t0
li t5, 0x33333333
srli t0, s0, 2 # x>>2
and t0, t0, t5 # (x >> 2) & 0x33333333
and t1, s0, t5 # x & 0x33333333
add s0, t0, t1
li t5, 0x0f0f0f0f
srli t0, s0, 4 # x>>4
add t0, s0, t0
and s0, t0, t5
srli t0, s0, 8 # x>>8
add s0, s0, t0
srli t0, s0, 16 # x>>16
add s0, s0, t0
li t5, 0x0000007f
and s0, s0, t5
sub s0, x0, s0
addi s0, s0, 32
ret
print:
# print split symbol
la a0, split
li a7, 4
ecall
# print string "char"
la a0, char_str
li a7, 4
ecall
# print char
lb a0, 0(a3)
li a7, 1
ecall
# newline
la a0, newline
li a7, 4
ecall
# print string "short"
la a0, short_str
li a7, 4
ecall
# print short
lh a0, 1(a3)
li a7, 1
ecall
# newline
la a0, newline
li a7, 4
ecall
# print string "int"
la a0, int_str
li a7, 4
ecall
# print int
lw a0, 4(a3)
li a7, 1
ecall
# newline
la a0, newline
li a7, 4
ecall
ret
```
We expect that assembly code runs the same as C code. In the data section we define the input word and the output strings, such as split symbol, newline, and several strings. In the text section we implement the code. Function name is the same as C code. Therefore, check the C code and the comment could be helpful to understand the meaning of assembly code. Following is the console of the assembly code,

## Pipeline

We use a 5-stage pipeline, which includes IF, ID, EXE, MEM and WB.
```mipsasm=56
count_leading_zeros:
add s0, x0, a1 # s0 counts the leading zeros
```
Take the simple instuction add s0, x0, a1 ( i.e. add x8, x0, x11) for example:
### IF stage

1. Without branch or jump, PC = PC+4. 0x0000007C+0X00000004
2. In the Instr. memory. add x8, x0, x11 = 0000000(func7) 01011(x11) 00000(x0) 000(funct3) 01000(x8) 0110011(R-type op code) = 0x00b00433
|funct7 |rs2 |rs1 |funct3 |rd |opcode|
|-|-|-|-|-|-|
### ID stage

1. R1 is x0(0x00), R2 is x11(0x0b)
2. Enable the write back, passed by pipeline register.
3. R3(0x08) is passed by pipeline register.
### EXE stage

1. Add two registers value. x0 is 0 and x11 is 0x00000010.
2. No branch.
### MEM stage

1. Disable the writing back to data memory.
2. ALU result is passed.
3. Write back register number is passed by pipeline regiser.
### WB stage

1. Write ALU result back to the register x8.
## Clock cycle

▲ Without unrolling loops
We modify our Assembly code with simple loop unrolling,
```mipsasm=9
.text
# a1 = input number
# a2 is cleared
# a3 = base address for output
# s0 = pass parameter
# t0 = temporary register
# t1 = temporary register
main:
# a1 for input number
lw a1, input
# create space for char, short and int
sb x0, 0(a3)
sh x0, 1(a3)
sw x0, 4(a3)
jal ra, opt_storage
slli a1, a1, 10 # shift input left to get larger data
jal ra, print
jal ra, opt_storage
slli a1, a1, 10
jal ra, print
jal ra, opt_storage
slli a1, a1, 10
jal ra, print
```

▲ With Unrolling loops
It reduces a little bit loops with unrolling the loops.
```misp=37
opt_storage:
addi sp, sp, -4 # create stack space to save return address
sw ra, 0(sp)
jal ra, count_leading_zeros # calculate leading zeros
# if leading zero >= 24
addi t0, s0, -24
bge t0, x0, char
# if leading zero >= 16
addi t0, s0, -16
bge t0, x0, short
int:
sw a1, 4(a3)
jal x0, endif
char:
sb a1, 0(a3)
jal x0, endif
short:
sh a1, 1(a3)
jal x0, endif
endif:
lw ra, 0(sp)
addi sp, sp, 4
ret
```
Moreover, we modify the opt_storage part by moving the else part, and it costs less instructions.

It reduces 3 cycles.