# Assignment1: RISC-V Assembly and Instruction Pipeline > contributed by [tinhanho](https://github.com/tinhanho/Assignment1-RISC-V-Assembly-and-Instruction-Pipeline) ## Optimizing Storage with Count Leading Zero Assume that we have a 32-bit data. Obviously, using the __int32( i.e. **int** in C language) to save the data is adequate. However, when the leading zero of the data is too much, it is better to regard it as __int16, or__int8. For example, when receiving a data 0x00000010, then we could use **char** to store the data. It will be stored as 0x10 in the data type **char** and it will not miss any bits of the data. The optimizing storage with count leading zero algorithm can be implemented as a simple C language code as following, ## C Program ```c= #include <stdlib.h> #include <stdio.h> char num_p_8 = 0; short num_p_16 = 0; int num_p_32 = 0; char count_leading_zeros(void *num_p){ x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); /* count ones (population count) */ x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x7f)); } void opt_storage(void *num_p){ char leading_zero = count_leading_zeros(num_p); if(leading_zero>=24){ num_p_8 = *(char*)num_p; } else if(leading_zero>=16){ num_p_16 = *(short*)num_p; } else num_p_32 = *(int*)num_p; } int main(){ int num = 16; void *num_p; for(int i=0; i<=2; i++){ num_p = &num; opt_storage(num_p); num = num << 10; printf("-------------\n"); printf("char: %d\n", num_p_8); printf("short: %d\n", num_p_16); printf("int: %d\n", num_p_32); } } ``` ```c=8 char count_leading_zeros(void *num_p){ x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); /* count ones (population count) */ x -= ((x >> 1) & 0x55555555); x = ((x >> 2) & 0x33333333) + (x & 0x33333333); x = ((x >> 4) + x) & 0x0f0f0f0f; x += (x >> 8); x += (x >> 16); return (32 - (x & 0x7f)); } ``` With the function **count_leading_zeros**, we could calculate the number of leading zero. It is useful for determining the size of the data. ```c=25 void opt_storage(void *num_p){ char leading_zero = count_leading_zeros(num_p); if(leading_zero>=24){ num_p_8 = *(char*)num_p; } else if(leading_zero>=16){ num_p_16 = *(short*)num_p; } else num_p_32 = *(int*)num_p; } ``` In the **opt_storage** function part, we use the leading zeros information to choose which is the best data structure to save our data. When leading zero is larger than 24, it means that we could simply use a **char** to save our data because the data is less or equal to 8 bits. In the same way, we could determine **short** data and **int** data with the information of leading zeros. In this case, we assume that input is 16 initailly and we shift it left by 10 in the for-loop in order to change the size of the data. Moreover, there are 3 global variables, which can store __int8, __int16 or __int32 data. We put all the input in the right data structure. Input is 16 and shift it left by 10 for 3 loops: ![](https://hackmd.io/_uploads/BJMTKmKg6.png) The data range can be stored: |char|short|int| |-|-|-| |-128~127|-32768~32767|-2147483648~2147483647| ## Assembly Program ```mipsasm= .data input: .word 16 split: .string "-------------\n" newline: .string "\n" char_str: .string "char:" short_str: .string "short:" int_str: .string "int:" .text # a1 = input number # a2 = loop counter # a3 = base address for output # s0 = pass parameter # t0 = temporary register # t1 = temporary register main: # a1 for input number lw a1, input # for-loop, a2 for loop counter addi a2, x0, 3 # create space for char, short and int sb x0, 0(a3) sh x0, 1(a3) sw x0, 4(a3) loop: jal ra, opt_storage jal ra, print slli a1, a1, 10 # shift input left to get larger data addi a2, a2, -1 # loop counter bne a2, x0, loop # Exit li a7, 10 ecall opt_storage: addi sp, sp, -4 # create stack space to save return address sw ra, 0(sp) jal ra, count_leading_zeros # calculate leading zeros # if leading zero >= 24 addi t0, s0, -24 bge t0, x0, char # if leading zero >= 16 addi t0, s0, -16 bge t0, x0, short # else jal x0, int char: sb a1, 0(a3) jal x0, endif short: sh a1, 1(a3) jal x0, endif int: sw a1, 4(a3) jal x0, endif endif: lw ra, 0(sp) addi sp, sp, 4 ret count_leading_zeros: add s0, x0, a1 # s0 counts the leading zeros srli t0, s0, 1 # t0 = x>>1 or s0, s0, t0 # x |= (x >> 1); srli t0, s0, 2 # x>>2 or s0, s0, t0 srli t0, s0, 4 # x>>4 or s0, s0, t0 srli t0, s0, 8 # x>>8 or s0, s0, t0 srli t0, s0, 16 # x>>16 or s0, s0, t0 li t5, 0x55555555 srli t0, s0, 1 # x>>1 and t0, t0, t5 # (x >> 1) & 0x55555555 sub s0, s0, t0 li t5, 0x33333333 srli t0, s0, 2 # x>>2 and t0, t0, t5 # (x >> 2) & 0x33333333 and t1, s0, t5 # x & 0x33333333 add s0, t0, t1 li t5, 0x0f0f0f0f srli t0, s0, 4 # x>>4 add t0, s0, t0 and s0, t0, t5 srli t0, s0, 8 # x>>8 add s0, s0, t0 srli t0, s0, 16 # x>>16 add s0, s0, t0 li t5, 0x0000007f and s0, s0, t5 sub s0, x0, s0 addi s0, s0, 32 ret print: # print split symbol la a0, split li a7, 4 ecall # print string "char" la a0, char_str li a7, 4 ecall # print char lb a0, 0(a3) li a7, 1 ecall # newline la a0, newline li a7, 4 ecall # print string "short" la a0, short_str li a7, 4 ecall # print short lh a0, 1(a3) li a7, 1 ecall # newline la a0, newline li a7, 4 ecall # print string "int" la a0, int_str li a7, 4 ecall # print int lw a0, 4(a3) li a7, 1 ecall # newline la a0, newline li a7, 4 ecall ret ``` We expect that assembly code runs the same as C code. In the data section we define the input word and the output strings, such as split symbol, newline, and several strings. In the text section we implement the code. Function name is the same as C code. Therefore, check the C code and the comment could be helpful to understand the meaning of assembly code. Following is the console of the assembly code, ![](https://hackmd.io/_uploads/HyiNsmKga.png) ## Pipeline ![](https://hackmd.io/_uploads/rkGq9XYg6.png) We use a 5-stage pipeline, which includes IF, ID, EXE, MEM and WB. ```mipsasm=56 count_leading_zeros: add s0, x0, a1 # s0 counts the leading zeros ``` Take the simple instuction add s0, x0, a1 ( i.e. add x8, x0, x11) for example: ### IF stage ![](https://hackmd.io/_uploads/BJD72Ete6.png) 1. Without branch or jump, PC = PC+4. 0x0000007C+0X00000004 2. In the Instr. memory. add x8, x0, x11 = 0000000(func7) 01011(x11) 00000(x0) 000(funct3) 01000(x8) 0110011(R-type op code) = 0x00b00433 |funct7 |rs2 |rs1 |funct3 |rd |opcode| |-|-|-|-|-|-| ### ID stage ![](https://hackmd.io/_uploads/B11w34FeT.png) 1. R1 is x0(0x00), R2 is x11(0x0b) 2. Enable the write back, passed by pipeline register. 3. R3(0x08) is passed by pipeline register. ### EXE stage ![](https://hackmd.io/_uploads/H1VfFVKgp.png) 1. Add two registers value. x0 is 0 and x11 is 0x00000010. 2. No branch. ### MEM stage ![](https://hackmd.io/_uploads/BJ38MHKxp.png) 1. Disable the writing back to data memory. 2. ALU result is passed. 3. Write back register number is passed by pipeline regiser. ### WB stage ![](https://hackmd.io/_uploads/HkOHK4te6.png) 1. Write ALU result back to the register x8. ## Clock cycle ![](https://hackmd.io/_uploads/HkX4wSYep.png) ▲ Without unrolling loops We modify our Assembly code with simple loop unrolling, ```mipsasm=9 .text # a1 = input number # a2 is cleared # a3 = base address for output # s0 = pass parameter # t0 = temporary register # t1 = temporary register main: # a1 for input number lw a1, input # create space for char, short and int sb x0, 0(a3) sh x0, 1(a3) sw x0, 4(a3) jal ra, opt_storage slli a1, a1, 10 # shift input left to get larger data jal ra, print jal ra, opt_storage slli a1, a1, 10 jal ra, print jal ra, opt_storage slli a1, a1, 10 jal ra, print ``` ![](https://hackmd.io/_uploads/BJT1ISKxT.png) ▲ With Unrolling loops It reduces a little bit loops with unrolling the loops. ```misp=37 opt_storage: addi sp, sp, -4 # create stack space to save return address sw ra, 0(sp) jal ra, count_leading_zeros # calculate leading zeros # if leading zero >= 24 addi t0, s0, -24 bge t0, x0, char # if leading zero >= 16 addi t0, s0, -16 bge t0, x0, short int: sw a1, 4(a3) jal x0, endif char: sb a1, 0(a3) jal x0, endif short: sh a1, 1(a3) jal x0, endif endif: lw ra, 0(sp) addi sp, sp, 4 ret ``` Moreover, we modify the opt_storage part by moving the else part, and it costs less instructions. ![](https://hackmd.io/_uploads/r1eoq8fWa.png =x180) It reduces 3 cycles.