Assignment1: RISC-V Assembly and Instruction Pipeline
contributed by <huangchenghan> ( Cheng-Han, Huang )
Problem C to RISC-V
Problem C has three functions to be implemented: fabsf
, my_clz
, and fp16_to_fp32
.
fabsf
C code
This function directly operates the bits of floating point numbers to calculate the absolute value. Specifically, it removes the negative sign by converting a floating-point number to integer form, then clearing the sign bit (the highest bit), and finally converting the result back to a floating-point number.
Assembly code
my_clz
C code (original)
The my_clz
function counts the number of leading zeros in a 32-bit unsigned integer x
. It does this by checking each bit from the most significant (bit 31) down to the least significant, incrementing a counter until it finds a 1
. The function then returns the count of leading zeros.
C code (Optimized)
The original my_clz iterates over all 32 bits, making it less efficient, especially for large numbers, because it checks each bit one by one. The optimized version uses a binary search approach, which narrows down the range of bits to check by shifting left in chunks (16, 8, 4, 2, 1), making it faster by reducing the number of comparisons. The binary search method is more efficient as it requires fewer iterations in the worst case. However, the original is simpler and more intuitive, while the optimized version is slightly more complex but significantly quicker. The optimized method takes constant time, whereas the original takes linear time relative to the number of leading zeros.
Assembly code
my_clz:
# t0 is the input parameter x
# t1 is c
# t2 is r
mv t0, a0
# int r = 0, c;
li t2, 0
# c = (x < 0x00010000) << 4;
li t1 0x00010000
sltu t1, t0, t1
slli t1, t1, 4
# r += c;
add t2, t2, t1
# x <<= c
sll t0, t0, t1
# c = (x < 0x01000000) << 3;
li t1 0x01000000
sltu t1, t0, t1
slli t1, t1, 3
# r += c;
add t2, t2, t1
# x <<= c
sll t0, t0, t1
# c = (x < 0x10000000) << 2;
li t1 0x10000000
sltu t1, t0, t1
slli t1, t1, 2
# r += c;
add t2, t2, t1
# x <<= c
sll t0, t0, t1
# c = (x < 0x40000000) << 1;
li t1 0x40000000
sltu t1, t0, t1
slli t1, t1, 1
# r += c;
add t2, t2, t1
# x <<= c
sll t0, t0, t1
# c = x < 0x80000000;
li t1 0x80000000
sltu t1, t0, t1
# r += c;
add t2, t2, t1
# x <<= c
sll t0, t0, t1
# r += x == 0;
seqz t0, t0
add t2, t2, t0
# return r;
mv a0, t2
ret
fp16_to_fp32
C code
The fp16_to_fp32
function converts a 16-bit floating point number (FP16) into a 32-bit floating point number (FP32). It first shifts the 16-bit input left by 16 bits to align it into a 32-bit space. Then, it extracts the sign bit and uses my_clz
to calculate a renormalization shift for the exponent. The function handles special cases like infinity and NaN using masks and adjusts the value to fit the FP32 format. Finally, it combines the sign, exponent, and mantissa into a 32-bit floating point result.
fp16_to_fp32(0x0618) is : 0x38c30000
fp16_to_fp32(0x000F) is : 0x35700000
fp16_to_fp32(0x0000) is : 0x0
fp16_to_fp32(0x8000) is : 0x80000000
fp16_to_fp32(0x7C00) is : 0x7f800000
fp16_to_fp32(0xFC00) is : 0xff800000
fp16_to_fp32(0x7CFF) is : 0x7f9fe000
Assembly code
.data
datas: .word 0x0618, 0x000F, 0x0000, 0x8000, 0x7C00, 0xFC00, 0x7CFF
ans: .word 0x38c30000, 0x35700000, 0x0, 0x80000000, 0x7f800000, 0xff800000, 0x7f9fe000
str1: .string "\nfp16_to_fp32("
str2: .string ") is : "
strError: .string "\nthe answer is wrong!!!"
.text
main:
# Load datas reference
la s6, datas
# Load ans reference
la s7, ans
# Load the loop count
li s8, 7
print_numbers:
# print "\nfp16_to_fp32("
la a0, str1
li a7, 4
ecall
# print 'data'
lw a0, 0(s6)
li, a7, 34
ecall
# print ") is : "
la a0, str2
li a7, 4
ecall
# print fp16_to_fp32(data)
lw a0, 0(s6)
jal ra, fp16_to_fp32
li, a7, 34
ecall
validation:
# check if fp16_to_fp32(data) == ans
lw t0, 0(s7)
sub t0, t0, a0
beqz t0, check_loop
# print "\nthe answer is wrong!!!"
la a0, strError
li a7, 4
ecall
check_loop:
# next data, ans
addi s6, s6, 4
addi s7, s7, 4
addi s8, s8, -1
bnez s8, print_numbers
exit:
# Exit the program
li a7, 10 # System call code for exiting the program
ecall # Make the exit system call
ret
fp16_to_fp32:
# a0 is the input parameter h
# t0 is w
# t1 is sign
# t2 is nosign
# const uint32_t w = (uint32_t) h << 16;
slli t0, a0, 16
# const uint32_t sign = w & UINT32_C(0x80000000);
li t1, 0x80000000
and t1, t0, t1
# const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
li t2, 0x7FFFFFFF
and t2, t0, t2
my_clz:
# t3 is the input parameter x
# t4 is c
# t5 is r
mv t3, t2
# int r = 0, c;
li t5, 0
# c = (x < 0x00010000) << 4;
li t4 0x00010000
sltu t4, t3, t4
slli t4, t4, 4
# r += c;
add t5, t5, t4
# x <<= c
sll t3, t3, t4
# c = (x < 0x01000000) << 3;
li t4 0x01000000
sltu t4, t3, t4
slli t4, t4, 3
# r += c;
add t5, t5, t4
# x <<= c
sll t3, t3, t4
# c = (x < 0x10000000) << 2;
li t4 0x10000000
sltu t4, t3, t4
slli t4, t4, 2
# r += c;
add t5, t5, t4
# x <<= c
sll t3, t3, t4
# c = (x < 0x40000000) << 1;
li t4 0x40000000
sltu t4, t3, t4
slli t4, t4, 1
# r += c;
add t5, t5, t4
# x <<= c
sll t3, t3, t4
# c = x < 0x80000000;
li t4 0x80000000
sltu t4, t3, t4
# r += c;
add t5, t5, t4
# x <<= c
sll t3, t3, t4
# r += x == 0;
seqz t3, t3
add t5, t5, t3
back_to_fp16_to_fp32:
# t0, t3, t4 can be reused
# t5 is renorm_shift
# renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
li t0, 5
bgtu t5, t0, renorm_shift_minus_five
j renorm_shift_is_zero
renorm_shift_minus_five:
sub t5, t5, t0
j continue_fp16_to_fp32
renorm_shift_is_zero:
li t5, 0
continue_fp16_to_fp32:
# t3 is inf_nan_mask
# t4 is zero_mask
# t0 is temp1
# t6 is temp2
# const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000);
li t0, 0x04000000
add t3, t2, t0
srai t3, t3, 8
li t0, 0x7F800000
and t3, t3, t0
# const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
addi t4, t2, -1
srai t4, t4, 31
# return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
# temp1 = (nonsign << renorm_shift >> 3)
sll t0, t2, t5
srli t0, t0, 3
# temp2 = ((0x70 - renorm_shift) << 23)
li t6, 0x70
sub t6, t6, t5
slli t6, t6, 23
# temp1 = temp1 + temp2
add t0, t0, t6
# temp1 = temp1 | inf_nan_mask
or t0, t0, t3
# temp1 = temp1 & ~zero_mask
not t4, t4
and t0, t0, t4
# temp1 = sign | temp1
or t0, t1, t0
# return temp1
mv a0, t0
ret
fp16_to_fp32(0x0618) is : 0x38c30000
fp16_to_fp32(0x000f) is : 0x35700000
fp16_to_fp32(0x0000) is : 0x0000
fp16_to_fp32(0x8000) is : 0x80000000
fp16_to_fp32(0x7c00) is : 0x7f800000
fp16_to_fp32(0xfc00) is : 0xff800000
fp16_to_fp32(0x7cff) is : 0x7f9fe000
LeetCode Problem: Palindrome Number
Leetcode 9. Palindrome Number
Description
Given an integer x, return true if x is a palindrome, and false otherwise.
The original task was to check if a number is a palindrome in base 10. To better fit the use of clz
, I made a slight modification to the task, changing it to check if a number is a palindrome in base 2.
Example 1 :
Input: x = 101
Output: 1
Explanation: 101 reads as 101 from left to right and from right to left.
Example 2 :
Input: x = 11011
Output: 1
Explanation: 11011 reads as 11011 from left to right and from right to left.
Example 3 :
Input: x = 10
Output: 0
Explanation: Reads 01 from right to left. Therefore it is not a palindrome.
C code
checkPalindrome(0x0F00000F) is : 1
checkPalindrome(0x0E000001) is : 0
checkPalindrome(0x0000001B) is : 1
checkPalindrome(0x00000002) is : 0
checkPalindrome(0x00000001) is : 1
Assembly code
.data
datas: .word 0x0F00000F, 0x0E000001, 0x0000001B, 0x00000002, 0x00000001
ans: .word 0x1, 0x0, 0x1, 0x0, 0x1
str1: .string "\ncheckPalindrome("
str2: .string ") is : "
strError: .string "\nthe answer is wrong!!!"
.text
main:
# Load datas reference
la s6, datas
# Load ans reference
la s7, ans
# Load the loop count
li s8, 5
print_numbers:
# print "\ncheckPalindrome("
la a0, str1
li a7, 4
ecall
# print 'data'
lw a0, 0(s6)
li, a7, 34
ecall
# print ") is : "
la a0, str2
li a7, 4
ecall
# print checkPalindrome(data)
lw a0, 0(s6)
jal ra, checkPalindrome
li, a7, 1
ecall
validation:
# check if fp16_to_fp32(data) == ans
lw t0, 0(s7)
sub t0, t0, a0
beqz t0, check_loop
# print "\nthe answer is wrong!!!"
la a0, strError
li a7, 4
ecall
check_loop:
# next data, ans
addi s6, s6, 4
addi s7, s7, 4
addi s8, s8, -1
bnez s8, print_numbers
exit:
# Exit the program
li a7, 10 # System call code for exiting the program
ecall # Make the exit system call
ret
checkPalindrome:
# t0 is the input parameter N
mv t0, a0
my_clz:
# t0 is the input parameter x
# t1 is c
# t2 is r
# int r = 0, c;
li t2, 0
# c = (x < 0x00010000) << 4;
li t1 0x00010000
sltu t1, t0, t1
slli t1, t1, 4
# r += c;
add t2, t2, t1
# x <<= c
sll t0, t0, t1
# c = (x < 0x01000000) << 3;
li t1 0x01000000
sltu t1, t0, t1
slli t1, t1, 3
# r += c;
add t2, t2, t1
# x <<= c
sll t0, t0, t1
# c = (x < 0x10000000) << 2;
li t1 0x10000000
sltu t1, t0, t1
slli t1, t1, 2
# r += c;
add t2, t2, t1
# x <<= c
sll t0, t0, t1
# c = (x < 0x40000000) << 1;
li t1 0x40000000
sltu t1, t0, t1
slli t1, t1, 1
# r += c;
add t2, t2, t1
# x <<= c
sll t0, t0, t1
# c = x < 0x80000000;
li t1 0x80000000
sltu t1, t0, t1
# r += c;
add t2, t2, t1
# x <<= c
sll t0, t0, t1
# r += x == 0;
seqz t0, t0
add t2, t2, t0
back_to_checkPalindrome:
# t0 is N
# t1 is rev
# t2 is num
# t3 is move
# t4 is additional_move
# t5 is i
mv t0, a0
# uint32_t rev = 0;
li t1, 0
# uint32_t num = 32 - my_clz(N);
li t6, 32
sub t2, t6, t2
# uint32_t move = num >> 1;
srli t3, t2, 1
# uint32_t additional_move = num & 1;
andi t4, t2, 1
li t5, 0
loop_start:
# if (i >= move) break
bge t5, t3, loop_end
# rev = (rev << 1) + (N & 1);
slli t1, t1, 1
andi t6, t0, 1
add t1, t1, t6
# N = N >> 1;
srli t0, t0, 1
# i++
addi t5, t5, 1
# jump to loop start
j loop_start
loop_end:
# N = N >> additional_move;
srl t0, t0, t4
# return N == rev;
sub t6, t0, t1
sltiu a0, t6, 1
ret
checkPalindrome(0xf00000f) is : 1
checkPalindrome(0xe000001) is : 0
checkPalindrome(0x001b) is : 1
checkPalindrome(0x0002) is : 0
checkPalindrome(0x0001) is : 1
Cycle comparison
The execution results of the RISC-V code generated from C code compiled with the RISC-V C compiler and written by myself:
Compiled by RISC-V C compiler |
Written by myself |
 |
 |
5-stage pipelined processor
Description
In a 5-stage pipelined processor in RISC-V, instruction execution is divided into five main stages to improve efficiency by allowing multiple instructions to be processed simultaneously at different stages. These stages are:
- IF (Instruction Fetch): The current instruction is fetched from the instruction memory, and the program counter (PC) is incremented to prepare for fetching the next instruction.
- ID (Instruction Decode): The fetched instruction is decoded, and the source operands are read from the register. This stage may also compute immediate values and identify branch instructions.
- EX (Execute / ALU): In this stage, the Arithmetic Logic Unit (ALU) performs operations specified by the instruction, such as addition, subtraction, logical operations, or shifting. For load/store instructions, memory addresses are calculated.
- MEM (Memory Access): For load or store instructions, this stage accesses the data memory. Load instructions read data from memory, while store instructions write data to memory.
- WB (Write Back): The results from the EX or MEM stage are written back to the register, completing the instruction execution. This is the final step of processing the instruction.

Example
Here I take addi t0, t0, 12
as an example
1. IF (Instruction Fetch)
- PC is
0x00000000
- Fetch instructions
0x00c28293
from address 0x00000000
in Instruction memory
- Add 4 to the value of PC

2. ID (Instruction Decode)
immediate |
rs1 |
func |
rd |
opcode |
000000001100 |
00101 |
000 |
00101 |
0010011 |
- Decode the instruction
0x00c28293
- Get destination register
0x05
- Get the value
0x00000000
from source register 0x05
- Get the immediate value
0x0000000c

3. EX (Execute / ALU)
- ADD
0x00000000
and 0x0000000c
- Get the result
0x0000000c

4. MEM (Memory Access)
- I-type format does not need to access Data memory

5. WB (Write Back)
- Write the result
0x0000000c
back to the destination register 0x05

Reference
Quiz1 of Computer Architecture (2024 Fall)
Branchless count-leading-zeros on 32-bit RISC-V without Zbb extension