Assignment1: RISC-V Assembly and Instruction Pipeline

contributed by <huangchenghan> ( Cheng-Han, Huang )

Problem C to RISC-V

Problem C has three functions to be implemented: fabsf, my_clz, and fp16_to_fp32.

fabsf

C code

This function directly operates the bits of floating point numbers to calculate the absolute value. Specifically, it removes the negative sign by converting a floating-point number to integer form, then clearing the sign bit (the highest bit), and finally converting the result back to a floating-point number.






static inline float fabsf(float x) {
    uint32_t i = *(uint32_t *)&x;  // Read the bits of the float into an integer
    i &= 0x7FFFFFFF;               // Clear the sign bit to get the absolute value
    x = *(float *)&i;              // Write the modified bits back into the float
    return x;
}

Assembly code









fabsf:
    # a0 is the input parameter x
    # t0 is i

    # i &= 0x7FFFFFFF; 
    li t0, 0x7FFFFFFF
    and a0, a0, t0
    
    ret

my_clz

C code (original)

The my_clz function counts the number of leading zeros in a 32-bit unsigned integer x. It does this by checking each bit from the most significant (bit 31) down to the least significant, incrementing a counter until it finds a 1. The function then returns the count of leading zeros.









static inline int my_clz(uint32_t x) {
    int count = 0;
    for (int i = 31; i >= 0; --i) {
        if (x & (1U << i))
            break;
        count++;
    }
    return count;
}

C code (Optimized)

The original my_clz iterates over all 32 bits, making it less efficient, especially for large numbers, because it checks each bit one by one. The optimized version uses a binary search approach, which narrows down the range of bits to check by shifting left in chunks (16, 8, 4, 2, 1), making it faster by reducing the number of comparisons. The binary search method is more efficient as it requires fewer iterations in the worst case. However, the original is simpler and more intuitive, while the optimized version is slightly more complex but significantly quicker. The optimized method takes constant time, whereas the original takes linear time relative to the number of leading zeros.




















static inline int my_clz(uint32_t x) {
    int r = 0, c;
    c = (x < 0x00010000) << 4;  //If the first 16 bits are all 0, c = 16 | If the first 16 bits are not all 0, c = 0
    r += c;                     // r += 16 | r += 0
    x <<= c;                    // x <<= 16 | x <<= 0
    c = (x < 0x01000000) << 3;  //If the first 8 bits are all 0, c = 8 | If the first 16 bits are not all 0, c = 0
    r += c;                     // r += 8 | r += 0
    x <<= c;                    // x <<= 8 | x <<= 0
    c = (x < 0x10000000) << 2;  //If the first 4 bits are all 0, c = 4 | If the first 16 bits are not all 0, c = 0
    r += c;                     // r += 4 | r += 0
    x <<= c;                    // x <<= 4 | x <<= 0
    c = (x < 0x40000000) << 1;  //If the first 2 bits are all 0, c = 2 | If the first 16 bits are not all 0, c = 0
    r += c;                     // r += 2 | r += 0
    x <<= c;                    // x <<= 2 | x <<= 0
    c = x < 0x80000000;         //If the first 1 bits are all 0, c = 1 | If the first 16 bits are not all 0, c = 0
    r += c;                     // r += 1 | r += 0
    x <<= c;                    // x <<= 1 | x <<= 0
    r += x == 0;
    return r;
}

Assembly code




























































my_clz:
        # t0 is the input parameter x
        # t1 is c
        # t2 is r
        mv t0, a0
        
        # int r = 0, c;    
        li t2, 0
        
        # c = (x < 0x00010000) << 4;
        li t1 0x00010000
        sltu t1, t0, t1
        slli t1, t1, 4
        # r += c;           
        add t2, t2, t1
        # x <<= c       
        sll t0, t0, t1
        
        # c = (x < 0x01000000) << 3;    
        li t1 0x01000000
        sltu t1, t0, t1
        slli t1, t1, 3
        # r += c;         
        add t2, t2, t1
        # x <<= c          
        sll t0, t0, t1
          
        # c = (x < 0x10000000) << 2;   
        li t1 0x10000000
        sltu t1, t0, t1
        slli t1, t1, 2
        # r += c;         
        add t2, t2, t1
        # x <<= c         
        sll t0, t0, t1
        
        # c = (x < 0x40000000) << 1;
        li t1 0x40000000
        sltu t1, t0, t1
        slli t1, t1, 1
        # r += c;         
        add t2, t2, t1
        # x <<= c       
        sll t0, t0, t1
        
        # c = x < 0x80000000;
        li t1 0x80000000
        sltu t1, t0, t1
        # r += c; 
        add t2, t2, t1
        # x <<= c               
        sll t0, t0, t1
        
        # r += x == 0;
        seqz t0, t0
        add t2, t2, t0
        
        # return r;
        mv a0, t2
        ret

fp16_to_fp32

C code

The fp16_to_fp32 function converts a 16-bit floating point number (FP16) into a 32-bit floating point number (FP32). It first shifts the 16-bit input left by 16 bits to align it into a 32-bit space. Then, it extracts the sign bit and uses my_clz to calculate a renormalization shift for the exponent. The function handles special cases like infinity and NaN using masks and adjusts the value to fit the FP32 format. Finally, it combines the sign, exponent, and mantissa into a 32-bit floating point result.






















































#include <stdio.h>
#include <stdint.h>

static inline int my_clz(uint32_t x) {
    int r = 0, c;
    c = (x < 0x00010000) << 4;  //If the first 16 bits are all 0, c = 16 | If the first 16 bits are not all 0, c = 0
    r += c;                     // r += 16 | r += 0
    x <<= c;                    // x <<= 16 | x <<= 0
    c = (x < 0x01000000) << 3;  //If the first 8 bits are all 0, c = 8 | If the first 16 bits are not all 0, c = 0
    r += c;                     // r += 8 | r += 0
    x <<= c;                    // x <<= 8 | x <<= 0
    c = (x < 0x10000000) << 2;  //If the first 4 bits are all 0, c = 4 | If the first 16 bits are not all 0, c = 0
    r += c;                     // r += 4 | r += 0
    x <<= c;                    // x <<= 4 | x <<= 0
    c = (x < 0x40000000) << 1;  //If the first 2 bits are all 0, c = 2 | If the first 16 bits are not all 0, c = 0
    r += c;                     // r += 2 | r += 0
    x <<= c;                    // x <<= 2 | x <<= 0
    c = x < 0x80000000;         //If the first 1 bits are all 0, c = 1 | If the first 16 bits are not all 0, c = 0
    r += c;                     // r += 1 | r += 0
    x <<= c;                    // x <<= 1 | x <<= 0
    r += x == 0;
    return r;
}

static inline uint32_t fp16_to_fp32(uint16_t h) {

    const uint32_t w = (uint32_t) h << 16;
    
    const uint32_t sign = w & UINT32_C(0x80000000);
    
    const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);
    
    uint32_t renorm_shift = my_clz(nonsign);
    renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
    
    const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) &
                                 INT32_C(0x7F800000);
    
    const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
    
    return sign | ((((nonsign << renorm_shift >> 3) +
            ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
}

int main()
{
    printf("fp16_to_fp32(0x0618) is : 0x%x\n", fp16_to_fp32(0x0618));  // normalized number
    printf("fp16_to_fp32(0x000F) is : 0x%x\n", fp16_to_fp32(0x000F));  // denormalized number
    printf("fp16_to_fp32(0x0000) is : 0x%x\n", fp16_to_fp32(0x0000));  // positive zero
    printf("fp16_to_fp32(0x8000) is : 0x%x\n", fp16_to_fp32(0x8000));  // negative zero
    printf("fp16_to_fp32(0x7C00) is : 0x%x\n", fp16_to_fp32(0x7C00));  //positive infinity
    printf("fp16_to_fp32(0xFC00) is : 0x%x\n", fp16_to_fp32(0xFC00));  //negative infinity
    printf("fp16_to_fp32(0x7CFF) is : 0x%x\n", fp16_to_fp32(0x7CFF));  // NaN, Not a Number
}

fp16_to_fp32(0x0618) is : 0x38c30000
fp16_to_fp32(0x000F) is : 0x35700000
fp16_to_fp32(0x0000) is : 0x0
fp16_to_fp32(0x8000) is : 0x80000000
fp16_to_fp32(0x7C00) is : 0x7f800000
fp16_to_fp32(0xFC00) is : 0xff800000
fp16_to_fp32(0x7CFF) is : 0x7f9fe000

Assembly code





























































































































































































.data
    datas: .word 0x0618, 0x000F, 0x0000, 0x8000, 0x7C00, 0xFC00, 0x7CFF
    ans:   .word 0x38c30000, 0x35700000, 0x0, 0x80000000, 0x7f800000, 0xff800000, 0x7f9fe000
    str1:     .string "\nfp16_to_fp32(" 
    str2:     .string ") is : " 
    strError: .string "\nthe answer is wrong!!!"
.text
main:
        # Load datas reference
        la s6, datas
        
        # Load ans reference
        la s7, ans
        
        # Load the loop count
        li s8, 7
        
print_numbers:
        # print "\nfp16_to_fp32(" 
        la a0, str1
        li a7, 4
        ecall
        
        # print 'data'
        lw a0, 0(s6)
        li, a7, 34
        ecall
        
        # print ") is : " 
        la a0, str2
        li a7, 4
        ecall
        
        # print fp16_to_fp32(data)  
        lw a0, 0(s6)
        jal ra, fp16_to_fp32
        li, a7, 34
        ecall
     
validation:
        # check if fp16_to_fp32(data) == ans 
        lw t0, 0(s7)
        sub t0, t0, a0
        beqz t0, check_loop
        
        # print "\nthe answer is wrong!!!"
        la a0, strError
        li a7, 4
        ecall
        
check_loop:
        # next data, ans
        addi s6, s6, 4
        addi s7, s7, 4
        addi s8, s8, -1
        bnez s8, print_numbers
        
exit:
        # Exit the program
        li a7, 10                  # System call code for exiting the program
        ecall                      # Make the exit system call
        ret

fp16_to_fp32:
        # a0 is the input parameter h
        # t0 is w
        # t1 is sign
        # t2 is nosign
        
        # const uint32_t w = (uint32_t) h << 16; 
        slli t0, a0, 16
        # const uint32_t sign = w & UINT32_C(0x80000000);     
        li t1, 0x80000000
        and t1, t0, t1
        # const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF);     
        li t2, 0x7FFFFFFF
        and t2, t0, t2

my_clz:
        # t3 is the input parameter x   
        # t4 is c
        # t5 is r  
        mv t3, t2
        
        # int r = 0, c;    
        li t5, 0
        
        # c = (x < 0x00010000) << 4;
        li t4 0x00010000
        sltu t4, t3, t4
        slli t4, t4, 4
        # r += c;           
        add t5, t5, t4
        # x <<= c       
        sll t3, t3, t4
        
        # c = (x < 0x01000000) << 3;    
        li t4 0x01000000
        sltu t4, t3, t4
        slli t4, t4, 3
        # r += c;         
        add t5, t5, t4
        # x <<= c          
        sll t3, t3, t4
          
        # c = (x < 0x10000000) << 2;   
        li t4 0x10000000
        sltu t4, t3, t4
        slli t4, t4, 2
        # r += c;         
        add t5, t5, t4
        # x <<= c         
        sll t3, t3, t4
        
        # c = (x < 0x40000000) << 1;
        li t4 0x40000000
        sltu t4, t3, t4
        slli t4, t4, 1
        # r += c;         
        add t5, t5, t4
        # x <<= c       
        sll t3, t3, t4
        
        # c = x < 0x80000000;
        li t4 0x80000000
        sltu t4, t3, t4
        # r += c; 
        add t5, t5, t4
        # x <<= c               
        sll t3, t3, t4
        
        # r += x == 0;
        seqz t3, t3
        add t5, t5, t3
        
back_to_fp16_to_fp32:
        # t0, t3, t4 can be reused
        # t5 is renorm_shift
        
        # renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0;
        li t0, 5
        bgtu t5, t0, renorm_shift_minus_five
        j renorm_shift_is_zero

renorm_shift_minus_five:
        sub t5, t5, t0
        j continue_fp16_to_fp32

renorm_shift_is_zero:
        li t5, 0
        
continue_fp16_to_fp32:
        # t3 is inf_nan_mask
        # t4 is zero_mask
        # t0 is temp1
        # t6 is temp2
        
        # const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000);
        li t0, 0x04000000
        add t3, t2, t0
        srai t3, t3, 8
        li t0, 0x7F800000
        and t3, t3, t0
        
        # const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31;
        addi t4, t2, -1
        srai t4, t4, 31
        
        # return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask);
        # temp1 = (nonsign << renorm_shift >> 3)
        sll t0, t2, t5
        srli t0, t0, 3
        # temp2 = ((0x70 - renorm_shift) << 23)
        li t6, 0x70
        sub t6, t6, t5
        slli t6, t6, 23
        # temp1 = temp1 + temp2
        add t0, t0, t6
        # temp1 = temp1 | inf_nan_mask
        or t0, t0, t3
        # temp1 = temp1 & ~zero_mask
        not t4, t4
        and t0, t0, t4
        # temp1 = sign | temp1
        or t0, t1, t0         
        # return temp1
        mv a0, t0
        
        ret

fp16_to_fp32(0x0618) is : 0x38c30000
fp16_to_fp32(0x000f) is : 0x35700000
fp16_to_fp32(0x0000) is : 0x0000
fp16_to_fp32(0x8000) is : 0x80000000
fp16_to_fp32(0x7c00) is : 0x7f800000
fp16_to_fp32(0xfc00) is : 0xff800000
fp16_to_fp32(0x7cff) is : 0x7f9fe000

LeetCode Problem: Palindrome Number

Leetcode 9. Palindrome Number

Description

Given an integer x, return true if x is a palindrome, and false otherwise.

The original task was to check if a number is a palindrome in base 10. To better fit the use of clz, I made a slight modification to the task, changing it to check if a number is a palindrome in base 2.

Example 1 :

Input: x = 101
Output: 1
Explanation: 101 reads as 101 from left to right and from right to left.

Example 2 :

Input: x = 11011
Output: 1
Explanation: 11011 reads as 11011 from left to right and from right to left.

Example 3 :

Input: x = 10
Output: 0
Explanation: Reads 01 from right to left. Therefore it is not a palindrome.

C code



















































#include <stdio.h>
#include <stdint.h>

uint32_t my_clz(uint32_t x) {
    int r = 0, c;
    c = (x < 0x00010000) << 4;  //If the first 16 bits are all 0, c = 16 | If the first 16 bits are not all 0, c = 0
    r += c;                     // r += 16 | r += 0
    x <<= c;                    // x <<= 16 | x <<= 0
    c = (x < 0x01000000) << 3;  //If the first 8 bits are all 0, c = 8 | If the first 16 bits are not all 0, c = 0
    r += c;                     // r += 8 | r += 0
    x <<= c;                    // x <<= 8 | x <<= 0
    c = (x < 0x10000000) << 2;  //If the first 4 bits are all 0, c = 4 | If the first 16 bits are not all 0, c = 0
    r += c;                     // r += 4 | r += 0
    x <<= c;                    // x <<= 4 | x <<= 0
    c = (x < 0x40000000) << 1;  //If the first 2 bits are all 0, c = 2 | If the first 16 bits are not all 0, c = 0
    r += c;                     // r += 2 | r += 0
    x <<= c;                    // x <<= 2 | x <<= 0
    c = x < 0x80000000;         //If the first 1 bits are all 0, c = 1 | If the first 16 bits are not all 0, c = 0
    r += c;                     // r += 1 | r += 0
    x <<= c;                    // x <<= 1 | x <<= 0
    r += x == 0;
    return r;
}

int checkPalindrome(uint32_t N)
{
    // Stores the reverse of N
    uint32_t rev = 0;

    uint32_t num = 32 - my_clz(N);
    uint32_t move = num >> 1;
    uint32_t additional_move = num & 1;
 
    for(int i = 0; i < move; i++)
    {
        rev = (rev << 1) + (N & 1);
        N = N >> 1;
    }
    N = N >> additional_move;
 
    return N == rev;
}

int main()
{
    printf("checkPalindrome(0x0F00000F) is : %d\n", checkPalindrome(0x0F00000F));
    printf("checkPalindrome(0x0E000001) is : %d\n", checkPalindrome(0x0E000001));
    printf("checkPalindrome(0x0000001B) is : %d\n", checkPalindrome(0x0000001B));
    printf("checkPalindrome(0x00000002) is : %d\n", checkPalindrome(0x00000002));
    printf("checkPalindrome(0x00000001) is : %d\n", checkPalindrome(0x00000001));
}

checkPalindrome(0x0F00000F) is : 1
checkPalindrome(0x0E000001) is : 0
checkPalindrome(0x0000001B) is : 1
checkPalindrome(0x00000002) is : 0
checkPalindrome(0x00000001) is : 1

Assembly code
















































































































































































.data
    datas: .word 0x0F00000F, 0x0E000001, 0x0000001B, 0x00000002, 0x00000001
    ans:   .word 0x1, 0x0, 0x1, 0x0, 0x1
    str1:     .string "\ncheckPalindrome(" 
    str2:     .string ") is : " 
    strError: .string "\nthe answer is wrong!!!"
.text
main:
        # Load datas reference
        la s6, datas
        
        # Load ans reference
        la s7, ans
        
        # Load the loop count
        li s8, 5
        
print_numbers:
        # print "\ncheckPalindrome(" 
        la a0, str1
        li a7, 4
        ecall
        
        # print 'data'
        lw a0, 0(s6)
        li, a7, 34
        ecall
        
        # print ") is : " 
        la a0, str2
        li a7, 4
        ecall
        
        # print checkPalindrome(data)  
        lw a0, 0(s6)
        jal ra, checkPalindrome
        li, a7, 1
        ecall
     
validation:
        # check if fp16_to_fp32(data) == ans 
        lw t0, 0(s7)
        sub t0, t0, a0
        beqz t0, check_loop
        
        # print "\nthe answer is wrong!!!"
        la a0, strError
        li a7, 4
        ecall
        
check_loop:
        # next data, ans
        addi s6, s6, 4
        addi s7, s7, 4
        addi s8, s8, -1
        bnez s8, print_numbers
        
exit:
        # Exit the program
        li a7, 10                  # System call code for exiting the program
        ecall                      # Make the exit system call
        ret

checkPalindrome:
        # t0 is the input parameter N
        
        mv t0, a0
        
my_clz:
        # t0 is the input parameter x
        # t1 is c
        # t2 is r
        
        # int r = 0, c;    
        li t2, 0
        
        # c = (x < 0x00010000) << 4;
        li t1 0x00010000
        sltu t1, t0, t1
        slli t1, t1, 4
        # r += c;           
        add t2, t2, t1
        # x <<= c       
        sll t0, t0, t1
        
        # c = (x < 0x01000000) << 3;    
        li t1 0x01000000
        sltu t1, t0, t1
        slli t1, t1, 3
        # r += c;         
        add t2, t2, t1
        # x <<= c          
        sll t0, t0, t1
          
        # c = (x < 0x10000000) << 2;   
        li t1 0x10000000
        sltu t1, t0, t1
        slli t1, t1, 2
        # r += c;         
        add t2, t2, t1
        # x <<= c         
        sll t0, t0, t1
        
        # c = (x < 0x40000000) << 1;
        li t1 0x40000000
        sltu t1, t0, t1
        slli t1, t1, 1
        # r += c;         
        add t2, t2, t1
        # x <<= c       
        sll t0, t0, t1
        
        # c = x < 0x80000000;
        li t1 0x80000000
        sltu t1, t0, t1
        # r += c; 
        add t2, t2, t1
        # x <<= c               
        sll t0, t0, t1
        
        # r += x == 0;
        seqz t0, t0
        add t2, t2, t0
        
        
back_to_checkPalindrome:
        # t0 is N
        # t1 is rev
        # t2 is num
        # t3 is move
        # t4 is additional_move
        # t5 is i
        mv t0, a0
        
        # uint32_t rev = 0;
        li t1, 0
        
        # uint32_t num = 32 - my_clz(N);
        li t6, 32
        sub t2, t6, t2
        
        # uint32_t move = num >> 1;
        srli t3, t2, 1
        
        # uint32_t additional_move = num & 1;
        andi t4, t2, 1
        
        li t5, 0
        
loop_start:
        # if (i >= move) break
        bge t5, t3, loop_end    
        
        # rev = (rev << 1) + (N & 1);
        slli t1, t1, 1
        andi t6, t0, 1
        add t1, t1, t6
        
        # N = N >> 1;
        srli t0, t0, 1
        
        # i++
        addi t5, t5, 1
        
        # jump to loop start   
        j loop_start 

loop_end:
        # N = N >> additional_move;
        srl t0, t0, t4
        
        # return N == rev;
        sub t6, t0, t1
        sltiu a0, t6, 1
        ret

checkPalindrome(0xf00000f) is : 1
checkPalindrome(0xe000001) is : 0
checkPalindrome(0x001b) is : 1
checkPalindrome(0x0002) is : 0
checkPalindrome(0x0001) is : 1

Cycle comparison

The execution results of the RISC-V code generated from C code compiled with the RISC-V C compiler and written by myself:

Compiled by RISC-V C compiler	Written by myself

5-stage pipelined processor

Description

In a 5-stage pipelined processor in RISC-V, instruction execution is divided into five main stages to improve efficiency by allowing multiple instructions to be processed simultaneously at different stages. These stages are:

IF (Instruction Fetch): The current instruction is fetched from the instruction memory, and the program counter (PC) is incremented to prepare for fetching the next instruction.
ID (Instruction Decode): The fetched instruction is decoded, and the source operands are read from the register. This stage may also compute immediate values and identify branch instructions.
EX (Execute / ALU): In this stage, the Arithmetic Logic Unit (ALU) performs operations specified by the instruction, such as addition, subtraction, logical operations, or shifting. For load/store instructions, memory addresses are calculated.
MEM (Memory Access): For load or store instructions, this stage accesses the data memory. Load instructions read data from memory, while store instructions write data to memory.
WB (Write Back): The results from the EX or MEM stage are written back to the register, completing the instruction execution. This is the final step of processing the instruction.