Try   HackMD

Assignment1: RISC-V Assembly and Instruction Pipeline

contributed by <huangchenghan> ( Cheng-Han, Huang )

Problem C to RISC-V

Problem C has three functions to be implemented: fabsf, my_clz, and fp16_to_fp32.

fabsf

C code

This function directly operates the bits of floating point numbers to calculate the absolute value. Specifically, it removes the negative sign by converting a floating-point number to integer form, then clearing the sign bit (the highest bit), and finally converting the result back to a floating-point number.

static inline float fabsf(float x) { uint32_t i = *(uint32_t *)&x; // Read the bits of the float into an integer i &= 0x7FFFFFFF; // Clear the sign bit to get the absolute value x = *(float *)&i; // Write the modified bits back into the float return x; }

Assembly code

fabsf: # a0 is the input parameter x # t0 is i # i &= 0x7FFFFFFF; li t0, 0x7FFFFFFF and a0, a0, t0 ret

my_clz

C code (original)

The my_clz function counts the number of leading zeros in a 32-bit unsigned integer x. It does this by checking each bit from the most significant (bit 31) down to the least significant, incrementing a counter until it finds a 1. The function then returns the count of leading zeros.

static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; }

C code (Optimized)

The original my_clz iterates over all 32 bits, making it less efficient, especially for large numbers, because it checks each bit one by one. The optimized version uses a binary search approach, which narrows down the range of bits to check by shifting left in chunks (16, 8, 4, 2, 1), making it faster by reducing the number of comparisons. The binary search method is more efficient as it requires fewer iterations in the worst case. However, the original is simpler and more intuitive, while the optimized version is slightly more complex but significantly quicker. The optimized method takes constant time, whereas the original takes linear time relative to the number of leading zeros.

static inline int my_clz(uint32_t x) { int r = 0, c; c = (x < 0x00010000) << 4; //If the first 16 bits are all 0, c = 16 | If the first 16 bits are not all 0, c = 0 r += c; // r += 16 | r += 0 x <<= c; // x <<= 16 | x <<= 0 c = (x < 0x01000000) << 3; //If the first 8 bits are all 0, c = 8 | If the first 16 bits are not all 0, c = 0 r += c; // r += 8 | r += 0 x <<= c; // x <<= 8 | x <<= 0 c = (x < 0x10000000) << 2; //If the first 4 bits are all 0, c = 4 | If the first 16 bits are not all 0, c = 0 r += c; // r += 4 | r += 0 x <<= c; // x <<= 4 | x <<= 0 c = (x < 0x40000000) << 1; //If the first 2 bits are all 0, c = 2 | If the first 16 bits are not all 0, c = 0 r += c; // r += 2 | r += 0 x <<= c; // x <<= 2 | x <<= 0 c = x < 0x80000000; //If the first 1 bits are all 0, c = 1 | If the first 16 bits are not all 0, c = 0 r += c; // r += 1 | r += 0 x <<= c; // x <<= 1 | x <<= 0 r += x == 0; return r; }

Assembly code

my_clz: # t0 is the input parameter x # t1 is c # t2 is r mv t0, a0 # int r = 0, c; li t2, 0 # c = (x < 0x00010000) << 4; li t1 0x00010000 sltu t1, t0, t1 slli t1, t1, 4 # r += c; add t2, t2, t1 # x <<= c sll t0, t0, t1 # c = (x < 0x01000000) << 3; li t1 0x01000000 sltu t1, t0, t1 slli t1, t1, 3 # r += c; add t2, t2, t1 # x <<= c sll t0, t0, t1 # c = (x < 0x10000000) << 2; li t1 0x10000000 sltu t1, t0, t1 slli t1, t1, 2 # r += c; add t2, t2, t1 # x <<= c sll t0, t0, t1 # c = (x < 0x40000000) << 1; li t1 0x40000000 sltu t1, t0, t1 slli t1, t1, 1 # r += c; add t2, t2, t1 # x <<= c sll t0, t0, t1 # c = x < 0x80000000; li t1 0x80000000 sltu t1, t0, t1 # r += c; add t2, t2, t1 # x <<= c sll t0, t0, t1 # r += x == 0; seqz t0, t0 add t2, t2, t0 # return r; mv a0, t2 ret

fp16_to_fp32

C code

The fp16_to_fp32 function converts a 16-bit floating point number (FP16) into a 32-bit floating point number (FP32). It first shifts the 16-bit input left by 16 bits to align it into a 32-bit space. Then, it extracts the sign bit and uses my_clz to calculate a renormalization shift for the exponent. The function handles special cases like infinity and NaN using masks and adjusts the value to fit the FP32 format. Finally, it combines the sign, exponent, and mantissa into a 32-bit floating point result.

#include <stdio.h> #include <stdint.h> static inline int my_clz(uint32_t x) { int r = 0, c; c = (x < 0x00010000) << 4; //If the first 16 bits are all 0, c = 16 | If the first 16 bits are not all 0, c = 0 r += c; // r += 16 | r += 0 x <<= c; // x <<= 16 | x <<= 0 c = (x < 0x01000000) << 3; //If the first 8 bits are all 0, c = 8 | If the first 16 bits are not all 0, c = 0 r += c; // r += 8 | r += 0 x <<= c; // x <<= 8 | x <<= 0 c = (x < 0x10000000) << 2; //If the first 4 bits are all 0, c = 4 | If the first 16 bits are not all 0, c = 0 r += c; // r += 4 | r += 0 x <<= c; // x <<= 4 | x <<= 0 c = (x < 0x40000000) << 1; //If the first 2 bits are all 0, c = 2 | If the first 16 bits are not all 0, c = 0 r += c; // r += 2 | r += 0 x <<= c; // x <<= 2 | x <<= 0 c = x < 0x80000000; //If the first 1 bits are all 0, c = 1 | If the first 16 bits are not all 0, c = 0 r += c; // r += 1 | r += 0 x <<= c; // x <<= 1 | x <<= 0 r += x == 0; return r; } static inline uint32_t fp16_to_fp32(uint16_t h) { const uint32_t w = (uint32_t) h << 16; const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } int main() { printf("fp16_to_fp32(0x0618) is : 0x%x\n", fp16_to_fp32(0x0618)); // normalized number printf("fp16_to_fp32(0x000F) is : 0x%x\n", fp16_to_fp32(0x000F)); // denormalized number printf("fp16_to_fp32(0x0000) is : 0x%x\n", fp16_to_fp32(0x0000)); // positive zero printf("fp16_to_fp32(0x8000) is : 0x%x\n", fp16_to_fp32(0x8000)); // negative zero printf("fp16_to_fp32(0x7C00) is : 0x%x\n", fp16_to_fp32(0x7C00)); //positive infinity printf("fp16_to_fp32(0xFC00) is : 0x%x\n", fp16_to_fp32(0xFC00)); //negative infinity printf("fp16_to_fp32(0x7CFF) is : 0x%x\n", fp16_to_fp32(0x7CFF)); // NaN, Not a Number }

fp16_to_fp32(0x0618) is : 0x38c30000
fp16_to_fp32(0x000F) is : 0x35700000
fp16_to_fp32(0x0000) is : 0x0
fp16_to_fp32(0x8000) is : 0x80000000
fp16_to_fp32(0x7C00) is : 0x7f800000
fp16_to_fp32(0xFC00) is : 0xff800000
fp16_to_fp32(0x7CFF) is : 0x7f9fe000

Assembly code

.data datas: .word 0x0618, 0x000F, 0x0000, 0x8000, 0x7C00, 0xFC00, 0x7CFF ans: .word 0x38c30000, 0x35700000, 0x0, 0x80000000, 0x7f800000, 0xff800000, 0x7f9fe000 str1: .string "\nfp16_to_fp32(" str2: .string ") is : " strError: .string "\nthe answer is wrong!!!" .text main: # Load datas reference la s6, datas # Load ans reference la s7, ans # Load the loop count li s8, 7 print_numbers: # print "\nfp16_to_fp32(" la a0, str1 li a7, 4 ecall # print 'data' lw a0, 0(s6) li, a7, 34 ecall # print ") is : " la a0, str2 li a7, 4 ecall # print fp16_to_fp32(data) lw a0, 0(s6) jal ra, fp16_to_fp32 li, a7, 34 ecall validation: # check if fp16_to_fp32(data) == ans lw t0, 0(s7) sub t0, t0, a0 beqz t0, check_loop # print "\nthe answer is wrong!!!" la a0, strError li a7, 4 ecall check_loop: # next data, ans addi s6, s6, 4 addi s7, s7, 4 addi s8, s8, -1 bnez s8, print_numbers exit: # Exit the program li a7, 10 # System call code for exiting the program ecall # Make the exit system call ret fp16_to_fp32: # a0 is the input parameter h # t0 is w # t1 is sign # t2 is nosign # const uint32_t w = (uint32_t) h << 16; slli t0, a0, 16 # const uint32_t sign = w & UINT32_C(0x80000000); li t1, 0x80000000 and t1, t0, t1 # const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); li t2, 0x7FFFFFFF and t2, t0, t2 my_clz: # t3 is the input parameter x # t4 is c # t5 is r mv t3, t2 # int r = 0, c; li t5, 0 # c = (x < 0x00010000) << 4; li t4 0x00010000 sltu t4, t3, t4 slli t4, t4, 4 # r += c; add t5, t5, t4 # x <<= c sll t3, t3, t4 # c = (x < 0x01000000) << 3; li t4 0x01000000 sltu t4, t3, t4 slli t4, t4, 3 # r += c; add t5, t5, t4 # x <<= c sll t3, t3, t4 # c = (x < 0x10000000) << 2; li t4 0x10000000 sltu t4, t3, t4 slli t4, t4, 2 # r += c; add t5, t5, t4 # x <<= c sll t3, t3, t4 # c = (x < 0x40000000) << 1; li t4 0x40000000 sltu t4, t3, t4 slli t4, t4, 1 # r += c; add t5, t5, t4 # x <<= c sll t3, t3, t4 # c = x < 0x80000000; li t4 0x80000000 sltu t4, t3, t4 # r += c; add t5, t5, t4 # x <<= c sll t3, t3, t4 # r += x == 0; seqz t3, t3 add t5, t5, t3 back_to_fp16_to_fp32: # t0, t3, t4 can be reused # t5 is renorm_shift # renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; li t0, 5 bgtu t5, t0, renorm_shift_minus_five j renorm_shift_is_zero renorm_shift_minus_five: sub t5, t5, t0 j continue_fp16_to_fp32 renorm_shift_is_zero: li t5, 0 continue_fp16_to_fp32: # t3 is inf_nan_mask # t4 is zero_mask # t0 is temp1 # t6 is temp2 # const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); li t0, 0x04000000 add t3, t2, t0 srai t3, t3, 8 li t0, 0x7F800000 and t3, t3, t0 # const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; addi t4, t2, -1 srai t4, t4, 31 # return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); # temp1 = (nonsign << renorm_shift >> 3) sll t0, t2, t5 srli t0, t0, 3 # temp2 = ((0x70 - renorm_shift) << 23) li t6, 0x70 sub t6, t6, t5 slli t6, t6, 23 # temp1 = temp1 + temp2 add t0, t0, t6 # temp1 = temp1 | inf_nan_mask or t0, t0, t3 # temp1 = temp1 & ~zero_mask not t4, t4 and t0, t0, t4 # temp1 = sign | temp1 or t0, t1, t0 # return temp1 mv a0, t0 ret

fp16_to_fp32(0x0618) is : 0x38c30000
fp16_to_fp32(0x000f) is : 0x35700000
fp16_to_fp32(0x0000) is : 0x0000
fp16_to_fp32(0x8000) is : 0x80000000
fp16_to_fp32(0x7c00) is : 0x7f800000
fp16_to_fp32(0xfc00) is : 0xff800000
fp16_to_fp32(0x7cff) is : 0x7f9fe000

LeetCode Problem: Palindrome Number

Leetcode 9. Palindrome Number

Description

Given an integer x, return true if x is a palindrome, and false otherwise.

The original task was to check if a number is a palindrome in base 10. To better fit the use of clz, I made a slight modification to the task, changing it to check if a number is a palindrome in base 2.

Example 1 :

Input: x = 101
Output: 1
Explanation: 101 reads as 101 from left to right and from right to left.

Example 2 :

Input: x = 11011
Output: 1
Explanation: 11011 reads as 11011 from left to right and from right to left.

Example 3 :

Input: x = 10
Output: 0
Explanation: Reads 01 from right to left. Therefore it is not a palindrome.

C code

#include <stdio.h> #include <stdint.h> uint32_t my_clz(uint32_t x) { int r = 0, c; c = (x < 0x00010000) << 4; //If the first 16 bits are all 0, c = 16 | If the first 16 bits are not all 0, c = 0 r += c; // r += 16 | r += 0 x <<= c; // x <<= 16 | x <<= 0 c = (x < 0x01000000) << 3; //If the first 8 bits are all 0, c = 8 | If the first 16 bits are not all 0, c = 0 r += c; // r += 8 | r += 0 x <<= c; // x <<= 8 | x <<= 0 c = (x < 0x10000000) << 2; //If the first 4 bits are all 0, c = 4 | If the first 16 bits are not all 0, c = 0 r += c; // r += 4 | r += 0 x <<= c; // x <<= 4 | x <<= 0 c = (x < 0x40000000) << 1; //If the first 2 bits are all 0, c = 2 | If the first 16 bits are not all 0, c = 0 r += c; // r += 2 | r += 0 x <<= c; // x <<= 2 | x <<= 0 c = x < 0x80000000; //If the first 1 bits are all 0, c = 1 | If the first 16 bits are not all 0, c = 0 r += c; // r += 1 | r += 0 x <<= c; // x <<= 1 | x <<= 0 r += x == 0; return r; } int checkPalindrome(uint32_t N) { // Stores the reverse of N uint32_t rev = 0; uint32_t num = 32 - my_clz(N); uint32_t move = num >> 1; uint32_t additional_move = num & 1; for(int i = 0; i < move; i++) { rev = (rev << 1) + (N & 1); N = N >> 1; } N = N >> additional_move; return N == rev; } int main() { printf("checkPalindrome(0x0F00000F) is : %d\n", checkPalindrome(0x0F00000F)); printf("checkPalindrome(0x0E000001) is : %d\n", checkPalindrome(0x0E000001)); printf("checkPalindrome(0x0000001B) is : %d\n", checkPalindrome(0x0000001B)); printf("checkPalindrome(0x00000002) is : %d\n", checkPalindrome(0x00000002)); printf("checkPalindrome(0x00000001) is : %d\n", checkPalindrome(0x00000001)); }

checkPalindrome(0x0F00000F) is : 1
checkPalindrome(0x0E000001) is : 0
checkPalindrome(0x0000001B) is : 1
checkPalindrome(0x00000002) is : 0
checkPalindrome(0x00000001) is : 1

Assembly code

.data datas: .word 0x0F00000F, 0x0E000001, 0x0000001B, 0x00000002, 0x00000001 ans: .word 0x1, 0x0, 0x1, 0x0, 0x1 str1: .string "\ncheckPalindrome(" str2: .string ") is : " strError: .string "\nthe answer is wrong!!!" .text main: # Load datas reference la s6, datas # Load ans reference la s7, ans # Load the loop count li s8, 5 print_numbers: # print "\ncheckPalindrome(" la a0, str1 li a7, 4 ecall # print 'data' lw a0, 0(s6) li, a7, 34 ecall # print ") is : " la a0, str2 li a7, 4 ecall # print checkPalindrome(data) lw a0, 0(s6) jal ra, checkPalindrome li, a7, 1 ecall validation: # check if fp16_to_fp32(data) == ans lw t0, 0(s7) sub t0, t0, a0 beqz t0, check_loop # print "\nthe answer is wrong!!!" la a0, strError li a7, 4 ecall check_loop: # next data, ans addi s6, s6, 4 addi s7, s7, 4 addi s8, s8, -1 bnez s8, print_numbers exit: # Exit the program li a7, 10 # System call code for exiting the program ecall # Make the exit system call ret checkPalindrome: # t0 is the input parameter N mv t0, a0 my_clz: # t0 is the input parameter x # t1 is c # t2 is r # int r = 0, c; li t2, 0 # c = (x < 0x00010000) << 4; li t1 0x00010000 sltu t1, t0, t1 slli t1, t1, 4 # r += c; add t2, t2, t1 # x <<= c sll t0, t0, t1 # c = (x < 0x01000000) << 3; li t1 0x01000000 sltu t1, t0, t1 slli t1, t1, 3 # r += c; add t2, t2, t1 # x <<= c sll t0, t0, t1 # c = (x < 0x10000000) << 2; li t1 0x10000000 sltu t1, t0, t1 slli t1, t1, 2 # r += c; add t2, t2, t1 # x <<= c sll t0, t0, t1 # c = (x < 0x40000000) << 1; li t1 0x40000000 sltu t1, t0, t1 slli t1, t1, 1 # r += c; add t2, t2, t1 # x <<= c sll t0, t0, t1 # c = x < 0x80000000; li t1 0x80000000 sltu t1, t0, t1 # r += c; add t2, t2, t1 # x <<= c sll t0, t0, t1 # r += x == 0; seqz t0, t0 add t2, t2, t0 back_to_checkPalindrome: # t0 is N # t1 is rev # t2 is num # t3 is move # t4 is additional_move # t5 is i mv t0, a0 # uint32_t rev = 0; li t1, 0 # uint32_t num = 32 - my_clz(N); li t6, 32 sub t2, t6, t2 # uint32_t move = num >> 1; srli t3, t2, 1 # uint32_t additional_move = num & 1; andi t4, t2, 1 li t5, 0 loop_start: # if (i >= move) break bge t5, t3, loop_end # rev = (rev << 1) + (N & 1); slli t1, t1, 1 andi t6, t0, 1 add t1, t1, t6 # N = N >> 1; srli t0, t0, 1 # i++ addi t5, t5, 1 # jump to loop start j loop_start loop_end: # N = N >> additional_move; srl t0, t0, t4 # return N == rev; sub t6, t0, t1 sltiu a0, t6, 1 ret

checkPalindrome(0xf00000f) is : 1
checkPalindrome(0xe000001) is : 0
checkPalindrome(0x001b) is : 1
checkPalindrome(0x0002) is : 0
checkPalindrome(0x0001) is : 1

Cycle comparison

The execution results of the RISC-V code generated from C code compiled with the RISC-V C compiler and written by myself:

Compiled by RISC-V C compiler Written by myself
image
image

5-stage pipelined processor

Description

In a 5-stage pipelined processor in RISC-V, instruction execution is divided into five main stages to improve efficiency by allowing multiple instructions to be processed simultaneously at different stages. These stages are:

  1. IF (Instruction Fetch): The current instruction is fetched from the instruction memory, and the program counter (PC) is incremented to prepare for fetching the next instruction.
  2. ID (Instruction Decode): The fetched instruction is decoded, and the source operands are read from the register. This stage may also compute immediate values and identify branch instructions.
  3. EX (Execute / ALU): In this stage, the Arithmetic Logic Unit (ALU) performs operations specified by the instruction, such as addition, subtraction, logical operations, or shifting. For load/store instructions, memory addresses are calculated.
  4. MEM (Memory Access): For load or store instructions, this stage accesses the data memory. Load instructions read data from memory, while store instructions write data to memory.
  5. WB (Write Back): The results from the EX or MEM stage are written back to the register, completing the instruction execution. This is the final step of processing the instruction.

image

Example

Here I take addi t0, t0, 12 as an example

1. IF (Instruction Fetch)

  • PC is 0x00000000
  • Fetch instructions 0x00c28293 from address 0x00000000 in Instruction memory
  • Add 4 to the value of PC
    image

2. ID (Instruction Decode)

immediate rs1 func rd opcode
000000001100 00101 000 00101 0010011
  • Decode the instruction 0x00c28293
  • Get destination register 0x05
  • Get the value 0x00000000 from source register 0x05
  • Get the immediate value 0x0000000c
    image

3. EX (Execute / ALU)

  • ADD 0x00000000 and 0x0000000c
  • Get the result 0x0000000c
    image

4. MEM (Memory Access)

  • I-type format does not need to access Data memory
    image

5. WB (Write Back)

  • Write the result 0x0000000c back to the destination register 0x05
    image

Reference

Quiz1 of Computer Architecture (2024 Fall)
Branchless count-leading-zeros on 32-bit RISC-V without Zbb extension