Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline Contributed by < `Huaxin` > The following is the confirmation checklist: :::success - [x] Choose one question (A, B, or C) from Quiz 1 and convert its C code into a complete RISC-V assembly program, including relevant test data. - [ ] Complete leetcode problem and test. ::: I follow [C programs with Ripes](https://github.com/mortbopet/Ripes/blob/master/docs/c_programming.md) to set the c compiler for Ripes as `riscv64-unknown-elf-gcc`. Then, I selected `Problem C` from [Quiz1 of Computer Architecture (2024 Fall)](https://hackmd.io/@sysprog/arch2024-quiz1-sol#Problem-C) for conversion. Our results are generated from a 5-stage RISC-V processor. The below is C implementation of `__builtin_clz`. The `__builtin_clz` function is a built-in feature provided by GCC that calculates the number of leading zeros in the binary form of an unsigned integer. Starting from the most significant bit (MSB), it counts how many consecutive zero bits appear before encountering the first 1. The function then returns this count as an integer. We convert c code into assembly via ripes and get the result after running. We set compiler arguments as `-O1 -g` and `-O2 -g`, respectly. **C code** ```c= #include <stdio.h> #include <stdint.h> static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } int main() { printf("The number of leading zero is %d\n", my_clz(16)); printf("The number of leading zero is %d\n", my_clz(1)); printf("The number of leading zero is %d\n", my_clz(255)); printf("The number of leading zero is %d\n", my_clz(0xFFFFFFFF)); return 0; } ``` **The result from gcc, using the optimization setting `-O1 -g` and `-O2 -g`** ![image](https://hackmd.io/_uploads/SykL29X1Jl.png) Following, we adere to [The RISC-V Instruction Set Manual](https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf) to complete the risc-v instructions of `my_clz` function. **Assembly code (version 1)** ```assembly= .data input1: .word 16 input2: .word 1 input3: .word 255 input4: .word 0xFFFFFFFF string1: .string "The number of leading zero is " string2: .string "\n" .text main: lw a1 input1 # test 16 jal ra my_clz jal ra print lw a1 input2 # test 1 jal ra my_clz jal ra print lw a1 input3 # test 255 jal ra my_clz jal ra print lw a1 input4 jal ra my_clz jal ra print # test 0xFFFFFFFF li a7 10 # leave the program ecall my_clz: li t0 31 # i = t0 li t1 0 # count = 0 my_clz_loop: li t2 1 # t2 = 1 sll t2 t2 t0 # t2 = 1<<i and t3 a1 t2 # t3 = (x&1<<i) bnez t3 my_clz_End # if(t3!=0) then, go to my_clz_End addi t1 t1 1 # count++ addi t0 t0 -1 # t0 -= 1 bltz t0 my_clz_End j my_clz_loop my_clz_End: mv a2 t1 ret print: la a0 string1 # print string1 li a7 4 ecall mv a0 a2 # print integer li a7 1 ecall la a0, string2 # next line li a7 4 ecall ret ``` **Assembly code (version 2)** I observed that it is not necessary to reset **t2** to 1 in every loop iteration. Therefore, we moved the `li t1 1` instruction from `my_clz_loop` to `my_clz`. ```diff= .data input1: .word 16 input2: .word 1 input3: .word 255 input4: .word 0xFFFFFFFF string1: .string "The number of leading zero is " string2: .string "\n" .text main: lw a1 input1 # test 16 jal ra my_clz jal ra print lw a1 input2 # test 1 jal ra my_clz jal ra print lw a1 input3 # test 255 jal ra my_clz jal ra print lw a1 input4 jal ra my_clz jal ra print # test 0xFFFFFFFF li a7 10 # leave the program ecall my_clz: li t0 31 # i = t0 li t1 0 # count = 0 + li t2 1 # t2 = 1 my_clz_loop: - li t2 1 # t2 = 1 - sll t2 t2 t0 # t2 = 1<<i + sll t3 t2 t0 # t2 = 1<<i and t3 a1 t3 # t3 = (x&1<<i) bnez t3 my_clz_End # if(t3!=0) then, go to my_clz_End addi t1 t1 1 # count++ addi t0 t0 -1 # t0 -= 1 bgez t0 my_clz_loop my_clz_End: mv a2 t1 ret print: la a0 string1 # print string1 li a7 4 ecall mv a0 a2 # print integer li a7 1 ecall la a0, string2 # next line li a7 4 ecall ret ``` **Below are the tests for 16, 1, 255, and 0xFFFFFFFF in sequence.** Binary representation of 16 is `00000000 00000000 00000000 00010000` Binary representation of 1 is `00000000 00000000 00000000 00000001` Binary representation of 255 is `00000000 00000000 00000000 11111111` Binary representation of 0xFFFFFFFF is `11111111 11111111 11111111 11111111` **Results of version 1** In terms of **cycles**, we are **87.4%** faster compared to the results generated by **GCC**, while **Instrs.retired** has been reduced by **87.3%**. **Results of version 2** In terms of **cycles**, we are **89.5%** faster compared to the results generated by **GCC**, while **Instrs.retired** has been reduced by **90.1%**. **Difference between Version 1 and Version 2** **Version 2** is **2.1%** faster in cycles compared to **version 1**, while the number of retired instructions (**Instrs.retired**) in **version 2** has been reduced by **2.8%**. | | Cycles | Instrs.retired | CPI| IPC | | -------- | -------- | -------- | -------- | -------- | | GCC (-O1 -g)(-O2 -g) | 7826 | 5933 | 1.32 | 0.758 | | My code (version 1) | 988 | 754 | 1.31 | 0.763 | | My code (version 2) | 824 | 590 | 1.4 | 0.716 | Following, I will implement the [`fp16_to_fp32`](https://) function, which transforms a 16-bit floating-point number in IEEE half-precision format (bit representation) into a 32-bit floating-point number in IEEE single-precision format (bit representation). This implementation will utilize the previously mentioned `clz` function. Half precision is a binary floating-point format that uses 16 bits for representing numbers. It is designed for storing floating-point values in scenarios where high precision is not critical, such as in image processing and neural networks. The exponent in half precision floating-point format uses 5 bits. Compared to single-precision floating-point numbers, its advantage is that it requires only half the storage space and bandwidth (but at the cost of precision and numerical range). ![image](https://hackmd.io/_uploads/rydg0BYJ1l.png) Single-precision floating-point format, known as float32, is a computer digital format that typically occupies 32 bits in computer memory. The exponent field is an 8-bit unsigned integer ranging from 0 to 255. ![image](https://hackmd.io/_uploads/Hk6QABtkyx.png) [Source1](https://zh.wikipedia.org/zh-tw/%E5%96%AE%E7%B2%BE%E5%BA%A6%E6%B5%AE%E9%BB%9E%E6%95%B8), [Source2](https://zh.wikipedia.org/zh-tw/%E5%8D%8A%E7%B2%BE%E5%BA%A6%E6%B5%AE%E7%82%B9%E6%95%B0) ```c static inline int my_clz(uint32_t x) ... static inline uint32_t fp16_to_fp32(uint16_t h) { /* * Extends the 16-bit half-precision floating-point number to 32 bits * by shifting it to the upper half of a 32-bit word: * +---+-----+------------+-------------------+ * | S |EEEEE|MM MMMM MMMM|0000 0000 0000 0000| * +---+-----+------------+-------------------+ * Bits 31 26-30 16-25 0-15 * * S - sign bit, E - exponent bits, M - mantissa bits, 0 - zero bits. */ const uint32_t w = (uint32_t) h << 16; /* * Isolates the sign bit from the input number, placing it in the most * significant bit of a 32-bit word: * * +---+----------------------------------+ * | S |0000000 00000000 00000000 00000000| * +---+----------------------------------+ * Bits 31 0-31 */ const uint32_t sign = w & UINT32_C(0x80000000); /* * Extracts the mantissa and exponent from the input number, placing * them in bits 0-30 of the 32-bit word: * * +---+-----+------------+-------------------+ * | 0 |EEEEE|MM MMMM MMMM|0000 0000 0000 0000| * +---+-----+------------+-------------------+ * Bits 30 27-31 17-26 0-16 */ const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); /* * The renorm_shift variable indicates how many bits the mantissa * needs to be shifted to normalize the half-precision number. * For normalized numbers, renorm_shift will be 0. For denormalized * numbers, renorm_shift will be greater than 0. Shifting a * denormalized number will move the mantissa into the exponent, * normalizing it. */ uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; /* * If the half-precision number has an exponent of 15, adding a * specific value will cause overflow into bit 31, which converts * the upper 9 bits into ones. Thus: * inf_nan_mask == * 0x7F800000 if the half-precision number is * NaN or infinity (exponent of 15) * 0x00000000 otherwise */ const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); /* * If nonsign equals 0, subtracting 1 will cause overflow, setting * bit 31 to 1. Otherwise, bit 31 will be 0. Shifting this result * propagates bit 31 across all bits in zero_mask. Thus: * zero_mask == * 0xFFFFFFFF if the half-precision number is * zero (+0.0h or -0.0h) * 0x00000000 otherwise */ const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; /* * 1. Shifts nonsign left by renorm_shift to normalize it (for denormal * inputs). * 2. Shifts nonsign right by 3, adjusting the exponent to fit in the * 8-bit exponent field and moving the mantissa into the correct * position within the 23-bit mantissa field of the single-precision * format. * 3. Adds 0x70 to the exponent to account for the difference in bias * between half-precision and single-precision. * 4. Subtracts renorm_shift from the exponent to account for any * renormalization that occurred. * 5. ORs with inf_nan_mask to set the exponent to 0xFF if the input * was NaN or infinity. * 6. ANDs with the inverted zero_mask to set the mantissa and exponent * to zero if the input was zero. * 7. Combines everything with the sign bit of the input number. */ return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } int main() { printf("The float32 of float16 is %p\n", fp16_to_fp32(0x3C00)); printf("The float32 of float16 is %p\n", fp16_to_fp32(0x4000)); printf("The float32 of float16 is %p\n", fp16_to_fp32(0x7C00)); return 0; } ``` we use the following test data: * The positive number 1, represented in half-precision floating-point as 0x3C00. * The positive number 2, represented in half-precision floating-point as 0x4000. * Infinity (inf), represented in half-precision floating-point as 0x7C00. Here are the expected single-precision floating-point values: | float16(hex) | float32(hex) | float32(dec) | | -------- | -------- | -------- | | 0x3C00 | 0x3f800000| 1065353216 | | 0x4000 | 0x40000000| 1073741824 | | 0x7C00 | 0x7f800000| 2139095040 | ```assembly .data input1: .word 0x3C00 # 1 input2: .word 0x4000 # 2 input3: .word 0x7C00 # +inf string1: .string "The float32 of float16 " string2: .string " is " string3: .string "\n" .text main: lw a1 input1 jal ra fp16_to_fp32 jal ra print lw a1 input2 jal ra fp16_to_fp32 jal ra print lw a1 input3 jal ra fp16_to_fp32 jal ra print li a7 10 # leave the program ecall fp16_to_fp32: mv t0 a1 slli t0 t0 16 # t0 is w li t1 0x80000000 and t1 t0 t1 # t1 is sign li t2 0x7FFFFFFF and t2 t0 t2 # t2 is nonsign my_clz: li t3 31 # i li t4 0 # count li t5 1 my_clz_loop: sll t6 t5 t3 # 1<<i and t6 t2 t6 # nonsign & (1<<i) bnez t6 my_clz_End # break addi t4 t4 1 # count++ addi t3 t3 -1 # --i bgez t3 my_clz_loop my_clz_End: li t3 5 # t4 is renorm_shift bgt t4 t3 fp16_to_fp32_2 li t4 0 # renorm_shift = 0 j fp16_to_fp32_3 fp16_to_fp32_2: addi t4 t4 -5 # renorm_shift - 5 j fp16_to_fp32_3 fp16_to_fp32_3: li t3 0x04000000 add t5 t2 t3 # nonsign + 0x04000000 srli t5 t5 8 li t3 0x7F800000 and t5 t5 t3 # t5 is inf_nan_mask addi t6 t2 -1 # nonsign - 1 srli t6 t6 31 # t6 is zero_mask li t3 0xFFFFFFFF xor t6 t6 t3 sll t2 t2 t4 srli t2 t2 3 # nonsign << renorm_shift >> 3 li t0 0x70 sub t4 t0 t4 slli t4 t4 23 # (0x70 - renorm_shift) << 23 add t2 t2 t4 or t2 t2 t5 # | inf_nan_mask and t2 t2 t6 # & ~zero_mask or t2 t1 t2 # sign | ret print: la a0 string1 # print string1 li a7 4 ecall mv a0 a1 # print li a7 1 ecall la a0 string2 li a7 4 ecall mv a0 t2 # print li a7 1 ecall la a0, string3 # next line li a7 4 ecall ret ``` **The result from gcc, using the optimization setting -O2 -g** ![image](https://hackmd.io/_uploads/r1iejUKJye.png) In terms of **cycles**, we are **95.6%** faster compared to the results generated by **GCC**, while **Instrs.retired** has been reduced by **95.8%**. | | Cycles | Instrs.retired | CPI| IPC | | -------- | -------- | -------- | -------- | -------- | | GCC | 6434 | 4925 | 1.31 | 0.765 | | My code | 283 | 203 | 1.39 | 0.717 |