Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [TingSheng](https://github.com/TingSheng10) > ## Quiz1 - Problem C ### Promblem Describe ### We implement `fp16_to_fp32` function (with `clz`), which converts a 16-bit floating-point number in IEEE half-precision format (bit representation) to a 32-bit floating-point number in IEEE single-precision format (bit representation). `clz` function counts the number of leading zero bits in the binary representation of an unsigned integer, starting from the most significant bit (MSB). It returns an integer representing how many zero bits precede the first 1 in the binary representation of the number. If the input is zero, the output is **"undefined"**. ### C code of `fp16_to_fp32` ### ```c static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } ``` ```c static inline uint32_t fp16_to_fp32(uint16_t h) { const uint32_t w = (uint32_t) h << 16; const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t nonsign = w & UINT32_C(0x7FFFFFFF); uint32_t renorm_shift = my_clz(nonsign); renorm_shift = renorm_shift > 5 ? renorm_shift - 5 : 0; const int32_t inf_nan_mask = ((int32_t)(nonsign + 0x04000000) >> 8) & INT32_C(0x7F800000); const int32_t zero_mask = (int32_t)(nonsign - 1) >> 31; return sign | ((((nonsign << renorm_shift >> 3) + ((0x70 - renorm_shift) << 23)) | inf_nan_mask) & ~zero_mask); } ``` ### Test case with `fp16_to_fp32` ### |number | number in fp16 (input) | output | | -------- | ---------------------- | -------- | |+0 | 0x0000 |0x00000000| |-0 | 0x8000 |0x80000000| |NaN | 0xffff |0xFFFFE000| |+inf | 0x7c00 |0x7F800000| |-inf | 0xf800 |0xFF800000| |normalized number| 0x007f |0x36FE0000| |denormalized number| 0x0300 |0x33800000| ### Assamble language of `fp16_to_fp32` ### ```c .data argument: .word 0x0000 endl: .string "\n" .text la t0, argument lw t0, 0(t0) # store x in t0 slli t0, t0, 16 addi s1, zero, 0x80000000 and s1, s1, t0 addi s2, zero, 0x7FFFFFFF and s2, s2, t0 addi t1, t1, 31 # initialized t1 to 31 addi t2, t2, 1 # t2 = 1 clz_loop: sll t4, t2, t1 # t4 record 1U shift i addi t1, t1, -1 # i = i - 1 and t4, t4, s1 # x & (1U << i) bnez t4, out_clz # break addi t5, t5, 1 # did not enter the if statement and incremented count bge t1, zero, clz_loop # return to the clz_loop out_clz: addi s3 ,s3 , 5 addi t5, t5, -5 bge t5, s3, out add t5, x0, x0 # t5 = renorm_shift out: addi t3, zero, 0x4000000 add t3, s2, t3 srli t3, t3, 8 addi s3, zero, 0x7F800000 and t3, t3, s3 # s3 = inf_nan_mask addi s4, s2, -1 srli s4, s4, 31 # s4 = zero mask sll s2, s2, t5 srli s2, s2, 3 sub t5, zero, t5 addi t5, t5, 0x70 slli t5, t5, 23 # 0x70 - renorm << 23 or t5, t5, s3 xori s4, s4, -1 and t5, t5, s4 add s2, t5, s2 or s1, s1, s2 addi a0, zero, s1 addi a7, zero, 34 ecall ``` ### Assamble language of clz(with loop) ### x is stored in `t0` clz(x) is store in `t4` ```c main: addi t1, zero, 31 # i in loop addi t2, zero, 1 # 1U addi t3, zero, 0 # clz(x) clz_loop: sll t3, t2, t1 # set mask t3 addi t1, t1, -1 # i = i-1 and t3, t0, t3 # using mask bnez t3, Done addi t4, t4, 1 bge t1, zero, clz_loop ``` ### Assamble language of clz(without loop) ### By leveraging the fact that every natural number can be represented in binary, we can implement the solution efficiently. x is stored in `t0` clz(x) is stored in `t1` ```c clz: addi t1, zero, 32 srli t2, t0, 16 beqz t2, shift_8 addi t1, t1, -16 addi t0, zero, t2 shift_8: srli t2, t0, 8 beqz t2, shift_4 addi t1, t1, -8 addi t0, zero, t2 shift_4: srli t2, t0, 4 beqz t2, shift_2 addi t1, t1, -4 addi t0, zero, t2 shift_2: srli t2, t0, 2 beqz t2, shift_1 addi t1, t1, -2 addi t0, zero, t2 shift_1: srli t2, t0, 1 beqz t2, finish addi t1, t1, -1 ``` ## Exponential-Golomb coding ## ### Introduction ### >[!Note] > **The example and code in this article are for Nature number** Golomb code is based on assumption that larger value of an integer will have less probability of occurrences. It has the advantage that they can be simply encoded and decoded algorithmically without the need for look-up tables. They also possess the property that, for any geometric distribution, it is possible to find an encoding that is an optimal prefix code. Heence, adaptive methods have thus been developed and exploited in standards such as the lossless form of JPEG – JPEG-LS. The characteristic of this method of representing integers over many other methods is that it can be **quite efficient at representing small numbers** without imposing a limit on the maximum number that can be represented. ### Approach to Golomb coding ### To encode any nonnegative integer x using the exp-Golomb code: :::info 1. Write down x+1 in binary 2. Count the bits written, subtract one, and write that number of starting zero bits preceding the previous bit string. ::: For example, * we encode 3 * x+1 = 3+1 = 4 * 4 in binary is '100' * '100' has 3 bits, and 3-1 = 2 * Hence add 2 zeros before '100', which is '00100' ### Why does this coding method enable compression? ### Consider the following sequence: ```0, 1, 4, 3``` After encoding, the results are as follows: ```1, 010, 00101, 000100``` When we concatenate these bit strings, we obtain: ```101000101000100``` Assuming our bit buffer operates on 8-bit units, we start with 32 bits before encoding. After encoding, the first 8-bit buffer becomes: ```10100010``` The remaining bits, 1000100, are placed in the next buffer: (filled with 1 additional zero to complete 8 bits) ```10001000``` In the end, the total number of bits used is only 16 bits, in contrast to the original 32 bits. Next, we will explore how to accelerate the decoding of Golomb codes using the clz. ### C code of `Golomb coding` ### :::danger Provide more discussions about the rationale. ::: This program prints the results after transformation.(include the test number) ```c #include <stdio.h> #include <stdint.h> static inline int my_clz(uint32_t x) { int count = 0; for (int i = 31; i >= 0; --i) { if (x & (1U << i)) break; count++; } return count; } void exp_golomb_encode(int value) { uint32_t code_num = value + 1; int leading_zeros = my_clz(code_num); int bit_length = 32 - leading_zeros; for (int i = 0; i < bit_length - 1; ++i) { printf("0"); } for (int i = bit_length - 1; i >= 0; --i) { printf("%d", (code_num >> i) & 1); } printf("\n"); } int main() { int values[] = {0, 1, 2, 3, 4, 5, 7, 15}; int n = sizeof(values) / sizeof(values[0]); printf("Exponential-Golomb(0) Encodings (Using CLZ):\n"); for (int i = 0; i < n; ++i) { printf("Value: %d -> Encoding: ", values[i]); exp_golomb_encode(values[i]); } return 0; } ``` ### Assemble language of `Golomb coding` (with loop) ### ```c .data argument1: .word 1 argument2: .word 5 argument3: .word 8 text1: .string "The Golomb code of " text2: .string " is " endl: .string "\n" .text main: la s0, argument1 # store data address to s0 li s3, 3 # amount of test data golomb_1: lw s1, 0(s0) # store test data to s1 addi s2, s1, 1 addi t1, zero, 31 # i for clz addi t2, zero, 1 # 1U for clz addi t4, zero, 0 # counter for clz clz_loop: sll t3, t2, t1 # mask addi t1, t1, -1 # i -= 1 and t3, s2, t3 # using mask bnez t3, golomb_2 addi t4, t4, 1 # counter += 1 bge t1, zero, clz_loop golomb_2: sub t4, zero, t4 addi t4, t4, 31 # (32 - clz(x)) - 1 mv t5, t4 print: la a0, text1 # print "The Golomb code of " li a7, 4 ecall mv a0, s1 # print test data li a7, 1 ecall la a0, text2 # print " is " li a7, 4 ecall li a7, 1 # for print_prefix mv a0, zero # for print "0" print_prefix: ecall addi t4, t4, -1 bgt t4, zero, print_prefix print_suffix: srl t3, s2, t5 # mask andi a0, t3, 1 # using mask ecall addi t5, t5, -1 bge t5, zero, print_suffix la a0, endl li a7, 4 ecall addi s0, s0, 4 # select next test data addi s3, s3, -1 # num of remainding test data bne s3, zero, golomb_1 li a7, 10 # termination ecall ``` ### Assemble language of `Golomb coding` (without loop) ### >[!Tip] This version includes only the main program. Please add the `.data` and `print:` sections for testing as needed. ```c main: la s0, argument1 # store data address to s0 li s3, 3 # amount of test data golomb_1: lw s1, 0(s0) # store test data to s1 addi s2, s1, 1 mv t0, s2 # t0 = test data for clz addi t4, zero, 31 # counter for clz clz: srli t2, t0, 16 beq t2, zero, shift_8 addi t4, t4, -16 mv t0, t2 shift_8: srli t2, t0, 8 beqz t2, shift_4 addi t4, t4, -8 mv t0, t2 shift_4: srli t2, t0, 4 beqz t2, shift_2 addi t4, t4, -4 mv t0, t2 shift_2: srli t2, t0, 2 beqz t2, shift_1 addi t4, t4, -2 mv t0, t2 shift_1: srli t2, t0, 1 beqz t2, golomb_2 addi t4, t4, -1 golomb_2: sub t4, zero, t4 addi t4, t4, 31 # (32 - clz(x)) - 1 mv t5, t4 ``` ## Result ## Both versions of the Golomb coding (in RISC-V ) have the same test data 1, 5, and 8, and the have the same output: ``` The Golomb code of 1 is 010 The Golomb code of 5 is 00110 The Golomb code of 8 is 0001001 Program exited with code: 0 ``` The following data is their performance in [Ripes](https://github.com/mortbopet/Ripes), using 5-stage pipelined processor. | Execution info | With loop | Without loop | | -------------- | --------- | ------------ | | Cycles | 945 | 286 | | Instrs. retired| 686 | 185 | | CPI | 1.38 | 1.55 | | IPC | 0.726 | 0.647 | | Clock Rate | 0 | 0 | The version without loop has significantly fewer total cycles than the version with loop. Both the instruction count and total cycles indicate a clear efficiency advantage, however, this comes at the cost of reduced code readability. ## Reference ## Exponential-Golomb coding - https://en.wikipedia.org/wiki/Exponential-Golomb_coding Golomb Code -https://www.sciencedirect.com/topics/engineering/golomb-code Ripes - https://github.com/mortbopet/Ripes