Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline :::danger Check the sample pages for consistent naming scheme. ::: contributed by < [`imNCNUwilliam`](https://github.com/imNCNUwilliam) > :::danger Choose one problem (A, B, or C) from [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol), translate it from C code to a complete RISC-V assembly program, and include the relevant test data. ::: :::danger Show the full C source code corresponding to the original problem set. ::: ## Floating-point Formats Conversion > [Quiz 1](https://hackmd.io/@sysprog/arch2024-quiz1-sol) Problem B :::danger Avoid unnecessary bullets! ::: **Motivation**: Some IoT devices used for specific purposes, such as temperature monitoring, pressure sensing, etc., are often equipped with lower-powerful processors having the advantages of lower cost and energy consumption. In the meanwhile, sensors with limitations on measurement precision may cause the collected data do not efficiently use the corresponding data type. Moreover, the collected data transmission often relied on lower-bandwidth wireless network. **Floating Points**: [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) and some companies specify several floating-point format to represent numbers in different ranges and precisions. [Single-Precision floating-point format](https://en.wikipedia.org/wiki/Single-precision_floating-point_format) occupies 4-byte for common `float` data type. Besides, [binary16](https://en.wikipedia.org/wiki/Half-precision_floating-point_format), and [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format), both occupy 2-byte, and are proper for some scenario, such as Graphic Processing and Machine Learning. Hence, as the source data come in different formats from which the computation needs, some tools are needed for these floating-point formats converisons. **Application**: I think [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) is a proper data type for collecting sensor data for the above mentioned IoT devices, and expect the reduced floating-point format can speed up the computation or increase the amount of data tranmission per packet. I present its conversion with [Single-Precision floating-point format](https://en.wikipedia.org/wiki/Single-precision_floating-point_format). :::danger Avoid overusing `==`, as it is only applicable in specific situations. ::: ### Implementation You can find the source code [here](https://github.com/imNCNUwilliam/arch2024/blob/main/bfloat/). Any comments are appreciated. #### C code C Language standard library (`/usr/include/x86_64-linux-gnu/ieee754.h` in my machine) uses `union` for `ieee754_float` declaration, which include `float` data type, and two `struct` composed of `unsigned int`. This way, the encoded `float` can be decoded as `unsigned int`, and thus allowed for operating these bits with integer operations. I attempt to implement a simplified [FP32 converter](https://www.h-schmidt.net/FloatConverter/IEEE754.html) to examine my understanding, and then complete the C code for `float to bfloat16` converter. ```c #include <stdio.h> typedef unsigned int uint32_t; typedef struct bfloat16 { unsigned short bits; } bf16_t; static inline float bf16_to_fp32(bf16_t h) { union { float f; uint32_t i; } u = {.i = (uint32_t)h.bits << 16}; return u.f; } static inline bf16_t fp32_to_bf16(float s) { bf16_t h; union { float f; uint32_t i; } u = {.f = s}; if ((u.i & 0x7fffffff) > 0x7f800000) { /* NaN */ h.bits = (u.i >> 16) | 64; /* force to quiet */ /* Why bitwise OR 64? Why omitting the content of tailing 6 bits? */ return h; } h.bits = (u.i + (0x7fff + ((u.i >> 0x10) & 1))) >> 0x10; /* Round the mantissa based on not only the 16th bit, * but also the 15th bit? */ return h; } int main(void) { float f[3] = {-0.5, -0.25, -0.125}; bf16_t bf[3]; int j; printf("Conversion Test ... \n"); printf("float to bfloat16 ... \n"); for (j = 0; j < 3; j++) { union { float f; uint32_t i; } u = {.f = f[j]}; bf[j] = fp32_to_bf16(f[j]); printf("%x %x\n", u.i, bf[j].bits); } printf("bfloat16 to float ... \n"); for (j = 0; j < 3; j++) { union { float f; uint32_t i; } u = {.f = bf16_to_fp32(bf[j])}; printf("%x %x\n", bf[j].bits, u.i); } return 0; } ``` **Confused**: The C code is based on [Quiz 1](https://hackmd.io/@sysprog/arch2024-quiz1-sol) solution for Problem B. However, there are some portions I am still confused. <s>Could anyone help explain the meaning of line 25 (bitwise OR), and line 29 (Round consideration). Thanks a lot.</s> :::danger Do it yourself. ::: #### Assembly code Unlike C Language, the memory word or byte are not regarded with certain data type in Assembly. Namely, it is allowed to perform any operations on each memory word or byte regardless of what they are used for in previous execution flows. ```clike .data # initialize the data section with the number to be converted # default convertion from float to bfloat16 (direction == 0) nums: .word 0x3f800000, 0x3fa00000, 0x40800000 nums_bf: .word 0x3f80, 0x3fa0, 0x4080 numsSize: .word 3 direction: .word 0 mask1: .word 0x7fffffff mask2: .word 0x7f800001 large_num: .word 0x7fff .text # Begin the main code in the text section lw s0, direction # load the given direction into register s0 la a0, nums # load the nums array address to register a0 lw a1, numsSize # load the array size to register a1 reset: addi a2, x0, 0 # keep the element index in register a2 beq s0, x0, loop # determine whether to switch the conversion direction la a0, nums_bf # load the nums_bf array address to register a0 loop: lw t1, 0(a0) # load the i-th array element into register t1 # determine the conversion type bne s0, x0, bf16_to_fp32 fp32_to_bf16: lw t2, mask1 # load mask1 into register t2 lw t3, mask2 # load mask2 into register t3 and t4, t1, t2 # t4 = t1 & t2 srli t5, t1, 0x10 # extract the first 16 bits bge t4, t3, quiet # jump to quiet for special case NaN # Rounding related andi t5, t5, 1 # examine whether the 16th bit is odd lw t4 large_num add t5, t5, t4 add t5, t5, t1 # convert to 2-byte srli t5 t5 0x10 j out quiet: ori t5, t5, 64 j out bf16_to_fp32: # convert to 4-byte slli t1 t1 0x10 out: # result hold in register t5 (fp32_to_bf16) or t1 (bf16_to_fp32) addi a2, a2, 1 # increase the loop index addi a0, a0, 4 # get the address of next array element blt a2, a1, loop # determine whether to continue the next conversion bne s0, x0, end # determine whether to end the conversion tests # switch the conversion direction and do the conversion test addi s0, s0, 1 j reset end: nop # No operation, acts as a placeholder ``` ### Discussion Although `float` data type can be converted into short-sized `bfloat16`, the conversion itself is the overhead. In my opinion, if the raw data formats were `float`, the main purposes relied from the floating-point formats conversion are the further massive computations or transmissions. Engineers will compare the conversion time with the computation or transmission time, and decide whether to adopt the solution. Here, I would like to observe the code size comparison between C and Assembly only. I plan to [obtain the Assembly code from C complier automated compilation, and compile both Assembly codes](https://stackoverflow.com/questions/7190050/how-do-i-compile-the-asm-generated-by-gcc) for observation. Lake of the hardware, I will try to [run 32-bit RISC-V Linux on QEMU](https://risc-v-getting-started-guide.readthedocs.io/en/latest/linux-qemu.html) for the observation. ## Pascal's Triangle [Pascal's triangle](https://en.wikipedia.org/wiki/Pascal%27s_triangle) has the pattern that the k-th element of `row i` is the sum of the (k-1)-th element and the k-th element of `row i-1`. Following is the Pascal's Triangle with six rows. ``` row 0: 1 row 1: 1 1 row 2: 1 2 1 row 3: 1 3 3 1 row 4: 1 4 6 4 1 row 5: 1 5 10 10 5 1 ``` It can also be used in [Binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution). Here, recall the experiments of tossing a coin `n` times. If the random variable `X` is defined as the event with outcomes having `i` heads (`0 <= i <= n`) during the experiments, then `X` will follow `B(n, 0.5)`. For example, there are five events `X = 0, 1, 2, 3, and 4` for tossing a coin four times, which has 16 (2 ^ 4) outcomes. In this scenario, every outcome can be encoded as one unique bitstring with length == 4. ``` coin head encoded as "1" / coil tail encoded as "0" ___________________________________________________________________ event | outcomes | event counter X = 0 | 0000 | 1 X = 1 | 0001, 0010, 0100, 1000 | 4 X = 2 | 0011, 0110, 1010, 0101, 1001, 1100 | 6 X = 3 | 0111, 1110, 1101, 1011 | 4 X = 4 | 1111 | 1 ___________________________________________________________________ which corresponds to row 4 of the above Pascal's Triangle ``` ### Implementation For tossing a coin `n` times experiment, I use the `n-bit` unsigned integer data type to represent each possible outcome. Besides, I prepare `n + 1` counters to record the number of outcomes `0 to n` heads. As the `n-bit` bitstring enumerates from all-zero to all-one, using the population count function with the `n-bit` bitstring yields which event counter it belongs to. You can find the source code [here](https://github.com/imNCNUwilliam/arch2024/blob/main/pascal_triangle/). Any comments are appreciated. #### C code Brute force method to one-time enumerate all outcomes, and update their corresponding event counters. ```c #include <stdio.h> #include <stdlib.h> unsigned char my_popcount(unsigned int val) { unsigned int count = 0; while (val) { if (val & 0x80000000) count++; val <<= 1; } return count; } void pascal_triangle(int n) { unsigned int outcome = 0; unsigned int max = (1 << n) - 1; unsigned int *counter; int i; if (n == 0) { printf("1\n"); return; } if (n > 31) { printf("Support up to 31 only!\n"); exit(-1); } printf("enumerate from %x(%d) to %x(%d)\n", outcome, outcome, max, max); // Prepare outcome counters counter = (unsigned int *)malloc(sizeof(unsigned int) * (n + 1)); if (counter == NULL) { fprintf(stderr, "No memory!\n"); exit(-2); } for (i = 0; i <= n; i++) counter[i] = 0; // start enumeration and update the corresponding outcome counters while (outcome <= max) { counter[my_popcount(outcome)]++; outcome++; // printf("DEBUG: %d %d %d\n", \ outcome - 1, \ my_popcount(outcome - 1), \ counter[my_popcount(outcome - 1)]); } for (i = 0; i <= n; i++) printf("%d ", counter[i]); free(counter); printf("\n"); } int main(void) { int i; for (i = 0; i < 15; i++) pascal_triangle(i); return 0; } ``` #### Assembly code Limited registers cannot hold all event counters. Thus, I implement event counters in memory, and it cannot avoid incurring massive memory accesses, which is less efficient. Additionally, the mapping between event counters to registers is another problem, so I have the compromise to keep event counters in memory. ```clike= .data # initilize the data section with the specied row number of Pascal's Triangle # the row number `i` starts from 0, and there are `i + 1` elements in row `i` row_id: .word 5 counters: .word 0, 0, 0, 0, 0, 0 zero: .word 0 .text # Begin the main code in the text section lw t4, row_id # load the given row id into register t4 la a0, counters # load the counters array address to register a0 # calculate the upper limits for enumeration addi t5, x0, 1 # keep the upper limits for enumeration with register t5 sll t5, t5, t4 addi t6, x0, -1 # enumerate from 0 to `(1 << row_id) - 1` with register t6 lw s3, zero # hold the constant value `0` with register s3 loop: addi t6, t6, 1 beq t6, t5, end # determine whether to end the enumeration addi s0, t6, 0 # local variable stored with register s0 for pop_count addi s1, x0, 0 # local variable stored with register s1 for one-time counter pop_count: beq s0, x0, out # determine whether to finish the pop_count andi s4, s0, 1 beq s4, s3, skip_add # determine whether to increase the one-time counter addi s1, s1, 1 # increase counter by 1 skip_add: srli s0, s0, 1 # right shift one position for pop_count checking the next bit j pop_count out: slli s1, s1, 2 # a word occupies 4 bytes in riscv32 add a1, a0, s1 # load counter `i` address into register a1 lw s2, 0(a1) # load content of counter `i` to register s2 addi s2, s2, 1 # increase counter `i` by 1 sw s2, 0(a1) # store register s2 to counter `i` j loop end: nop # No operation, acts as a placeholder ``` :::danger You should automate the testing procedures as much as possible, meaning there's no need to manually check each program output. Instead, implement internal validation to handle the result checking. ::: ### Analysis Each assembly instruction modifies some states of components in computer architecture. For observing the operations of computer architecture, we use [Ripes](https://github.com/mortbopet/Ripes) simulator to know how RISC-V instructions works with components of RISC-V ISA. #### Single-cycle processor I use single-cycle processor to observe operations of RISC-V ISA as running my RV32I executable. From [Wikipedia, single cycle processor](https://en.wikipedia.org/wiki/Single_cycle_processor) carries out one instruction in a single clock cycle, and thus the states change of RISC-V ISA per clock cycle are done by only one RISC-V instruction. ![arch2024-homework1_Ripes_processor](https://hackmd.io/_uploads/S1UlKIAJ1x.png) #### RISC-V executable My 43-line `Source code` are composed of 23-line RV32I instruction in the .text section, and three lines in the .data section. Thus, for the .text section, each RV32I instruction will be encoded as one 4-byte memory word, so it should occupy 23 * 4 = 92 (0x5c) bytes in `Executable code`. However, it occupies 0x68 (104) bytes, and the excessive 12 bytes come from the `auipc` instruction used for three `lw`, `la` instructions. ``` 0: 10000e97 auipc x29 0x10000 4: 000eae83 lw x29 0 x29 8: 10000517 auipc x10 0x10000 c: ffc50513 addi x10 x10 -4 10: 00100f13 addi x30 x0 1 14: 01df1f33 sll x30 x30 x29 18: fff00f93 addi x31 x0 -1 1c: 10000997 auipc x19 0x10000 20: 0009a983 lw x19 0 x19 00000024 <loop>: 24: 001f8f93 addi x31 x31 1 28: 03ef8e63 beq x31 x30 60 <end> 2c: 000f8413 addi x8 x31 0 30: 00000493 addi x9 x0 0 00000034 <pop_count>: 34: 00040c63 beq x8 x0 24 <out> 38: 00147a13 andi x20 x8 1 3c: 013a0463 beq x20 x19 8 <skip_add> 40: 00148493 addi x9 x9 1 00000044 <skip_add>: 44: 00145413 srli x8 x8 1 48: fedff06f jal x0 -20 <pop_count> 0000004c <out>: 4c: 00249493 slli x9 x9 2 50: 009505b3 add x11 x10 x9 54: 0005a903 lw x18 0 x11 58: 00190913 addi x18 x18 1 5c: 0125a023 sw x18 0 x11 60: fc5ff06f jal x0 -60 <loop> 00000064 <end>: 64: 00000013 addi x0 x0 0 ``` During simulation, 0x68 bytes of .text section are loaded into 0x00000000. ![arch2024-homework1_Ripes_text_section](https://hackmd.io/_uploads/SyaVKL0J1l.png) For the .data section, it occupies 32 (0x20) bytes for that I declare eight words. During simulation, 0x20 bytes of .data section are loaded into 0x10000000. ![arch2024-homework1_Ripes_data_section](https://hackmd.io/_uploads/ByqStIAkJx.png) #### Running the executable **Instruction Fetch (IF)**: After loading the executable into memory, the address of the first instruction of the executable is often set to the program counter (PC). In the simulation, it is the memory address 0x00000000. By default, the program counter will be increased by 4 after each clock cycle in 32-bit RISC-V ISA, unless some instructions update the program counter. This way, the next 4-byte word will be `fetched`. After `fetching`, each 4-byte word will be `decoded` as certain RISC-V instruction format for `execution`. During the `execution`, contents of some registers, program counter will be changed, and sometimes memory accesses are triggered. **Instruction Decode (ID)**: Take the 10th instruction loaded in memory address 0x00000024 as example, it is encoded as `0x001f8f93`. Following are its first time execution. * Instruction `0x001f8f93` is decoded to four parts: * `opcode` = `addi` which is 10-bit I-Type instruction * `Wr idx` = `0x1f` specifies 5-bit destination register index * `R1 idx` = `0x1f` specifies 5-bit source 1 register index * `R2 idx` = `0x01` originally specifies 5-bit source 2 register index, but I-Type instruction does not use source 2 register, so the 5 bits combined with remaining 7 bits specifies a 12-bit immediate value ![arch2024-homework1_Ripes_IFID](https://hackmd.io/_uploads/SyRLKI01Jx.png) **Instruction Execute (IE)**: