Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < [曾遠哲](https://github.com/vx6122035) > ## Problem A in [Quiz1](https://hackmd.io/@sysprog/arch2024-quiz1-sol) ### Problem analysis Problem A focuses on the following points: 1. **Floating-Point Conversions**: Handling the intricacies of FP16 and FP32 formats, including bit manipulation, exponent bias adjustments, and managing special cases like zeros, infinities, and NaNs. 2. **Bit-Level Operations**: Requires precise bit masking, shifting, and logical operations, which are fundamental in low-level programming and assembly language. 3. **Conditional Logic**: Managing different scenarios based on the input's characteristics (e.g., normalized vs. denormalized numbers). Original C program #### bits to fp32 / fp32 to bits ```c= static inline float bits_to_fp32(uint32_t w) { union { uint32_t as_bits; float as_value; } fp32 = {.as_bits = w}; return fp32.as_value; } static inline uint32_t fp32_to_bits(float f) { union { float as_value; uint32_t as_bits; } fp32 = {.as_value = f}; return fp32.as_bits; } ``` It is challenging to complete the conversion between fp32 and bits is because that there are `no floating-point instructions` and `registers` in the RV32I. This assignment will use the following approaches to implement C to RISC-V assembly conversion 1. **Bit Manipulation**: We'll perform operations on the bit pattern to extract the floating-point components (sign, exponent, mantissa). 2. **Simulate Floating-Point Conversion**: We'll reconstruct the floating-point value using integer arithmetic, approximating its behavior. 3. **Testing with Loops and Branches**: We'll create test cases that include loops and conditional branches to validate our implementation. A 32-bit IEEE 754 floating-point number consists of: 1. **1-bit Sign (S):** Determines if the number is positive or negative. 2. **8-bit Exponent (E):** Biased exponent. 3. **23-bit Mantissa (M):** Fractional part (with an implicit leading 1 for normalized numbers). **Formula to Compute the Floating-Point Value:** For normalized numbers: $$ \begin{aligned} \text{Value} = (-1)^{S} \times 1.M \times 2^{(E - 127)} \end{aligned} $$ ### RISC-V Assembly Implementation (1) #### (bits_to_fp32/fp32_to_bits) 1. **Extract the Sign Bit:** - Shift the bit pattern right by 31 bits to isolate the sign bit. 2. **Extract the Exponent:** - Shift left by 1 to remove the sign bit, then right by 24 bits. 3. **Extract the Mantissa:** - Mask the lower 23 bits. Since it is not possible to perform floating-point arithmetic, this implementation will simulate the computation using integer operations. That is, the implementation will approximate the floating-point value by computing an integer representation. Initial version (naive) ```c # Function: bits_to_fp32 # Input: w (in a0) # Output: Approximated integer value of the float (in a0) # This function approximates the float value by manipulating the bit pattern. .data CONST_7FFFFF: .word 0x007FFFFF CONST_800000: .word 0x00800000 .text bits_to_fp32: # Extract sign bit srli t0, a0, 31 # t0 = Sign bit (0 or 1) # Extract exponent bits slli t1, a0, 1 # Remove sign bit srli t1, t1, 24 # t1 = Exponent bits (8 bits) # Extract mantissa bits la t3, CONST_7FFFFF # Load address of CONST_7FFFFF lw t3, 0(t3) # t3 = 0x007FFFFF and t2, a0, t3 # t2 = Mantissa bits (23 bits) # Bias exponent (E - 127) addi t1, t1, -127 # Adjust exponent # Approximate mantissa (simulate 1.M format) la t4, CONST_800000 # Load address of CONST_800000 lw t4, 0(t4) # t4 = 0x00800000 add t2, t2, t4 # t2 = t2 + 0x00800000 # Compute approximate value: mantissa * 2^(E - 127) # Since we cannot perform actual floating-point operations, we'll approximate using shifts sll t2, t2, t1 # Shift mantissa by (E - 127) # Apply sign bnez t0, negative # If sign bit is 1, branch to negative positive: # Result is positive mv a0, t2 # Move the result to a0 ret negative: # Result is negative sub a0, x0, t2 # a0 = -t2 ret ``` ```c # Function: fp32_to_bits # Input: Integer value approximating the float (in a0) # Output: Reconstructed bit pattern (in a0) # This function approximates the float bit pattern from an integer value. fp32_to_bits: # Check for zero input beq a0, x0, return_zero # If input is zero, return zero # Determine sign bit and absolute value slti t0, a0, 0 # t0 = 1 if a0 < 0 slli t0, t0, 31 # Shift sign bit to bit 31 bgez a0, skip_abs sub a0, x0, a0 # a0 = -a0 skip_abs: # Find position of the highest set bit (leading one) li t1, 0 # t1 = shift count mv t2, a0 # t2 = input value find_msb: srli t2, t2, 1 # Shift right by 1 addi t1, t1, 1 # Increment shift count bnez t2, find_msb # Loop until t2 == 0 # Calculate exponent li t3, 31 # Maximum bit position sub t1, t3, t1 # t1 = 31 - shift count addi t1, t1, 127 # Exponent = (31 - shift count) + 127 # Extract mantissa sll a0, a0, t1 # Normalize mantissa slli a0, a0, 1 # Adjust mantissa alignment srli a0, a0, 9 # Shift to fit into 23 bits # Assemble the bit pattern slli t1, t1, 23 # Shift exponent to bits [30:23] or a0, a0, t1 # Combine exponent and mantissa or a0, a0, t0 # Combine with sign bit ret return_zero: # Input is zero, return zero li a0, 0 ret ``` #### Fewer Instructions It is possible to use the bit manipulation intead of loading the large constants from the memory. ```c # Function: bits_to_fp32 # Input: w (in a0) # Output: Approximated integer value of the float (in a0) # This function approximates the float value by manipulating the bit pattern. .text bits_to_fp32: # Extract sign bit srli t0, a0, 31 # t0 = Sign bit (0 or 1) # Extract exponent bits srli t1, a0, 23 # t1 = Bits [31:23] (Sign bit + exponent) andi t1, t1, 0xFF # t1 = Exponent bits (8 bits) # Extract mantissa bits and add implicit leading 1 slli t2, a0, 9 # Shift left to remove sign and exponent srli t2, t2, 9 # Shift back to align mantissa bits li t3, 1 slli t3, t3, 23 # t3 = 1 << 23 (implicit leading 1) or t2, t2, t3 # t2 = mantissa | implicit leading 1 # Adjust exponent bias (E - 127) addi t1, t1, -127 # t1 = E - 127 # Handle exponent being negative or positive blt t1, zero_exponent # If exponent < 0, branch to zero_exponent # Positive exponent sll t2, t2, t1 # Shift mantissa left by (E - 127) j apply_sign zero_exponent: # Negative or zero exponent sub t1, x0, t1 # t1 = -(E - 127) srl t2, t2, t1 # Shift mantissa right by -(E - 127) apply_sign: # Apply sign bnez t0, negative # If sign bit is 1, number is negative # Positive number mv a0, t2 # Result in a0 ret negative: # Negative number sub a0, x0, t2 # a0 = -t2 ret ``` Since RISC-V uses little endian byte order by default, it is possible to neglect the MSB (most significant bit) searching component to reduce the instruction size. ```c # Function: fp32_to_bits # Input: Integer value approximating the float (in a0) # Output: Reconstructed bit pattern (in a0) # Note: This function provides a simplified reconstruction due to the limitations. fp32_to_bits: # Initialize registers mv t2, a0 # t2 = approximate value # Determine the sign bit slti t0, t2, 0 # t0 = 1 if t2 < 0 slli t0, t0, 31 # Shift sign bit to bit 31 # Take absolute value of t2 bgez t2, skip_abs sub t2, x0, t2 # t2 = -t2 skip_abs: # Approximate exponent and mantissa # Since accurate calculation is complex, we'll set exponent and mantissa to zero li t1, 0 # Exponent bits li t3, 0 # Mantissa bits # Construct the bit pattern slli t1, t1, 23 # Shift exponent to bits [30:23] or t1, t1, t3 # Combine exponent and mantissa or a0, t0, t1 # Combine sign bit with exponent and mantissa ret ``` ### code test ```c .data # Test data: array of 16-bit FP16 bit patterns bit_patterns: .half 0x3C00 # 1.0 .half 0xBC00 # -1.0 .half 0x0000 # 0.0 .half 0x8000 # -0.0 .half 0x7C00 # Infinity .half 0xFC00 # -Infinity .half 0x3555 # Approximately 0.33325 .half 0xC000 # -2.0 array_length: .word 8 # Number of elements in the array # Storage for results (FP32 bit patterns) results: .word 0, 0, 0, 0, 0, 0, 0, 0 # 8 words initialized to zero .text .globl _start _start: # Initialize pointers and counters la t0, bit_patterns # t0 points to the start of bit_patterns la t1, results # t1 points to the start of results lw t2, array_length # t2 = number of elements li t3, 0 # Loop counter i = 0 loop: bge t3, t2, end_loop # if i >= array_length, exit loop # Load the FP16 bit pattern into a0 slli t4, t3, 1 # t4 = i * 2 (offset in bytes for halfwords) add t5, t0, t4 # t5 = address of bit_patterns[i] lh a0, 0(t5) # a0 = bit_patterns[i] (load halfword) # Call bits_to_fp32 function jal ra, bits_to_fp32 # Result will be in a0 # Store the result (FP32 bit pattern) in the results array slli t4, t3, 2 # t4 = i * 4 (offset in bytes for words) add t6, t1, t4 # t6 = address of results[i] sw a0, 0(t6) # results[i] = a0 # Increment loop counter addi t3, t3, 1 # i++ # Jump back to the beginning of the loop j loop end_loop: # Program end: loop indefinitely j end_loop ######################################### # Function: bits_to_fp32 # Input: FP16 bit pattern (in a0) # Output: FP32 bit pattern (in a0) bits_to_fp32: # Extract sign bit srli t0, a0, 15 # t0 = Sign bit (0 or 1) slli t0, t0, 31 # Shift sign to bit 31 # Extract exponent bits srli t1, a0, 10 # t1 = Exponent bits (5 bits) andi t1, t1, 0x1F # Mask to get lower 5 bits # Extract mantissa bits andi t2, a0, 0x03FF # t2 = Mantissa bits (10 bits) # Adjust exponent li t3, 112 # Exponent bias difference (127 - 15) beq t1, zero, zero_or_subnormal beq t1, 0x1F, inf_or_nan # Normalized number add t1, t1, t3 # Adjust exponent slli t1, t1, 23 # Shift exponent to bits 23-30 slli t2, t2, 13 # Shift mantissa to bits 0-22 or t1, t1, t2 # Combine exponent and mantissa or a0, t0, t1 # Combine sign bit ret zero_or_subnormal: # Handle zero and subnormals (set to zero for simplicity) mv a0, t0 # Only the sign bit ret inf_or_nan: # Handle infinity and NaN li t1, 0xFF # Exponent = 255 slli t1, t1, 23 # Shift exponent to bits 23-30 slli t2, t2, 13 # Shift mantissa to bits 0-22 or t1, t1, t2 # Combine exponent and mantissa or a0, t0, t1 # Combine sign bit ret ``` #### fp16 to fp32 ```c= static inline float fp16_to_fp32(fp16_t h) { const uint32_t w = (uint32_t) h << 16; const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t two_w = w + w; const uint32_t exp_offset = UINT32_C(0xE0) << 23; const float exp_scale = 0x1.0p-112f; const float normalized_value = bits_to_fp32((two_w >> 4) + exp_offset) * exp_scale; const uint32_t mask = UINT32_C(126) << 23; const float magic_bias = 0.5f; const float denormalized_value = bits_to_fp32((two_w >> 17) | mask) - magic_bias; const uint32_t denormalized_cutoff = UINT32_C(1) << 27; const uint32_t result = sign | (two_w < denormalized_cutoff ? fp32_to_bits(denormalized_value) : fp32_to_bits(normalized_value)); return bits_to_fp32(result); } ``` ##### code test for fp16 to fp32 ```c= #include <stdio.h> #include <stdint.h> #include <stdbool.h> typedef uint16_t fp16_t; static inline float bits_to_fp32(uint32_t w) { union { uint32_t as_bits; float as_value; } fp32 = {.as_bits = w}; return fp32.as_value; } static inline uint32_t fp32_to_bits(float f) { union { float as_value; uint32_t as_bits; } fp32 = {.as_value = f}; return fp32.as_bits; } static inline float fp16_to_fp32(fp16_t h) { const uint32_t w = (uint32_t) h << 16; const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t two_w = w + w; const uint32_t exp_offset = UINT32_C(0xE0) << 23; const float exp_scale = 0x1.0p-112f; const float normalized_value = bits_to_fp32((two_w >> 4) + exp_offset) * exp_scale; const uint32_t mask = UINT32_C(126) << 23; const float magic_bias = 0.5f; const float denormalized_value = bits_to_fp32((two_w >> 17) | mask) - magic_bias; const uint32_t denormalized_cutoff = UINT32_C(1) << 27; const uint32_t result = sign | (two_w < denormalized_cutoff ? fp32_to_bits(denormalized_value) : fp32_to_bits(normalized_value)); return bits_to_fp32(result); } int main() { // Test data: array of 16-bit FP16 bit patterns fp16_t bit_patterns[] = { 0x3C00, // 1.0 0xBC00, // -1.0 0x0000, // 0.0 0x8000, // -0.0 0x7C00, // Infinity 0xFC00, // -Infinity 0x3555, // Approximately 0.33325 0xC000, // -2.0 0x7BFF, // Max normal positive number 0x0400, // Smallest positive normal number 0x03FF, // Largest subnormal number 0x0001, // Smallest positive subnormal number }; int array_length = sizeof(bit_patterns) / sizeof(bit_patterns[0]); // Process each bit pattern for (int i = 0; i < array_length; i++) { fp16_t h = bit_patterns[i]; float fp32_value = fp16_to_fp32(h); uint32_t fp32_bits = fp32_to_bits(fp32_value); // Print the input FP16 bit pattern and the resulting FP32 value and bit pattern printf("Input FP16 bit pattern: 0x%04X\n", h); printf("Output FP32 value: %f\n", fp32_value); printf("Output FP32 bit pattern: 0x%08X\n", fp32_bits); printf("-----------------------------\n"); } return 0; } ``` #### fp32 to fp16 ```c= static inline fp16_t fp32_to_fp16(float f) { const float scale_to_inf = 0x1.0p+112f; const float scale_to_zero = 0x1.0p-110f; float base = (fabsf(f) * scale_to_inf) * scale_to_zero; const uint32_t w = fp32_to_bits(f); const uint32_t shl1_w = w + w; const uint32_t sign = w & UINT32_C(0x80000000); uint32_t bias = shl1_w & UINT32_C(0xFF000000); if (bias < UINT32_C(0x71000000)) bias = UINT32_C(0x71000000); base = bits_to_fp32((bias >> 1) + UINT32_C(0x07800000)) + base; const uint32_t bits = fp32_to_bits(base); const uint32_t exp_bits = (bits >> 13) & UINT32_C(0x00007C00); const uint32_t mantissa_bits = bits & UINT32_C(0x00000FFF); const uint32_t nonsign = exp_bits + mantissa_bits; return (sign >> 16) | (shl1_w > UINT32_C(0xFF000000) ? UINT16_C(0x7E00) : nonsign); } ``` ### RISC-V Assembly Implementation (2) #### (fp16_to_fp32/fp32_to_fp16) 1. Adjustments Due to RV32I Limitations: 1. **Floating-Point Constants**: It is not possible to represent `exp_scale` and `magic_bias` directly. Thus, this function approximates these values by omitting these scaling factors. 2. **Multiplication and Division**: Due to the absence of floating-point multiplication and division functions in the RV32I base instruction set (which are available in extensions like M, F, and D), we have adjusted our approach to use integer approximations. 2. **Approximation**: 1. The result will be an approximate conversion from FP16 to FP32. 2. This approximation may not be exact due to the limitations. #### fp16_to_fp32 (RISC-V assembly) ```c .data # Constants exp_offset: .word 0x03800000 # 0xE0 << 23 mask: .word 0x3F000000 # 126 << 23 denorm_cutoff: .word 0x08000000 # 1 << 27 .text .globl fp16_to_fp32 fp16_to_fp32: # Prologue addi sp, sp, -20 # Allocate stack space sw ra, 16(sp) # Save return address sw s0, 12(sp) # Save s0 sw s1, 8(sp) # Save s1 sw s2, 4(sp) # Save s2 sw s3, 0(sp) # Save s3 # Input: h in a0 slli s0, a0, 16 # s0 = w = h << 16 # s1 = sign = w & 0x80000000 li t0, 0x80000000 and s1, s0, t0 # s2 = two_w = w + w add s2, s0, s0 # temp = (two_w >> 4) + exp_offset srli t1, s2, 4 # t1 = two_w >> 4 la t2, exp_offset lw t2, 0(t2) # t2 = exp_offset add s3, t1, t2 # s3 = temp # normalized_value = bits_to_fp32(temp) mv a0, s3 jal ra, bits_to_fp32 # Result in a0 # Store normalized_value in s4 mv s4, a0 # temp = (two_w >> 17) | mask srli t1, s2, 17 # t1 = two_w >> 17 la t2, mask lw t2, 0(t2) # t2 = mask or s3, t1, t2 # s3 = temp # denormalized_value = bits_to_fp32(temp) mv a0, s3 jal ra, bits_to_fp32 # Result in a0 # Store denormalized_value in s5 mv s5, a0 # Compare two_w with denorm_cutoff la t0, denorm_cutoff lw t0, 0(t0) # t0 = denorm_cutoff blt s2, t0, use_denorm # Use normalized_value mv a0, s4 j assemble_result use_denorm: # Use denormalized_value mv a0, s5 assemble_result: # Combine sign with the selected value or a0, a0, s1 # a0 = sign | selected_value # Epilogue lw ra, 16(sp) lw s0, 12(sp) lw s1, 8(sp) lw s2, 4(sp) lw s3, 0(sp) addi sp, sp, 20 ret ``` 1. **Adjustments Due to RV32I Limitations**: 1. **Floating-Point Operations**: Since we cannot perform floating-point multiplication or addition, we approximate base using integer operations. 2. **Approximations**: We proceed with the integer values as-is, understanding that the result will be an approximation. 2. **Key Points**: 1. **Bias Adjustment**: We adjust bias if it is less than 0x71000000. 2. **Extracting Bits**: We extract the necessary bits for the exponent and mantissa. 3. **Final Assembly**: We assemble the FP16 value by combining the sign and the computed bits. #### fp32_to_fp16 ```c .data bias_threshold: .word 0x71000000 bias_offset: .word 0x07800000 max_exponent: .word 0xFF000000 nan_value: .word 0x7E00 .text .globl fp32_to_fp16 fp32_to_fp16: # Prologue addi sp, sp, -28 # Allocate stack space sw ra, 24(sp) # Save return address sw s0, 20(sp) # Save s0 sw s1, 16(sp) # Save s1 sw s2, 12(sp) # Save s2 sw s3, 8(sp) # Save s3 sw s4, 4(sp) # Save s4 sw s5, 0(sp) # Save s5 # Input: f in a0 jal ra, fp32_to_bits # a0 = w mv s0, a0 # s0 = w # s1 = shl1_w = w + w add s1, s0, s0 # s2 = sign = w & 0x80000000 li t0, 0x80000000 and s2, s0, t0 # s3 = bias = shl1_w & 0xFF000000 li t1, 0xFF000000 and s3, s1, t1 # Adjust bias if necessary la t2, bias_threshold lw t2, 0(t2) # t2 = 0x71000000 blt s3, t2, adjust_bias j compute_base adjust_bias: mv s3, t2 # s3 = bias = 0x71000000 compute_base: # s4 = (bias >> 1) + bias_offset srli t3, s3, 1 # t3 = bias >> 1 la t4, bias_offset lw t4, 0(t4) # t4 = 0x07800000 add s4, t3, t4 # s4 = temp # base = bits_to_fp32(s4) mv a0, s4 jal ra, bits_to_fp32 # a0 = base mv s5, a0 # s5 = base # Convert base back to bits jal ra, fp32_to_bits # a0 = bits mv s6, a0 # s6 = bits # Extract exponent and mantissa srli t5, s6, 13 # t5 = bits >> 13 andi t5, t5, 0x7C00 # t5 = exp_bits andi t6, s6, 0x0FFF # t6 = mantissa_bits # nonsign = exp_bits + mantissa_bits add t0, t5, t6 # t0 = nonsign # Check for NaN or Infinity la t1, max_exponent lw t1, 0(t1) # t1 = 0xFF000000 bgt s1, t1, return_nan # Assemble fp16 value srli s2, s2, 16 # s2 = sign >> 16 or a0, s2, t0 # a0 = sign | nonsign j fp32_to_fp16_end return_nan: # Return NaN value srli s2, s2, 16 # s2 = sign >> 16 la t2, nan_value lw t2, 0(t2) # t2 = 0x7E00 or a0, s2, t2 # a0 = sign | 0x7E00 fp32_to_fp16_end: # Epilogue lw ra, 24(sp) lw s0, 20(sp) lw s1, 16(sp) lw s2, 12(sp) lw s3, 8(sp) lw s4, 4(sp) lw s5, 0(sp) addi sp, sp, 28 ret ``` ## Use-Case from Leetcode Relevance to Assignment: 1. Floating-Point Manipulation: 1. The problem requires calculating travel times using divisions and handling fractional hours, which inherently involves floating-point arithmetic. 2. It presents challenges related to precision, rounding, and efficient computation of floating-point operations. 2. Optimization Potential: 1. There is room to optimize the algorithm by reducing floating-point computations, using integer arithmetic where possible, or employing lower-precision floating-point formats. 2. It is a good chance to explore techniques like fixed-point arithmetic, branchless programming, or specialized hardware instructions to improve performance. ### [Leetcode 1870 Minimum Speed to Arrive on Time](https://leetcode.com/problems/minimum-speed-to-arrive-on-time/) You are given a floating-point number `hour`, representing the amount of time you have to reach the office. To commute to the office, you must take `n` trains in sequential order. You are also given an integer array `dist` of length `n`, where `dist[i]` describes the distance (in kilometers) of the i^th^ train ride. Each train can only depart at an integer hour, so you may need to wait in between each train ride. - For example, if the 1^st^ train ride takes `1.5` hours, you must wait for an additional `0.5` hours before you can depart on the 2^nd^ train ride at the `2` hour mark. Return the **minimum positive integer** speed **(in kilometers per hour)** that all the trains must travel at for you to reach the office on time, or `-1` if it is impossible to be on time. Tests are generated such that the answer will not exceed 10^7^ and `hour` will have **at most two digits after the decimal point**. ### Solution for Leetcode 1870 ```c= #include <stdbool.h> bool canReachOnTime(int dist[], int n, int speed, double hour) { double totalTime = 0.0; for (int i = 0; i < n; i++) { double time = (double)dist[i] / speed; if (i < n - 1) { // Using integer arithmetic to compute the ceiling totalTime += (dist[i] + speed - 1) / speed; } else { totalTime += time; } } return totalTime <= hour; } int minSpeedOnTime(int* dist, int distSize, double hour) { int left = 1, right = 10000000; int result = -1; while (left <= right) { int mid = left + (right - left) / 2; if (canReachOnTime(dist, distSize, mid, hour)) { result = mid; right = mid - 1; // Try to find a smaller speed } else { left = mid + 1; // Increase speed } } return result; } ``` RISC-V implementation (RV32I) Initial version (Naive) ```c .data # Input data dist: .word 1, 3, 2 # Example distances distSize: .word 3 hour: .word 270 # hour * 100 (e.g., 2.7 * 100 = 270) # Output result result: .word -1 # Constants MAX_SPEED: .word 10000000 .text .globl _start _start: # Load distSize lw t0, distSize # Initialize left = 1, right = MAX_SPEED li t1, 1 # left lw t2, MAX_SPEED # right li t3, -1 # result binary_search_loop: bgt t1, t2, end_binary_search # mid = left + (right - left) / 2 sub t4, t2, t1 # t4 = right - left srai t4, t4, 1 # t4 = (right - left) / 2 add t5, t1, t4 # t5 = mid # Call canReachOnTime(dist, distSize, mid, hour) mv a0, t5 # a0 = mid (speed) la a1, dist # a1 = &dist[0] lw a2, distSize # a2 = distSize lw a3, hour # a3 = hour * 100 jal ra, canReachOnTime # Result in a0 (0 = false, 1 = true) beq a0, x0, cannot_reach # If can reach, update result and right = mid - 1 mv t3, t5 # result = mid addi t2, t5, -1 # right = mid - 1 j binary_search_loop cannot_reach: # If cannot reach, left = mid + 1 addi t1, t5, 1 # left = mid + 1 j binary_search_loop end_binary_search: # Store result la t0, result sw t3, 0(t0) # End of program j end_program ################################################## # Function: canReachOnTime # Inputs: # a0 = speed # a1 = pointer to dist array # a2 = distSize # a3 = hour * 100 (scaled to integer to avoid floating-point) # Output: # a0 = 1 if can reach on time, 0 otherwise canReachOnTime: # Prologue addi sp, sp, -4 sw s0, 0(sp) # Save s0 # Initialize totalTime = 0 li t0, 0 # t0 = totalTime # Loop over dist array li t1, 0 # t1 = i canReach_loop: bge t1, a2, canReach_end_loop # Load dist[i] slli t2, t1, 2 # t2 = i * 4 (word size) add t3, a1, t2 # t3 = &dist[i] lw t4, 0(t3) # t4 = dist[i] # Compute dist[i] * 100 using shifts and adds slli t5, t4, 6 # t5 = t4 << 6 (dist[i] * 64) slli t6, t4, 5 # t6 = t4 << 5 (dist[i] * 32) add t5, t5, t6 # t5 = dist[i] * 96 slli t6, t4, 2 # t6 = t4 << 2 (dist[i] * 4) add t5, t5, t6 # t5 = dist[i] * 100 # t5 now holds dist[i] * 100 # Compute time = (dist[i] * 100) / speed # Since we cannot use 'div', we implement division in software # Save registers before function call addi sp, sp, -12 sw t1, 0(sp) # Save t1 (i) sw t4, 4(sp) # Save t4 (dist[i]) sw t5, 8(sp) # Save t5 (dist[i] * 100) mv a0, t5 # a0 = dividend (dist[i] * 100) mv a1, a0 # a1 = divisor (speed) jal divu # Call division function mv t6, a0 # t6 = time # Restore registers lw t1, 0(sp) # Restore t1 (i) lw t4, 4(sp) # Restore t4 (dist[i]) lw t5, 8(sp) # Restore t5 (dist[i] * 100) addi sp, sp, 12 # For all but the last train, apply ceiling addi s0, a2, -1 # s0 = distSize - 1 blt t1, s0, apply_ceiling # Last train, add time as is add t0, t0, t6 # totalTime += time j next_iteration apply_ceiling: # time = ((dist[i] * 100) + speed - 1) / speed add t5, t5, a0 # t5 = dist[i] * 100 + speed addi t5, t5, -1 # t5 = dist[i] * 100 + speed - 1 # Divide t5 by speed # Save registers before function call addi sp, sp, -4 sw t5, 0(sp) # Save t5 mv a0, t5 # a0 = dividend mv a1, a0 # a1 = divisor (speed) jal divu # Call division function mv t6, a0 # t6 = time # Restore registers lw t5, 0(sp) # Restore t5 addi sp, sp, 4 # Add time to totalTime add t0, t0, t6 # totalTime += time next_iteration: addi t1, t1, 1 # i++ j canReach_loop canReach_end_loop: # Compare totalTime with hour * 100 ble t0, a3, can_reach # Cannot reach li a0, 0 # Epilogue lw s0, 0(sp) addi sp, sp, 4 ret can_reach: # Can reach li a0, 1 # Epilogue lw s0, 0(sp) addi sp, sp, 4 ret ################################################## # Unsigned Division Function: divu # Divides a0 (dividend) by a1 (divisor) # Result is in a0 (quotient) divu: # Check for division by zero beq a1, zero, divu_zero li t0, 0 # t0 = quotient mv t1, a0 # t1 = dividend mv t2, a1 # t2 = divisor # Find the highest bit set in divisor li t3, 0 # t3 = shift amount divu_shift: sll t4, t2, t3 # t4 = divisor << t3 blt t1, t4, divu_divide addi t3, t3, 1 j divu_shift divu_divide: addi t3, t3, -1 # Adjust shift amount divu_loop: sll t4, t2, t3 # t4 = divisor << t3 blt t1, t4, divu_no_subtract sub t1, t1, t4 # t1 -= t4 slli t0, t0, 1 # t0 <<= 1 addi t0, t0, 1 # t0 |= 1 j divu_next divu_no_subtract: slli t0, t0, 1 # t0 <<= 1 divu_next: addi t3, t3, -1 bgez t3, divu_loop mv a0, t0 # Return quotient in a0 ret divu_zero: li a0, -1 # Return max unsigned value ret ################################################## end_program: # Infinite loop to end the program j end_program ``` Leetcode 1870 RISC-V optimized #### Optimization Strategy 1. **Integer Scaling**: 1. Since `hour` has at most two decimal places, we can scale all time-related values by 100 to convert them into integers. This allows us to avoid floating-point operations. 2. For higher precision, we can scale by 100 (two decimal places) or 1000 (three decimal places). 2. **Integer Arithmetic**: 1. Replace floating-point division and comparisons with integer arithmetic. 2. Use integer operations to compute time and compare with the scaled hour. 3. **Optimize Loops and Branches**: 1. Use efficient loop constructs. 2. Minimize the number of instructions in loops. 3. Use branchless programming techniques where appropriate. 4. **Use Efficient Instructions**: 1. Leverage RISC-V instructions that perform combined operations. 2. Use shifts and adds instead of multiplications and divisions where possible. #### Original version (without using fp16/fp32 conversion) ```c .data MAX_SPEED: .word 10000000 # Maximum speed SCALE_FACTOR: .word 100 # Scaling factor for hour (to two decimal places) .text .globl minSpeedOnTime minSpeedOnTime: # Arguments: # a0: dist pointer # a1: distSize # a2: hour (double, but we will pass it as an integer scaled by 100) # Prologue addi sp, sp, -16 # Allocate stack space sw ra, 12(sp) # Save return address sw s0, 8(sp) # Save s0 sw s1, 4(sp) # Save s1 sw s2, 0(sp) # Save s2 mv s0, a0 # s0 = dist pointer mv s1, a1 # s1 = distSize mv s2, a2 # s2 = hour * SCALE_FACTOR li t0, 1 # left = 1 la t1, MAX_SPEED lw t1, 0(t1) # t1 = right = MAX_SPEED li t2, -1 # result = -1 binary_search_loop: ble t1, t0, end_binary_search # while (left <= right) # mid = left + (right - left) / 2 sub t3, t1, t0 # t3 = right - left srai t3, t3, 1 # t3 = (right - left) / 2 add t4, t0, t3 # t4 = mid # Call canReachOnTime(dist, distSize, mid, scaled_hour) # Arguments: # a0: dist pointer (s0) # a1: distSize (s1) # a2: speed (t4) # a3: scaled_hour (s2) mv a0, s0 mv a1, s1 mv a2, t4 mv a3, s2 jal ra, canReachOnTime # Result in a0 (0 = false, 1 = true) beq a0, x0, cannot_reach # If can reach, update result and right = mid - 1 mv t2, t4 # result = mid addi t1, t4, -1 # right = mid - 1 j binary_search_loop cannot_reach: # If cannot reach, left = mid + 1 addi t0, t4, 1 # left = mid + 1 j binary_search_loop end_binary_search: # Epilogue mv a0, t2 # Return value in a0 lw ra, 12(sp) # Restore return address lw s0, 8(sp) # Restore s0 lw s1, 4(sp) # Restore s1 lw s2, 0(sp) # Restore s2 addi sp, sp, 16 # Deallocate stack space ret ################################################## # Function: canReachOnTime # Arguments: # a0: dist pointer # a1: distSize # a2: speed # a3: scaled_hour (hour * 100) # Returns: # a0: 1 if can reach on time, 0 otherwise canReachOnTime: # Prologue addi sp, sp, -36 # Allocate stack space sw ra, 32(sp) # Save return address sw s0, 28(sp) # Save s0 sw s1, 24(sp) # Save s1 sw s2, 20(sp) # Save s2 sw s3, 16(sp) # Save s3 sw s4, 12(sp) # Save s4 sw s5, 8(sp) # Save s5 sw s6, 4(sp) # Save s6 sw s7, 0(sp) # Save s7 mv s0, a0 # s0 = dist pointer mv s1, a1 # s1 = distSize mv t6, a2 # t6 = speed mv s7, a3 # s7 = scaled_hour li t0, 0 # totalTime = 0 li t1, 0 # i = 0 canReachOnTime_loop: bge t1, s1, canReachOnTime_end_loop # Load dist[i] slli t2, t1, 2 # t2 = i * 4 add t3, s0, t2 # t3 = &dist[i] lw t4, 0(t3) # t4 = dist[i] # Compute s4 = dist[i] * 100 # Since 100 = 64 + 32 + 4, we can compute dist[i] * 100 using shifts and adds slli t5, t4, 6 # t5 = t4 << 6 (t4 * 64) slli t6, t4, 5 # t6 = t4 << 5 (t4 * 32) add s4, t5, t6 # s4 = t4 * 96 slli t5, t4, 2 # t5 = t4 << 2 (t4 * 4) add s4, s4, t5 # s4 = s4 + t5 (t4 * 100) # Compute s5 = s4 / s2 (time) # Implement division in software mv a0, s4 # a0 = dividend mv a1, s2 # a1 = divisor jal ra, divu # Result in a0 mv s5, a0 # s5 = time # For all but the last train addi t2, t1, 1 # t2 = i + 1 blt t2, s1, compute_ceil_time # Last train, add time as is add t0, t0, t5 # totalTime += time j canReachOnTime_next_iteration compute_ceil_time: # Compute s6 = s5 * s2 (s6 = time * speed) mv a0, s5 mv a1, s2 jal ra, mulu # Result in a0 mv s6, a0 # s6 = time * speed # Compute t5 = s4 - s6 (remainder) sub t5, s4, s6 # t5 = s4 - s6 # If t5 != 0, need to increment time bnez t5, needs_ceiling # No remainder, time is as is add t0, t0, s5 # totalTime += time j canReachOnTime_next_iteration needs_ceiling: addi s5, s5, 1 # s5 = s5 + 1 add t0, t0, s5 # totalTime += time canReachOnTime_next_iteration: addi t1, t1, 1 # i++ j canReachOnTime_loop canReachOnTime_end_loop: # Compare totalTime with scaled_hour ble t0, s7, canReachOnTime_can_reach # Cannot reach li a0, 0 # Epilogue lw ra, 32(sp) lw s0, 28(sp) lw s1, 24(sp) lw s2, 20(sp) lw s3, 16(sp) lw s4, 12(sp) lw s5, 8(sp) lw s6, 4(sp) lw s7, 0(sp) addi sp, sp, 36 ret canReachOnTime_can_reach: # Can reach li a0, 1 # Epilogue lw ra, 32(sp) lw s0, 28(sp) lw s1, 24(sp) lw s2, 20(sp) lw s3, 16(sp) lw s4, 12(sp) lw s5, 8(sp) lw s6, 4(sp) lw s7, 0(sp) addi sp, sp, 36 ret # Multiply a0 * a1, result in a0 mulu: li t0, 0 # t0 = product mv t1, a0 # t1 = multiplicand mv t2, a1 # t2 = multiplier mulu_loop: beqz t2, mulu_end andi t3, t2, 1 beqz t3, mulu_skip_add add t0, t0, t1 # t0 += t1 mulu_skip_add: slli t1, t1, 1 # t1 <<= 1 srli t2, t2, 1 # t2 >>= 1 j mulu_loop mulu_end: mv a0, t0 # Return product in a0 ret # Divide a0 / a1, result in a0 divu: beq a1, zero, divu_zero_divisor li t0, 0 # t0 = quotient mv t1, a0 # t1 = dividend mv t2, a1 # t2 = divisor li t3, 0 # t3 = shift count divu_shift: sll t4, t2, t3 # t4 = t2 << t3 blt t1, t4, divu_divide addi t3, t3, 1 j divu_shift divu_divide: addi t3, t3, -1 beqz t3, divu_end divu_loop: sll t5, t2, t3 # t5 = t2 << t3 blt t1, t5, divu_no_subtract sub t1, t1, t5 # t1 -= t5 ori t0, t0, 1 # t0 |= 1 divu_no_subtract: sll t0, t0, 1 # t0 <<= 1 addi t3, t3, -1 bnez t3, divu_loop srl t0, t0, 1 # Adjust quotient divu_end: mv a0, t0 # Return quotient in a0 ret divu_zero_divisor: li a0, -1 # Return max unsigned value ret ``` #### Optimized version (utilize fp16/fp32 conversion) ```c .data MAX_SPEED: .word 10000000 # Maximum speed # No need for SCALE_FACTOR since we're using FP16 representations .text .globl minSpeedOnTime minSpeedOnTime: # Arguments: # a0: dist pointer # a1: distSize # a2: hour (FP16 representation) # Prologue addi sp, sp, -20 # Allocate stack space sw ra, 16(sp) # Save return address sw s0, 12(sp) # Save s0 sw s1, 8(sp) # Save s1 sw s2, 4(sp) # Save s2 sw s3, 0(sp) # Save s3 mv s0, a0 # s0 = dist pointer mv s1, a1 # s1 = distSize mv s2, a2 # s2 = hour (FP16) li t0, 1 # left = 1 la t1, MAX_SPEED lw t1, 0(t1) # t1 = right = MAX_SPEED li s3, -1 # result = -1 binary_search_loop: bgt t0, t1, end_binary_search # while (left <= right) # mid = left + (right - left) / 2 sub t2, t1, t0 # t2 = right - left srai t2, t2, 1 # t2 = (right - left) / 2 add t3, t0, t2 # t3 = mid # Call canReachOnTime(dist, distSize, mid, hour) # Arguments: # a0: dist pointer (s0) # a1: distSize (s1) # a2: speed (t3) # a3: hour (s2) mv a0, s0 mv a1, s1 mv a2, t3 mv a3, s2 jal ra, canReachOnTime # Result in a0 (0 = false, 1 = true) beq a0, x0, cannot_reach # If can reach, update result and right = mid - 1 mv s3, t3 # result = mid addi t1, t3, -1 # right = mid - 1 j binary_search_loop cannot_reach: # If cannot reach, left = mid + 1 addi t0, t3, 1 # left = mid + 1 j binary_search_loop end_binary_search: # Epilogue mv a0, s3 # Return value in a0 lw ra, 16(sp) # Restore return address lw s0, 12(sp) # Restore s0 lw s1, 8(sp) # Restore s1 lw s2, 4(sp) # Restore s2 lw s3, 0(sp) # Restore s3 addi sp, sp, 20 # Deallocate stack space ret ################################################## # Function: canReachOnTime # Arguments: # a0: dist pointer # a1: distSize # a2: speed # a3: hour (FP16 representation) # Returns: # a0: 1 if can reach on time, 0 otherwise canReachOnTime: # Prologue addi sp, sp, -28 # Allocate stack space sw ra, 24(sp) # Save return address sw s0, 20(sp) # Save s0 sw s1, 16(sp) # Save s1 sw s2, 12(sp) # Save s2 sw s3, 8(sp) # Save s3 sw s4, 4(sp) # Save s4 sw s5, 0(sp) # Save s5 mv s0, a0 # s0 = dist pointer mv s1, a1 # s1 = distSize mv s2, a2 # s2 = speed mv s3, a3 # s3 = hour (FP16) li t0, 0 # t0 = totalTime (FP16) li t1, 0 # t1 = i canReachOnTime_loop: bge t1, s1, canReachOnTime_end_loop # Load dist[i] slli t2, t1, 2 # t2 = i * 4 add t3, s0, t2 # t3 = &dist[i] lw t4, 0(t3) # t4 = dist[i] # Compute time = dist[i] / speed (as FP16) # Since we cannot perform division, we'll approximate # Use software division or shift-based approximation # Multiply speed by time to get dist[i] # time = dist[i] / speed # Since speed is an integer, we can represent time in FP16 # Convert dist[i] to FP16 mv a0, t4 # a0 = dist[i] jal ra, fp32_to_fp16 # a0 = dist_fp16 mv s4, a0 # s4 = dist_fp16 # Convert speed to FP16 mv a0, s2 # a0 = speed jal ra, fp32_to_fp16 # a0 = speed_fp16 mv s5, a0 # s5 = speed_fp16 # Compute time_fp16 = dist_fp16 / speed_fp16 # Since we cannot divide, we will approximate using shifts # Alternatively, since speed is integer, and dist is FP16, we can multiply dist by reciprocal of speed # For simplicity, let's proceed to accumulate totalTime approximately # For all but the last train addi t2, t1, 1 # t2 = i + 1 blt t2, s1, compute_ceil_time # Last train, add time as is # totalTime += dist[i] / speed # Since we cannot divide, we can estimate time as follows: # time_fp32 = (dist[i] << 16) / speed slli t5, t4, 16 # t5 = dist[i] << 16 mv a0, t5 mv a1, s2 jal ra, divu # a0 = time (approximate) mv t6, a0 # t6 = time # Convert time to FP16 jal ra, fp32_to_fp16 # a0 = time_fp16 add t0, t0, a0 # totalTime += time_fp16 j canReachOnTime_next_iteration compute_ceil_time: # For other trains, we need to compute ceil(dist[i] / speed) # Compute time = (dist[i] + speed - 1) / speed add t5, t4, s2 # t5 = dist[i] + speed addi t5, t5, -1 # t5 = dist[i] + speed - 1 mv a0, t5 mv a1, s2 jal ra, divu # a0 = time (integer) # Convert time to FP16 jal ra, fp32_to_fp16 # a0 = time_fp16 add t0, t0, a0 # totalTime += time_fp16 canReachOnTime_next_iteration: addi t1, t1, 1 # i++ j canReachOnTime_loop canReachOnTime_end_loop: # Compare totalTime with hour (both in FP16) # totalTime <= hour # Convert totalTime (FP16) to approximated FP32 integer mv a0, t0 jal ra, fp16_to_fp32 # a0 = totalTime_fp32 (approximate integer) # Convert hour (FP16) to approximated FP32 integer mv a1, s3 jal ra, fp16_to_fp32 # a0 = totalTime_fp32, a1 = hour_fp32 mv t1, a0 # t1 = totalTime_fp32 mv t2, a1 # t2 = hour_fp32 # Compare totalTime_fp32 and hour_fp32 ble t1, t2, canReachOnTime_can_reach # Cannot reach li a0, 0 # Epilogue lw ra, 24(sp) lw s0, 20(sp) lw s1, 16(sp) lw s2, 12(sp) lw s3, 8(sp) lw s4, 4(sp) lw s5, 0(sp) addi sp, sp, 28 ret canReachOnTime_can_reach: # Can reach li a0, 1 # Epilogue lw ra, 24(sp) lw s0, 20(sp) lw s1, 16(sp) lw s2, 12(sp) lw s3, 8(sp) lw s4, 4(sp) lw s5, 0(sp) addi sp, sp, 28 ret # Unsigned Division: a0 = a0 / a1 divu: beq a1, zero, divu_zero_divisor li t0, 0 # t0 = quotient mv t1, a0 # t1 = dividend mv t2, a1 # t2 = divisor # Count leading zeros in divisor li t3, 0 # t3 = shift amount divu_shift: sll t4, t2, t3 # t4 = divisor << shift blt t1, t4, divu_divide addi t3, t3, 1 j divu_shift divu_divide: addi t3, t3, -1 # Adjust shift amount divu_loop: sll t4, t2, t3 # t4 = divisor << shift sub t5, t1, t4 # t5 = dividend - (divisor << shift) blt t5, zero, divu_no_subtract mv t1, t5 # t1 = t5 slli t0, t0, 1 # t0 <<= 1 addi t0, t0, 1 # t0 |= 1 j divu_next divu_no_subtract: slli t0, t0, 1 # t0 <<= 1 divu_next: addi t3, t3, -1 bgez t3, divu_loop mv a0, t0 # Return quotient in a0 ret divu_zero_divisor: li a0, -1 # Return max unsigned value ret ``` ## RISC-V operation process in Ripes ![Ripes 5-stage processor](https://hackmd.io/_uploads/rkTmOhc1ke.png) Ripes supports three types of processors: 1. Single-cycle processor 2. 5-stage processor 3. 6-stage dual-issue processor For this assignment, the 5-stage pipelined processor has been selected as the target device because it is the most commonly used architecture. 1. **Instruction Fetch (IF):** 1. The Program Counter (PC) holds the address of the next instruction to be executed. 2. The PC increments sequentially. (`PC+4` since each instruction is 32 bits or 4 bytes in RV32) 3. For branch and jump instructions, the PC is updated to the target address. 4. The instruction at the current PC is fetched from memory and passed to the next stage. 2. **Instruction Decode (ID):** 1. The instruction is decoded to determine its type and required operands. 2. The processor reads the necessary register values from the register file, which contains 32 general-purpose registers. 3. Immediate values are extracted, and control signals are generated based on the instruction. 3. **Execution (EX):** 1. Load and store instructions access the data memory. 2. The processor reads from or writes to memory locations computed in the EX stage. 1. Arithmetic operations 2. branch target calculations 3. load/store address calculations 4. **Memory Access (MEM):** 1. For load and store instructions, the memory access is performed using the calculated address. 2. The computed address from the EX stage is used to read from or write to memory, depending on the instruction type. 3. Other instructions bypass this stage without any action. 5. **Write Back (WB):** 1. The final stage writes the result back to the register file. 2. The value to be written could come from either the memory (for load instructions) or the Arithmetic Logic Unit (ALU) (for arithmetic instructions). 3. This updates the destination register specified by the instruction. ### Optimization in Ripes illustration #### code test ```c .data # Constants for FP16 to FP32 conversion exp_offset: .word 0x03800000 # 0x70 << 23 mask: .word 0x3F000000 # 126 << 23 denorm_cutoff: .word 0x08000000 # 1 << 27 # Constants for FP32 to FP16 conversion bias_threshold: .word 0x71000000 bias_offset: .word 0x07800000 max_exponent: .word 0xFF000000 nan_value: .word 0x7E00 # Maximum speed MAX_SPEED: .word 10000000 # Maximum speed # Test cases # Test case 1: dist = [1,3,2], hour = 6.0 dist1: .word 1, 3, 2 distSize1: .word 3 hour1: .half 0x4600 # 6.0 in FP16 # Test case 2: dist = [1,3,2], hour = 2.7 dist2: .word 1, 3, 2 distSize2: .word 3 hour2: .half 0x4166 # 2.7 in FP16 (approximated) # Test case 3: dist = [1,3,2], hour = 1.9 dist3: .word 1, 3, 2 distSize3: .word 3 hour3: .half 0x3F99 # 1.9 in FP16 (approximated) # Storage for results result1: .word 0 result2: .word 0 result3: .word 0 .text .globl main main: # Test Case 1 la a0, dist1 # Load address of dist array lw a1, distSize1 # Load distSize la t0, hour1 # Load address of hour1 lhu a2, 0(t0) # Load hour (FP16) into a2 jal minSpeedOnTime # Call function # Store result la t0, result1 # Load address of result1 sw a0, 0(t0) # Store result into result1 # Print result # (In Ripes, you can view the 'result1' memory location) # Test Case 2 la a0, dist2 lw a1, distSize2 la t0, hour2 lhu a2, 0(t0) jal minSpeedOnTime la t0, result2 sw a0, 0(t0) # Test Case 3 la a0, dist3 lw a1, distSize3 la t0, hour3 lhu a2, 0(t0) jal minSpeedOnTime la t0, result3 sw a0, 0(t0) # End of program j end_program ################################################## # Function: minSpeedOnTime # Arguments: # a0: dist pointer # a1: distSize # a2: hour (FP16 representation) # Returns: # a0: Minimum speed, or -1 if impossible minSpeedOnTime: # Prologue addi sp, sp, -20 # Allocate stack space sw ra, 16(sp) # Save return address sw s0, 12(sp) # Save s0 sw s1, 8(sp) # Save s1 sw s2, 4(sp) # Save s2 sw s3, 0(sp) # Save s3 mv s0, a0 # s0 = dist pointer mv s1, a1 # s1 = distSize mv s2, a2 # s2 = hour (FP16) li t0, 1 # t0 = left = 1 la t1, MAX_SPEED lw t1, 0(t1) # t1 = right = MAX_SPEED li s3, -1 # s3 = result = -1 binary_search_loop: bgt t0, t1, end_binary_search # while (left <= right) # mid = left + (right - left) / 2 sub t2, t1, t0 # t2 = right - left srai t2, t2, 1 # t2 = (right - left) / 2 add t3, t0, t2 # t3 = mid # Call canReachOnTime(dist, distSize, mid, hour) # Arguments: # a0: dist pointer (s0) # a1: distSize (s1) # a2: speed (t3) # a3: hour (s2) mv a0, s0 mv a1, s1 mv a2, t3 mv a3, s2 jal canReachOnTime # Result in a0 (0 = false, 1 = true) beq a0, zero, cannot_reach # If can reach, update result and right = mid - 1 mv s3, t3 # result = mid addi t1, t3, -1 # right = mid - 1 j binary_search_loop cannot_reach: # If cannot reach, left = mid + 1 addi t0, t3, 1 # left = mid + 1 j binary_search_loop end_binary_search: # Epilogue mv a0, s3 # Return value in a0 lw ra, 16(sp) # Restore return address lw s0, 12(sp) # Restore s0 lw s1, 8(sp) lw s2, 4(sp) lw s3, 0(sp) addi sp, sp, 20 # Deallocate stack space ret ################################################## # Function: canReachOnTime # Arguments: # a0: dist pointer # a1: distSize # a2: speed # a3: hour (FP16 representation) # Returns: # a0: 1 if can reach on time, 0 otherwise canReachOnTime: # Prologue addi sp, sp, -36 # Allocate stack space sw ra, 32(sp) # Save return address sw s0, 28(sp) sw s1, 24(sp) sw s2, 20(sp) sw s3, 16(sp) sw s4, 12(sp) sw s5, 8(sp) sw s6, 4(sp) sw s7, 0(sp) mv s0, a0 # s0 = dist pointer mv s1, a1 # s1 = distSize mv s2, a2 # s2 = speed mv s3, a3 # s3 = hour (FP16) li t0, 0 # t0 = totalTime (FP16) li t1, 0 # t1 = i canReachOnTime_loop: bge t1, s1, canReachOnTime_end_loop # Load dist[i] slli t2, t1, 2 # t2 = i * 4 add t3, s0, t2 # t3 = &dist[i] lw t4, 0(t3) # t4 = dist[i] # Convert dist[i] to FP16 mv a0, t4 # a0 = dist[i] jal fp32_to_fp16 # a0 = dist_fp16 mv s4, a0 # s4 = dist_fp16 # Convert speed to FP16 mv a0, s2 # a0 = speed jal fp32_to_fp16 # a0 = speed_fp16 mv s5, a0 # s5 = speed_fp16 # Compute time_fp16 = dist_fp16 / speed_fp16 (approximate) # Since division isn't available, use a software routine mv a0, s4 # a0 = dist_fp16 mv a1, s5 # a1 = speed_fp16 jal divu # a0 = time_fp16 mv s6, a0 # s6 = time_fp16 # For all but the last train addi t2, t1, 1 # t2 = i + 1 blt t2, s1, compute_ceil_time # Last train, add time as is add t0, t0, s6 # totalTime += time_fp16 j canReachOnTime_next_iteration compute_ceil_time: # Compute if we need to ceil the time mv a0, s4 # a0 = dist_fp16 mv a1, s2 # a1 = speed jal ceil_div_fp16 # a0 = time_fp16 after ceiling mv s6, a0 # s6 = time_fp16 add t0, t0, s6 # totalTime += time_fp16 canReachOnTime_next_iteration: addi t1, t1, 1 # i++ j canReachOnTime_loop canReachOnTime_end_loop: # Compare totalTime with hour (both in FP16) # Convert totalTime to approximated FP32 integer mv a0, t0 jal fp16_to_fp32 # a0 = totalTime_fp32 # Convert hour (FP16) to approximated FP32 integer mv a1, s3 jal fp16_to_fp32 # a1 = hour_fp32 # Compare totalTime_fp32 and hour_fp32 mv t1, a0 # t1 = totalTime_fp32 mv t2, a1 # t2 = hour_fp32 ble t1, t2, canReachOnTime_can_reach # Cannot reach li a0, 0 # Epilogue lw ra, 32(sp) lw s0, 28(sp) lw s1, 24(sp) lw s2, 20(sp) lw s3, 16(sp) lw s4, 12(sp) lw s5, 8(sp) lw s6, 4(sp) lw s7, 0(sp) addi sp, sp, 36 ret canReachOnTime_can_reach: # Can reach li a0, 1 # Epilogue lw ra, 32(sp) lw s0, 28(sp) lw s1, 24(sp) lw s2, 20(sp) lw s3, 16(sp) lw s4, 12(sp) lw s5, 8(sp) lw s6, 4(sp) lw s7, 0(sp) addi sp, sp, 36 ret ################################################## # Function: ceil_div_fp16 # Computes ceil(dividend / divisor) for FP16 values # Inputs: # a0: dividend (FP16) # a1: divisor (integer) # Returns: # a0: Result (FP16) ceil_div_fp16: # Prologue addi sp, sp, -12 # Allocate stack space sw ra, 8(sp) # Save return address sw s0, 4(sp) # Save s0 sw s1, 0(sp) # Save s1 # Convert FP16 dividend to integer jal fp16_to_fp32 # a0 = dividend_fp32 # Save dividend_fp32 mv s0, a0 # s0 = dividend_fp32 # Convert FP16 divisor to integer mv a0, a1 # a0 = divisor (FP16) jal fp16_to_fp32 # a0 = divisor_fp32 # Save divisor_fp32 mv s1, a0 # s1 = divisor_fp32 # Perform (dividend_fp32 + divisor_fp32 - 1) / divisor_fp32 add t0, s0, s1 # t0 = dividend_fp32 + divisor_fp32 addi t0, t0, -1 # t0 = dividend_fp32 + divisor_fp32 - 1 # Use divu to compute quotient mv a0, t0 mv a1, s1 # a1 = divisor_fp32 jal divu # a0 = quotient # Convert result back to FP16 jal fp32_to_fp16 # a0 = result_fp16 # Epilogue lw ra, 8(sp) # Restore return address lw s0, 4(sp) # Restore s0 lw s1, 0(sp) # Restore s1 addi sp, sp, 12 # Deallocate stack space ret ################################################## # Function: divu # Unsigned division: a0 = a0 / a1 # Returns: # a0: quotient divu: beq a1, zero, divu_zero_divisor li t0, 0 # t0 = quotient mv t1, a0 # t1 = dividend mv t2, a1 # t2 = divisor li t3, 0 # t3 = shift amount divu_shift: sll t4, t2, t3 # t4 = divisor << t3 blt t1, t4, divu_divide addi t3, t3, 1 blt t3, 32, divu_shift # Limit shift amount to < 32 j divu_divide # Exit loop if t3 >= 32 divu_divide: addi t3, t3, -1 # Adjust shift amount divu_loop: sll t4, t2, t3 # t4 = divisor << t3 blt t1, t4, divu_no_subtract sub t1, t1, t4 # t1 -= t4 slli t0, t0, 1 # t0 <<= 1 addi t0, t0, 1 # t0 |= 1 j divu_next divu_no_subtract: slli t0, t0, 1 # t0 <<= 1 divu_next: addi t3, t3, -1 bgez t3, divu_loop # Continue loop if t3 >= 0 mv a0, t0 # Return quotient in a0 ret divu_zero_divisor: li a0, -1 # Return max unsigned value ret ################################################## # Function: fp32_to_fp16 # Converts a 32-bit integer to 16-bit FP representation # Input: a0 = integer value # Output: a0 = FP16 value fp32_to_fp16: # Prologue addi sp, sp, -40 sw ra, 36(sp) sw s0, 32(sp) sw s1, 28(sp) sw s2, 24(sp) sw s3, 20(sp) sw s4, 16(sp) sw s5, 12(sp) sw s6, 8(sp) sw s7, 4(sp) sw s8, 0(sp) # Input: f in a0 jal fp32_to_bits # a0 = w mv s0, a0 # s0 = w # s1 = shl1_w = w + w add s1, s0, s0 # s2 = sign = w & 0x80000000 li t0, 0x80000000 and s2, s0, t0 # s3 = bias = shl1_w & 0xFF000000 li t1, 0xFF000000 and s3, s1, t1 # Adjust bias if necessary la t2, bias_threshold lw t2, 0(t2) blt s3, t2, adjust_bias j compute_base adjust_bias: mv s3, t2 # s3 = bias = bias_threshold compute_base: # s4 = (bias >> 1) + bias_offset srli t3, s3, 1 la t4, bias_offset lw t4, 0(t4) add s4, t3, t4 # base = bits_to_fp32(s4) mv a0, s4 jal bits_to_fp32 mv s5, a0 # s5 = base # Convert base back to bits jal fp32_to_bits mv s6, a0 # Extract exponent and mantissa srli t5, s6, 13 # t5 = bits >> 13 # Handle immediate overflow li s7, 0x7C00 and t5, t5, s7 # t5 = exp_bits li s8, 0x0FFF and t6, s6, s8 # t6 = mantissa_bits # nonsign = exp_bits + mantissa_bits add t0, t5, t6 # t0 = nonsign # Check for NaN or Infinity la t1, max_exponent lw t1, 0(t1) bgt s1, t1, return_nan # Assemble fp16 value srli s2, s2, 16 or a0, s2, t0 j fp32_to_fp16_end return_nan: # Return NaN value srli s2, s2, 16 la t2, nan_value lw t2, 0(t2) or a0, s2, t2 fp32_to_fp16_end: # Epilogue lw ra, 36(sp) lw s0, 32(sp) lw s1, 28(sp) lw s2, 24(sp) lw s3, 20(sp) lw s4, 16(sp) lw s5, 12(sp) lw s6, 8(sp) lw s7, 4(sp) lw s8, 0(sp) addi sp, sp, 40 ret ################################################## # Function: fp16_to_fp32 # Converts a 16-bit FP value to a 32-bit integer (approximate) # Input: a0 = FP16 value # Output: a0 = integer value fp16_to_fp32: # Prologue addi sp, sp, -20 sw ra, 16(sp) sw s0, 12(sp) sw s1, 8(sp) sw s2, 4(sp) sw s3, 0(sp) # Input: h in a0 slli s0, a0, 16 # s0 = w = h << 16 # s1 = sign = w & 0x80000000 li t0, 0x80000000 and s1, s0, t0 # s2 = two_w = w + w add s2, s0, s0 # temp = (two_w >> 4) + exp_offset srli t1, s2, 4 la t2, exp_offset lw t2, 0(t2) add s3, t1, t2 # normalized_value = bits_to_fp32(temp) mv a0, s3 jal bits_to_fp32 mv s4, a0 # temp = (two_w >> 17) | mask srli t1, s2, 17 la t2, mask lw t2, 0(t2) or s3, t1, t2 # denormalized_value = bits_to_fp32(s3) mv a0, s3 jal bits_to_fp32 mv s5, a0 # Compare two_w with denorm_cutoff la t0, denorm_cutoff lw t0, 0(t0) blt s2, t0, use_denorm # Use normalized_value mv a0, s4 j assemble_result use_denorm: # Use denormalized_value mv a0, s5 assemble_result: # Combine sign with the selected value or a0, a0, s1 # Epilogue lw ra, 16(sp) lw s0, 12(sp) lw s1, 8(sp) lw s2, 4(sp) lw s3, 0(sp) addi sp, sp, 20 ret ################################################## # Function: fp32_to_bits # Input: Integer value (in a0) # Output: Bit pattern (in a0) fp32_to_bits: # Simplified: Treat the integer as the bit pattern ret ################################################## # Function: bits_to_fp32 # Input: Bit pattern (in a0) # Output: Integer value (approximate) bits_to_fp32: # Simplified: Treat the bit pattern as the integer value ret ################################################## end_program: # Infinite loop to end the program j end_program ``` ![image](https://hackmd.io/_uploads/BkkvpnqJJg.png)