Assignment1: RISC-V Assembly and Instruction Pipeline

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < En-Hsiang,Chang > # LeetCode Problem 818, "Race Car": The LeetCode problem 818, **Race Car**, is a pathfinding problem on a number line. Your car starts at position 0 with a speed of +1 on an infinite number line. The car automatically drives according to a sequence of instructions, where you have two possible commands: - **"A" (Accelerate)**: - Increases your position by the current speed. - Doubles your current speed. - **"R" (Reverse)**: - Reverses the direction of the car. If the current speed is positive, it becomes -1, and if it is negative, it becomes +1. The position does not change when reversing. Given a target position `target` on the number line, you are asked to return the length of the shortest sequence of commands (composed of "A" and "R") to reach that exact position starting from position 0. **Example 1:** - **Input**: `target = 3` - **Output**: `2` - **Explanation**: The shortest instruction sequence is `"AA"` which brings the car to positions `0 -> 1 -> 3`. **Example 2:** - **Input**: `target = 6` - **Output**: `5` - **Explanation**: The shortest instruction sequence is `"AAARA"`, leading to positions `0 -> 1 -> 3 -> 7 -> 7 -> 6`. **Constraints:** - `1 <= target <= 10000` The challenge is to determine the optimal sequence of instructions to minimize the number of moves, making it efficient to reach the target position. Various approaches can be used, such as breadth-first search (BFS) for exploring the possible paths or dynamic programming to keep track of the minimal steps. The goal is to reach the target with the least number of instructions while handling both positive and negative speed reversals. # C program ```c #include <stdio.h> #include <limits.h> #include <stdint.h> #include <math.h> uint16_t dp[10001][2] = {{0}}; // Conversion between FP16 and FP32 (Problem A) static inline float bits_to_fp32(uint32_t w) { union { uint32_t as_bits; float as_value; } fp32 = {.as_bits = w}; return fp32.as_value; } static inline uint32_t fp32_to_bits(float f) { union { float as_value; uint32_t as_bits; } fp32 = {.as_value = f}; return fp32.as_bits; } static inline float fp16_to_fp32(uint16_t h) { const uint32_t w = (uint32_t)h << 16; const uint32_t sign = w & UINT32_C(0x80000000); const uint32_t two_w = w + w; const uint32_t exp_offset = UINT32_C(0xE0) << 23; const float exp_scale = 0x1.0p-112f; const float normalized_value = bits_to_fp32((two_w >> 4) + exp_offset) * exp_scale; const uint32_t mask = UINT32_C(126) << 23; const float magic_bias = 0.5f; const float denormalized_value = bits_to_fp32((two_w >> 17) | mask) - magic_bias; const uint32_t denormalized_cutoff = UINT32_C(1) << 27; const uint32_t result = sign | (two_w < denormalized_cutoff ? fp32_to_bits(denormalized_value) : fp32_to_bits(normalized_value)); return bits_to_fp32(result); } static inline uint16_t fp32_to_fp16(float f) { const float scale_to_inf = 0x1.0p+112f; const float scale_to_zero = 0x1.0p-110f; float base = (fabsf(f) * scale_to_inf) * scale_to_zero; const uint32_t w = fp32_to_bits(f); const uint32_t shl1_w = w + w; const uint32_t sign = w & UINT32_C(0x80000000); uint32_t bias = shl1_w & UINT32_C(0xFF000000); if (bias < UINT32_C(0x71000000)) bias = UINT32_C(0x71000000); base = bits_to_fp32((bias >> 1) + UINT32_C(0x07800000)) + base; const uint32_t bits = fp32_to_bits(base); const uint32_t exp_bits = (bits >> 13) & UINT32_C(0x00007C00); const uint32_t mantissa_bits = bits & UINT32_C(0x00000FFF); const uint32_t nonsign = exp_bits + mantissa_bits; return (sign >> 16) | (shl1_w > UINT32_C(0xFF000000) ? UINT16_C(0x7E00) : nonsign); } int racecar(int target) { for (int t = 0; t <= target; t++) { int n = 0; while ((1 << n) <= t) { n++; } if ((1 << n) - 1 == t) { dp[t][0] = fp32_to_fp16((float)n); dp[t][1] = fp32_to_fp16((float)(n + 1)); continue; } int rest = (1 << n) - 1 - t; float dp_rest_0 = fp16_to_fp32(dp[rest][0]); float dp_rest_1 = fp16_to_fp32(dp[rest][1]); dp[t][0] = fp32_to_fp16(n + 1 + (dp_rest_1 < dp_rest_0 + 1 ? dp_rest_1 : dp_rest_0 + 1)); dp[t][1] = fp32_to_fp16(n + 1 + (dp_rest_0 < dp_rest_1 + 1 ? dp_rest_0 : dp_rest_1 + 1)); for (int i = 0; i < t; i++) { for (int d = 0; d < 2; d++) { float dp_i_0 = fp16_to_fp32(dp[i][0]); float dp_i_1 = fp16_to_fp32(dp[i][1]); float dp_t_i_d = fp16_to_fp32(dp[t - i][d]); float option1 = dp_i_0 + 2 + dp_t_i_d; float option2 = dp_i_1 + 1 + dp_t_i_d; float current_dp_t_d = fp16_to_fp32(dp[t][d]); dp[t][d] = fp32_to_fp16(current_dp_t_d < option1 ? current_dp_t_d : option1); dp[t][d] = fp32_to_fp16(current_dp_t_d < option2 ? current_dp_t_d : option2); } } } float result_0 = fp16_to_fp32(dp[target][0]); float result_1 = fp16_to_fp32(dp[target][1]); return result_0 < result_1 ? (int)result_0 : (int)result_1; } ``` # Test on LeetCode ![1728961896321@2x](https://hackmd.io/_uploads/Bk8BAIsyyx.jpg) ![1728961946176@2x_0](https://hackmd.io/_uploads/rk8HA8i1Jx.jpg) ![1728961977434@2x_0](https://hackmd.io/_uploads/ByLH0UsJyl.jpg) ![1728962006088@2x](https://hackmd.io/_uploads/Bk8HR8sJyx.jpg) # Why I added the floating-point conversion functions The main reason for adding the FP16 and FP32 conversion functions is to make our code more efficient in terms of memory usage while still being accurate in calculations. Here's how: 1.**Save Memory:** FP16 uses half the number of bits compared to FP32. So, by storing our dynamic programming (dp) states as FP16, we cut down on the amount of memory we need. This is particularly helpful since our dp array is large, which could use a lot of memory if we stick to 32-bit floating-point numbers. 2.**Accuracy Where It Counts:** When we need to do calculations, we first convert the FP16 values to FP32. This keeps the calculations accurate because FP32 has more precision. Once we have our results, we convert them back to FP16 to save memory. This way, we're balancing between saving memory and keeping calculations accurate. 3.**Meeting the Assignment Requirements:** The assignment asked for practical use of floating-point conversions. By incorporating these conversions directly in our racecar function, we're showing how they can be used to improve efficiency in a real problem. # Assembly Code ```asm .data dp: .space 20002 # Space for dp array, using 10001 * 2 elements .text .globl main ``` ```asm # Function to convert FP32 to FP16 fp32_to_fp16: li t0, 0x1F li t1, 0x7FFF srli t2, a0, 13 and t2, t2, t1 srli t3, a0, 23 addi t3, t3, -127 add t3, t3, 15 and t3, t3, t0 slli t3, t3, 10 or a1, t3, t2 andi t4, a0, 0x80000000 srli t4, t4, 16 or a1, a1, t4 jr ra ``` ```asm # Function to convert FP16 to FP32 fp16_to_fp32: li t0, 0x7C00 li t1, 0x03FF and t2, a0, t1 slli t2, t2, 13 and t3, a0, t0 srli t3, t3, 10 addi t3, t3, 112 slli t3, t3, 23 or a1, t3, t2 andi t4, a0, 0x8000 slli t4, t4, 16 or a1, a1, t4 jr ra ``` ```asm main: li t0, 0 li t1, 10001 loop_target: bge t0, t1, end_loop li t2, 0 li t3, 1 calc_n: blt t3, t0, inc_n j finish_calc_n inc_n: addi t2, t2, 1 sll t3, t3, 1 j calc_n finish_calc_n: addi t4, t3, -1 beq t4, t0, store_n sub t5, t4, t0 mv a0, t5 jal fp32_to_fp16 mv t6, a1 store_n: slli t7, t0, 1 la t8, dp add t8, t8, t7 mv a0, t2 jal fp32_to_fp16 sw a1, 0(t8) addi t2, t2, 1 mv a0, t2 jal fp32_to_fp16 sw a1, 4(t8) addi t0, t0, 1 j loop_target end_loop: li a7, 10 ecall ``` # CPU Pipeline ![圖片 1](https://hackmd.io/_uploads/rkuvliqykx.jpg) Stages and Functions: **IF (Instruction Fetch):** Retrieves the instruction from memory based on the current Program Counter (PC). **ID (Instruction Decode):** Decodes the fetched instruction and reads the necessary register values. **EX (Execute):** Executes the decoded instruction using the ALU or determines the required operation. **MEM (Memory Access):** Accesses data memory for load or store instructions. **WB (Write Back):** Writes the result back to the appropriate register for future use.