Computer Architecture HW1

# Assignment1: RISC-V Assembly and Instruction Pipeline contributed by < `周姵彣` > ## quiz1 - problem B The problem requires converting a 32-bit floating point number (IEEE 754 float format) into a 16-bit bfloat16 format. ### Purpose - Reduce memory usage: bfloat16 uses 16 bits instead of 32 bits. - Increase computational speed: Faster processing due to reduced data size. - Maintain dynamic range: Exponent bits remain the same, allowing for the same numerical range. - Sacrifice precision: bfloat16 retains only 7 bits for the mantissa, resulting in lower precision, but generally sufficient. ### C Code ```c static inline bf16_t fp32_to_bf16(float s) { union { float f; uint32_t i; } u = {.f = s}; // Use union to handle both float and integer representations // Check for NaN case if ((u.i & 0x7FFFFFFF) > 0x7F800000) { // NaN check bf16_t h; h.bits = (u.i >> 16) | 0x0040; // Force it to be quiet NaN return h; } // Normal case bf16_t h; h.bits = (u.i >> 16) + ((u.i >> 15) & 1); // Perform rounding return h; } ``` :::danger Don't paste code snip without comprehensive discussions. ::: ### Assembly Code ```c .data argument: .word 0x3E800000 # float 0.25 (32-bit IEEE 754) .text _start: j main fp32_to_bf16: la a0, argument # Load the address of the argument lw a1, 0(a0) # Load float 0.25 (32-bit IEEE 754) from memory # NaN Check li a2, 0x7F800000 # NaN threshold li a3, 0x7FFFFFFF # Mask to remove the sign bit and a4, a1, a3 # u.i & 0x7FFFFFFF bgtu a4, a2, non_case # If u.i > 0x7F800000, handle NaN case # Rounding and conversion to bfloat16 li a2, 0x7FFF # Constant 0x7FFF srli a3, a1, 16 # Shift right to extract the top 16 bits of u.i li t0, 0x8000 # Load the 0x8000 value for rounding and a3, a1, t0 # Check the 17th bit for rounding (u.i >> 15) & 1 add a3, a3, a2 # 0x7FFF + (u.i >> 15) & 1 (Rounding) add a3, a3, a1 # u.i + (0x7FFF + (u.i >> 15) & 1) srli a0, a3, 16 # Shift right to get the final bfloat16 value ret # Return the result in a0 non_case: # Handle NaN case srli a1, a1, 16 # Shift right by 16 bits li a4, 0x0040 # Set the quiet NaN flag or a0, a1, a4 # Set the final result as (u.i >> 16) | 0x40 ret print_result: li a7, 1 # syscall number for print integer (assuming RISC-V Linux) ecall # make system call to print the value in a0 ret main: jal ra, fp32_to_bf16 # Call the fp32_to_bf16 function jal ra, print_result # Print the result stored in a0 end: nop ``` ### Result ![image](https://hackmd.io/_uploads/H1a7InIykx.png) ![image](https://hackmd.io/_uploads/S13rIhLkkl.png) ## [LeetCode: 238. Product of Array Except Self](https://leetcode.com/problems/product-of-array-except-self/description/) ### Motivation I chose the Product of Array Except Self problem because I believe it involves numerical operations and array manipulation, which is similar to the process of floating-point conversion (such as FP32 to BF16). Both emphasize data handling, efficiency optimization, and avoiding unnecessary operations. This problem can help me practice how to achieve optimization. ### Description Given an integer array nums, return an array answer such that answer[i] is equal to the product of all the elements of nums except nums[i]. ### Solution This problem can be solved using the "prefix product" and "suffix product" approach: 1. **Forward pass**: First, calculate the product of all elements to the left of each element and store it in the result array. 2. **Backward pass**: Then, iterate from right to left, calculating the product of all elements to the right of each element while updating the result array. This allows solving the problem in O(n) time without using division. ### C Code ```c int* productExceptSelf(int* nums, int numsSize, int* returnSize) { int* answer = (int*)malloc(numsSize * sizeof(int)); *returnSize = numsSize; answer[0] = 1; for (int i = 1; i < numsSize; i++) { answer[i] = answer[i - 1] * nums[i - 1]; } int right = 1; for (int i = numsSize - 1; i >= 0; i--) { answer[i] = answer[i] * right; right *= nums[i]; } return answer; } ``` ## LeetCode: 238. Product of Array Except Self(revised version) Corrections: 1. Since assembly language does not have the concept of dynamic allocation, I have modified it to use static allocation. 2. Changed integer to float. ### C Code ```c #include <stdio.h> float* productExceptSelf(float nums[], int numsSize, float answer[]) { answer[0] = 1.0; for (int i = 1; i < numsSize; i++) { answer[i] = answer[i - 1] * nums[i - 1]; } float right = 1.0; for (int i = numsSize - 1; i >= 0; i--) { answer[i] = answer[i] * right; right *= nums[i]; } return &answer[0]; } int main() { float nums[] = {1.0, 2.0, 3.0, 4.0}; int numsSize = sizeof(nums) / sizeof(nums[0]); float answer[numsSize]; productExceptSelf(nums, numsSize, answer); // Print the answer array printf("Answer array:\n"); for (int i = 0; i < numsSize; i++) { printf("answer[%d] = %f\n", i, answer[i]); } return 0; } ``` ### Result ![messageImage_1728887565506](https://hackmd.io/_uploads/HyI4AN5J1e.jpg) ### Assembly Code ```c .globl productExceptSelf .text productExceptSelf: addi sp, sp, -16 # Adjust stack pointer to allocate space for saving registers sw ra, 12(sp) # Save return address sw s0, 8(sp) # Save register s0 sw s1, 4(sp) # Save register s1 sw s2, 0(sp) # Save register s2 add s0, a0, x0 # s0 = address of nums[] add s1, a1, x0 # s1 = numsSize add s2, a2, x0 # s2 = address of answer[] addi t0, x0, 1 # t0 = left running product, initialize to 1 addi t3, x0, 0 # t3 = index (initialize to 0) # Calculate left product left_product_loop: bge t3, s1, right_product_init # If t3 >= numsSize, exit the loop # Calculate answer[t3] slli t4, t3, 2 # t4 = t3 * 4 (calculate current index offset) add t5, s2, t4 # t5 = &answer[t3] sw t0, 0(t5) # answer[t3] = left running product # Update left running product add t6, s0, t4 # t6 = &nums[t3] lw t4, 0(t6) # t4 = nums[t3] add t0, t0, t5 # Accumulate product into t0 addi t3, t3, 1 # t3++ j left_product_loop # Continue calculating left product right_product_init: addi t1, x0, 1 # Initialize t1 to 1 (right running product) addi t3, s1, -1 # t3 = numsSize - 1 # Calculate right product right_product_loop: blt t3, x0, end_function # If t3 < 0, exit the loop # Calculate answer[t3] = answer[t3] * right slli t4, t3, 2 # t4 = t3 * 4 (calculate current index offset) add t5, s2, t4 # t5 = &answer[t3] lw t6, 0(t5) # t6 = answer[t3] add t6, t6, t1 # Simulate multiplication, add right to t6 sw t6, 0(t5) # Update answer[t3] # Update right as right * nums[t3] add t6, s0, t4 # t6 = &nums[t3] lw t5, 0(t6) # t5 = nums[t3] add t1, t1, t5 # Simulate multiplication, add nums[t3] to right addi t3, t3, -1 # t3-- j right_product_loop # Continue calculating right product end_function: lw ra, 12(sp) # Restore return address lw s0, 8(sp) # Restore register s0 lw s1, 4(sp) # Restore register s1 lw s2, 0(sp) # Restore register s2 addi sp, sp, 16 # Restore stack pointer # Program exit li a7, 10 ecall ``` ### Result ![image](https://hackmd.io/_uploads/Hy4YR4ck1l.png) ### Analysis Pipeline Execution of Instructions: Taking the core computation process of the productExceptSelf function as an example, the program calculates the product of all elements in the array except for the current element. Assuming an input array of length 4, [1, 2, 3, 4], the operations will be carried out in the pipeline stages as follows: #### IF (Instruction Fetch) In the first stage, the Instruction Fetch (IF) stage, the processor reads the instruction from memory. For example, the instruction addi sp, sp, -16 is fetched from memory and placed into the pipeline. The Program Counter (PC) is then incremented, so that the next instruction can be fetched during the following cycle. #### ID (Instruction Decode) In the Instruction Decode (ID) stage, the fetched instruction is decoded. The registers are identified, and immediate values are extracted. For example, in the instruction addi t0, x0, 1, the immediate value 1 is loaded into register t0. The processor also determines whether the instruction involves memory access or ALU operations. #### EX (Execute) In the Execute (EX) stage, arithmetic or memory address calculations are performed. For example, the instruction add t5, s2, t4 adds the base address in s2 and the offset in t4 to calculate the address of answer[t3]. This operation is part of the core computation for determining the location of the element to be updated in the answer[] array. #### MEM (Memory Access) In the Memory Access (MEM) stage, memory read and write operations are performed. For example, the instruction sw t0, 0(t5) stores the value of the left running product from register t0 into the memory location pointed to by t5 (which corresponds to answer[t3]). Similarly, the instruction lw t4, 0(t6) loads the value from the address pointed to by t6 (which corresponds to nums[t3]) into register t4. #### WB (Write Back) In the Write Back (WB) stage, the final result of the instruction is written back to the destination register. For instance, in the instruction add t6, t6, t1, the result of the addition is stored in t6. This ensures that the updated value is correctly retained for subsequent use in the computation. #### Main Loop and Core Computation: The main computation occurs within the loop, where the left and right products are calculated for each element of the array. The process involves two main loops: the left product loop and the right product loop. In the left product loop, the processor calculates the running product of elements from the start of the array up to each element, storing the intermediate results in the answer[] array. In the right product loop, the processor calculates the running product of elements from the end of the array, multiplying it with the value already stored in answer[] to get the final product except self for each element. In both loops, instructions are fetched, decoded, executed, and the results are written to either registers or memory. The left running product is accumulated in t0, while the right running product is accumulated in t1. Each iteration updates the answer[] array, ensuring that every element in answer[] contains the product of all elements in nums[] except the current one.