Computer Architecture HW2

# Computer Architecture HW2 contributed by < [`adamboy447`](https://github.com/adamboy447/classify-rv32i) > ## Part A ### Task 1: Relu ```Assembly= relu_loop: bge t1, a1, relu_done # If index >= length, exit loop slli t3, t1, 2 # Offset = index * 4 add t4, t5, t3 # Address of element lw t2, 0(t4) # Load element blt t2, zero, relu_zero j relu_store relu_zero: li t2, 0 # If element < 0, set to 0 relu_store: sw t2, 0(t4) # Store updated value addi t1, t1, 1 # Increment index j relu_loop ``` First, load the array value into t2, then compare t2 with 0. If t2 is less than 0, indicating a negative number, set it to 0; otherwise, leave it unchanged. By looping through each element of the array, the function completes its operation ### Task 2: ArgMax ```Assembly= argmax_loop_start: beq t2, a1, argmax_done # If end of array, exit loop slli t3, t2, 2 # Offset for current element add t4, a0, t3 # Address of element lw t5, 0(t4) # Load element blt t5, t0, argmax_skip # Skip update if not greater mv t0, t5 # Update max value mv t1, t2 # Update max index argmax_skip: addi t2, t2, 1 # Increment loop counter j argmax_loop_start argmax_done: mv a0, t1 # Return index jr ra ``` First, load the array value into t5. The registers t0 and t1 are initialized to store the current maximum value and its index, respectively. Then, iterate through each element of the loop, loading the value into t5 and comparing it with t0. If t5 is greater than t0, update t0 with t5 and update t1 with the current index. Otherwise, skip the update. This process continues until the loop counter t2 equals the number of elements a1, resulting in the maximum value and its index within the array. ### Task 3.1: Dot Product ```Assembly= addi sp, sp, -40 sw ra, 36(sp) sw s0, 32(sp) sw s1, 28(sp) sw s2, 24(sp) sw s3, 20(sp) sw s4, 16(sp) sw s5, 12(sp) sw s6, 8(sp) sw s7, 4(sp) sw s8, 0(sp) lw ra, 36(sp) lw s0, 32(sp) lw s1, 28(sp) lw s2, 24(sp) lw s3, 20(sp) lw s4, 16(sp) lw s5, 12(sp) lw s6, 8(sp) lw s7, 4(sp) lw s8, 0(sp) addi sp, sp, 40 ``` Because other functions will use this function later, it is necessary to first save the temporary registers to prevent them from being overwritten. ```Assembly= li s0, 0 # sum li s1, 0 # counter mv s2, a0 # array1 ptr mv s3, a1 # array2 ptr slli s4, a3, 2 # stride1 bytes slli s5, a4, 2 # stride2 bytes mv s6, a2 # element count . . . accumulate: add s0, s0, t3 # add to total # Update pointers and counter add s2, s2, s4 # next arr1 element add s3, s3, s5 # next arr2 element addi s1, s1, 1 # increment counter j loop . . . # Load values lw t1, 0(s2) # load arr1 element lw t2, 0(s3) # load arr2 element ``` Initialize by setting s0 to 0 for the sum and s1 to 0 for the counter. The stride values (a3, a4) are left-shifted by 2 (s4, s5), equivalent to multiplying by 4. Then, add the current element addresses (s2, s3) by the stride values to access the next elements in the arrays, and store the loaded elements into t1 and t2. ```Assembly= multiplication: li t3, 0 # product mv t4, t1 # multiplicand mv t5, t2 # multiplier li t6, 0 # sign flag # Handle signs bgez t4, check_t5 sub t4, zero, t4 xori t6, t6, 1 check_t5: bgez t5, mult_loop sub t5, zero, t5 xori t6, t6, 1 mult_loop: andi t0, t5, 1 # check LSB beq t0, zero, shift_nums add t3, t3, t4 # add partial product shift_nums: slli t4, t4, 1 # shift left multiplicand srli t5, t5, 1 # shift right multiplier bne t5, zero, mult_loop # continue if multiplier != 0 # Handle product sign beq t6, zero, accumulate sub t3, zero, t3 # negate if needed ``` Check the signs of the multiplicand (t4) and multiplier (t5). If either is negative, adjust the sign flag (t6). Then, check the least significant bit (LSB) of the multiplier. If it is 1, add the multiplicand to the product (t3). Shift the multiplicand left by 1 and the multiplier right by 1, continuing the loop until the multiplier becomes 0. Finally, if the sign flag is set, negate the product to obtain the final result. ### Task 3.2: Matrix Multiplication ``` Assembly= li s0, 0 # outer loop counter li s1, 0 # inner loop counter mv s2, a6 # incrementing result matrix pointer mv s3, a0 # incrementing matrix A pointer, increments durring outer loop mv s4, a3 # incrementing matrix B pointer, increments during inner loop ``` s0: Outer loop counter, corresponds to the number of rows in M0. s1: Inner loop counter, corresponds to the number of columns in M1. s2: Pointer to the current position in the result matrix D. s3: Pointer to the current row in matrix M0. s4: Pointer to the current column in matrix M1. ``` Assembly= outer_loop_start: #s0 is going to be the loop counter for the rows in A li s1, 0 mv s4, a3 blt s0, a1, inner_loop_start j outer_loop_end ``` Iterate over each row of M0. If s0 is less than the number of rows a1, proceed to the inner loop; otherwise, exit the outer loop. ``` Assembly= inner_loop_start: beq s1, a5, inner_loop_end addi sp, sp, -24 sw a0, 0(sp) sw a1, 4(sp) sw a2, 8(sp) sw a3, 12(sp) sw a4, 16(sp) sw a5, 20(sp) mv a0, s3 # setting pointer for matrix A into the correct argument value mv a1, s4 # setting pointer for Matrix B into the correct argument value mv a2, a2 # setting the number of elements to use to the columns of A li a3, 1 # stride for matrix A mv a4, a5 # stride for matrix B jal dot ``` Iterate over each column of M1. Save the current state by adjusting the stack pointer and storing necessary registers. Set up the arguments for the dot function to calculate the dot product of the current row from M0 and the current column from M1. ``` Assembly= mv t0, a0 # storing result of the dot product into t0 sw t0, 0(s2) addi s2, s2, 4 # Incrememtning pointer for result matrix ``` Store the result of the dot product (t0) into the current position of the result matrix D, then increment the result matrix pointer s2 by 4 bytes to move to the next element. ``` Assembly= li t1, 4 add s4, s4, t1 # incrememtning the column on Matrix B addi s1, s1, 1 j inner_loop_start ``` Increment the pointer for matrix B by 4 bytes to move to the next column element and increment the inner loop counter s1. Jump back to the start of the inner loop ``` Assembly= inner_loop_end: slli t1, a2, 2 # t1 = a2 << 2 (equivalent to a2 * 4) add s3, s3, t1 # Move A pointer to next row # Increment outer loop counter addi s0, s0, 1 # Jump back to outer loop j outer_loop_start ``` After completing the inner loop for all columns of M1, calculate the byte offset for the next row in matrix M0 by shifting a2 left by 2 bits (multiplying by 4). Move the pointer for matrix M0 to the next row, increment the outer loop counter s0, and jump back to the start of the outer loop. ##Part B ### Task 1: Read Matrix ``` Assembly= li s1, 0 # Initialize result mv t3, t1 # Copy number of rows to t3 multiply_loop: beq t3, zero, multiply_done add s1, s1, t2 # Add number of columns each iteration addi t3, t3, -1 # Decrement row counter j multiply_loop ``` First, initialize the result s1 to 0 and copy the number of rows t1 to t3 as the loop counter. Check if t3 is 0; if so, exit the loop. Otherwise, add the number of columns t2 to the result s1, decrement the row counter t3, and continue looping. ### Task 2: Write Matrix Same as above. ### Task 3: Classification ``` Assembly= li a0, 4 jal malloc # malloc 4 bytes for an integer, rows beq a0, x0, error_malloc mv s3, a0 # save m0 rows pointer for later li a0, 4 jal malloc # malloc 4 bytes for an integer, cols beq a0, x0, error_malloc mv s4, a0 # save m0 cols pointer for later ``` Allocate memory to store the matrix's rows and columns by loading 4 into a0 and calling malloc. If malloc fails (returns 0), jump to the error_malloc label. Otherwise, save the allocated memory addresses for rows and columns in registers s3 and s4, respectively. ``` Assembly= lw a0, 4(a1) # set argument 1 for the read_matrix function mv a1, s3 # set argument 2 for the read_matrix function mv a2, s4 # set argument 3 for the read_matrix function jal read_matrix mv s0, a0 # setting s0 to the m0, aka the return value of read_matrix ``` Load the filename from the command-line arguments into a0 by loading from 4(a1). Set a1 and a2 to the pointers for rows (s3) and columns (s4). Call the read_matrix function to read the matrix from the file. Store the returned matrix address from a0 into register s0. ``` Assembly= mv a0, s0 # move m0 array to first arg lw a1, 0(s3) # move m0 rows to second arg lw a2, 0(s4) # move m0 cols to third arg mv a3, s2 # move input array to fourth arg lw a4, 0(s7) # move input rows to fifth arg lw a5, 0(s8) # move input cols to sixth arg jal matmul ``` Pass the address of matrix M0 in a0, the number of rows in a1, and the number of columns in a2. Similarly, pass the input array address in a3, input rows in a4, and input columns in a5. Call the matmul function to perform the matrix multiplication. ``` Assembly= mv a0, s9 # move h to the first argument jal relu ``` Move the address of the hidden layer matrix h (stored in s9) into a0 and call the relu function to apply the ReLU activation function to the hidden layer. ``` Assembly= mv a0, s1 # move m1 array to first arg lw a1, 0(s5) # move m1 rows to second arg lw a2, 0(s6) # move m1 cols to third arg mv a3, s9 # move h array to fourth arg lw a4, 0(s3) # move h rows to fifth arg lw a5, 0(s8) # move h cols to sixth arg jal matmul ``` Pass the address of matrix M1 in a0, the number of rows in a1, and the number of columns in a2. Similarly, pass the address of the hidden layer matrix h in a3, the number of rows in a4, and the number of columns in a5. Call the matmul function to perform the second matrix multiplication. ``` Assembly= mv a0, s10 # load o array into first arg jal argmax ``` Move the address of the output matrix O (stored in s10) into a0 and call the argmax function to find the index of the maximum value in O. When writing individual functions, everything worked fine. However, when integrating them into the overall program, errors initially made it difficult to identify the cause. After carefully examining the error messages during program execution for a long time, I realized that the issue was related to the storage of the stack pointer (sp) and the improper use of registers. After making the necessary corrections, everything worked correctly. When writing replacements for the mul instruction without using mul, I often got the logic wrong, causing the program to fail repeatedly. Additionally, I originally intended to write an additional mul.s function, but it kept failing. Although the logic seemed correct, I eventually decided to embed the multiplication logic directly into the main program instead of calling the separate mul.s function." ![image](https://hackmd.io/_uploads/HJlComOGJe.png) ![image](https://hackmd.io/_uploads/ByhAi7uzye.png)