# Computer Architecture HW2
contributed by < [`adamboy447`](https://github.com/adamboy447/classify-rv32i) >
## Part A
### Task 1: Relu
```Assembly=
relu_loop:
bge t1, a1, relu_done # If index >= length, exit loop
slli t3, t1, 2 # Offset = index * 4
add t4, t5, t3 # Address of element
lw t2, 0(t4) # Load element
blt t2, zero, relu_zero
j relu_store
relu_zero:
li t2, 0 # If element < 0, set to 0
relu_store:
sw t2, 0(t4) # Store updated value
addi t1, t1, 1 # Increment index
j relu_loop
```
First, load the array value into t2, then compare t2 with 0. If t2 is less than 0, indicating a negative number, set it to 0; otherwise, leave it unchanged. By looping through each element of the array, the function completes its operation
### Task 2: ArgMax
```Assembly=
argmax_loop_start:
beq t2, a1, argmax_done # If end of array, exit loop
slli t3, t2, 2 # Offset for current element
add t4, a0, t3 # Address of element
lw t5, 0(t4) # Load element
blt t5, t0, argmax_skip # Skip update if not greater
mv t0, t5 # Update max value
mv t1, t2 # Update max index
argmax_skip:
addi t2, t2, 1 # Increment loop counter
j argmax_loop_start
argmax_done:
mv a0, t1 # Return index
jr ra
```
First, load the array value into t5. The registers t0 and t1 are initialized to store the current maximum value and its index, respectively. Then, iterate through each element of the loop, loading the value into t5 and comparing it with t0. If t5 is greater than t0, update t0 with t5 and update t1 with the current index. Otherwise, skip the update. This process continues until the loop counter t2 equals the number of elements a1, resulting in the maximum value and its index within the array.
### Task 3.1: Dot Product
```Assembly=
addi sp, sp, -40
sw ra, 36(sp)
sw s0, 32(sp)
sw s1, 28(sp)
sw s2, 24(sp)
sw s3, 20(sp)
sw s4, 16(sp)
sw s5, 12(sp)
sw s6, 8(sp)
sw s7, 4(sp)
sw s8, 0(sp)
lw ra, 36(sp)
lw s0, 32(sp)
lw s1, 28(sp)
lw s2, 24(sp)
lw s3, 20(sp)
lw s4, 16(sp)
lw s5, 12(sp)
lw s6, 8(sp)
lw s7, 4(sp)
lw s8, 0(sp)
addi sp, sp, 40
```
Because other functions will use this function later, it is necessary to first save the temporary registers to prevent them from being overwritten.
```Assembly=
li s0, 0 # sum
li s1, 0 # counter
mv s2, a0 # array1 ptr
mv s3, a1 # array2 ptr
slli s4, a3, 2 # stride1 bytes
slli s5, a4, 2 # stride2 bytes
mv s6, a2 # element count
.
.
.
accumulate:
add s0, s0, t3 # add to total
# Update pointers and counter
add s2, s2, s4 # next arr1 element
add s3, s3, s5 # next arr2 element
addi s1, s1, 1 # increment counter
j loop
.
.
.
# Load values
lw t1, 0(s2) # load arr1 element
lw t2, 0(s3) # load arr2 element
```
Initialize by setting s0 to 0 for the sum and s1 to 0 for the counter. The stride values (a3, a4) are left-shifted by 2 (s4, s5), equivalent to multiplying by 4. Then, add the current element addresses (s2, s3) by the stride values to access the next elements in the arrays, and store the loaded elements into t1 and t2.
```Assembly=
multiplication:
li t3, 0 # product
mv t4, t1 # multiplicand
mv t5, t2 # multiplier
li t6, 0 # sign flag
# Handle signs
bgez t4, check_t5
sub t4, zero, t4
xori t6, t6, 1
check_t5:
bgez t5, mult_loop
sub t5, zero, t5
xori t6, t6, 1
mult_loop:
andi t0, t5, 1 # check LSB
beq t0, zero, shift_nums
add t3, t3, t4 # add partial product
shift_nums:
slli t4, t4, 1 # shift left multiplicand
srli t5, t5, 1 # shift right multiplier
bne t5, zero, mult_loop # continue if multiplier != 0
# Handle product sign
beq t6, zero, accumulate
sub t3, zero, t3 # negate if needed
```
Check the signs of the multiplicand (t4) and multiplier (t5). If either is negative, adjust the sign flag (t6). Then, check the least significant bit (LSB) of the multiplier. If it is 1, add the multiplicand to the product (t3). Shift the multiplicand left by 1 and the multiplier right by 1, continuing the loop until the multiplier becomes 0. Finally, if the sign flag is set, negate the product to obtain the final result.
### Task 3.2: Matrix Multiplication
``` Assembly=
li s0, 0 # outer loop counter
li s1, 0 # inner loop counter
mv s2, a6 # incrementing result matrix pointer
mv s3, a0 # incrementing matrix A pointer, increments durring outer loop
mv s4, a3 # incrementing matrix B pointer, increments during inner loop
```
s0: Outer loop counter, corresponds to the number of rows in M0.
s1: Inner loop counter, corresponds to the number of columns in M1.
s2: Pointer to the current position in the result matrix D.
s3: Pointer to the current row in matrix M0.
s4: Pointer to the current column in matrix M1.
``` Assembly=
outer_loop_start:
#s0 is going to be the loop counter for the rows in A
li s1, 0
mv s4, a3
blt s0, a1, inner_loop_start
j outer_loop_end
```
Iterate over each row of M0. If s0 is less than the number of rows a1, proceed to the inner loop; otherwise, exit the outer loop.
``` Assembly=
inner_loop_start:
beq s1, a5, inner_loop_end
addi sp, sp, -24
sw a0, 0(sp)
sw a1, 4(sp)
sw a2, 8(sp)
sw a3, 12(sp)
sw a4, 16(sp)
sw a5, 20(sp)
mv a0, s3 # setting pointer for matrix A into the correct argument value
mv a1, s4 # setting pointer for Matrix B into the correct argument value
mv a2, a2 # setting the number of elements to use to the columns of A
li a3, 1 # stride for matrix A
mv a4, a5 # stride for matrix B
jal dot
```
Iterate over each column of M1. Save the current state by adjusting the stack pointer and storing necessary registers. Set up the arguments for the dot function to calculate the dot product of the current row from M0 and the current column from M1.
``` Assembly=
mv t0, a0 # storing result of the dot product into t0
sw t0, 0(s2)
addi s2, s2, 4 # Incrememtning pointer for result matrix
```
Store the result of the dot product (t0) into the current position of the result matrix D, then increment the result matrix pointer s2 by 4 bytes to move to the next element.
``` Assembly=
li t1, 4
add s4, s4, t1 # incrememtning the column on Matrix B
addi s1, s1, 1
j inner_loop_start
```
Increment the pointer for matrix B by 4 bytes to move to the next column element and increment the inner loop counter s1. Jump back to the start of the inner loop
``` Assembly=
inner_loop_end:
slli t1, a2, 2 # t1 = a2 << 2 (equivalent to a2 * 4)
add s3, s3, t1 # Move A pointer to next row
# Increment outer loop counter
addi s0, s0, 1
# Jump back to outer loop
j outer_loop_start
```
After completing the inner loop for all columns of M1, calculate the byte offset for the next row in matrix M0 by shifting a2 left by 2 bits (multiplying by 4). Move the pointer for matrix M0 to the next row, increment the outer loop counter s0, and jump back to the start of the outer loop.
##Part B
### Task 1: Read Matrix
``` Assembly=
li s1, 0 # Initialize result
mv t3, t1 # Copy number of rows to t3
multiply_loop:
beq t3, zero, multiply_done
add s1, s1, t2 # Add number of columns each iteration
addi t3, t3, -1 # Decrement row counter
j multiply_loop
```
First, initialize the result s1 to 0 and copy the number of rows t1 to t3 as the loop counter. Check if t3 is 0; if so, exit the loop. Otherwise, add the number of columns t2 to the result s1, decrement the row counter t3, and continue looping.
### Task 2: Write Matrix
Same as above.
### Task 3: Classification
``` Assembly=
li a0, 4
jal malloc # malloc 4 bytes for an integer, rows
beq a0, x0, error_malloc
mv s3, a0 # save m0 rows pointer for later
li a0, 4
jal malloc # malloc 4 bytes for an integer, cols
beq a0, x0, error_malloc
mv s4, a0 # save m0 cols pointer for later
```
Allocate memory to store the matrix's rows and columns by loading 4 into a0 and calling malloc. If malloc fails (returns 0), jump to the error_malloc label. Otherwise, save the allocated memory addresses for rows and columns in registers s3 and s4, respectively.
``` Assembly=
lw a0, 4(a1) # set argument 1 for the read_matrix function
mv a1, s3 # set argument 2 for the read_matrix function
mv a2, s4 # set argument 3 for the read_matrix function
jal read_matrix
mv s0, a0 # setting s0 to the m0, aka the return value of read_matrix
```
Load the filename from the command-line arguments into a0 by loading from 4(a1). Set a1 and a2 to the pointers for rows (s3) and columns (s4). Call the read_matrix function to read the matrix from the file. Store the returned matrix address from a0 into register s0.
``` Assembly=
mv a0, s0 # move m0 array to first arg
lw a1, 0(s3) # move m0 rows to second arg
lw a2, 0(s4) # move m0 cols to third arg
mv a3, s2 # move input array to fourth arg
lw a4, 0(s7) # move input rows to fifth arg
lw a5, 0(s8) # move input cols to sixth arg
jal matmul
```
Pass the address of matrix M0 in a0, the number of rows in a1, and the number of columns in a2. Similarly, pass the input array address in a3, input rows in a4, and input columns in a5. Call the matmul function to perform the matrix multiplication.
``` Assembly=
mv a0, s9 # move h to the first argument
jal relu
```
Move the address of the hidden layer matrix h (stored in s9) into a0 and call the relu function to apply the ReLU activation function to the hidden layer.
``` Assembly=
mv a0, s1 # move m1 array to first arg
lw a1, 0(s5) # move m1 rows to second arg
lw a2, 0(s6) # move m1 cols to third arg
mv a3, s9 # move h array to fourth arg
lw a4, 0(s3) # move h rows to fifth arg
lw a5, 0(s8) # move h cols to sixth arg
jal matmul
```
Pass the address of matrix M1 in a0, the number of rows in a1, and the number of columns in a2. Similarly, pass the address of the hidden layer matrix h in a3, the number of rows in a4, and the number of columns in a5. Call the matmul function to perform the second matrix multiplication.
``` Assembly=
mv a0, s10 # load o array into first arg
jal argmax
```
Move the address of the output matrix O (stored in s10) into a0 and call the argmax function to find the index of the maximum value in O.
When writing individual functions, everything worked fine. However, when integrating them into the overall program, errors initially made it difficult to identify the cause. After carefully examining the error messages during program execution for a long time, I realized that the issue was related to the storage of the stack pointer (sp) and the improper use of registers. After making the necessary corrections, everything worked correctly.
When writing replacements for the mul instruction without using mul, I often got the logic wrong, causing the program to fail repeatedly. Additionally, I originally intended to write an additional mul.s function, but it kept failing. Although the logic seemed correct, I eventually decided to embed the multiplication logic directly into the main program instead of calling the separate mul.s function."

