In this section, I focused on implementing fundamental matrix operations commonly used in neural networks. Specifically, I developed functions for dot product, matrix multiplication, element-wise ReLU, and ArgMax.
All matrices were represented as 1D vectors in row-major order. This representation required careful attention to memory access patterns, particularly when working with strides to access elements non-contiguously in memory.
Challenge:
Transforming a 2D matrix into a 1D vector was the first hurdle. Working with MNIST-like data, I had to ensure that the flattened representation accurately followed the row-major order.
Solution:
Through precise index calculations, I ensured that each row of the 2D matrix was concatenated correctly. This set a solid foundation for subsequent matrix operations.
Implementation:
ReLU was implemented by looping through the input array, using the index i
as a pointer to process each value individually. Negative values were replaced with zero.
Challenge:
The VENUS system call requires a0
as the control register. During static data testing, using a0
for output caused conflicts when performing system calls.
Solution:
To resolve this, I temporarily stored output in a1
before making system calls, ensuring that a0
could properly control the output.
Implementation:
ArgMax was based on a "max register" function. The algorithm iterated through the array, comparing each element and updating the "reg_kingdom" (max value register) when a new maximum was found. Simultaneously, it tracked the corresponding index of the maximum value.
Thought Process:
Version 1:
The initial implementation focused on functionality, assuming both arrays had the same stride. Sizes size1
and size2
were independently configurable.
⚠️ Key Issues:
matmul
assumes stride1
= 1 and stride2
= matrixB_col
.Version 2 (Improvements):
Added the ability to handle different strides (stride1
and stride2
).
🔑 Key Insights:
matmul
, stride1
is always 1, while stride2
typically exceeds 1 (e.g., the number of columns in matrix B).Additionally, I reduced the number of registers by combining size1
and size2
into a single size register.
Future Improvements:
Develop a version where both stride1
and stride2
can be independently configurable, enhancing the function's general utility.
Implementation:
Matrix multiplication was achieved by calculating the dot product between rows of M1
and columns of M2
. This process repeated for each pair, with the total number of computations equaling M1_col_number
× M2_row_number
.
Throughout PART A, I learned to define static memory in the data segment for testing each function. This approach allowed for easy verification of function outputs without worrying about dynamic memory issues during initial development.
Debugging assembly can be daunting, but VENUS Web Simulator proved invaluable. By stepping through each instruction and observing register values, I could pinpoint errors. Setting breakpoints using EBREAK
allowed me to halt execution at critical points, making the debugging process more manageable and systematic.
Implementing these functions solidified my understanding of RISC-V calling conventions. Specifically:
reg_a
, reg_t
) were used for temporary values.reg_s
, reg_ra
) ensured that critical values were preserved across function calls.Through practice, I became adept at crafting robust prologue and epilogue sections for each function, ensuring that register states were properly saved and restored.
This section was a deep dive into low-level programming and assembly concepts. It challenged my understanding of memory management, register handling, and function calling conventions. By the end, I felt more confident in writing efficient, bug-free assembly code, a foundational skill for building more complex systems in the future.