Matrix multiplication is a critical operation in deep learning and scientific computing. This project focuses on developing a high-performance systolic array accelerator using Chisel, enabling efficient computation of General Matrix Multiplication (GEMM). The design is modular and scalable, featuring a systolic array composed of 4x4 Processing Elements (PEs) grouped into 2x2 Tiles.
Advantages of Hardware Accelerator
Higher Throughput
Multiple operations can be computed in parallel, achieving higher operations-per-second than a purely CPU-bound approach
Scalable and Modular
Base 4×4 array can be replicated or tiled to handle arbitrarily large matrices
A single design can address a wide range of matrix sizes through a flexible tiling strategy
Efficient Data Reuse
Leveraging weight-stationary or output-stationary dataflows minimizes redundant memory access
Data Type Flexibility
Support both int4 and int8 data types for diverse workloads
dut.io.in_a.poke(255.U)// Max value for 8-bit
dut.io.in_w.poke(255.U)// Max value for 8-bit
dut.clock.step(1)// First operation: 255 * 255 = 65025
assert(dut.io.outPS.peek().litValue ==65025,"Partial sum should be 22 after second operation")
assert(dut.io.fwd_a.peek().litValue ==255,"fwd_a should forward 2")
assert(dut.io.fwd_w.peek().litValue ==255,"fwd_w should forward 5")
println(s"Partial sum for overflow test: ${dut.io.outPS.peek().litValue}")
Systolic Array input data software solution
Ends in 10 cycles (3N-2) N is thhe size of systolic array
Using Chisel testing
Constraints
Matrix Size Constraints
(4x4) x (4x4) or smaller
Intput Datatype
Uint8 for W and A
Output partial sum
24bit reg for partial sum in each PE
Interface Requirements
Software Verification Compare hardware results with software GEMM implementations Use ChiselTest for Verification
test case
Can support n x n GEMM
2x2 Small Matrices
padding module
4x4 Small-Scale Matrices
Uint8 vec input
8x8 Larger Matrices or NxN Matrices
tiling module(TODO)
Future Directions
Floating-Point Support Add FP16/FP32 operations for broader workloads
Sparse Matrix Optimization Introduce techniques for handling sparse matrices efficiently
Scalability Extend to NxN systolic arrays for even larger matrix operations