Convolutional Neural Network Accelerator implemented in Chisel
洪翊碩
GitHub
Targeting Task and Model
Targeting Task
- Image Classification
- Dataset - CIFAR-10
- Image size: 32 * 32
- 600000 images
- 10 classes
Model
Current ideas to be explored
- A Model With 5 CONV Layers and 3 FC Layers
- Use dyadic quantization within the framework of PTQ (Post-Training Quantization)
- A type of integer-only quantization that all of the scaling factors are restricted to be dyadic numbers defined as:
where is a floating point number, and , are integers.
- Motivation: Dyadic quantization can be implemented with only bit shift operations, which eliminates overhead of expensive dequantization and requantization.
Model Architecture
Model loaded from ./weights/cifar10/alexnet/alexnetbn_v2-power2.pt (3.351736 MB)
QuantWrapper(
(quant): Quantize(scale=tensor([0.0156]), zero_point=tensor([128]), dtype=torch.quint8)
(dequant): DeQuantize()
(module): VGG8Bn(
(conv1): Sequential(
(0): QuantizedConvReLU2d(3, 64, kernel_size=(3, 3), stride=(1, 1), scale=0.03125, zero_point=128, padding=(1, 1))
(1): Identity()
(2): Identity()
(3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(conv2): Sequential(
(0): QuantizedConvReLU2d(64, 192, kernel_size=(3, 3), stride=(1, 1), scale=0.0078125, zero_point=128, padding=(1, 1))
(1): Identity()
(2): Identity()
(3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(conv3): Sequential(
(0): QuantizedConvReLU2d(192, 384, kernel_size=(3, 3), stride=(1, 1), scale=0.0078125, zero_point=128, padding=(1, 1))
(1): Identity()
(2): Identity()
)
(conv4): Sequential(
(0): QuantizedConvReLU2d(384, 256, kernel_size=(3, 3), stride=(1, 1), scale=0.0078125, zero_point=128, padding=(1, 1))
(1): Identity()
(2): Identity()
)
(conv5): Sequential(
(0): QuantizedConvReLU2d(256, 256, kernel_size=(3, 3), stride=(1, 1), scale=0.0078125, zero_point=128, padding=(1, 1))
(1): Identity()
(2): Identity()
(3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(fc6): Sequential(
(0): QuantizedLinearReLU(in_features=4096, out_features=256, scale=0.0078125, zero_point=128, qscheme=torch.per_tensor_affine)
(1): Identity()
(2): QuantizedDropout(p=0.5, inplace=False)
)
(fc7): Sequential(
(0): QuantizedLinearReLU(in_features=256, out_features=128, scale=0.0078125, zero_point=128, qscheme=torch.per_tensor_affine)
(1): Identity()
(2): QuantizedDropout(p=0.5, inplace=False)
)
(fc8): QuantizedLinear(in_features=128, out_features=10, scale=0.0625, zero_point=128, qscheme=torch.per_tensor_affine)
)
)
Hardware Architecture
Systolic Array
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- Calculate CNN layers and FC layers
- Data Processing
- In CNN Layer, we use im2col to transform the input feature map into a format compatible with the systolic array's matrix multiplication structure, enabling efficient convolution by converting spatial operations into matrix operations.
- Dataflow: Output Stationary
- Weight Stationary (WS)
- Maximize the reuse of weights by keeping them stationary in each PE.
- Activations are transmitted from the global buffer to each PE, and the partial sum is updated (accumulated) once it passes through a PE.
- Output Stationary (OS)
- OS aims to reduce the access frequency of partial sums.
- In a DNN accelerator, the partial sum must be updated every time it passes through a MAC.
- By keeping the partial sum stored in the registers within the PE, access to the global buffer for the partial sum is unnecessary.
- Row Stationary (RS)
- Spec
- a W8A8 MAC and a 32-bit psum register in a PE
- The number of PEs - H * W
- The hardware's
H
, W
will be written as parameterized
- The hardware's height
H
and width W
are temporarily set to 8×8.
- SRAM - 16KB
PE
I/O ports
Port |
I/O |
Bitwidth |
Description |
clk |
I |
1 |
|
rst |
I |
1 |
active high |
compute_en |
I |
1 |
active high |
read_en |
I |
1 |
active high |
compute_mode |
I |
1 |
0 for load bias/ipsum, 1 for multiply |
ifmap_i |
I |
8 |
valid when
compute_en == 1 && compute_mode == 1 |
ifmap_o |
O |
8 |
|
weight_i |
I |
8 |
valid when
compute_en == 1 && compute_mode == 1 |
weight_o |
O |
8 |
|
ipsum |
I |
32 |
valid when
compute_en == 1 && compute_mode == 0 |
opsum |
O |
32 |
valid when read_en == 1 |
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
Post Unit
-
ReLU
-
Maxpooling
-
Quantization
- Merge dequantion before computation and requantization after computation.
- Due to dyadic quantization, the scaling factors are all power of 2.
It means that we can use shift operation to complete it.
Testbench

- Use AXI protocol to communicate chip with testbench
- Chip serves as both a Master and a Slave in the AXI protocol.
- Master: AR/R/AW/W/B
- Load data from DRAM(testbench) to SRAM
- Slave: AW/W/B
- Receieve MMIO data from CPU(testbench)
- Unit Test for PE and Systolic Array
- Model data input test
In this final project, we process one image at a time (N = 1)
We use channel last
, which means the format is NHWC
, as our memory order
Shape Parameter Definition

Ifmap (C, H, W)、Filter (M, C, R, S)、Ofmap (M, E, F)
Method - Img2Col
- represents the -th row of the matrix created by flattening all sliding windows of the input image .
- represents the total number of spatial positions in the output feature map.
- represents the total number of elements in each sliding window (filter size).
Filter Expansion
- represents the -th row and -th column of the filter matrix , where each filter is flattened into a column.
- represents the number of filters (output channels).
Output Matrix
- represents the result of the matrix multiplication after flattening, which corresponds to the output feature map
Operator Mapping
- systolic array can only calculate GEMM at most,
- We need to perform tiling on the matrices obtained from the
Img2Col
method to fit into the hardware's computation capab
Implementation Code
Tiling Strategy
- After
img2col
, I will tile the result matrix from img2col
- Tile
ifmap
in the row
direction,
and the tile_size
is equal to systolic_array_height
- Tile
filter
in the column
direction,
and the tile_size
is equal to systolic_array_width

Tiled Matrix in Systolic Array
The diagram below illustrates 3 things
- How my systolic array calculates the tiled matrix?
- Why does ifmap need to be tiled in the row direction to match the systolic array height?
- Why does the filter need to be tiled in the col direction to match the systolic array width?

Chisel Execute
Test PE
Run this command to test the PE

Generate Verilog File
Run this command to test the PE
Generated File
Verilator Testbench
Load tiled matrix from Python in C++
Always write comments in English.
Result and Analysis
Due to insufficient time, I was unable to complete testing of the Systolic Array and boot model on my accelerator.
I can't get the result statistics to analysis, but this was because I had already spent significant effort on the following tasks:
Completed Work
- Training the model and determining the quantization method.
- Processing the .pt file to load input weights and biases, then converting them into text files.
- Using Python to perform img2col and matrix tiling on the img2col results.
- Sending the Python-computed tiled matrix to the Verilator C++ testbench.
- Setting up the Verilator environment.
- Setting up the Chisel environment and learning how to use Chisel.
- Using Mill to build the project and learning how to write build.sc.
- Writing the PE design and Systolic Array design.
- Testing the PE functionality to ensure correctness.
- Completing the Verilator Makefile.
Git History

Ongoing Work
- Writing the Systolic Array testbench using Verilator.
- Using Verilator to send the tiled matrix to the Systolic Array for computation.
- Post-unit design and testing.
- Verifying the correctness of the accelerator's computation results.
Latency:
Test the number of clock cycles required to complete a single convolution, matrix multiplication, or the entire output generation.
Throughput:
Measure the number of images that can be processed per second
(e.g., images from the CIFAR-10 dataset).
Hardware Resource Utilization:
Verify the utilization of computational units (e.g., PEs and SRAM) within each module of the hardware design.
Supplement
Chisel Environment Setting
- Chisel Template
- It is a useful git repo which provides valid
mill
environment (build.sc
and some include lib
)