Convolutional Neural Network Accelerator implemented in Chisel

洪翊碩

GitHub

Targeting Task and Model

Targeting Task

Image Classification
Dataset - CIFAR-10
- Image size: 32 * 32
- 600000 images
- 10 classes

Model

Current ideas to be explored

A Model With 5 CONV Layers and 3 FC Layers
Use dyadic quantization within the framework of PTQ (Post-Training Quantization)
- A type of integer-only quantization that all of the scaling factors are restricted to be dyadic numbers defined as:
  \[ s = \frac{b}{2^c} \]
  where \(𝑠\) is a floating point number, and \(𝑏\), \(𝑐\) are integers.
- Motivation: Dyadic quantization can be implemented with only bit shift operations, which eliminates overhead of expensive dequantization and requantization.

Model Architecture

Model loaded from ./weights/cifar10/alexnet/alexnetbn_v2-power2.pt (3.351736 MB)
QuantWrapper(
  (quant): Quantize(scale=tensor([0.0156]), zero_point=tensor([128]), dtype=torch.quint8)
  (dequant): DeQuantize()
  (module): VGG8Bn(
    (conv1): Sequential(
      (0): QuantizedConvReLU2d(3, 64, kernel_size=(3, 3), stride=(1, 1), scale=0.03125, zero_point=128, padding=(1, 1))
      (1): Identity()
      (2): Identity()
      (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    )
    (conv2): Sequential(
      (0): QuantizedConvReLU2d(64, 192, kernel_size=(3, 3), stride=(1, 1), scale=0.0078125, zero_point=128, padding=(1, 1))
      (1): Identity()
      (2): Identity()
      (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    )
    (conv3): Sequential(
      (0): QuantizedConvReLU2d(192, 384, kernel_size=(3, 3), stride=(1, 1), scale=0.0078125, zero_point=128, padding=(1, 1))
      (1): Identity()
      (2): Identity()
    )
    (conv4): Sequential(
      (0): QuantizedConvReLU2d(384, 256, kernel_size=(3, 3), stride=(1, 1), scale=0.0078125, zero_point=128, padding=(1, 1))
      (1): Identity()
      (2): Identity()
    )
    (conv5): Sequential(
      (0): QuantizedConvReLU2d(256, 256, kernel_size=(3, 3), stride=(1, 1), scale=0.0078125, zero_point=128, padding=(1, 1))
      (1): Identity()
      (2): Identity()
      (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    )
    (fc6): Sequential(
      (0): QuantizedLinearReLU(in_features=4096, out_features=256, scale=0.0078125, zero_point=128, qscheme=torch.per_tensor_affine)
      (1): Identity()
      (2): QuantizedDropout(p=0.5, inplace=False)
    )
    (fc7): Sequential(
      (0): QuantizedLinearReLU(in_features=256, out_features=128, scale=0.0078125, zero_point=128, qscheme=torch.per_tensor_affine)
      (1): Identity()
      (2): QuantizedDropout(p=0.5, inplace=False)
    )
    (fc8): QuantizedLinear(in_features=128, out_features=10, scale=0.0625, zero_point=128, qscheme=torch.per_tensor_affine)
  )
)

Hardware Architecture

Systolic Array

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Calculate CNN layers and FC layers
Data Processing
- In CNN Layer, we use im2col to transform the input feature map into a format compatible with the systolic array's matrix multiplication structure, enabling efficient convolution by converting spatial operations into matrix operations.
- Dataflow: Output Stationary
  1. Weight Stationary (WS)
    - Maximize the reuse of weights by keeping them stationary in each PE.
    - Activations are transmitted from the global buffer to each PE, and the partial sum is updated (accumulated) once it passes through a PE.
  2. Output Stationary (OS)
    - OS aims to reduce the access frequency of partial sums.
    - In a DNN accelerator, the partial sum must be updated every time it passes through a MAC.
    - By keeping the partial sum stored in the registers within the PE, access to the global buffer for the partial sum is unnecessary.
  3. Row Stationary (RS)
    - RS is a dataflow used in the Eyeriss neural network accelerator.
    - Its design goal is to maximize the reuse of weights and activations within the PE array.
Spec
- a W8A8 MAC and a 32-bit psum register in a PE
- The number of PEs - H * W
  - The hardware's H, W will be written as parameterized
  - The hardware's height H and width W are temporarily set to 8×8.
- SRAM - 16KB

import chisel3._
import chisel3.util._
import chisel3.stage.ChiselStage

class SystolicArray(val rows: Int, val cols: Int) extends Module {
    val io = IO(new Bundle {
        val compute_en = Input(UInt((rows*cols).W))
        val read_en = Input(UInt((rows*cols).W))
        val compute_mode = Input(UInt((rows*cols).W))
        val ifmap = Input(UInt((8*rows).W))
        val weight = Input(UInt((8*cols).W))
        val ipsum = Input(UInt((32*rows*cols).W))
        val opsum = Output(UInt((32*rows*cols).W))
    })

    val pe_array = Seq.fill(rows, cols)(Module(new PE))

    val opsum_flat = Wire(Vec(rows * cols, UInt(32.W)))

    for (i <- 0 until rows) {
        for (j <- 0 until cols) {
            val idx = i * cols + j
            pe_array(i)(j).io.compute_en := io.compute_en(idx)
            pe_array(i)(j).io.read_en := io.read_en(idx)
            pe_array(i)(j).io.compute_mode := io.compute_mode(idx)
            if (j == 0) {
                pe_array(i)(j).io.ifmap_i := Cat(~io.ifmap(8*(i+1)-1), io.ifmap(8*(i+1)-2,8*i)).asSInt
            } else {
                pe_array(i)(j).io.ifmap_i := pe_array(i)(j-1).io.ifmap_o
            }
            if (i == 0) {
                pe_array(i)(j).io.weight_i := io.weight(8*(j+1)-1, 8*j).asSInt
            } else {
                pe_array(i)(j).io.weight_i := pe_array(i-1)(j).io.weight_o
            }
            pe_array(i)(j).io.ipsum := io.ipsum((idx+1)*32-1, idx*32).asSInt
            opsum_flat(idx) := pe_array(i)(j).io.opsum.asUInt
        }
    }

    io.opsum := opsum_flat.asUInt
}

// Generate the Verilog code
object SystolicArray extends App {
  (new ChiselStage).emitVerilog(new SystolicArray(8, 8))
}

PE

I/O ports

Port	I/O	Bitwidth	Description
clk	I	1
rst	I	1	active high
compute_en	I	1	active high
read_en	I	1	active high
compute_mode	I	1	0 for load bias/ipsum, 1 for multiply
ifmap_i	I	8	valid when `compute_en == 1 && compute_mode == 1`
ifmap_o	O	8
weight_i	I	8	valid when `compute_en == 1 && compute_mode == 1`
weight_o	O	8
ipsum	I	32	valid when `compute_en == 1 && compute_mode == 0`
opsum	O	32	valid when `read_en == 1`

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

import chisel3._
import chisel3.util._
import chisel3.stage.ChiselStage

// Define parameters for ADD and MUL operations
object OperationMode {
  val ADD = 0.U(1.W)
  val MUL = 1.U(1.W)
}

// Define the PE Accumulator module
class PE extends Module {
  val io = IO(new Bundle {
    val compute_en = Input(Bool())
    val read_en = Input(Bool())
    val compute_mode = Input(Bool())
    val ifmap_i = Input(SInt(8.W))
    val ifmap_o = Output(SInt(8.W))
    val weight_i = Input(SInt(8.W))
    val weight_o = Output(SInt(8.W))
    val ipsum = Input(SInt(32.W))
    val opsum = Output(SInt(32.W))
  })

  val ifmap = RegInit(0.S(8.W))
  val weight = RegInit(0.S(8.W))
  val psum = RegInit(0.S(32.W))

  when (reset.asBool) {
    psum := 0.S
  } .elsewhen (io.compute_en) {
    ifmap := io.ifmap_i
    weight := io.weight_i
    when (io.compute_mode === OperationMode.ADD) {
      psum := io.ipsum
    } .elsewhen (io.compute_mode === OperationMode.MUL) {
      psum := psum + (io.ifmap_i * io.weight_i)
    }
  }

  io.ifmap_o := ifmap
  io.weight_o := weight
  io.opsum := Mux(io.read_en, psum, 0.S)
}

Post Unit

ReLU
\[ \text{ReLU}(x) = \max(x, 0) = \begin{cases} x, ~~\text{if} ~ x > 0 \\ 0, ~~\text{otherwise} \end{cases} \]
Maxpooling
Quantization
- Merge dequantion before computation and requantization after computation.
\[ \begin{align} \bar x &= \text{clamp} \left( \left\lfloor \frac{x}{s_x} \right\rceil + 128, 0, 255 \right) \in \mathbb{Z}_\text{uint8}^{\dim(x)} \\ \bar w &= \text{clamp} \left( \left\lfloor \frac{w}{s_w} \right\rceil, -128, 127 \right) \in \mathbb{Z}_\text{int8}^{\dim(w)} \\ \bar b &= \text{clamp} \left( \left\lfloor \frac{b}{s_x s_w} \right\rfloor, -2^{31}, 2^{31}-1 \right) \in \mathbb{Z}_\text{int32}^{\dim(b)} \\ \bar y &= \text{clamp} \left( \left\lfloor \frac{y}{s_y} \right\rceil + 128, 0, 255 \right) \in \mathbb{Z}_\text{uint8}^{\dim(y)} \end{align} \]

\[ y_i = b_i + \sum_j x_j \cdot w_{ji} \]

\[ \begin{align} s_y (\bar y_i - 128) &= s_x s_w \bar b_i + \sum_j s_x (\bar x_j - 128) \cdot s_w \bar w_{ji} \\ &= s_x s_w (\bar b_i + \sum_j (\bar x_j - 128) \cdot \bar w_{ji}) \end{align} \]

\[ \bar y_i = \frac{s_x s_w}{s_y} (\bar b_i + \sum_j (\bar x_j - 128) \cdot \bar w_{ji}) + 128 \]
- Due to dyadic quantization, the scaling factors are all power of 2.
  It means that we can use shift operation to complete it.

Testbench

Use AXI protocol to communicate chip with testbench
- Chip serves as both a Master and a Slave in the AXI protocol.
  - Master: AR/R/AW/W/B
    - Load data from DRAM(testbench) to SRAM
  - Slave: AW/W/B
    - Receieve MMIO data from CPU(testbench)
Unit Test for PE and Systolic Array
Model data input test

Operator Transformation and Mapping

Transform Conv2D to GEMM

In this final project, we process one image at a time (N = 1)
We use channel last, which means the format is NHWC, as our memory order

Shape Parameter Definition

Ifmap (C, H, W)、Filter (M, C, R, S)、Ofmap (M, E, F)

\[ O = I * W \]

\[ O(i, j, m) = \sum_{c=0}^{C-1} \sum_{r=0}^{R-1} \sum_{s=0}^{S-1} I(c, i+r, j+s) \cdot W(m, c, r, s) \]

Method - Img2Col

\[ O_{i,j} = \sum_{k=1}^{R*S*C} I_{i,k} \cdot W_{k,j} \quad \forall i \in [1, E*F], \, j \in [1, M] \]

Input Matrix Expansion

\(I_{i,k}\) represents the \(i\)-th row of the matrix created by flattening all sliding windows of the input image \(I\).
\(i \in [1, E \times F]: E \times F\) represents the total number of spatial positions in the output feature map.
\(k \in [1, R \times S \times C]: R \times S \times C\) represents the total number of elements in each sliding window (filter size).

Filter Expansion

\(W_{k,j}\) represents the \(k\)-th row and \(j\)-th column of the filter matrix \(W\), where each filter \((m, c, r, s)\) is flattened into a column.
\(j \in [1,M]: M\) represents the number of filters (output channels).

Output Matrix

\(O_{i,j}\) represents the result of the matrix multiplication after flattening, which corresponds to the output feature map \(O\)

Operator Mapping

\(8 * 8\) systolic array can only calculate \([8, N] * [N, 8]\) GEMM at most,
We need to perform tiling on the matrices obtained from the Img2Col method to fit into the hardware's computation capab

Implementation Code

def img2col(ifmap, filter, stride=1, padding=0):
    # Get dimensions
    C, H, W = ifmap.shape
    M, _, R, S = filter.shape
    # Calculate output dimensions
    E = (H + 2 * padding - R) // stride + 1
    F = (W + 2 * padding - S) // stride + 1

    # Apply padding to the input feature map
    ifmap_padded = np.pad(
        ifmap, ((0, 0), (padding, padding), (padding, padding)), mode="constant"
    )

    # Initialize the column matrix
    ifmap_col = np.zeros((E * F, R * S * C))

    # Fill the column matrix
    row_idx = 0
    for y in range(0, H + 2 * padding - R + 1, stride):
        for x in range(0, W + 2 * padding - S + 1, stride):
            patch = ifmap_padded[:, y : y + R, x : x + S]
            # print("patch: \n", patch)
            # print("patch flattened: \n", patch.flatten())
            ifmap_col[row_idx, :] = patch.flatten().reshape(1, -1)
            row_idx += 1

    # Reshape filter
    filter_col = filter.reshape(R * S * C, -1)

    return ifmap_col, filter_col

Tiling Strategy

Afterimg2col, I will tile the result matrix from img2col
Tile ifmap in the row direction,
and the tile_size is equal to systolic_array_height
Tile filter in the column direction,
and the tile_size is equal to systolic_array_width

Implementation Code

def tile_matrix(matrix, tile_dir, tile_size):
    """
    Tile the input matrix into smaller matrices based on tile_dir and tile_size.

    Args:
        matrix (np.ndarray): Input 2D matrix to be tiled.
        tile_dir (str): Direction of tiling ('row' or 'col').
        tile_size (int): Size of each tile along the specified direction.

    Returns:
        np.ndarray: A 3D array where each slice along the third dimension is a tile.
    """
    if not isinstance(matrix, np.ndarray):
        raise ValueError("Input matrix must be a numpy array.")

    if tile_dir not in ('row', 'col'):
        raise ValueError("tile_dir must be either 'row' or 'col'.")

    if not isinstance(tile_size, int) or tile_size <= 0:
        raise ValueError("tile_size must be a positive integer.")

    if tile_dir == 'row':
        if matrix.shape[0] % tile_size != 0:
            raise ValueError("Number of rows is not divisible by tile_size.")

        num_tiles = matrix.shape[0] // tile_size
        tiled = matrix.reshape(num_tiles, tile_size, matrix.shape[1])
        return tiled

    elif tile_dir == 'col':
        if matrix.shape[1] % tile_size != 0:
            raise ValueError("Number of columns is not divisible by tile_size.")
        print("matrix shape", matrix.shape)
        num_tiles = matrix.shape[1] // tile_size
        tiled = matrix.reshape(matrix.shape[0], num_tiles, tile_size).transpose(1, 0, 2)
        return tiled

Tiled Matrix in Systolic Array

The diagram below illustrates 3 things

How my systolic array calculates the tiled matrix?
Why does ifmap need to be tiled in the row direction to match the systolic array height?
Why does the filter need to be tiled in the col direction to match the systolic array width?

Chisel Execute

build.sc

// import Mill dependency
import mill._
import mill.define.Sources
import mill.modules.Util
import mill.scalalib.TestModule.ScalaTest
import scalalib._
// support BSP
import mill.bsp._

object conv_accelerator extends SbtModule { m =>
  override def millSourcePath = os.pwd
  override def scalaVersion = "2.13.8"
  override def scalacOptions = Seq(
    "-language:reflectiveCalls",
    "-deprecation",
    "-feature",
    "-Xcheckinit",
  )
  override def ivyDeps = Agg(
    ivy"edu.berkeley.cs::chisel3:3.5.0",
    ivy"edu.berkeley.cs::chiseltest:0.5.5"
  )

  def scalacPluginIvyDeps = Agg(
    ivy"edu.berkeley.cs:::chisel3-plugin:3.5.5" // Chisel 編譯器插件
  )

  object test extends SbtModuleTests with TestModule.ScalaTest {
    override def ivyDeps = m.ivyDeps() ++ Agg(
      ivy"org.scalatest::scalatest::3.2.15"
    )
  }
}

Test PE

Run this command to test the PE

mill conv_accelerator.test.testOnly TestPE

Generate Verilog File

Run this command to test the PE

mill conv_accelerator.run

Generated File

module PE(
  input         clock,
  input         reset,
  input         io_compute_en,
  input         io_read_en,
  input         io_compute_mode,
  input  [7:0]  io_ifmap_i,
  output [7:0]  io_ifmap_o,
  input  [7:0]  io_weight_i,
  output [7:0]  io_weight_o,
  input  [31:0] io_ipsum,
  output [31:0] io_opsum
);
 // omitted
endmodule

module SystolicArray(
  input           clock,
  input           reset,
  input  [63:0]   io_compute_en,
  input  [63:0]   io_read_en,
  input  [63:0]   io_compute_mode,
  input  [63:0]   io_ifmap,
  input  [63:0]   io_weight,
  input  [2047:0] io_ipsum,
  output [2047:0] io_opsum
);
  // omitted
  PE pe_array_0_0 ( // @[Systolic_array.scala 16:47]
  .clock(pe_array_0_0_clock),
  .reset(pe_array_0_0_reset),
  .io_compute_en(pe_array_0_0_io_compute_en),
  .io_read_en(pe_array_0_0_io_read_en),
  .io_compute_mode(pe_array_0_0_io_compute_mode),
  .io_ifmap_i(pe_array_0_0_io_ifmap_i),
  .io_ifmap_o(pe_array_0_0_io_ifmap_o),
  .io_weight_i(pe_array_0_0_io_weight_i),
  .io_weight_o(pe_array_0_0_io_weight_o),
  .io_ipsum(pe_array_0_0_io_ipsum),
  .io_opsum(pe_array_0_0_io_opsum)
  );
  // omitted
endmodule

Verilator Testbench

Load tiled matrix from Python in C++

Always write comments in English.

template<typename T>
void load_array_from_python(const string& file_name, const string& func_name, const string& arg,
                             vector<vector<vector<T> > >& ifmap, vector<vector<vector<T> > >& weight) {
    // 初始化 Python 解釋器
    Py_Initialize();

    PyRun_SimpleString("import sys");
    PyRun_SimpleString("sys.path.append('../software')");

    // 加載 Python 模組
    PyObject* pName = PyUnicode_DecodeFSDefault(file_name.c_str());
    PyObject* pModule = PyImport_Import(pName);
    Py_DECREF(pName);

    if (pModule) {
        // 獲取 Python 函數
        PyObject* pFunc = PyObject_GetAttrString(pModule, func_name.c_str());

        if (pFunc && PyCallable_Check(pFunc)) {
            // 構造 Python 函數的參數
            PyObject* pArgs = PyTuple_New(1);
            PyObject* pArg = PyUnicode_FromString(arg.c_str());
            PyTuple_SetItem(pArgs, 0, pArg);

            // 調用函數
            PyObject* pValue = PyObject_CallObject(pFunc, pArgs);
            Py_DECREF(pArgs);

            if (pValue && PyTuple_Check(pValue) && PyTuple_Size(pValue) == 2) {
                // 解析第一個 3D array
                PyObject* pyArray1 = PyTuple_GetItem(pValue, 0);
                size_t dim1 = PyList_Size(pyArray1);
                ifmap.resize(dim1);
                for (size_t i = 0; i < dim1; ++i) {
                    PyObject* pSubList1 = PyList_GetItem(pyArray1, i);
                    size_t dim2 = PyList_Size(pSubList1);
                    ifmap[i].resize(dim2);
                    for (size_t j = 0; j < dim2; ++j) {
                        PyObject* pSubList2 = PyList_GetItem(pSubList1, j);
                        size_t dim3 = PyList_Size(pSubList2);
                        ifmap[i][j].resize(dim3);
                        for (size_t k = 0; k < dim3; ++k) {
                            PyObject* pElem = PyList_GetItem(pSubList2, k);
                            ifmap[i][j][k] = static_cast<T>(PyLong_AsLong(pElem));
                        }
                    }
                }

                // 解析第二個 3D array
                PyObject* pyArray2 = PyTuple_GetItem(pValue, 1);
                dim1 = PyList_Size(pyArray2);
                weight.resize(dim1);
                for (size_t i = 0; i < dim1; ++i) {
                    PyObject* pSubList1 = PyList_GetItem(pyArray2, i);
                    size_t dim2 = PyList_Size(pSubList1);
                    weight[i].resize(dim2);
                    for (size_t j = 0; j < dim2; ++j) {
                        PyObject* pSubList2 = PyList_GetItem(pSubList1, j);
                        size_t dim3 = PyList_Size(pSubList2);
                        for (size_t k = 0; k < dim3; ++k) {
                            PyObject* pElem = PyList_GetItem(pSubList2, k);
                            weight[i][j][k] = static_cast<T>(PyLong_AsLong(pElem));
                        }
                    }
                }

                Py_DECREF(pValue);
            } else {
                PyErr_Print();
            }
        } else {
            PyErr_Print();
        }

        Py_XDECREF(pFunc);
        Py_DECREF(pModule);
    } else {
        PyErr_Print();
    }

    // 終止 Python 解釋器
    Py_Finalize();
}

Result and Analysis

Due to insufficient time, I was unable to complete testing of the Systolic Array and boot model on my accelerator.
I can't get the result statistics to analysis, but this was because I had already spent significant effort on the following tasks:

Completed Work

Training the model and determining the quantization method.
Processing the .pt file to load input weights and biases, then converting them into text files.
Using Python to perform img2col and matrix tiling on the img2col results.
Sending the Python-computed tiled matrix to the Verilator C++ testbench.
Setting up the Verilator environment.
Setting up the Chisel environment and learning how to use Chisel.
Using Mill to build the project and learning how to write build.sc.
Writing the PE design and Systolic Array design.
Testing the PE functionality to ensure correctness.
Completing the Verilator Makefile.

Git History

Ongoing Work

Writing the Systolic Array testbench using Verilator.
Using Verilator to send the tiled matrix to the Systolic Array for computation.
Post-unit design and testing.
Verifying the correctness of the accelerator's computation results.

Use Treadle to Evaluate the Performance

Latency:

Test the number of clock cycles required to complete a single convolution, matrix multiplication, or the entire output generation.

Throughput:

Measure the number of images that can be processed per second
(e.g., images from the CIFAR-10 dataset).

Hardware Resource Utilization:

Verify the utilization of computational units (e.g., PEs and SRAM) within each module of the hardware design.

\[ \text{MatMul}(A,B) = A \cdot B \]

Supplement

Chisel Environment Setting

Chisel Template
- It is a useful git repo which provides valid mill environment (build.sc and some include lib)