Rework Neural Networks with low bit weights on Venus

蘇暐倫

Neural Networks (NN) often give people the impression which consumes lots of memory and needs powerful processors. However, how much memory does it consume? Also, how about the processors capable of running the NN model? In this project, we focus on the fully-connected neural networks (FCNN) to discuss the following topics and seek cost-effective solutions to run Neural Networks (NN).

Memory consumptions for a given NN model
Model compression: Quantization approach
Open-source project to leverage Quantization
RV32IM implementation for Quantization approach

Memory consumptions for a given NN model

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Recently, neural network (NN) models adopted by Artificial Intelligence (AI) have amazed the world in many aspects, such as AlphaGo in board game Go. However, there has been no one NN model able to be used for any applications so far, and applications often vary in characteristics as well. Therefore, when companies or individuals adopt one NN model, parameters of certain selected NN model need to be adjusted before deploying into application. On the other hand, hardware requirements to run the NN model directly determine the efficiency, expense and time. Thus, it is important to do the feasibility assessment before selecting the NN model running on certain hardware platform. Here, we would like to estimate the memory consumption for a simplified NN model.

Neural network (NN) models are in layered structure. As shown in the above figure, there are three hidden layers between the input and output layers. In FCNN, every layer is composed of nodes, and each node in former layer connects nodes in latter layer with independent weights. The value of each node comes from the dot product value of former node values and its connected weights, and then the dot product value also goes through certain activation function as the node value.

Neural network (NN) models can be viewed as sets of weights, and aim to classify any given input into a most likely class. To achieve this goal, data labeling which gives each input the specified class plays the important role in model training. During the model training, each input data will go through operations of layers and generate the possibilities of each class. The most likely class is the result, and will be compared with the one during data lableing, i.e., ground truth, and some adjustments to the weights of hidden layers are done.

After training the model, the weights of each layer are kept typically for inference on a unknown input. Therefore, the memory consumption of an NN model depends on the two factors. One is the number of weights it uses, which is related to numbers of nodes within each layer. The other is the data structure size used for each weight.

Take the MNIST dataset as the example, each 28x28 pixel handwritten digit is classfied with the NN model into 10 classes, i.e., 0 to 9. For a three-layer NN model combined with the input and output layer, there are four sets of weights in this model. Each set of weights is composed of node connections between the former and latter layer. Namely, the number of weights in each weight-set is determined by numbers of nodes at the former and latter layer. The input layer is come from flattening one 28x28 pixel image into 784 nodes. The output layer is 10 nodes dependent to the number of classes. Assume the other three layers are all established with 64 nodes. Following are the four set of weights in this model with nodes (784, 64, 64, 64, 10) respectively.

Factor 1: Numbers of weights.

Weight-set	Input / Hidden Layer 1	Hidden Layer 1 / 2	Hidden Layer 2 / 3	Hidden Layer 3 / Output
Number of weights	768 * 64	64 * 64	64 * 64	64 * 10

Factor 2: Data structure size per weight.
- It is 4-byte when using float to represent the weight.
Memory consumption.

Weight-set	Input / Hidden Layer 1	Hidden Layer 1 / 2	Hidden Layer 2 / 3	Hidden Layer 3 / Output
Memory consumption	192 KB	16 KB	16 KB	2.5 KB

Hence, the minimum requirement for memory consumption is the maximum among the weights per layer, which is 192 KB. During the model training, it is likely to keep weight sets of layers for adjustments, so the maximum requirement for memory consumption will be the sum of weight-set memory consumptions, which is 226.5 KB. From this example, we can know the memory consumption increases as the number of nodes per layer, and the number of layers increase. Conventionally, it is represented as the number of parameters within an NN model.

TODO: Number of parameters in popular neural network models:

Model compression: Quantization approach

BitNet: Scaling 1-bit Transformers for Large Language Models
https://github.com/microsoft/BitNet

TODO: Why model compression? How many the number of parameters in popular neural network models? How limited the hardware resources?

To decrease the memory consumption of NN models, two methodologies could be adopted, i.e., Pruning, and Quantization, especially in edge computing which has limited hardware resources. As mentioned in the previous section, the memory consumption of an NN model depends on the two factors, i.e., the number of weights it uses, and the data structure size used for each weight. These two methodologies aim to decrease the memory consumption of NN models towards these two factors respectively.

Solution 1: Pruning approach.
Deduct the number of weights used in an NN model to decrease the memory consumption.
Solution 2: Quantization approach.
Accommondate multiple weights in the same data structure to decrease the memory consumption.

In this project, we focus on Quantization approach for model compression.

TODO: Report of memory consumptions under different bitperweight conditions

Open-source project to leverage Quantization

BitNetMCU is an open-source project attempting to run machine learning with low-end RISC-V microcontroller. Leveraging Quantization to decrease the memory consumption, BitNetMCU has successfully run handwritten recoginition task in CH32V003, which operates in 48 MHz clock rate with 2 KB SRAM and 16 KB flash, but costs merely US$0.1 per piece.

BitNetMCU provides three NN models, i.e., FCMNIST, CNNMNIST, MAXMNIST, and several hyperparameters for conducting the model training. It also prepares a tool script to transform sets of weights into quantized ones after model training. For porting the NN model to MCU, certain inference C program and the quantized weights are generated as Dynamic-link library (DLL). During development phase, data sets in two scales could be used for inference tests. The below figure shows how BitNetMCU project works.

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

Following are the steps to try BitNetMCU on personal computers. Numbers surrounded with circle in the above figure stand for step numbers.

Step 0: Hyperparameters selection for model training.

$ git clone https://github.com/cpldcpu/BitNetMCU && cd BitNetMCU
// use vi or similar editing softwares
$ vi trainingparamters.yaml

Step 1: Model training.

$ python3 training.py

Step 2: Quantization of model weights.

// may encounter some problems
// please see next section for workaround patches
$ python3 exportquant.py

Step 3: C DLL generation for inference.

$ make

Step 4a: Inference with 10-data dataset.

$ gcc BitNetMCU_MNIST_test.c && ./a.out

Step 4b: Inference with 10,000-data MNIST dataset.

$ python3 test_inference.py

If encountering the following error or similar in Step 2 or Step 4b, it is likely that the model trained with GPU, but read with CPU before quantization. In CPU-only machine, you can re-train the model with CPU from Step 1 to avoid this error.

Traceback (most recent call last):
  File "exportquant.py", line 334, in <module>
    model.load_state_dict(torch.load(f'modeldata/{runname}.pth', weights_only=True))
  File "/home/su/.local/lib/python3.8/site-packages/torch/serialization.py", line 1096, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
 Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

TODO: System Environment and measured memory consumptions under different bitperweight conditions

RV32IM implementation of Quantization approach

Github repository: nn-venus

To run machine learning with the low-end MCUs, the memory consumption mentioned in above sections plays an important role. On the other hand, the energy consumption plays another important role due to that devices often rely on limited power supply for operations. To decrease the energy consumption, we have to not only lower the instruction count and cycle count, but also select the instructions used. In this project, we attempt to implement BitNetMCU Inference Engine 2 with RV32IM instruction sets, and aim to achieve the goal of lowering the instruction count and cycle count. During the development, we validate the implementaion on UCB CS61C Venus (RISV-V Simulator). To implement the inference, we need both the dataset and model weights in binary format for the RV32IM inference program.

Binary files conversion

BitNetMCU provides the sample dataset and model weights in C source code, so we need a tool program to convert them into binary files.

BitNetMCU uses MNIST dataset as the handwritten recognition dataset, and resizes images from 28x28 pixel to 16x16 pixel format. Each image is presented as 256 8-bit grayscales, and stored as 256 data. We convert each data into a 4-byte memory word, so the dataset will occupy 1024 bytes.
In addition, BitNetMCU adopts Quantization approach and uses 4-byte data structure as the weight chunk. The bitperweight configuration determines how many weights are accommodated within a 4-byte weight chunk, and it cause that weights of the same NN model structure occupy different numbers of weight chunks according to the bitperweight configuration.

bitperweight 32 16 8 4 2 1

number of weights in a 4-byte weight chunk 1 2 4 8 16 32

NN model memory consumption ratio in theory 100% 50% 25% 12.5% 6.25% 3.125%

bitperweight	32	16	8	4	2	1
number of weights in a 4-byte weight chunk	1	2	4	8	16	32
NN model memory consumption ratio in theory	100%	50%	25%	12.5%	6.25%	3.125%

Thus, the conversion aims to deal with different data types and different numbers of data, and they are specified as the arguments for conversion.

$ git clone https://github.com/imNCNUwilliam/nn-venus && cd BitNetMCU
// Dataset is prepared in "BitNetMCU_MNIST_test_data.h".
// Sets of model weights stored in "BitNetMCU_model.h" is generated 
// from previous section Step 2: Quantization of model weights.

$ cd BitNetMCU && make check
// generate the input and the four set of model weights as binary files

Hyperparameters in model training could generate certain "BitNetMCU_model.h" for conversion. However, current BitNetMCU only support NN model "FCMNIST" for inference tests. Besides, during exporting quantized model weights, BitNetMCU.py causes some problems, and needs some patches. Please patch the one in the patch/ to fix the problem.

RV32IM inference program

We base on UCB CS61C Fall 2024 Project: Classify to develop the RV32IM inference program. It is a test and development framework for RV32IM programs on UCB CS61C Venus (RISV-V Simulator). After importing from the Github repository, this project needs to finish these two major functions, i.e., Matrix Multiplication, and ReLU Normalization, have to be finished, together with the Quantization.