蘇暐倫
Neural Networks (NN) often give people the impression which consumes lots of memory and needs powerful processors. However, how much memory does it consume? Also, how about the processors capable of running the NN model? In this project, we focus on the fully-connected neural networks (FCNN) to discuss the following topics and seek cost-effective solutions to run Neural Networks (NN).
Neural network (NN) models are in layered structure. As shown in the above figure, there are three hidden layers between the input and output layers. In FCNN, every layer is composed of nodes, and each node in former layer connects nodes in latter layer with independent weights. The value of each node comes from the dot product value of former node values and its connected weights, and then the dot product value also goes through certain activation function as the node value.
Neural network (NN) models can be viewed as sets of weights, and aim to classify any given input into a most likely class. To achieve this goal, data labeling which gives each input the specified class plays the important role in model training. During the model training, each input data will go through operations of layers and generate the possibilities of each class. The most likely class is the result, and will be compared with the one during data lableing, i.e., ground truth, and some adjustments to the weights of hidden layers are done.
After training the model, the weights of each layer are kept typically for inference on a unknown input. Therefore, the memory consumption of an NN model depends on the two factors. One is the number of weights it uses, which is related to numbers of nodes within each layer. The other is the data structure size used for each weight.
Take the MNIST dataset as the example, each 28x28 pixel handwritten digit is classfied with the NN model into 10 classes, i.e., 0 to 9. For a three-layer NN model combined with the input and output layer, there are four sets of weights in this model. Each set of weights is composed of node connections between the former and latter layer. Namely, the number of weights in each weight-set is determined by numbers of nodes at the former and latter layer. The input layer is come from flattening one 28x28 pixel image into 784 nodes. The output layer is 10 nodes dependent to the number of classes. Assume the other three layers are all established with 64 nodes. Following are the four set of weights in this model with nodes (784, 64, 64, 64, 10) respectively.
Weight-set | Input / Hidden Layer 1 | Hidden Layer 1 / 2 | Hidden Layer 2 / 3 | Hidden Layer 3 / Output |
---|---|---|---|---|
Number of weights | 768 * 64 | 64 * 64 | 64 * 64 | 64 * 10 |
Factor 2: Data structure size per weight.
float
to represent the weight.Memory consumption.
Weight-set | Input / Hidden Layer 1 | Hidden Layer 1 / 2 | Hidden Layer 2 / 3 | Hidden Layer 3 / Output |
---|---|---|---|---|
Memory consumption | 192 KB | 16 KB | 16 KB | 2.5 KB |
Hence, the minimum requirement for memory consumption is the maximum among the weights per layer, which is 192 KB. During the model training, it is likely to keep weight sets of layers for adjustments, so the maximum requirement for memory consumption will be the sum of weight-set memory consumptions, which is 226.5 KB. From this example, we can know the memory consumption increases as the number of nodes per layer, and the number of layers increase. Conventionally, it is represented as the number of parameters within an NN model.
TODO: Number of parameters in popular neural network models:
BitNet: Scaling 1-bit Transformers for Large Language Models
https://github.com/microsoft/BitNet
TODO: Why model compression? How many the number of parameters in popular neural network models? How limited the hardware resources?
To decrease the memory consumption of NN models, two methodologies could be adopted, i.e., Pruning, and Quantization, especially in edge computing which has limited hardware resources. As mentioned in the previous section, the memory consumption of an NN model depends on the two factors, i.e., the number of weights it uses, and the data structure size used for each weight. These two methodologies aim to decrease the memory consumption of NN models towards these two factors respectively.
In this project, we focus on Quantization approach for model compression.
TODO: Report of memory consumptions under different bitperweight conditions
BitNetMCU is an open-source project attempting to run machine learning with low-end RISC-V microcontroller. Leveraging Quantization to decrease the memory consumption, BitNetMCU has successfully run handwritten recoginition task in CH32V003, which operates in 48 MHz clock rate with 2 KB SRAM and 16 KB flash, but costs merely US$0.1 per piece.
BitNetMCU provides three NN models, i.e., FCMNIST, CNNMNIST, MAXMNIST, and several hyperparameters for conducting the model training. It also prepares a tool script to transform sets of weights into quantized ones after model training. For porting the NN model to MCU, certain inference C program and the quantized weights are generated as Dynamic-link library (DLL). During development phase, data sets in two scales could be used for inference tests. The below figure shows how BitNetMCU project works.
$ git clone https://github.com/cpldcpu/BitNetMCU && cd BitNetMCU
// use vi or similar editing softwares
$ vi trainingparamters.yaml
$ python3 training.py
// may encounter some problems
// please see next section for workaround patches
$ python3 exportquant.py
$ make
$ gcc BitNetMCU_MNIST_test.c && ./a.out
$ python3 test_inference.py
If encountering the following error or similar in Step 2 or Step 4b, it is likely that the model trained with GPU, but read with CPU before quantization. In CPU-only machine, you can re-train the model with CPU from Step 1 to avoid this error.
Traceback (most recent call last):
File "exportquant.py", line 334, in <module>
model.load_state_dict(torch.load(f'modeldata/{runname}.pth', weights_only=True))
File "/home/su/.local/lib/python3.8/site-packages/torch/serialization.py", line 1096, in load
raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
TODO: System Environment and measured memory consumptions under different bitperweight conditions
To run machine learning with the low-end MCUs, the memory consumption mentioned in above sections plays an important role. On the other hand, the energy consumption plays another important role due to that devices often rely on limited power supply for operations. To decrease the energy consumption, we have to not only lower the instruction count and cycle count, but also select the instructions used. In this project, we attempt to implement BitNetMCU Inference Engine 2
with RV32IM instruction sets, and aim to achieve the goal of lowering the instruction count and cycle count. During the development, we validate the implementaion on UCB CS61C Venus (RISV-V Simulator). To implement the inference, we need both the dataset and model weights in binary format for the RV32IM inference program.
BitNetMCU provides the sample dataset and model weights in C source code, so we need a tool program to convert them into binary files.
BitNetMCU uses MNIST dataset as the handwritten recognition dataset, and resizes images from 28x28 pixel to 16x16 pixel format. Each image is presented as 256 8-bit grayscales, and stored as 256 data. We convert each data into a 4-byte memory word, so the dataset will occupy 1024 bytes.
In addition, BitNetMCU adopts Quantization approach and uses 4-byte data structure as the weight chunk. The bitperweight
configuration determines how many weights are accommodated within a 4-byte weight chunk, and it cause that weights of the same NN model structure occupy different numbers of weight chunks according to the bitperweight
configuration.
bitperweight | 32 | 16 | 8 | 4 | 2 | 1 |
---|---|---|---|---|---|---|
number of weights in a 4-byte weight chunk | 1 | 2 | 4 | 8 | 16 | 32 |
NN model memory consumption ratio in theory | 100% | 50% | 25% | 12.5% | 6.25% | 3.125% |
Thus, the conversion aims to deal with different data types and different numbers of data, and they are specified as the arguments for conversion.
$ git clone https://github.com/imNCNUwilliam/nn-venus && cd BitNetMCU
// Dataset is prepared in "BitNetMCU_MNIST_test_data.h".
// Sets of model weights stored in "BitNetMCU_model.h" is generated
// from previous section Step 2: Quantization of model weights.
$ cd BitNetMCU && make check
// generate the input and the four set of model weights as binary files
Hyperparameters in model training could generate certain "BitNetMCU_model.h" for conversion. However, current BitNetMCU only support NN model "FCMNIST" for inference tests. Besides, during exporting quantized model weights, BitNetMCU.py causes some problems, and needs some patches. Please patch the one in the patch/ to fix the problem.
We base on UCB CS61C Fall 2024 Project: Classify to develop the RV32IM inference program. It is a test and development framework for RV32IM programs on UCB CS61C Venus (RISV-V Simulator). After importing from the Github repository, this project needs to finish these two major functions, i.e., Matrix Multiplication, and ReLU Normalization, have to be finished, together with the Quantization.