# Linux 核心專題: 藉由系統手段加速 BitNet
> 執行人: Denny0097
[GitHub](https://github.com/Denny0097/BitNetMCU/blob/BitNet_VGG/)
>Video
[Export to Model.h inference](https://www.youtube.com/watch?v=fuphucQnpVM)
[Analysis](?)
### Reviewed by `ginsengAttack`
執行命令
env LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libhugetlbfs.so HUGETLB_MORECORE=yes \
perf stat -e cache-misses,cache-references,cycles ./cifar10_test
後的result缺少對hugepage的使用造成的影響細節,可能要再觀察看看實際到底有沒有使用到,並詳細說明執行時的差異。
## 任務簡述
在大型語言模型資源需求日益高漲的背景下,微軟研究院提出 BitNet b1.58 2B4T,以 1.58 位元量化訓練的開放原始碼模型,在其推論過程中,BitNet 的矩陣乘法可化約為加法、減法與忽略,浮點乘法完全被移除。這使模型在實際部署中不僅延遲降低、記憶體存取負擔減輕,也進一步降低能源消耗。這種極簡的運算流程非常適合硬體加速器設計。
BitNet b1.58 採用 Transformer 架構進行大幅簡化與重設,包括移除線性層與正規化層中的 bias,使用 BitLinear 層取代傳統全連接層,搭配旋轉位置編碼 (RoPE)、ReLU² 激勵函數與 subLN 正規化。
BitNet b1.58 2B4T 模型包含約 20 億參數,訓練資料量達 4 兆個 token,涵蓋自然語言 (以英語為主)。它使用與 Llama 3 相同的分詞器,字彙表大小為 128,256,並支援最多 4096 token 的上下文長度。整體訓練過程分為三個階段:預訓練 (pretraining)、監督式微調 (SFT) 與偏好對齊 (DPO),使模型在效能與對話表現間取得良好平衡。
BitNet b1.58 的運行效率極高。在 Apple M2 這類通用 CPU 上可達 29 毫秒延遲,且記憶體佔用僅為 0.4GB,遠小於例如 Gemma 3 1B 所需的 1.4GB。微軟團隊在基準測試中將其與 Llama 3.2 1B、Gemma 3 1B 與 Qwen 2.5 1.5B 等全精度模型對比,發現 BitNet 即使在參數較少、權重精度較低的情況下,仍能在 MMLU、GSM8K、MATH 等任務中維持穩定表現,並在 GSM8K 上奪得最佳成績。BitNet 於 30 億參數以上的規模下,其性能已能接近 FP16 模型,並於 70 億參數規模達成高達 4 倍的推理速度提升與 7 倍的記憶體節省,顯示此技術具備高度擴展潛力。
本任務嘗試運用 Linux 核心提供的效能分析工具,定位出 BitNet 運行時期的效能瓶頸,並善用 [Transparent Hugepage Support](https://docs.kernel.org/admin-guide/mm/transhuge.html)、[針對事件驅動的 I/O 模型](https://hackmd.io/@sysprog/linux-io-model) (如 `io_uring`),和課程所及的手法,加速 BitNet。
> [Alan 的實驗](https://hackmd.io/@alanhc/2025q1-term-project)
## Outline
- TODO
- BitNet Experiment
- Env
- SW
- Model
- Quantiztion scheme(QAT)
- Code architecture
- Analysis
- Optimaization based on analysis
- HW simulation
## TODO
參考 [BitNet](https://github.com/microsoft/BitNet/tree/main) 以及 [BitNetMCU](https://github.com/cpldcpu/BitNetMCU) 在 Linux 環境利用 BitNet quantiztion 的 VGG8 for MNIST dataset (model.h),使用 Linux 核心提供的效能分析工具分析 training & inference 運行時期的效能瓶頸,分析後設計 CPU & mem usage 最佳化,並用 SystemC 模擬硬體運算,另嘗試硬體加速。
研究 BitNet [1]:
* 在 GNU/Linux 系統運作 BitNet b1.58 2B4T 並重現論文實驗
效能改進:
* 以 perf 在內的工具,測量推理過程中運算資源佔比前 20 大的函式,並探討其作用
* 分析記憶體使用量,特別是過程中的 page fault, TLB miss 等統計。在 XMRig [2] 一類的挖礦程式中,善用 huge page (或 THP),可達到加速效果
* 評估 T-MAC [3] [5],特別是其搭配 BitNet 的查表效益,紀錄過程中的 perf 事件統計
* 觀察載入模型的機制,能否用 splice [4] 一類的機制予以加速
修改 BitNetMCU 並輸出 2bit model:
* 引入原 lab2 提供的 VGG 8
* 增加 padding, maxpool 支援
* 補全未完成的 2 bits QuantType (I2_S)
* 補全未完成的 Ternary QuantType (TL1, TL2)
* 生成可部署到硬體上的 model.h
* 設計 inference.c 來測試 model 在 CPU 上的 inference
* 比較 PTQ 8 bit model vs QAT 2w8a model
HW simulation
* C sim or Verilator
[1] https://github.com/microsoft/BitNet
[2] https://xmrig.com/docs/miner/hugepages
[3] https://github.com/microsoft/T-MAC
[4] https://hackmd.io/@sysprog/linux-zerocopy
[5] BitNet 有 LUT: https://github.com/microsoft/BitNet/tree/main/src
https://www.kernel.org/doc/html/next/admin-guide/mm/transhuge.html
## Bitnet Experiment
### Env
:::spoiler 實驗環境
```shell
(base) denny0097:~/linux2025$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.2 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
(base) denny0097:~/linux2025$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
(base) denny0097:~/linux2025$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
(bitnet-cpp) denny0097:~/linux2025/BitNetMCU-main$ conda info
active environment : bitnet-cpp
active env location : /home/denny0097/miniconda3/envs/bitnet-cpp
shell level : 2
user config file : /home/denny0097/.condarc
populated config files : /home/denny0097/miniconda3/.condarc
conda version : 25.3.1
conda-build version : not installed
python version : 3.13.2.final.0
solver : libmamba (default)
virtual packages : __archspec=1=skylake
__conda=25.3.1=0
__cuda=12.4=0
__glibc=2.39=0
__linux=6.11.0=0
__unix=0=0
base environment : /home/denny0097/miniconda3 (writable)
```
:::
run instruction:
```
python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv
```
result :
```!
> User: Tell me about the architecture of BitNet.
BitNet is a software communication network that connects devices and provides a common architecture for the Internet, but it's not a place in Barcelona, Spain, although it might seem like that from the name. It's actually a network protocol developed by the University of California, Berkeley in the 197
```
#### Benchmark :
```
python utils/e2e_benchmark.py -m /path/to/model -n 200 -p 256 -t 4
```
This command would run the inference benchmark using the model located at /path/to/model, generating 200 tokens from a 256 token prompt, utilizing 4 threads.
#### result :
| model | size | params | backend | threads | n_batch | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
| bitnet-b1.58 2B I2_S - 2 bpw ternary | 1.71 GiB | 2.74 B | CPU | 4 | 1 | pp256 | 15.86 ± 0.04 |
| bitnet-b1.58 2B I2_S - 2 bpw ternary | 1.71 GiB | 2.74 B | CPU | 4 | 1 | tg200 | 15.75 ± 0.10 |
## SW
### Model
:::spoiler Model summary
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
BitConv2d-1 [-1, 64, 32, 32] 576
ReLU-2 [-1, 64, 32, 32] 0
MaxPool2d-3 [-1, 64, 16, 16] 0
BitConv2d-4 [-1, 192, 16, 16] 110,592
ReLU-5 [-1, 192, 16, 16] 0
MaxPool2d-6 [-1, 192, 8, 8] 0
BitConv2d-7 [-1, 384, 8, 8] 663,552
ReLU-8 [-1, 384, 8, 8] 0
BitConv2d-9 [-1, 256, 8, 8] 884,736
ReLU-10 [-1, 256, 8, 8] 0
BitConv2d-11 [-1, 256, 8, 8] 589,824
ReLU-12 [-1, 256, 8, 8] 0
MaxPool2d-13 [-1, 256, 4, 4] 0
Flatten-14 [-1, 4096] 0
BitLinear-15 [-1, 256] 1,048,576
ReLU-16 [-1, 256] 0
BitLinear-17 [-1, 128] 32,768
ReLU-18 [-1, 128] 0
BitLinear-19 [-1, 10] 1,280
================================================================
Total params: 3,331,904
Trainable params: 3,331,904
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 2.91
Params size (MB): 12.71
Estimated Total Size (MB): 15.63
----------------------------------------------------------------
:::

:::spoiler trainingparameters.yaml
# Description: Training parameters for the training script
# Model selection
model: 'VGG' # 'FCMNIST' or 'CNNMNIST' This is the class name of the model as defined in models.py.
# Quantization settings
QuantType: 'I2_S' # 'Ternary', 'Binary', 'BinaryBalanced', '2bitsym', '4bit', '4bitsym', '8bit', 'None", 'FP130', 'NF4', 'I2_S'
NormType: 'RMS' # 'RMS', 'Lin', 'BatchNorm'
WScale: 'PerTensor' # 'PerTensor', 'PerOutput'
# Clipping parameters - only used for 2 bit and higher quantization
maxw_algo: 'octav' # 'octav', 'prop' Algorithm used to calculate the clipping parameters (maximum weight)
maxw_update_until_epoch: 50 # Update clipping parameters until this epoch, they are frozen afterwards
maxw_quantscale: 0.25 # Used only for clipping_algo='prop'. Determines the relation between stddev of weights and max_weight
# Learning parameters
num_epochs: 50
batch_size: 32
scheduler: "Cosine" # "StepLR", "Cosine"
learning_rate: 0.001
lr_decay: 0.1 # lr_decay and step size are not used with cosine scheduler
step_size: 10
# halve_lr_epoch: 30 # Epoch at which to halve the learning rate
# Data augmentation
augmentation: True
rotation1: 10 # rotation1 and rotation2 are used for data augmentation
rotation2: 10
# Model parameters
network_width1: 256
network_width2: 128
network_width3: 0
# name
runtag: "octav" # runtag is prefix for runname
:::
[Output quanted model](https://github.com/Denny0097/BitNetMCU/blob/BitNet_VGG/BitNetMCU_model.h)
BitConv2d & BitLinear 是 BitNetMCU 提供的 layer 框架,其中支持包含 Normalize(RMS) 跟 QAT foward 的運作
```python=
class BitConv2d(nn.Conv2d, BitQuant):
"""
2D convolution layer with quantization aware training and normalization.
Configurable quantization and normalization types.
Normalization Types:
- RMS : Root Mean Square
- None : No normalization
@cpldcpu 2024-June-2
"""
#def __init__ ...
def forward(self, x):
"""
Args:
x: an input tensor with shape [n, d]
Returns:
y: an output tensor with shape [n, k]
"""
w = self.weight # a weight tensor with shape [d, k]
x_norm = self.Normalize(x)
if self.QuantType == 'None':
y = F.conv2d(x_norm, w, stride=self.stride, padding=self.padding, groups=self.groups )
else:
x_int, x_scale = self.activation_quant(x_norm)
x_quant = x_norm + (x_int / x_scale - x_norm).detach()
w_int, w_scale, _ = self.weight_quant(w)
w_quant = w + (w_int / w_scale - w).detach()
y = F.conv2d(x_quant, w_quant, groups=self.groups, stride=self.stride, padding=self.padding, bias=None)
return y
```
#### Training data set:
MNIST
CIFAR10
### Quantization scheme (QAT)
#### Weight
w_scale = 1.0 / w.abs().mean().clamp_(min=1e-5)
w_int = (w * w_scale ).round().clamp_(-1, 1)
$$
\begin{align}
s = max(\frac{1}{mean(|w|)}, 10^{-5})
\\ q(w) = clamp(round(w \cdot s), -1, 1)
\end{align}
$$
#### Activation
training
scale = 127.0 / x.abs().max(dim=-1, keepdim=True).values.clamp_(min=1e-5)
y = (x * scale).round().clamp_(-128, 127)
inference
scale = 127.0 / np.maximum(np.abs(input_data).max(axis=-1, keepdims=True), 1e-5)
current_data = np.round(input_data * scale).clip(-128, 127)
$$
\begin{align}
s = max(\frac{127}{max(|x|)}, 10^{-5}),
\\q(x) = clamp(round(x\cdot s), -128, 127)
\end{align}
$$
**HW friendly:**
$$
\begin{align}
s = max(|x|) >> 7 ,
\\while(s > 0) shift++
\\rounding = (1<<shift)>>1
\\q(x) = (x + rounding) >> shift
\end{align}
$$
計算 input 的絕對值的最大值 max(|x|),並計算最接近 max(|x|) 的 2 的冪次 $2^{shift+7}$,用此值作為 quantization 的最大範圍,$x/2^{shift}$ 來得到 range [-128, 127] quanted value,但這樣會無條件捨棄小數,為了更精準,加上 round 的計算:$(x+2^{shift-1})/2^{shift}$,過程中沒有乘法也沒有浮點數運算。
#### [Bitnet](https://github.com/microsoft/BitNet/tree/main) I2_S:
基於[BitMCU](https://github.com/cpldcpu/BitNetMCU)提供的 [BitLinear](https://github.com/cpldcpu/BitNetMCU/blob/main/BitNetMCU.py) & [BitConv2d](https://github.com/cpldcpu/BitNetMCU/blob/main/BitNetMCU.py) 進行 QAT 讓模型在訓練時就學會適應 {-1,0,1} low bit-width inference。
而由於 BitnetMCU 目前不支援 padding, maxpooling, BatchNorm, (使用 BatchNorm 訓練的模型在 export 成 .c file 時,accuracy 會異常低,接近隨機推演),以及最重要的,原始的 export.py 還沒有支援任何 BitNet 儲存。
因此我會增加 Maxpool 以及 Padding ,最後用 CIFAR10 訓練並先實現 export I2_S model (2bits),讓相較原本程式原本支援的 (full FC + ReLU ) 模型更多層的 VGG8 模型能夠在該專案中訓練並生成 bitnet model,並且設計對應的 CPU 上運行測試。
:::spoiler export結果
(bitnet-cpp) denny0097:~/linux2025/BitNetMCU-main$ python exportquant.py
Load parameters from file: trainingparameters.yaml
octav_VGG_Aug_BitMnist_I2_S_width256_128_0_epochs10
Loading model...
Inference using the original model...
Accuracy/Test of trained model: 99.67 %
Quantizing model...
0 VGG
1 Sequential
2 BitConv2d
3 ReLU
4 MaxPool2d
5 BitConv2d
6 ReLU
7 MaxPool2d
8 BitConv2d
9 ReLU
10 BitConv2d
11 ReLU
12 BitConv2d
13 ReLU
14 MaxPool2d
15 Flatten
16 BitLinear
17 ReLU
18 BitLinear
19 ReLU
20 BitLinear
Layer: 2, Max: 1.0, Min: -1.0, Mean: -0.11458333333333333, Std: 0.8317041365974641
Values: [-1. 0. 1.]
Percent: [40.97222222 29.51388889 29.51388889]
Entropy: 1.57 bits. Code capacity used: 78.33160789015268 %
Layer: 5, Max: 1.0, Min: -1.0, Mean: -0.2090747974537037, Std: 0.7201310989476462
Values: [-1. 0. 1.]
Percent: [38.5687934 43.76989294 17.66131366]
Entropy: 1.49 bits. Code capacity used: 74.68134472269595 %
Layer: 8, Max: 1.0, Min: -1.0, Mean: -0.14563289689429013, Std: 0.6785549412400004
Values: [-1. 0. 1.]
Percent: [31.36393229 51.83542511 16.8006426 ]
Entropy: 1.45 bits. Code capacity used: 72.42034190610372 %
Layer: 10, Max: 1.0, Min: -1.0, Mean: -0.17259724934895834, Std: 0.6684536886212908
Values: [-1. 0. 1.]
Percent: [32.46086968 52.33798557 15.20114475]
Entropy: 1.43 bits. Code capacity used: 71.44577546801597 %
Layer: 12, Max: 1.0, Min: -1.0, Mean: -0.13642713758680555, Std: 0.6916967437120225
Values: [-1. 0. 1.]
Percent: [31.67419434 50.29432509 18.03148058]
Entropy: 1.47 bits. Code capacity used: 73.48354983372585 %
Layer: 16, Max: 1.0, Min: -1.0, Mean: -0.06987667083740234, Std: 0.6562856511567088
Values: [-1. -0. 1.]
Percent: [25.27351379 56.4406395 18.28584671]
Entropy: 1.42 bits. Code capacity used: 70.77351150852402 %
Layer: 18, Max: 1.0, Min: -1.0, Mean: -0.019989013671875, Std: 0.7926493968240628
Values: [-1. 0. 1.]
Percent: [32.43408203 37.1307373 30.43518066]
Entropy: 1.58 bits. Code capacity used: 78.99523722907314 %
Layer: 20, Max: 1.0, Min: -1.0, Mean: -0.31484375, Std: 0.7746561579732891
Values: [-1. -0. 1.]
Percent: [50.703125 30.078125 19.21875 ]
Entropy: 1.48 bits. Code capacity used: 73.77139839499056 %
Total number of bits: 6663808 (813.453125 kbytes)
inference of quantized model
layer: ('BitConv2d', 2)
layer: ('ReLU', 3)
layer: ('MaxPool2d', 4)
layer: ('BitConv2d', 5)
layer: ('ReLU', 6)
layer: ('MaxPool2d', 7)
layer: ('BitConv2d', 8)
layer: ('ReLU', 9)
layer: ('BitConv2d', 10)
layer: ('ReLU', 11)
layer: ('BitConv2d', 12)
layer: ('ReLU', 13)
layer: ('MaxPool2d', 14)
layer: ('BitLinear', 16)
layer: ('ReLU', 17)
layer: ('BitLinear', 18)
layer: ('ReLU', 19)
:::
### Weight Distribution

從 weight 的分佈來看 -1 & 0 是相對較高的,推測因為 model 沒有 bias 且每個 conv & fc 又經過 ReLU,所以會導致 conv 的輸入都是正數,而 conv 輸出會有盡量常態分布 (向 0 集中)的傾向,所以導致複數的 weights 大過正數。
### Code architecture
``` mermaid
graph TD;
QAT_Training-->Exporting-->model.h-->Test_Inference
```
:::spoiler 增加 [VGG](https://github.com/Denny0097/BitNetMCU/tree/BitNet_VGG/models.py)
```python
class VGG(nn.Module):
def __init__(self,network_width1=256,network_width2=128,network_width3=0,QuantType='Binary',WScale='PerTensor',NormType='BatchNorm', in_channels=1, in_size=32, num_classes=10):
super(VGG, self).__init__()
# Conv1
self.conv1 = nn.Sequential(
BitConv2d(in_channels, 64, kernel_size=3, stride=1, padding=1, groups=1,QuantType=QuantType,NormType='None', WScale=WScale),
# nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2, stride=2) # 32×32 -> 16×16
)
# Conv2
self.conv2 = nn.Sequential(
BitConv2d(64, 192, kernel_size=3, stride=1, padding=1, groups=1,QuantType=QuantType,NormType='None', WScale=WScale),
# nn.BatchNorm2d(192),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2, stride=2) # 16×16 -> 8×8
)
# Conv3
self.conv3 = nn.Sequential(
BitConv2d(192, 384, kernel_size=3, stride=1, padding=1, groups=1,QuantType=QuantType,NormType='None', WScale=WScale),
# nn.BatchNorm2d(384),
nn.ReLU(inplace=True)
# nn.MaxPool2d(kernel_size=2, stride=2) # 8×8 -> 4×4
)
# Conv4 (Dilated, 1 layer)
self.conv4 = nn.Sequential(
BitConv2d(384, 256, kernel_size=3, stride=1, padding=1, groups=1,QuantType=QuantType,NormType='None', WScale=WScale),
# nn.BatchNorm2d(256),
nn.ReLU(inplace=True)
)
# Conv5 (1 layer)
self.conv5 = nn.Sequential(
BitConv2d(256, 256, kernel_size=3, stride=1, padding=1, groups=1,QuantType=QuantType,NormType='None', WScale=WScale),
# nn.BatchNorm2d(256),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=2, stride=2) # 4×4 -> 2×2
)
# Fully Connected Layers
fmap_size = in_size // 8 # 32 -> 16 -> 8 -> 4 (3 次 MaxPool)
self.fc6 = nn.Sequential(
nn.Flatten(),
BitLinear(256 * fmap_size * fmap_size, network_width1,QuantType=QuantType,NormType=NormType, WScale=WScale), # 256×4×4 = 4096
nn.ReLU()
)
self.fc7 = nn.Sequential(
# nn.Flatten(),
BitLinear(network_width1, network_width2,QuantType=QuantType,NormType=NormType, WScale=WScale),
nn.ReLU()
)
# Final classifier
self.fc8 = BitLinear(network_width2, 10,QuantType=QuantType,NormType=NormType, WScale=WScale)
def forward(self, x):
x = self.conv1(x)
x = self.conv2(x)
x = self.conv3(x)
x = self.conv4(x)
x = self.conv5(x)
x = self.fc6(x)
x = self.fc7(x)
x = self.fc8(x)
return x
```
:::
:::spoiler 增加 [padding](https://github.com/Denny0097/BitNetMCU/tree/BitNet_VGG/BitNetMCU.py)
```python
for layer_info in self.quantized_model[:-1]: # For all layers except the last one
print(f'layer: {layer_info["layer_type"], layer_info["layer_order"] }')
# .....
elif layer_info['layer_type'] == 'BitConv2d':
# .....
padding = layer_info['padding']
groups = layer_info['groups']
in_channels = layer_info['in_channels']
out_channels = layer_info['out_channels']
# Apply padding
if padding > 0:
current_data = np.pad(
current_data,
pad_width=((0, 0), (0, 0), (padding, padding), (padding, padding)),
mode='constant',
constant_values=(0,0)
)
```
:::
::: spoiler 增加 [maxpool layer](https://github.com/Denny0097/BitNetMCU/tree/BitNet_VGG/BitNetMCU.py)
```python!
# ...
elif layer_info['layer_type'] == 'MaxPool2d':
kernel_size = layer_info['kernel_size'] # Assuming square kernel
stride = layer_info['stride']
# Extract input dimensions
batch_size, channels, height, width = current_data.shape
out_height = (height - kernel_size) // stride + 1
out_width = (width - kernel_size) // stride + 1
# Initialize output
output = np.zeros((batch_size, channels, out_height, out_width), dtype=current_data.dtype)
# Perform max pooling
for i in range(out_height):
for j in range(out_width):
h_start = i * stride
h_end = h_start + kernel_size
w_start = j * stride
w_end = w_start + kernel_size
patch = current_data[:, :, h_start:h_end, w_start:w_end]
output[:, :, i, j] = np.max(patch, axis=(2, 3))
current_data = output
```
:::
::: spoiler 修正 [exportquant](https://github.com/Denny0097/BitNetMCU/tree/BitNet_VGG/exportquant.py)
```python
for layer_info in quantized_model.quantized_model:
layer = f'L{layer_info["layer_order"]}'
layer_type = f'{layer_info["layer_type"]}'
f.write(f'// Layer: {layer}\n')
f.write(f'// Layer type: {layer_type}\n')
if layer_info['layer_type'] == 'BitLinear':
incoming_weights = layer_info['incoming_weights']
outgoing_weights = layer_info['outgoing_weights']
bpw = layer_info['bpw']
weights = np.array(layer_info['quantized_weights'])
quantization_type = layer_info['quantization_type']
if (bpw*incoming_weights%32) != 0:
raise ValueError(f"Size mismatch: Incoming weights must be packed to 32bit boundary. Incoming weights: {incoming_weights} Bit per weight: {bpw} Total bits: {bpw*incoming_weights}")
print(f'Layer: {layer} Quantization type: <{quantization_type}>, Bits per weight: {bpw}, Num. incoming: {incoming_weights}, Num outgoing: {outgoing_weights}')
data_type = np.uint32
if quantization_type == 'Binary':
encoded_weights = np.where(weights == -1, 0, 1)
QuantID = 1
elif quantization_type == '2bitsym': # encoding -1.5 -> 11, -0.5 -> 10, 0.5 -> 00, 1.5 -> 01 (one complement with offset)
encoded_weights = ((weights < 0).astype(data_type) << 1) | (np.floor(np.abs(weights))).astype(data_type) # use bitwise operations to encode the weights
QuantID = 2
# I2_S
elif quantization_type == 'I2_S': # encoding -1 -> 00, 0 -> 01, 1 -> 10
encoded_weights = weights.astype(data_type) + 1
QuantID = 2
elif quantization_type == '4bitsym':
encoded_weights = ((weights < 0).astype(data_type) << 3) | (np.floor(np.abs(weights))).astype(data_type) # use bitwise operations to encode the weights
QuantID = 4
elif quantization_type == '4bit':
encoded_weights = np.floor(weights).astype(data_type) & 15 # twos complement encoding
QuantID = 8 + 4
elif quantization_type == 'NF4':
levels = np.array([-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.723, 1.0])
encoded_weights = np.argmin(np.abs(weights[:, :, np.newaxis] - levels), axis=2)
QuantID = 32 + 4
elif quantization_type == '8bit':
encoded_weights = np.floor(weights).astype(data_type) & 255 # twos complement encoding
QuantID = 8
elif quantization_type == 'FP130': # FP1.3.0 encoding (sign * 2^exp)
encoded_weights = ((weights < 0).astype(data_type) << 3) | (np.floor(np.log2(np.abs(weights)))).astype(data_type)
QuantID = 16 + 4
else:
print(f'Skipping layer {layer} with quantization type {quantization_type} and {bpw} bits per weight. Quantization type not supported.')
# pack bits into 32 bit words
weight_per_word = 32 // bpw
reshaped_array = encoded_weights.reshape(-1, weight_per_word)
# reverse arange to match C language LSB first reading order
bit_positions = np.arange(0, 32, bpw, dtype=data_type)
packed_weights = np.bitwise_or.reduce(reshaped_array << bit_positions, axis=1).view(data_type)
# print(f'weights: {weights.shape} {weights.flatten()[0:16]}')
# print(f'Encoded weights: {encoded_weights.shape} {encoded_weights.flatten()[0:16]}')
# print(f'Packed weights: {packed_weights.shape} {", ".join(map(lambda x: hex(x), packed_weights.flatten()[0:4]))}')
# Write layer order, shape, shiftright and weights to the file
f.write(f'// Layer: {layer}\n')
f.write(f'// QuantType: {quantization_type}\n')
f.write(f'#define {layer}_active\n')
f.write(f'#define {layer}_bitperweight {QuantID}\n')
f.write(f'#define {layer}_incoming_weights {incoming_weights}\n')
f.write(f'#define {layer}_outgoing_weights {outgoing_weights}\n')
f.write(f'const uint32_t {layer}_weights[] = {{')
for i,data in enumerate(packed_weights.flatten()):
if i&7 ==0:
f.write('\n\t')
f.write(f'0x{data:08x},')
f.write('\n}; //first channel is topmost bit\n\n')
elif layer_info['layer_type'] == 'BitConv2d':
in_channels = layer_info['in_channels']
out_channels = layer_info['out_channels']
incoming_x = layer_info['incoming_x']
incoming_y = layer_info['incoming_y']
outgoing_x = layer_info['outgoing_x']
outgoing_y = layer_info['outgoing_y']
padding = layer_info['padding']
groups = layer_info['groups']
kernel_size = layer_info['kernel_size'][0] # Assuming square kernel
bpw = layer_info['bpw']
quantization_type = layer_info['quantization_type']
weights = np.array(layer_info['quantized_weights'])
bias = layer_info.get('bias', None)
f.write(f'// Layer: {layer} (Convolutional)\n')
f.write(f'#define {layer}_active\n')
f.write(f'#define {layer}_type BitConv2d\n')
f.write(f'#define {layer}_in_channels {in_channels}\n')
f.write(f'#define {layer}_out_channels {out_channels}\n')
f.write(f'#define {layer}_incoming_x {incoming_x}\n')
f.write(f'#define {layer}_incoming_y {incoming_y}\n')
f.write(f'#define {layer}_outgoing_x {outgoing_x}\n')
f.write(f'#define {layer}_outgoing_y {outgoing_y}\n')
f.write(f'#define {layer}_kernel_size {kernel_size}\n')
f.write(f'#define {layer}_stride 1\n')
f.write(f'#define {layer}_padding {padding}\n')
f.write(f'#define {layer}_groups {groups}\n')
f.write(f'#define {layer}_bitperweight {bpw}\n')
# if (bpw*incoming_weights%32) != 0:
# raise ValueError(f"Size mismatch: Incoming weights must be packed to 32bit boundary. Incoming weights: {incoming_weights} Bit per weight: {bpw} Total bits: {bpw*incoming_weights}")
data_type = np.uint32
if quantization_type == 'Binary':
encoded_weights = np.where(weights == -1, 0, 1)
QuantID = 1
elif quantization_type == '2bitsym': # encoding -1.5 -> 11, -0.5 -> 10, 0.5 -> 00, 1.5 -> 01 (one complement with offset)
encoded_weights = ((weights < 0).astype(data_type) << 1) | (np.floor(np.abs(weights))).astype(data_type) # use bitwise operations to encode the weights
QuantID = 2
# I2_S
elif quantization_type == 'I2_S': # encoding -1 -> 00, 0 -> 01, 1 -> 10
encoded_weights = weights.astype(data_type) + 1
QuantID = 2
elif quantization_type == '4bitsym':
encoded_weights = ((weights < 0).astype(data_type) << 3) | (np.floor(np.abs(weights))).astype(data_type) # use bitwise operations to encode the weights
QuantID = 4
elif quantization_type == '4bit':
encoded_weights = np.floor(weights).astype(data_type) & 15 # twos complement encoding
QuantID = 8 + 4
elif quantization_type == 'NF4':
levels = np.array([-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.723, 1.0])
encoded_weights = np.argmin(np.abs(weights[:, :, np.newaxis] - levels), axis=2)
QuantID = 32 + 4
elif quantization_type == '8bit':
encoded_weights = np.floor(weights).astype(data_type) & 255 # twos complement encoding
QuantID = 8
elif quantization_type == 'FP130': # FP1.3.0 encoding (sign * 2^exp)
encoded_weights = ((weights < 0).astype(data_type) << 3) | (np.floor(np.log2(np.abs(weights)))).astype(data_type)
QuantID = 16 + 4
else:
print(f'Skipping layer {layer} with quantization type {quantization_type} and {bpw} bits per weight. Quantization type not supported.')
# pack bits into 32 bit words
weight_per_word = 32 // bpw
reshaped_array = encoded_weights.reshape(-1, weight_per_word)
# reverse arange to match C language LSB first reading order
bit_positions = np.arange(weight_per_word, dtype=data_type) * bpw
packed_weights = np.bitwise_or.reduce(reshaped_array << bit_positions, axis=1).view(data_type)
f.write(f'const uint32_t {layer}_packed_weights[] = {{')
for i, data in enumerate(packed_weights.flatten()):
if i % 32 == 0:
f.write('\n\t')
f.write(f'{data},')
f.write('\n};\n\n')
if 'bias' in layer_info and layer_info['bias'] is not None:
bias = np.array(layer_info['bias']).astype(data_type)
f.write(f'const int32_t {layer}_bias[] = {{')
for i, data in enumerate(bias.flatten()):
if i % 8 == 0:
f.write('\n\t')
f.write(f'{data},')
f.write('\n};\n\n')
else:
f.write(f'// No bias for layer {layer}\n')
print(f'Layer: {layer} Conv2d bpw: {bpw} {in_channels} -> {out_channels} groups:{groups} Kernel: {kernel_size}x{kernel_size}')
```
:::
::: spoiler 新增 [test inference for I2_S](https://github.com/Denny0097/BitNetMCU/blob/BitNet_VGG/BitNetMCU_CIFAR10_test.c)
| Func | Desc |
| ---- | ---- |
| precessfc_I2_S | Processes a fully connected layer in a neural network with 2-bit weights.|
| processcvlayer_I2_S | Processes a conv2D layer in a neural network.|
| Maxp | Processes a maxpooling layer. |
| ReLUNorm | Applies a ReLU activation function to an array of integers and normalizes, pre quantization the result to 8-bit integers.|
(bitnet-cpp) denny0097:~/linux2025/BitNetMCU-main$ gcc BitNetMCU_MNIST_test.c -o mnist_test -std=c99 -lm
(bitnet-cpp) denny0097:~/linux2025/BitNetMCU-main$ ./mnist_test
fc3_out: -1076 -749 -462 1650 -1109 -424 -1317 -438 -478 -750
label: 3 predicted: 3
fc3_out: -615 -556 1597 -337 -600 -778 -444 -513 -436 -672
label: 2 predicted: 2
fc3_out: 1255 -731 -646 -759 -801 -594 -397 -831 -626 -436
label: 0 predicted: 0
fc3_out: -761 -948 -753 -724 -102 -707 -1141 -317 -495 1685
label: 9 predicted: 9
fc3_out: 1346 -758 -718 -820 -860 -650 -506 -871 -678 -394
label: 0 predicted: 0
fc3_out: -366 -527 -800 -965 -315 -345 1135 -1150 -538 -847
label: 6 predicted: 6
fc3_out: -635 -701 -549 -599 88 -633 -884 -311 -362 1210
label: 9 predicted: 9
fc3_out: -474 -314 803 -133 -286 -708 -675 250 -339 -416
label: 2 predicted: 2
fc3_out: -526 -532 -134 -344 -628 -730 -1272 1069 -400 -110
label: 7 predicted: 7
fc3_out: -858 -318 -519 -481 -672 -995 -1577 1423 -506 -64
label: 7 predicted: 7
:::
### Analysis
設計 10 筆 data 的 test inference 來觀察 quanted model 的執行。
**Exec time**
在程式碼前後紀錄 clock。
(**用 if/else 取代 conv 的 mul 計算** vs **保持 conv 的 mul**)

branch:
$sum
\left\{
\begin{aligned}
\text{-= } act, \text{if weight = -1}\\
\text{+= } act, \text{if weight = 1}\\
\end{aligned}
\right.$
mul:
$sum = act \cdot weight$
| Layer | Branch (ms) | Mul (ms) |
| :--- | :--- | :--- |
| L2 (Conv) | 16.103 | 14.303 |
| L3 (ReLUNorm) | 0.322 | 0.355 |
| L4 (Maxp) | 0.329 | 0.316 |
| L5 (Conv) | 245.586 | 213.306 |
| L6 (ReLUNorm) | 0.201 | 0.201 |
| L7 (Maxp) | 0.198 | 0.196 |
| L8 (Conv) | 342.022 | 298.320 |
| L9 (ReLUNorm) | 0.105 | 0.105 |
| L10 (Conv) | 456.651 | 401.056 |
| L11 (ReLUNorm) | 0.069 | 0.069 |
| L12 (Conv) | 304.385 | 278.764 |
| L13 (ReLUNorm) | 0.065 | 0.067 |
| L14 (Maxp) | 0.066 | 0.067 |
| L16 (FC) | 5.622 | 4.595 |
| L17 (ReLUNorm) | 0.002 | 0.002 |
| L18 (FC) | 0.172 | 0.129 |
| L19 (ReLUNorm) | 0.001 | 0.001 |
| L20 (FC) | 0.007 | 0.006 |
|total| 1371.706 ms | 1211.458 ms|
雖然原本希望能藉由省去 mult 讓計算加速,但實際上如果用 if/else 來判斷加減會有更多的 branch 導致更高的 cost。
bitwise 的計算來同時避免乘法及 branch:
|Unpack|Pack|
|-|-|
|-1 |00|
|0 |01|
|1 |10|
```c
int8_t delta = (-((int8_t)(weight == 0x2)) & act) |
(-((int8_t)(weight == 0x0)) & (-act);
sum += delta;
```

最後發現實際上,在 CPU 上執行的效率跟 branch 差不多,CPU 有硬件乘法器和編譯器優化。對於這種環境,最直接數學表達 sum += act * weight; 的形式,反而讓編譯器能夠發揮最大效能,利用底層的硬件乘法單元。
#### perf
perf 使用會有限制, -1 為最低,通常直接輸入命令會出現
```shell
Error:
Access to performance monitoring and observability operations is limited.
Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
access to performance monitoring and observability operations for processes
without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
perf_event_paranoid setting is 4:
-1: Allow use of (almost) all events by all users
Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>= 0: Disallow raw and ftrace function tracepoint access
>= 1: Disallow CPU event access
>= 2: Disallow kernel profiling
```
:::danger
注意用語,務必依循課程規範的術語。
:::
先將 perf 限制降低至 -1,可以輸入以下 命令 暫時獲得權限:
sudo sh -c 'echo -1 > /proc/sys/kernel/perf_event_paranoid'
1. CPU
``` shell
perf record ./mnist_test
```
``` shell
perf report
```


模型推論時間幾乎 全部卡在卷積層 processcvlayer_I2_S(可以做 loop unrolling)
2. Mem
| Layer | Entropy(bits) | Code capacity used |
| ------- | ------- | ------- |
| L2 | 1.57 | 78.33160789015268 % |
| L5 | 1.49 | 74.68134472269595 % |
| L8 | 1.45 | 72.42034190610372 % |
| L10| 1.43 | 71.44577546801597 % |
| L12| 1.47 | 73.48354983372585 % |
| L16| 1.42 | 70.77351150852402 % |
| L18| 1.58 | 78.99523722907314 % |
| L20| 1.48 | 73.77139839499056 % |
畢竟 2bits 只用來儲存 [-1, 1] 的 weights ,理所當然 capacity 不好。
```shell
perf stat -e cache-misses,cache-references,cycles ./cifar10_test
```
**Result:**
```
Performance counter stats for './cifar10_test':
1,395,104 cache-misses # 23.70% of all cache refs
5,887,284 cache-references
63,527,466,421 cycles
13.750155002 seconds time elapsed
13.748634000 seconds user
0.000000000 seconds sys
```
Cache miss rate 23.70%,後續嘗試使用 huge page 觀察是否改善。
如果單純做 unroll-loops 雖然減少 branch 提高計算速度,但會因為指令變多導致 cache misses 提高 (I-cache)。
```shell
gcc -funroll-loops -o cifar10_test BitNetMCU_CIFAR10_test.c -lm
```
**Result:**
```
Performance counter stats for './cifar10_test':
1,439,561 cache-misses # 17.66% of all cache refs
8,153,347 cache-references
63,693,742,173 cycles
13.786458539 seconds time elapsed
13.782243000 seconds user
0.000999000 seconds sys
```
**利用 gcc 最佳化編譯( additional instruction set, loop unrolling..)**
```shell
gcc -O3 -march=native -funroll-loops -o mnist_test BitNetMCU_MNIST_test.c -lm
```
```
Performance counter stats for './cifar10_test':
788,640 cache-misses # 31.30% of all cache refs
2,519,992 cache-references
7,213,769,871 cycles
1.572760010 seconds time elapsed
1.571592000 seconds user
0.001000000 seconds sys
```
雖然因為 loop unrolling 導致 miss rate 提升,但執行速度大幅提升( 13.750155002 -> 1.572760010 seconds time elapsed,8倍以上)同時也使 conv2 的負擔減少。
#### hugepage
```
sudo sysctl -w vm.nr_hugepages=128
```
動態設定 Linux 核心中可用的 Huge Pages 數量,Linux 預設的 page size 是 4KB,分配 128 個 2MB 的 Huge Page,共 256MB,Kernel 會保留 128 個 2MB 的連續 physical mem page 供程式用 HugePage 分配
```shell
cat /proc/meminfo | grep Huge
```
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 128
HugePages_Free: 128
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 262144 kB
確認有開啟 Hugepage
#### libhugetlbfs
是 Linux 系統上一個 Huge Pages 輔助 lib,可以讓程式的 malloc() 自動從 Huge Page <s>分配</s> 記憶體。
安裝:
```
sudo apt update
sudo apt install libhugetlbfs-bin libhugetlbfs-dev
```
觀察:
```shell
env LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libhugetlbfs.so HUGETLB_MORECORE=yes \
perf stat -e cache-misses,cache-references,cycles ./cifar10_test
```
**Reslut:**
```
Performance counter stats for './cifar10_test':
1,063,465 cache-misses # 12.56% of all cache refs
8,465,880 cache-references
63,575,493,015 cycles
13.679735573 seconds time elapsed
13.673417000 seconds user
0.002999000 seconds sys
```
#### valgrind
>Valgrind 是個在使用者層級 (user space) 對程式在執行時期的行為提供動態分析的系統軟體框架,具備多種工具,可以用來追蹤及分析程式效能,最為人們所熟知的特性就是幫助使用者偵測記憶體錯誤,諸如使用未初始化的記憶體、不當的記憶體配置、或取消配置記憶體,並加以分析。
ref: [2023 年 Linux 核心設計/實作課程作業-lab0(B) 以 Valgrind 分析記憶體問題](https://hackmd.io/@sysprog/linux2023-lab0/%2F%40sysprog%2Flinux2023-lab0-b)
```shell
valgrind --tool=massif ./cifar10_test
```
**massif:**
透過不斷 Snapshot 來觀察 process's stack memory 的分配狀況,
```shell
ms_print massif.out.<pid>
```
####
**Memory Usage Graph**
```
MB
5.199^ ::::::::::::::::::::::::::::::::::::#
| :::: #
| @ : #
| @ : #
| @ : #
| @ : #
| @ : #
| @ : #
| @ : #
| @ : #
| @ : #
| @ : #
| @ : #
| @ : #
| @ : #
| @ : #
|@:::::::::::::::::::::::::::::::@ : #
|@ :::@ : #
|@ :::@ : #
|@ :::@ : #
0 +----------------------------------------------------------------------->ki
0 161.6
Number of snapshots: 30
Detailed snapshots: [9, 13, 23, 28 (peak)]
```
Peak: 5.199 MB
This gives a timeline of memory usage during the program's execution.
**Vertical Axis (MB)**: Represents the total memory used by program in Megabytes (MB).
The peak memory usage observed is about 5.199 MB.
**Horizontal Axis (ki)**: Represents "kiloinstructions" or the approximate number of instructions executed by program.
#### Snapshot (0~9): loading lib(1.13 MB)
```
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
0 0 4,096 4,096 0 0
1 0 12,288 12,288 0 0
2 0 847,872 847,872 0 0
3 0 888,832 888,832 0 0
4 0 892,928 892,928 0 0
5 0 1,069,056 1,069,056 0 0
6 0 1,110,016 1,110,016 0 0
7 0 1,126,400 1,126,400 0 0
8 0 1,130,496 1,130,496 0 0
9 0 1,134,592 1,134,592 0 0
100.00% (1,134,592B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->100.00% (1,134,592B) 0x0: ???
```
`0x0: ???` & `time(i) = 0` 表示這是屬於 lib、packge 佔用的 mem。
#### Snapshot (10~13): loading lib(1.13 MB)
```
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
10 0 1,146,880 1,146,880 0 0
11 0 1,150,976 1,150,976 0 0
12 0 1,155,072 1,155,072 0 0
13 0 1,155,072 1,155,072 0 0
100.00% (1,155,072B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->99.65% (1,150,976B) 0x0: ???
|
->00.35% (4,096B) in 1+ places, all below ms_print's threshold (01.00%)
```
99.65% 同上,
>->00.35% (4,096B) in 1+ places, all below ms_print's threshold (01.00%)
程式總共分配的記憶體中有一個很小的部分(0.35%) 是由一個或多個小規模的記憶體分配操作產生的。
因為這些分配操作的單個或累積大小都非常小,沒有達到 ms_print 工具預設的 1% threshold,所以 ms_print 不顯示這些分配的具體呼叫資訊。
#### Snapshot (14~23):
```
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
14 0 1,150,976 1,150,976 0 0
15 0 1,150,976 1,150,976 0 0
16 68,331 1,159,168 1,159,168 0 0
17 69,240 1,179,648 1,179,648 0 0
18 69,317 1,183,744 1,183,744 0 0
19 69,365 1,187,840 1,187,840 0 0
20 69,413 1,196,032 1,196,032 0 0
21 71,697 1,261,568 1,261,568 0 0
22 74,157 3,432,448 3,432,448 0 0
23 74,217 5,038,080 5,038,080 0 0
100.00% (5,038,080B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->77.15% (3,887,104B) 0x4025D2C: __mmap64 (mmap64.c:58)
| ->77.15% (3,887,104B) 0x4025D2C: mmap (mmap64.c:46)
| ->43.50% (2,191,360B) 0x4007E17: _dl_map_segment (dl-map-segments.h:29)
| | ->43.50% (2,191,360B) 0x4007E17: _dl_map_segments (dl-map-segments.h:101)
| | ->43.50% (2,191,360B) 0x4007E17: _dl_map_object_from_fd (dl-load.c:1258)
| | ->43.50% (2,191,360B) 0x4009528: _dl_map_object (dl-load.c:2268)
| | ->43.09% (2,170,880B) 0x4002A2C: openaux (dl-deps.c:64)
| | | ->43.09% (2,170,880B) 0x400151B: _dl_catch_exception (dl-catch.c:237)
| | | ->43.09% (2,170,880B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232)
| | | ->43.09% (2,170,880B) 0x402241B: dl_main (rtld.c:1965)
| | | ->43.09% (2,170,880B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140)
| | | ->43.09% (2,170,880B) 0x402075D: _dl_start_final (rtld.c:494)
| | | ->43.09% (2,170,880B) 0x402075D: _dl_start (rtld.c:581)
| | | ->43.09% (2,170,880B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
| | |
| | ->00.41% (20,480B) in 1+ places, all below ms_print's threshold (01.00%)
| |
| ->32.20% (1,622,016B) 0x4007F78: _dl_map_segments (dl-map-segments.h:139)
| | ->32.20% (1,622,016B) 0x4007F78: _dl_map_object_from_fd (dl-load.c:1258)
| | ->32.20% (1,622,016B) 0x4009528: _dl_map_object (dl-load.c:2268)
| | ->31.87% (1,605,632B) 0x4002A2C: openaux (dl-deps.c:64)
| | | ->31.87% (1,605,632B) 0x400151B: _dl_catch_exception (dl-catch.c:237)
| | | ->31.87% (1,605,632B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232)
| | | ->31.87% (1,605,632B) 0x402241B: dl_main (rtld.c:1965)
| | | ->31.87% (1,605,632B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140)
| | | ->31.87% (1,605,632B) 0x402075D: _dl_start_final (rtld.c:494)
| | | ->31.87% (1,605,632B) 0x402075D: _dl_start (rtld.c:581)
| | | ->31.87% (1,605,632B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
| | |
| | ->00.33% (16,384B) in 1+ places, all below ms_print's threshold (01.00%)
| |
| ->01.30% (65,536B) 0x400C36C: _dl_sysdep_read_whole_file (dl-misc.c:49)
| | ->01.30% (65,536B) 0x4016C27: _dl_load_cache_lookup (dl-cache.c:411)
| | ->01.30% (65,536B) 0x40097CA: _dl_map_object (dl-load.c:2135)
| | ->01.30% (65,536B) 0x4002A2C: openaux (dl-deps.c:64)
| | ->01.30% (65,536B) 0x400151B: _dl_catch_exception (dl-catch.c:237)
| | ->01.30% (65,536B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232)
| | ->01.30% (65,536B) 0x402241B: dl_main (rtld.c:1965)
| | ->01.30% (65,536B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140)
| | ->01.30% (65,536B) 0x402075D: _dl_start_final (rtld.c:494)
| | ->01.30% (65,536B) 0x402075D: _dl_start (rtld.c:581)
| | ->01.30% (65,536B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
| |
| ->00.16% (8,192B) in 1+ places, all below ms_print's threshold (01.00%)
|
->22.85% (1,150,976B) 0x0: ???
|
->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)
```
> n=16, time=68,331, total=1,159,168B
表示執行了 6.8 萬 instruction,從此 Snapshot 開始 total mem 開始緩慢上升,推測是再做變數初始化,直到 Snapshot 22,
>n=22, time=74,157, total=3,432,448B
total mem 1.26 -> 3.43 MB,
>n=23, time=74,217, total=5,038,080B
>
total mem 3.43 -> 5.03 MB,並且從其中資訊觀察,
> 100.00% (5,038,080B) (page allocation syscalls) mmap/mremap/brk
所有 memory 都是 mmap page allocation。
>->77.15% (3,887,104B) 0x4025D2C: __mmap64
77.15%(約 3.88MB)是由 __mmap64 且主要是由 `linux-gnu/ld-linux-x86-64.so.2` 也就是 Linux 的 dynamic linker 分配,可能是在載入更多或更大的庫,或者對已載入的函式庫進行初始化和定位,需要更多的 memory mapping。
>->22.85% (1,150,976B) 0x0: ???
22.85%(約 1.15MB)推測是執行初期的那些基礎記憶體開銷持續存在。
#### Snapshot (24~28):
```
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
24 74,265 5,361,664 5,361,664 0 0
25 74,313 5,386,240 5,386,240 0 0
26 74,647 5,439,488 5,439,488 0 0
27 81,700 5,451,776 5,451,776 0 0
28 165,472 5,451,776 5,451,776 0 0
100.00% (5,451,776B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->78.89% (4,300,800B) 0x4025D2C: __mmap64 (mmap64.c:58)
| ->78.89% (4,300,800B) 0x4025D2C: mmap (mmap64.c:46)
| ->40.20% (2,191,360B) 0x4007E17: _dl_map_segment (dl-map-segments.h:29)
| | ->40.20% (2,191,360B) 0x4007E17: _dl_map_segments (dl-map-segments.h:101)
| | ->40.20% (2,191,360B) 0x4007E17: _dl_map_object_from_fd (dl-load.c:1258)
| | ->40.20% (2,191,360B) 0x4009528: _dl_map_object (dl-load.c:2268)
| | ->39.82% (2,170,880B) 0x4002A2C: openaux (dl-deps.c:64)
| | | ->39.82% (2,170,880B) 0x400151B: _dl_catch_exception (dl-catch.c:237)
| | | ->39.82% (2,170,880B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232)
| | | ->39.82% (2,170,880B) 0x402241B: dl_main (rtld.c:1965)
| | | ->39.82% (2,170,880B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140)
| | | ->39.82% (2,170,880B) 0x402075D: _dl_start_final (rtld.c:494)
| | | ->39.82% (2,170,880B) 0x402075D: _dl_start (rtld.c:581)
| | | ->39.82% (2,170,880B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
| | |
| | ->00.38% (20,480B) in 1+ places, all below ms_print's threshold (01.00%)
| |
| ->36.14% (1,970,176B) 0x4007F78: _dl_map_segments (dl-map-segments.h:139)
| | ->36.14% (1,970,176B) 0x4007F78: _dl_map_object_from_fd (dl-load.c:1258)
| | ->36.14% (1,970,176B) 0x4009528: _dl_map_object (dl-load.c:2268)
| | ->35.84% (1,953,792B) 0x4002A2C: openaux (dl-deps.c:64)
| | | ->35.84% (1,953,792B) 0x400151B: _dl_catch_exception (dl-catch.c:237)
| | | ->35.84% (1,953,792B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232)
| | | ->35.84% (1,953,792B) 0x402241B: dl_main (rtld.c:1965)
| | | ->35.84% (1,953,792B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140)
| | | ->35.84% (1,953,792B) 0x402075D: _dl_start_final (rtld.c:494)
| | | ->35.84% (1,953,792B) 0x402075D: _dl_start (rtld.c:581)
| | | ->35.84% (1,953,792B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
| | |
| | ->00.30% (16,384B) in 1+ places, all below ms_print's threshold (01.00%)
| |
| ->01.35% (73,728B) in 2 places, all below massif's threshold (1.00%)
| |
| ->01.20% (65,536B) 0x400C36C: _dl_sysdep_read_whole_file (dl-misc.c:49)
| ->01.20% (65,536B) 0x4016C27: _dl_load_cache_lookup (dl-cache.c:411)
| ->01.20% (65,536B) 0x40097CA: _dl_map_object (dl-load.c:2135)
| ->01.20% (65,536B) 0x4002A2C: openaux (dl-deps.c:64)
| ->01.20% (65,536B) 0x400151B: _dl_catch_exception (dl-catch.c:237)
| ->01.20% (65,536B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232)
| ->01.20% (65,536B) 0x402241B: dl_main (rtld.c:1965)
| ->01.20% (65,536B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140)
| ->01.20% (65,536B) 0x402075D: _dl_start_final (rtld.c:494)
| ->01.20% (65,536B) 0x402075D: _dl_start (rtld.c:581)
| ->01.20% (65,536B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
|
->21.11% (1,150,976B) 0x0: ???
|
->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
29 165,472 5,447,680 5,447,680 0 0
```
>78.89% (4,300,800B) 0x4025D2C: __mmap64 (mmap64.c:58)
| ->78.89% (4,300,800B) 0x4025D2C: mmap (mmap64.c:46)
主要原因仍是dynamic linker (ld-linux-x86-64.so.2) 的 mmap64 呼叫:這表示即使已經開始執行程式碼,但為了執行 convolution or other inference operations,系統在這個時間點又載入其他新的要用到的 lib。
延伸閱讀:
[你所不知道的 C 語言:動態連結器篇](https://hackmd.io/@sysprog/c-dynamic-linkage)
:::danger
文字訊息不要用圖片展現!
:::
### Optimaization based on analysis
#### T-MAC
// to be continue
## HW simulation