Try   HackMD

Linux 核心專題: 藉由系統手段加速 BitNet

執行人: Denny0097
GitHub

Video
Export to Model.h inference
Analysis

Reviewed by ginsengAttack

執行命令
env LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libhugetlbfs.so HUGETLB_MORECORE=yes
perf stat -e cache-misses,cache-references,cycles ./cifar10_test
後的result缺少對hugepage的使用造成的影響細節,可能要再觀察看看實際到底有沒有使用到,並詳細說明執行時的差異。

任務簡述

在大型語言模型資源需求日益高漲的背景下,微軟研究院提出 BitNet b1.58 2B4T,以 1.58 位元量化訓練的開放原始碼模型,在其推論過程中,BitNet 的矩陣乘法可化約為加法、減法與忽略,浮點乘法完全被移除。這使模型在實際部署中不僅延遲降低、記憶體存取負擔減輕,也進一步降低能源消耗。這種極簡的運算流程非常適合硬體加速器設計。

BitNet b1.58 採用 Transformer 架構進行大幅簡化與重設,包括移除線性層與正規化層中的 bias,使用 BitLinear 層取代傳統全連接層,搭配旋轉位置編碼 (RoPE)、ReLU² 激勵函數與 subLN 正規化。
BitNet b1.58 2B4T 模型包含約 20 億參數,訓練資料量達 4 兆個 token,涵蓋自然語言 (以英語為主)。它使用與 Llama 3 相同的分詞器,字彙表大小為 128,256,並支援最多 4096 token 的上下文長度。整體訓練過程分為三個階段:預訓練 (pretraining)、監督式微調 (SFT) 與偏好對齊 (DPO),使模型在效能與對話表現間取得良好平衡。

BitNet b1.58 的運行效率極高。在 Apple M2 這類通用 CPU 上可達 29 毫秒延遲,且記憶體佔用僅為 0.4GB,遠小於例如 Gemma 3 1B 所需的 1.4GB。微軟團隊在基準測試中將其與 Llama 3.2 1B、Gemma 3 1B 與 Qwen 2.5 1.5B 等全精度模型對比,發現 BitNet 即使在參數較少、權重精度較低的情況下,仍能在 MMLU、GSM8K、MATH 等任務中維持穩定表現,並在 GSM8K 上奪得最佳成績。BitNet 於 30 億參數以上的規模下,其性能已能接近 FP16 模型,並於 70 億參數規模達成高達 4 倍的推理速度提升與 7 倍的記憶體節省,顯示此技術具備高度擴展潛力。

本任務嘗試運用 Linux 核心提供的效能分析工具,定位出 BitNet 運行時期的效能瓶頸,並善用 Transparent Hugepage Support針對事件驅動的 I/O 模型 (如 io_uring),和課程所及的手法,加速 BitNet。

Alan 的實驗

Outline

  • TODO
  • BitNet Experiment
    • Env
  • SW
    • Model
    • Quantiztion scheme(QAT)
    • Code architecture
    • Analysis
    • Optimaization based on analysis
  • HW simulation

TODO

參考 BitNet 以及 BitNetMCU 在 Linux 環境利用 BitNet quantiztion 的 VGG8 for MNIST dataset (model.h),使用 Linux 核心提供的效能分析工具分析 training & inference 運行時期的效能瓶頸,分析後設計 CPU & mem usage 最佳化,並用 SystemC 模擬硬體運算,另嘗試硬體加速。

研究 BitNet [1]:

  • 在 GNU/Linux 系統運作 BitNet b1.58 2B4T 並重現論文實驗

效能改進:

  • 以 perf 在內的工具,測量推理過程中運算資源佔比前 20 大的函式,並探討其作用
  • 分析記憶體使用量,特別是過程中的 page fault, TLB miss 等統計。在 XMRig [2] 一類的挖礦程式中,善用 huge page (或 THP),可達到加速效果
  • 評估 T-MAC [3] [5],特別是其搭配 BitNet 的查表效益,紀錄過程中的 perf 事件統計
  • 觀察載入模型的機制,能否用 splice [4] 一類的機制予以加速

修改 BitNetMCU 並輸出 2bit model:

  • 引入原 lab2 提供的 VGG 8
  • 增加 padding, maxpool 支援
  • 補全未完成的 2 bits QuantType (I2_S)
  • 補全未完成的 Ternary QuantType (TL1, TL2)
  • 生成可部署到硬體上的 model.h
  • 設計 inference.c 來測試 model 在 CPU 上的 inference
  • 比較 PTQ 8 bit model vs QAT 2w8a model

HW simulation

  • C sim or Verilator

[1] https://github.com/microsoft/BitNet
[2] https://xmrig.com/docs/miner/hugepages
[3] https://github.com/microsoft/T-MAC
[4] https://hackmd.io/@sysprog/linux-zerocopy
[5] BitNet 有 LUT: https://github.com/microsoft/BitNet/tree/main/src
https://www.kernel.org/doc/html/next/admin-guide/mm/transhuge.html

Bitnet Experiment

Env

實驗環境
(base) denny0097:~/linux2025$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.2 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

(base) denny0097:~/linux2025$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   16
  On-line CPU(s) list:    0-15
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
  
(base) denny0097:~/linux2025$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

(bitnet-cpp) denny0097:~/linux2025/BitNetMCU-main$ conda info

     active environment : bitnet-cpp
    active env location : /home/denny0097/miniconda3/envs/bitnet-cpp
            shell level : 2
       user config file : /home/denny0097/.condarc
 populated config files : /home/denny0097/miniconda3/.condarc
          conda version : 25.3.1
    conda-build version : not installed
         python version : 3.13.2.final.0
                 solver : libmamba (default)
       virtual packages : __archspec=1=skylake
                          __conda=25.3.1=0
                          __cuda=12.4=0
                          __glibc=2.39=0
                          __linux=6.11.0=0
                          __unix=0=0
       base environment : /home/denny0097/miniconda3  (writable)

run instruction:

python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv

result :

> User: Tell me about the architecture of BitNet.
BitNet is a software communication network that connects devices and provides a common architecture for the Internet, but it's not a place in Barcelona, Spain, although it might seem like that from the name. It's actually a network protocol developed by the University of California, Berkeley in the 197

Benchmark :

python utils/e2e_benchmark.py -m /path/to/model -n 200 -p 256 -t 4  

This command would run the inference benchmark using the model located at /path/to/model, generating 200 tokens from a 256 token prompt, utilizing 4 threads.

result :

model size params backend threads n_batch test t/s
bitnet-b1.58 2B I2_S - 2 bpw ternary 1.71 GiB 2.74 B CPU 4 1 pp256 15.86 ± 0.04
bitnet-b1.58 2B I2_S - 2 bpw ternary 1.71 GiB 2.74 B CPU 4 1 tg200 15.75 ± 0.10

SW

Model

Model summary
​​​​----------------------------------------------------------------
​​​​        Layer (type)               Output Shape         Param #
​​​​================================================================
​​​​         BitConv2d-1           [-1, 64, 32, 32]             576
​​​​              ReLU-2           [-1, 64, 32, 32]               0
​​​​         MaxPool2d-3           [-1, 64, 16, 16]               0
​​​​         BitConv2d-4          [-1, 192, 16, 16]         110,592
​​​​              ReLU-5          [-1, 192, 16, 16]               0
​​​​         MaxPool2d-6            [-1, 192, 8, 8]               0
​​​​         BitConv2d-7            [-1, 384, 8, 8]         663,552
​​​​              ReLU-8            [-1, 384, 8, 8]               0
​​​​         BitConv2d-9            [-1, 256, 8, 8]         884,736
​​​​             ReLU-10            [-1, 256, 8, 8]               0
​​​​        BitConv2d-11            [-1, 256, 8, 8]         589,824
​​​​             ReLU-12            [-1, 256, 8, 8]               0
​​​​        MaxPool2d-13            [-1, 256, 4, 4]               0
​​​​          Flatten-14                 [-1, 4096]               0
​​​​        BitLinear-15                  [-1, 256]       1,048,576
​​​​             ReLU-16                  [-1, 256]               0
​​​​        BitLinear-17                  [-1, 128]          32,768
​​​​             ReLU-18                  [-1, 128]               0
​​​​        BitLinear-19                   [-1, 10]           1,280
​​​​================================================================
​​​​Total params: 3,331,904
​​​​Trainable params: 3,331,904
​​​​Non-trainable params: 0
​​​​----------------------------------------------------------------
​​​​Input size (MB): 0.00
​​​​Forward/backward pass size (MB): 2.91
​​​​Params size (MB): 12.71
​​​​Estimated Total Size (MB): 15.63
​​​​----------------------------------------------------------------

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

trainingparameters.yaml
​​​​# Description: Training parameters for the training script

​​​​# Model selection
​​​​model: 'VGG' # 'FCMNIST' or 'CNNMNIST' This is the class name of the model as defined in models.py.

​​​​# Quantization settings
​​​​QuantType: 'I2_S' # 'Ternary', 'Binary', 'BinaryBalanced', '2bitsym', '4bit', '4bitsym', '8bit', 'None", 'FP130', 'NF4', 'I2_S'
​​​​NormType: 'RMS' # 'RMS', 'Lin', 'BatchNorm'
​​​​WScale: 'PerTensor' # 'PerTensor', 'PerOutput'

​​​​# Clipping parameters - only used for 2 bit and higher quantization
​​​​maxw_algo: 'octav' # 'octav', 'prop' Algorithm used to calculate the clipping parameters (maximum weight)
​​​​maxw_update_until_epoch: 50 # Update clipping parameters until this epoch, they are frozen afterwards
​​​​maxw_quantscale: 0.25  # Used only for clipping_algo='prop'. Determines the relation between stddev of weights and max_weight

​​​​# Learning parameters
​​​​num_epochs: 50
​​​​batch_size: 32
​​​​scheduler: "Cosine" # "StepLR", "Cosine"
​​​​learning_rate: 0.001
​​​​lr_decay: 0.1     # lr_decay and step size are not used with cosine scheduler
​​​​step_size: 10
​​​​# halve_lr_epoch: 30  # Epoch at which to halve the learning rate

​​​​# Data augmentation
​​​​augmentation: True
​​​​rotation1: 10  # rotation1 and rotation2 are used for data augmentation
​​​​rotation2: 10

​​​​# Model parameters
​​​​network_width1: 256
​​​​network_width2: 128
​​​​network_width3: 0

​​​​# name
​​​​runtag: "octav" # runtag is prefix for runname

Output quanted model

BitConv2d & BitLinear 是 BitNetMCU 提供的 layer 框架,其中支持包含 Normalize(RMS) 跟 QAT foward 的運作

class BitConv2d(nn.Conv2d, BitQuant): """ 2D convolution layer with quantization aware training and normalization. Configurable quantization and normalization types. Normalization Types: - RMS : Root Mean Square - None : No normalization @cpldcpu 2024-June-2 """ #def __init__ ... def forward(self, x): """ Args: x: an input tensor with shape [n, d] Returns: y: an output tensor with shape [n, k] """ w = self.weight # a weight tensor with shape [d, k] x_norm = self.Normalize(x) if self.QuantType == 'None': y = F.conv2d(x_norm, w, stride=self.stride, padding=self.padding, groups=self.groups ) else: x_int, x_scale = self.activation_quant(x_norm) x_quant = x_norm + (x_int / x_scale - x_norm).detach() w_int, w_scale, _ = self.weight_quant(w) w_quant = w + (w_int / w_scale - w).detach() y = F.conv2d(x_quant, w_quant, groups=self.groups, stride=self.stride, padding=self.padding, bias=None) return y

Training data set:

MNIST
CIFAR10

Quantization scheme (QAT)

Weight

​​​​w_scale = 1.0 / w.abs().mean().clamp_(min=1e-5)
​​​​w_int = (w * w_scale ).round().clamp_(-1, 1)

s=max(1mean(|w|),105)q(w)=clamp(round(ws),1,1)

Activation

training

​​​​scale = 127.0 / x.abs().max(dim=-1, keepdim=True).values.clamp_(min=1e-5)
​​​​y = (x * scale).round().clamp_(-128, 127)

inference

​​​​scale = 127.0 / np.maximum(np.abs(input_data).max(axis=-1, keepdims=True), 1e-5)
​​​​current_data = np.round(input_data * scale).clip(-128, 127)

s=max(127max(|x|),105),q(x)=clamp(round(xs),128,127)

HW friendly:

s=max(|x|)>>7,while(s>0)shift++rounding=(1<<shift)>>1q(x)=(x+rounding)>>shift
計算 input 的絕對值的最大值 max(|x|),並計算最接近 max(|x|) 的 2 的冪次
2shift+7
,用此值作為 quantization 的最大範圍,
x/2shift
來得到 range [-128, 127] quanted value,但這樣會無條件捨棄小數,為了更精準,加上 round 的計算:
(x+2shift1)/2shift
,過程中沒有乘法也沒有浮點數運算。

Bitnet I2_S:

基於BitMCU提供的 BitLinear & BitConv2d 進行 QAT 讓模型在訓練時就學會適應 {-1,0,1} low bit-width inference。

而由於 BitnetMCU 目前不支援 padding, maxpooling, BatchNorm, (使用 BatchNorm 訓練的模型在 export 成 .c file 時,accuracy 會異常低,接近隨機推演),以及最重要的,原始的 export.py 還沒有支援任何 BitNet 儲存。

因此我會增加 Maxpool 以及 Padding ,最後用 CIFAR10 訓練並先實現 export I2_S model (2bits),讓相較原本程式原本支援的 (full FC + ReLU ) 模型更多層的 VGG8 模型能夠在該專案中訓練並生成 bitnet model,並且設計對應的 CPU 上運行測試。

export結果
​​​​(bitnet-cpp) denny0097:~/linux2025/BitNetMCU-main$ python exportquant.py
​​​​Load parameters from file: trainingparameters.yaml
​​​​octav_VGG_Aug_BitMnist_I2_S_width256_128_0_epochs10
​​​​Loading model...
​​​​Inference using the original model...
​​​​Accuracy/Test of trained model: 99.67 %
​​​​Quantizing model...
​​​​0 VGG
​​​​1 Sequential
​​​​2 BitConv2d
​​​​3 ReLU
​​​​4 MaxPool2d
​​​​5 BitConv2d
​​​​6 ReLU
​​​​7 MaxPool2d
​​​​8 BitConv2d
​​​​9 ReLU
​​​​10 BitConv2d
​​​​11 ReLU
​​​​12 BitConv2d
​​​​13 ReLU
​​​​14 MaxPool2d
​​​​15 Flatten
​​​​16 BitLinear
​​​​17 ReLU
​​​​18 BitLinear
​​​​19 ReLU
​​​​20 BitLinear

​​​​Layer: 2, Max: 1.0, Min: -1.0, Mean: -0.11458333333333333, Std: 0.8317041365974641
​​​​Values: [-1.  0.  1.]
​​​​Percent: [40.97222222 29.51388889 29.51388889]
​​​​Entropy: 1.57 bits. Code capacity used: 78.33160789015268 %

​​​​Layer: 5, Max: 1.0, Min: -1.0, Mean: -0.2090747974537037, Std: 0.7201310989476462
​​​​Values: [-1.  0.  1.]
​​​​Percent: [38.5687934  43.76989294 17.66131366]
​​​​Entropy: 1.49 bits. Code capacity used: 74.68134472269595 %

​​​​Layer: 8, Max: 1.0, Min: -1.0, Mean: -0.14563289689429013, Std: 0.6785549412400004
​​​​Values: [-1.  0.  1.]
​​​​Percent: [31.36393229 51.83542511 16.8006426 ]
​​​​Entropy: 1.45 bits. Code capacity used: 72.42034190610372 %

​​​​Layer: 10, Max: 1.0, Min: -1.0, Mean: -0.17259724934895834, Std: 0.6684536886212908
​​​​Values: [-1.  0.  1.]
​​​​Percent: [32.46086968 52.33798557 15.20114475]
​​​​Entropy: 1.43 bits. Code capacity used: 71.44577546801597 %

​​​​Layer: 12, Max: 1.0, Min: -1.0, Mean: -0.13642713758680555, Std: 0.6916967437120225
​​​​Values: [-1.  0.  1.]
​​​​Percent: [31.67419434 50.29432509 18.03148058]
​​​​Entropy: 1.47 bits. Code capacity used: 73.48354983372585 %

​​​​Layer: 16, Max: 1.0, Min: -1.0, Mean: -0.06987667083740234, Std: 0.6562856511567088
​​​​Values: [-1. -0.  1.]
​​​​Percent: [25.27351379 56.4406395  18.28584671]
​​​​Entropy: 1.42 bits. Code capacity used: 70.77351150852402 %

​​​​Layer: 18, Max: 1.0, Min: -1.0, Mean: -0.019989013671875, Std: 0.7926493968240628
​​​​Values: [-1.  0.  1.]
​​​​Percent: [32.43408203 37.1307373  30.43518066]
​​​​Entropy: 1.58 bits. Code capacity used: 78.99523722907314 %

​​​​Layer: 20, Max: 1.0, Min: -1.0, Mean: -0.31484375, Std: 0.7746561579732891
​​​​Values: [-1. -0.  1.]
​​​​Percent: [50.703125 30.078125 19.21875 ]
​​​​Entropy: 1.48 bits. Code capacity used: 73.77139839499056 %
​​​​Total number of bits: 6663808 (813.453125 kbytes)
​​​​inference of quantized model
​​​​layer: ('BitConv2d', 2)
​​​​layer: ('ReLU', 3)
​​​​layer: ('MaxPool2d', 4)
​​​​layer: ('BitConv2d', 5)
​​​​layer: ('ReLU', 6)
​​​​layer: ('MaxPool2d', 7)
​​​​layer: ('BitConv2d', 8)
​​​​layer: ('ReLU', 9)
​​​​layer: ('BitConv2d', 10)
​​​​layer: ('ReLU', 11)
​​​​layer: ('BitConv2d', 12)
​​​​layer: ('ReLU', 13)
​​​​layer: ('MaxPool2d', 14)
​​​​layer: ('BitLinear', 16)
​​​​layer: ('ReLU', 17)
​​​​layer: ('BitLinear', 18)
​​​​layer: ('ReLU', 19)

Weight Distribution

I2_S_distribution

從 weight 的分佈來看 -1 & 0 是相對較高的,推測因為 model 沒有 bias 且每個 conv & fc 又經過 ReLU,所以會導致 conv 的輸入都是正數,而 conv 輸出會有盡量常態分布 (向 0 集中)的傾向,所以導致複數的 weights 大過正數。

Code architecture

QAT_Training

Exporting

model.h

Test_Inference

增加 VGG
class VGG(nn.Module):
    def __init__(self,network_width1=256,network_width2=128,network_width3=0,QuantType='Binary',WScale='PerTensor',NormType='BatchNorm', in_channels=1, in_size=32, num_classes=10):
        super(VGG, self).__init__()
        # Conv1
        self.conv1 = nn.Sequential(
            BitConv2d(in_channels, 64, kernel_size=3, stride=1, padding=1, groups=1,QuantType=QuantType,NormType='None', WScale=WScale),
            # nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2)  # 32×32 -> 16×16
        )
        
        # Conv2
        self.conv2 = nn.Sequential(
            BitConv2d(64, 192, kernel_size=3, stride=1, padding=1, groups=1,QuantType=QuantType,NormType='None', WScale=WScale),
            # nn.BatchNorm2d(192),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2)  # 16×16 -> 8×8
        )
        
        # Conv3
        self.conv3 = nn.Sequential(
            BitConv2d(192, 384, kernel_size=3, stride=1, padding=1, groups=1,QuantType=QuantType,NormType='None', WScale=WScale), 
            # nn.BatchNorm2d(384),
            nn.ReLU(inplace=True)
            # nn.MaxPool2d(kernel_size=2, stride=2)  # 8×8 -> 4×4
        )
        
        # Conv4 (Dilated, 1 layer)
        self.conv4 = nn.Sequential(
            BitConv2d(384, 256, kernel_size=3, stride=1, padding=1, groups=1,QuantType=QuantType,NormType='None', WScale=WScale),
            # nn.BatchNorm2d(256),
            nn.ReLU(inplace=True)
        )
        
        # Conv5 (1 layer)
        self.conv5 = nn.Sequential(
            BitConv2d(256, 256, kernel_size=3, stride=1, padding=1, groups=1,QuantType=QuantType,NormType='None', WScale=WScale),
            # nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2)  # 4×4 -> 2×2
        )
        
        # Fully Connected Layers
        fmap_size = in_size // 8   # 32 -> 16 -> 8 -> 4 (3 次 MaxPool)
        self.fc6 = nn.Sequential(
            nn.Flatten(),
            BitLinear(256 * fmap_size * fmap_size, network_width1,QuantType=QuantType,NormType=NormType, WScale=WScale), # 256×4×4 = 4096
            nn.ReLU()
        )
        self.fc7 = nn.Sequential(
            # nn.Flatten(),
            BitLinear(network_width1, network_width2,QuantType=QuantType,NormType=NormType, WScale=WScale),
            nn.ReLU()
        )        
       
        # Final classifier
        self.fc8 = BitLinear(network_width2, 10,QuantType=QuantType,NormType=NormType, WScale=WScale)

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.conv4(x)
        x = self.conv5(x)
        x = self.fc6(x)
        x = self.fc7(x)
        x = self.fc8(x)
        return x
增加 padding
 for layer_info in self.quantized_model[:-1]:  # For all layers except the last one
            print(f'layer: {layer_info["layer_type"], layer_info["layer_order"] }')
#         .....

            elif layer_info['layer_type'] == 'BitConv2d':
#         .....

              padding = layer_info['padding']
                groups = layer_info['groups']
                in_channels = layer_info['in_channels']
                out_channels = layer_info['out_channels']

                # Apply padding
                if padding > 0:
                    current_data = np.pad(
                        current_data,
                        pad_width=((0, 0), (0, 0), (padding, padding), (padding, padding)),
                        mode='constant',
                        constant_values=(0,0)
                    )
                
增加 maxpool layer
# ...
            elif layer_info['layer_type'] == 'MaxPool2d':
                kernel_size = layer_info['kernel_size'] # Assuming square kernel
                stride = layer_info['stride']

                # Extract input dimensions
                batch_size, channels, height, width = current_data.shape

                out_height = (height - kernel_size) // stride + 1
                out_width = (width - kernel_size) // stride + 1

                # Initialize output
                output = np.zeros((batch_size, channels, out_height, out_width), dtype=current_data.dtype)

                # Perform max pooling
                for i in range(out_height):
                    for j in range(out_width):
                        h_start = i * stride
                        h_end = h_start + kernel_size
                        w_start = j * stride
                        w_end = w_start + kernel_size

                        patch = current_data[:, :, h_start:h_end, w_start:w_end]
                        output[:, :, i, j] = np.max(patch, axis=(2, 3))

                current_data = output
修正 exportquant
 for layer_info in quantized_model.quantized_model:
            layer = f'L{layer_info["layer_order"]}'
            layer_type = f'{layer_info["layer_type"]}'
            f.write(f'// Layer: {layer}\n')
            f.write(f'// Layer type: {layer_type}\n')
            if layer_info['layer_type'] == 'BitLinear':

                incoming_weights = layer_info['incoming_weights']
                outgoing_weights = layer_info['outgoing_weights']
                bpw = layer_info['bpw']
                weights = np.array(layer_info['quantized_weights'])
                quantization_type = layer_info['quantization_type']

                if (bpw*incoming_weights%32) != 0:
                    raise ValueError(f"Size mismatch: Incoming weights must be packed to 32bit boundary. Incoming weights: {incoming_weights} Bit per weight: {bpw} Total bits: {bpw*incoming_weights}")

                print(f'Layer: {layer} Quantization type: <{quantization_type}>, Bits per weight: {bpw}, Num. incoming: {incoming_weights},  Num outgoing: {outgoing_weights}')

                data_type = np.uint32

                if quantization_type == 'Binary':
                    encoded_weights = np.where(weights == -1, 0, 1)
                    QuantID = 1
                elif quantization_type == '2bitsym': # encoding -1.5 -> 11, -0.5 -> 10, 0.5 -> 00, 1.5 -> 01 (one complement with offset)
                    encoded_weights = ((weights < 0).astype(data_type) << 1) | (np.floor(np.abs(weights))).astype(data_type)  # use bitwise operations to encode the weights
                    QuantID = 2
                # I2_S
                elif quantization_type == 'I2_S': # encoding -1 -> 00, 0 -> 01, 1 -> 10
                    encoded_weights = weights.astype(data_type) + 1
                    QuantID = 2
                elif quantization_type == '4bitsym':
                    encoded_weights = ((weights < 0).astype(data_type) << 3) | (np.floor(np.abs(weights))).astype(data_type)  # use bitwise operations to encode the weights
                    QuantID = 4
                elif quantization_type == '4bit':
                    encoded_weights = np.floor(weights).astype(data_type) & 15  # twos complement encoding
                    QuantID =  8 + 4
                elif quantization_type == 'NF4':
                    levels = np.array([-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
                                   0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.723, 1.0])
                    encoded_weights = np.argmin(np.abs(weights[:, :, np.newaxis] - levels), axis=2)
                    QuantID = 32 + 4
                elif quantization_type == '8bit':
                    encoded_weights = np.floor(weights).astype(data_type) & 255  # twos complement encoding
                    QuantID =  8
                elif quantization_type == 'FP130': # FP1.3.0 encoding (sign * 2^exp)
                    encoded_weights = ((weights < 0).astype(data_type) << 3) | (np.floor(np.log2(np.abs(weights)))).astype(data_type)
                    QuantID = 16 + 4
                else:
                    print(f'Skipping layer {layer} with quantization type {quantization_type} and {bpw} bits per weight. Quantization type not supported.')

                # pack bits into 32 bit words
                weight_per_word = 32 // bpw
                reshaped_array = encoded_weights.reshape(-1, weight_per_word)

                # reverse arange to match C language LSB first reading order
                bit_positions = np.arange(0, 32, bpw, dtype=data_type)
                packed_weights = np.bitwise_or.reduce(reshaped_array << bit_positions, axis=1).view(data_type)

                # print(f'weights: {weights.shape} {weights.flatten()[0:16]}')
                # print(f'Encoded weights: {encoded_weights.shape} {encoded_weights.flatten()[0:16]}')
                # print(f'Packed weights: {packed_weights.shape} {", ".join(map(lambda x: hex(x), packed_weights.flatten()[0:4]))}')

                # Write layer order, shape, shiftright and weights to the file
                f.write(f'// Layer: {layer}\n')
                f.write(f'// QuantType: {quantization_type}\n')

                f.write(f'#define {layer}_active\n')
                f.write(f'#define {layer}_bitperweight {QuantID}\n')
                f.write(f'#define {layer}_incoming_weights {incoming_weights}\n')
                f.write(f'#define {layer}_outgoing_weights {outgoing_weights}\n')

                f.write(f'const uint32_t {layer}_weights[] = {{')
                for i,data in enumerate(packed_weights.flatten()):
                    if i&7 ==0:
                        f.write('\n\t')
                    f.write(f'0x{data:08x},')
                f.write('\n}; //first channel is topmost bit\n\n')

            elif layer_info['layer_type'] == 'BitConv2d':
                in_channels = layer_info['in_channels']
                out_channels = layer_info['out_channels']
                incoming_x = layer_info['incoming_x']
                incoming_y = layer_info['incoming_y']
                outgoing_x = layer_info['outgoing_x']
                outgoing_y = layer_info['outgoing_y']
                
                padding = layer_info['padding']
                groups = layer_info['groups']
                kernel_size = layer_info['kernel_size'][0]  # Assuming square kernel
                bpw = layer_info['bpw']
                quantization_type = layer_info['quantization_type']
                weights = np.array(layer_info['quantized_weights'])
                bias = layer_info.get('bias', None)

                f.write(f'// Layer: {layer} (Convolutional)\n')
                f.write(f'#define {layer}_active\n')
                f.write(f'#define {layer}_type BitConv2d\n')
                f.write(f'#define {layer}_in_channels {in_channels}\n')
                f.write(f'#define {layer}_out_channels {out_channels}\n')
                f.write(f'#define {layer}_incoming_x {incoming_x}\n')
                f.write(f'#define {layer}_incoming_y {incoming_y}\n')
                f.write(f'#define {layer}_outgoing_x {outgoing_x}\n')
                f.write(f'#define {layer}_outgoing_y {outgoing_y}\n')
                f.write(f'#define {layer}_kernel_size {kernel_size}\n')
                f.write(f'#define {layer}_stride 1\n')
                f.write(f'#define {layer}_padding {padding}\n')
                f.write(f'#define {layer}_groups {groups}\n')
                f.write(f'#define {layer}_bitperweight {bpw}\n')

                # if (bpw*incoming_weights%32) != 0:
                #     raise ValueError(f"Size mismatch: Incoming weights must be packed to 32bit boundary. Incoming weights: {incoming_weights} Bit per weight: {bpw} Total bits: {bpw*incoming_weights}")

                data_type = np.uint32

                if quantization_type == 'Binary':
                    encoded_weights = np.where(weights == -1, 0, 1)
                    QuantID = 1
                elif quantization_type == '2bitsym': # encoding -1.5 -> 11, -0.5 -> 10, 0.5 -> 00, 1.5 -> 01 (one complement with offset)
                    encoded_weights = ((weights < 0).astype(data_type) << 1) | (np.floor(np.abs(weights))).astype(data_type)  # use bitwise operations to encode the weights
                    QuantID = 2
                # I2_S
                elif quantization_type == 'I2_S': # encoding -1 -> 00, 0 -> 01, 1 -> 10
                    encoded_weights = weights.astype(data_type) + 1
                    QuantID = 2
                elif quantization_type == '4bitsym':
                    encoded_weights = ((weights < 0).astype(data_type) << 3) | (np.floor(np.abs(weights))).astype(data_type)  # use bitwise operations to encode the weights
                    QuantID = 4
                elif quantization_type == '4bit':
                    encoded_weights = np.floor(weights).astype(data_type) & 15  # twos complement encoding
                    QuantID =  8 + 4
                elif quantization_type == 'NF4':
                    levels = np.array([-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
                                   0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.723, 1.0])
                    encoded_weights = np.argmin(np.abs(weights[:, :, np.newaxis] - levels), axis=2)
                    QuantID = 32 + 4
                elif quantization_type == '8bit':
                    encoded_weights = np.floor(weights).astype(data_type) & 255  # twos complement encoding
                    QuantID =  8
                elif quantization_type == 'FP130': # FP1.3.0 encoding (sign * 2^exp)
                    encoded_weights = ((weights < 0).astype(data_type) << 3) | (np.floor(np.log2(np.abs(weights)))).astype(data_type)
                    QuantID = 16 + 4
                else:
                    print(f'Skipping layer {layer} with quantization type {quantization_type} and {bpw} bits per weight. Quantization type not supported.')

                # pack bits into 32 bit words
                weight_per_word = 32 // bpw
                reshaped_array = encoded_weights.reshape(-1, weight_per_word)

                # reverse arange to match C language LSB first reading order
                bit_positions = np.arange(weight_per_word, dtype=data_type) * bpw
                packed_weights = np.bitwise_or.reduce(reshaped_array << bit_positions, axis=1).view(data_type)

                f.write(f'const uint32_t {layer}_packed_weights[] = {{')
                for i, data in enumerate(packed_weights.flatten()):
                    if i % 32 == 0:
                        f.write('\n\t')
                    f.write(f'{data},')
                f.write('\n};\n\n')

                if 'bias' in layer_info and layer_info['bias'] is not None:
                    bias = np.array(layer_info['bias']).astype(data_type)
                    f.write(f'const int32_t {layer}_bias[] = {{')
                    for i, data in enumerate(bias.flatten()):
                        if i % 8 == 0:
                            f.write('\n\t')
                        f.write(f'{data},')
                    f.write('\n};\n\n')
                else:
                    f.write(f'// No bias for layer {layer}\n')

                print(f'Layer: {layer} Conv2d bpw: {bpw} {in_channels} -> {out_channels} groups:{groups} Kernel: {kernel_size}x{kernel_size}')
新增 test inference for I2_S
Func Desc
precessfc_I2_S Processes a fully connected layer in a neural network with 2-bit weights.
processcvlayer_I2_S Processes a conv2D layer in a neural network.
Maxp Processes a maxpooling layer.
ReLUNorm Applies a ReLU activation function to an array of integers and normalizes, pre quantization the result to 8-bit integers.
​​​​(bitnet-cpp) denny0097:~/linux2025/BitNetMCU-main$ gcc BitNetMCU_MNIST_test.c  -o mnist_test -std=c99 -lm
​​​​(bitnet-cpp) denny0097:~/linux2025/BitNetMCU-main$ ./mnist_test 
​​​​fc3_out: -1076 -749 -462 1650 -1109 -424 -1317 -438 -478 -750
​​​​label: 3 predicted: 3
​​​​fc3_out: -615 -556 1597 -337 -600 -778 -444 -513 -436 -672
​​​​label: 2 predicted: 2
​​​​fc3_out: 1255 -731 -646 -759 -801 -594 -397 -831 -626 -436
​​​​label: 0 predicted: 0
​​​​fc3_out: -761 -948 -753 -724 -102 -707 -1141 -317 -495 1685
​​​​label: 9 predicted: 9
​​​​fc3_out: 1346 -758 -718 -820 -860 -650 -506 -871 -678 -394
​​​​label: 0 predicted: 0
​​​​fc3_out: -366 -527 -800 -965 -315 -345 1135 -1150 -538 -847
​​​​label: 6 predicted: 6
​​​​fc3_out: -635 -701 -549 -599 88 -633 -884 -311 -362 1210
​​​​label: 9 predicted: 9
​​​​fc3_out: -474 -314 803 -133 -286 -708 -675 250 -339 -416
​​​​label: 2 predicted: 2
​​​​fc3_out: -526 -532 -134 -344 -628 -730 -1272 1069 -400 -110
​​​​label: 7 predicted: 7
​​​​fc3_out: -858 -318 -519 -481 -672 -995 -1577 1423 -506 -64
​​​​label: 7 predicted: 7

Analysis

設計 10 筆 data 的 test inference 來觀察 quanted model 的執行。

Exec time
在程式碼前後紀錄 clock。
(用 if/else 取代 conv 的 mul 計算 vs 保持 conv 的 mul)

vs
branch:
sum{-= act,if weight = -1+= act,if weight = 1

mul:
sum=actweight

Layer Branch (ms) Mul (ms)
L2 (Conv) 16.103 14.303
L3 (ReLUNorm) 0.322 0.355
L4 (Maxp) 0.329 0.316
L5 (Conv) 245.586 213.306
L6 (ReLUNorm) 0.201 0.201
L7 (Maxp) 0.198 0.196
L8 (Conv) 342.022 298.320
L9 (ReLUNorm) 0.105 0.105
L10 (Conv) 456.651 401.056
L11 (ReLUNorm) 0.069 0.069
L12 (Conv) 304.385 278.764
L13 (ReLUNorm) 0.065 0.067
L14 (Maxp) 0.066 0.067
L16 (FC) 5.622 4.595
L17 (ReLUNorm) 0.002 0.002
L18 (FC) 0.172 0.129
L19 (ReLUNorm) 0.001 0.001
L20 (FC) 0.007 0.006
total 1371.706 ms 1211.458 ms

雖然原本希望能藉由省去 mult 讓計算加速,但實際上如果用 if/else 來判斷加減會有更多的 branch 導致更高的 cost。

bitwise 的計算來同時避免乘法及 branch:

Unpack Pack
-1 00
0 01
1 10
int8_t delta = (-((int8_t)(weight == 0x2)) & act) |
                 (-((int8_t)(weight == 0x0)) & (-act);
sum += delta;

upload_fef315c4d495ba6358adc37dfe581299

最後發現實際上,在 CPU 上執行的效率跟 branch 差不多,CPU 有硬件乘法器和編譯器優化。對於這種環境,最直接數學表達 sum += act * weight; 的形式,反而讓編譯器能夠發揮最大效能,利用底層的硬件乘法單元。

perf

perf 使用會有限制, -1 為最低,通常直接輸入命令會出現

Error:
Access to performance monitoring and observability operations is limited.
Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
access to performance monitoring and observability operations for processes
without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
    perf_event_paranoid setting is 4:
      -1: Allow use of (almost) all events by all users
          Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
    >= 0: Disallow raw and ftrace function tracepoint access
    >= 1: Disallow CPU event access
    >= 2: Disallow kernel profiling

注意用語,務必依循課程規範的術語。

先將 perf 限制降低至 -1,可以輸入以下 命令 暫時獲得權限:

​​​​sudo sh -c 'echo -1 > /proc/sys/kernel/perf_event_paranoid'
  1. CPU
perf record ./mnist_test
perf report 

截圖 2025-06-13 上午11.50.44

image

模型推論時間幾乎 全部卡在卷積層 processcvlayer_I2_S(可以做 loop unrolling)

  1. Mem
Layer Entropy(bits) Code capacity used
L2 1.57 78.33160789015268 %
L5 1.49 74.68134472269595 %
L8 1.45 72.42034190610372 %
L10 1.43 71.44577546801597 %
L12 1.47 73.48354983372585 %
L16 1.42 70.77351150852402 %
L18 1.58 78.99523722907314 %
L20 1.48 73.77139839499056 %

畢竟 2bits 只用來儲存 [-1, 1] 的 weights ,理所當然 capacity 不好。

perf stat -e cache-misses,cache-references,cycles ./cifar10_test

Result:

 Performance counter stats for './cifar10_test':

         1,395,104      cache-misses                     #   23.70% of all cache refs         
         5,887,284      cache-references                                                      
    63,527,466,421      cycles                                                                

      13.750155002 seconds time elapsed

      13.748634000 seconds user
       0.000000000 seconds sys

Cache miss rate 23.70%,後續嘗試使用 huge page 觀察是否改善。

如果單純做 unroll-loops 雖然減少 branch 提高計算速度,但會因為指令變多導致 cache misses 提高 (I-cache)。

gcc  -funroll-loops -o cifar10_test BitNetMCU_CIFAR10_test.c -lm

Result:

 Performance counter stats for './cifar10_test':

         1,439,561      cache-misses                     #   17.66% of all cache refs         
         8,153,347      cache-references                                                      
    63,693,742,173      cycles                                                                

      13.786458539 seconds time elapsed

      13.782243000 seconds user
       0.000999000 seconds sys

利用 gcc 最佳化編譯( additional instruction set, loop unrolling..)

gcc -O3 -march=native -funroll-loops -o mnist_test BitNetMCU_MNIST_test.c -lm
 Performance counter stats for './cifar10_test':

           788,640      cache-misses                     #   31.30% of all cache refs         
         2,519,992      cache-references                                                      
     7,213,769,871      cycles                                                                

       1.572760010 seconds time elapsed

       1.571592000 seconds user
       0.001000000 seconds sys

雖然因為 loop unrolling 導致 miss rate 提升,但執行速度大幅提升( 13.750155002 -> 1.572760010 seconds time elapsed,8倍以上)同時也使 conv2 的負擔減少。

hugepage

sudo sysctl -w vm.nr_hugepages=128

動態設定 Linux 核心中可用的 Huge Pages 數量,Linux 預設的 page size 是 4KB,分配 128 個 2MB 的 Huge Page,共 256MB,Kernel 會保留 128 個 2MB 的連續 physical mem page 供程式用 HugePage 分配

cat /proc/meminfo | grep Huge
​​​​AnonHugePages:         0 kB
​​​​ShmemHugePages:        0 kB
​​​​FileHugePages:         0 kB
​​​​HugePages_Total:     128
​​​​HugePages_Free:      128
​​​​HugePages_Rsvd:        0
​​​​HugePages_Surp:        0
​​​​Hugepagesize:       2048 kB
​​​​Hugetlb:          262144 kB

確認有開啟 Hugepage

libhugetlbfs

是 Linux 系統上一個 Huge Pages 輔助 lib,可以讓程式的 malloc() 自動從 Huge Page 分配 記憶體。
安裝:

sudo apt update
sudo apt install libhugetlbfs-bin libhugetlbfs-dev

觀察:

env LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libhugetlbfs.so HUGETLB_MORECORE=yes \
perf stat -e cache-misses,cache-references,cycles ./cifar10_test

Reslut:

 Performance counter stats for './cifar10_test':

         1,063,465      cache-misses                     #   12.56% of all cache refs         
         8,465,880      cache-references                                                      
    63,575,493,015      cycles                                                                

      13.679735573 seconds time elapsed

      13.673417000 seconds user
       0.002999000 seconds sys

valgrind

Valgrind 是個在使用者層級 (user space) 對程式在執行時期的行為提供動態分析的系統軟體框架,具備多種工具,可以用來追蹤及分析程式效能,最為人們所熟知的特性就是幫助使用者偵測記憶體錯誤,諸如使用未初始化的記憶體、不當的記憶體配置、或取消配置記憶體,並加以分析。

ref: 2023 年 Linux 核心設計/實作課程作業-lab0(B) 以 Valgrind 分析記憶體問題

valgrind --tool=massif ./cifar10_test 

massif:
透過不斷 Snapshot 來觀察 process's stack memory 的分配狀況,

ms_print massif.out.<pid>

Memory Usage Graph

    MB
5.199^                                   ::::::::::::::::::::::::::::::::::::#
     |                                ::::                                   #
     |                                @  :                                   #
     |                                @  :                                   #
     |                                @  :                                   #
     |                                @  :                                   #
     |                                @  :                                   #
     |                                @  :                                   #
     |                                @  :                                   #
     |                                @  :                                   #
     |                                @  :                                   #
     |                                @  :                                   #
     |                                @  :                                   #
     |                                @  :                                   #
     |                                @  :                                   #
     |                                @  :                                   #
     |@:::::::::::::::::::::::::::::::@  :                                   #
     |@                            :::@  :                                   #
     |@                            :::@  :                                   #
     |@                            :::@  :                                   #
   0 +----------------------------------------------------------------------->ki
     0                                                                   161.6

Number of snapshots: 30
Detailed snapshots: [9, 13, 23, 28 (peak)]

Peak: 5.199 MB
This gives a timeline of memory usage during the program's execution.

Vertical Axis (MB): Represents the total memory used by program in Megabytes (MB).
The peak memory usage observed is about 5.199 MB.

Horizontal Axis (ki): Represents "kiloinstructions" or the approximate number of instructions executed by program.

Snapshot (0~9): loading lib(1.13 MB)

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0            4,096            4,096             0            0
  1              0           12,288           12,288             0            0
  2              0          847,872          847,872             0            0
  3              0          888,832          888,832             0            0
  4              0          892,928          892,928             0            0
  5              0        1,069,056        1,069,056             0            0
  6              0        1,110,016        1,110,016             0            0
  7              0        1,126,400        1,126,400             0            0
  8              0        1,130,496        1,130,496             0            0
  9              0        1,134,592        1,134,592             0            0
100.00% (1,134,592B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->100.00% (1,134,592B) 0x0: ???

0x0: ??? & time(i) = 0 表示這是屬於 lib、packge 佔用的 mem。

Snapshot (10~13): loading lib(1.13 MB)

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 10              0        1,146,880        1,146,880             0            0
 11              0        1,150,976        1,150,976             0            0
 12              0        1,155,072        1,155,072             0            0
 13              0        1,155,072        1,155,072             0            0
100.00% (1,155,072B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->99.65% (1,150,976B) 0x0: ???
| 
->00.35% (4,096B) in 1+ places, all below ms_print's threshold (01.00%)

99.65% 同上,

->00.35% (4,096B) in 1+ places, all below ms_print's threshold (01.00%)

程式總共分配的記憶體中有一個很小的部分(0.35%) 是由一個或多個小規模的記憶體分配操作產生的。
因為這些分配操作的單個或累積大小都非常小,沒有達到 ms_print 工具預設的 1% threshold,所以 ms_print 不顯示這些分配的具體呼叫資訊。

Snapshot (14~23):

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 14              0        1,150,976        1,150,976             0            0
 15              0        1,150,976        1,150,976             0            0
 16         68,331        1,159,168        1,159,168             0            0
 17         69,240        1,179,648        1,179,648             0            0
 18         69,317        1,183,744        1,183,744             0            0
 19         69,365        1,187,840        1,187,840             0            0
 20         69,413        1,196,032        1,196,032             0            0
 21         71,697        1,261,568        1,261,568             0            0
 22         74,157        3,432,448        3,432,448             0            0
 23         74,217        5,038,080        5,038,080             0            0
100.00% (5,038,080B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->77.15% (3,887,104B) 0x4025D2C: __mmap64 (mmap64.c:58)
| ->77.15% (3,887,104B) 0x4025D2C: mmap (mmap64.c:46)
|   ->43.50% (2,191,360B) 0x4007E17: _dl_map_segment (dl-map-segments.h:29)
|   | ->43.50% (2,191,360B) 0x4007E17: _dl_map_segments (dl-map-segments.h:101)
|   |   ->43.50% (2,191,360B) 0x4007E17: _dl_map_object_from_fd (dl-load.c:1258)
|   |     ->43.50% (2,191,360B) 0x4009528: _dl_map_object (dl-load.c:2268)
|   |       ->43.09% (2,170,880B) 0x4002A2C: openaux (dl-deps.c:64)
|   |       | ->43.09% (2,170,880B) 0x400151B: _dl_catch_exception (dl-catch.c:237)
|   |       |   ->43.09% (2,170,880B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232)
|   |       |     ->43.09% (2,170,880B) 0x402241B: dl_main (rtld.c:1965)
|   |       |       ->43.09% (2,170,880B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140)
|   |       |         ->43.09% (2,170,880B) 0x402075D: _dl_start_final (rtld.c:494)
|   |       |           ->43.09% (2,170,880B) 0x402075D: _dl_start (rtld.c:581)
|   |       |             ->43.09% (2,170,880B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
|   |       |               
|   |       ->00.41% (20,480B) in 1+ places, all below ms_print's threshold (01.00%)
|   |       
|   ->32.20% (1,622,016B) 0x4007F78: _dl_map_segments (dl-map-segments.h:139)
|   | ->32.20% (1,622,016B) 0x4007F78: _dl_map_object_from_fd (dl-load.c:1258)
|   |   ->32.20% (1,622,016B) 0x4009528: _dl_map_object (dl-load.c:2268)
|   |     ->31.87% (1,605,632B) 0x4002A2C: openaux (dl-deps.c:64)
|   |     | ->31.87% (1,605,632B) 0x400151B: _dl_catch_exception (dl-catch.c:237)
|   |     |   ->31.87% (1,605,632B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232)
|   |     |     ->31.87% (1,605,632B) 0x402241B: dl_main (rtld.c:1965)
|   |     |       ->31.87% (1,605,632B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140)
|   |     |         ->31.87% (1,605,632B) 0x402075D: _dl_start_final (rtld.c:494)
|   |     |           ->31.87% (1,605,632B) 0x402075D: _dl_start (rtld.c:581)
|   |     |             ->31.87% (1,605,632B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
|   |     |               
|   |     ->00.33% (16,384B) in 1+ places, all below ms_print's threshold (01.00%)
|   |     
|   ->01.30% (65,536B) 0x400C36C: _dl_sysdep_read_whole_file (dl-misc.c:49)
|   | ->01.30% (65,536B) 0x4016C27: _dl_load_cache_lookup (dl-cache.c:411)
|   |   ->01.30% (65,536B) 0x40097CA: _dl_map_object (dl-load.c:2135)
|   |     ->01.30% (65,536B) 0x4002A2C: openaux (dl-deps.c:64)
|   |       ->01.30% (65,536B) 0x400151B: _dl_catch_exception (dl-catch.c:237)
|   |         ->01.30% (65,536B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232)
|   |           ->01.30% (65,536B) 0x402241B: dl_main (rtld.c:1965)
|   |             ->01.30% (65,536B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140)
|   |               ->01.30% (65,536B) 0x402075D: _dl_start_final (rtld.c:494)
|   |                 ->01.30% (65,536B) 0x402075D: _dl_start (rtld.c:581)
|   |                   ->01.30% (65,536B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
|   |                     
|   ->00.16% (8,192B) in 1+ places, all below ms_print's threshold (01.00%)
|   
->22.85% (1,150,976B) 0x0: ???
| 
->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)

n=16, time=68,331, total=1,159,168B

表示執行了 6.8 萬 instruction,從此 Snapshot 開始 total mem 開始緩慢上升,推測是再做變數初始化,直到 Snapshot 22,

n=22, time=74,157, total=3,432,448B

total mem 1.26 -> 3.43 MB,

n=23, time=74,217, total=5,038,080B

total mem 3.43 -> 5.03 MB,並且從其中資訊觀察,

100.00% (5,038,080B) (page allocation syscalls) mmap/mremap/brk

所有 memory 都是 mmap page allocation。

->77.15% (3,887,104B) 0x4025D2C: __mmap64

77.15%(約 3.88MB)是由 __mmap64 且主要是由 linux-gnu/ld-linux-x86-64.so.2 也就是 Linux 的 dynamic linker 分配,可能是在載入更多或更大的庫,或者對已載入的函式庫進行初始化和定位,需要更多的 memory mapping。

->22.85% (1,150,976B) 0x0: ???

22.85%(約 1.15MB)推測是執行初期的那些基礎記憶體開銷持續存在。

Snapshot (24~28):

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 24         74,265        5,361,664        5,361,664             0            0
 25         74,313        5,386,240        5,386,240             0            0
 26         74,647        5,439,488        5,439,488             0            0
 27         81,700        5,451,776        5,451,776             0            0
 28        165,472        5,451,776        5,451,776             0            0
100.00% (5,451,776B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->78.89% (4,300,800B) 0x4025D2C: __mmap64 (mmap64.c:58)
| ->78.89% (4,300,800B) 0x4025D2C: mmap (mmap64.c:46)
|   ->40.20% (2,191,360B) 0x4007E17: _dl_map_segment (dl-map-segments.h:29)
|   | ->40.20% (2,191,360B) 0x4007E17: _dl_map_segments (dl-map-segments.h:101)
|   |   ->40.20% (2,191,360B) 0x4007E17: _dl_map_object_from_fd (dl-load.c:1258)
|   |     ->40.20% (2,191,360B) 0x4009528: _dl_map_object (dl-load.c:2268)
|   |       ->39.82% (2,170,880B) 0x4002A2C: openaux (dl-deps.c:64)
|   |       | ->39.82% (2,170,880B) 0x400151B: _dl_catch_exception (dl-catch.c:237)
|   |       |   ->39.82% (2,170,880B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232)
|   |       |     ->39.82% (2,170,880B) 0x402241B: dl_main (rtld.c:1965)
|   |       |       ->39.82% (2,170,880B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140)
|   |       |         ->39.82% (2,170,880B) 0x402075D: _dl_start_final (rtld.c:494)
|   |       |           ->39.82% (2,170,880B) 0x402075D: _dl_start (rtld.c:581)
|   |       |             ->39.82% (2,170,880B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
|   |       |               
|   |       ->00.38% (20,480B) in 1+ places, all below ms_print's threshold (01.00%)
|   |       
|   ->36.14% (1,970,176B) 0x4007F78: _dl_map_segments (dl-map-segments.h:139)
|   | ->36.14% (1,970,176B) 0x4007F78: _dl_map_object_from_fd (dl-load.c:1258)
|   |   ->36.14% (1,970,176B) 0x4009528: _dl_map_object (dl-load.c:2268)
|   |     ->35.84% (1,953,792B) 0x4002A2C: openaux (dl-deps.c:64)
|   |     | ->35.84% (1,953,792B) 0x400151B: _dl_catch_exception (dl-catch.c:237)
|   |     |   ->35.84% (1,953,792B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232)
|   |     |     ->35.84% (1,953,792B) 0x402241B: dl_main (rtld.c:1965)
|   |     |       ->35.84% (1,953,792B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140)
|   |     |         ->35.84% (1,953,792B) 0x402075D: _dl_start_final (rtld.c:494)
|   |     |           ->35.84% (1,953,792B) 0x402075D: _dl_start (rtld.c:581)
|   |     |             ->35.84% (1,953,792B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
|   |     |               
|   |     ->00.30% (16,384B) in 1+ places, all below ms_print's threshold (01.00%)
|   |     
|   ->01.35% (73,728B) in 2 places, all below massif's threshold (1.00%)
|   | 
|   ->01.20% (65,536B) 0x400C36C: _dl_sysdep_read_whole_file (dl-misc.c:49)
|     ->01.20% (65,536B) 0x4016C27: _dl_load_cache_lookup (dl-cache.c:411)
|       ->01.20% (65,536B) 0x40097CA: _dl_map_object (dl-load.c:2135)
|         ->01.20% (65,536B) 0x4002A2C: openaux (dl-deps.c:64)
|           ->01.20% (65,536B) 0x400151B: _dl_catch_exception (dl-catch.c:237)
|             ->01.20% (65,536B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232)
|               ->01.20% (65,536B) 0x402241B: dl_main (rtld.c:1965)
|                 ->01.20% (65,536B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140)
|                   ->01.20% (65,536B) 0x402075D: _dl_start_final (rtld.c:494)
|                     ->01.20% (65,536B) 0x402075D: _dl_start (rtld.c:581)
|                       ->01.20% (65,536B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
|                         
->21.11% (1,150,976B) 0x0: ???
| 
->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 29        165,472        5,447,680        5,447,680             0            0

78.89% (4,300,800B) 0x4025D2C: __mmap64 (mmap64.c:58)
| ->78.89% (4,300,800B) 0x4025D2C: mmap (mmap64.c:46)

主要原因仍是dynamic linker (ld-linux-x86-64.so.2) 的 mmap64 呼叫:這表示即使已經開始執行程式碼,但為了執行 convolution or other inference operations,系統在這個時間點又載入其他新的要用到的 lib。

延伸閱讀:
你所不知道的 C 語言:動態連結器篇

文字訊息不要用圖片展現!

Optimaization based on analysis

T-MAC

// to be continue

HW simulation