sysprog
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Write
        • Owners
        • Signed-in users
        • Everyone
        Owners Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Sharing URL Help
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee
  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    2
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    # Linux 核心專題: 藉由系統手段加速 BitNet > 執行人: Denny0097 [GitHub](https://github.com/Denny0097/BitNetMCU/blob/BitNet_VGG/) >Video [Export to Model.h inference](https://www.youtube.com/watch?v=fuphucQnpVM) [Analysis](?) ### Reviewed by `ginsengAttack` 執行命令 env LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libhugetlbfs.so HUGETLB_MORECORE=yes \ perf stat -e cache-misses,cache-references,cycles ./cifar10_test 後的result缺少對hugepage的使用造成的影響細節,可能要再觀察看看實際到底有沒有使用到,並詳細說明執行時的差異。 ## 任務簡述 在大型語言模型資源需求日益高漲的背景下,微軟研究院提出 BitNet b1.58 2B4T,以 1.58 位元量化訓練的開放原始碼模型,在其推論過程中,BitNet 的矩陣乘法可化約為加法、減法與忽略,浮點乘法完全被移除。這使模型在實際部署中不僅延遲降低、記憶體存取負擔減輕,也進一步降低能源消耗。這種極簡的運算流程非常適合硬體加速器設計。 BitNet b1.58 採用 Transformer 架構進行大幅簡化與重設,包括移除線性層與正規化層中的 bias,使用 BitLinear 層取代傳統全連接層,搭配旋轉位置編碼 (RoPE)、ReLU² 激勵函數與 subLN 正規化。 BitNet b1.58 2B4T 模型包含約 20 億參數,訓練資料量達 4 兆個 token,涵蓋自然語言 (以英語為主)。它使用與 Llama 3 相同的分詞器,字彙表大小為 128,256,並支援最多 4096 token 的上下文長度。整體訓練過程分為三個階段:預訓練 (pretraining)、監督式微調 (SFT) 與偏好對齊 (DPO),使模型在效能與對話表現間取得良好平衡。 BitNet b1.58 的運行效率極高。在 Apple M2 這類通用 CPU 上可達 29 毫秒延遲,且記憶體佔用僅為 0.4GB,遠小於例如 Gemma 3 1B 所需的 1.4GB。微軟團隊在基準測試中將其與 Llama 3.2 1B、Gemma 3 1B 與 Qwen 2.5 1.5B 等全精度模型對比,發現 BitNet 即使在參數較少、權重精度較低的情況下,仍能在 MMLU、GSM8K、MATH 等任務中維持穩定表現,並在 GSM8K 上奪得最佳成績。BitNet 於 30 億參數以上的規模下,其性能已能接近 FP16 模型,並於 70 億參數規模達成高達 4 倍的推理速度提升與 7 倍的記憶體節省,顯示此技術具備高度擴展潛力。 本任務嘗試運用 Linux 核心提供的效能分析工具,定位出 BitNet 運行時期的效能瓶頸,並善用 [Transparent Hugepage Support](https://docs.kernel.org/admin-guide/mm/transhuge.html)、[針對事件驅動的 I/O 模型](https://hackmd.io/@sysprog/linux-io-model) (如 `io_uring`),和課程所及的手法,加速 BitNet。 > [Alan 的實驗](https://hackmd.io/@alanhc/2025q1-term-project) ## Outline - TODO - BitNet Experiment - Env - SW - Model - Quantiztion scheme(QAT) - Code architecture - Analysis - Optimaization based on analysis - HW simulation ## TODO 參考 [BitNet](https://github.com/microsoft/BitNet/tree/main) 以及 [BitNetMCU](https://github.com/cpldcpu/BitNetMCU) 在 Linux 環境利用 BitNet quantiztion 的 VGG8 for MNIST dataset (model.h),使用 Linux 核心提供的效能分析工具分析 training & inference 運行時期的效能瓶頸,分析後設計 CPU & mem usage 最佳化,並用 SystemC 模擬硬體運算,另嘗試硬體加速。 研究 BitNet [1]: * 在 GNU/Linux 系統運作 BitNet b1.58 2B4T 並重現論文實驗 效能改進: * 以 perf 在內的工具,測量推理過程中運算資源佔比前 20 大的函式,並探討其作用 * 分析記憶體使用量,特別是過程中的 page fault, TLB miss 等統計。在 XMRig [2] 一類的挖礦程式中,善用 huge page (或 THP),可達到加速效果 * 評估 T-MAC [3] [5],特別是其搭配 BitNet 的查表效益,紀錄過程中的 perf 事件統計 * 觀察載入模型的機制,能否用 splice [4] 一類的機制予以加速 修改 BitNetMCU 並輸出 2bit model: * 引入原 lab2 提供的 VGG 8 * 增加 padding, maxpool 支援 * 補全未完成的 2 bits QuantType (I2_S) * 補全未完成的 Ternary QuantType (TL1, TL2) * 生成可部署到硬體上的 model.h * 設計 inference.c 來測試 model 在 CPU 上的 inference * 比較 PTQ 8 bit model vs QAT 2w8a model HW simulation * C sim or Verilator [1] https://github.com/microsoft/BitNet [2] https://xmrig.com/docs/miner/hugepages [3] https://github.com/microsoft/T-MAC [4] https://hackmd.io/@sysprog/linux-zerocopy [5] BitNet 有 LUT: https://github.com/microsoft/BitNet/tree/main/src https://www.kernel.org/doc/html/next/admin-guide/mm/transhuge.html ## Bitnet Experiment ### Env :::spoiler 實驗環境 ```shell (base) denny0097:~/linux2025$ cat /etc/os-release PRETTY_NAME="Ubuntu 24.04.2 LTS" NAME="Ubuntu" VERSION_ID="24.04" VERSION="24.04.2 LTS (Noble Numbat)" VERSION_CODENAME=noble ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=noble LOGO=ubuntu-logo (base) denny0097:~/linux2025$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (base) denny0097:~/linux2025$ gcc --version gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Copyright (C) 2023 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. (bitnet-cpp) denny0097:~/linux2025/BitNetMCU-main$ conda info active environment : bitnet-cpp active env location : /home/denny0097/miniconda3/envs/bitnet-cpp shell level : 2 user config file : /home/denny0097/.condarc populated config files : /home/denny0097/miniconda3/.condarc conda version : 25.3.1 conda-build version : not installed python version : 3.13.2.final.0 solver : libmamba (default) virtual packages : __archspec=1=skylake __conda=25.3.1=0 __cuda=12.4=0 __glibc=2.39=0 __linux=6.11.0=0 __unix=0=0 base environment : /home/denny0097/miniconda3 (writable) ``` ::: run instruction: ``` python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -cnv ``` result : ```! > User: Tell me about the architecture of BitNet. BitNet is a software communication network that connects devices and provides a common architecture for the Internet, but it's not a place in Barcelona, Spain, although it might seem like that from the name. It's actually a network protocol developed by the University of California, Berkeley in the 197 ``` #### Benchmark : ``` python utils/e2e_benchmark.py -m /path/to/model -n 200 -p 256 -t 4 ``` This command would run the inference benchmark using the model located at /path/to/model, generating 200 tokens from a 256 token prompt, utilizing 4 threads. #### result : | model | size | params | backend | threads | n_batch | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: | | bitnet-b1.58 2B I2_S - 2 bpw ternary | 1.71 GiB | 2.74 B | CPU | 4 | 1 | pp256 | 15.86 ± 0.04 | | bitnet-b1.58 2B I2_S - 2 bpw ternary | 1.71 GiB | 2.74 B | CPU | 4 | 1 | tg200 | 15.75 ± 0.10 | ## SW ### Model :::spoiler Model summary ---------------------------------------------------------------- Layer (type) Output Shape Param # ================================================================ BitConv2d-1 [-1, 64, 32, 32] 576 ReLU-2 [-1, 64, 32, 32] 0 MaxPool2d-3 [-1, 64, 16, 16] 0 BitConv2d-4 [-1, 192, 16, 16] 110,592 ReLU-5 [-1, 192, 16, 16] 0 MaxPool2d-6 [-1, 192, 8, 8] 0 BitConv2d-7 [-1, 384, 8, 8] 663,552 ReLU-8 [-1, 384, 8, 8] 0 BitConv2d-9 [-1, 256, 8, 8] 884,736 ReLU-10 [-1, 256, 8, 8] 0 BitConv2d-11 [-1, 256, 8, 8] 589,824 ReLU-12 [-1, 256, 8, 8] 0 MaxPool2d-13 [-1, 256, 4, 4] 0 Flatten-14 [-1, 4096] 0 BitLinear-15 [-1, 256] 1,048,576 ReLU-16 [-1, 256] 0 BitLinear-17 [-1, 128] 32,768 ReLU-18 [-1, 128] 0 BitLinear-19 [-1, 10] 1,280 ================================================================ Total params: 3,331,904 Trainable params: 3,331,904 Non-trainable params: 0 ---------------------------------------------------------------- Input size (MB): 0.00 Forward/backward pass size (MB): 2.91 Params size (MB): 12.71 Estimated Total Size (MB): 15.63 ---------------------------------------------------------------- ::: ![截圖 2025-06-23 上午11.37.28](https://hackmd.io/_uploads/S1SZhHU4eg.png) :::spoiler trainingparameters.yaml # Description: Training parameters for the training script # Model selection model: 'VGG' # 'FCMNIST' or 'CNNMNIST' This is the class name of the model as defined in models.py. # Quantization settings QuantType: 'I2_S' # 'Ternary', 'Binary', 'BinaryBalanced', '2bitsym', '4bit', '4bitsym', '8bit', 'None", 'FP130', 'NF4', 'I2_S' NormType: 'RMS' # 'RMS', 'Lin', 'BatchNorm' WScale: 'PerTensor' # 'PerTensor', 'PerOutput' # Clipping parameters - only used for 2 bit and higher quantization maxw_algo: 'octav' # 'octav', 'prop' Algorithm used to calculate the clipping parameters (maximum weight) maxw_update_until_epoch: 50 # Update clipping parameters until this epoch, they are frozen afterwards maxw_quantscale: 0.25 # Used only for clipping_algo='prop'. Determines the relation between stddev of weights and max_weight # Learning parameters num_epochs: 50 batch_size: 32 scheduler: "Cosine" # "StepLR", "Cosine" learning_rate: 0.001 lr_decay: 0.1 # lr_decay and step size are not used with cosine scheduler step_size: 10 # halve_lr_epoch: 30 # Epoch at which to halve the learning rate # Data augmentation augmentation: True rotation1: 10 # rotation1 and rotation2 are used for data augmentation rotation2: 10 # Model parameters network_width1: 256 network_width2: 128 network_width3: 0 # name runtag: "octav" # runtag is prefix for runname ::: [Output quanted model](https://github.com/Denny0097/BitNetMCU/blob/BitNet_VGG/BitNetMCU_model.h) BitConv2d & BitLinear 是 BitNetMCU 提供的 layer 框架,其中支持包含 Normalize(RMS) 跟 QAT foward 的運作 ```python= class BitConv2d(nn.Conv2d, BitQuant): """ 2D convolution layer with quantization aware training and normalization. Configurable quantization and normalization types. Normalization Types: - RMS : Root Mean Square - None : No normalization @cpldcpu 2024-June-2 """ #def __init__ ... def forward(self, x): """ Args: x: an input tensor with shape [n, d] Returns: y: an output tensor with shape [n, k] """ w = self.weight # a weight tensor with shape [d, k] x_norm = self.Normalize(x) if self.QuantType == 'None': y = F.conv2d(x_norm, w, stride=self.stride, padding=self.padding, groups=self.groups ) else: x_int, x_scale = self.activation_quant(x_norm) x_quant = x_norm + (x_int / x_scale - x_norm).detach() w_int, w_scale, _ = self.weight_quant(w) w_quant = w + (w_int / w_scale - w).detach() y = F.conv2d(x_quant, w_quant, groups=self.groups, stride=self.stride, padding=self.padding, bias=None) return y ``` #### Training data set: MNIST CIFAR10 ### Quantization scheme (QAT) #### Weight w_scale = 1.0 / w.abs().mean().clamp_(min=1e-5) w_int = (w * w_scale ).round().clamp_(-1, 1) $$ \begin{align} s = max(\frac{1}{mean(|w|)}, 10^{-5}) \\ q(w) = clamp(round(w \cdot s), -1, 1) \end{align} $$ #### Activation training scale = 127.0 / x.abs().max(dim=-1, keepdim=True).values.clamp_(min=1e-5) y = (x * scale).round().clamp_(-128, 127) inference scale = 127.0 / np.maximum(np.abs(input_data).max(axis=-1, keepdims=True), 1e-5) current_data = np.round(input_data * scale).clip(-128, 127) $$ \begin{align} s = max(\frac{127}{max(|x|)}, 10^{-5}), \\q(x) = clamp(round(x\cdot s), -128, 127) \end{align} $$ **HW friendly:** $$ \begin{align} s = max(|x|) >> 7 , \\while(s > 0) shift++ \\rounding = (1<<shift)>>1 \\q(x) = (x + rounding) >> shift \end{align} $$ 計算 input 的絕對值的最大值 max(|x|),並計算最接近 max(|x|) 的 2 的冪次 $2^{shift+7}$,用此值作為 quantization 的最大範圍,$x/2^{shift}$ 來得到 range [-128, 127] quanted value,但這樣會無條件捨棄小數,為了更精準,加上 round 的計算:$(x+2^{shift-1})/2^{shift}$,過程中沒有乘法也沒有浮點數運算。 #### [Bitnet](https://github.com/microsoft/BitNet/tree/main) I2_S: 基於[BitMCU](https://github.com/cpldcpu/BitNetMCU)提供的 [BitLinear](https://github.com/cpldcpu/BitNetMCU/blob/main/BitNetMCU.py) & [BitConv2d](https://github.com/cpldcpu/BitNetMCU/blob/main/BitNetMCU.py) 進行 QAT 讓模型在訓練時就學會適應 {-1,0,1} low bit-width inference。 而由於 BitnetMCU 目前不支援 padding, maxpooling, BatchNorm, (使用 BatchNorm 訓練的模型在 export 成 .c file 時,accuracy 會異常低,接近隨機推演),以及最重要的,原始的 export.py 還沒有支援任何 BitNet 儲存。 因此我會增加 Maxpool 以及 Padding ,最後用 CIFAR10 訓練並先實現 export I2_S model (2bits),讓相較原本程式原本支援的 (full FC + ReLU ) 模型更多層的 VGG8 模型能夠在該專案中訓練並生成 bitnet model,並且設計對應的 CPU 上運行測試。 :::spoiler export結果 (bitnet-cpp) denny0097:~/linux2025/BitNetMCU-main$ python exportquant.py Load parameters from file: trainingparameters.yaml octav_VGG_Aug_BitMnist_I2_S_width256_128_0_epochs10 Loading model... Inference using the original model... Accuracy/Test of trained model: 99.67 % Quantizing model... 0 VGG 1 Sequential 2 BitConv2d 3 ReLU 4 MaxPool2d 5 BitConv2d 6 ReLU 7 MaxPool2d 8 BitConv2d 9 ReLU 10 BitConv2d 11 ReLU 12 BitConv2d 13 ReLU 14 MaxPool2d 15 Flatten 16 BitLinear 17 ReLU 18 BitLinear 19 ReLU 20 BitLinear Layer: 2, Max: 1.0, Min: -1.0, Mean: -0.11458333333333333, Std: 0.8317041365974641 Values: [-1. 0. 1.] Percent: [40.97222222 29.51388889 29.51388889] Entropy: 1.57 bits. Code capacity used: 78.33160789015268 % Layer: 5, Max: 1.0, Min: -1.0, Mean: -0.2090747974537037, Std: 0.7201310989476462 Values: [-1. 0. 1.] Percent: [38.5687934 43.76989294 17.66131366] Entropy: 1.49 bits. Code capacity used: 74.68134472269595 % Layer: 8, Max: 1.0, Min: -1.0, Mean: -0.14563289689429013, Std: 0.6785549412400004 Values: [-1. 0. 1.] Percent: [31.36393229 51.83542511 16.8006426 ] Entropy: 1.45 bits. Code capacity used: 72.42034190610372 % Layer: 10, Max: 1.0, Min: -1.0, Mean: -0.17259724934895834, Std: 0.6684536886212908 Values: [-1. 0. 1.] Percent: [32.46086968 52.33798557 15.20114475] Entropy: 1.43 bits. Code capacity used: 71.44577546801597 % Layer: 12, Max: 1.0, Min: -1.0, Mean: -0.13642713758680555, Std: 0.6916967437120225 Values: [-1. 0. 1.] Percent: [31.67419434 50.29432509 18.03148058] Entropy: 1.47 bits. Code capacity used: 73.48354983372585 % Layer: 16, Max: 1.0, Min: -1.0, Mean: -0.06987667083740234, Std: 0.6562856511567088 Values: [-1. -0. 1.] Percent: [25.27351379 56.4406395 18.28584671] Entropy: 1.42 bits. Code capacity used: 70.77351150852402 % Layer: 18, Max: 1.0, Min: -1.0, Mean: -0.019989013671875, Std: 0.7926493968240628 Values: [-1. 0. 1.] Percent: [32.43408203 37.1307373 30.43518066] Entropy: 1.58 bits. Code capacity used: 78.99523722907314 % Layer: 20, Max: 1.0, Min: -1.0, Mean: -0.31484375, Std: 0.7746561579732891 Values: [-1. -0. 1.] Percent: [50.703125 30.078125 19.21875 ] Entropy: 1.48 bits. Code capacity used: 73.77139839499056 % Total number of bits: 6663808 (813.453125 kbytes) inference of quantized model layer: ('BitConv2d', 2) layer: ('ReLU', 3) layer: ('MaxPool2d', 4) layer: ('BitConv2d', 5) layer: ('ReLU', 6) layer: ('MaxPool2d', 7) layer: ('BitConv2d', 8) layer: ('ReLU', 9) layer: ('BitConv2d', 10) layer: ('ReLU', 11) layer: ('BitConv2d', 12) layer: ('ReLU', 13) layer: ('MaxPool2d', 14) layer: ('BitLinear', 16) layer: ('ReLU', 17) layer: ('BitLinear', 18) layer: ('ReLU', 19) ::: ### Weight Distribution ![I2_S_distribution](https://hackmd.io/_uploads/rJVAZzEXlg.png) 從 weight 的分佈來看 -1 & 0 是相對較高的,推測因為 model 沒有 bias 且每個 conv & fc 又經過 ReLU,所以會導致 conv 的輸入都是正數,而 conv 輸出會有盡量常態分布 (向 0 集中)的傾向,所以導致複數的 weights 大過正數。 ### Code architecture ``` mermaid graph TD; QAT_Training-->Exporting-->model.h-->Test_Inference ``` :::spoiler 增加 [VGG](https://github.com/Denny0097/BitNetMCU/tree/BitNet_VGG/models.py) ```python class VGG(nn.Module): def __init__(self,network_width1=256,network_width2=128,network_width3=0,QuantType='Binary',WScale='PerTensor',NormType='BatchNorm', in_channels=1, in_size=32, num_classes=10): super(VGG, self).__init__() # Conv1 self.conv1 = nn.Sequential( BitConv2d(in_channels, 64, kernel_size=3, stride=1, padding=1, groups=1,QuantType=QuantType,NormType='None', WScale=WScale), # nn.BatchNorm2d(64), nn.ReLU(inplace=True), nn.MaxPool2d(kernel_size=2, stride=2) # 32×32 -> 16×16 ) # Conv2 self.conv2 = nn.Sequential( BitConv2d(64, 192, kernel_size=3, stride=1, padding=1, groups=1,QuantType=QuantType,NormType='None', WScale=WScale), # nn.BatchNorm2d(192), nn.ReLU(inplace=True), nn.MaxPool2d(kernel_size=2, stride=2) # 16×16 -> 8×8 ) # Conv3 self.conv3 = nn.Sequential( BitConv2d(192, 384, kernel_size=3, stride=1, padding=1, groups=1,QuantType=QuantType,NormType='None', WScale=WScale), # nn.BatchNorm2d(384), nn.ReLU(inplace=True) # nn.MaxPool2d(kernel_size=2, stride=2) # 8×8 -> 4×4 ) # Conv4 (Dilated, 1 layer) self.conv4 = nn.Sequential( BitConv2d(384, 256, kernel_size=3, stride=1, padding=1, groups=1,QuantType=QuantType,NormType='None', WScale=WScale), # nn.BatchNorm2d(256), nn.ReLU(inplace=True) ) # Conv5 (1 layer) self.conv5 = nn.Sequential( BitConv2d(256, 256, kernel_size=3, stride=1, padding=1, groups=1,QuantType=QuantType,NormType='None', WScale=WScale), # nn.BatchNorm2d(256), nn.ReLU(inplace=True), nn.MaxPool2d(kernel_size=2, stride=2) # 4×4 -> 2×2 ) # Fully Connected Layers fmap_size = in_size // 8 # 32 -> 16 -> 8 -> 4 (3 次 MaxPool) self.fc6 = nn.Sequential( nn.Flatten(), BitLinear(256 * fmap_size * fmap_size, network_width1,QuantType=QuantType,NormType=NormType, WScale=WScale), # 256×4×4 = 4096 nn.ReLU() ) self.fc7 = nn.Sequential( # nn.Flatten(), BitLinear(network_width1, network_width2,QuantType=QuantType,NormType=NormType, WScale=WScale), nn.ReLU() ) # Final classifier self.fc8 = BitLinear(network_width2, 10,QuantType=QuantType,NormType=NormType, WScale=WScale) def forward(self, x): x = self.conv1(x) x = self.conv2(x) x = self.conv3(x) x = self.conv4(x) x = self.conv5(x) x = self.fc6(x) x = self.fc7(x) x = self.fc8(x) return x ``` ::: :::spoiler 增加 [padding](https://github.com/Denny0097/BitNetMCU/tree/BitNet_VGG/BitNetMCU.py) ```python for layer_info in self.quantized_model[:-1]: # For all layers except the last one print(f'layer: {layer_info["layer_type"], layer_info["layer_order"] }') # ..... elif layer_info['layer_type'] == 'BitConv2d': # ..... padding = layer_info['padding'] groups = layer_info['groups'] in_channels = layer_info['in_channels'] out_channels = layer_info['out_channels'] # Apply padding if padding > 0: current_data = np.pad( current_data, pad_width=((0, 0), (0, 0), (padding, padding), (padding, padding)), mode='constant', constant_values=(0,0) ) ``` ::: ::: spoiler 增加 [maxpool layer](https://github.com/Denny0097/BitNetMCU/tree/BitNet_VGG/BitNetMCU.py) ```python! # ... elif layer_info['layer_type'] == 'MaxPool2d': kernel_size = layer_info['kernel_size'] # Assuming square kernel stride = layer_info['stride'] # Extract input dimensions batch_size, channels, height, width = current_data.shape out_height = (height - kernel_size) // stride + 1 out_width = (width - kernel_size) // stride + 1 # Initialize output output = np.zeros((batch_size, channels, out_height, out_width), dtype=current_data.dtype) # Perform max pooling for i in range(out_height): for j in range(out_width): h_start = i * stride h_end = h_start + kernel_size w_start = j * stride w_end = w_start + kernel_size patch = current_data[:, :, h_start:h_end, w_start:w_end] output[:, :, i, j] = np.max(patch, axis=(2, 3)) current_data = output ``` ::: ::: spoiler 修正 [exportquant](https://github.com/Denny0097/BitNetMCU/tree/BitNet_VGG/exportquant.py) ```python for layer_info in quantized_model.quantized_model: layer = f'L{layer_info["layer_order"]}' layer_type = f'{layer_info["layer_type"]}' f.write(f'// Layer: {layer}\n') f.write(f'// Layer type: {layer_type}\n') if layer_info['layer_type'] == 'BitLinear': incoming_weights = layer_info['incoming_weights'] outgoing_weights = layer_info['outgoing_weights'] bpw = layer_info['bpw'] weights = np.array(layer_info['quantized_weights']) quantization_type = layer_info['quantization_type'] if (bpw*incoming_weights%32) != 0: raise ValueError(f"Size mismatch: Incoming weights must be packed to 32bit boundary. Incoming weights: {incoming_weights} Bit per weight: {bpw} Total bits: {bpw*incoming_weights}") print(f'Layer: {layer} Quantization type: <{quantization_type}>, Bits per weight: {bpw}, Num. incoming: {incoming_weights}, Num outgoing: {outgoing_weights}') data_type = np.uint32 if quantization_type == 'Binary': encoded_weights = np.where(weights == -1, 0, 1) QuantID = 1 elif quantization_type == '2bitsym': # encoding -1.5 -> 11, -0.5 -> 10, 0.5 -> 00, 1.5 -> 01 (one complement with offset) encoded_weights = ((weights < 0).astype(data_type) << 1) | (np.floor(np.abs(weights))).astype(data_type) # use bitwise operations to encode the weights QuantID = 2 # I2_S elif quantization_type == 'I2_S': # encoding -1 -> 00, 0 -> 01, 1 -> 10 encoded_weights = weights.astype(data_type) + 1 QuantID = 2 elif quantization_type == '4bitsym': encoded_weights = ((weights < 0).astype(data_type) << 3) | (np.floor(np.abs(weights))).astype(data_type) # use bitwise operations to encode the weights QuantID = 4 elif quantization_type == '4bit': encoded_weights = np.floor(weights).astype(data_type) & 15 # twos complement encoding QuantID = 8 + 4 elif quantization_type == 'NF4': levels = np.array([-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0, 0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.723, 1.0]) encoded_weights = np.argmin(np.abs(weights[:, :, np.newaxis] - levels), axis=2) QuantID = 32 + 4 elif quantization_type == '8bit': encoded_weights = np.floor(weights).astype(data_type) & 255 # twos complement encoding QuantID = 8 elif quantization_type == 'FP130': # FP1.3.0 encoding (sign * 2^exp) encoded_weights = ((weights < 0).astype(data_type) << 3) | (np.floor(np.log2(np.abs(weights)))).astype(data_type) QuantID = 16 + 4 else: print(f'Skipping layer {layer} with quantization type {quantization_type} and {bpw} bits per weight. Quantization type not supported.') # pack bits into 32 bit words weight_per_word = 32 // bpw reshaped_array = encoded_weights.reshape(-1, weight_per_word) # reverse arange to match C language LSB first reading order bit_positions = np.arange(0, 32, bpw, dtype=data_type) packed_weights = np.bitwise_or.reduce(reshaped_array << bit_positions, axis=1).view(data_type) # print(f'weights: {weights.shape} {weights.flatten()[0:16]}') # print(f'Encoded weights: {encoded_weights.shape} {encoded_weights.flatten()[0:16]}') # print(f'Packed weights: {packed_weights.shape} {", ".join(map(lambda x: hex(x), packed_weights.flatten()[0:4]))}') # Write layer order, shape, shiftright and weights to the file f.write(f'// Layer: {layer}\n') f.write(f'// QuantType: {quantization_type}\n') f.write(f'#define {layer}_active\n') f.write(f'#define {layer}_bitperweight {QuantID}\n') f.write(f'#define {layer}_incoming_weights {incoming_weights}\n') f.write(f'#define {layer}_outgoing_weights {outgoing_weights}\n') f.write(f'const uint32_t {layer}_weights[] = {{') for i,data in enumerate(packed_weights.flatten()): if i&7 ==0: f.write('\n\t') f.write(f'0x{data:08x},') f.write('\n}; //first channel is topmost bit\n\n') elif layer_info['layer_type'] == 'BitConv2d': in_channels = layer_info['in_channels'] out_channels = layer_info['out_channels'] incoming_x = layer_info['incoming_x'] incoming_y = layer_info['incoming_y'] outgoing_x = layer_info['outgoing_x'] outgoing_y = layer_info['outgoing_y'] padding = layer_info['padding'] groups = layer_info['groups'] kernel_size = layer_info['kernel_size'][0] # Assuming square kernel bpw = layer_info['bpw'] quantization_type = layer_info['quantization_type'] weights = np.array(layer_info['quantized_weights']) bias = layer_info.get('bias', None) f.write(f'// Layer: {layer} (Convolutional)\n') f.write(f'#define {layer}_active\n') f.write(f'#define {layer}_type BitConv2d\n') f.write(f'#define {layer}_in_channels {in_channels}\n') f.write(f'#define {layer}_out_channels {out_channels}\n') f.write(f'#define {layer}_incoming_x {incoming_x}\n') f.write(f'#define {layer}_incoming_y {incoming_y}\n') f.write(f'#define {layer}_outgoing_x {outgoing_x}\n') f.write(f'#define {layer}_outgoing_y {outgoing_y}\n') f.write(f'#define {layer}_kernel_size {kernel_size}\n') f.write(f'#define {layer}_stride 1\n') f.write(f'#define {layer}_padding {padding}\n') f.write(f'#define {layer}_groups {groups}\n') f.write(f'#define {layer}_bitperweight {bpw}\n') # if (bpw*incoming_weights%32) != 0: # raise ValueError(f"Size mismatch: Incoming weights must be packed to 32bit boundary. Incoming weights: {incoming_weights} Bit per weight: {bpw} Total bits: {bpw*incoming_weights}") data_type = np.uint32 if quantization_type == 'Binary': encoded_weights = np.where(weights == -1, 0, 1) QuantID = 1 elif quantization_type == '2bitsym': # encoding -1.5 -> 11, -0.5 -> 10, 0.5 -> 00, 1.5 -> 01 (one complement with offset) encoded_weights = ((weights < 0).astype(data_type) << 1) | (np.floor(np.abs(weights))).astype(data_type) # use bitwise operations to encode the weights QuantID = 2 # I2_S elif quantization_type == 'I2_S': # encoding -1 -> 00, 0 -> 01, 1 -> 10 encoded_weights = weights.astype(data_type) + 1 QuantID = 2 elif quantization_type == '4bitsym': encoded_weights = ((weights < 0).astype(data_type) << 3) | (np.floor(np.abs(weights))).astype(data_type) # use bitwise operations to encode the weights QuantID = 4 elif quantization_type == '4bit': encoded_weights = np.floor(weights).astype(data_type) & 15 # twos complement encoding QuantID = 8 + 4 elif quantization_type == 'NF4': levels = np.array([-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0, 0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.723, 1.0]) encoded_weights = np.argmin(np.abs(weights[:, :, np.newaxis] - levels), axis=2) QuantID = 32 + 4 elif quantization_type == '8bit': encoded_weights = np.floor(weights).astype(data_type) & 255 # twos complement encoding QuantID = 8 elif quantization_type == 'FP130': # FP1.3.0 encoding (sign * 2^exp) encoded_weights = ((weights < 0).astype(data_type) << 3) | (np.floor(np.log2(np.abs(weights)))).astype(data_type) QuantID = 16 + 4 else: print(f'Skipping layer {layer} with quantization type {quantization_type} and {bpw} bits per weight. Quantization type not supported.') # pack bits into 32 bit words weight_per_word = 32 // bpw reshaped_array = encoded_weights.reshape(-1, weight_per_word) # reverse arange to match C language LSB first reading order bit_positions = np.arange(weight_per_word, dtype=data_type) * bpw packed_weights = np.bitwise_or.reduce(reshaped_array << bit_positions, axis=1).view(data_type) f.write(f'const uint32_t {layer}_packed_weights[] = {{') for i, data in enumerate(packed_weights.flatten()): if i % 32 == 0: f.write('\n\t') f.write(f'{data},') f.write('\n};\n\n') if 'bias' in layer_info and layer_info['bias'] is not None: bias = np.array(layer_info['bias']).astype(data_type) f.write(f'const int32_t {layer}_bias[] = {{') for i, data in enumerate(bias.flatten()): if i % 8 == 0: f.write('\n\t') f.write(f'{data},') f.write('\n};\n\n') else: f.write(f'// No bias for layer {layer}\n') print(f'Layer: {layer} Conv2d bpw: {bpw} {in_channels} -> {out_channels} groups:{groups} Kernel: {kernel_size}x{kernel_size}') ``` ::: ::: spoiler 新增 [test inference for I2_S](https://github.com/Denny0097/BitNetMCU/blob/BitNet_VGG/BitNetMCU_CIFAR10_test.c) | Func | Desc | | ---- | ---- | | precessfc_I2_S | Processes a fully connected layer in a neural network with 2-bit weights.| | processcvlayer_I2_S | Processes a conv2D layer in a neural network.| | Maxp | Processes a maxpooling layer. | | ReLUNorm | Applies a ReLU activation function to an array of integers and normalizes, pre quantization the result to 8-bit integers.| (bitnet-cpp) denny0097:~/linux2025/BitNetMCU-main$ gcc BitNetMCU_MNIST_test.c -o mnist_test -std=c99 -lm (bitnet-cpp) denny0097:~/linux2025/BitNetMCU-main$ ./mnist_test fc3_out: -1076 -749 -462 1650 -1109 -424 -1317 -438 -478 -750 label: 3 predicted: 3 fc3_out: -615 -556 1597 -337 -600 -778 -444 -513 -436 -672 label: 2 predicted: 2 fc3_out: 1255 -731 -646 -759 -801 -594 -397 -831 -626 -436 label: 0 predicted: 0 fc3_out: -761 -948 -753 -724 -102 -707 -1141 -317 -495 1685 label: 9 predicted: 9 fc3_out: 1346 -758 -718 -820 -860 -650 -506 -871 -678 -394 label: 0 predicted: 0 fc3_out: -366 -527 -800 -965 -315 -345 1135 -1150 -538 -847 label: 6 predicted: 6 fc3_out: -635 -701 -549 -599 88 -633 -884 -311 -362 1210 label: 9 predicted: 9 fc3_out: -474 -314 803 -133 -286 -708 -675 250 -339 -416 label: 2 predicted: 2 fc3_out: -526 -532 -134 -344 -628 -730 -1272 1069 -400 -110 label: 7 predicted: 7 fc3_out: -858 -318 -519 -481 -672 -995 -1577 1423 -506 -64 label: 7 predicted: 7 ::: ### Analysis 設計 10 筆 data 的 test inference 來觀察 quanted model 的執行。 **Exec time** 在程式碼前後紀錄 clock。 (**用 if/else 取代 conv 的 mul 計算** vs **保持 conv 的 mul**) ![vs](https://hackmd.io/_uploads/SJjgfiAXgl.png) branch: $sum \left\{ \begin{aligned} \text{-= } act, \text{if weight = -1}\\ \text{+= } act, \text{if weight = 1}\\ \end{aligned} \right.$ mul: $sum = act \cdot weight$ | Layer | Branch (ms) | Mul (ms) | | :--- | :--- | :--- | | L2 (Conv) | 16.103 | 14.303 | | L3 (ReLUNorm) | 0.322 | 0.355 | | L4 (Maxp) | 0.329 | 0.316 | | L5 (Conv) | 245.586 | 213.306 | | L6 (ReLUNorm) | 0.201 | 0.201 | | L7 (Maxp) | 0.198 | 0.196 | | L8 (Conv) | 342.022 | 298.320 | | L9 (ReLUNorm) | 0.105 | 0.105 | | L10 (Conv) | 456.651 | 401.056 | | L11 (ReLUNorm) | 0.069 | 0.069 | | L12 (Conv) | 304.385 | 278.764 | | L13 (ReLUNorm) | 0.065 | 0.067 | | L14 (Maxp) | 0.066 | 0.067 | | L16 (FC) | 5.622 | 4.595 | | L17 (ReLUNorm) | 0.002 | 0.002 | | L18 (FC) | 0.172 | 0.129 | | L19 (ReLUNorm) | 0.001 | 0.001 | | L20 (FC) | 0.007 | 0.006 | |total| 1371.706 ms | 1211.458 ms| 雖然原本希望能藉由省去 mult 讓計算加速,但實際上如果用 if/else 來判斷加減會有更多的 branch 導致更高的 cost。 bitwise 的計算來同時避免乘法及 branch: |Unpack|Pack| |-|-| |-1 |00| |0 |01| |1 |10| ```c int8_t delta = (-((int8_t)(weight == 0x2)) & act) | (-((int8_t)(weight == 0x0)) & (-act); sum += delta; ``` ![upload_fef315c4d495ba6358adc37dfe581299](https://hackmd.io/_uploads/Sy0uCj1Hxl.png) 最後發現實際上,在 CPU 上執行的效率跟 branch 差不多,CPU 有硬件乘法器和編譯器優化。對於這種環境,最直接數學表達 sum += act * weight; 的形式,反而讓編譯器能夠發揮最大效能,利用底層的硬件乘法單元。 #### perf perf 使用會有限制, -1 為最低,通常直接輸入命令會出現 ```shell Error: Access to performance monitoring and observability operations is limited. Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open access to performance monitoring and observability operations for processes without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability. perf_event_paranoid setting is 4: -1: Allow use of (almost) all events by all users Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK >= 0: Disallow raw and ftrace function tracepoint access >= 1: Disallow CPU event access >= 2: Disallow kernel profiling ``` :::danger 注意用語,務必依循課程規範的術語。 ::: 先將 perf 限制降低至 -1,可以輸入以下 命令 暫時獲得權限: sudo sh -c 'echo -1 > /proc/sys/kernel/perf_event_paranoid' 1. CPU ``` shell perf record ./mnist_test ``` ``` shell perf report ``` ![截圖 2025-06-13 上午11.50.44](https://hackmd.io/_uploads/SJmQemFQlx.png) ![image](https://hackmd.io/_uploads/SyENWmFQlx.png) 模型推論時間幾乎 全部卡在卷積層 processcvlayer_I2_S(可以做 loop unrolling) 2. Mem | Layer | Entropy(bits) | Code capacity used | | ------- | ------- | ------- | | L2 | 1.57 | 78.33160789015268 % | | L5 | 1.49 | 74.68134472269595 % | | L8 | 1.45 | 72.42034190610372 % | | L10| 1.43 | 71.44577546801597 % | | L12| 1.47 | 73.48354983372585 % | | L16| 1.42 | 70.77351150852402 % | | L18| 1.58 | 78.99523722907314 % | | L20| 1.48 | 73.77139839499056 % | 畢竟 2bits 只用來儲存 [-1, 1] 的 weights ,理所當然 capacity 不好。 ```shell perf stat -e cache-misses,cache-references,cycles ./cifar10_test ``` **Result:** ``` Performance counter stats for './cifar10_test': 1,395,104 cache-misses # 23.70% of all cache refs 5,887,284 cache-references 63,527,466,421 cycles 13.750155002 seconds time elapsed 13.748634000 seconds user 0.000000000 seconds sys ``` Cache miss rate 23.70%,後續嘗試使用 huge page 觀察是否改善。 如果單純做 unroll-loops 雖然減少 branch 提高計算速度,但會因為指令變多導致 cache misses 提高 (I-cache)。 ```shell gcc -funroll-loops -o cifar10_test BitNetMCU_CIFAR10_test.c -lm ``` **Result:** ``` Performance counter stats for './cifar10_test': 1,439,561 cache-misses # 17.66% of all cache refs 8,153,347 cache-references 63,693,742,173 cycles 13.786458539 seconds time elapsed 13.782243000 seconds user 0.000999000 seconds sys ``` **利用 gcc 最佳化編譯( additional instruction set, loop unrolling..)** ```shell gcc -O3 -march=native -funroll-loops -o mnist_test BitNetMCU_MNIST_test.c -lm ``` ``` Performance counter stats for './cifar10_test': 788,640 cache-misses # 31.30% of all cache refs 2,519,992 cache-references 7,213,769,871 cycles 1.572760010 seconds time elapsed 1.571592000 seconds user 0.001000000 seconds sys ``` 雖然因為 loop unrolling 導致 miss rate 提升,但執行速度大幅提升( 13.750155002 -> 1.572760010 seconds time elapsed,8倍以上)同時也使 conv2 的負擔減少。 #### hugepage ``` sudo sysctl -w vm.nr_hugepages=128 ``` 動態設定 Linux 核心中可用的 Huge Pages 數量,Linux 預設的 page size 是 4KB,分配 128 個 2MB 的 Huge Page,共 256MB,Kernel 會保留 128 個 2MB 的連續 physical mem page 供程式用 HugePage 分配 ```shell cat /proc/meminfo | grep Huge ``` AnonHugePages: 0 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 128 HugePages_Free: 128 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 262144 kB 確認有開啟 Hugepage #### libhugetlbfs 是 Linux 系統上一個 Huge Pages 輔助 lib,可以讓程式的 malloc() 自動從 Huge Page <s>分配</s> 記憶體。 安裝: ``` sudo apt update sudo apt install libhugetlbfs-bin libhugetlbfs-dev ``` 觀察: ```shell env LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libhugetlbfs.so HUGETLB_MORECORE=yes \ perf stat -e cache-misses,cache-references,cycles ./cifar10_test ``` **Reslut:** ``` Performance counter stats for './cifar10_test': 1,063,465 cache-misses # 12.56% of all cache refs 8,465,880 cache-references 63,575,493,015 cycles 13.679735573 seconds time elapsed 13.673417000 seconds user 0.002999000 seconds sys ``` #### valgrind >Valgrind 是個在使用者層級 (user space) 對程式在執行時期的行為提供動態分析的系統軟體框架,具備多種工具,可以用來追蹤及分析程式效能,最為人們所熟知的特性就是幫助使用者偵測記憶體錯誤,諸如使用未初始化的記憶體、不當的記憶體配置、或取消配置記憶體,並加以分析。 ref: [2023 年 Linux 核心設計/實作課程作業-lab0(B) 以 Valgrind 分析記憶體問題](https://hackmd.io/@sysprog/linux2023-lab0/%2F%40sysprog%2Flinux2023-lab0-b) ```shell valgrind --tool=massif ./cifar10_test ``` **massif:** 透過不斷 Snapshot 來觀察 process's stack memory 的分配狀況, ```shell ms_print massif.out.<pid> ``` #### **Memory Usage Graph** ``` MB 5.199^ ::::::::::::::::::::::::::::::::::::# | :::: # | @ : # | @ : # | @ : # | @ : # | @ : # | @ : # | @ : # | @ : # | @ : # | @ : # | @ : # | @ : # | @ : # | @ : # |@:::::::::::::::::::::::::::::::@ : # |@ :::@ : # |@ :::@ : # |@ :::@ : # 0 +----------------------------------------------------------------------->ki 0 161.6 Number of snapshots: 30 Detailed snapshots: [9, 13, 23, 28 (peak)] ``` Peak: 5.199 MB This gives a timeline of memory usage during the program's execution. **Vertical Axis (MB)**: Represents the total memory used by program in Megabytes (MB). The peak memory usage observed is about 5.199 MB. **Horizontal Axis (ki)**: Represents "kiloinstructions" or the approximate number of instructions executed by program. #### Snapshot (0~9): loading lib(1.13 MB) ``` -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 0 0 4,096 4,096 0 0 1 0 12,288 12,288 0 0 2 0 847,872 847,872 0 0 3 0 888,832 888,832 0 0 4 0 892,928 892,928 0 0 5 0 1,069,056 1,069,056 0 0 6 0 1,110,016 1,110,016 0 0 7 0 1,126,400 1,126,400 0 0 8 0 1,130,496 1,130,496 0 0 9 0 1,134,592 1,134,592 0 0 100.00% (1,134,592B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc. ->100.00% (1,134,592B) 0x0: ??? ``` `0x0: ???` & `time(i) = 0` 表示這是屬於 lib、packge 佔用的 mem。 #### Snapshot (10~13): loading lib(1.13 MB) ``` -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 10 0 1,146,880 1,146,880 0 0 11 0 1,150,976 1,150,976 0 0 12 0 1,155,072 1,155,072 0 0 13 0 1,155,072 1,155,072 0 0 100.00% (1,155,072B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc. ->99.65% (1,150,976B) 0x0: ??? | ->00.35% (4,096B) in 1+ places, all below ms_print's threshold (01.00%) ``` 99.65% 同上, >->00.35% (4,096B) in 1+ places, all below ms_print's threshold (01.00%) 程式總共分配的記憶體中有一個很小的部分(0.35%) 是由一個或多個小規模的記憶體分配操作產生的。 因為這些分配操作的單個或累積大小都非常小,沒有達到 ms_print 工具預設的 1% threshold,所以 ms_print 不顯示這些分配的具體呼叫資訊。 #### Snapshot (14~23): ``` -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 14 0 1,150,976 1,150,976 0 0 15 0 1,150,976 1,150,976 0 0 16 68,331 1,159,168 1,159,168 0 0 17 69,240 1,179,648 1,179,648 0 0 18 69,317 1,183,744 1,183,744 0 0 19 69,365 1,187,840 1,187,840 0 0 20 69,413 1,196,032 1,196,032 0 0 21 71,697 1,261,568 1,261,568 0 0 22 74,157 3,432,448 3,432,448 0 0 23 74,217 5,038,080 5,038,080 0 0 100.00% (5,038,080B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc. ->77.15% (3,887,104B) 0x4025D2C: __mmap64 (mmap64.c:58) | ->77.15% (3,887,104B) 0x4025D2C: mmap (mmap64.c:46) | ->43.50% (2,191,360B) 0x4007E17: _dl_map_segment (dl-map-segments.h:29) | | ->43.50% (2,191,360B) 0x4007E17: _dl_map_segments (dl-map-segments.h:101) | | ->43.50% (2,191,360B) 0x4007E17: _dl_map_object_from_fd (dl-load.c:1258) | | ->43.50% (2,191,360B) 0x4009528: _dl_map_object (dl-load.c:2268) | | ->43.09% (2,170,880B) 0x4002A2C: openaux (dl-deps.c:64) | | | ->43.09% (2,170,880B) 0x400151B: _dl_catch_exception (dl-catch.c:237) | | | ->43.09% (2,170,880B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232) | | | ->43.09% (2,170,880B) 0x402241B: dl_main (rtld.c:1965) | | | ->43.09% (2,170,880B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140) | | | ->43.09% (2,170,880B) 0x402075D: _dl_start_final (rtld.c:494) | | | ->43.09% (2,170,880B) 0x402075D: _dl_start (rtld.c:581) | | | ->43.09% (2,170,880B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2) | | | | | ->00.41% (20,480B) in 1+ places, all below ms_print's threshold (01.00%) | | | ->32.20% (1,622,016B) 0x4007F78: _dl_map_segments (dl-map-segments.h:139) | | ->32.20% (1,622,016B) 0x4007F78: _dl_map_object_from_fd (dl-load.c:1258) | | ->32.20% (1,622,016B) 0x4009528: _dl_map_object (dl-load.c:2268) | | ->31.87% (1,605,632B) 0x4002A2C: openaux (dl-deps.c:64) | | | ->31.87% (1,605,632B) 0x400151B: _dl_catch_exception (dl-catch.c:237) | | | ->31.87% (1,605,632B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232) | | | ->31.87% (1,605,632B) 0x402241B: dl_main (rtld.c:1965) | | | ->31.87% (1,605,632B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140) | | | ->31.87% (1,605,632B) 0x402075D: _dl_start_final (rtld.c:494) | | | ->31.87% (1,605,632B) 0x402075D: _dl_start (rtld.c:581) | | | ->31.87% (1,605,632B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2) | | | | | ->00.33% (16,384B) in 1+ places, all below ms_print's threshold (01.00%) | | | ->01.30% (65,536B) 0x400C36C: _dl_sysdep_read_whole_file (dl-misc.c:49) | | ->01.30% (65,536B) 0x4016C27: _dl_load_cache_lookup (dl-cache.c:411) | | ->01.30% (65,536B) 0x40097CA: _dl_map_object (dl-load.c:2135) | | ->01.30% (65,536B) 0x4002A2C: openaux (dl-deps.c:64) | | ->01.30% (65,536B) 0x400151B: _dl_catch_exception (dl-catch.c:237) | | ->01.30% (65,536B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232) | | ->01.30% (65,536B) 0x402241B: dl_main (rtld.c:1965) | | ->01.30% (65,536B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140) | | ->01.30% (65,536B) 0x402075D: _dl_start_final (rtld.c:494) | | ->01.30% (65,536B) 0x402075D: _dl_start (rtld.c:581) | | ->01.30% (65,536B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2) | | | ->00.16% (8,192B) in 1+ places, all below ms_print's threshold (01.00%) | ->22.85% (1,150,976B) 0x0: ??? | ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%) ``` > n=16, time=68,331, total=1,159,168B 表示執行了 6.8 萬 instruction,從此 Snapshot 開始 total mem 開始緩慢上升,推測是再做變數初始化,直到 Snapshot 22, >n=22, time=74,157, total=3,432,448B total mem 1.26 -> 3.43 MB, >n=23, time=74,217, total=5,038,080B > total mem 3.43 -> 5.03 MB,並且從其中資訊觀察, > 100.00% (5,038,080B) (page allocation syscalls) mmap/mremap/brk 所有 memory 都是 mmap page allocation。 >->77.15% (3,887,104B) 0x4025D2C: __mmap64 77.15%(約 3.88MB)是由 __mmap64 且主要是由 `linux-gnu/ld-linux-x86-64.so.2` 也就是 Linux 的 dynamic linker 分配,可能是在載入更多或更大的庫,或者對已載入的函式庫進行初始化和定位,需要更多的 memory mapping。 >->22.85% (1,150,976B) 0x0: ??? 22.85%(約 1.15MB)推測是執行初期的那些基礎記憶體開銷持續存在。 #### Snapshot (24~28): ``` -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 24 74,265 5,361,664 5,361,664 0 0 25 74,313 5,386,240 5,386,240 0 0 26 74,647 5,439,488 5,439,488 0 0 27 81,700 5,451,776 5,451,776 0 0 28 165,472 5,451,776 5,451,776 0 0 100.00% (5,451,776B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc. ->78.89% (4,300,800B) 0x4025D2C: __mmap64 (mmap64.c:58) | ->78.89% (4,300,800B) 0x4025D2C: mmap (mmap64.c:46) | ->40.20% (2,191,360B) 0x4007E17: _dl_map_segment (dl-map-segments.h:29) | | ->40.20% (2,191,360B) 0x4007E17: _dl_map_segments (dl-map-segments.h:101) | | ->40.20% (2,191,360B) 0x4007E17: _dl_map_object_from_fd (dl-load.c:1258) | | ->40.20% (2,191,360B) 0x4009528: _dl_map_object (dl-load.c:2268) | | ->39.82% (2,170,880B) 0x4002A2C: openaux (dl-deps.c:64) | | | ->39.82% (2,170,880B) 0x400151B: _dl_catch_exception (dl-catch.c:237) | | | ->39.82% (2,170,880B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232) | | | ->39.82% (2,170,880B) 0x402241B: dl_main (rtld.c:1965) | | | ->39.82% (2,170,880B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140) | | | ->39.82% (2,170,880B) 0x402075D: _dl_start_final (rtld.c:494) | | | ->39.82% (2,170,880B) 0x402075D: _dl_start (rtld.c:581) | | | ->39.82% (2,170,880B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2) | | | | | ->00.38% (20,480B) in 1+ places, all below ms_print's threshold (01.00%) | | | ->36.14% (1,970,176B) 0x4007F78: _dl_map_segments (dl-map-segments.h:139) | | ->36.14% (1,970,176B) 0x4007F78: _dl_map_object_from_fd (dl-load.c:1258) | | ->36.14% (1,970,176B) 0x4009528: _dl_map_object (dl-load.c:2268) | | ->35.84% (1,953,792B) 0x4002A2C: openaux (dl-deps.c:64) | | | ->35.84% (1,953,792B) 0x400151B: _dl_catch_exception (dl-catch.c:237) | | | ->35.84% (1,953,792B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232) | | | ->35.84% (1,953,792B) 0x402241B: dl_main (rtld.c:1965) | | | ->35.84% (1,953,792B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140) | | | ->35.84% (1,953,792B) 0x402075D: _dl_start_final (rtld.c:494) | | | ->35.84% (1,953,792B) 0x402075D: _dl_start (rtld.c:581) | | | ->35.84% (1,953,792B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2) | | | | | ->00.30% (16,384B) in 1+ places, all below ms_print's threshold (01.00%) | | | ->01.35% (73,728B) in 2 places, all below massif's threshold (1.00%) | | | ->01.20% (65,536B) 0x400C36C: _dl_sysdep_read_whole_file (dl-misc.c:49) | ->01.20% (65,536B) 0x4016C27: _dl_load_cache_lookup (dl-cache.c:411) | ->01.20% (65,536B) 0x40097CA: _dl_map_object (dl-load.c:2135) | ->01.20% (65,536B) 0x4002A2C: openaux (dl-deps.c:64) | ->01.20% (65,536B) 0x400151B: _dl_catch_exception (dl-catch.c:237) | ->01.20% (65,536B) 0x4002E66: _dl_map_object_deps (dl-deps.c:232) | ->01.20% (65,536B) 0x402241B: dl_main (rtld.c:1965) | ->01.20% (65,536B) 0x401EF45: _dl_sysdep_start (dl-sysdep.c:140) | ->01.20% (65,536B) 0x402075D: _dl_start_final (rtld.c:494) | ->01.20% (65,536B) 0x402075D: _dl_start (rtld.c:581) | ->01.20% (65,536B) 0x401F547: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2) | ->21.11% (1,150,976B) 0x0: ??? | ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%) -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 29 165,472 5,447,680 5,447,680 0 0 ``` >78.89% (4,300,800B) 0x4025D2C: __mmap64 (mmap64.c:58) | ->78.89% (4,300,800B) 0x4025D2C: mmap (mmap64.c:46) 主要原因仍是dynamic linker (ld-linux-x86-64.so.2) 的 mmap64 呼叫:這表示即使已經開始執行程式碼,但為了執行 convolution or other inference operations,系統在這個時間點又載入其他新的要用到的 lib。 延伸閱讀: [你所不知道的 C 語言:動態連結器篇](https://hackmd.io/@sysprog/c-dynamic-linkage) :::danger 文字訊息不要用圖片展現! ::: ### Optimaization based on analysis #### T-MAC // to be continue ## HW simulation

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully