# 2025/01/07 - 2025/01/20
[TOC]
## TODO
- [ ] 用做好的東西,串inference的bitnet,比較bitliniear和其他的部分
- [x] 蒐集 paper (大專生計畫)
* 寫SW
* 畫餅
* 文獻引用多
* [怎麼skip 0,處理11000000那種case 的 paper](https://ieeexplore.ieee.org/document/7551378)
Next TODO:
bug修好 => 可以看 5 次,signed bit 分開來看,剩下 7 bits 看 4 次
DeepGEMM 跟 bitnet.cpp 的 implementation 有什麼不一樣
BSS有興趣可看
## 問題
1. bitnet.cpp activation 的 range 是 [-128,127],不知道要怎麼處理 signed 的部分,因為目前是把 8-bit activation 拆成 2-bit 2-bit 來算 [details](https://hackmd.io/7a2GAGE8ROiJWioHnpQpDA#here)
2. 要看bitlinear和bitnet中其它運算的時間占比
而 bitnet.cpp 只是一個 framework,有三個不同 kernels,針對 low-bit weights 有不同改進。所以他還是去 huggingface 上下載 train 好的 model 來跑 inference,只是推理過程中會用到的點積等運算是用所選 kernel 優化過的點積function。這樣好像沒辦法算 bitlinear 跟他以外的運算的時間?
> BitNet ([1bitLLM/bitnet_b1_58-large](https://huggingface.co/1bitLLM/bitnet_b1_58-large))
> 試: 改bitnet.cpp的SIMD intrinsics的部分,成功可以inference之後再計時
> 現在想的計時方法是用兩個時間變數,一個累計bitlinear時間,另一個累計其它operations的時間
## Setting
* Environment
* WSL2, Ubuntu 22.04
* Local machine CPU:
* Architecture: x86_64
* CPU(s): 22
* Model used: 1bitLLM/bitnet_b1_58-large
## bitnet.cpp inference with i2_s kernel
* Command
`python run_inference.py -m models/bitnet_b1_58-large/ggml-model-i2_s.gguf -p "What is your name?" -t 1`
* Result
```bash=
llama_perf_sampler_print: sampling time = 2.39 ms / 68 runs ( 0.04 ms per token, 28463.79 tokens per second)
llama_perf_context_print: load time = 10525.32 ms
llama_perf_context_print: prompt eval time = 150.87 ms / 6 tokens ( 25.14 ms per token, 39.77 tokens per second)
llama_perf_context_print: eval time = 1577.66 ms / 61 runs ( 25.86 ms per token, 38.66 tokens per second)
llama_perf_context_print: total time = 1744.88 ms / 67 tokens
```
* Response
```
What is your name? What is your home town? What's your favourite city? How would you describe the character of your city? What makes it so special? What makes it so difficult to live here? How do you feel about the city? What do you think it is that makes this city so special? How do you feel about it? How do you feel about your city?
Do you live in a small town? Do you live in a big city? Do you like to walk around the city centre? Do you like to take public transport? Do you like the city's people and its culture? Do you like the city'
```
:::spoiler Entire output
### output
```
(bitnet-cpp) emily@MSI:/mnt/d/NYCU/CAS Lab/low-bit multiplication/BitNet$ python run_inference.py -m models/bitnet_b1_58-large/ggml-model-i2_s.gguf -p "What is your name?" -t 1
warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
build: 3954 (957b59d2) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 26 key-value pairs and 266 tensors from models/bitnet_b1_58-large/ggml-model-i2_s.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = bitnet
llama_model_loader: - kv 1: general.name str = bitnet_b1_58-large
llama_model_loader: - kv 2: bitnet.block_count u32 = 24
llama_model_loader: - kv 3: bitnet.context_length u32 = 2048
llama_model_loader: - kv 4: bitnet.embedding_length u32 = 1536
llama_model_loader: - kv 5: bitnet.feed_forward_length u32 = 4096
llama_model_loader: - kv 6: bitnet.attention.head_count u32 = 16
llama_model_loader: - kv 7: bitnet.attention.head_count_kv u32 = 16
llama_model_loader: - kv 8: bitnet.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 9: bitnet.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 40
llama_model_loader: - kv 11: bitnet.vocab_size u32 = 32002
llama_model_loader: - kv 12: bitnet.rope.scaling.type str = linear
llama_model_loader: - kv 13: bitnet.rope.scaling.factor f32 = 1.000000
llama_model_loader: - kv 14: tokenizer.ggml.model str = llama
llama_model_loader: - kv 15: tokenizer.ggml.pre str = default
llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 32000
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 24: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 97 tensors
llama_model_loader: - type f16: 1 tensors
llama_model_loader: - type i2_s: 168 tensors
llm_load_vocab: control token: 2 '</s>' is not marked as EOG
llm_load_vocab: control token: 1 '<s>' is not marked as EOG
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = bitnet
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32002
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 1536
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 16
llm_load_print_meta: n_rot = 96
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 96
llm_load_print_meta: n_embd_head_v = 96
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 1536
llm_load_print_meta: n_embd_v_gqa = 1536
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 4096
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 700M
llm_load_print_meta: model ftype = I2_S - 2 bpw ternary
llm_load_print_meta: model params = 728.84 M
llm_load_print_meta: model size = 256.56 MiB (2.95 BPW)
llm_load_print_meta: general.name = bitnet_b1_58-large
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 32000 '</line>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOG token = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size = 0.12 MiB
llm_load_tensors: CPU buffer size = 256.56 MiB
.................................................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 32
llama_new_context_with_model: n_ubatch = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 288.00 MiB
llama_new_context_with_model: KV self size = 288.00 MiB, K (f16): 144.00 MiB, V (f16): 144.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: CPU compute buffer size = 5.00 MiB
llama_new_context_with_model: graph nodes = 870
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 1
system_info: n_threads = 1 (n_threads_batch = 1) / 22 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampler seed: 3163594951
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 1
What is your name? What is your nationality? What is your marital status? Where are you currently located? Are you a first-time visitor to the site or do you have a recent visit?
You can also check out our contact page if you would like to speak to us by phone or e-mail. [end of text]
llama_perf_sampler_print: sampling time = 2.39 ms / 68 runs ( 0.04 ms per token, 28463.79 tokens per second)
llama_perf_context_print: load time = 10525.32 ms
llama_perf_context_print: prompt eval time = 150.87 ms / 6 tokens ( 25.14 ms per token, 39.77 tokens per second)
llama_perf_context_print: eval time = 1577.66 ms / 61 runs ( 25.86 ms per token, 38.66 tokens per second)
llama_perf_context_print: total time = 1744.88 ms / 67 tokens
```
:::
* Run e2e benchmark
```bash=
(bitnet-cpp) emily@MSI:/mnt/d/NYCU/CAS Lab/low-bit multiplication/BitNet$ python utils/e2e_benchmark.py -m models/bitnet_b1_58-large/ggml-model-i2_s.gguf -n 200 -p 256 -t 4
| model | size | params | backend | threads | n_batch | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
| bitnet 700M I2_S - 2 bpw ternary | 256.56 MiB | 728.84 M | CPU | 4 | 1 | pp256 | 35.15 ± 2.96 |
| bitnet 700M I2_S - 2 bpw ternary | 256.56 MiB | 728.84 M | CPU | 4 | 1 | tg200 | 34.10 ± 1.76 |
build: 957b59d2 (3954)
```
## Steps to set up the environment
```bash=
git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet
bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"
conda create -n bitnet-cpp python=3.9
conda activate bitnet-cpp
pip install -r requirements.txt
sudo apt install cmake
python setup_env.py --hf-repo 1bitLLM/bitnet_b1_58-large -q i2_s
```
## bitnet.cpp dot products of W2A8 with AVX2
* `xq8_0`, `xq8_1`, `xq8_2`, `xq8_3` (weights): Each has 32 8-bit (only lowest 2 bits are valid) values
* `yq8_0`, `yq8_1`, `yq8_2`, `yq8_3` (inputs): Each has 32 8-bit values
```c++
void ggml_vec_dot_i2_i8_s(int n, float * s, size_t bs, const void * vx, size_t bx, const void * vy, size_t by, int nrc) {
const uint8_t * x = (uint8_t *)vx;
const int8_t * y = (int8_t *)vy;
const int nb = n / QK_I2_S;
const int group32_num = nb / 32;
const int la_num = nb % 32;
const int groupla_num = nb % 32 != 0 ? 1 : 0;
#if defined(__AVX2__)
__m256i mask = _mm256_set1_epi8(0x03);
__m256i accu = _mm256_setzero_si256();
for (int i=0; i < group32_num; i++){
__m256i accu32 = _mm256_setzero_si256();
for (int j=0; j < 32; j++) {
// 128 index
__m256i xq8_3 = _mm256_loadu_si256((const __m256i*)(x + i * 32 * 32 + j * 32));
__m256i xq8_2 = _mm256_srli_epi16(xq8_3, 2);
__m256i xq8_1 = _mm256_srli_epi16(xq8_3, 4);
__m256i xq8_0 = _mm256_srli_epi16(xq8_3, 6);
// each 32 index
xq8_3 = _mm256_and_si256(xq8_3, mask);
xq8_2 = _mm256_and_si256(xq8_2, mask);
xq8_1 = _mm256_and_si256(xq8_1, mask);
xq8_0 = _mm256_and_si256(xq8_0, mask);
// each 32 index
__m256i yq8_0 = _mm256_loadu_si256((const __m256i*)(y + i * 128 * 32 + j * 128 + 0)); // in bytes, so +32 is actually +256 bits => 32 8-bit values
__m256i yq8_1 = _mm256_loadu_si256((const __m256i*)(y + i * 128 * 32 + j * 128 + 32));
__m256i yq8_2 = _mm256_loadu_si256((const __m256i*)(y + i * 128 * 32 + j * 128 + 64));
__m256i yq8_3 = _mm256_loadu_si256((const __m256i*)(y + i * 128 * 32 + j * 128 + 96));
// 128 index accumulation add
// split into 32 accumulation block
// each block each 128 index accumulated 4index
// each index maximum 256
// each block maximum 4 * 256
// each block accumulation maximum 127 * 256
// each 32 group index (128 index in one group) needs cast to int32
xq8_0 = _mm256_maddubs_epi16(xq8_0, yq8_0);
xq8_1 = _mm256_maddubs_epi16(xq8_1, yq8_1);
xq8_2 = _mm256_maddubs_epi16(xq8_2, yq8_2);
xq8_3 = _mm256_maddubs_epi16(xq8_3, yq8_3);
accu32 = _mm256_add_epi16(accu32, _mm256_add_epi16(xq8_0, xq8_1));
accu32 = _mm256_add_epi16(accu32, _mm256_add_epi16(xq8_2, xq8_3));
}
accu = _mm256_add_epi32(_mm256_madd_epi16(accu32, _mm256_set1_epi16(1)), accu);
}
for (int i = 0; i < groupla_num; i++){
__m256i accula = _mm256_setzero_si256();
for (int j = 0; j < la_num; j++) {
// 128 index
__m256i xq8_3 = _mm256_loadu_si256((const __m256i*)(x + group32_num * 32 * 32 + j * 32));
__m256i xq8_2 = _mm256_srli_epi16(xq8_3, 2);
__m256i xq8_1 = _mm256_srli_epi16(xq8_3, 4);
__m256i xq8_0 = _mm256_srli_epi16(xq8_3, 6);
// each 32 index
xq8_3 = _mm256_and_si256(xq8_3, mask);
xq8_2 = _mm256_and_si256(xq8_2, mask);
xq8_1 = _mm256_and_si256(xq8_1, mask);
xq8_0 = _mm256_and_si256(xq8_0, mask);
// each 32 index
__m256i yq8_0 = _mm256_loadu_si256((const __m256i*)(y + group32_num * 128 * 32 + j * 128 + 0));
__m256i yq8_1 = _mm256_loadu_si256((const __m256i*)(y + group32_num * 128 * 32 + j * 128 + 32));
__m256i yq8_2 = _mm256_loadu_si256((const __m256i*)(y + group32_num * 128 * 32 + j * 128 + 64));
__m256i yq8_3 = _mm256_loadu_si256((const __m256i*)(y + group32_num * 128 * 32 + j * 128 + 96));
// 128 index accumulation add
// split into 32 accumulation block
// each block each 128 index accumulated 4index
// each index maximum 256
// each block maximum 4 * 256
// each block accumulation maximum 127 * 256
// each 32 group index (128 index in one group) needs cast to int32
xq8_0 = _mm256_maddubs_epi16(xq8_0, yq8_0);
xq8_1 = _mm256_maddubs_epi16(xq8_1, yq8_1);
xq8_2 = _mm256_maddubs_epi16(xq8_2, yq8_2);
xq8_3 = _mm256_maddubs_epi16(xq8_3, yq8_3);
accula = _mm256_add_epi16(accula, _mm256_add_epi16(xq8_0, xq8_1));
accula = _mm256_add_epi16(accula, _mm256_add_epi16(xq8_2, xq8_3));
}
accu = _mm256_add_epi32(accu, _mm256_madd_epi16(accula, _mm256_set1_epi16(1)));
}
int sumi = hsum_i32_8(accu);
*s = (float)sumi;
```
## Modified ggml_vec_dot_i2_i8_s
* Inference 結果還是錯誤的(還有bug)
* `x`: unsigned, weight indices
* indices 0,1,2 -> values -1,0,1
* `y`: **signed** 8-bit numbers
> `shift == 6` (8-bit number 的 highest 2 bits) 時加**負的**查表結果
* 有確認他 weight indices 和 activations 的範圍
* act: -128-127
* weight: idx 0,1,2 -> -1,0,1
> 可能問題: overflow 或是 sign extension 錯了,但 de 不出來QwQ
#### here
* Method to deal with activation range [-128,127]:
* 上次寫 DeepGEMM multiplication 時的 activation range 為 [0,255],但是 bitnet.cpp 中 activation range 是 [-128, 127],因此計算時將 8-bit activation 的最高位 2 bits 在累加時加上**負的**查表結果。
* 例如 $-7_{dec}=11111001_{bin}=-2^7+2^6+2^5+2^4+2^3+2^0$
* *想到一個bug但還沒改*: 不能是最高位 2 bits 一起變負號...只有最高位要是負的。不知道要怎麼處理range為負的情況,因為2-bit的相乘結果是已經存在LUT中,activation的最高bit和第二高bit沒辦法分開來處理?
```c++
if(shift == 6){
lut_values_low_16_1 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_1);
lut_values_low_16_2 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_2);
lut_values_low_16_3 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_3);
lut_values_low_16_4 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_4);
lut_values_high_16_1 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_1);
lut_values_high_16_2 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_2);
lut_values_high_16_3 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_3);
lut_values_high_16_4 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_4);
}
```
* 跟 baseline 比速度**較慢**
* Baseline: SIMD bitnet.cpp (i2_s kernel)
* Also pack the ternary weights into 2-bit, and unpack them when inferecing.
* Adopt the vanilla multiply-then-addition manner to perform the matrix multiplication
* **Also use SIMD instructions to accelerate**
* Command
`python run_inference.py -m models/bitnet_b1_58-large/ggml-model-i2_s.gguf -p "What is your name?" -t 1`
* Result
```
llama_perf_sampler_print: sampling time = 4.84 ms / 134 runs ( 0.04 ms per token, 27691.67 tokens per second)
llama_perf_context_print: load time = 9613.05 ms
llama_perf_context_print: prompt eval time = 533.84 ms / 6 tokens ( 88.97 ms per token, 11.24 tokens per second)
llama_perf_context_print: eval time = 13220.86 ms / 127 runs ( 104.10 ms per token, 9.61 tokens per second)
llama_perf_context_print: total time = 13780.34 ms / 133 tokens
```
* Prompt
`What is your name?`
* Response
```!
What is your name?tree math poi poihp poi poidetail poi poi poihp poi poi poihbar poihpfol poi poi poihbar honour poi Contexthp poivos poi poifolvos poi poi poiher Desp poi Synhphphp Math poi honour Syn honour poihp poihp poihp poi poivos honourfol poihphp honour Desphp poi poihpwaldwald poi poiwaldhp poifol Context poi poifol poivol poi poidetail poi Syn poihpwehrfolvoshp poihp poi poi poi poihpfol Trib poi poi poi poi poi poi poi Syn poiwaldvos poi poi poi poihphp poihp poihp poihpfolhpfol
```
:::spoiler Entire Output
(bitnet-cpp) emily@MSI:/mnt/d/NYCU/CAS Lab/low-bit multiplication/BitNet$ python run_inference.py -m models/bitnet_b1_58-large/ggml-model-i2_s.gguf -p "What is your name?" -t 1
warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
build: 3954 (957b59d2) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 26 key-value pairs and 266 tensors from models/bitnet_b1_58-large/ggml-model-i2_s.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = bitnet
llama_model_loader: - kv 1: general.name str = bitnet_b1_58-large
llama_model_loader: - kv 2: bitnet.block_count u32 = 24
llama_model_loader: - kv 3: bitnet.context_length u32 = 2048
llama_model_loader: - kv 4: bitnet.embedding_length u32 = 1536
llama_model_loader: - kv 5: bitnet.feed_forward_length u32 = 4096
llama_model_loader: - kv 6: bitnet.attention.head_count u32 = 16
llama_model_loader: - kv 7: bitnet.attention.head_count_kv u32 = 16
llama_model_loader: - kv 8: bitnet.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 9: bitnet.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 40
llama_model_loader: - kv 11: bitnet.vocab_size u32 = 32002
llama_model_loader: - kv 12: bitnet.rope.scaling.type str = linear
llama_model_loader: - kv 13: bitnet.rope.scaling.factor f32 = 1.000000
llama_model_loader: - kv 14: tokenizer.ggml.model str = llama
llama_model_loader: - kv 15: tokenizer.ggml.pre str = default
llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 32000
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 24: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 97 tensors
llama_model_loader: - type f16: 1 tensors
llama_model_loader: - type i2_s: 168 tensors
llm_load_vocab: control token: 2 '</s>' is not marked as EOG
llm_load_vocab: control token: 1 '<s>' is not marked as EOG
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 5
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = bitnet
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32002
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 1536
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 16
llm_load_print_meta: n_rot = 96
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 96
llm_load_print_meta: n_embd_head_v = 96
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 1536
llm_load_print_meta: n_embd_v_gqa = 1536
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 4096
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 700M
llm_load_print_meta: model ftype = I2_S - 2 bpw ternary
llm_load_print_meta: model params = 728.84 M
llm_load_print_meta: model size = 256.56 MiB (2.95 BPW)
llm_load_print_meta: general.name = bitnet_b1_58-large
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 32000 '</line>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOG token = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size = 0.12 MiB
llm_load_tensors: CPU buffer size = 256.56 MiB
.................................................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 32
llama_new_context_with_model: n_ubatch = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 288.00 MiB
llama_new_context_with_model: KV self size = 288.00 MiB, K (f16): 144.00 MiB, V (f16): 144.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: CPU compute buffer size = 5.00 MiB
llama_new_context_with_model: graph nodes = 870
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 1
system_info: n_threads = 1 (n_threads_batch = 1) / 22 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampler seed: 135211052
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 1
What is your name?tree math poi poihp poi poidetail poi poi poihp poi poi poihbar poihpfol poi poi poihbar honour poi Contexthp poivos poi poifolvos poi poi poiher Desp poi Synhphphp Math poi honour Syn honour poihp poihp poihp poi poivos honourfol poihphp honour Desphp poi poihpwaldwald poi poiwaldhp poifol Context poi poifol poivol poi poidetail poi Syn poihpwehrfolvoshp poihp poi poi poi poihpfol Trib poi poi poi poi poi poi poi Syn poiwaldvos poi poi poi poihphp poihp poihp poihpfolhpfol
llama_perf_sampler_print: sampling time = 4.84 ms / 134 runs ( 0.04 ms per token, 27691.67 tokens per second)
llama_perf_context_print: load time = 9613.05 ms
llama_perf_context_print: prompt eval time = 533.84 ms / 6 tokens ( 88.97 ms per token, 11.24 tokens per second)
llama_perf_context_print: eval time = 13220.86 ms / 127 runs ( 104.10 ms per token, 9.61 tokens per second)
llama_perf_context_print: total time = 13780.34 ms / 133 tokens
:::
```c++=
alignas(32) int8_t lut[LUT_SIZE] = {0,-1,-2,-3,0,0,0,0,0,1,2,3,0,2,4,6};
void ggml_vec_dot_i2_i8_s(int n, float * s, size_t bs, const void * vx, size_t bx, const void * vy, size_t by, int nrc) {
const uint8_t * x = (uint8_t *)vx;
const int8_t * y = (int8_t *)vy;
const int nb = n / QK_I2_S;
const int group32_num = nb / 32;
const int la_num = nb % 32;
const int groupla_num = nb % 32 != 0 ? 1 : 0;
#if defined(__AVX2__)
// LUT initialization
// alignas(32) int8_t lut[LUT_SIZE] = {0};
// const int8_t predefined_weights[4] = {-1, 0, 1, 2}; // index only 0,1,2 -> values -1,0,1
// const int8_t predefined_activations[4] = {0, 1, 2, 3};
// generateLUT(lut, predefined_weights, predefined_activations);
// load LUT into AVX2 register and prepare it for 256-bit shuffle
__m256i lut_vec = _mm256_load_si256(reinterpret_cast<const __m256i *>(lut));
lut_vec = _mm256_permute2x128_si256(lut_vec, lut_vec, 0x00);
// accumulator for final result
__m256i accu = _mm256_setzero_si256();
// process full groups of 32 blocks
for (int i = 0; i < group32_num; i++) {
__m256i group_accu = _mm256_setzero_si256();
for (int j = 0; j < 32; j++) { // 一個 group 有 32 個壓縮塊
const uint8_t *x_base = x + i * 32 * 32 + j * 32; // from original code
const uint8_t *y_base = y + i * 128 * 32 + j * 128; // from original code
// load activations and weights
// __m256i yq8_0 = _mm256_loadu_si256((const __m256i*)(y_base + 0));
__m256i act_vec_1 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(y_base + 0)); // 32 activations
__m256i act_vec_2 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(y_base + 32)); // 32 activations
__m256i act_vec_3 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(y_base + 64)); // 32 activations
__m256i act_vec_4 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(y_base + 96)); // 32 activations
__m256i wt_vec_1 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(x_base));
__m256i wt_vec_2 = _mm256_srli_epi16(wt_vec_1, 2);
__m256i wt_vec_3 = _mm256_srli_epi16(wt_vec_1, 4);
__m256i wt_vec_4 = _mm256_srli_epi16(wt_vec_1, 6);
// accumulator for partial results of this block
__m256i block_accu = _mm256_setzero_si256();
__m256i block_sub_result_low = _mm256_setzero_si256(); // added, because the multiplication result is 16 bits and we have 32 results in each group
__m256i block_sub_result_high = _mm256_setzero_si256(); // added
__m256i wt_index_1 = _mm256_and_si256(wt_vec_1, _mm256_set1_epi8(0x03)); // 32 weights
__m256i wt_index_2 = _mm256_and_si256(wt_vec_2, _mm256_set1_epi8(0x03)); // 32 weights
__m256i wt_index_3 = _mm256_and_si256(wt_vec_3, _mm256_set1_epi8(0x03)); // 32 weights
__m256i wt_index_4 = _mm256_and_si256(wt_vec_4, _mm256_set1_epi8(0x03)); // 32 weights
// compute using LUT for all 4 shifts (2-bit processing)
for (int shift = 0; shift < 8; shift += 2) { // divide 8-bit activations into 4 2-bit sub-values to calculate
// extract indices for activations and weights
__m256i act_index_1 = _mm256_and_si256(_mm256_srli_epi16(act_vec_1, shift), _mm256_set1_epi8(0x03));
__m256i act_index_2 = _mm256_and_si256(_mm256_srli_epi16(act_vec_2, shift), _mm256_set1_epi8(0x03));
__m256i act_index_3 = _mm256_and_si256(_mm256_srli_epi16(act_vec_3, shift), _mm256_set1_epi8(0x03));
__m256i act_index_4 = _mm256_and_si256(_mm256_srli_epi16(act_vec_4, shift), _mm256_set1_epi8(0x03));
// combine indices for LUT lookup
__m256i combined_index_1 = _mm256_or_si256(act_index_1, _mm256_slli_epi16(wt_index_1, 2));
__m256i combined_index_2 = _mm256_or_si256(act_index_2, _mm256_slli_epi16(wt_index_2, 2));
__m256i combined_index_3 = _mm256_or_si256(act_index_3, _mm256_slli_epi16(wt_index_3, 2));
__m256i combined_index_4 = _mm256_or_si256(act_index_4, _mm256_slli_epi16(wt_index_4, 2));
// LUT lookup
__m256i lut_values_1 = _mm256_shuffle_epi8(lut_vec, combined_index_1);
__m256i lut_values_2 = _mm256_shuffle_epi8(lut_vec, combined_index_2);
__m256i lut_values_3 = _mm256_shuffle_epi8(lut_vec, combined_index_3);
__m256i lut_values_4 = _mm256_shuffle_epi8(lut_vec, combined_index_4);
// 提取前半部分和後半部分的數據(因為8-bit數字要變成16-bit數字去累加)
__m128i lut_values_low_1 = _mm256_castsi256_si128(lut_values_1); // 前 128-bit (low lane)
__m128i lut_values_low_2 = _mm256_castsi256_si128(lut_values_2);
__m128i lut_values_low_3 = _mm256_castsi256_si128(lut_values_3);
__m128i lut_values_low_4 = _mm256_castsi256_si128(lut_values_4);
__m128i lut_values_high_1 = _mm256_extracti128_si256(lut_values_1, 1); // 後 128-bit (high lane)
__m128i lut_values_high_2 = _mm256_extracti128_si256(lut_values_2, 1);
__m128i lut_values_high_3 = _mm256_extracti128_si256(lut_values_3, 1);
__m128i lut_values_high_4 = _mm256_extracti128_si256(lut_values_4, 1);
// extend 8-bit to 16-bit
__m256i lut_values_low_16_1 = _mm256_cvtepi8_epi16(lut_values_low_1);
__m256i lut_values_low_16_2 = _mm256_cvtepi8_epi16(lut_values_low_2);
__m256i lut_values_low_16_3 = _mm256_cvtepi8_epi16(lut_values_low_3);
__m256i lut_values_low_16_4 = _mm256_cvtepi8_epi16(lut_values_low_4);
__m256i lut_values_high_16_1 = _mm256_cvtepi8_epi16(lut_values_high_1);
__m256i lut_values_high_16_2 = _mm256_cvtepi8_epi16(lut_values_high_2);
__m256i lut_values_high_16_3 = _mm256_cvtepi8_epi16(lut_values_high_3);
__m256i lut_values_high_16_4 = _mm256_cvtepi8_epi16(lut_values_high_4);
if(shift == 6){ // highest 2 bits
lut_values_low_16_1 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_1);
lut_values_low_16_2 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_2);
lut_values_low_16_3 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_3);
lut_values_low_16_4 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_4);
lut_values_high_16_1 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_1);
lut_values_high_16_2 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_2);
lut_values_high_16_3 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_3);
lut_values_high_16_4 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_4);
}
// Add extended result to the corresponding accumulator
block_sub_result_low = _mm256_add_epi16(block_sub_result_low, _mm256_slli_epi16(_mm256_add_epi16(lut_values_low_16_1, lut_values_low_16_2), shift));
block_sub_result_low = _mm256_add_epi16(block_sub_result_low, _mm256_slli_epi16(_mm256_add_epi16(lut_values_low_16_3, lut_values_low_16_4), shift));
block_sub_result_high = _mm256_add_epi16(block_sub_result_high, _mm256_slli_epi16(_mm256_add_epi16(lut_values_high_16_1, lut_values_high_16_2), shift));
block_sub_result_high = _mm256_add_epi16(block_sub_result_high, _mm256_slli_epi16(_mm256_add_epi16(lut_values_high_16_3, lut_values_high_16_4), shift));
// // accumulate LUT results (converted to 16-bit before summing)
// block_accu = _mm256_add_epi16(block_accu, _mm256_cvtepi8_epi16(_mm256_castsi256_si128(lut_values)));
// block_accu = _mm256_add_epi16(block_accu, _mm256_cvtepi8_epi16(_mm256_extracti128_si256(lut_values, 1)));
}
// accumulate block results into group
group_accu = _mm256_add_epi32(group_accu, _mm256_madd_epi16(block_sub_result_low, _mm256_set1_epi16(1)));
group_accu = _mm256_add_epi32(group_accu, _mm256_madd_epi16(block_sub_result_high, _mm256_set1_epi16(1)));
// group_accu = _mm256_add_epi32(group_accu, _mm256_madd_epi16(block_accu, _mm256_set1_epi16(1)));
}
// accumulate group results into global accumulator
accu = _mm256_add_epi32(accu, group_accu);
}
// remaining blocks (less than 32)
if (la_num > 0) {
__m256i group_accu = _mm256_setzero_si256();
for (int j = 0; j < la_num; j++) {
const uint8_t *x_base = x + group32_num * 32 * 32 + j * 32;
const uint8_t *y_base = y + group32_num * 128 * 32 + j * 128;
// load activations and weights
__m256i act_vec_1 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(y_base + 0));
__m256i act_vec_2 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(y_base + 32));
__m256i act_vec_3 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(y_base + 64));
__m256i act_vec_4 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(y_base + 96));
__m256i wt_vec_1 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(x_base));
__m256i wt_vec_2 = _mm256_srli_epi16(wt_vec_1, 2);
__m256i wt_vec_3 = _mm256_srli_epi16(wt_vec_1, 4);
__m256i wt_vec_4 = _mm256_srli_epi16(wt_vec_1, 6);
__m256i wt_index_1 = _mm256_and_si256(wt_vec_1, _mm256_set1_epi8(0x03));
__m256i wt_index_2 = _mm256_and_si256(wt_vec_2, _mm256_set1_epi8(0x03));
__m256i wt_index_3 = _mm256_and_si256(wt_vec_3, _mm256_set1_epi8(0x03));
__m256i wt_index_4 = _mm256_and_si256(wt_vec_4, _mm256_set1_epi8(0x03));
// accumulator for partial results of this block
__m256i block_accu = _mm256_setzero_si256();
__m256i block_sub_result_low = _mm256_setzero_si256(); // added
__m256i block_sub_result_high = _mm256_setzero_si256(); // added
// compute using LUT for all 4 shifts (2-bit processing)
for (int shift = 0; shift < 8; shift += 2) {
// extract indices for activations and weights
__m256i act_index_1 = _mm256_and_si256(_mm256_srli_epi16(act_vec_1, shift), _mm256_set1_epi8(0x03));
__m256i act_index_2 = _mm256_and_si256(_mm256_srli_epi16(act_vec_2, shift), _mm256_set1_epi8(0x03));
__m256i act_index_3 = _mm256_and_si256(_mm256_srli_epi16(act_vec_3, shift), _mm256_set1_epi8(0x03));
__m256i act_index_4 = _mm256_and_si256(_mm256_srli_epi16(act_vec_4, shift), _mm256_set1_epi8(0x03));
// combine indices for LUT lookup
__m256i combined_index_1 = _mm256_or_si256(act_index_1, _mm256_slli_epi16(wt_index_1, 2));
__m256i combined_index_2 = _mm256_or_si256(act_index_2, _mm256_slli_epi16(wt_index_2, 2));
__m256i combined_index_3 = _mm256_or_si256(act_index_3, _mm256_slli_epi16(wt_index_3, 2));
__m256i combined_index_4 = _mm256_or_si256(act_index_4, _mm256_slli_epi16(wt_index_4, 2));
// LUT lookup
__m256i lut_values_1 = _mm256_shuffle_epi8(lut_vec, combined_index_1);
__m256i lut_values_2 = _mm256_shuffle_epi8(lut_vec, combined_index_2);
__m256i lut_values_3 = _mm256_shuffle_epi8(lut_vec, combined_index_3);
__m256i lut_values_4 = _mm256_shuffle_epi8(lut_vec, combined_index_4);
// 提取前半部分和後半部分的數據
__m128i lut_values_low_1 = _mm256_castsi256_si128(lut_values_1); // 前 128-bit (low lane)
__m128i lut_values_low_2 = _mm256_castsi256_si128(lut_values_2);
__m128i lut_values_low_3 = _mm256_castsi256_si128(lut_values_3);
__m128i lut_values_low_4 = _mm256_castsi256_si128(lut_values_4);
__m128i lut_values_high_1 = _mm256_extracti128_si256(lut_values_1, 1); // 後 128-bit (high lane)
__m128i lut_values_high_2 = _mm256_extracti128_si256(lut_values_2, 1);
__m128i lut_values_high_3 = _mm256_extracti128_si256(lut_values_3, 1);
__m128i lut_values_high_4 = _mm256_extracti128_si256(lut_values_4, 1);
// extend 8-bit to 16-bit
__m256i lut_values_low_16_1 = _mm256_cvtepi8_epi16(lut_values_low_1);
__m256i lut_values_low_16_2 = _mm256_cvtepi8_epi16(lut_values_low_2);
__m256i lut_values_low_16_3 = _mm256_cvtepi8_epi16(lut_values_low_3);
__m256i lut_values_low_16_4 = _mm256_cvtepi8_epi16(lut_values_low_4);
__m256i lut_values_high_16_1 = _mm256_cvtepi8_epi16(lut_values_high_1);
__m256i lut_values_high_16_2 = _mm256_cvtepi8_epi16(lut_values_high_2);
__m256i lut_values_high_16_3 = _mm256_cvtepi8_epi16(lut_values_high_3);
__m256i lut_values_high_16_4 = _mm256_cvtepi8_epi16(lut_values_high_4);
if(shift == 6){
lut_values_low_16_1 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_1);
lut_values_low_16_2 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_2);
lut_values_low_16_3 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_3);
lut_values_low_16_4 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_4);
lut_values_high_16_1 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_1);
lut_values_high_16_2 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_2);
lut_values_high_16_3 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_3);
lut_values_high_16_4 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_4);
}
// Add extended result to the corresponding accumulator
block_sub_result_low = _mm256_add_epi16(block_sub_result_low, _mm256_slli_epi16(_mm256_add_epi16(lut_values_low_16_1, lut_values_low_16_2), shift));
block_sub_result_low = _mm256_add_epi16(block_sub_result_low, _mm256_slli_epi16(_mm256_add_epi16(lut_values_low_16_3, lut_values_low_16_4), shift));
block_sub_result_high = _mm256_add_epi16(block_sub_result_high, _mm256_slli_epi16(_mm256_add_epi16(lut_values_high_16_1, lut_values_high_16_2), shift));
block_sub_result_high = _mm256_add_epi16(block_sub_result_high, _mm256_slli_epi16(_mm256_add_epi16(lut_values_high_16_3, lut_values_high_16_4), shift));
// // accumulate LUT results (converted to 16-bit before summing)
// block_accu = _mm256_add_epi16(block_accu, _mm256_cvtepi8_epi16(_mm256_castsi256_si128(lut_values)));
// block_accu = _mm256_add_epi16(block_accu, _mm256_cvtepi8_epi16(_mm256_extracti128_si256(lut_values, 1)));
}
// accumulate block results into group
group_accu = _mm256_add_epi32(group_accu, _mm256_madd_epi16(block_sub_result_low, _mm256_set1_epi16(1)));
group_accu = _mm256_add_epi32(group_accu, _mm256_madd_epi16(block_sub_result_high, _mm256_set1_epi16(1)));
// group_accu = _mm256_add_epi32(group_accu, _mm256_madd_epi16(block_accu, _mm256_set1_epi16(1)));
}
// accumulate group results into global accumulator
accu = _mm256_add_epi32(accu, group_accu);
}
// final horizontal sum of the accumulator
int sumi = hsum_i32_8(accu);
*s = static_cast<float>(sumi);
```