2025/01/07 - 2025/01/20

# 2025/01/07 - 2025/01/20 [TOC] ## TODO - [ ] 用做好的東西，串inference的bitnet，比較bitliniear和其他的部分 - [x] 蒐集 paper (大專生計畫) * 寫SW * 畫餅 * 文獻引用多 * [怎麼skip 0，處理11000000那種case 的 paper](https://ieeexplore.ieee.org/document/7551378) Next TODO: bug修好 => 可以看 5 次，signed bit 分開來看，剩下 7 bits 看 4 次 DeepGEMM 跟 bitnet.cpp 的 implementation 有什麼不一樣 BSS有興趣可看 ## 問題 1. bitnet.cpp activation 的 range 是 [-128,127]，不知道要怎麼處理 signed 的部分，因為目前是把 8-bit activation 拆成 2-bit 2-bit 來算 [details](https://hackmd.io/7a2GAGE8ROiJWioHnpQpDA#here) 2. 要看bitlinear和bitnet中其它運算的時間占比而 bitnet.cpp 只是一個 framework，有三個不同 kernels，針對 low-bit weights 有不同改進。所以他還是去 huggingface 上下載 train 好的 model 來跑 inference，只是推理過程中會用到的點積等運算是用所選 kernel 優化過的點積function。這樣好像沒辦法算 bitlinear 跟他以外的運算的時間? > BitNet ([1bitLLM/bitnet_b1_58-large](https://huggingface.co/1bitLLM/bitnet_b1_58-large)) > 試: 改bitnet.cpp的SIMD intrinsics的部分，成功可以inference之後再計時 > 現在想的計時方法是用兩個時間變數，一個累計bitlinear時間，另一個累計其它operations的時間 ## Setting * Environment * WSL2, Ubuntu 22.04 * Local machine CPU: * Architecture: x86_64 * CPU(s): 22 * Model used: 1bitLLM/bitnet_b1_58-large ## bitnet.cpp inference with i2_s kernel * Command `python run_inference.py -m models/bitnet_b1_58-large/ggml-model-i2_s.gguf -p "What is your name?" -t 1` * Result ```bash= llama_perf_sampler_print: sampling time = 2.39 ms / 68 runs ( 0.04 ms per token, 28463.79 tokens per second) llama_perf_context_print: load time = 10525.32 ms llama_perf_context_print: prompt eval time = 150.87 ms / 6 tokens ( 25.14 ms per token, 39.77 tokens per second) llama_perf_context_print: eval time = 1577.66 ms / 61 runs ( 25.86 ms per token, 38.66 tokens per second) llama_perf_context_print: total time = 1744.88 ms / 67 tokens ``` * Response ``` What is your name? What is your home town? What's your favourite city? How would you describe the character of your city? What makes it so special? What makes it so difficult to live here? How do you feel about the city? What do you think it is that makes this city so special? How do you feel about it? How do you feel about your city? Do you live in a small town? Do you live in a big city? Do you like to walk around the city centre? Do you like to take public transport? Do you like the city's people and its culture? Do you like the city' ``` :::spoiler Entire output ### output ``` (bitnet-cpp) emily@MSI:/mnt/d/NYCU/CAS Lab/low-bit multiplication/BitNet$ python run_inference.py -m models/bitnet_b1_58-large/ggml-model-i2_s.gguf -p "What is your name?" -t 1 warning: not compiled with GPU offload support, --gpu-layers option will be ignored warning: see main README.md for information on enabling GPU BLAS support build: 3954 (957b59d2) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: llama backend init main: load the model and apply lora adapter, if any llama_model_loader: loaded meta data with 26 key-value pairs and 266 tensors from models/bitnet_b1_58-large/ggml-model-i2_s.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bitnet llama_model_loader: - kv 1: general.name str = bitnet_b1_58-large llama_model_loader: - kv 2: bitnet.block_count u32 = 24 llama_model_loader: - kv 3: bitnet.context_length u32 = 2048 llama_model_loader: - kv 4: bitnet.embedding_length u32 = 1536 llama_model_loader: - kv 5: bitnet.feed_forward_length u32 = 4096 llama_model_loader: - kv 6: bitnet.attention.head_count u32 = 16 llama_model_loader: - kv 7: bitnet.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: bitnet.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 9: bitnet.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 40 llama_model_loader: - kv 11: bitnet.vocab_size u32 = 32002 llama_model_loader: - kv 12: bitnet.rope.scaling.type str = linear llama_model_loader: - kv 13: bitnet.rope.scaling.factor f32 = 1.000000 llama_model_loader: - kv 14: tokenizer.ggml.model str = llama llama_model_loader: - kv 15: tokenizer.ggml.pre str = default llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 32000 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 24: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - type f32: 97 tensors llama_model_loader: - type f16: 1 tensors llama_model_loader: - type i2_s: 168 tensors llm_load_vocab: control token: 2 '</s>' is not marked as EOG llm_load_vocab: control token: 1 '<s>' is not marked as EOG llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 5 llm_load_vocab: token to piece cache size = 0.1684 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bitnet llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 1536 llm_load_print_meta: n_layer = 24 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_rot = 96 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 96 llm_load_print_meta: n_embd_head_v = 96 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 1536 llm_load_print_meta: n_embd_v_gqa = 1536 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 4096 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 700M llm_load_print_meta: model ftype = I2_S - 2 bpw ternary llm_load_print_meta: model params = 728.84 M llm_load_print_meta: model size = 256.56 MiB (2.95 BPW) llm_load_print_meta: general.name = bitnet_b1_58-large llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 32000 '</line>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_print_meta: EOG token = 2 '</s>' llm_load_print_meta: max token length = 48 llm_load_tensors: ggml ctx size = 0.12 MiB llm_load_tensors: CPU buffer size = 256.56 MiB ................................................................. llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 32 llama_new_context_with_model: n_ubatch = 32 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 288.00 MiB llama_new_context_with_model: KV self size = 288.00 MiB, K (f16): 144.00 MiB, V (f16): 144.00 MiB llama_new_context_with_model: CPU output buffer size = 0.12 MiB llama_new_context_with_model: CPU compute buffer size = 5.00 MiB llama_new_context_with_model: graph nodes = 870 llama_new_context_with_model: graph splits = 1 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 1 system_info: n_threads = 1 (n_threads_batch = 1) / 22 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampler seed: 3163594951 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 1 What is your name? What is your nationality? What is your marital status? Where are you currently located? Are you a first-time visitor to the site or do you have a recent visit? You can also check out our contact page if you would like to speak to us by phone or e-mail. [end of text] llama_perf_sampler_print: sampling time = 2.39 ms / 68 runs ( 0.04 ms per token, 28463.79 tokens per second) llama_perf_context_print: load time = 10525.32 ms llama_perf_context_print: prompt eval time = 150.87 ms / 6 tokens ( 25.14 ms per token, 39.77 tokens per second) llama_perf_context_print: eval time = 1577.66 ms / 61 runs ( 25.86 ms per token, 38.66 tokens per second) llama_perf_context_print: total time = 1744.88 ms / 67 tokens ``` ::: * Run e2e benchmark ```bash= (bitnet-cpp) emily@MSI:/mnt/d/NYCU/CAS Lab/low-bit multiplication/BitNet$ python utils/e2e_benchmark.py -m models/bitnet_b1_58-large/ggml-model-i2_s.gguf -n 200 -p 256 -t 4 | model | size | params | backend | threads | n_batch | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: | | bitnet 700M I2_S - 2 bpw ternary | 256.56 MiB | 728.84 M | CPU | 4 | 1 | pp256 | 35.15 ± 2.96 | | bitnet 700M I2_S - 2 bpw ternary | 256.56 MiB | 728.84 M | CPU | 4 | 1 | tg200 | 34.10 ± 1.76 | build: 957b59d2 (3954) ``` ## Steps to set up the environment ```bash= git clone --recursive https://github.com/microsoft/BitNet.git cd BitNet bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)" conda create -n bitnet-cpp python=3.9 conda activate bitnet-cpp pip install -r requirements.txt sudo apt install cmake python setup_env.py --hf-repo 1bitLLM/bitnet_b1_58-large -q i2_s ``` ## bitnet.cpp dot products of W2A8 with AVX2 * `xq8_0`, `xq8_1`, `xq8_2`, `xq8_3` (weights): Each has 32 8-bit (only lowest 2 bits are valid) values * `yq8_0`, `yq8_1`, `yq8_2`, `yq8_3` (inputs): Each has 32 8-bit values ```c++ void ggml_vec_dot_i2_i8_s(int n, float * s, size_t bs, const void * vx, size_t bx, const void * vy, size_t by, int nrc) { const uint8_t * x = (uint8_t *)vx; const int8_t * y = (int8_t *)vy; const int nb = n / QK_I2_S; const int group32_num = nb / 32; const int la_num = nb % 32; const int groupla_num = nb % 32 != 0 ? 1 : 0; #if defined(__AVX2__) __m256i mask = _mm256_set1_epi8(0x03); __m256i accu = _mm256_setzero_si256(); for (int i=0; i < group32_num; i++){ __m256i accu32 = _mm256_setzero_si256(); for (int j=0; j < 32; j++) { // 128 index __m256i xq8_3 = _mm256_loadu_si256((const __m256i*)(x + i * 32 * 32 + j * 32)); __m256i xq8_2 = _mm256_srli_epi16(xq8_3, 2); __m256i xq8_1 = _mm256_srli_epi16(xq8_3, 4); __m256i xq8_0 = _mm256_srli_epi16(xq8_3, 6); // each 32 index xq8_3 = _mm256_and_si256(xq8_3, mask); xq8_2 = _mm256_and_si256(xq8_2, mask); xq8_1 = _mm256_and_si256(xq8_1, mask); xq8_0 = _mm256_and_si256(xq8_0, mask); // each 32 index __m256i yq8_0 = _mm256_loadu_si256((const __m256i*)(y + i * 128 * 32 + j * 128 + 0)); // in bytes, so +32 is actually +256 bits => 32 8-bit values __m256i yq8_1 = _mm256_loadu_si256((const __m256i*)(y + i * 128 * 32 + j * 128 + 32)); __m256i yq8_2 = _mm256_loadu_si256((const __m256i*)(y + i * 128 * 32 + j * 128 + 64)); __m256i yq8_3 = _mm256_loadu_si256((const __m256i*)(y + i * 128 * 32 + j * 128 + 96)); // 128 index accumulation add // split into 32 accumulation block // each block each 128 index accumulated 4index // each index maximum 256 // each block maximum 4 * 256 // each block accumulation maximum 127 * 256 // each 32 group index (128 index in one group) needs cast to int32 xq8_0 = _mm256_maddubs_epi16(xq8_0, yq8_0); xq8_1 = _mm256_maddubs_epi16(xq8_1, yq8_1); xq8_2 = _mm256_maddubs_epi16(xq8_2, yq8_2); xq8_3 = _mm256_maddubs_epi16(xq8_3, yq8_3); accu32 = _mm256_add_epi16(accu32, _mm256_add_epi16(xq8_0, xq8_1)); accu32 = _mm256_add_epi16(accu32, _mm256_add_epi16(xq8_2, xq8_3)); } accu = _mm256_add_epi32(_mm256_madd_epi16(accu32, _mm256_set1_epi16(1)), accu); } for (int i = 0; i < groupla_num; i++){ __m256i accula = _mm256_setzero_si256(); for (int j = 0; j < la_num; j++) { // 128 index __m256i xq8_3 = _mm256_loadu_si256((const __m256i*)(x + group32_num * 32 * 32 + j * 32)); __m256i xq8_2 = _mm256_srli_epi16(xq8_3, 2); __m256i xq8_1 = _mm256_srli_epi16(xq8_3, 4); __m256i xq8_0 = _mm256_srli_epi16(xq8_3, 6); // each 32 index xq8_3 = _mm256_and_si256(xq8_3, mask); xq8_2 = _mm256_and_si256(xq8_2, mask); xq8_1 = _mm256_and_si256(xq8_1, mask); xq8_0 = _mm256_and_si256(xq8_0, mask); // each 32 index __m256i yq8_0 = _mm256_loadu_si256((const __m256i*)(y + group32_num * 128 * 32 + j * 128 + 0)); __m256i yq8_1 = _mm256_loadu_si256((const __m256i*)(y + group32_num * 128 * 32 + j * 128 + 32)); __m256i yq8_2 = _mm256_loadu_si256((const __m256i*)(y + group32_num * 128 * 32 + j * 128 + 64)); __m256i yq8_3 = _mm256_loadu_si256((const __m256i*)(y + group32_num * 128 * 32 + j * 128 + 96)); // 128 index accumulation add // split into 32 accumulation block // each block each 128 index accumulated 4index // each index maximum 256 // each block maximum 4 * 256 // each block accumulation maximum 127 * 256 // each 32 group index (128 index in one group) needs cast to int32 xq8_0 = _mm256_maddubs_epi16(xq8_0, yq8_0); xq8_1 = _mm256_maddubs_epi16(xq8_1, yq8_1); xq8_2 = _mm256_maddubs_epi16(xq8_2, yq8_2); xq8_3 = _mm256_maddubs_epi16(xq8_3, yq8_3); accula = _mm256_add_epi16(accula, _mm256_add_epi16(xq8_0, xq8_1)); accula = _mm256_add_epi16(accula, _mm256_add_epi16(xq8_2, xq8_3)); } accu = _mm256_add_epi32(accu, _mm256_madd_epi16(accula, _mm256_set1_epi16(1))); } int sumi = hsum_i32_8(accu); *s = (float)sumi; ``` ## Modified ggml_vec_dot_i2_i8_s * Inference 結果還是錯誤的(還有bug) * `x`: unsigned, weight indices * indices 0,1,2 -> values -1,0,1 * `y`: **signed** 8-bit numbers > `shift == 6` (8-bit number 的 highest 2 bits) 時加**負的**查表結果 * 有確認他 weight indices 和 activations 的範圍 * act: -128-127 * weight: idx 0,1,2 -> -1,0,1 > 可能問題: overflow 或是 sign extension 錯了，但 de 不出來QwQ #### here * Method to deal with activation range [-128,127]: * 上次寫 DeepGEMM multiplication 時的 activation range 為 [0,255]，但是 bitnet.cpp 中 activation range 是 [-128, 127]，因此計算時將 8-bit activation 的最高位 2 bits 在累加時加上**負的**查表結果。 * 例如 $-7_{dec}=11111001_{bin}=-2^7+2^6+2^5+2^4+2^3+2^0$ * *想到一個bug但還沒改*: 不能是最高位 2 bits 一起變負號...只有最高位要是負的。不知道要怎麼處理range為負的情況，因為2-bit的相乘結果是已經存在LUT中，activation的最高bit和第二高bit沒辦法分開來處理? ```c++ if(shift == 6){ lut_values_low_16_1 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_1); lut_values_low_16_2 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_2); lut_values_low_16_3 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_3); lut_values_low_16_4 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_4); lut_values_high_16_1 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_1); lut_values_high_16_2 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_2); lut_values_high_16_3 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_3); lut_values_high_16_4 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_4); } ``` * 跟 baseline 比速度**較慢** * Baseline: SIMD bitnet.cpp (i2_s kernel) * Also pack the ternary weights into 2-bit, and unpack them when inferecing. * Adopt the vanilla multiply-then-addition manner to perform the matrix multiplication * **Also use SIMD instructions to accelerate** * Command `python run_inference.py -m models/bitnet_b1_58-large/ggml-model-i2_s.gguf -p "What is your name?" -t 1` * Result ``` llama_perf_sampler_print: sampling time = 4.84 ms / 134 runs ( 0.04 ms per token, 27691.67 tokens per second) llama_perf_context_print: load time = 9613.05 ms llama_perf_context_print: prompt eval time = 533.84 ms / 6 tokens ( 88.97 ms per token, 11.24 tokens per second) llama_perf_context_print: eval time = 13220.86 ms / 127 runs ( 104.10 ms per token, 9.61 tokens per second) llama_perf_context_print: total time = 13780.34 ms / 133 tokens ``` * Prompt `What is your name?` * Response ```! What is your name?tree math poi poihp poi poidetail poi poi poihp poi poi poihbar poihpfol poi poi poihbar honour poi Contexthp poivos poi poifolvos poi poi poiher Desp poi Synhphphp Math poi honour Syn honour poihp poihp poihp poi poivos honourfol poihphp honour Desphp poi poihpwaldwald poi poiwaldhp poifol Context poi poifol poivol poi poidetail poi Syn poihpwehrfolvoshp poihp poi poi poi poihpfol Trib poi poi poi poi poi poi poi Syn poiwaldvos poi poi poi poihphp poihp poihp poihpfolhpfol ``` :::spoiler Entire Output (bitnet-cpp) emily@MSI:/mnt/d/NYCU/CAS Lab/low-bit multiplication/BitNet$ python run_inference.py -m models/bitnet_b1_58-large/ggml-model-i2_s.gguf -p "What is your name?" -t 1 warning: not compiled with GPU offload support, --gpu-layers option will be ignored warning: see main README.md for information on enabling GPU BLAS support build: 3954 (957b59d2) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: llama backend init main: load the model and apply lora adapter, if any llama_model_loader: loaded meta data with 26 key-value pairs and 266 tensors from models/bitnet_b1_58-large/ggml-model-i2_s.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bitnet llama_model_loader: - kv 1: general.name str = bitnet_b1_58-large llama_model_loader: - kv 2: bitnet.block_count u32 = 24 llama_model_loader: - kv 3: bitnet.context_length u32 = 2048 llama_model_loader: - kv 4: bitnet.embedding_length u32 = 1536 llama_model_loader: - kv 5: bitnet.feed_forward_length u32 = 4096 llama_model_loader: - kv 6: bitnet.attention.head_count u32 = 16 llama_model_loader: - kv 7: bitnet.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: bitnet.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 9: bitnet.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 40 llama_model_loader: - kv 11: bitnet.vocab_size u32 = 32002 llama_model_loader: - kv 12: bitnet.rope.scaling.type str = linear llama_model_loader: - kv 13: bitnet.rope.scaling.factor f32 = 1.000000 llama_model_loader: - kv 14: tokenizer.ggml.model str = llama llama_model_loader: - kv 15: tokenizer.ggml.pre str = default llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 32000 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 24: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - type f32: 97 tensors llama_model_loader: - type f16: 1 tensors llama_model_loader: - type i2_s: 168 tensors llm_load_vocab: control token: 2 '</s>' is not marked as EOG llm_load_vocab: control token: 1 '<s>' is not marked as EOG llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 5 llm_load_vocab: token to piece cache size = 0.1684 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bitnet llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 1536 llm_load_print_meta: n_layer = 24 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_rot = 96 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 96 llm_load_print_meta: n_embd_head_v = 96 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 1536 llm_load_print_meta: n_embd_v_gqa = 1536 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 4096 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 700M llm_load_print_meta: model ftype = I2_S - 2 bpw ternary llm_load_print_meta: model params = 728.84 M llm_load_print_meta: model size = 256.56 MiB (2.95 BPW) llm_load_print_meta: general.name = bitnet_b1_58-large llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 32000 '</line>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_print_meta: EOG token = 2 '</s>' llm_load_print_meta: max token length = 48 llm_load_tensors: ggml ctx size = 0.12 MiB llm_load_tensors: CPU buffer size = 256.56 MiB ................................................................. llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 32 llama_new_context_with_model: n_ubatch = 32 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 288.00 MiB llama_new_context_with_model: KV self size = 288.00 MiB, K (f16): 144.00 MiB, V (f16): 144.00 MiB llama_new_context_with_model: CPU output buffer size = 0.12 MiB llama_new_context_with_model: CPU compute buffer size = 5.00 MiB llama_new_context_with_model: graph nodes = 870 llama_new_context_with_model: graph splits = 1 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 1 system_info: n_threads = 1 (n_threads_batch = 1) / 22 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampler seed: 135211052 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 1 What is your name?tree math poi poihp poi poidetail poi poi poihp poi poi poihbar poihpfol poi poi poihbar honour poi Contexthp poivos poi poifolvos poi poi poiher Desp poi Synhphphp Math poi honour Syn honour poihp poihp poihp poi poivos honourfol poihphp honour Desphp poi poihpwaldwald poi poiwaldhp poifol Context poi poifol poivol poi poidetail poi Syn poihpwehrfolvoshp poihp poi poi poi poihpfol Trib poi poi poi poi poi poi poi Syn poiwaldvos poi poi poi poihphp poihp poihp poihpfolhpfol llama_perf_sampler_print: sampling time = 4.84 ms / 134 runs ( 0.04 ms per token, 27691.67 tokens per second) llama_perf_context_print: load time = 9613.05 ms llama_perf_context_print: prompt eval time = 533.84 ms / 6 tokens ( 88.97 ms per token, 11.24 tokens per second) llama_perf_context_print: eval time = 13220.86 ms / 127 runs ( 104.10 ms per token, 9.61 tokens per second) llama_perf_context_print: total time = 13780.34 ms / 133 tokens ::: ```c++= alignas(32) int8_t lut[LUT_SIZE] = {0,-1,-2,-3,0,0,0,0,0,1,2,3,0,2,4,6}; void ggml_vec_dot_i2_i8_s(int n, float * s, size_t bs, const void * vx, size_t bx, const void * vy, size_t by, int nrc) { const uint8_t * x = (uint8_t *)vx; const int8_t * y = (int8_t *)vy; const int nb = n / QK_I2_S; const int group32_num = nb / 32; const int la_num = nb % 32; const int groupla_num = nb % 32 != 0 ? 1 : 0; #if defined(__AVX2__) // LUT initialization // alignas(32) int8_t lut[LUT_SIZE] = {0}; // const int8_t predefined_weights[4] = {-1, 0, 1, 2}; // index only 0,1,2 -> values -1,0,1 // const int8_t predefined_activations[4] = {0, 1, 2, 3}; // generateLUT(lut, predefined_weights, predefined_activations); // load LUT into AVX2 register and prepare it for 256-bit shuffle __m256i lut_vec = _mm256_load_si256(reinterpret_cast<const __m256i *>(lut)); lut_vec = _mm256_permute2x128_si256(lut_vec, lut_vec, 0x00); // accumulator for final result __m256i accu = _mm256_setzero_si256(); // process full groups of 32 blocks for (int i = 0; i < group32_num; i++) { __m256i group_accu = _mm256_setzero_si256(); for (int j = 0; j < 32; j++) { // 一個 group 有 32 個壓縮塊 const uint8_t *x_base = x + i * 32 * 32 + j * 32; // from original code const uint8_t *y_base = y + i * 128 * 32 + j * 128; // from original code // load activations and weights // __m256i yq8_0 = _mm256_loadu_si256((const __m256i*)(y_base + 0)); __m256i act_vec_1 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(y_base + 0)); // 32 activations __m256i act_vec_2 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(y_base + 32)); // 32 activations __m256i act_vec_3 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(y_base + 64)); // 32 activations __m256i act_vec_4 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(y_base + 96)); // 32 activations __m256i wt_vec_1 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(x_base)); __m256i wt_vec_2 = _mm256_srli_epi16(wt_vec_1, 2); __m256i wt_vec_3 = _mm256_srli_epi16(wt_vec_1, 4); __m256i wt_vec_4 = _mm256_srli_epi16(wt_vec_1, 6); // accumulator for partial results of this block __m256i block_accu = _mm256_setzero_si256(); __m256i block_sub_result_low = _mm256_setzero_si256(); // added, because the multiplication result is 16 bits and we have 32 results in each group __m256i block_sub_result_high = _mm256_setzero_si256(); // added __m256i wt_index_1 = _mm256_and_si256(wt_vec_1, _mm256_set1_epi8(0x03)); // 32 weights __m256i wt_index_2 = _mm256_and_si256(wt_vec_2, _mm256_set1_epi8(0x03)); // 32 weights __m256i wt_index_3 = _mm256_and_si256(wt_vec_3, _mm256_set1_epi8(0x03)); // 32 weights __m256i wt_index_4 = _mm256_and_si256(wt_vec_4, _mm256_set1_epi8(0x03)); // 32 weights // compute using LUT for all 4 shifts (2-bit processing) for (int shift = 0; shift < 8; shift += 2) { // divide 8-bit activations into 4 2-bit sub-values to calculate // extract indices for activations and weights __m256i act_index_1 = _mm256_and_si256(_mm256_srli_epi16(act_vec_1, shift), _mm256_set1_epi8(0x03)); __m256i act_index_2 = _mm256_and_si256(_mm256_srli_epi16(act_vec_2, shift), _mm256_set1_epi8(0x03)); __m256i act_index_3 = _mm256_and_si256(_mm256_srli_epi16(act_vec_3, shift), _mm256_set1_epi8(0x03)); __m256i act_index_4 = _mm256_and_si256(_mm256_srli_epi16(act_vec_4, shift), _mm256_set1_epi8(0x03)); // combine indices for LUT lookup __m256i combined_index_1 = _mm256_or_si256(act_index_1, _mm256_slli_epi16(wt_index_1, 2)); __m256i combined_index_2 = _mm256_or_si256(act_index_2, _mm256_slli_epi16(wt_index_2, 2)); __m256i combined_index_3 = _mm256_or_si256(act_index_3, _mm256_slli_epi16(wt_index_3, 2)); __m256i combined_index_4 = _mm256_or_si256(act_index_4, _mm256_slli_epi16(wt_index_4, 2)); // LUT lookup __m256i lut_values_1 = _mm256_shuffle_epi8(lut_vec, combined_index_1); __m256i lut_values_2 = _mm256_shuffle_epi8(lut_vec, combined_index_2); __m256i lut_values_3 = _mm256_shuffle_epi8(lut_vec, combined_index_3); __m256i lut_values_4 = _mm256_shuffle_epi8(lut_vec, combined_index_4); // 提取前半部分和後半部分的數據(因為8-bit數字要變成16-bit數字去累加) __m128i lut_values_low_1 = _mm256_castsi256_si128(lut_values_1); // 前 128-bit (low lane) __m128i lut_values_low_2 = _mm256_castsi256_si128(lut_values_2); __m128i lut_values_low_3 = _mm256_castsi256_si128(lut_values_3); __m128i lut_values_low_4 = _mm256_castsi256_si128(lut_values_4); __m128i lut_values_high_1 = _mm256_extracti128_si256(lut_values_1, 1); // 後 128-bit (high lane) __m128i lut_values_high_2 = _mm256_extracti128_si256(lut_values_2, 1); __m128i lut_values_high_3 = _mm256_extracti128_si256(lut_values_3, 1); __m128i lut_values_high_4 = _mm256_extracti128_si256(lut_values_4, 1); // extend 8-bit to 16-bit __m256i lut_values_low_16_1 = _mm256_cvtepi8_epi16(lut_values_low_1); __m256i lut_values_low_16_2 = _mm256_cvtepi8_epi16(lut_values_low_2); __m256i lut_values_low_16_3 = _mm256_cvtepi8_epi16(lut_values_low_3); __m256i lut_values_low_16_4 = _mm256_cvtepi8_epi16(lut_values_low_4); __m256i lut_values_high_16_1 = _mm256_cvtepi8_epi16(lut_values_high_1); __m256i lut_values_high_16_2 = _mm256_cvtepi8_epi16(lut_values_high_2); __m256i lut_values_high_16_3 = _mm256_cvtepi8_epi16(lut_values_high_3); __m256i lut_values_high_16_4 = _mm256_cvtepi8_epi16(lut_values_high_4); if(shift == 6){ // highest 2 bits lut_values_low_16_1 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_1); lut_values_low_16_2 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_2); lut_values_low_16_3 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_3); lut_values_low_16_4 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_4); lut_values_high_16_1 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_1); lut_values_high_16_2 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_2); lut_values_high_16_3 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_3); lut_values_high_16_4 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_4); } // Add extended result to the corresponding accumulator block_sub_result_low = _mm256_add_epi16(block_sub_result_low, _mm256_slli_epi16(_mm256_add_epi16(lut_values_low_16_1, lut_values_low_16_2), shift)); block_sub_result_low = _mm256_add_epi16(block_sub_result_low, _mm256_slli_epi16(_mm256_add_epi16(lut_values_low_16_3, lut_values_low_16_4), shift)); block_sub_result_high = _mm256_add_epi16(block_sub_result_high, _mm256_slli_epi16(_mm256_add_epi16(lut_values_high_16_1, lut_values_high_16_2), shift)); block_sub_result_high = _mm256_add_epi16(block_sub_result_high, _mm256_slli_epi16(_mm256_add_epi16(lut_values_high_16_3, lut_values_high_16_4), shift)); // // accumulate LUT results (converted to 16-bit before summing) // block_accu = _mm256_add_epi16(block_accu, _mm256_cvtepi8_epi16(_mm256_castsi256_si128(lut_values))); // block_accu = _mm256_add_epi16(block_accu, _mm256_cvtepi8_epi16(_mm256_extracti128_si256(lut_values, 1))); } // accumulate block results into group group_accu = _mm256_add_epi32(group_accu, _mm256_madd_epi16(block_sub_result_low, _mm256_set1_epi16(1))); group_accu = _mm256_add_epi32(group_accu, _mm256_madd_epi16(block_sub_result_high, _mm256_set1_epi16(1))); // group_accu = _mm256_add_epi32(group_accu, _mm256_madd_epi16(block_accu, _mm256_set1_epi16(1))); } // accumulate group results into global accumulator accu = _mm256_add_epi32(accu, group_accu); } // remaining blocks (less than 32) if (la_num > 0) { __m256i group_accu = _mm256_setzero_si256(); for (int j = 0; j < la_num; j++) { const uint8_t *x_base = x + group32_num * 32 * 32 + j * 32; const uint8_t *y_base = y + group32_num * 128 * 32 + j * 128; // load activations and weights __m256i act_vec_1 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(y_base + 0)); __m256i act_vec_2 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(y_base + 32)); __m256i act_vec_3 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(y_base + 64)); __m256i act_vec_4 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(y_base + 96)); __m256i wt_vec_1 = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(x_base)); __m256i wt_vec_2 = _mm256_srli_epi16(wt_vec_1, 2); __m256i wt_vec_3 = _mm256_srli_epi16(wt_vec_1, 4); __m256i wt_vec_4 = _mm256_srli_epi16(wt_vec_1, 6); __m256i wt_index_1 = _mm256_and_si256(wt_vec_1, _mm256_set1_epi8(0x03)); __m256i wt_index_2 = _mm256_and_si256(wt_vec_2, _mm256_set1_epi8(0x03)); __m256i wt_index_3 = _mm256_and_si256(wt_vec_3, _mm256_set1_epi8(0x03)); __m256i wt_index_4 = _mm256_and_si256(wt_vec_4, _mm256_set1_epi8(0x03)); // accumulator for partial results of this block __m256i block_accu = _mm256_setzero_si256(); __m256i block_sub_result_low = _mm256_setzero_si256(); // added __m256i block_sub_result_high = _mm256_setzero_si256(); // added // compute using LUT for all 4 shifts (2-bit processing) for (int shift = 0; shift < 8; shift += 2) { // extract indices for activations and weights __m256i act_index_1 = _mm256_and_si256(_mm256_srli_epi16(act_vec_1, shift), _mm256_set1_epi8(0x03)); __m256i act_index_2 = _mm256_and_si256(_mm256_srli_epi16(act_vec_2, shift), _mm256_set1_epi8(0x03)); __m256i act_index_3 = _mm256_and_si256(_mm256_srli_epi16(act_vec_3, shift), _mm256_set1_epi8(0x03)); __m256i act_index_4 = _mm256_and_si256(_mm256_srli_epi16(act_vec_4, shift), _mm256_set1_epi8(0x03)); // combine indices for LUT lookup __m256i combined_index_1 = _mm256_or_si256(act_index_1, _mm256_slli_epi16(wt_index_1, 2)); __m256i combined_index_2 = _mm256_or_si256(act_index_2, _mm256_slli_epi16(wt_index_2, 2)); __m256i combined_index_3 = _mm256_or_si256(act_index_3, _mm256_slli_epi16(wt_index_3, 2)); __m256i combined_index_4 = _mm256_or_si256(act_index_4, _mm256_slli_epi16(wt_index_4, 2)); // LUT lookup __m256i lut_values_1 = _mm256_shuffle_epi8(lut_vec, combined_index_1); __m256i lut_values_2 = _mm256_shuffle_epi8(lut_vec, combined_index_2); __m256i lut_values_3 = _mm256_shuffle_epi8(lut_vec, combined_index_3); __m256i lut_values_4 = _mm256_shuffle_epi8(lut_vec, combined_index_4); // 提取前半部分和後半部分的數據 __m128i lut_values_low_1 = _mm256_castsi256_si128(lut_values_1); // 前 128-bit (low lane) __m128i lut_values_low_2 = _mm256_castsi256_si128(lut_values_2); __m128i lut_values_low_3 = _mm256_castsi256_si128(lut_values_3); __m128i lut_values_low_4 = _mm256_castsi256_si128(lut_values_4); __m128i lut_values_high_1 = _mm256_extracti128_si256(lut_values_1, 1); // 後 128-bit (high lane) __m128i lut_values_high_2 = _mm256_extracti128_si256(lut_values_2, 1); __m128i lut_values_high_3 = _mm256_extracti128_si256(lut_values_3, 1); __m128i lut_values_high_4 = _mm256_extracti128_si256(lut_values_4, 1); // extend 8-bit to 16-bit __m256i lut_values_low_16_1 = _mm256_cvtepi8_epi16(lut_values_low_1); __m256i lut_values_low_16_2 = _mm256_cvtepi8_epi16(lut_values_low_2); __m256i lut_values_low_16_3 = _mm256_cvtepi8_epi16(lut_values_low_3); __m256i lut_values_low_16_4 = _mm256_cvtepi8_epi16(lut_values_low_4); __m256i lut_values_high_16_1 = _mm256_cvtepi8_epi16(lut_values_high_1); __m256i lut_values_high_16_2 = _mm256_cvtepi8_epi16(lut_values_high_2); __m256i lut_values_high_16_3 = _mm256_cvtepi8_epi16(lut_values_high_3); __m256i lut_values_high_16_4 = _mm256_cvtepi8_epi16(lut_values_high_4); if(shift == 6){ lut_values_low_16_1 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_1); lut_values_low_16_2 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_2); lut_values_low_16_3 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_3); lut_values_low_16_4 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_low_16_4); lut_values_high_16_1 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_1); lut_values_high_16_2 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_2); lut_values_high_16_3 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_3); lut_values_high_16_4 = _mm256_sub_epi16(_mm256_setzero_si256(), lut_values_high_16_4); } // Add extended result to the corresponding accumulator block_sub_result_low = _mm256_add_epi16(block_sub_result_low, _mm256_slli_epi16(_mm256_add_epi16(lut_values_low_16_1, lut_values_low_16_2), shift)); block_sub_result_low = _mm256_add_epi16(block_sub_result_low, _mm256_slli_epi16(_mm256_add_epi16(lut_values_low_16_3, lut_values_low_16_4), shift)); block_sub_result_high = _mm256_add_epi16(block_sub_result_high, _mm256_slli_epi16(_mm256_add_epi16(lut_values_high_16_1, lut_values_high_16_2), shift)); block_sub_result_high = _mm256_add_epi16(block_sub_result_high, _mm256_slli_epi16(_mm256_add_epi16(lut_values_high_16_3, lut_values_high_16_4), shift)); // // accumulate LUT results (converted to 16-bit before summing) // block_accu = _mm256_add_epi16(block_accu, _mm256_cvtepi8_epi16(_mm256_castsi256_si128(lut_values))); // block_accu = _mm256_add_epi16(block_accu, _mm256_cvtepi8_epi16(_mm256_extracti128_si256(lut_values, 1))); } // accumulate block results into group group_accu = _mm256_add_epi32(group_accu, _mm256_madd_epi16(block_sub_result_low, _mm256_set1_epi16(1))); group_accu = _mm256_add_epi32(group_accu, _mm256_madd_epi16(block_sub_result_high, _mm256_set1_epi16(1))); // group_accu = _mm256_add_epi32(group_accu, _mm256_madd_epi16(block_accu, _mm256_set1_epi16(1))); } // accumulate group results into global accumulator accu = _mm256_add_epi32(accu, group_accu); } // final horizontal sum of the accumulator int sumi = hsum_i32_8(accu); *s = static_cast<float>(sumi); ```