# Evaluate BitNet inference > 李懿宸 ## Mission * Evaluate the Microsoft's BitNet b1.58[[1]](https://github.com/microsoft/BitNet?tab=readme-ov-file) and conduct experiment with existing implement[[2]](https://github.com/catid/bitnet_cpu?tab=readme-ov-file). * During the process, I can incorporate [[3]](https://github.com/kevin-pek/bitnet.c), [[4]](https://github.com/kaizizzzzzz/Bitnet-C-benchmark), or other open-source projects. * Evaluate the performance of [2] on modern Intel/AMD processors * Propose subsequent improvements in performance and functionality. ## Microsoft's BitNet Bitnet.cpp is the official inference framework for 1-bit LLMs. It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU. ### Environment setup and Build since bitnet require Visual Studio 2022 to run on windows, I install Visual Studio 2022 with following options. * Desktop-development with C++ * C++-CMake Tools for Windows * Git for Windows * C++-Clang Compiler for Windows * MS-Build Support for LLVM-Toolset (clang) 1. Clone BitNet.cpp and install the dependencies ```shell $ git clone --recursive https://github.com/microsoft/BitNet.git $ cd BitNet $ pip install -r requirements.txt ``` 2. Build the project using bitnet_b1_58-large model ```shell $ python setup_env.py --hf-repo 1bitLLM/bitnet_b1_58-large ``` ### Reproduce the experiment 1. Run the inference benchmark of float32 model. ```shell $ python utils/e2e_benchmark.py -m models/bitnet_b1_58-large/ggml-model-f32.gguf -t 4 | model | size | params | backend | threads | n_batch | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: | | bitnet 700M all F32 | 2.72 GiB | 728.84 M | CPU | 4 | 1 | pp512 | 18.38 ± 0.23 | | bitnet 700M all F32 | 2.72 GiB | 728.84 M | CPU | 4 | 1 | tg128 | 19.14 ± 0.35 | ``` 2. Run the inference benchmark of 2 bit per weight ternary model(BitNet b1.58). ```shell $ python utils/e2e_benchmark.py -m models/bitnet_b1_58-large/ggml-model-i2_s.gguf -t 4 | model | size | params | backend | threads | n_batch | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: | | bitnet 700M I2_S - 2 bpw ternary | 256.56 MiB | 728.84 M | CPU | 4 | 1 | pp512 | 112.48 ± 0.39 | | bitnet 700M I2_S - 2 bpw ternary | 256.56 MiB | 728.84 M | CPU | 4 | 1 | tg128 | 136.69 ± 1.41 | ``` We can see from the benchmark, model using BitNet b1.58 technique can rum much faster than original 32float weight model on CPU. ## Bitnet-cpu This repository tests the acceleration of BitNet operations on the CPU using AVX-512 intrinsics, by performing tests with random weights on the model size of bitnet_b1_58-3B. ### Environment setup and Build 1. Use the developer powershell for VS to set up environment, git clone bitnet-cpu and use cmake to configure the project and generate the necessary build files. ```shell $ git clone https://github.com/catid/bitnet_cpu.git $ cd bitnet_cpu $ mkdir build $ cd build $ cmake .. ``` 2. Open the .sln file in the ./build/ folder, select release mode and build. ![image](https://hackmd.io/_uploads/SyHo4siDJe.png) ### Reproduce the experiment This is on my AMD Ryzen 5 7500F Windows PC. #### 1. With AVX-512 speedup ```shell # PS C:\Users\ethan\source\repos\bitnet_cpu\build> ./tests/Release/math_test Running non-AVX-512 version Test case 1 passed! Test case 2 passed! Test case 3 passed! Test case 4 passed! Running AVX-512 version Tests passed! ``` ```shell # PS C:\Users\ethan\source\repos\bitnet_cpu\build> ./Release/benchmark_model Preallocating buffers... Benchmarking model... Warmup took 25 milliseconds Benchmark Results: Number of Layers: 182 Number of Benchmark Iterations: 100 Average Time per Iteration: 24.82 milliseconds ``` This is about 1000/24.82 =~ 40 tokens per second with AVX-512. --- #### 2. Without AVX-512 speedup To avoid using the AVX512 architecture, I have to modify the CMakelists file in two places and rebuild the project. from: ```clike if(MSVC) set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /arch:AVX512 /Ox /fp:fast") else() set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native -O3 -funroll-loops") endif() ``` to: ```clike if(MSVC) set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /arch:AVX2 /Ox /fp:fast") else() set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -mno-avx512 -O3 -funroll-loops") endif() ``` After rebuild the project, this is the result on same machine. ```shell $ PS C:\Users\ethan\source\repos\bitnet_cpu-noAVX\build> ./tests/Release/math_test Running non-AVX-512 version Test case 1 passed! Test case 2 passed! Test case 3 passed! Test case 4 passed! Tests passed! ``` ```shell $ PS C:\Users\ethan\source\repos\bitnet_cpu-noAVX\build> ./Release/benchmark_model Preallocating buffers... Benchmarking model... Warmup took 57 milliseconds Benchmark Results: Number of Layers: 182 Number of Benchmark Iterations: 100 Average Time per Iteration: 56.88 milliseconds PS C:\Users\ethan\source\repos\bitnet_cpu-noAVX\build> ``` Without AVX-512, the performance go to about 1000/56.88 =~ 18 tokens per second. ## Bitnet-C++-benchmark This repository provides a single-thread, end-to-end C++ implementation of the Bitnet 1.58. This implementation avoids complex optimizations for specific CPU architectures, making it straightforward and adaptable for hardware synthesis and FPGA deployment. Using bitnet_b1_58-large model to run the inference. ### Environment setup and Build 1. git clone the repository: ```shell $ git clone https://github.com/kaizizzzzzz/Bitnet-C-benchmark.git $ cd Bitnet-C-benchmark ``` 2. Download the processed model file and build: ```shell $ wget https://huggingface.co/kaixin123/bitnet-1.58-processed/resolve/main/model.bin ``` 3. Set up Environment: ```shell $ source setup_conda_env.sh ``` This repository require following dependencies: * C++17 compiler * Python 3.10 * PyTorch * Numpy * sentencepiece * transformers 4. Build the project with make: ```shell $ make ``` ### Reproduce the experiment 1. Encode tokens ```shell $ PS C:\Users\ethan\source\repos\Bitnet-C-benchmark> python encode.py --prompt "Cornell University is" >> tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████| 1.39k/1.39k [00:00<?, ?B/s] C:\Users\ethan\anaconda3\Lib\site-packages\huggingface_hub\file_download.py:140: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\ethan\source\repos\Bitnet-C-benchmark\tokenizer_model\models--1bitLLM--bitnet_b1_58-large. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations. To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development warnings.warn(message) tokenizer.model: 100%|██████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 8.30MB/s] added_tokens.json: 100%|████████████████████████████████████████████████████████████| 41.0/41.0 [00:00<00:00, 40.6kB/s] special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████| 604/604 [00:00<?, ?B/s] tokenizer.json: 100%|█████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 2.52MB/s] ``` 2. Run the inference: ```shell $ PS C:\Users\ethan\source\repos\Bitnet-C-benchmark> ./inference/inference Encoded_ID: 1 11655 514 3014 338 Prefill Starts: >>>>>>>>>>>>>> Encoded_ID now: 1 11655 514 3014 338 385 Inference time for 0th token:49s Decoding Starts: <<<<<<<<<<<<<<< Encoded_ID now: 1 11655 514 3014 338 385 12666 Inference time for 1th token:10s Encoded_ID now: 1 11655 514 3014 338 385 12666 411 Inference time for 2th token:10s Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784 Inference time for 3th token:10s Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784 4024 Inference time for 4th token:10s Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784 4024 1691 Inference time for 5th token:10s Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784 4024 1691 322 Inference time for 6th token:10s Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784 4024 1691 322 372 Inference time for 7th token:10s Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784 4024 1691 322 372 756 Inference time for 8th token:10s Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784 4024 1691 322 372 756 1784 Inference time for 9th token:10s Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784 4024 1691 322 372 756 1784 4024 Inference time for 10th token:10s Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784 4024 1691 322 372 756 1784 4024 18857 Inference time for 11th token:10s Total latency: 159s Inference Speed: 13 seconds / token ``` 3. Decode IDs: Can see the output of model inference. ```shell $ PS C:\Users\ethan\source\repos\Bitnet-C-benchmark> python decode.py Cornell University is an institution with many facets and it has many faculty ``` ## Using Bitnet-C++-benchmark to evaluate Bitnet-cpu To evaluate the performance of Bitnet-cpu, I have to modify Bitnet-C++-benchmark. I copied the matrix computation method used during inference from the bitnet-cpu project into the bitnet-c-benchmark project, including AVX512 and OpenMP multi-thread optimization. ### Modify code I modified three functions in float_kernel.cpp which are used for matrix multiplication. Here is one of the functions. It was used for matrix multiplication of 2D tensors. The original function using a triple-loop approach to perform 2D matrix multiplication ```cpp= Tensor2D GEMM_2D_float(const Tensor2D &tensor1, const Tensor2D &tensor2) { // Validate dimensions size_t seq_len1 = tensor1.size(); size_t intermediate_dim1 = tensor1[0].size(); size_t intermediate_dim2 = tensor2.size(); size_t out_dim = tensor2[0].size(); // This should match the last dimension of tensor2 if (intermediate_dim1 != intermediate_dim2) { throw std::runtime_error("intermediate_dim of tensor1 and tensor2 must match."); } size_t seq_len = seq_len1; size_t intermediate_dim = intermediate_dim1; // Initialize result tensor with the correct dimensions Tensor2D result(seq_len, std::vector<float>(out_dim, 0.0f)); // Perform matrix multiplication for (size_t i = 0; i < seq_len; ++i) { for (size_t j = 0; j < out_dim; ++j) { for (size_t k = 0; k < intermediate_dim; ++k) { result[i][j] += tensor1[i][k] * tensor2[k][j]; } } } return result; } ``` I modified the for loop of the matrix multiplication using AVX-512 intrinsics and add the OpenMP. ```cpp= #pragma omp parallel for num_threads(16) for (size_t i = 0; i < seq_len; ++i) { for (size_t j = 0; j < intermediate_dim; ++j) { __m512 ra = _mm512_set1_ps(tensor1[i][j]); size_t k = 0; for (; k + 15 < out_dim; k += 16) { __m512 rb = _mm512_loadu_ps(&tensor2[j][k]); rb = _mm512_mul_ps(ra, rb); __m512 rc = _mm512_loadu_ps(&result[i][k]); rc = _mm512_add_ps(rb, rc); _mm512_storeu_ps(&result[i][k], rc); } for (; k < out_dim; k++) { result[i][k] += tensor1[i][j] * tensor2[j][k]; } } } ``` ### Experiment I test the inference with the same encode token on same machine. Following is the result output. ```shell # PS C:\Users\ethan\source\repos\Bitnet-C-avx> ./inference/inference Encoded_ID: 1 11655 514 3014 338 Prefill Starts: >>>>>>>>>>>>>> Encoded_ID now: 1 11655 514 3014 338 263 Decoding Starts: <<<<<<<<<<<<<<< Encoded_ID now: 1 11655 514 3014 338 263 970 Inference time for 1th token:3s Encoded_ID now: 1 11655 514 3014 338 263 970 5925 Inference time for 2th token:3s Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372 Inference time for 3th token:3s Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372 297 Inference time for 4th token:3s Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372 297 306 Inference time for 5th token:3s Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372 297 306 386 Inference time for 6th token:3s Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372 297 306 386 11989 Inference time for 7th token:3s Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372 297 306 386 11989 29892 Inference time for 8th token:3s Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372 297 306 386 11989 29892 1570 Inference time for 9th token:3s Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372 297 306 386 11989 29892 1570 3088 Inference time for 10th token:3s Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372 297 306 386 11989 29892 1570 3088 29889 Inference time for 11th token:3s Total latency: 49s Inference Speed: 4 seconds / token ``` ```shell $ PS C:\Users\ethan\source\repos\Bitnet-C-avx> python decode.py Cornell University is a public research university in Ithaca, New York. ``` You can see that the inference speed improved from 13 seconds per token to 4 seconds per token, and the output results of the inference are also quite good. This demonstrates that the optimization methods used in the bitnet-cpu project are highly effective. Howerver, this is still much slower than microsoft's bitnet.cpp. ## Performance and functionality improvements Here are several improvement that can be implement on Bitnet-cpu. 1. Improve Computational Accuracy While reproducing the experiment of Bitnet-cpu, I found that the math_test of the AVX-512 version sometimes fails due to minor precision issues. These issues may arise from floating-point rounding errors inherent in vectorized computations. Implementing higher-precision computations or optimizing the handling of floating-point arithmetic can significantly improve the computational accuracy. 2. Improved Usability Many functions in the project are limited to input sizes that are multiples of 32 or 64, which can create inconveniences for users. I think optimize the code to support all input sizes can improving flexibility and usability. ## Reference https://github.com/microsoft/BitNet https://github.com/catid/bitnet_cpu https://github.com/kaizizzzzzz/Bitnet-C-benchmark https://www.cnblogs.com/ThousandPine/p/17026221.html