Evaluate BitNet inference

李懿宸

Mission

Evaluate the Microsoft's BitNet b1.58[1] and conduct experiment with existing implement[2].
During the process, I can incorporate [3], [4], or other open-source projects.
Evaluate the performance of [2] on modern Intel/AMD processors
Propose subsequent improvements in performance and functionality.

Microsoft's BitNet

Bitnet.cpp is the official inference framework for 1-bit LLMs. It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU.

Environment setup and Build

since bitnet require Visual Studio 2022 to run on windows, I install Visual Studio 2022 with following options.

Desktop-development with C++
C++-CMake Tools for Windows
Git for Windows
C++-Clang Compiler for Windows
MS-Build Support for LLVM-Toolset (clang)

Clone BitNet.cpp and install the dependencies

$ git clone --recursive https://github.com/microsoft/BitNet.git
$ cd BitNet
$ pip install -r requirements.txt

Build the project using bitnet_b1_58-large model

$ python setup_env.py --hf-repo 1bitLLM/bitnet_b1_58-large

Reproduce the experiment

Run the inference benchmark of float32 model.

$ python utils/e2e_benchmark.py -m models/bitnet_b1_58-large/ggml-model-f32.gguf -t 4
| model                          |       size |     params | backend    | threads | n_batch |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
| bitnet 700M all F32            |   2.72 GiB |   728.84 M | CPU        |       4 |       1 |         pp512 |         18.38 ± 0.23 |
| bitnet 700M all F32            |   2.72 GiB |   728.84 M | CPU        |       4 |       1 |         tg128 |         19.14 ± 0.35 |

Run the inference benchmark of 2 bit per weight ternary model(BitNet b1.58).

$ python utils/e2e_benchmark.py -m models/bitnet_b1_58-large/ggml-model-i2_s.gguf -t 4
| model                          |       size |     params | backend    | threads | n_batch |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
| bitnet 700M I2_S - 2 bpw ternary | 256.56 MiB |   728.84 M | CPU        |       4 |       1 |         pp512 |        112.48 ± 0.39 |
| bitnet 700M I2_S - 2 bpw ternary | 256.56 MiB |   728.84 M | CPU        |       4 |       1 |         tg128 |        136.69 ± 1.41 |

We can see from the benchmark, model using BitNet b1.58 technique can rum much faster than original 32float weight model on CPU.

Bitnet-cpu

This repository tests the acceleration of BitNet operations on the CPU using AVX-512 intrinsics, by performing tests with random weights on the model size of bitnet_b1_58-3B.

Environment setup and Build

Use the developer powershell for VS to set up environment, git clone bitnet-cpu and use cmake to configure the project and generate the necessary build files.

$ git clone https://github.com/catid/bitnet_cpu.git
$ cd bitnet_cpu
$ mkdir build
$ cd build
$ cmake ..

Open the .sln file in the ./build/ folder, select release mode and build.
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →

Reproduce the experiment

This is on my AMD Ryzen 5 7500F Windows PC.

1. With AVX-512 speedup

# PS C:\Users\ethan\source\repos\bitnet_cpu\build> ./tests/Release/math_test
Running non-AVX-512 version
Test case 1 passed!
Test case 2 passed!
Test case 3 passed!
Test case 4 passed!
Running AVX-512 version
Tests passed!

# PS C:\Users\ethan\source\repos\bitnet_cpu\build> ./Release/benchmark_model
Preallocating buffers...
Benchmarking model...
Warmup took 25 milliseconds
Benchmark Results:
Number of Layers: 182
Number of Benchmark Iterations: 100
Average Time per Iteration: 24.82 milliseconds

This is about 1000/24.82 =~ 40 tokens per second with AVX-512.

2. Without AVX-512 speedup

To avoid using the AVX512 architecture, I have to modify the CMakelists file in two places and rebuild the project.
from:

if(MSVC) 
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /arch:AVX512 /Ox /fp:fast")
else()
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=native -O3 -funroll-loops")
endif()

to:

if(MSVC) 
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /arch:AVX2 /Ox /fp:fast")
else()
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -mno-avx512 -O3 -funroll-loops")
endif()

After rebuild the project, this is the result on same machine.

$ PS C:\Users\ethan\source\repos\bitnet_cpu-noAVX\build> ./tests/Release/math_test
Running non-AVX-512 version
Test case 1 passed!
Test case 2 passed!
Test case 3 passed!
Test case 4 passed!
Tests passed!

$ PS C:\Users\ethan\source\repos\bitnet_cpu-noAVX\build> ./Release/benchmark_model
Preallocating buffers...
Benchmarking model...
Warmup took 57 milliseconds
Benchmark Results:
Number of Layers: 182
Number of Benchmark Iterations: 100
Average Time per Iteration: 56.88 milliseconds
PS C:\Users\ethan\source\repos\bitnet_cpu-noAVX\build>

Without AVX-512, the performance go to about 1000/56.88 =~ 18 tokens per second.

Bitnet-C++-benchmark

This repository provides a single-thread, end-to-end C++ implementation of the Bitnet 1.58. This implementation avoids complex optimizations for specific CPU architectures, making it straightforward and adaptable for hardware synthesis and FPGA deployment. Using bitnet_b1_58-large model to run the inference.

Environment setup and Build

git clone the repository:

$ git clone https://github.com/kaizizzzzzz/Bitnet-C-benchmark.git
$ cd Bitnet-C-benchmark

Download the processed model file and build:

$ wget https://huggingface.co/kaixin123/bitnet-1.58-processed/resolve/main/model.bin

Set up Environment:

$ source setup_conda_env.sh

This repository require following dependencies:

C++17 compiler
Python 3.10
PyTorch
Numpy
sentencepiece
transformers

Build the project with make:

$ make

Reproduce the experiment

Encode tokens

$ PS C:\Users\ethan\source\repos\Bitnet-C-benchmark> python encode.py --prompt "Cornell University is"
>>
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████| 1.39k/1.39k [00:00<?, ?B/s]
C:\Users\ethan\anaconda3\Lib\site-packages\huggingface_hub\file_download.py:140: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\ethan\source\repos\Bitnet-C-benchmark\tokenizer_model\models--1bitLLM--bitnet_b1_58-large. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
tokenizer.model: 100%|██████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 8.30MB/s]
added_tokens.json: 100%|████████████████████████████████████████████████████████████| 41.0/41.0 [00:00<00:00, 40.6kB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████| 604/604 [00:00<?, ?B/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 2.52MB/s]

Run the inference:

$ PS C:\Users\ethan\source\repos\Bitnet-C-benchmark> ./inference/inference
Encoded_ID: 1 11655 514 3014 338


Prefill Starts: >>>>>>>>>>>>>>
Encoded_ID now: 1 11655 514 3014 338 385
Inference time for 0th token:49s


Decoding Starts: <<<<<<<<<<<<<<<
Encoded_ID now: 1 11655 514 3014 338 385 12666
Inference time for 1th token:10s
Encoded_ID now: 1 11655 514 3014 338 385 12666 411
Inference time for 2th token:10s
Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784
Inference time for 3th token:10s
Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784 4024
Inference time for 4th token:10s
Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784 4024 1691
Inference time for 5th token:10s
Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784 4024 1691 322
Inference time for 6th token:10s
Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784 4024 1691 322 372
Inference time for 7th token:10s
Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784 4024 1691 322 372 756
Inference time for 8th token:10s
Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784 4024 1691 322 372 756 1784
Inference time for 9th token:10s
Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784 4024 1691 322 372 756 1784 4024
Inference time for 10th token:10s
Encoded_ID now: 1 11655 514 3014 338 385 12666 411 1784 4024 1691 322 372 756 1784 4024 18857
Inference time for 11th token:10s

Total latency: 159s
Inference Speed: 13 seconds / token

Decode IDs: Can see the output of model inference.

$ PS C:\Users\ethan\source\repos\Bitnet-C-benchmark> python decode.py
Cornell University is an institution with many facets and it has many faculty

Using Bitnet-C++-benchmark to evaluate Bitnet-cpu

To evaluate the performance of Bitnet-cpu, I have to modify Bitnet-C++-benchmark. I copied the matrix computation method used during inference from the bitnet-cpu project into the bitnet-c-benchmark project, including AVX512 and OpenMP multi-thread optimization.

Modify code

I modified three functions in float_kernel.cpp which are used for matrix multiplication.
Here is one of the functions. It was used for matrix multiplication of 2D tensors. The original function using a triple-loop approach to perform 2D matrix multiplication

























Tensor2D GEMM_2D_float(const Tensor2D &tensor1, const Tensor2D &tensor2) {
    // Validate dimensions
    size_t seq_len1 = tensor1.size();
    size_t intermediate_dim1 = tensor1[0].size();
    size_t intermediate_dim2 = tensor2.size();
    size_t out_dim = tensor2[0].size();  // This should match the last dimension of tensor2

    if (intermediate_dim1 != intermediate_dim2) {
        throw std::runtime_error("intermediate_dim of tensor1 and tensor2 must match.");
    }
    size_t seq_len = seq_len1;
    size_t intermediate_dim = intermediate_dim1;
    // Initialize result tensor with the correct dimensions
    Tensor2D result(seq_len, std::vector<float>(out_dim, 0.0f));

    // Perform matrix multiplication
    for (size_t i = 0; i < seq_len; ++i) {
        for (size_t j = 0; j < out_dim; ++j) {
            for (size_t k = 0; k < intermediate_dim; ++k) {
                result[i][j] += tensor1[i][k] * tensor2[k][j];
            }
        }
    }
    return result;
}

I modified the for loop of the matrix multiplication using AVX-512 intrinsics and add the OpenMP.

















#pragma omp parallel for num_threads(16)
    for (size_t i = 0; i < seq_len; ++i) {
        for (size_t j = 0; j < intermediate_dim; ++j) {
            __m512 ra = _mm512_set1_ps(tensor1[i][j]);
            size_t k = 0;
            for (; k + 15 < out_dim; k += 16) {
                __m512 rb = _mm512_loadu_ps(&tensor2[j][k]);
                rb = _mm512_mul_ps(ra, rb);
                __m512 rc = _mm512_loadu_ps(&result[i][k]);
                rc = _mm512_add_ps(rb, rc);
                _mm512_storeu_ps(&result[i][k], rc);
            }
            for (; k < out_dim; k++) {
                result[i][k] += tensor1[i][j] * tensor2[j][k];
            }
        }
    }

Experiment

I test the inference with the same encode token on same machine. Following is the result output.

# PS C:\Users\ethan\source\repos\Bitnet-C-avx> ./inference/inference
Encoded_ID: 1 11655 514 3014 338


Prefill Starts: >>>>>>>>>>>>>>
Encoded_ID now: 1 11655 514 3014 338 263


Decoding Starts: <<<<<<<<<<<<<<<
Encoded_ID now: 1 11655 514 3014 338 263 970
Inference time for 1th token:3s
Encoded_ID now: 1 11655 514 3014 338 263 970 5925
Inference time for 2th token:3s
Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372
Inference time for 3th token:3s
Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372 297
Inference time for 4th token:3s
Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372 297 306
Inference time for 5th token:3s
Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372 297 306 386
Inference time for 6th token:3s
Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372 297 306 386 11989
Inference time for 7th token:3s
Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372 297 306 386 11989 29892
Inference time for 8th token:3s
Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372 297 306 386 11989 29892 1570
Inference time for 9th token:3s
Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372 297 306 386 11989 29892 1570 3088
Inference time for 10th token:3s
Encoded_ID now: 1 11655 514 3014 338 263 970 5925 16372 297 306 386 11989 29892 1570 3088 29889
Inference time for 11th token:3s

Total latency: 49s
Inference Speed: 4 seconds / token

$ PS C:\Users\ethan\source\repos\Bitnet-C-avx> python decode.py
Cornell University is a public research university in Ithaca, New York.

You can see that the inference speed improved from 13 seconds per token to 4 seconds per token, and the output results of the inference are also quite good. This demonstrates that the optimization methods used in the bitnet-cpu project are highly effective. Howerver, this is still much slower than microsoft's bitnet.cpp.

Performance and functionality improvements

Here are several improvement that can be implement on Bitnet-cpu.

Improve Computational Accuracy
While reproducing the experiment of Bitnet-cpu, I found that the math_test of the AVX-512 version sometimes fails due to minor precision issues. These issues may arise from floating-point rounding errors inherent in vectorized computations. Implementing higher-precision computations or optimizing the handling of floating-point arithmetic can significantly improve the computational accuracy.
Improved Usability
Many functions in the project are limited to input sizes that are multiples of 32 or 64, which can create inconveniences for users. I think optimize the code to support all input sizes can improving flexibility and usability.

Reference

https://github.com/microsoft/BitNet
https://github.com/catid/bitnet_cpu
https://github.com/kaizizzzzzz/Bitnet-C-benchmark
https://www.cnblogs.com/ThousandPine/p/17026221.html