李懿宸
Bitnet.cpp is the official inference framework for 1-bit LLMs. It offers a suite of optimized kernels, that support fast and lossless inference of 1.58-bit models on CPU.
since bitnet require Visual Studio 2022 to run on windows, I install Visual Studio 2022 with following options.
We can see from the benchmark, model using BitNet b1.58 technique can rum much faster than original 32float weight model on CPU.
This repository tests the acceleration of BitNet operations on the CPU using AVX-512 intrinsics, by performing tests with random weights on the model size of bitnet_b1_58-3B.
This is on my AMD Ryzen 5 7500F Windows PC.
This is about 1000/24.82 =~ 40 tokens per second with AVX-512.
To avoid using the AVX512 architecture, I have to modify the CMakelists file in two places and rebuild the project.
from:
to:
After rebuild the project, this is the result on same machine.
Without AVX-512, the performance go to about 1000/56.88 =~ 18 tokens per second.
This repository provides a single-thread, end-to-end C++ implementation of the Bitnet 1.58. This implementation avoids complex optimizations for specific CPU architectures, making it straightforward and adaptable for hardware synthesis and FPGA deployment. Using bitnet_b1_58-large model to run the inference.
This repository require following dependencies:
To evaluate the performance of Bitnet-cpu, I have to modify Bitnet-C++-benchmark. I copied the matrix computation method used during inference from the bitnet-cpu project into the bitnet-c-benchmark project, including AVX512 and OpenMP multi-thread optimization.
I modified three functions in float_kernel.cpp which are used for matrix multiplication.
Here is one of the functions. It was used for matrix multiplication of 2D tensors. The original function using a triple-loop approach to perform 2D matrix multiplication
I modified the for loop of the matrix multiplication using AVX-512 intrinsics and add the OpenMP.
I test the inference with the same encode token on same machine. Following is the result output.
You can see that the inference speed improved from 13 seconds per token to 4 seconds per token, and the output results of the inference are also quite good. This demonstrates that the optimization methods used in the bitnet-cpu project are highly effective. Howerver, this is still much slower than microsoft's bitnet.cpp.
Here are several improvement that can be implement on Bitnet-cpu.
https://github.com/microsoft/BitNet
https://github.com/catid/bitnet_cpu
https://github.com/kaizizzzzzz/Bitnet-C-benchmark
https://www.cnblogs.com/ThousandPine/p/17026221.html