Oral discussion Report: BitNet 1.58-bit LLM and RVV Hardware Acceleration

# Oral discussion Report: BitNet 1.58-bit LLM and RVV Hardware Acceleration **Date:** January 6, 2026 **Topic:** Computational Simplification of LLMs via Ternary Weights and RISC-V Vector Extensions **Name:** 陳彥伯 **Student ID:** P76141144 --- ## 1. Technical Breakthrough: BitNet (1.58-bit LLM) Traditional Large Language Models (LLMs) rely on high-precision floating-point formats (e.g., FP16 or BF16), which demand significant memory bandwidth and power during inference. **BitNet b1.58** represents a paradigm shift toward extreme quantization by using **ternary weights**. ### Core Principle: Ternary Weights BitNet restricts model weights to the set $\{-1, 0, 1\}$. This design fundamentally alters the underlying mathematical operations: ``` Standard GEMV: BitNet Ternary: ┌───────────────────┐ ┌───────────────────┐ │ y += w[i] * x[i] │ │ if w[i] == +1: │ │ (multiply) │ → │ y += x[i] │ │ │ │ if w[i] == -1: │ │ │ │ y -= x[i] │ └───────────────────┘ └───────────────────┘ HW: Multiplier HW: Mux + Adder ``` * **Mathematical Transformation**: The standard Matrix-Vector Multiplication ($y = \sum w_i \times x_i$) is simplified into **Accumulate (ACC) operations**. * **Hardware Efficiency**: During inference, the need for expensive floating-point **Multipliers** is eliminated. Instead, the hardware uses a **Multiplexer (Mux)** to select the sign of the input and an **Adder** to update the accumulator. --- ## 2. Training Mechanism: Quantization-Aware Training (QAT) Ternary weights are produced through a specialized training process rather than direct initialization: * **Shadow Weights**: The system maintains high-precision **FP32 "shadow weights"** during training to accurately accumulate small gradient updates. * **Quantization Process**: High-precision weights are mapped to $\{-1, 0, 1\}$ using a scaling factor ($\beta$) and a rounding function. * **Straight-Through Estimator (STE)**: Since quantization is non-differentiable, STE is used during backpropagation to allow gradients to bypass the quantization function and update the shadow weights. --- ## 3. Implementation Strategy: RISC-V Vector Extension (RVV) To maximize the performance of BitNet on RISC-V architectures, the **RVV** is utilized for vectorized execution. ### Numerical Representation (Weight Encoding) To optimize storage and processing speed, weights are encoded using **2 bits per weight**: | Encoding (Enc) | Value | Hardware Operation | | -------------- | ----- | ------------------ | | `00` | 0 | No contribution (Skip) | | `01` | +1 | Add activation | | `10` | -1 | Subtract activation | | `11` | 0 | Reserved (as 0) | ### Branchless Vectorization Traditional `if-else` logic causes pipeline stalls in processors. RVV solves this by enabling **branchless ternary MAC**: * **Vector Merge/Mask**: RVV uses mask operations to decide which activations to add or subtract in a single parallel step. * **VLA (Vector Length Agnostic)**: The code automatically adapts to different hardware vector widths, ensuring scalability and portability across various RISC-V cores. --- ## 4. Conclusion: Performance and Energy Gains 1. **Memory Savings**: The 2-bit representation reduces the memory footprint and bandwidth requirements by approximately **8x to 10x** compared to FP16. 2. **Energy Efficiency**: Replacing multipliers with adders significantly lowers the energy consumption per operation, making LLMs viable for battery-powered Edge AI devices. 3. **High Throughput**: By leveraging RVV's parallel processing capabilities, BitNet achieves superior inference speeds on RISC-V platforms. --- ## Appendix: Understanding Vectors vs. Matrices in AI To understand why BitNet and RVV are so effective, it is essential to distinguish between the two fundamental data structures used in AI: ### 1. Vector (1D Entity) * **Definition**: A vector is an ordered, one-dimensional list of numbers. * **In LLMs**: A vector typically represents a single data point or a "hidden state" of a word (token) in the model. * **Hardware Context**: RVV (RISC-V Vector Extension) is designed to process these long lists of numbers simultaneously, performing the same operation (like addition) on every element in the vector at once. ### 2. Matrix (2D Entity) * **Definition**: A matrix is a two-dimensional grid of numbers arranged in rows and columns. * **In LLMs**: The "Weights" of a model are stored as matrices. When an input vector passes through a layer, it is multiplied by these weight matrices. * **The Challenge**: Standard Matrix-Vector Multiplication (GEMV) is computationally expensive because it requires $O(N^2)$ multiplications and additions. ### How BitNet Changes the Interaction Traditionally, we multiply a **Vector** by a **Matrix** using high-precision floating-point math. * **BitNet's Approach**: By restricting the **Matrix** elements (weights) to $\{-1, 0, 1\}$, we no longer "multiply." * **The Result**: The **Matrix** acts as a set of instructions (Add, Subtract, or Skip) for the incoming **Vector**. This allows the hardware to treat complex matrix math as a series of high-speed vector accumulations (ACC).