IEEE 754 Floating Point

# IEEE 754 Floating Point ## 浮點數標準大名鼎鼎的 [IEEE 754](https://pcv.oss-cn-shanghai.aliyuncs.com/wp-content/upload/2022/12/ieee-std-754-2019.pdf) 是定義 32-bit 浮點數的官方文件，於 1985 年由發布第一個版本。下圖^[1]^ 顯示 Float32 各位元的功能: ![image](https://hackmd.io/_uploads/ryqXvoV51g.png) 位元可以劃分為三部分: - sign bit (藍色): 31^th^ 位，`1` 表示負，共 1 位 - exponent bits (綠色) : 23^th^ - 30^th^ 位，共 8 位 - fraction bits (紅色): 0^th^ - 22^th^ 位，共 23 位計算的方法如下: $$ (-1)^s \times \left(1 + \frac{fraction}{2^{23}} \right) \times 2^{(e-127)} $$ :::warning 浮點數具有幾種特殊值: - exp 為 `0b1111_1111` && fraction 為 `0` && sign 為 `1`: `-INF` - exp 為 `0b1111_1111` && fraction 為 `0` && sign 為 `0`: `INF` - exp 為 `0b1111_1111` && fraction 為非 `0` && sign 為 `0`: `NaN` ::: 要將 10 進位小數轉換成浮點數格式需要兩個步驟 1. 10 進位 → 2 進位 1. 2 進位 → 浮點數表達 (sign, exponent, fraction) :::info 10 進位數值是如何轉換成 2 進制呢? ::: 假設有 $X$ 為一 10 進制數字，可分為整數 $N$ 和小數 $F$ 兩部分: $$ X = N + F $$ $N$ 使用除 2 取餘可得，餘數順序分別是 $b_0, b_1, \cdots$，當被除數小於 2 時停止 $$ N = b_k \cdot 2^k + b_{k-1} \cdot 2^{k-1} + \cdots + b_1 \cdot 2^1 + b_0 \cdot 2^0 $$ $F$ 使用乘 2 取整方式 (乘 2 後看整數是否為 1) 得到，當被乘數等於 0 時停止 $$ F = b_{-1} \cdot 2^{-1} + b_{-2} \cdot 2^{-2} + \cdots $$ 讓我們以 `9.73` 為例 $$ \begin{aligned} N &= 9 = 2^3 + 2^0 = 1001_b \\ F &= 0.73_{10} \\ &\approx 0.101110101110000101001_b \\ X &= 1001_b + 0.101110101110000101001_b \\ & = 1.001\_101110101110000101001_b \cdot 2^3 \end{aligned} $$ 將 $X$ 換成以浮點數表示: - sign: $0$ - exp: $127 + 3 = 130 = 2^7 + 2^1 = 1000\_0010_b$ - fraction: $0011\_0111\_0101\_1100\_0010\_100(1)_b$，最後一位被省略組合起來就是: $$ \begin{array}{c} \underbrace{0}_{\text{sign}} \quad \underbrace{1000\_0010}_{\text{exponent}} \quad \underbrace{0011\_0111\_0101\_1100\_0010\_100}_{\text{fraction}} \end{array} $$ 透過印出 `9.73` float 的 binary format 得到同樣的結果: ```c float f = 9.73; unsigned int* pu32_f = (unsigned int*)&f; unsigned int u32_f = *pu32_f; printf("[%d] ", (u32_f) & (1 << 31), 1, 0); for (int bit = 30; bit >= 0; bit--) { int c = 30 - bit; printf("%d", (u32_f) & (1 << bit) ? 1 : 0); if ((c - 3)% 4 == 0) { printf("_"); } } ``` :::info 浮點數的表達極限在哪裡? ::: 由於不可能透過有限位數表達無限多的實數，因此透過位元表達勢必存在極限，下圖^[3]^很好地說明符點數的分佈會隨著數字增加，而使最小精度單位 (Unit in Last Place, ULP) 以 2 的指數倍下降。例如: $[2^{23}, 2^{24})$ 時 ULP 是 1，而對於 $[2^{24}, 2^{25})$ 則是 2。 ![image](https://hackmd.io/_uploads/ryMjptHq1g.png) 浮點數特別的極限: 1. 當 $X = 2^{23}$ : 由於整數部份已經佔據 23 bits，因此**無任何小數資訊能被保存**，ULP 為 1，若繼續增加則 ULP 會變成 2, 4, ... **範例觀察** 接下來讓我們用程式觀察幾個 ULP 計算範例: ```python import numpy as np num = np.float32(123456789.0) # 轉換為 float32 next_up = np.nextafter(num, float('inf'), dtype=np.float32) # 找比它大的最近浮點數 next_down = np.nextafter(num, float('-inf'), dtype=np.float32) # 找比它小的最近浮點數 print("="*40) print(f" num = {num}") print(f" next_up = {next_up}") print(f" next_down= {next_down}") print(f" ULP 誤差範圍 = {next_up - num}") print("="*40) # 比較不同的數值，顯示結果更清楚 test_values = [ (123456795.99, 123456800), (123456796.0, 123456800), (123456803.99, 123456800), (123456804.0, 123456800), (123456804.1, 123456800), (123456804.0, 123456796.0), ] print("\n浮點數比較結果:") print("-"*40) for val1, val2 in test_values: res = np.float32(val1) == np.float32(val2) print(f" np.float32({val1}) == np.float32({val2}) → {res}") print("-"*40) ``` ```bash ======================================== num = 123456792.0 next_up = 123456800.0 next_down= 123456784.0 ULP 誤差範圍 = 8.0 ======================================== 浮點數比較結果: ---------------------------------------- np.float32(123456795.99) == np.float32(123456800) → False np.float32(123456796.0) == np.float32(123456800) → True np.float32(123456803.99) == np.float32(123456800) → True np.float32(123456804.0) == np.float32(123456800) → True np.float32(123456804.1) == np.float32(123456800) → False np.float32(123456804.0) == np.float32(123456796.0) → True ---------------------------------------- ``` 觀察以下兩組比較: - 針對兩鄰近的浮點數表示 `123456792.0` 和 `123456784.0`，由於 `123456789.0` 更靠近前者，因此會選用前者的表示方式 - 針對兩鄰近的浮點數表示 `123456800.0` 和 `123456808.0`，由於 `123456804.0` 與兩者==同樣接近==，最後選擇了前者，Why? 這跟浮點數的==捨入方法==有關: - 當兩邊==距離相同==時，==捨入至最低有效位 (LSB) 為偶數方== (IEEE754-2019, 4.3.1 roundTiesToEven) - `123456800.0` 的 LSB 是 0 而，`12345808.0` 則是 1，故捨入前者 <br> ```bash 123456800.0 的 IEEE 754 表示: 01001100111010110111100110100100 (0x4CEB79A4) 123456808.0 的 IEEE 754 表示: 01001100111010110111100110100101 (0x4CEB79A5) 123456800.0 的 LSB: 0 123456808.0 的 LSB: 1 ``` - 能正確被浮點數表示的數字稱機器可表示數 (machine-representable number, MRN)，以 MRN 為圓心，半徑為 0.5\*ULP 範圍內的數字都會以 MRN 紀錄，剛好在圓心上的數字會往 LSB 為 0 的 MRN 捨入 ## 問題回顧了解上述內容後，讓我們[回顧問題](https://docs.google.com/presentation/d/1NW62Ht_GuvSZXGJ98aIkzAu0I9PGRD8CSv2SgxB3GnM/edit#slide=id.gbd95434991_0_145)並釐清答案吧! ```c #include <stdio.h> int main() { float sum = 0.0f; for (int i = 0; i < 10000; i++) sum += i + 1; printf("Sum: %f\n", sum); return 0; } ``` :::info 上述[程式](https://godbolt.org/z/Pnzr7n)的輸出居然是 `50002896`，與實際答案 `50005000` 相差了 `2104`，雖然知道浮點數運算會有誤差，但 `2104` 是怎麼來的，你說的出來嗎? ::: 透過 AI 撰寫以下程式分析該過程 ```c /* * cumulative_error_print.c * * This program computes the cumulative sum from 1 to 10000 using both * float (simulating float32 arithmetic) and double (higher precision). * When the float and double cumulative results differ, the program prints * detailed information including: * - Current number (num) * - The current number as float (num_float) * - Exact cumulative sum (exact_sum) computed using int64_t arithmetic * - Cumulative sum computed with float (float32_sum) * - LSB (least significant bit) of the float32 cumulative result * - Current error (error = double_sum - float32_sum) * - Change in error (error_diff compared to previous step) * * Compilation: gcc cumulative_error_print.c -lm * Execution: ./a.out > output.txt */ #include <inttypes.h> #include <math.h> #include <stdint.h> #include <stdio.h> /* Use a union to access the bit-level representation of a float */ typedef union { float f; uint32_t u; } FloatUnion; /** * @brief Extract the least significant bit (LSB) of the mantissa from a float. * * @param value The float value. * @return The LSB of the 23-bit mantissa. */ unsigned int extract_lsb(float value) { FloatUnion fu; fu.f = value; /* Extract the mantissa (lower 23 bits) and then return its LSB */ return (fu.u & 0x007FFFFF) & 1; } int main(void) { const int max_i = 10000; /* Loop from 1 to 10000 */ int i; /* Exact cumulative sum computed with int64_t */ int64_t exact_sum = 0; /* Cumulative sum using float (simulate float32 arithmetic) */ float sum_f = 0.0f; /* Cumulative sum using double (higher precision, serves as theoretical value) */ double sum_d = 0.0; double prev_error = 0.0; double current_error = 0.0; double error_diff = 0.0; /* Print header using tab-separated columns with left alignment */ printf("%-8s\t%-10s\t%-12s\t%-12s\t%-4s\t%-8s\t%-8s\n", "num", "num_float", "exact_sum", "float32_sum", "LSB", "error", "error_diff"); for (i = 1; i <= max_i; i++) { exact_sum += i; sum_d += i; /* Accumulate using double (theoretical value) */ sum_f = sum_f + (float)i; /* Accumulate using float (float32 simulation) */ current_error = sum_d - (double)sum_f; /* Print information when the float and double results differ */ if (fabs(current_error) > 0.0) { if (i == 1) error_diff = 0.0; else error_diff = current_error - prev_error; prev_error = current_error; /* Get the LSB of the float cumulative result */ unsigned int lsb = extract_lsb(sum_f); printf("%-8d\t%-10.0f\t%-12" PRId64 "\t%-12.0f\t%-4d\t%-8.0f\t%-8.0f\n", i, (float)i, exact_sum, sum_f, lsb, current_error, error_diff); } } return 0; } ``` 結合上述得知僅當浮點數超過 $2^{24} 時，其 ULP 才會 > 1.0。因此將 `float32_sum` 依照 $2^{24}$ (), $2^{25}$ (33554432) 進行分界，得知 `num` 分別為 : - 5793: 16776527 → 16782320 - 8192: 33549136 → 33557328 觀察這附近的輸出: ``` num num_float exact_sum float32_sum LSB error error_diff 5793 5793 16782321 16782320 0 1 1 5794 5794 16788115 16788114 1 1 0 5795 5795 16793910 16793908 0 2 1 5796 5796 16799706 16799704 0 2 0 5797 5797 16805503 16805500 0 3 1 5798 5798 16811301 16811298 1 3 0 5799 5799 16817100 16817096 0 4 1 5800 5800 16822900 16822896 0 4 0 ... /* 以下是 2^25 範圍 */ 8192 8192 33558528 33557328 0 1200 0 8193 8193 33566721 33565520 0 1201 1 8194 8194 33574915 33573712 0 1203 2 8195 8195 33583110 33581908 1 1202 -1 8196 8196 33591306 33590104 0 1202 0 8197 8197 33599503 33598300 1 1203 1 8198 8198 33607701 33606496 0 1205 2 8199 8199 33615900 33614696 0 1204 -1 8200 8200 33624100 33622896 0 1204 0 8201 8201 33632301 33631096 0 1205 1 8202 8202 33640503 33639296 0 1207 2 8203 8203 33648706 33647500 1 1206 -1 . . . 9996 9996 49965006 49962904 0 2102 0 9997 9997 49975003 49972900 1 2103 1 9998 9998 49985001 49982896 0 2105 2 9999 9999 49995000 49992896 0 2104 -1 10000 10000 50005000 50002896 0 2104 0 ``` 在前半部分 ($\lbrack 5793, 8192)$)，可以參考這張圖: ![float_error](https://hackmd.io/_uploads/HyBMRkv91g.png) - LSB 0 → LSB 1 的區域定義為 Region A (藍色處) - LSB 1 → LSB 0 的區域定義為 Region A (綠色處) 在藍色區域的中點(含)以左都會被捨入到左側的浮點數 (e.g., 步驟 2)，在綠色區域中點(含)以右都會被捨入到右側的浮點數。誤差可以這樣統計: 1. 對於任意等差數列和，可以找出其誤差差值 (error_diff) 循環規律，例如本題(遞增 1)，在 $\lbrack 5793, 8192)$ 為 `1, 0`，循環長度為 ULP。在 $\lbrack 8192, 11585)$ 為 `0, 1, 2, -1`，在 $\lbrack 11585, 16384)$ 為 `1, 2, 3, 4, -3, -2, -1, 0`。每次循環的和是 ULP / 2。 2. 統計 error_diff: - $\lbrack 5793, 8192)$: 共 1199 組循環==餘 1==，單個循環 error_diff 和是 1，剩下的數字 8191 的 error 是 1，因此本階段共計 1200。 - $\lbrack 8192, 10000)$: 剛好 452 組，一組循環 2，共 904。 - 兩者和為 2104 :::warning :warning: 區間內的等差數值數量不一定會整除 ULP，要計算最後一組剩下的 error_diff 加上前面每一組的 error_diff_sum ::: :::info :question: 目前頂多是暴力解，有數值解嗎?算式如何列? ::: **背後的原理** todo :::success 由於浮點數間隔在一段距離內 ($2^k \to 2^{k+1}-1$) 是 ==均勻分布== 的，又我們的目標數是遞增，所以 - error 循環長度與 ULP 一致 (e.g., 0, 1, 2, -1，對應到 ULP = 4) ::: 至此，我們知道累積的誤差來源正是因為捨入方向。IEEE 754 預設的捨入模式稱為銀行家捨入法 (Banker's rounding, 四捨六入五取偶)，詳見[延伸閱讀: 捨入方法](#延伸閱讀-捨入方法)。 ## 修正誤差 - Kahan summation algorithm 我們可以藉由 [Kahan summation algorithm](https://en.wikipedia.org/wiki/Kahan_summation_algorithm) ```c int main() { float sum = 0.0f, corr = 0.0f; /* corrective value for rounding error */ for (int i = 0; i < 10000; i++) { float y = (i + 1) - corr; /* add the correction to specific item */ float t = sum + y; /* bits might be lost */ corr = (t - sum) - y; /* recover lost bits */ sum = t; } printf("Sum: %f\n", sum); return 0; } ``` 其原理是... :::info Kahan summation algorithm 的限制? ::: --- ## 延伸閱讀: 捨入方法 1. **銀行家捨入 (Banker's rounding)** 也稱為無偏捨入， 1. ## 延伸閱讀: Float16 V.S. bFloat16 ## 延伸閱讀: 浮點數在 Linux Kernel 的使用 Linux 核心在大部分情況下禁止使用 FP 暫存器，因其上下文切換時造成的 CPU 性能損耗非常大。定點數運算 ### 避免 overflow - 平均 - 乘法 ## Reference 1. [What Is bfloat16, Anyway?](https://www.eejournal.com/article/what-is-bfloat16-anyway/) 2. [2025-02-{18,20} 課堂問答簡記](https://hackmd.io/2jXewm-kQBSDHwWVEZBg2w?view#devarajabc) 3. [Non-uniform Distribution of Floating-Point Numbers](https://www.researchgate.net/figure/Non-uniform-Distribution-of-Floating-Point-Numbers_fig8_265389713)