Floating-Point Arithmetic

--- tags: number-system, binary-system --- # Floating-Point Arithmetic ## Why `0.1 + 0.2 != 0.3`?[^1] - [Prime Factor](https://www.geeksforgeeks.org/prime-factor/) - [質因數](https://terms.naer.edu.tw/detail/1677487/): 若b為a的因數，且b為質數，則稱b為a的質因數；如60的質因數有2、3、5。 - 從內文的描述 **Some interesting fact about Prime Factor:** 1. There is only one (unique!) set of prime factors for any number. 3. Prime factorizations can help us with divisibility, simplifying fractions, and finding common denominators for fractions. > 由上述兩點可得知Prime Factor為表示一個數值，唯一且最小的Factor 以base-10 system為例(人們常用的)，`10`的質因數是`2`和`5`，所以只要是分母是2或5的相乘的組合，就不會有非預期的額外的小數位數（be expressed cleanly），如：1/2 (=(1x5)/(2x5)=5/10=5·10^-1^)；反之，1/3, 1/6, 1/7和1/9會變成循環小數（repeating decimals）因為他們的分母的質因數是`3`或`7`。回到binary (or base-2), `2`是唯一的質因數, 所以只有質因數是`2`的分母才能cleanly表示，如: 1/2 (=2^-1^), 1/4 (=2^-2^), 1/8 (=2^-3^) 。所以0.1和0.2 (即1/10與1/5), 在較human-readable base-10 representation為clean decimals；然而在base-2 system下數值的表示會變成循環小數，故運算過後會有殘餘(leftovers)的進位留下。 ## Representation range > A shorten explaination[^2] :::info Representation in binary system ::: IEEE 754 has 3 basic components: 1. **The Sign of Mantissa –** This is as simple as the name. 0 represents a positive number while 1 represents a negative number. 2. **The Biased exponent –** The exponent field needs to represent both positive and negative exponents. A bias is added to the actual exponent in order to get the stored exponent. > 延伸閱讀：Why There is a Bias 127 in Exponent in IEEE 754 Standard Single Precision {%youtube aE2kVS0O0OE %} - 當exponent field可表示的值域均分成正負兩個範圍，此時會出現+0/-0 - 為了消弭+0/-0成單一一個0，IEEE 754透過一個bias(2^bits_of_exp-1^-1)，將欲表示的正負值域，透過加上此bias，對應到exponent field原來可表示的值域；以8-bit exponent field為例： ```c= // Negative Positive [-127, ..., -0][+0, 1, ..., +127] => [ 0, ..., 127][127, 128, ..., 254, 255] ``` 故，可表示的值域為`-127~0, 1~128`; 其中`128(mapped to 255)`，因為+0跟-0都是對應到bias(127)，所以移至從`1`開始對應回原來可表示的值域(128~255)，即新的值為原來可表示的正數(0~127)+1。但： - `128 (255 in the exp field)`被挪用為判斷Infinity or NaN (Not a number), 根據Mantisa的值。 - `-127 (0)`被挪用為判斷exact 0 or denormalised 總結如下： | EXPONENT | MANTISA | VALUE | | -------- | ------- | ------------------ | | 0 | 0 | exact 0 | | 255 | 0 | Infinity | | 0 | not 0 | denormalised | | 255 | not 0 | Not a number (NaN) | ==**Conclusion**: 實際可以表示的值域為-126 ~ +127== 3. **The Normalised Mantissa –** The mantissa is part of a number in scientific notation or a floating-point number, consisting of its significant digits. Here we have only 2 digits, i.e. O and 1. So a normalised mantissa is one with only one 1 to the left of the decimal. :::success :tada: The answer ::: > How to Calculate Range of Float Variable [Single Precision] {%youtube YYIeMM8By6Y %} 從上述represent floating point number的結論, **Exponent**可表示的值域： ```c= [-126, ..., +127] ``` 回到IEEE 754 Floating Point Standard, Single Precision的表示方式： ![](https://i.imgur.com/zNyFYWM.png) 能表示的最大值: +1.111...1 x 2^127^ - **+1.111...1**近似於`2` - 從計算機可得2^127^約等於1.7e^38^ - [e為Exponential，用來表示10的冪](https://zh.wikipedia.org/zh-tw/%E7%A7%91%E5%AD%A6%E8%AE%B0%E6%95%B0%E6%B3%95) - 故最大值與該slide描述ㄧ致約等於==2.0 x 10^38^== 而最小值由Exponent可表示的最小值(-126)來計算: :question: 1.0 x 2^-126^ = 2^-126^ = 1.175e-38 !≈ ==2.0 x 10^-38^== :::info 另外一種計算 - Exponent最大值及最小值的方式: - 在IEEE-754[^3]有推導對應的算式： emax = bias = 2^(w-1)^ - 1 emin = 1 - emax = 2 - 2^(w-1)^ - 其中**emin**，因為從負指數的範圍挪用一個number當作bias，故相較於正指數的範圍少一個number, 相對的值為`emax - 1`，而emin是的指數會是負的，可透過對調運算元得之：`1 - emax` - 最大值的方式[^4]: maximum value of (2−2^−23^) × 2^127^ ≈ 3.4028235 × 10^38^ 延續[^2]列出的Denormalized/Normalized的最大值及最小值 ![](https://i.imgur.com/WVjEg2T.png) - Denormalized 欄位*Mantissa*被用來表示小數的所有位元，即沒有implict leading `1` 因小數是由2的負的冪所組成，故2的冪的累加會在分子的部分做；以3-bit mantissa為例: max 111~2~ = 1/2 + 1/4 + 1/8 = (4+2+1) / 8 = 7/8 = 1 - min = 1 - 001~2~ = 1 - 2^-3^ = 7/8 又2的冪的總和 = 2^w^ - 1, w: width of mantissa 故max的算式可寫成 (2^0^ +2^1^ + 2^2^) / 2^3^ = (2^3^ - 1) / 2^3^ = (2^w^ - 1) / 2^w^ = ==1 - 2^-w^== - Normalized Value max = 1.1···1~w-1~ min = 1.0 因為多了implict leading `1`, 若要套用上述2的冪的總和的公式，需要用大於leading`1`的下一個bit來剪掉LSB為1的值: 2 - 0.0···1~w-1~ = ==2 - 2^-w^== ::: [^1]: 不同程式語言給出相似的執行結果: [Floating Point Math](http://0.30000000000000004.com/) in [你所不知道的 C 語言：數值系統篇](https://hackmd.io/@jserv/rJzclA2q-/https%3A%2F%2Fhackmd.io%2Fs%2FBkRKhQGae?type=book) [^2]: [IEEE Standard 754 Floating Point Numbers](https://www.geeksforgeeks.org/ieee-standard-754-floating-point-numbers/) [^3]: 3.4 Binary interchange format encodings in [ IEEE Std 754™-2008 (Revision of IEEE Std 754-1985), IEEE Standard for Floating-Point Arithmetic](https://irem.univ-reunion.fr/IMG/pdf/ieee-754-2008.pdf) [^4]: Wikipedia: [Single-precision floating-point format](https://en.wikipedia.org/wiki/Single-precision_floating-point_format)