---
tags: number-system, binary-system
---
# Floating-Point Arithmetic
## Why `0.1 + 0.2 != 0.3`?[^1]
- [Prime Factor](https://www.geeksforgeeks.org/prime-factor/)
- [質因數](https://terms.naer.edu.tw/detail/1677487/): 若b為a的因數,且b為質數,則稱b為a的質因數;如60的質因數有2、3、5。
- 從內文的描述 **Some interesting fact about Prime Factor:**
1. There is only one (unique!) set of prime factors for any number.
3. Prime factorizations can help us with divisibility, simplifying fractions, and finding common denominators for fractions.
> 由上述兩點可得知Prime Factor為表示一個數值,唯一且最小的Factor
以base-10 system為例(人們常用的),`10`的質因數是`2`和`5`,所以只要是分母是2或5的相乘的組合,就不會有非預期的額外的小數位數(be expressed cleanly),如:1/2 (=(1x5)/(2x5)=5/10=5·10^-1^);反之,1/3, 1/6, 1/7和1/9會變成循環小數(repeating decimals)因為他們的分母的質因數是`3`或`7`。
回到binary (or base-2), `2`是唯一的質因數, 所以只有質因數是`2`的分母才能cleanly表示,如: 1/2 (=2^-1^), 1/4 (=2^-2^), 1/8 (=2^-3^) 。
所以0.1和0.2 (即1/10與1/5), 在較human-readable base-10 representation為clean decimals;然而在base-2 system下數值的表示會變成循環小數,故運算過後會有殘餘(leftovers)的進位留下。
## Representation range
> A shorten explaination[^2]
:::info
Representation in binary system
:::
IEEE 754 has 3 basic components:
1. **The Sign of Mantissa –**
This is as simple as the name. 0 represents a positive number while 1 represents a negative number.
2. **The Biased exponent –**
The exponent field needs to represent both positive and negative exponents. A bias is added to the actual exponent in order to get the stored exponent.
> 延伸閱讀:Why There is a Bias 127 in Exponent in IEEE 754 Standard Single Precision
{%youtube aE2kVS0O0OE %}
- 當exponent field可表示的值域均分成正負兩個範圍,此時會出現+0/-0
- 為了消弭+0/-0成單一一個0,IEEE 754透過一個bias(2^bits_of_exp-1^-1),將欲表示的正負值域,透過加上此bias,對應到exponent field原來可表示的值域;以8-bit exponent field為例:
```c=
// Negative Positive
[-127, ..., -0][+0, 1, ..., +127]
=> [ 0, ..., 127][127, 128, ..., 254, 255]
```
故,可表示的值域為`-127~0, 1~128`; 其中`128(mapped to 255)`,因為+0跟-0都是對應到bias(127),所以移至從`1`開始對應回原來可表示的值域(128~255),即新的值為原來可表示的正數(0~127)+1。但:
- `128 (255 in the exp field)`被挪用為判斷Infinity or NaN (Not a number), 根據Mantisa的值。
- `-127 (0)`被挪用為判斷exact 0 or denormalised
總結如下:
| EXPONENT | MANTISA | VALUE |
| -------- | ------- | ------------------ |
| 0 | 0 | exact 0 |
| 255 | 0 | Infinity |
| 0 | not 0 | denormalised |
| 255 | not 0 | Not a number (NaN) |
==**Conclusion**: 實際可以表示的值域為-126 ~ +127==
3. **The Normalised Mantissa –**
The mantissa is part of a number in scientific notation or a floating-point number, consisting of its significant digits. Here we have only 2 digits, i.e. O and 1. So a normalised mantissa is one with only one 1 to the left of the decimal.
:::success
:tada: The answer
:::
> How to Calculate Range of Float Variable [Single Precision]
{%youtube YYIeMM8By6Y %}
從上述represent floating point number的結論, **Exponent**可表示的值域:
```c=
[-126, ..., +127]
```
回到IEEE 754 Floating Point Standard, Single Precision的表示方式:

能表示的最大值: +1.111...1 x 2^127^
- **+1.111...1**近似於`2`
- 從計算機可得2^127^約等於1.7e^38^
- [e為Exponential,用來表示10的冪](https://zh.wikipedia.org/zh-tw/%E7%A7%91%E5%AD%A6%E8%AE%B0%E6%95%B0%E6%B3%95)
- 故最大值與該slide描述ㄧ致約等於==2.0 x 10^38^==
而最小值由Exponent可表示的最小值(-126)來計算:
:question: 1.0 x 2^-126^ = 2^-126^ = 1.175e-38 !≈ ==2.0 x 10^-38^==
:::info
另外一種計算
- Exponent最大值及最小值的方式:
- 在IEEE-754[^3]有推導對應的算式:
emax = bias = 2^(w-1)^ - 1
emin = 1 - emax = 2 - 2^(w-1)^
- 其中**emin**,因為從負指數的範圍挪用一個number當作bias,故相較於正指數的範圍少一個number, 相對的值為`emax - 1`,而emin是的指數會是負的,可透過對調運算元得之:`1 - emax`
- 最大值的方式[^4]:
maximum value of (2−2^−23^) × 2^127^ ≈ 3.4028235 × 10^38^
延續[^2]列出的Denormalized/Normalized的最大值及最小值

- Denormalized
欄位*Mantissa*被用來表示小數的所有位元,即沒有implict leading `1`
因小數是由2的負的冪所組成,故2的冪的累加會在分子的部分做;以3-bit mantissa為例:
max 111~2~ = 1/2 + 1/4 + 1/8 = (4+2+1) / 8 = 7/8 =
1 - min = 1 - 001~2~ = 1 - 2^-3^ = 7/8
又2的冪的總和 = 2^w^ - 1, w: width of mantissa
故max的算式可寫成 (2^0^ +2^1^ + 2^2^) / 2^3^ = (2^3^ - 1) / 2^3^ = (2^w^ - 1) / 2^w^ = ==1 - 2^-w^==
- Normalized Value
max = 1.1···1~w-1~
min = 1.0
因為多了implict leading `1`, 若要套用上述2的冪的總和的公式,需要用大於leading`1`的下一個bit來剪掉LSB為1的值:
2 - 0.0···1~w-1~ = ==2 - 2^-w^==
:::
[^1]: 不同程式語言給出相似的執行結果: [Floating Point Math](http://0.30000000000000004.com/)
in [你所不知道的 C 語言:數值系統篇](https://hackmd.io/@jserv/rJzclA2q-/https%3A%2F%2Fhackmd.io%2Fs%2FBkRKhQGae?type=book)
[^2]: [IEEE Standard 754 Floating Point Numbers](https://www.geeksforgeeks.org/ieee-standard-754-floating-point-numbers/)
[^3]: 3.4 Binary interchange format encodings in [
IEEE Std 754™-2008 (Revision of IEEE Std 754-1985), IEEE Standard for Floating-Point Arithmetic](https://irem.univ-reunion.fr/IMG/pdf/ieee-754-2008.pdf)
[^4]: Wikipedia: [Single-precision floating-point format](https://en.wikipedia.org/wiki/Single-precision_floating-point_format)