<style>
.red {
color: red;
font-weight: bold;
}
.blue {
color: blue;
font-weight: bold;
}
.green {
color: green;
font-weight: bold;
}
.gb {
background-color: lightgreen;
font-weight: bold;
}
</style>
## Fixed-Point Numbers
### Preliminary
- In a fixed-point(定點數)number system, each number has exactly the same number of digits, and the point is always in the **same place**.
- Ex.
- 0.53, 5.12, 9.11
- Ordinary integers: 2, 53, 120
- 小數點在最後一位,前面有無限多個 0
- $(11.10)_2$, $(01.10)_2$, $(00.11)_2$
- In the computer ***the binary point is not stored anywhere.*** (電腦不會存小數點,他自己知道在哪裡)
### Range and precision in fixed-point numbers
- **Range** of expressible numbers
- The distance between the **largest and smallest numbers**.
- **Precision**
- The distance between **two adjacent numbers** on a **number line**.
- <font class="blue">Error</font>
- **1/2 of the difference** between two adjacent numbers.
:::info
**Example**
- Three digits, the decimal point is placed two digits from right
- Range: [0.00, 9.99]
- Precision: 0.01
- Error: 0.005
:::
### The associative law of algebra does not always hold in computers
- Associative law of algebra: a + (b + c) = (a + b) + c
:::info
**Example**
- Range: [-9, 9]
- (a, b, c) = (7, 4, -3)
- a + (b + c) = 7 + (4 + -3) = 8
- (a + b) + c = (7 + 4) + -3 = 11 + -3 $\rightarrow$ **11 is outside the range**
:::
- To detect ***overflow*** wherever it occurs
## Radix number systems
- The **base** or **radix** of a number system defines the range of possible values that a digit may have.
- <font class="blue">*Most Significant Bit (MSB)* vs. *Least Significant Bit (LSB)*</font>
- **MSB**: the leftmost bit (最左邊那個 bit)
- **LSB**: the rightmost bit (最右邊那個 bit)


### Signed fixed-point numbers
- Four commonly used ***signed*** representations

#### Signed magnitude
- The ***leftmost*** bit is used for the sign (<font class="green">*definition*</font>) (<font class="blue">0 for positive, 1 for negative</font>)
- For a k-bit number
- Range: [$-(2^{k-1}-1)$, $2^{k-1}-1$]
- ex. k = 3, Range: [$-(2^{3-1}-1),\ 2^{3-1}-1$] = [-3, 3]
- $2^k-1$ different numbers can be represented (減 1: 因為 +0 = -0)
#### One's complement (1 補數)
- 0 $\leftrightarrow$ 1 (complementing)
- The leftmost bit is also the sign bit (<font class="green">*conclusion*</font>)(<font class="blue">0 for positive, 1 for negative</font>)
- For a k-bit number
- Range: [$-(2^{k-1}-1)$, $2^{k-1}-1$]
- $2^k-1$ different number can be represented
- Practical application
- ***Checksums*** in the headers of network packets
#### Two's complement (2 補數)
- 0 $\leftrightarrow$ 1 than add 1, **discard the carry** out from the MSB
- For a k-bit number
- Range: [$-2^{k-1}$, $2^{k-1}-1$]
- $2^k$ different numbers can be represented (without -0, so we can get one more negative number than one's complement)
- The most commonly used representation in conventional computers
#### Excess (biased) representation
- Two's complement form plus a ***bias*** (將二補數加上 bias)
- Ex. excess 128
- $(+12)_{10}\ =\ (10001100)_2\ \ \ (-12)_{10}\ =\ (01110100)_2$
- $(-128)_{10}\ =\ (00000000)_2\ \ \ (+127)_{10}\ =\ (11111111)_2$
- $+0\ =\ (10000000)_2\ \ \ (-0)\ =\ (10000000)_2$
- The leftmost bit is the sign bit (<font class="red">1 for positve, 0 for negative</font>)
- <font class="gb">Numerically smaller numbers have smaller bit patterns</font>
(數字大小照著 bit pattern 排序,所以很好判斷大小關係)
- For a k-bit number
- Range: [$-2^{k-1}$, $2^{k-1}-1$]
- $2^k$ different numbers can be represented
### Sign extension
- A number is placed into a **larger container**
- To pad the left side with 0s
- **Positive number: correct**
- **Negative number: incorrent**
:::info
**Example**:
- $+12_{10}$: 8-bit $\rightarrow$ 16-bit
0000 1100 (sign: **+**) $\rightarrow$ 0000 0000 0000 1100 (sign: **+**)
$\Rightarrow$ **correct!**
- $-12_{10}$: 8-bit $\rightarrow$ 16-bit
1111 0100 (sign: **-**) $\rightarrow$ 0000 0000 1111 0100 (sign: **+**)
$\Rightarrow$ **incorrect!**
:::
- To copy the sign bit to the left side: ***sign extension***
:::info
**Example**:
- $+12_{10}$: 8-bit $\rightarrow$ 16-bit
0000 1100 (sign: **+**) $\rightarrow$ 0000 0000 0000 1100 (sign: **+**)
$\Rightarrow$ **correct!**
- $-12_{10}$: 8-bit $\rightarrow$ 16-bit
1111 0100 (sign: **-**) $\rightarrow$ 1111 1111 1111 0100 (sign: **-**)
$\Rightarrow$ **correct!**
:::
- A number is placed into a smaller container
- To remove bits on the left side
:::info
**Example**:
$-12_{10}$: 16-bit $\rightarrow$ 8-bit
1111 1111 1111 0100 $\rightarrow$ 1111 0100
:::
### 2 補數表示的數字轉 10 進位
$$
\displaystyle{\rm{Value} = \sum^{n - 1}_{i = -m} b_i \times k^i\quad{MSB \times (-1)}}
$$
:::warning
**Example**
- $45 = (0101101)_2 = 2^5 + 2^3 + 2^2 + 2^0$
- $-45 = (1010011)_2 = -2^6 + 2^4 + 2^1 + 2^0$
:::
### 數字轉成 2 補數,用最少 bit 數表示
- 記 range 公式
- 正數直接寫出,左側補 0 (MSB 會為 0)
- 負數由正數轉換 (MSB 會為 1)
### 非整數負數的 2 補數表示法
- $(19.25)_{10} = (010011.01)_2$
- $(-19.26)_{10} = (?)_2$
:::info
**法一**
$$
\begin{split}
(-19.25)_{10} &= (-20)_{10} + (0.75)_{10} \\
&= (101100)_2 + (0.11)_2 \\
&= (101100.11)_2
\end{split}
$$
:::
:::success
**法二**
$$
(19.25)_{10} = (010011.01)_2
$$
0, 1 互換,**bit pattern** + 1
$$
(010011.01)_2 \Rightarrow (101100.10)_2 \Rightarrow (101100.11)_2
$$
:::
## Addition and Subtraction
### Binary addition and subtraction
- Digits are added bit by bit from **right to left**, with carries passed to the next digit to the left
- 減法就是加負數 (利用二補數)
### Overflow
- To occur when the result from an operation cannot be represented with the available hardware
- The carry-out of the leftmost bit is discarded because the number system is ***modular***
#### When overflow **cannot** occur
- **adding positive and negative** operands (異號相加)
- **subtracting** two operands of the **same sign** (同號相減)
#### How to detect overflow?
- <font class="blue">同號數相加卻出現異號數</font>
- Ex. (+) + (+) = (-)
- 異號數相加不可能 overflow
- <font class="blue">負減正卻出現正數 or 正減負卻出現負數</font>
- Ex. (-) - (+) = (+)
- 同號數相減不可能 overflow
#### MIPS detects overflow with an *exception*
- An unscheduled procedure call
- The processor jumps to a **predefined address** to invoke the appropriate routine
- The address caused the exception is saved in **register EPC (Exception Program Counter)**
- In some situations the program can continue after corrective code is executed
- Exception vs. ***Interrupt*** (in MIPS processor)
- Exception refers to any unexpected change in control flow **without distinguishing whether the cause is internal or external**
- **Interrupt is an exception that comes from outside of the processor**
#### *Saturating* operations 飽和控制
- When a calculation overflows, the result is set to the largest positive number or most negative number
- Suitable for media operations
- 例如遙控器控制音量,到最大就維持在最大值了。
## Multiplication
### Unsigned multiplication
- ***Multiplicand*** $\times$ ***Multiplier*** = ***Product*** (被乘數 $\times$ 乘數 = 積)
- The number of digits in the product
- ***n* bit** multiplicand $\times$ ***m* bit** multiplier = **(*n+m*) bit** product (ignore the sign bits)
- **Sequential version** of the multiplication algorithm and hardware
- Multiplier is in the 32 bit **Multiplier** register
- The 64 bit **Product** register is initialized to 0
- A 64 bit **Multiplicand** register is initialized with the 32 bit multiplicand in the right half and zero in the left half
- The **Multiplier** register is shifted right 1 bit each step
- The **Multiplicand** register is shifted left 1 bit each step to align the multiplicand with the sum being accumulated in the 64 bit **Product** register



- **Refined version** of the multiplication hardware
- To perform operations in **parallel**: 1 clock cycle per step
- The multiplier and multiplicand are shifted while the multiplicand is added to the product if the multiplier bit is a 1
- The hardware has to ensure that it tests the LSB of the multiplier and gets the pre-shifted version of the multiplicand
- To halve the width of the adder and registers

### Signed multiplication
- Leaving the signs out of the calculation (<font class="green">都當成正的,算完再判斷正負</font>)
- To convert the multiplier and multiplicand to positive numbers and then remember the original signs
- The algorithms should be run for 31 iterations
- The algorithm in Fig. 3.4 will work for signed numbers
- The shifting steps would need to **extend the sign of the product**
- When the algorithm completes, the **lower word** would have the 32-bit product

### Faster multiplication
Whether the multiplicand is to be added or not is known at the beginning of the multiplication by looking at each of the 32 multiplier bits.
- To provide one 32-bit adder for each bit of the multiplier
- Two input bits
- One input is the multiplicand ANDed with a multiplier
- One input is the output of a prior adder
- To organize these 32 additions
- To connect the outputs of adders on the right to the inputs of adders on the left
- Parallel tree

## Division
### Unsigned Division
- ***Dividend / Divisor = Quotient + Remainder***
***Dividend = Quotient $\times$ Divisor + Remainder***
- Steps
- To see how big a number can be subtracted
- To create a digit of the quotient on each attempt
#### First version of the division hardware
- The 32 bit **Quotient** register is initialized to 0
- The divisor is placed in the **left half** of the 64 bit **Divisor** register
- The **Remainder** register is initialized with the dividend (sign extension)
- The **Divisor** register is **shifted right** 1 bit each step to align with the dividend
- The **Quotient** register is **shifted left** 1 bit each step

- **Step 1.** ***subtract*** the **Divisor** register from the **Remainder** register
- **Step 2a.** if the result is **positive**, we **generate a 1 in the quotient**
- **Step 2b.** if the result is **negative**, we **restore the original value to the remainder** and **generate a 0 in the quotient**
- **Step 3.** the **Divisor** register is **shifted right** 1 bit
- Above steps are repeated


#### An improved version of the division hardware
- To shift the operands and the quotient simultaneously with the subtraction
- To halve the width of the adder and registers
- The **Remainder** register is shifted left
- To combine the **Quotient** register with the right half of the **Remainder** register

### Signed Division
- To remember the signs of the divisor and dividend and then negate the quotient if the signs disagree
- <font class="red">**The dividend and remainder must have the same signs**</font>
- $+7 \div -2 = -3 ... +1$
- $-7 \div -2 = +3 ... -1$
- <font class="green">先當成正數算,再來判斷正負,被除數跟餘數要同號。</font>
### Faster Division: SRT division
- Try to **predict** several quotient bits per step
- Using a table lookup based on the upper bits of the dividend and remainder
- 6 bits from the remainder and 4 bits from the divisor to index a table
- Subsequent steps to correct wrong predictions
- A typical value today is 4 bits
- The accuracy depends on having proper values in the lookup table
## Floating Point
### Preliminary
- Programming languages support numbers with **fractions**
- ***Floating point***
- Computer arithmetic that represent numbers in which the ***decimal point*** (or ***binary point***) is not fixed
- ***Scientific notation***
- Numbers with a single digit to the left of the decimal point
- ***Normalized***
- A number in floating-point notation that has no leading 0s
### Float-point representation
- General form of a floating-number: $(-1)^s \times F \times 2^E$
- <font class="blue">***s***</font> is the **sign** of the floating-point number
- <font class="blue">***F: Fraction***</font> (**mantissa**)
- A value $1.xxx..._2$ to determine the **precision**
- Larger F enhances the precision of the fraction
- <font class="blue">***E: exponent***</font>
- A signed integer to determine the **range**
- Larger E increases the range of numbers that can be represented
- A tradeoff is between precision and range
### IEEE 754 floating-point standard
- **Single precision** (單精度)
- A **floating-point** value represented in a single **32 (1 + 8 + 23) bit word**
- A separate **sign bit**: <font class="red">0 for positive, 1 for negative</font>
- <font class="green">8 bit exponent field</font> (including the sign of the exponent)
- <font class="green">23 bit fraction field</font>
- Range: $2.0_{10} \times 10^{-38} \sim 2.0_{10} \times 10^{38}$

- Overflow
- A positive exponent is too large to be represented in the exponent field
- Underflow
- A negative exponent is too large to fit in the exponent field
- **Double precision**
- A floating-point value represented in two 32-bit words
- <font class="green">Sign bit, 11 bit exponent field, 52 bit fraction field</font>
- Range: $2.0_{10} \times 10^{-308} \sim 2.0_{10} \times 10^{308}$
- Greater precision due to the much larger fraction

- Making the leading 1-bit of normalized binary numbers implicit (hidden bit)
(省略 fraction 的小數點跟最左邊的 1,避免浪費)
- The significand is actually 24-bit long in single precision and 53-bit long in double precision
- Since 0 has no leading 1, it needs a special format to represent (0 是例外情況)
- ==Five formats==
- $00...00_2$ represent **0** (所以 0 佔了兩個 bit pattern, s=0 or 1)
- The representaion of $\pm$**floating-point number**
- $(-1)^s \times (1 + Fraction) \times 2^{(Exponent - Bias)}$
- Using a bias of **127** for single precision and **1023** for double precision
- Range of single precision number
- $\pm1.00000000000000000000000 \times 2^{-126}$ ~ $\pm1.11111111111111111111111 \times 2^{+127}$
- The largest exponent is reserved for $\pm$**infinity** and ***NaN (not a number)***
- $\pm$infinity can represent the result of a divided by 0
- NaN can represent invalid operations, such as 0/0 or subtracting infinity from infinity
- The representation of $\pm$**denormalized number**
- ==Without== the leading 1-bit
- $(-1)^s \times Fraction \times 2^{-126}$ in single precision and $(-1)^s \times Fraction \times 2^{-1022}$ in double precision
- To represent numbers between 0 and the smallest (or largest) positive (or negative) number in the $\pm$floating-point number type

:::info
**Example**. Show the IEEE 754 binary representation of the number $-0.75_{10}$
- $-0.75_{10} = (-3 / 4)_{10} = (-3 / 2^2)_{10} = (-3 \times 2^{-2})_{10} = -0.11_2$
:::

:::success
- a.
- Sign: positive (+) $\rightarrow$ 0
- Exponent: 5 + 127 (bias) = 132 (1000 0100)
- Fraction: 101
- b.
- Sign: negative (-) $\rightarrow$ 1
- Exponent: -126 + 127 = 1 (0000 0001)
- Fraction" 010 1100
- c.
- Sign: positive (+) $\rightarrow$ 0
- Exponent: 127 + 127 = 254 (1111 1110)
- Fraction: 000
:::
:::danger
**必考題**
**Q.** IEEE 754 single precision, 在數線上能表示出來的數字有幾個?
**A.**
法一: (減)
- floating-point + denormalized: 總共有 32 bit, 所以有 $2^{32}$ 種
- 去掉 NaN 跟 $\infty$ 的 case: exponent 固定全為 1, 所以剩下 32 - 8 = 24 個 bit, 共 $2^{24}$ 種。
- 所以總共 $2^{32} - 2^{24} - 1$ 種 (扣掉重複的0: -0)
法二: (加)

- 0: 1 種
- denormalized: 因為 exponent 固定為 0, 然後 fraction 不可為 0 ($2^{23} - 1$), 所以共有 $2 \times 1 \times (2^{23} - 1) = 2^{24} - 2$ 種
- normalized: 因為 exponent 為 1 ~ 254, 所以共有 $2 \times 254 \times 2^{23} = 254 \times 2^{24}$ 種
- 所以總共有
$$
1 + (2^{24} - 2) + (254 \times 2^{24}) \
= 255 \times 2^{24} - 1 \
= (2^8 - 1) \times 2^{24} - 1 \
= 2^{32} - 2^{24} - 1
$$
:::
### Floating-point addition
1. align the decimal point of the number that has the **smaller** exponent
2. add the significands
3. adjust the sum to normalized form
4. round the number
:::info
**Example.** $9.999_{10} \times 10^1 + 1.610_{10} \times 10^{-1}$
Assume we can store 4 decimal digits of the significand and 2 decimal digits of the exponent.
1. $1.610_{10} \times 10^{-1}\ =\ 0.1610_{10} \times 10^0\ =\ 0.01610_{10} \times 10^1$
- we can represent only 4 decimal digits $\Rightarrow 0.016 \times 10^1$
2. $9.999_{10} + 0.016_{10} = 10.015_{10} \times 10^1$
3. $10.015_{10} \times 10^1 = 1.0015_{10} \times 10^2$
4. $1.0015_{10} \times 10^2 \Rightarrow 1.002_{10} \times 10^2$
:::
### Accurate arithmetic
- Floating-point numbers are normally approximations for a number they can’t really represent
- The computer numbers have limited size and hence limited precision
- Getting the floating-point representation close to the actual number
- It requires extra bits in the calculation to round accurately
- If every intermediate result had to be truncated to the exact number of digits, there would be no opportunity to round
- IEEE 754 always **keeps two extra bits on the right** during intermediate additions
- ***Guard** bit and **round** bit*
:::info
**Example** $2.56_{10} \times 10^0 + 2.34_{10} \times 10^2$
- Round to the nearest decimal number with 3 significant decimal digits
- With guard and round digits
- $2.56_{10} \times 10^0 + 2.34_{10} \times 10^2 = (0.02\underline{56}_{10} + 2.34\underline{00}_{10}) \times 10^2 = 2.3656_{10} \times 10^2$
- *We have two digits to round* $\Rightarrow 2.37_{10} \times 10^2$
- 0 ~ 49 to round down, 51~99 to round up, 50 is the tiebreaker
- Without guard and round digits
- To drop two digits from the calculation
- $2.56_{10} \times 10^0 + 2.34_{10} \times 10^2 = (0.02_{10} + 2.34_{10}) \times 10^2 = 2.36_{10} \times 10^2$
:::
---
- IEEE 754 has four rounding modes
- **Always round up** (toward $+\infty$) (無條件進位)
- **Always round down** (toward $-\infty$) (無條件捨去)
- **Truncate** (直接扣掉 ex. 4.6 = 4, -4.6 = -4)
- **Round to nearest even**: <font class="blue">the most commonly used</font>
- tiebreake(進位或退位的誤差相同)的時候找最近的偶數 (ex. 3.5 = 4, 4.5 = 4)
- 二進位: ex. $1.0110\underline{10}_2 \rightarrow 1.0110_2$ (truncate) $\ \ \ 1.0111\underline{10}_2 \rightarrow 1.1000_2$ (add 1)
| | $+\infty$ | $-\infty$ | Truncate | Nearest Even |
|:---:|:---:|:---:|:---:|:---:|
| 4.6 | 5 | 4 | 4 | 5 |
| -4.6 | -4 | -5 | -4 | -5 |
| 3.5 | 4 | 3 | 3 | 4 |
| 4.5 | 5 | 4 | 4 | 4 |
- Sticky bit
- A bit used in rounding in addition to grand and round bits
- Sticky bit is set whenever there are **nonzero bits to the right of the round bit**
:::info
**Example** $5.01_{10} \times 10^{-1} + 2.34_{10} \times 10^2$
- $5.01_{10} \times 10^{-1} + 2.34_{10} \times 10^2 = (0.00\underline{50}_{10} + 2.34\underline{00}_{10}) \times 10^2 = 2.34\underline{50}_{10} \times 10^2$
- The sticky bit is set since there are nonzero bits to the right ($0.0050[1]$)
- With the sticky bit to remember that the number is actually larger than $2.34\underline{50}_{10} \Rightarrow \text{Round to } 2.35_{10} \times 10^2$
- Without the sticky bit $\Rightarrow \text{Round to } 2.34_{10} \times 10^2$
:::
## Fallacies and Pitfalls
- Fallacies
- Just as a left shift instruction can replace an integer multiply by a power of 2, a right shift is the same as an integer division by a power of 2
- Above statement is only true for unsigned integers
- The problem for signed integer
- Solution: to have an arithmetic right shift that extends the sign bit instead of shifting in 0s
- Parallel execution strategies that work for integer data types also work for floating-point data types
- Pitfalls
- Floating-point addition is not associative
- Associativity holds for a sequence of 2’s complement integer additions, even if the computation overflows
- Associativity does not hold for floating-point numbers because
- Floating-point numbers are approximations of real numbers
- Computer arithmetic has limited precision
- Problems occur when adding two large numbers of opposite signs plus a small number
- Example
<!-- TODO -->
- The MIPS instruction add immediate unsigned (addiu) sign-extends its 16-bit immediate field