---
# System prepended metadata

title: 'Ch02: Arithmetic for Computers'
tags: [祭祖]

---

<style>
  .red {
    color: red;
    font-weight: bold;
  }
  
  .blue {
    color: blue;
    font-weight: bold;
  }
  
  .green {
    color: green;
    font-weight: bold;
  }
  
  .gb {
    background-color: lightgreen;
    font-weight: bold;
  }
</style>

## Fixed-Point Numbers

### Preliminary

- In a fixed-point（定點數）number system, each number has exactly the same number of digits, and the point is always in the **same place**.
- Ex.
  - 0.53, 5.12, 9.11
  - Ordinary integers: 2, 53, 120
    - 小數點在最後一位，前面有無限多個 0
  - $(11.10)_2$, $(01.10)_2$, $(00.11)_2$ 
- In the computer ***the binary point is not stored anywhere.*** (電腦不會存小數點，他自己知道在哪裡)

### Range and precision in fixed-point numbers

- **Range** of expressible numbers
  - The distance between the **largest and smallest numbers**.
- **Precision**
  - The distance between **two adjacent numbers** on a **number line**.
- <font class="blue">Error</font>
  - **1/2 of the difference** between two adjacent numbers.

:::info
**Example**
- Three digits, the decimal point is placed two digits from right
  - Range: [0.00, 9.99]
  - Precision: 0.01
  - Error: 0.005
:::

### The associative law of algebra does not always hold in computers

- Associative law of algebra: a + (b + c) = (a + b) + c
:::info
**Example**
- Range: [-9, 9] 
- (a, b, c) = (7, 4, -3)
- a + (b + c) = 7 + (4 + -3) = 8
- (a + b) + c = (7 + 4) + -3 = 11 + -3 $\rightarrow$ **11 is outside the range**
:::

- To detect ***overflow*** wherever it occurs

## Radix number systems

- The **base** or **radix** of a number system defines the range of possible values that a digit may have.
- <font class="blue">*Most Significant Bit (MSB)* vs. *Least Significant Bit (LSB)*</font>
  - **MSB**: the leftmost bit (最左邊那個 bit)
  - **LSB**: the rightmost bit (最右邊那個 bit)

![](https://hackmd.io/_uploads/S1EJFWYlp.png =600x)

![](https://hackmd.io/_uploads/r1FXKWFe6.png =600x)

### Signed fixed-point numbers

- Four commonly used ***signed*** representations
  ![](https://hackmd.io/_uploads/SJKrF-Kep.png =500x)

#### Signed magnitude

- The ***leftmost*** bit is used for the sign (<font class="green">*definition*</font>) (<font class="blue">0 for positive, 1 for negative</font>)
- For a k-bit number
  - Range: [$-(2^{k-1}-1)$, $2^{k-1}-1$]
    - ex. k = 3, Range: [$-(2^{3-1}-1),\ 2^{3-1}-1$] = [-3, 3]
  - $2^k-1$ different numbers can be represented (減 1: 因為 +0 = -0)

#### One's complement (1 補數)

- 0 $\leftrightarrow$ 1 (complementing)
  - The leftmost bit is also the sign bit (<font class="green">*conclusion*</font>)(<font class="blue">0 for positive, 1 for negative</font>)
- For a k-bit number 
  - Range: [$-(2^{k-1}-1)$, $2^{k-1}-1$]
  - $2^k-1$ different number can be represented
- Practical application
  - ***Checksums*** in the headers of network packets

#### Two's complement (2 補數)

- 0 $\leftrightarrow$ 1 than add 1, **discard the carry** out from the MSB
- For a k-bit number
  - Range: [$-2^{k-1}$, $2^{k-1}-1$]
  - $2^k$ different numbers can be represented (without -0, so we can get one more negative number than one's complement)
- The most commonly used representation in conventional computers

#### Excess (biased) representation

- Two's complement form plus a ***bias*** (將二補數加上 bias) 
- Ex. excess 128
  - $(+12)_{10}\ =\ (10001100)_2\ \ \ (-12)_{10}\ =\ (01110100)_2$
  - $(-128)_{10}\ =\ (00000000)_2\ \ \ (+127)_{10}\ =\ (11111111)_2$
  - $+0\ =\ (10000000)_2\ \ \ (-0)\ =\ (10000000)_2$
  - The leftmost bit is the sign bit (<font class="red">1 for positve, 0 for negative</font>)
  - <font class="gb">Numerically smaller numbers have smaller bit patterns</font> 
    (數字大小照著 bit pattern 排序，所以很好判斷大小關係)
- For a k-bit number
  - Range: [$-2^{k-1}$, $2^{k-1}-1$]
  - $2^k$ different numbers can be represented

### Sign extension

- A number is placed into a **larger container**
  - To pad the left side with 0s
    - **Positive number: correct**
    - **Negative number: incorrent**
    :::info
    **Example**:

    - $+12_{10}$: 8-bit $\rightarrow$ 16-bit
      0000 1100 (sign: **+**) $\rightarrow$ 0000 0000 0000 1100 (sign: **+**)
      $\Rightarrow$ **correct!**

    - $-12_{10}$: 8-bit $\rightarrow$ 16-bit
      1111 0100 (sign: **-**) $\rightarrow$ 0000 0000 1111 0100 (sign: **+**)
      $\Rightarrow$ **incorrect!**
    :::
  - To copy the sign bit to the left side: ***sign extension***
    :::info
    **Example**:
    - $+12_{10}$: 8-bit $\rightarrow$ 16-bit
      0000 1100 (sign: **+**) $\rightarrow$ 0000 0000 0000 1100 (sign: **+**)
      $\Rightarrow$ **correct!**
      
    - $-12_{10}$: 8-bit $\rightarrow$ 16-bit
        1111 0100 (sign: **-**) $\rightarrow$ 1111 1111 1111 0100 (sign: **-**)
        $\Rightarrow$ **correct!**
    :::
- A number is placed into a smaller container
  - To remove bits on the left side
    :::info
    **Example**:
    
    $-12_{10}$: 16-bit $\rightarrow$ 8-bit
    1111 1111 1111 0100 $\rightarrow$ 1111 0100
    :::

### 2 補數表示的數字轉 10 進位

$$
\displaystyle{\rm{Value} = \sum^{n - 1}_{i = -m} b_i \times k^i\quad{MSB \times (-1)}}
$$

:::warning
**Example**
- $45 = (0101101)_2 = 2^5 + 2^3 + 2^2 + 2^0$
- $-45 = (1010011)_2 = -2^6 + 2^4 + 2^1 + 2^0$
:::

### 數字轉成 2 補數，用最少 bit 數表示

- 記 range 公式
- 正數直接寫出，左側補 0 (MSB 會為 0)
- 負數由正數轉換 (MSB 會為 1)

### 非整數負數的 2 補數表示法

- $(19.25)_{10} = (010011.01)_2$
- $(-19.26)_{10} = (?)_2$

:::info
**法一**
$$
\begin{split}
(-19.25)_{10} &= (-20)_{10} + (0.75)_{10} \\
&= (101100)_2 + (0.11)_2 \\
&= (101100.11)_2
\end{split}
$$
:::

:::success
**法二**
$$
(19.25)_{10} = (010011.01)_2
$$
0, 1 互換，**bit pattern** + 1

$$
(010011.01)_2 \Rightarrow (101100.10)_2 \Rightarrow (101100.11)_2
$$
:::

## Addition and Subtraction

### Binary addition and subtraction

- Digits are added bit by bit from **right to left**, with carries passed to the next digit to the left
- 減法就是加負數 (利用二補數)

### Overflow

- To occur when the result from an operation cannot be represented with the available hardware
- The carry-out of the leftmost bit is discarded because the number system is ***modular***

#### When overflow **cannot** occur

- **adding positive and negative** operands (異號相加)
- **subtracting** two operands of the **same sign** (同號相減)

#### How to detect overflow?

- <font class="blue">同號數相加卻出現異號數</font>
  - Ex. (+) + (+) = (-)
- 異號數相加不可能 overflow
- <font class="blue">負減正卻出現正數 or 正減負卻出現負數</font>
  - Ex. (-) - (+) = (+)
- 同號數相減不可能 overflow

#### MIPS detects overflow with an *exception*

- An unscheduled procedure call
  - The processor jumps to a **predefined address** to invoke the appropriate routine
  - The address caused the exception is saved in **register EPC (Exception Program Counter)**
  - In some situations the program can continue after corrective code is executed
- Exception vs. ***Interrupt*** (in MIPS processor)
  - Exception refers to any unexpected change in control flow **without distinguishing whether the cause is internal or external**
  - **Interrupt is an exception that comes from outside of the processor**

#### *Saturating* operations 飽和控制
- When a calculation overflows, the result is set to the largest positive number or most negative number
- Suitable for media operations 
  - 例如遙控器控制音量，到最大就維持在最大值了。

## Multiplication

### Unsigned multiplication

- ***Multiplicand*** $\times$ ***Multiplier*** = ***Product*** (被乘數 $\times$ 乘數 = 積)
- The number of digits in the product
  - ***n* bit** multiplicand $\times$ ***m* bit** multiplier = **(*n+m*) bit** product (ignore the sign bits)
- **Sequential version** of the multiplication algorithm and hardware
  - Multiplier is in the 32 bit **Multiplier** register
  - The 64 bit **Product** register is initialized to 0
  - A 64 bit **Multiplicand** register is initialized with the 32 bit multiplicand in the right half and zero in the left half
  - The **Multiplier** register is shifted right 1 bit each step
  - The **Multiplicand** register is shifted left 1 bit each step to align the multiplicand with the sum being accumulated in the 64 bit **Product** register
  ![](https://hackmd.io/_uploads/ByxS3_cx6.png =400x)
  ![](https://hackmd.io/_uploads/B1yZhuqlT.png =400x)
  ![](https://hackmd.io/_uploads/H1cuVl3xa.png)
- **Refined version** of the multiplication hardware
  - To perform operations in **parallel**: 1 clock cycle per step
    - The multiplier and multiplicand are shifted while the multiplicand is added to the product if the multiplier bit is a 1
    - The hardware has to ensure that it tests the LSB of the multiplier and gets the pre-shifted version of the multiplicand
  - To halve the width of the adder and registers
  ![](https://hackmd.io/_uploads/SJmDFx2ep.png =500x)

### Signed multiplication

- Leaving the signs out of the calculation (<font class="green">都當成正的，算完再判斷正負</font>)
  - To convert the multiplier and multiplicand to positive numbers and then remember the original signs
  - The algorithms should be run for 31 iterations
- The algorithm in Fig. 3.4 will work for signed numbers
  - The shifting steps would need to **extend the sign of the product**
  - When the algorithm completes, the **lower word** would have the 32-bit product

![](https://hackmd.io/_uploads/B1zC3VngT.png)

### Faster multiplication

Whether the multiplicand is to be added or not is known at the beginning of the multiplication by looking at each of the 32 multiplier bits.

- To provide one 32-bit adder for each bit of the multiplier
  - Two input bits
    - One input is the multiplicand ANDed with a multiplier
    - One input is the output of a prior adder
  - To organize these 32 additions
    - To connect the outputs of adders on the right to the inputs of adders on the left
    - Parallel tree

![](https://hackmd.io/_uploads/HJJ0643xp.png =800x)

## Division

### Unsigned Division

- ***Dividend / Divisor = Quotient + Remainder***
  ***Dividend = Quotient $\times$ Divisor + Remainder***
- Steps
  - To see how big a number can be subtracted
  - To create a digit of the quotient on each attempt

#### First version of the division hardware

- The 32 bit **Quotient** register is initialized to 0
- The divisor is placed in the **left half** of the 64 bit **Divisor** register
- The **Remainder** register is initialized with the dividend (sign extension)
- The **Divisor** register is **shifted right** 1 bit each step to align with the dividend
- The **Quotient** register is **shifted left** 1 bit each step

![](https://hackmd.io/_uploads/rJtClY5la.png =600x)

- **Step 1.** ***subtract*** the **Divisor** register from the **Remainder** register
- **Step 2a.** if the result is **positive**, we **generate a 1 in the quotient**
- **Step 2b.** if the result is **negative**, we **restore the original value to the remainder** and **generate a 0 in the quotient**
- **Step 3.**  the **Divisor** register is **shifted right** 1 bit
- Above steps are repeated
![](https://hackmd.io/_uploads/H1t6-Y5xp.png =500x)

![](https://hackmd.io/_uploads/ByqkcLnxa.png)

#### An improved version of the division hardware

- To shift the operands and the quotient simultaneously with the subtraction
- To halve the width of the adder and registers
  - The **Remainder** register is shifted left
  - To combine the **Quotient** register with the right half of the **Remainder** register

![](https://hackmd.io/_uploads/HySnQK5gp.png =800x)

### Signed Division

- To remember the signs of the divisor and dividend and then negate the quotient if the signs disagree
- <font class="red">**The dividend and remainder must have the same signs**</font>
  - $+7 \div -2 = -3 ... +1$
  - $-7 \div -2 = +3 ... -1$
  - <font class="green">先當成正數算，再來判斷正負，被除數跟餘數要同號。</font>

### Faster Division: SRT division

- Try to **predict** several quotient bits per step
  - Using a table lookup based on the upper bits of the dividend and remainder
    - 6 bits from the remainder and 4 bits from the divisor to index a table
  - Subsequent steps to correct wrong predictions
  - A typical value today is 4 bits
- The accuracy depends on having proper values in the lookup table

## Floating Point

### Preliminary

- Programming languages support numbers with **fractions**
- ***Floating point***
  - Computer arithmetic that represent numbers in which the ***decimal point*** (or ***binary point***) is not fixed 
- ***Scientific notation***
  - Numbers with a single digit to the left of the decimal point
- ***Normalized***
  - A number in floating-point notation that has no leading 0s

### Float-point representation

- General form of a floating-number: $(-1)^s \times F \times 2^E$
  - <font class="blue">***s***</font> is the **sign** of the floating-point number
  - <font class="blue">***F: Fraction***</font> (**mantissa**)
    - A value $1.xxx..._2$ to determine the **precision**
    - Larger F enhances the precision of the fraction
  - <font class="blue">***E: exponent***</font>
    - A signed integer to determine the **range**
    - Larger E increases the range of numbers that can be represented 
  - A tradeoff is between precision and range

### IEEE 754 floating-point standard

- **Single precision** (單精度)
  - A **floating-point** value represented in a single **32 (1 + 8 + 23) bit word**
  - A separate **sign bit**: <font class="red">0 for positive, 1 for negative</font>
  - <font class="green">8 bit exponent field</font> (including the sign of the exponent)
  - <font class="green">23 bit fraction field</font>
  - Range: $2.0_{10} \times 10^{-38} \sim 2.0_{10} \times 10^{38}$
  ![](https://hackmd.io/_uploads/HkhbnY5x6.png)
- Overflow
  - A positive exponent is too large to be represented in the exponent field
- Underflow
  - A negative exponent is too large to fit in the exponent field
- **Double precision**
  - A floating-point value represented in two 32-bit words
  - <font class="green">Sign bit, 11 bit exponent field, 52 bit fraction field</font>
  - Range: $2.0_{10} \times 10^{-308} \sim 2.0_{10} \times 10^{308}$
  - Greater precision due to the much larger fraction
  ![](https://hackmd.io/_uploads/rkiYatcgp.png)
- Making the leading 1-bit of normalized binary numbers implicit (hidden bit)
  (省略 fraction 的小數點跟最左邊的 1，避免浪費)
  - The significand is actually 24-bit long in single precision and 53-bit long in double precision
  - Since 0 has no leading 1, it needs a special format to represent (0 是例外情況)
- ==Five formats==
  - $00...00_2$ represent **0** (所以 0 佔了兩個 bit pattern, s=0 or 1)
  - The representaion of $\pm$**floating-point number**
    - $(-1)^s \times (1 + Fraction) \times 2^{(Exponent - Bias)}$
    - Using a bias of **127** for single precision and **1023** for double precision
    - Range of single precision number
      - $\pm1.00000000000000000000000 \times 2^{-126}$ ~ $\pm1.11111111111111111111111 \times 2^{+127}$
  - The largest exponent is reserved for $\pm$**infinity** and ***NaN (not a number)***
    - $\pm$infinity can represent the result of a divided by 0
    - NaN can represent invalid operations, such as 0/0 or subtracting infinity from infinity
  - The representation of $\pm$**denormalized number**
    - ==Without== the leading 1-bit
    - $(-1)^s \times Fraction \times 2^{-126}$ in single precision and $(-1)^s \times Fraction \times 2^{-1022}$ in double precision
    - To represent numbers between 0 and the smallest (or largest) positive (or negative) number in the $\pm$floating-point number type

![](https://hackmd.io/_uploads/SyYiaFcea.png)

:::info
**Example**. Show the IEEE 754 binary representation of the number $-0.75_{10}$

- $-0.75_{10} = (-3 / 4)_{10} = (-3 / 2^2)_{10} = (-3 \times 2^{-2})_{10} = -0.11_2$
:::
  


![](https://hackmd.io/_uploads/HyOsWq9gp.png =800x)

:::success
- a.
  - Sign: positive (+) $\rightarrow$ 0
  - Exponent: 5 + 127 (bias) = 132 (1000 0100)
  - Fraction: 101
- b. 
  - Sign: negative (-) $\rightarrow$ 1
  - Exponent: -126 + 127 = 1 (0000 0001)
  - Fraction" 010 1100
- c.
  - Sign: positive (+) $\rightarrow$ 0
  - Exponent: 127 + 127 = 254 (1111 1110)
  - Fraction: 000
:::

:::danger
**必考題**

**Q.** IEEE 754 single precision, 在數線上能表示出來的數字有幾個?

**A.**
法一: (減)
- floating-point + denormalized: 總共有 32 bit, 所以有 $2^{32}$ 種
- 去掉 NaN 跟 $\infty$ 的 case: exponent 固定全為 1, 所以剩下 32 - 8 = 24 個 bit, 共 $2^{24}$ 種。 
- 所以總共 $2^{32} - 2^{24} - 1$ 種 (扣掉重複的0: -0)

法二: (加)
![](https://hackmd.io/_uploads/Hy1Jy_oWa.png =500x)
- 0: 1 種
- denormalized: 因為 exponent 固定為 0, 然後 fraction 不可為 0 ($2^{23} - 1$), 所以共有 $2 \times 1 \times (2^{23} - 1) = 2^{24} - 2$ 種
- normalized: 因為 exponent 為 1 ~ 254, 所以共有 $2 \times 254 \times 2^{23} = 254 \times 2^{24}$ 種
- 所以總共有 
  $$
  1 + (2^{24} - 2) + (254 \times 2^{24}) \
  = 255 \times 2^{24} - 1 \
  = (2^8 - 1) \times 2^{24} - 1 \
  = 2^{32} - 2^{24} - 1
  $$
:::

### Floating-point addition

1. align the decimal point of the number that has the **smaller** exponent
2. add the significands
3. adjust the sum to normalized form
4. round the number

:::info
**Example.** $9.999_{10} \times 10^1 + 1.610_{10} \times 10^{-1}$
Assume we can store 4 decimal digits of the significand and 2 decimal digits of the exponent. 

1. $1.610_{10} \times 10^{-1}\ =\ 0.1610_{10} \times 10^0\ =\ 0.01610_{10} \times 10^1$
    - we can represent only 4 decimal digits $\Rightarrow 0.016 \times 10^1$
2. $9.999_{10} + 0.016_{10} = 10.015_{10} \times 10^1$
3. $10.015_{10} \times 10^1 = 1.0015_{10} \times 10^2$
4. $1.0015_{10} \times 10^2 \Rightarrow 1.002_{10} \times 10^2$
:::

### Accurate arithmetic

- Floating-point numbers are normally approximations for a number they can’t really represent
  - The computer numbers have limited size and hence limited precision
  - Getting the floating-point representation close to the actual number
- It requires extra bits in the calculation to round accurately
  - If every intermediate result had to be truncated to the exact number of digits, there would be no opportunity to round 
  - IEEE 754 always **keeps two extra bits on the right** during intermediate additions
    - ***Guard** bit and **round** bit*

:::info
**Example** $2.56_{10} \times 10^0 + 2.34_{10} \times 10^2$

- Round to the nearest decimal number with 3 significant decimal digits
- With guard and round digits
  - $2.56_{10} \times 10^0 + 2.34_{10} \times 10^2 = (0.02\underline{56}_{10} + 2.34\underline{00}_{10}) \times 10^2 = 2.3656_{10} \times 10^2$
  - *We have two digits to round*  $\Rightarrow 2.37_{10} \times 10^2$
    - 0 ~ 49 to round down, 51~99 to round up, 50 is the tiebreaker
- Without guard and round digits
  - To drop two digits from the calculation
  - $2.56_{10} \times 10^0 + 2.34_{10} \times 10^2 = (0.02_{10} + 2.34_{10}) \times 10^2 = 2.36_{10} \times 10^2$
:::

---

- IEEE 754 has four rounding modes
  - **Always round up** (toward $+\infty$) (無條件進位)
  - **Always round down** (toward $-\infty$) (無條件捨去)
  - **Truncate** (直接扣掉 ex. 4.6 = 4, -4.6 = -4)
  - **Round to nearest even**: <font class="blue">the most commonly used</font>
    - tiebreake（進位或退位的誤差相同）的時候找最近的偶數 (ex. 3.5 = 4, 4.5 = 4)
    - 二進位: ex. $1.0110\underline{10}_2 \rightarrow 1.0110_2$ (truncate) $\ \ \ 1.0111\underline{10}_2 \rightarrow 1.1000_2$ (add 1)
  
  |  | $+\infty$ | $-\infty$ | Truncate | Nearest Even |
  |:---:|:---:|:---:|:---:|:---:|
  | 4.6 | 5 | 4 | 4 | 5 |
  | -4.6 | -4 | -5 | -4 | -5 |
  | 3.5 | 4 | 3 | 3 | 4 |
  | 4.5 | 5 | 4 | 4 | 4 |
- Sticky bit
  - A bit used in rounding in addition to grand and round bits
  - Sticky bit is set whenever there are **nonzero bits to the right of the round bit**

:::info
**Example** $5.01_{10} \times 10^{-1} + 2.34_{10} \times 10^2$

- $5.01_{10} \times 10^{-1} + 2.34_{10} \times 10^2 = (0.00\underline{50}_{10} + 2.34\underline{00}_{10}) \times 10^2 = 2.34\underline{50}_{10} \times 10^2$
  - The sticky bit is set since there are nonzero bits to the right ($0.0050[1]$)
- With the sticky bit to remember that the number is actually larger than $2.34\underline{50}_{10} \Rightarrow \text{Round to } 2.35_{10} \times 10^2$
- Without the sticky bit $\Rightarrow \text{Round to } 2.34_{10} \times 10^2$
:::

## Fallacies and Pitfalls

- Fallacies
  - Just as a left shift instruction can replace an integer multiply by a power of 2, a right shift is the same as an integer division by a power of 2
    - Above statement is only true for unsigned integers
    - The problem for signed integer
    - Solution: to have an arithmetic right shift that extends the sign bit instead of shifting in 0s
  - Parallel execution strategies that work for integer data types also work for floating-point data types
- Pitfalls
  - Floating-point addition is not associative
    - Associativity holds for a sequence of 2’s complement integer additions, even if the computation overflows
    - Associativity does not hold for floating-point numbers because
      - Floating-point numbers are approximations of real numbers
      - Computer arithmetic has limited precision
    - Problems occur when adding two large numbers of opposite signs plus a small number
    - Example
    <!-- TODO -->
  - The MIPS instruction add immediate unsigned (addiu) sign-extends its 16-bit immediate field
