Ch02: Arithmetic for Computers - HackMD

<style> .red { color: red; font-weight: bold; } .blue { color: blue; font-weight: bold; } .green { color: green; font-weight: bold; } .gb { background-color: lightgreen; font-weight: bold; } </style> ## Fixed-Point Numbers ### Preliminary - In a fixed-point（定點數）number system, each number has exactly the same number of digits, and the point is always in the **same place**. - Ex. - 0.53, 5.12, 9.11 - Ordinary integers: 2, 53, 120 - 小數點在最後一位，前面有無限多個 0 - $(11.10)_2$, $(01.10)_2$, $(00.11)_2$ - In the computer ***the binary point is not stored anywhere.*** (電腦不會存小數點，他自己知道在哪裡) ### Range and precision in fixed-point numbers - **Range** of expressible numbers - The distance between the **largest and smallest numbers**. - **Precision** - The distance between **two adjacent numbers** on a **number line**. - <font class="blue">Error</font> - **1/2 of the difference** between two adjacent numbers. :::info **Example** - Three digits, the decimal point is placed two digits from right - Range: [0.00, 9.99] - Precision: 0.01 - Error: 0.005 ::: ### The associative law of algebra does not always hold in computers - Associative law of algebra: a + (b + c) = (a + b) + c :::info **Example** - Range: [-9, 9] - (a, b, c) = (7, 4, -3) - a + (b + c) = 7 + (4 + -3) = 8 - (a + b) + c = (7 + 4) + -3 = 11 + -3 $\rightarrow$ **11 is outside the range** ::: - To detect ***overflow*** wherever it occurs ## Radix number systems - The **base** or **radix** of a number system defines the range of possible values that a digit may have. - <font class="blue">*Most Significant Bit (MSB)* vs. *Least Significant Bit (LSB)*</font> - **MSB**: the leftmost bit (最左邊那個 bit) - **LSB**: the rightmost bit (最右邊那個 bit) ![](https://hackmd.io/_uploads/S1EJFWYlp.png =600x) ![](https://hackmd.io/_uploads/r1FXKWFe6.png =600x) ### Signed fixed-point numbers - Four commonly used ***signed*** representations ![](https://hackmd.io/_uploads/SJKrF-Kep.png =500x) #### Signed magnitude - The ***leftmost*** bit is used for the sign (<font class="green">*definition*</font>) (<font class="blue">0 for positive, 1 for negative</font>) - For a k-bit number - Range: [$-(2^{k-1}-1)$, $2^{k-1}-1$] - ex. k = 3, Range: [$-(2^{3-1}-1),\ 2^{3-1}-1$] = [-3, 3] - $2^k-1$ different numbers can be represented (減 1: 因為 +0 = -0) #### One's complement (1 補數) - 0 $\leftrightarrow$ 1 (complementing) - The leftmost bit is also the sign bit (<font class="green">*conclusion*</font>)(<font class="blue">0 for positive, 1 for negative</font>) - For a k-bit number - Range: [$-(2^{k-1}-1)$, $2^{k-1}-1$] - $2^k-1$ different number can be represented - Practical application - ***Checksums*** in the headers of network packets #### Two's complement (2 補數) - 0 $\leftrightarrow$ 1 than add 1, **discard the carry** out from the MSB - For a k-bit number - Range: [$-2^{k-1}$, $2^{k-1}-1$] - $2^k$ different numbers can be represented (without -0, so we can get one more negative number than one's complement) - The most commonly used representation in conventional computers #### Excess (biased) representation - Two's complement form plus a ***bias*** (將二補數加上 bias) - Ex. excess 128 - $(+12)_{10}\ =\ (10001100)_2\ \ \ (-12)_{10}\ =\ (01110100)_2$ - $(-128)_{10}\ =\ (00000000)_2\ \ \ (+127)_{10}\ =\ (11111111)_2$ - $+0\ =\ (10000000)_2\ \ \ (-0)\ =\ (10000000)_2$ - The leftmost bit is the sign bit (<font class="red">1 for positve, 0 for negative</font>) - <font class="gb">Numerically smaller numbers have smaller bit patterns</font> (數字大小照著 bit pattern 排序，所以很好判斷大小關係) - For a k-bit number - Range: [$-2^{k-1}$, $2^{k-1}-1$] - $2^k$ different numbers can be represented ### Sign extension - A number is placed into a **larger container** - To pad the left side with 0s - **Positive number: correct** - **Negative number: incorrent** :::info **Example**: - $+12_{10}$: 8-bit $\rightarrow$ 16-bit 0000 1100 (sign: **+**) $\rightarrow$ 0000 0000 0000 1100 (sign: **+**) $\Rightarrow$ **correct!** - $-12_{10}$: 8-bit $\rightarrow$ 16-bit 1111 0100 (sign: **-**) $\rightarrow$ 0000 0000 1111 0100 (sign: **+**) $\Rightarrow$ **incorrect!** ::: - To copy the sign bit to the left side: ***sign extension*** :::info **Example**: - $+12_{10}$: 8-bit $\rightarrow$ 16-bit 0000 1100 (sign: **+**) $\rightarrow$ 0000 0000 0000 1100 (sign: **+**) $\Rightarrow$ **correct!** - $-12_{10}$: 8-bit $\rightarrow$ 16-bit 1111 0100 (sign: **-**) $\rightarrow$ 1111 1111 1111 0100 (sign: **-**) $\Rightarrow$ **correct!** ::: - A number is placed into a smaller container - To remove bits on the left side :::info **Example**: $-12_{10}$: 16-bit $\rightarrow$ 8-bit 1111 1111 1111 0100 $\rightarrow$ 1111 0100 ::: ### 2 補數表示的數字轉 10 進位 $$ \displaystyle{\rm{Value} = \sum^{n - 1}_{i = -m} b_i \times k^i\quad{MSB \times (-1)}} $$ :::warning **Example** - $45 = (0101101)_2 = 2^5 + 2^3 + 2^2 + 2^0$ - $-45 = (1010011)_2 = -2^6 + 2^4 + 2^1 + 2^0$ ::: ### 數字轉成 2 補數，用最少 bit 數表示 - 記 range 公式 - 正數直接寫出，左側補 0 (MSB 會為 0) - 負數由正數轉換 (MSB 會為 1) ### 非整數負數的 2 補數表示法 - $(19.25)_{10} = (010011.01)_2$ - $(-19.26)_{10} = (?)_2$ :::info **法一** $$ \begin{split} (-19.25)_{10} &= (-20)_{10} + (0.75)_{10} \\ &= (101100)_2 + (0.11)_2 \\ &= (101100.11)_2 \end{split} $$ ::: :::success **法二** $$ (19.25)_{10} = (010011.01)_2 $$ 0, 1 互換，**bit pattern** + 1 $$ (010011.01)_2 \Rightarrow (101100.10)_2 \Rightarrow (101100.11)_2 $$ ::: ## Addition and Subtraction ### Binary addition and subtraction - Digits are added bit by bit from **right to left**, with carries passed to the next digit to the left - 減法就是加負數 (利用二補數) ### Overflow - To occur when the result from an operation cannot be represented with the available hardware - The carry-out of the leftmost bit is discarded because the number system is ***modular*** #### When overflow **cannot** occur - **adding positive and negative** operands (異號相加) - **subtracting** two operands of the **same sign** (同號相減) #### How to detect overflow? - <font class="blue">同號數相加卻出現異號數</font> - Ex. (+) + (+) = (-) - 異號數相加不可能 overflow - <font class="blue">負減正卻出現正數 or 正減負卻出現負數</font> - Ex. (-) - (+) = (+) - 同號數相減不可能 overflow #### MIPS detects overflow with an *exception* - An unscheduled procedure call - The processor jumps to a **predefined address** to invoke the appropriate routine - The address caused the exception is saved in **register EPC (Exception Program Counter)** - In some situations the program can continue after corrective code is executed - Exception vs. ***Interrupt*** (in MIPS processor) - Exception refers to any unexpected change in control flow **without distinguishing whether the cause is internal or external** - **Interrupt is an exception that comes from outside of the processor** #### *Saturating* operations 飽和控制 - When a calculation overflows, the result is set to the largest positive number or most negative number - Suitable for media operations - 例如遙控器控制音量，到最大就維持在最大值了。 ## Multiplication ### Unsigned multiplication - ***Multiplicand*** $\times$ ***Multiplier*** = ***Product*** (被乘數 $\times$ 乘數 = 積) - The number of digits in the product - ***n* bit** multiplicand $\times$ ***m* bit** multiplier = **(*n+m*) bit** product (ignore the sign bits) - **Sequential version** of the multiplication algorithm and hardware - Multiplier is in the 32 bit **Multiplier** register - The 64 bit **Product** register is initialized to 0 - A 64 bit **Multiplicand** register is initialized with the 32 bit multiplicand in the right half and zero in the left half - The **Multiplier** register is shifted right 1 bit each step - The **Multiplicand** register is shifted left 1 bit each step to align the multiplicand with the sum being accumulated in the 64 bit **Product** register ![](https://hackmd.io/_uploads/ByxS3_cx6.png =400x) ![](https://hackmd.io/_uploads/B1yZhuqlT.png =400x) ![](https://hackmd.io/_uploads/H1cuVl3xa.png) - **Refined version** of the multiplication hardware - To perform operations in **parallel**: 1 clock cycle per step - The multiplier and multiplicand are shifted while the multiplicand is added to the product if the multiplier bit is a 1 - The hardware has to ensure that it tests the LSB of the multiplier and gets the pre-shifted version of the multiplicand - To halve the width of the adder and registers ![](https://hackmd.io/_uploads/SJmDFx2ep.png =500x) ### Signed multiplication - Leaving the signs out of the calculation (<font class="green">都當成正的，算完再判斷正負</font>) - To convert the multiplier and multiplicand to positive numbers and then remember the original signs - The algorithms should be run for 31 iterations - The algorithm in Fig. 3.4 will work for signed numbers - The shifting steps would need to **extend the sign of the product** - When the algorithm completes, the **lower word** would have the 32-bit product ![](https://hackmd.io/_uploads/B1zC3VngT.png) ### Faster multiplication Whether the multiplicand is to be added or not is known at the beginning of the multiplication by looking at each of the 32 multiplier bits. - To provide one 32-bit adder for each bit of the multiplier - Two input bits - One input is the multiplicand ANDed with a multiplier - One input is the output of a prior adder - To organize these 32 additions - To connect the outputs of adders on the right to the inputs of adders on the left - Parallel tree ![](https://hackmd.io/_uploads/HJJ0643xp.png =800x) ## Division ### Unsigned Division - ***Dividend / Divisor = Quotient + Remainder*** ***Dividend = Quotient $\times$ Divisor + Remainder*** - Steps - To see how big a number can be subtracted - To create a digit of the quotient on each attempt #### First version of the division hardware - The 32 bit **Quotient** register is initialized to 0 - The divisor is placed in the **left half** of the 64 bit **Divisor** register - The **Remainder** register is initialized with the dividend (sign extension) - The **Divisor** register is **shifted right** 1 bit each step to align with the dividend - The **Quotient** register is **shifted left** 1 bit each step ![](https://hackmd.io/_uploads/rJtClY5la.png =600x) - **Step 1.** ***subtract*** the **Divisor** register from the **Remainder** register - **Step 2a.** if the result is **positive**, we **generate a 1 in the quotient** - **Step 2b.** if the result is **negative**, we **restore the original value to the remainder** and **generate a 0 in the quotient** - **Step 3.** the **Divisor** register is **shifted right** 1 bit - Above steps are repeated ![](https://hackmd.io/_uploads/H1t6-Y5xp.png =500x) ![](https://hackmd.io/_uploads/ByqkcLnxa.png) #### An improved version of the division hardware - To shift the operands and the quotient simultaneously with the subtraction - To halve the width of the adder and registers - The **Remainder** register is shifted left - To combine the **Quotient** register with the right half of the **Remainder** register ![](https://hackmd.io/_uploads/HySnQK5gp.png =800x) ### Signed Division - To remember the signs of the divisor and dividend and then negate the quotient if the signs disagree - <font class="red">**The dividend and remainder must have the same signs**</font> - $+7 \div -2 = -3 ... +1$ - $-7 \div -2 = +3 ... -1$ - <font class="green">先當成正數算，再來判斷正負，被除數跟餘數要同號。</font> ### Faster Division: SRT division - Try to **predict** several quotient bits per step - Using a table lookup based on the upper bits of the dividend and remainder - 6 bits from the remainder and 4 bits from the divisor to index a table - Subsequent steps to correct wrong predictions - A typical value today is 4 bits - The accuracy depends on having proper values in the lookup table ## Floating Point ### Preliminary - Programming languages support numbers with **fractions** - ***Floating point*** - Computer arithmetic that represent numbers in which the ***decimal point*** (or ***binary point***) is not fixed - ***Scientific notation*** - Numbers with a single digit to the left of the decimal point - ***Normalized*** - A number in floating-point notation that has no leading 0s ### Float-point representation - General form of a floating-number: $(-1)^s \times F \times 2^E$ - <font class="blue">***s***</font> is the **sign** of the floating-point number - <font class="blue">***F: Fraction***</font> (**mantissa**) - A value $1.xxx..._2$ to determine the **precision** - Larger F enhances the precision of the fraction - <font class="blue">***E: exponent***</font> - A signed integer to determine the **range** - Larger E increases the range of numbers that can be represented - A tradeoff is between precision and range ### IEEE 754 floating-point standard - **Single precision** (單精度) - A **floating-point** value represented in a single **32 (1 + 8 + 23) bit word** - A separate **sign bit**: <font class="red">0 for positive, 1 for negative</font> - <font class="green">8 bit exponent field</font> (including the sign of the exponent) - <font class="green">23 bit fraction field</font> - Range: $2.0_{10} \times 10^{-38} \sim 2.0_{10} \times 10^{38}$ ![](https://hackmd.io/_uploads/HkhbnY5x6.png) - Overflow - A positive exponent is too large to be represented in the exponent field - Underflow - A negative exponent is too large to fit in the exponent field - **Double precision** - A floating-point value represented in two 32-bit words - <font class="green">Sign bit, 11 bit exponent field, 52 bit fraction field</font> - Range: $2.0_{10} \times 10^{-308} \sim 2.0_{10} \times 10^{308}$ - Greater precision due to the much larger fraction ![](https://hackmd.io/_uploads/rkiYatcgp.png) - Making the leading 1-bit of normalized binary numbers implicit (hidden bit) (省略 fraction 的小數點跟最左邊的 1，避免浪費) - The significand is actually 24-bit long in single precision and 53-bit long in double precision - Since 0 has no leading 1, it needs a special format to represent (0 是例外情況) - ==Five formats== - $00...00_2$ represent **0** (所以 0 佔了兩個 bit pattern, s=0 or 1) - The representaion of $\pm$**floating-point number** - $(-1)^s \times (1 + Fraction) \times 2^{(Exponent - Bias)}$ - Using a bias of **127** for single precision and **1023** for double precision - Range of single precision number - $\pm1.00000000000000000000000 \times 2^{-126}$ ~ $\pm1.11111111111111111111111 \times 2^{+127}$ - The largest exponent is reserved for $\pm$**infinity** and ***NaN (not a number)*** - $\pm$infinity can represent the result of a divided by 0 - NaN can represent invalid operations, such as 0/0 or subtracting infinity from infinity - The representation of $\pm$**denormalized number** - ==Without== the leading 1-bit - $(-1)^s \times Fraction \times 2^{-126}$ in single precision and $(-1)^s \times Fraction \times 2^{-1022}$ in double precision - To represent numbers between 0 and the smallest (or largest) positive (or negative) number in the $\pm$floating-point number type ![](https://hackmd.io/_uploads/SyYiaFcea.png) :::info **Example**. Show the IEEE 754 binary representation of the number $-0.75_{10}$ - $-0.75_{10} = (-3 / 4)_{10} = (-3 / 2^2)_{10} = (-3 \times 2^{-2})_{10} = -0.11_2$ ::: ![](https://hackmd.io/_uploads/HyOsWq9gp.png =800x) :::success - a. - Sign: positive (+) $\rightarrow$ 0 - Exponent: 5 + 127 (bias) = 132 (1000 0100) - Fraction: 101 - b. - Sign: negative (-) $\rightarrow$ 1 - Exponent: -126 + 127 = 1 (0000 0001) - Fraction" 010 1100 - c. - Sign: positive (+) $\rightarrow$ 0 - Exponent: 127 + 127 = 254 (1111 1110) - Fraction: 000 ::: :::danger **必考題** **Q.** IEEE 754 single precision, 在數線上能表示出來的數字有幾個? **A.** 法一: (減) - floating-point + denormalized: 總共有 32 bit, 所以有 $2^{32}$ 種 - 去掉 NaN 跟 $\infty$ 的 case: exponent 固定全為 1, 所以剩下 32 - 8 = 24 個 bit, 共 $2^{24}$ 種。 - 所以總共 $2^{32} - 2^{24} - 1$ 種 (扣掉重複的0: -0) 法二: (加) ![](https://hackmd.io/_uploads/Hy1Jy_oWa.png =500x) - 0: 1 種 - denormalized: 因為 exponent 固定為 0, 然後 fraction 不可為 0 ($2^{23} - 1$), 所以共有 $2 \times 1 \times (2^{23} - 1) = 2^{24} - 2$ 種 - normalized: 因為 exponent 為 1 ~ 254, 所以共有 $2 \times 254 \times 2^{23} = 254 \times 2^{24}$ 種 - 所以總共有 $$ 1 + (2^{24} - 2) + (254 \times 2^{24}) \ = 255 \times 2^{24} - 1 \ = (2^8 - 1) \times 2^{24} - 1 \ = 2^{32} - 2^{24} - 1 $$ ::: ### Floating-point addition 1. align the decimal point of the number that has the **smaller** exponent 2. add the significands 3. adjust the sum to normalized form 4. round the number :::info **Example.** $9.999_{10} \times 10^1 + 1.610_{10} \times 10^{-1}$ Assume we can store 4 decimal digits of the significand and 2 decimal digits of the exponent. 1. $1.610_{10} \times 10^{-1}\ =\ 0.1610_{10} \times 10^0\ =\ 0.01610_{10} \times 10^1$ - we can represent only 4 decimal digits $\Rightarrow 0.016 \times 10^1$ 2. $9.999_{10} + 0.016_{10} = 10.015_{10} \times 10^1$ 3. $10.015_{10} \times 10^1 = 1.0015_{10} \times 10^2$ 4. $1.0015_{10} \times 10^2 \Rightarrow 1.002_{10} \times 10^2$ ::: ### Accurate arithmetic - Floating-point numbers are normally approximations for a number they can’t really represent - The computer numbers have limited size and hence limited precision - Getting the floating-point representation close to the actual number - It requires extra bits in the calculation to round accurately - If every intermediate result had to be truncated to the exact number of digits, there would be no opportunity to round - IEEE 754 always **keeps two extra bits on the right** during intermediate additions - ***Guard** bit and **round** bit* :::info **Example** $2.56_{10} \times 10^0 + 2.34_{10} \times 10^2$ - Round to the nearest decimal number with 3 significant decimal digits - With guard and round digits - $2.56_{10} \times 10^0 + 2.34_{10} \times 10^2 = (0.02\underline{56}_{10} + 2.34\underline{00}_{10}) \times 10^2 = 2.3656_{10} \times 10^2$ - *We have two digits to round* $\Rightarrow 2.37_{10} \times 10^2$ - 0 ~ 49 to round down, 51~99 to round up, 50 is the tiebreaker - Without guard and round digits - To drop two digits from the calculation - $2.56_{10} \times 10^0 + 2.34_{10} \times 10^2 = (0.02_{10} + 2.34_{10}) \times 10^2 = 2.36_{10} \times 10^2$ ::: --- - IEEE 754 has four rounding modes - **Always round up** (toward $+\infty$) (無條件進位) - **Always round down** (toward $-\infty$) (無條件捨去) - **Truncate** (直接扣掉 ex. 4.6 = 4, -4.6 = -4) - **Round to nearest even**: <font class="blue">the most commonly used</font> - tiebreake（進位或退位的誤差相同）的時候找最近的偶數 (ex. 3.5 = 4, 4.5 = 4) - 二進位: ex. $1.0110\underline{10}_2 \rightarrow 1.0110_2$ (truncate) $\ \ \ 1.0111\underline{10}_2 \rightarrow 1.1000_2$ (add 1) | | $+\infty$ | $-\infty$ | Truncate | Nearest Even | |:---:|:---:|:---:|:---:|:---:| | 4.6 | 5 | 4 | 4 | 5 | | -4.6 | -4 | -5 | -4 | -5 | | 3.5 | 4 | 3 | 3 | 4 | | 4.5 | 5 | 4 | 4 | 4 | - Sticky bit - A bit used in rounding in addition to grand and round bits - Sticky bit is set whenever there are **nonzero bits to the right of the round bit** :::info **Example** $5.01_{10} \times 10^{-1} + 2.34_{10} \times 10^2$ - $5.01_{10} \times 10^{-1} + 2.34_{10} \times 10^2 = (0.00\underline{50}_{10} + 2.34\underline{00}_{10}) \times 10^2 = 2.34\underline{50}_{10} \times 10^2$ - The sticky bit is set since there are nonzero bits to the right ($0.0050[1]$) - With the sticky bit to remember that the number is actually larger than $2.34\underline{50}_{10} \Rightarrow \text{Round to } 2.35_{10} \times 10^2$ - Without the sticky bit $\Rightarrow \text{Round to } 2.34_{10} \times 10^2$ ::: ## Fallacies and Pitfalls - Fallacies - Just as a left shift instruction can replace an integer multiply by a power of 2, a right shift is the same as an integer division by a power of 2 - Above statement is only true for unsigned integers - The problem for signed integer - Solution: to have an arithmetic right shift that extends the sign bit instead of shifting in 0s - Parallel execution strategies that work for integer data types also work for floating-point data types - Pitfalls - Floating-point addition is not associative - Associativity holds for a sequence of 2’s complement integer additions, even if the computation overflows - Associativity does not hold for floating-point numbers because - Floating-point numbers are approximations of real numbers - Computer arithmetic has limited precision - Problems occur when adding two large numbers of opposite signs plus a small number - Example  - The MIPS instruction add immediate unsigned (addiu) sign-extends its 16-bit immediate field