Archi 3 - HackMD

--- title: Archi 3 --- # Computer Architectures NTNU 計算機結構 ##### [Back to Note Overview](https://reurl.cc/XXeYaE) ##### [Back to Computer Architectures](https://hackmd.io/@NTNUCSIE112/Archi109-2) {%hackmd @sophie8909/pink_theme %} ###### tags: `NTNU` `CSIE` `必修` `Computer Architectures` `109-2`    ## Ch.03 Arithmetic of Computers ### 3.1 Introduction #### Arithmetic for Computers 計算機運算 - Example: 7 + 6 - ![Example: 7 + 6](https://i.imgur.com/USHLrwj.png) - Operations on integers - Addition and subtraction - Multiplication and division - Dealing with overflow - Floating-point real numbers - Representation and operations ### 3.2 Addition and Subtraction #### Integer Addition - Overflow if a result out of range - Add +value (ve) and -ve operands, no overflow - Add two +ve operands - Overflow if the result sign is 1 - Add to -ve operands - Overflow if the result sign is 0 #### Integer Subtraction - Add the negation of the second operand - Example: 7-6 = 7+(-6) - Overflow if a result out of range - Subtract two +ve or two –veoperands, no overflow - Subtract +vefrom –veoperand - Overflow if the result sign is 0 - Subtract –vefrom +veoperand - Overflow if the result sign is 1 #### Deal with Overflow - Some languages, e.g., C, ignore overflow. - Use MIPS `addu`, `addui`, `subu` instructions - Other languages, e.g., Ada and Fortran, require raising an exception. - Use MIPS `add`, `addi`, `sub` instructions - On overflow, invoke an exception handle - Save PC in an exception program counter (EPC) register - Jump to a predefined handler address - `mfc0` instruction - move from coprocessor register - can retrieve EPC value, to return after corrective action #### Arithmetic for Multimedia 多媒體運算 - Graphics and media processing operates on vectors of 8bit and 16bit data. - Use 64bit adder, with partitioned carry chain - Operate on 8 ×8bit, 4 ×16bit, or 2 ×32bit vectors - SIMD (single-instruction, multiple-data) - Saturating operations - On overflow, a result is the largest representable valu - C.f. 2s-complement modulo arithmetic - E.g., clipping in audio, saturation in video ### 3.3 Multiplication #### Multiplication - Start with a long-multiplication approach ![Multiplication](https://i.imgur.com/UMixKuo.png) #### Multiplication Hardware ![Multiplication Hardware](https://i.imgur.com/f2bqeDB.png) #### Optimized Multiplier - Perform two steps in parallel: add and shift ![optimized Multiplier ](https://i.imgur.com/xSRlaSj.png) - One cycle per partial-product addition - That’s ok, if the frequency of multiplications is low.Prof. #### Faster Multiplier - Use multiple adders - Cost/performance tradeoff ![Faster Multiplier](https://i.imgur.com/oB6ImGY.png) - Can be pipelined - Several multiplication is performed in parallel. #### MIPS Multiplication - Two 32bit registers for product - HI: most-significant 32 bits - LO: least-significant 32 bits - Instructions - `mult rs,rt` / `multu rs,rt` - 64bit product in HI/LO - `mfhi rd` / `mflo rd` - Move from HI/LO to rd - Can test HI value to see if product overflows 32 bits - `mul rd,rs,rt` (not in the textbook, FYI) - Least-significant 32 bits of product $\to$ `rd` ### 3.4 Division #### Division ![Division](https://i.imgur.com/ek1buXy.png) - Check for 0 divisor - Long division approach - If divisor $\leq$ dividend bits - bit 1 in quotient, subtract - Otherwise - bit 0 in quotient, bring down next dividend bit - Restore division - Do a subtract, and if remainder < 0, add divisor back - Signed division - Divide using absolute values - Adjust the sign of quotient and remainder as required #### Division Hardware ![Division Hardware](https://i.imgur.com/FgZ1WI1.png) #### Optimized Divider - One cycle per partial-remainder subtraction - Look a lot like a multiplier! - Same hardware can be used for both. ![Optimized Divider](https://i.imgur.com/JIQmniG.png) - 前半放餘數 Remainder，後半放商數 Quotient #### Faster Division - Cannot use parallel hardware as in multiplier - Subtraction is conditional on the sign of a remainder. - Faster dividers (e.g. SRT division) generate multiple quotient bits per step. - Still require multiple steps #### MIPS Division - Use HI/LO registers for a result - HI: 32bit remainder - LO: 32bit quotient - Instructions - `div rs, rt` / `divu rs, rt` - No overflow or divide-by-0 checking - Software must perform checks if required. - Use `mfhi` and `mflo` to access a result ### 3.5 Floating Point #### Floating Point - Representation for non-integral numbers - Include very small and very large numbers - Like scientific notation - normalized - –2.34 × 1056 - not normalized - +0.002 × 10–4 - +987.02 × 109 - In binary - ±1.xxxxxxx2 × 2 yyyy - Type float and double in C #### Floating Point Standard - Defined by IEEE Standard 754-1985 - Developed in response to divergence of representations - Portability issues for scientific code - Now almost universally adopted - Two representations - Single precision (32 bits) - Double precision (64 bits) #### IEEE Floating-Point Format ![IEEE Floating-Point Format](https://i.imgur.com/5MYy44y.png) $$ x = (-1)^S\times(1+\text{Fraction})\times2^{(\text{Exponet} - \text{Bias})} $$ - S: sign bit (0 => non-negative, 1 => negative) - Normalize significand: 1.0 ≤ |significand| < 2.0 - Always have a leading pre-binary-point 1 bit - no need to represent it explicitly (hidden bit) - Significand is Fraction with the “1.” restored - Exponent: excess representation: actual exponent + Bias - Ensure exponent is unsigned. - Single: Bias = 127; Double: Bias = 1023 #### Single-Precision Range - Exponents $00000000$ and $11111111$ are reserved. - Smallest value - Exponent: $00000001$ - $\text{actual exponent} = 1 – 127 = –126$ - Fraction: $000\dots00$ - $\text{significand} = 1.0$ - $\pm 1.0 \times 2^{–126} \approx \pm 1.2 \times 10^{–38}$ - Largest value - Exponent: $11111110$ - $\text{actual exponent} = 254 – 127 = +127$ - Fraction: $111\dots11$ - $\text{significand} \approx 2.0$ - $\pm 2.0 \times 2^{+127} \approx \pm 3.4 \times 10^{+38}$ #### Double-Precision Range - Exponents $0000\dots00$ and $1111\dots11$ are reserved. - Smallest value - Exponent: $00000000001$ - $\text{actual exponent} = 1 – 1023 = –1022$ - Fraction: $000\dots00$ - $\text{significand} = 1.0$ - $\pm1.0 \times 2^{–1022} \approx \pm 2.2 \times 10^{–308}$ - Largest value - Exponent: $11111111110$ - $\text{actual exponent} = 2046 – 1023 = +1023$ - Fraction: $111\dots11$ - $\text{significand} \approx 2.0$ - $\pm 2.0 \times 2^{+1023} \approx \pm 1.8 \times 10^{+308}$ #### Floating-Point Precision  - Relative precision - All fraction bits are significant. - Single: approximate $2^{–23}$ - Equivalent to $23 ×\log_{10}2 ≈ 23 ×0.3 ≈ 6$ decimal digits of precision - Double: approximate $2^{–52}$ - Equivalent to $52 ×\log_{10}2 ≈ 52 ×0.3 ≈ 16$ decimal digits of precision #### Floating-Point Example  - Represent –0.75 - –0.75 = (–1)1×1.12×2–1 - S = 1 - Fraction = 1000...002 - Exponent = –1 + Bias - Single: –1 + 127 = 126 = 011111102 - Double: –1 + 1023 = 1022 = 011111111102 - Single: 1011111101000...00 - Double: 1011111111101000...00 - What number is represented by the single-precision float11000000101000...00 - S = 1 - Fraction = 01000...002 - Exponent = 100000012 = 129 - x = (–1)1×(1 + 012) ×2(129 –127)= (–1) ×1.25 ×22= –5.0 #### Denormal Numbers #### Infinities and NaN #### Floating-Point Addition in Decimal #### Floating-Point Addition in Binary #### FP Adder Hardware #### Floating-Point Multiplication in Decimal #### Floating-Point Multiplication in Binary #### FP Arithmetic Hardware #### FP Instructions in MIPS - FP hardware is coprocessor 1. - Adjunct processor that extends ISA - Separate FP registers - 32 single-precision: `$f0`, `$f1`, ..., `$f31` - Paired for double-precision: $f0/$f1, $f2/$f3, ... - Release 2 of MIPS ISA supports 32 ×64-bit FP registers. - FP instructions operate only on FP registers. - Programs generally don’t do integer operations on FP data, or vice versa. - More registers with minimal code-size impact - FP load and store instructions. - `lwc1`, `ldc1`, `swc1`, `sdc1` - E.g., `ldc1 $f8, 32($sp)` - Single-precision arithmetic - `add.s`, `sub.s`, `mul.s`, `div.s` - E.g., `add.s$f0, $f1, $f6` - Double-precision arithmetic - `add.d`, `sub.d`, `mul.d`, `div.d` - E.g., `mul.d$f4, $f4, $f6` - Single-and double-precision comparison - c.xx.s, c.xx.d(xx is `eq`, `lt`, `le`, ...) - Set or clear FP condition-code bit - E.g. `c.lt.s$f3, $f4` - Branch on FP condition code true or false - `bc1t`, `bc1f` - E.g., `bc1t TargetLabe` #### FP Example: °F to °C - C code ```c= // fahr in $f12 float f2c (float fahr) { // result in $f0 // literals in global memory space return ((5.0/9.0)*(fahr-32.0)); } ``` - Compiled MIPS code ```MIPS= f2c: lwc1 $f16, const5($gp) lwc1 $f18, const9($gp) div.s $f16, $f16, $f18 lwc1 $f18, const32($gp) sub.s $f18, $f12, $f18 mul.s $f0, $f16, $f18 jr$ra ``` #### FP Example: Array Multiplication - X = X + Y ×Z - All 32 ×32 matrices, 64-bit double-precision elements - C code ```c= // x, y, z in $a0, $a1, $a2 void mm (double x[][], double y[][], double z[][]) { // i, j, k in $s0, $s1, $s2 int i, j, k; for (i= 0; i! = 32; i= i+ 1) for (j = 0; j! = 32; j = j + 1) for (k = 0; k! = 32; k = k + 1) x[i][j] = x[i][j] + y[i][k] * z[k][j]; } ``` ### 3.6 Parallelism and Computer Arithmetic #### Accurate Arithmetic #### Subword Parallellism ### 3.7 Real Stuff: Floating Point in the x86 #### x86 FP Architecture #### x86 FP Instructions #### Streaming SIMD Extension 2 (SSE2) ### 3.8 Going Faster: Subword Parallelism and Matrix Multiply #### Matrix Multiply in C - Unoptimized code ```c= void dgemm(intn, double* A, double* B, double* C) { for (int i = 0 ; i < n ; ++i) for (int j = 0 ; j < n ; ++j) { double cij= C[i+j*n]; /* cij= C[i][j] */ for(int k = 0 ; k < n; k++) cij+= A[i+k*n] * B[k+j*n]; /* cij+= A[i][k]*B[k][j] */ C[i+j*n] = cij; /* C[i][j] = cij*/ } } ``` #### Matrix Multiply #### Optimized Matrix Multiply in C #### Optimized Matrix Multiply ### 3.9 Fallacies and Pitfalls