計算機組織 Computer Organization (上) (CSIE-2)

# 計算機組織 Computer Organization (上) (CSIE-2) ###### tags: `courses meow` `CSIE-2` `Computer Organization` `2021 Spring` :::success :::spoiler click to open TOC [TOC] ::: ## Introduction - What will you learn in this course - Evalute the performance of a system - Capabilities of the main computers of a CPU and a memory system - How the main compunents are inter-connected - Information flow between the components - How to control the information flow - Emerging memory architecture :::success ### 教授 - Week_1 ~ Week_9 - intrustor - 陳弘軒 - Office hour : after class or by appointment - TAs - 郭同益(email:kuotony860810@gmail.com) - 王承凱(email:wang19980531@outlook.com) - 徐永棚(yungpeng1998@gmail.com) ### Material - Textbook - "Computer Organization and Design : The Hardware/Software Interface" by David A. Patterson and John L Hennessy - They are turing award laureates - Slides - Download from new ee-class ### Grading - exams: `50%` - mid 25% 4/22 - final 25% 6/24 - exercise `30%` - - The lowest score is discarded - project `20%` - MIPS programing - Emerging memory system architecture - Late submission penalties : - 1day -> 10% off - 2days -> 20% off - 3days or more -> 0 > 只有 59 會讓你 60 過，58以下就不用再來談了，不要跟我說什麼58跟59只差一分[name=教授] ::: :::info ### 參考資料 - [MIPS相關筆記](https://chi_gitbook.gitbooks.io/personal-note/content/register_operands.html) ::: --- ## Chapter 1 ### The Computer Revolution - Progess in computer technology - Underpinned by Moore's law - Makes novel applications feasible - Computers in automobliies - Cell phones - Human genome project - World Wide Web - Search Engines :::info ### 摩爾定律(Moore's law) > 在積體電路上可容納的電晶體數目，每隔兩年便會增加一倍[name=Intel創辦人戈登·摩爾] - 積體電路上可容納的電晶體數目，約每隔18個月便增加一倍。 - 微處理器的效能每隔18個月提高一倍，或價格下降一半。 - 相同價格所買的電腦，效能每隔18個月增加一倍。 ::: ### Classes of Computers - Personal computers - General purpose, variety of software - Subject to cost/performance tradeoff - server - network based - reliability & cap - super computer - high-end scientific and engineering - Embedded computers - Hidden as compunents of systems - Stringent power/performance/cost constraints - Personal Mobile Device (PMD) - Bettery opersted - ex: smart phones, tablets, smart glasses - Cloud computing - Warehouse Scale Computers(WSC) - Software as a Service(SaaS) (SaaS 是其中一種) - Portion of software run on a PMD and a portion rin in the Cloud - Amazon(AWS) and Google(GCP) ,Mincrosoft(azure) ,linode , ### Components of a Computer ![](https://i.imgur.com/blBidii.png) - Same components for all kinds of computer ### Understanding Performance - algo - determines ==number of operations== executed - Programming language, compiler, architecture - Determines ==number of machine instructions== exec per operation - > 程式語言&compiler 可以用machine code的量來評估效率 - Prossor and memory system - Determine how fast instructions are excuted - > 時間和空間不可兼得 - I/O system (including OS) - Determines how fast I/O operations are executed - 盡量讓你的程式可以從disk讀出來後就直接做運算，以減少cost ### Inside the PR0cess0r (**CPU**) - Datapath:performance operation on data - Control: sequences, datapath, memory.... - Cache memory - small fast SRAM memory for immediate access to data - bla bla bla bla - sth coooool ### Abstractions - Abstration helps us deal with complexity - Hide lower-level detail ### A Safe Place for Data > 揮發性記憶體 vs 非揮發性記憶體 ### Networks - Communication, resource sharing, nonlocal access - Local Area Network (**LAN**): Ethernet - Wide Area Network (**WAN**): the Internet - Wireless network: WiFi, Bluetooth ### 摩爾定律(Moore's law) ![](https://i.imgur.com/7yKe9MY.png) - 在經過12到18個月後積體電路上可容納的電晶體數目便會翻倍blablabla ### Semiconducter Technology - Silicon: semiconductor(半導體) - Add materials to transform properties: - Conductor (導體) - Insulators (絕緣體) - Switch ### Manufacturing ICs ![](https://i.imgur.com/5s2YOd4.png) - High yield is difficult - If 50 per process，and $Y_i = 0.99$，$Y_i^{50} = 0.61$ :::info ### Integrated Circuit Cost - Cost per die = $\frac{\text{Cost per wafer}}{\text{Dies per wafer} \times \text{Yield}}$ - dies per wafer ≈ $\frac{\text{Wafer area}}{\text{Die area}}$ - Yield = $$ \frac{1}{{1+(\text{Defects per area}\times\frac{\text{Die area}}{2})}^2} $$ - Die costs are nonlinear die area - wafer cost and wafer area are fixed - defect rate determined by manufacturing process - die area deteremined by architecture and circut design > Die: 從矽晶圓切出來一小片的矽晶 > 至少要知道名詞意思，剩下就是小學數學而已惹 > 很多人覺得這堂課很難是因為瑣碎細節很多 [name=教授] ::: ### Response Time and Througnput - ==Response Time== - How long if takes to do a task - Throughput - Total work done per util time > 有的東西可能Response Time高，但是緩衝時間長，所以Throughput低[name=某同學] ### sth - Elapsed time - Total response time ,including all aspects - Processing ,I/O ,OS overhaed ,idle time - CPU time > determined by `user mode` or `kernel mode` - user cpu time - system cpu time - CPU time - Time spent processing a given job - Discounts I/O time, other jobs' shares - Comprises CPU time and system time - Different programs are affected by CPU and system performance ### CPU Clocking - Clock period(reponse time) : duration of a clock cycle - CPU frequency(rate ,throughput) : cycles per second > frequency高不一定代表速度快 ### Instruction count and CPI > Average cycles per instruction 不一定是整數 > 但是每一個 instruction 一定執行整數個 cycle ### Summary ![](https://i.imgur.com/VIXU5tr.png =550x) ### Amdahl's law ![](https://i.imgur.com/gfoKdVL.png) --- ## Chapter 2 - Instructions : Language of the Computer > How do Assembly Language transform to machine code ### Unsigned Binary Integers Given an n-bit number $x = x_{n-1}2^{n-1} + x_{n-2}2^{n-2} + ... + x_12^1 + x_02^0$ Range : $0$ to $2 ^ n - 1$ ### Complement Signed number Given an n-bit number $x = - x_{n-1}2^{n-1} + x_{n-2}2^{n-2} + ... + x_12^1 + x_02^0$ Range : $-2^{n-1}$ to $2^{n-1}$ - Most negative: $1000 0000 0000 ... 0000$ > 只有最左邊的負的 bit 是1，其他都是0 - Most positive: $0111 1111 1111 ... 1111$ > 讓最左邊的負的 bit 是0，其他正的 bit 都是1 ### Signed Negation - Complement and then plus 1 !!! $x + \sim x = 11111...111 = -1$ $\implies -x = \sim x + 1$ ### Signed Extension - Replicate the sign bit to the left - EX : -2 : **1111 1111** 1111 1110 ### Instruction Set > CPU 原生支援的指令集 > 有的CPU原生指令集裡面可能沒有not運算，那它就會把not轉換成它有的指令集進行運算 ### The MIPS - "**M**illion **I**nstruction **P**er **S**econd" - "**M**icroprocessor without **I**nterlocked **P**ipeline **S**tages" ### MIPS > Assembly的指令格式會盡量一樣，這對於未來轉換成machine code以及內部電路在連接的時候有幫助[name=教授] - Register Operands - MIPS 有 32 個 32-bit 的 "general purpose" register - 32-bit (4 bytes) data is called a **word** - Arithmetic Operands - `add` and `sub` use three operands - `add a, b, c` -> a gets b + c - operand can only use register ==(cannot use memory!!!)== - Memory Operands - Main memory used for composite data - Arrays, structures, dynamic data - To apply arithmetic operations - Load : from memory to register - `lw a, b` - Store : from register to memory - `sw a, b` - array in memory ![](https://i.imgur.com/F4r0Q1Z.png) - Each memory address identifies in 8 bytes - **Words** are aligned in memory (address must be a multiple of 4) - MIPS is ==big endian== - Immediate Operands - Constant data specified in an instruction - `addi $s3 $s3 4` - No subtract immediate instruction - Just use a negative constant :::info ### Big and Little Endiiiiiian - Little Endian: The least significant byte occurs at the first (smallest) memory address. - Big Endian: The most significant byte occurs at the first (smallest) memory address. ![](https://i.imgur.com/kwesPI5.png) > Little **End**ian: 數字結尾在記憶體位置較小處 > Big Endian: 數字結尾在記憶體位置較大處 ::: ### Shift Operations - Shift left - Shift left and fill with 0 - sll i bit -> multiples $2^{i}$ - Shift right - shift right and fill with 0(unsigned only) - srl i bit -> divides $2^{i}$ - Example - `a *= 8` in C language - `sll $s0, $s0, 3` in MIPS ### Logical Operations - AND, OR, NOR - NOT operation - `NOR = NOT (a OR b)` - 如果我們能讓 `a OR b = a` 的話就達到目的了 - 只要讓 `a` 去 `OR` 一個都是 `0` 的東東就會是他本身，可以利用 zero register `$zero` :::warning **Note** `$zero` 也可以拿來做 register 之間資料的搬運 ex. `add $s1, $s2, $zero` 會把 `$s2` 的值搬到 `$s1` 裡面 -> 類似mov ::: ### Representing Instructions - Encoded as 32-bit instruction words - Register Number - $t0 \sim t7$ reg 8~15 - $s0 \sim s7$ reg 16~23 - $t8 \sim t9$ reg 24, 25 ![](https://i.imgur.com/mYMIECT.png) - R-format (`add`, `sub`, `and`, ...) - **op(6 bits)**: operation code (000000 in R-format) - **rs(5 bits)**: first source register number (first operand) - **rt(5 bits)**: second source register number (second operand) - **rd(5 bits)**: first destination register number - **shamt(5 bits)**: (only used for shift instructions) shift amount - **funct(6 bits)**: function code (extends opcode) - a lookup table contains all kinds of function code - e.g. `sll` is `0x00(000000)`, `add` is `0x20(100000)` - I-format (`addi`, `subi`, `lw`, `beq`, ...) - 需要用到立即數(一般數字) ,會用 I-Format (Immediate) - **op(6 bits)** - ==Different from R-format(always 0), using op code to determine which instruction should be used== - **rs(5 bits)**: source register number - **rt(5 bits)**: destination register number - **constant or address(16 bits)**: - constant range: $-2^{15}$ to $2^{15}+1$ - address: offset added to base address in `rs` - e.g. if the instruction is `beq` (or other branching instructions) and the address is `3`, the system will move PC (program counter) `3` lines after current PC; more precisely, `3*4=12` bytes after. - J-Format - 通常用於改變程式執行的流向(jump) - **op(6 bits)**: operation code - determine which instruction should be used - **address(26 bits)**: where should it go (destination) :::warning **Note** Why does `rs`, `rt`, `rd` only need 5 bits? Because in MIPS, we have 32 registers, we need $2^5=32$ different numbers to uniquely represent them. ::: ### Conditional Operations - **Branch to a labeled instruction if a condition is true** - `beq rs, rt, L1` (branch equal) - jumps to label `L1` if `rs == rt` - `bne rs, rt, L1` (branch not equal) - jumps to label `L1` if `rs != rt` - `j L1` (direct jump) - jumps to label `L1` - `slt/slti rd, rs, rt` (set on less than, signed comparison) - if (rs < rt) rd = 1, else rd = 0 - `sltu/sltui rd, rs, rt` (set on less than, unsigned comparison) ### Loop Statements - Use a register to record counter number - Use conditional operations to determine if the counter has reached a specified number - If true, then jump to the `end` label - If false, then jump to the `loop` label  :::info **Why 2's complement?** - Why not 1 = ==0==001, -1 = ==1==001 ? - Disadvantage - 可能會有運算處理的問題，需要做額外的處理。 - 出現兩個零，0 = ==0==000 or ==1==000 - -1 + 2會變成-3喔 - ==1==001(-1) + ==0==010(0) = ==1==011(-3) **Review** - addi $s0, $zero, 65536 > 指令有誤，因為 65536 > $2^{16}$  ::: ### Procedure Calling > 後面加ee的，就是被幹嘛的[name=教授] - Caller - Place parameters in register - Transfer control - e.g. ``` add $a0, $s0, 0 add $a1, $s1, 0 // $a0, $a1 are the 2 parameters for addTwo jal addTwo add $s2, $v0, 0 // result was stored in $v0 ``` - Callee - Acquire storage for procedure - Caller might use `$s0, $s1, ...`, need to save them on stack - Save value in stack ``` addi $sp, $sp, -4 sw $s0, 0($sp) // stores $s0 into the address pointed by $sp (*($sp) = $s0) ``` - Load value in stack ``` lw $s0, 0($sp) // loads $s0 from the address pointed by $sp ($s0 = *($sp)) addi $sp, $sp, 4 ``` - Instruction - `jal ProcedureLabel` (jump and link) - address of following instruction put in `$ra` (link return address to ra) - jumps to target address (jump to target) - `jr $ra` (jump register) - copys `$ra` to program counter (PC) > Every instruction must be in the address of multiple of 4 ### Byte / Halfword Operations - Load byte / halfword - `lb / lh rt, offset(rs)` (sign extension) - `lbu / lhu rt, offset(rs)` (zero extension) - Save byte / halfword - `sb / sh rt, offset(rs)` :::info **J-type** - 扣掉OPcode(6 bits)以後還有26 bits，電腦會自動補位2 bits($\times 4$，讓其必定為4的倍數，對其address) - $28$ bits can represent $2^{28}$ addresses $=2^{20} \cdot 2^8=2^8$ MB - 會在$2^8$的範圍內跑來跑去 ::: --- ## Chapter 3 ### ALU Operations | Operation | Signal | |:----------------------:|:------:| | AND | 0000 | | OR | 0001 | | ADD | 0011 | | SLT (SET-ON-LESS-THAN) | 0111 | | - | - | | SUBTRACT | 0110 | | NOR | 1100 | - `AND`, `OR`, `ADD`, `SLT` are the 4 basic operations. - `SUBTRACT` = `A` + `~B` + 1 (by 2's complement) - `NOR` = `~A` AND `~B` - `NAND` = `~A` OR `~B` ### Set-on-less-than Operation - Basic idea - if $a<b$, return $1$; otherwise return $0$ - Implementation - For bit 31 - result is 0 - `SET`: the output of sign(`A` - `B`) (check bit 31) - `SET` will be wired to check overflow(*), and then wired to bit 0 - For bit 1 to 30, result is always 0 - For bit 0 - receive `SET` result and output :::success ### Notes for overflow checking #### Design `SLT` (of bit 0) = `SET` XOR `Overflow` #### Reason if $a-b$ overflowed, the `SLT` output is incorrect. | (a-b) sign bit | overflow? | correct SLT result | | -------- | -------- | -------- | | 1 | False(0) | 1 (✓) | | 0 | False(0) | 0 (✓) | | 1 | True(1) | 0 (✗) | | 0 | True(1) | 1 (✗) | #### Overflow - `Overflow` = `CarryIn[N-1]` XOR `CarryOut[N-1]` - e.g. 1. No overflow (CarryIn $=$ CarryOut) $111+110=101$ 2. Overflow (CarryIn $\neq$ CarryOut) $101+110=011$ (CarryIn (=0) does not equal to CarryOut(=1) of MSB) ::: ### Booth's Algorithm #### Idea $2 \times 6 = 0010 \times 0110=0010\times (1000-0010)$ > 連續的 "1" 可以被拆成高位的加法與低位的減法 > 因此： "10" => 進入連續的一，減去被成乘數（目前在連續 "1" 的最低位） > "00", "11" 不須做其他動作 > "01" => 離開連續的 "1"，加上被乘數（目前在連續 "1" 的最高位） #### Algorithm Loop 乘數的每個 bit (由低到高) : 1. 遇到`0`的話什麼也不做，直到遇到`1` 2. 遇到`1`的話，扣掉被乘數 3. 接下來換成遇到`1`什麼也不做，直到遇到`0` 4. 遇到`0`的話，加上被乘數 5. 回到第 1 步 #### Example $2 \times 6 = 0010 \times 0110$ Loop every bit of $0110$ : 1. bit 0 ($0$) 遇到`0`，啥事也不幹 `result` = `0000` 2. bit 1 ($1$) 遇到第一個`1`，扣掉被乘數 `result` = - `00100` (-4) 3. bit 2 ($1$) 遇到`1`，啥事也不幹 `result` = - `00100` (-4) 4. bit 3 ($0$) 遇到`0`了，加上被乘數 `result` = `0010000`(16) - `00100`(4) = `0001100`(12) ### Divison 後面再補 //==TODO==: // ==FIXME==: ### Signed Division - Use unsigned division first - 被除數和餘數的正負要一樣 (但不是每個語言都這樣做) - e.g. $(-7) \div 2 = (-3) ... (-1)$ ### Floating Point #### The Floating Point Standard - IEEE 754 (IEEE Standard for Floating-Point Arithmetic) - Single precision (32-bit) - Double precision (64-bit) #### Format ![](https://i.imgur.com/4f0xqLC.png) $$ x = (-1)^{\text{sign}} \times (1+\text{fraction})\times 2^{\text{exponent-bias}} $$ - Sign - Sign bit ($0 \to$ nonnegative, $1 \to$ negative) - Fraction (a.k.a. significant, mantissa) - Exponent - Excess representation: `actual exponent` + `bias` - Ensures exponent is unsigned - Bias - We want to compare floating point numbers as if they were integers - Let notation `0000 0000` be the most negative, `1111 1111` be the most positive - For IEEE 754, bias for single-precision FP is $127$, double-precision is $1023$ :::warning **Why Bias?** - If there is no bias: `0 1111 1111 0 ... 0` = $\frac{1}{2}$ `0 0000 0001 0 ... 0` = $2$ $\frac{1}{2}$ will be greater than $2$! - Add bias: `0 0111 1110 0 ... 0` = $\frac{1}{2}$ `0 1000 0000 0 ... 0` = $2$ $\implies 2 > \frac{1}{2}$ ::: - Exponent `0000 0000` and `1111 1111` are reserved (see below) #### Smallest value (single precision) - Exponent: `0000 0001` (actual exponent = 1 - 127 = -126) - Fraction: `0 ... 0` -> significant = `1.0` - Lower bound: $\pm 1.0\times 2^{-126} \approx \pm 1.2\times 10^{-38}$ #### Largest value (single precision) - Exponent: `1111 1110` (actual exponent = 254 - 127 = 127) - Fraction: `1 ... 1` -> significant $\approx$ `2.0` - Upper bound: $\pm 2.0 \times 2^{127} \approx \pm 3.4 \times 10^{38}$ ### Zero and special numbers #### Representation for 0 - Format: `0 0000 0000 0...0` (all 0) #### Gradual Underflow - If exponent is `0000 0000`, then siginificand will be `fraction` itself rather than `1 + fraction` #### Smallest Number - The smallest positive normalized number - `1.0000 ... 0000` $\times 2^{-126}$ - The largest positive **de-normalized** number - `0.1111 ... 1111` $\times 2^{-126}$ - The smallest positive **de-normalized** number - `0.0000 ... 0001` $\times 2^{-126}$ #### Representation for +/- infinity - In FP, divide by zero should produce infinity, not overflow - Format: `0 1111 1111 0...0` #### Representation for NaN (not a number) - Format: `0 1111 1111 0...1` (fraction not 0) --- ## Chapter 4 ### CPU Performance Factors - Instruction count - Determined by ISA and compiler - CPI and Cycle time - Determined by CPU hardware ### Elements - Combinational element - Output is a function of input - e.g. - AND-gate - Adder - Multiplexer - ALU - State element ### Clocking Methodology - Positive edge-triggered - 當 clock 從 0 變成 1 的時候資料才會傳送 --- ## Ch4-2 Pipelining ### Pipeline Analogy 每個 task 的 response time 可能變長，但整體的 throughput 是增加的 ### MIPS Pipeline - IF (instruction fetch) - ID (instruction decode) - EX (execution) - MEM (accessing memory) - WB (write back) ![](https://i.imgur.com/n0iNQHM.png) ### Pipeline Registers - Need registers between stages to hold information produced in previous cycle ### Pipeline Performance ![](https://i.imgur.com/KElq6XR.png) ### Structure Hazard - Conflict for use of a resource ![](https://i.imgur.com/5P5Mb99.png) #### Solution - Making instructions have the same length ![](https://i.imgur.com/UyPeIps.png) ### Some Problems ### Data Hazard (Read after write, RAW) - Suppose we're executing some consecutive instructions: ``` lw $10, 0($3) add $10, $10, $4 sub $11, $10, $5 ``` - The value of `$10` may be wrong while executing `add` and `sub` if we use pipeline ### Handling Data Hazard - **Detect** and **resolve** - Resolve: **Forwarding** - Use result when it is computed - Requires extra connections in the datapath ![](https://i.imgur.com/b5vobLS.png) - Detect: **Forwarding Unit** - Detect whether ++`rd` of the previous instruction++ is used in ++`rs` or `rt` of the current instruction++ - `ID/EX.rs`, `ID/EX.rt`, `EX/MEM.rd` and `MEM/WB.rd` (registers to compare) is wired to **Forwarding Unit** ![](https://i.imgur.com/LUkEC3p.png) - **EX Hazard** - 1a. `EX/MEM.rd` = `ID/EX.rs` $\implies$ ForwardA = 10 - 1b. `EX/MEM.rd` = `ID/EX.rt` $\implies$ ForwardB = 10 - **MEM Hazard** - 2a. `MEM/WB.rd` = `ID/EX.rs` $\implies$ ForwardA = 01 - 2b. `MEM/WB.rd` = `ID/EX.rt` $\implies$ ForwardB = 01 :::warning - **ForwardA** controls whether the ALU should accept `rd` as the first input - 00 - accept original `rs` - 01 - accept `MEM/WB.rd` - 10 - accept `EX/MEM.rd` - **ForwardB** controls whether the ALU should accept `rd` as the second input - 00 - accept original `rt` - 01 - accept `MEM/WB.rd` - 10 - accept `EX/MEM.rd` ::: - **Load-Use Data Hazard** - Forwarding will not work here! (Values are not prepared when needed) - Condition: - `ID/EX.type` = MemRead (lw) - `ID/EX.rt` = `IF/ID.rs` - `ID/EX.rt` = `IF/ID.rt` - Solution: ++**Insert 1 bubble + forwarding**++ - Insert bubble - set all control signals to 0 - prevent update of PC and IF/ID registers - **Reordering to avoid stalls** ![](https://i.imgur.com/Z038rn5.png) ### Control Hazard - Suppose we're executing a `beq` instruction - If we use pipeline, we can't determine whether the next instruction is `PC+4`, or the label given by `beq` ### Handling Control Hazard - We can handle this at `IF/ID` stage - After `ID`, we can know how far the `beq` instruction should jump to - Solution: - Put a comparator at registers to compare if `rs = rt`, the result is wired to the mux before `IMem` ### Extra data hazard for branching instructions - We moved the logic 1 stage before the original one (`EX` -> `ID`) - This means `beq` can and should be executed at the 2nd stage (`ID`) - e.g. `add` -> `beq` - `add` can only get the result after the 3rd stage (`EX`) - Need 1 stall cycle ![](https://i.imgur.com/tccvm9s.png) - e.g. `lw` -> `beq` - `lw` can only get the result after the 4th stage (`MEM`) - Need 2 stall cycles ![](https://i.imgur.com/3drdiEM.png) ### Branch Prediction #### Dynamic Predicton - The hardware will predict the result according to the history - **1-bit predictor** - If predict taken -> guess the next predict will be taken - If not -> guess the next predict will not be taken - **2-bit predictor** - Similar to 1-bit predictor, but only change the predict if there are 2 mispredictions