# 計算機組織 Computer Organization (上) (CSIE-2)
###### tags: `courses meow` `CSIE-2` `Computer Organization` `2021 Spring`
:::success
:::spoiler click to open TOC
[TOC]
:::
## Introduction
- What will you learn in this course
- Evalute the performance of a system
- Capabilities of the main computers of a CPU and a memory system
- How the main compunents are inter-connected
- Information flow between the components
- How to control the information flow
- Emerging memory architecture
:::success
### 教授
- Week_1 ~ Week_9
- intrustor
- 陳弘軒
- Office hour : after class or by appointment
- TAs
- 郭同益(email:kuotony860810@gmail.com)
- 王承凱(email:wang19980531@outlook.com)
- 徐永棚(yungpeng1998@gmail.com)
### Material
- Textbook
- "Computer Organization and Design : The Hardware/Software Interface" by David A. Patterson and John L Hennessy
- They are turing award laureates
- Slides
- Download from new ee-class
### Grading
- exams: `50%`
- mid 25% 4/22
- final 25% 6/24
- exercise `30%`
- - The lowest score is discarded
- project `20%`
- MIPS programing
- Emerging memory system architecture
- Late submission penalties :
- 1day -> 10% off
- 2days -> 20% off
- 3days or more -> 0
> 只有 59 會讓你 60 過,58以下就不用再來談了,不要跟我說什麼58跟59只差一分[name=教授]
:::
:::info
### 參考資料
- [MIPS相關筆記](https://chi_gitbook.gitbooks.io/personal-note/content/register_operands.html)
:::
---
## Chapter 1
### The Computer Revolution
- Progess in computer technology
- Underpinned by Moore's law
- Makes novel applications feasible
- Computers in automobliies
- Cell phones
- Human genome project
- World Wide Web
- Search Engines
:::info
### 摩爾定律(Moore's law)
> 在積體電路上可容納的電晶體數目,每隔兩年便會增加一倍[name=Intel創辦人戈登·摩爾]
- 積體電路上可容納的電晶體數目,約每隔18個月便增加一倍。
- 微處理器的效能每隔18個月提高一倍,或價格下降一半。
- 相同價格所買的電腦,效能每隔18個月增加一倍。
:::
### Classes of Computers
- Personal computers
- General purpose, variety of software
- Subject to cost/performance tradeoff
- server
- network based
- reliability & cap
- super computer
- high-end scientific and engineering
- Embedded computers
- Hidden as compunents of systems
- Stringent power/performance/cost constraints
- Personal Mobile Device (PMD)
- Bettery opersted
- ex: smart phones, tablets, smart glasses
- Cloud computing
- Warehouse Scale Computers(WSC)
- Software as a Service(SaaS) (SaaS 是其中一種)
- Portion of software run on a PMD and a portion rin in the Cloud
- Amazon(AWS) and Google(GCP) ,Mincrosoft(azure) ,linode ,
### Components of a Computer

- Same components for all kinds of computer
### Understanding Performance
- algo
- determines ==number of operations== executed
- Programming language, compiler, architecture
- Determines ==number of machine instructions== exec per operation
- > 程式語言&compiler 可以用machine code的量來評估效率
- Prossor and memory system
- Determine how fast instructions are excuted
- > 時間和空間不可兼得
- I/O system (including OS)
- Determines how fast I/O operations are executed
- 盡量讓你的程式可以從disk讀出來後就直接做運算,以減少cost
### Inside the PR0cess0r (**CPU**)
- Datapath:performance operation on data
- Control: sequences, datapath, memory....
- Cache memory
- small fast SRAM memory for immediate access to data
- bla bla bla bla
- sth coooool
### Abstractions
- Abstration helps us deal with complexity
- Hide lower-level detail
### A Safe Place for Data
> 揮發性記憶體 vs 非揮發性記憶體
### Networks
- Communication, resource sharing, nonlocal access
- Local Area Network (**LAN**): Ethernet
- Wide Area Network (**WAN**): the Internet
- Wireless network: WiFi, Bluetooth
### 摩爾定律(Moore's law)

- 在經過12到18個月後積體電路上可容納的電晶體數目便會翻倍blablabla
### Semiconducter Technology
- Silicon: semiconductor(半導體)
- Add materials to transform properties:
- Conductor (導體)
- Insulators (絕緣體)
- Switch
### Manufacturing ICs

- High yield is difficult
- If 50 per process,and $Y_i = 0.99$,$Y_i^{50} = 0.61$
:::info
### Integrated Circuit Cost
- Cost per die = $\frac{\text{Cost per wafer}}{\text{Dies per wafer} \times \text{Yield}}$
- dies per wafer ≈ $\frac{\text{Wafer area}}{\text{Die area}}$
- Yield =
$$
\frac{1}{{1+(\text{Defects per area}\times\frac{\text{Die area}}{2})}^2}
$$
- Die costs are nonlinear die area
- wafer cost and wafer area are fixed
- defect rate determined by manufacturing process
- die area deteremined by architecture and circut design
> Die: 從矽晶圓切出來一小片的矽晶
> 至少要知道名詞意思,剩下就是小學數學而已惹
> 很多人覺得這堂課很難是因為瑣碎細節很多
[name=教授]
:::
### Response Time and Througnput
- ==Response Time==
- How long if takes to do a task
- Throughput
- Total work done per util time
> 有的東西可能Response Time高,但是緩衝時間長,所以Throughput低[name=某同學]
### sth
- Elapsed time
- Total response time ,including all aspects
- Processing ,I/O ,OS overhaed ,idle time
- CPU time
> determined by `user mode` or `kernel mode`
- user cpu time
- system cpu time
- CPU time
- Time spent processing a given job
- Discounts I/O time, other jobs' shares
- Comprises CPU time and system time
- Different programs are affected by CPU and system performance
### CPU Clocking
- Clock period(reponse time) : duration of a clock cycle
- CPU frequency(rate ,throughput) : cycles per second
> frequency高不一定代表速度快
### Instruction count and CPI
> Average cycles per instruction 不一定是整數
> 但是每一個 instruction 一定執行整數個 cycle
### Summary

### Amdahl's law

---
## Chapter 2
- Instructions : Language of the Computer
> How do Assembly Language transform to machine code
### Unsigned Binary Integers
Given an n-bit number
$x = x_{n-1}2^{n-1} + x_{n-2}2^{n-2} + ... + x_12^1 + x_02^0$
Range : $0$ to $2 ^ n - 1$
### Complement Signed number
Given an n-bit number
$x = - x_{n-1}2^{n-1} + x_{n-2}2^{n-2} + ... + x_12^1 + x_02^0$
Range : $-2^{n-1}$ to $2^{n-1}$
- Most negative: $1000 0000 0000 ... 0000$
> 只有最左邊的負的 bit 是1,其他都是0
- Most positive: $0111 1111 1111 ... 1111$
> 讓最左邊的負的 bit 是0,其他正的 bit 都是1
### Signed Negation
- Complement and then plus 1 !!!
$x + \sim x = 11111...111 = -1$
$\implies -x = \sim x + 1$
### Signed Extension
- Replicate the sign bit to the left
- EX : -2 : **1111 1111** 1111 1110
### Instruction Set
> CPU 原生支援的指令集
> 有的CPU原生指令集裡面可能沒有not運算,那它就會把not轉換成它有的指令集進行運算
### The MIPS
- "**M**illion **I**nstruction **P**er **S**econd"
- "**M**icroprocessor without **I**nterlocked **P**ipeline **S**tages"
### MIPS
> Assembly的指令格式會盡量一樣,這對於未來轉換成machine code以及內部電路在連接的時候有幫助[name=教授]
- Register Operands
- MIPS 有 32 個 32-bit 的 "general purpose" register
- 32-bit (4 bytes) data is called a **word**
- Arithmetic Operands
- `add` and `sub` use three operands
- `add a, b, c` -> a gets b + c
- operand can only use register ==(cannot use memory!!!)==
- Memory Operands
- Main memory used for composite data
- Arrays, structures, dynamic data
- To apply arithmetic operations
- Load : from memory to register
- `lw a, b`
- Store : from register to memory
- `sw a, b`
- array in memory

- Each memory address identifies in 8 bytes
- **Words** are aligned in memory (address must be a multiple of 4)
- MIPS is ==big endian==
- Immediate Operands
- Constant data specified in an instruction
- `addi $s3 $s3 4`
- No subtract immediate instruction
- Just use a negative constant
:::info
### Big and Little Endiiiiiian
- Little Endian: The least significant byte occurs at the first (smallest) memory address.
- Big Endian: The most significant byte occurs at the first (smallest) memory address.

> Little **End**ian: 數字結尾在記憶體位置較小處
> Big Endian: 數字結尾在記憶體位置較大處
:::
### Shift Operations
- Shift left
- Shift left and fill with 0
- sll i bit -> multiples $2^{i}$
- Shift right
- shift right and fill with 0(unsigned only)
- srl i bit -> divides $2^{i}$
- Example
- `a *= 8` in C language
- `sll $s0, $s0, 3` in MIPS
### Logical Operations
- AND, OR, NOR
- NOT operation
- `NOR = NOT (a OR b)`
- 如果我們能讓 `a OR b = a` 的話就達到目的了
- 只要讓 `a` 去 `OR` 一個都是 `0` 的東東就會是他本身,可以利用 zero register `$zero`
:::warning
**Note**
`$zero` 也可以拿來做 register 之間資料的搬運
ex.
`add $s1, $s2, $zero` 會把 `$s2` 的值搬到 `$s1` 裡面 -> 類似mov
:::
### Representing Instructions
- Encoded as 32-bit instruction words
- Register Number
- $t0 \sim t7$ reg 8~15
- $s0 \sim s7$ reg 16~23
- $t8 \sim t9$ reg 24, 25

- R-format (`add`, `sub`, `and`, ...)
- **op(6 bits)**: operation code (000000 in R-format)
- **rs(5 bits)**: first source register number (first operand)
- **rt(5 bits)**: second source register number (second operand)
- **rd(5 bits)**: first destination register number
- **shamt(5 bits)**: (only used for shift instructions) shift amount
- **funct(6 bits)**: function code (extends opcode)
- a lookup table contains all kinds of function code
- e.g. `sll` is `0x00(000000)`, `add` is `0x20(100000)`
- I-format (`addi`, `subi`, `lw`, `beq`, ...)
- 需要用到立即數(一般數字) ,會用 I-Format (Immediate)
- **op(6 bits)**
- ==Different from R-format(always 0), using op code to determine which instruction should be used==
- **rs(5 bits)**: source register number
- **rt(5 bits)**: destination register number
- **constant or address(16 bits)**:
- constant range: $-2^{15}$ to $2^{15}+1$
- address: offset added to base address in `rs`
- e.g. if the instruction is `beq` (or other branching instructions) and the address is `3`, the system will move PC (program counter) `3` lines after current PC; more precisely, `3*4=12` bytes after.
- J-Format
- 通常用於改變程式執行的流向(jump)
- **op(6 bits)**: operation code
- determine which instruction should be used
- **address(26 bits)**: where should it go (destination)
:::warning
**Note**
Why does `rs`, `rt`, `rd` only need 5 bits?
Because in MIPS, we have 32 registers, we need $2^5=32$ different numbers to uniquely represent them.
:::
### Conditional Operations
- **Branch to a labeled instruction if a condition is true**
- `beq rs, rt, L1` (branch equal)
- jumps to label `L1` if `rs == rt`
- `bne rs, rt, L1` (branch not equal)
- jumps to label `L1` if `rs != rt`
- `j L1` (direct jump)
- jumps to label `L1`
- `slt/slti rd, rs, rt` (set on less than, signed comparison)
- if (rs < rt) rd = 1, else rd = 0
- `sltu/sltui rd, rs, rt` (set on less than, unsigned comparison)
### Loop Statements
- Use a register to record counter number
- Use conditional operations to determine if the counter has reached a specified number
- If true, then jump to the `end` label
- If false, then jump to the `loop` label
<!-- 我找到了一個相關的筆記,丟在最上面的參考資料惹(看起來是學長寫的?) -->
:::info
**Why 2's complement?**
- Why not 1 = ==0==001, -1 = ==1==001 ?
- Disadvantage
- 可能會有運算處理的問題,需要做額外的處理。
- 出現兩個零,0 = ==0==000 or ==1==000
- -1 + 2會變成-3喔
- ==1==001(-1) + ==0==010(0) = ==1==011(-3)
**Review**
- addi $s0, $zero, 65536
> 指令有誤,因為 65536 > $2^{16}$
<!--我前幾天做筆記才剛好查那個的用法XD-->
:::
### Procedure Calling
> 後面加ee的,就是被幹嘛的[name=教授]
- Caller
- Place parameters in register
- Transfer control
- e.g.
```
add $a0, $s0, 0
add $a1, $s1, 0 // $a0, $a1 are the 2 parameters for addTwo
jal addTwo
add $s2, $v0, 0 // result was stored in $v0
```
- Callee
- Acquire storage for procedure
- Caller might use `$s0, $s1, ...`, need to save them on stack
- Save value in stack
```
addi $sp, $sp, -4
sw $s0, 0($sp) // stores $s0 into the address pointed by $sp (*($sp) = $s0)
```
- Load value in stack
```
lw $s0, 0($sp) // loads $s0 from the address pointed by $sp ($s0 = *($sp))
addi $sp, $sp, 4
```
- Instruction
- `jal ProcedureLabel` (jump and link)
- address of following instruction put in `$ra` (link return address to ra)
- jumps to target address (jump to target)
- `jr $ra` (jump register)
- copys `$ra` to program counter (PC)
> Every instruction must be in the address of multiple of 4
### Byte / Halfword Operations
- Load byte / halfword
- `lb / lh rt, offset(rs)` (sign extension)
- `lbu / lhu rt, offset(rs)` (zero extension)
- Save byte / halfword
- `sb / sh rt, offset(rs)`
:::info
**J-type**
- 扣掉OPcode(6 bits)以後還有26 bits,電腦會自動補位2 bits($\times 4$,讓其必定為4的倍數,對其address)
- $28$ bits can represent $2^{28}$ addresses $=2^{20} \cdot 2^8=2^8$ MB
- 會在$2^8$的範圍內跑來跑去
:::
---
## Chapter 3
### ALU Operations
| Operation | Signal |
|:----------------------:|:------:|
| AND | 0000 |
| OR | 0001 |
| ADD | 0011 |
| SLT (SET-ON-LESS-THAN) | 0111 |
| - | - |
| SUBTRACT | 0110 |
| NOR | 1100 |
- `AND`, `OR`, `ADD`, `SLT` are the 4 basic operations.
- `SUBTRACT` = `A` + `~B` + 1 (by 2's complement)
- `NOR` = `~A` AND `~B`
- `NAND` = `~A` OR `~B`
### Set-on-less-than Operation
- Basic idea
- if $a<b$, return $1$; otherwise return $0$
- Implementation
- For bit 31
- result is 0
- `SET`: the output of sign(`A` - `B`) (check bit 31)
- `SET` will be wired to check overflow(*), and then wired to bit 0
- For bit 1 to 30, result is always 0
- For bit 0
- receive `SET` result and output
:::success
### Notes for overflow checking
#### Design
`SLT` (of bit 0) = `SET` XOR `Overflow`
#### Reason
if $a-b$ overflowed, the `SLT` output is incorrect.
| (a-b) sign bit | overflow? | correct SLT result |
| -------- | -------- | -------- |
| 1 | False(0) | 1 (✓) |
| 0 | False(0) | 0 (✓) |
| 1 | True(1) | 0 (✗) |
| 0 | True(1) | 1 (✗) |
#### Overflow
- `Overflow` = `CarryIn[N-1]` XOR `CarryOut[N-1]`
- e.g.
1. No overflow (CarryIn $=$ CarryOut)
$111+110=101$
2. Overflow (CarryIn $\neq$ CarryOut)
$101+110=011$ (CarryIn (=0) does not equal to CarryOut(=1) of MSB)
:::
### Booth's Algorithm
#### Idea
$2 \times 6 = 0010 \times 0110=0010\times (1000-0010)$
> 連續的 "1" 可以被拆成高位的加法與低位的減法
> 因此: "10" => 進入連續的一,減去被成乘數(目前在連續 "1" 的最低位)
> "00", "11" 不須做其他動作
> "01" => 離開連續的 "1",加上被乘數(目前在連續 "1" 的最高位)
#### Algorithm
Loop 乘數的每個 bit (由低到高) :
1. 遇到`0`的話什麼也不做,直到遇到`1`
2. 遇到`1`的話,扣掉被乘數
3. 接下來換成遇到`1`什麼也不做,直到遇到`0`
4. 遇到`0`的話,加上被乘數
5. 回到第 1 步
#### Example
$2 \times 6 = 0010 \times 0110$
Loop every bit of $0110$ :
1. bit 0 ($0$)
遇到`0`,啥事也不幹
`result` = `0000`
2. bit 1 ($1$)
遇到第一個`1`,扣掉被乘數
`result` = - `00100` (-4)
3. bit 2 ($1$)
遇到`1`,啥事也不幹
`result` = - `00100` (-4)
4. bit 3 ($0$)
遇到`0`了,加上被乘數
`result` = `0010000`(16) - `00100`(4) = `0001100`(12)
### Divison
後面再補
//==TODO==:
// ==FIXME==:
### Signed Division
- Use unsigned division first
- 被除數和餘數的正負要一樣 (但不是每個語言都這樣做)
- e.g. $(-7) \div 2 = (-3) ... (-1)$
### Floating Point
#### The Floating Point Standard
- IEEE 754 (IEEE Standard for Floating-Point Arithmetic)
- Single precision (32-bit)
- Double precision (64-bit)
#### Format

$$
x = (-1)^{\text{sign}} \times (1+\text{fraction})\times 2^{\text{exponent-bias}}
$$
- Sign
- Sign bit ($0 \to$ nonnegative, $1 \to$ negative)
- Fraction (a.k.a. significant, mantissa)
- Exponent
- Excess representation: `actual exponent` + `bias`
- Ensures exponent is unsigned
- Bias
- We want to compare floating point numbers as if they were integers
- Let notation `0000 0000` be the most negative, `1111 1111` be the most positive
- For IEEE 754, bias for single-precision FP is $127$, double-precision is $1023$
:::warning
**Why Bias?**
- If there is no bias:
`0 1111 1111 0 ... 0` = $\frac{1}{2}$
`0 0000 0001 0 ... 0` = $2$
$\frac{1}{2}$ will be greater than $2$!
- Add bias:
`0 0111 1110 0 ... 0` = $\frac{1}{2}$
`0 1000 0000 0 ... 0` = $2$
$\implies 2 > \frac{1}{2}$
:::
- Exponent `0000 0000` and `1111 1111` are reserved (see below)
#### Smallest value (single precision)
- Exponent: `0000 0001` (actual exponent = 1 - 127 = -126)
- Fraction: `0 ... 0` -> significant = `1.0`
- Lower bound: $\pm 1.0\times 2^{-126} \approx \pm 1.2\times 10^{-38}$
#### Largest value (single precision)
- Exponent: `1111 1110` (actual exponent = 254 - 127 = 127)
- Fraction: `1 ... 1` -> significant $\approx$ `2.0`
- Upper bound: $\pm 2.0 \times 2^{127} \approx \pm 3.4 \times 10^{38}$
### Zero and special numbers
#### Representation for 0
- Format: `0 0000 0000 0...0` (all 0)
#### Gradual Underflow
- If exponent is `0000 0000`, then siginificand will be `fraction` itself rather than `1 + fraction`
#### Smallest Number
- The smallest positive normalized number
- `1.0000 ... 0000` $\times 2^{-126}$
- The largest positive **de-normalized** number
- `0.1111 ... 1111` $\times 2^{-126}$
- The smallest positive **de-normalized** number
- `0.0000 ... 0001` $\times 2^{-126}$
#### Representation for +/- infinity
- In FP, divide by zero should produce infinity, not overflow
- Format: `0 1111 1111 0...0`
#### Representation for NaN (not a number)
- Format: `0 1111 1111 0...1` (fraction not 0)
---
## Chapter 4
### CPU Performance Factors
- Instruction count
- Determined by ISA and compiler
- CPI and Cycle time
- Determined by CPU hardware
### Elements
- Combinational element
- Output is a function of input
- e.g.
- AND-gate
- Adder
- Multiplexer
- ALU
- State element
### Clocking Methodology
- Positive edge-triggered
- 當 clock 從 0 變成 1 的時候資料才會傳送
---
## Ch4-2 Pipelining
### Pipeline Analogy
每個 task 的 response time 可能變長,但整體的 throughput 是增加的
### MIPS Pipeline
- IF (instruction fetch)
- ID (instruction decode)
- EX (execution)
- MEM (accessing memory)
- WB (write back)

### Pipeline Registers
- Need registers between stages to hold information produced in previous cycle
### Pipeline Performance

### Structure Hazard
- Conflict for use of a resource

#### Solution
- Making instructions have the same length

### Some Problems
### Data Hazard (Read after write, RAW)
- Suppose we're executing some consecutive instructions:
```
lw $10, 0($3)
add $10, $10, $4
sub $11, $10, $5
```
- The value of `$10` may be wrong while executing `add` and `sub` if we use pipeline
### Handling Data Hazard
- **Detect** and **resolve**
- Resolve: **Forwarding**
- Use result when it is computed
- Requires extra connections in the datapath

- Detect: **Forwarding Unit**
- Detect whether ++`rd` of the previous instruction++ is used in ++`rs` or `rt` of the current instruction++
- `ID/EX.rs`, `ID/EX.rt`, `EX/MEM.rd` and `MEM/WB.rd` (registers to compare) is wired to **Forwarding Unit**

- **EX Hazard**
- 1a. `EX/MEM.rd` = `ID/EX.rs`
$\implies$ ForwardA = 10
- 1b. `EX/MEM.rd` = `ID/EX.rt`
$\implies$ ForwardB = 10
- **MEM Hazard**
- 2a. `MEM/WB.rd` = `ID/EX.rs`
$\implies$ ForwardA = 01
- 2b. `MEM/WB.rd` = `ID/EX.rt`
$\implies$ ForwardB = 01
:::warning
- **ForwardA** controls whether the ALU should accept `rd` as the first input
- 00 - accept original `rs`
- 01 - accept `MEM/WB.rd`
- 10 - accept `EX/MEM.rd`
- **ForwardB** controls whether the ALU should accept `rd` as the second input
- 00 - accept original `rt`
- 01 - accept `MEM/WB.rd`
- 10 - accept `EX/MEM.rd`
:::
- **Load-Use Data Hazard**
- Forwarding will not work here! (Values are not prepared when needed)
- Condition:
- `ID/EX.type` = MemRead (lw)
- `ID/EX.rt` = `IF/ID.rs`
- `ID/EX.rt` = `IF/ID.rt`
- Solution: ++**Insert 1 bubble + forwarding**++
- Insert bubble
- set all control signals to 0
- prevent update of PC and IF/ID registers
- **Reordering to avoid stalls**

### Control Hazard
- Suppose we're executing a `beq` instruction
- If we use pipeline, we can't determine whether the next instruction is `PC+4`, or the label given by `beq`
### Handling Control Hazard
- We can handle this at `IF/ID` stage
- After `ID`, we can know how far the `beq` instruction should jump to
- Solution:
- Put a comparator at registers to compare if `rs = rt`, the result is wired to the mux before `IMem`
### Extra data hazard for branching instructions
- We moved the logic 1 stage before the original one (`EX` -> `ID`)
- This means `beq` can and should be executed at the 2nd stage (`ID`)
- e.g. `add` -> `beq`
- `add` can only get the result after the 3rd stage (`EX`)
- Need 1 stall cycle

- e.g. `lw` -> `beq`
- `lw` can only get the result after the 4th stage (`MEM`)
- Need 2 stall cycles

### Branch Prediction
#### Dynamic Predicton
- The hardware will predict the result according to the history
- **1-bit predictor**
- If predict taken -> guess the next predict will be taken
- If not -> guess the next predict will not be taken
- **2-bit predictor**
- Similar to 1-bit predictor, but only change the predict if there are 2 mispredictions