計算機結構

計算機結構

第一章 - Computer Abstractions and Technology

1.1 - Introduction

1.2 - Eight Great Ideas in Computer Architecture

1.3 - Below Your Program

1.4 - Under the Covers

1.5 - Technologies for Building Processors and Memory

1.6 - Performance

1.7 - The Power Wall

1.8 - The Sea Change（巨變）：The Switch from Uniprocessors to Multiprocessors

1.9 - Real Stuff：Benchmarking（基準測試） the Intel Core i7

1.10 - Fallacies and Pitfalls（謬誤 & 隱患）

1.11 - Concluding Remarks

1.12 - Historical Perspective and Further Reading

1.13 - Exercises

第二章 - Instructions：Language of the Computer

2.1 - Introduction

2.2 - Operations of the Computer Hardware

2.3 - Operands of the Computer Hardware

2.4 - Signed and Unsigned Numbers

2.5 - Representing Instructions in the Computer

2.6 - Logical Operations

2.7 - Instructions for Making Decisions

2.8 - Supporting Procedures in Computer Hardware

2.9 - Communicating with People

2.10 - MIPS Addressing for 32-Bit Immediates and Addresses

2.11 - Parallelism and Instructions：Synchronization

2.12 - Translating and Starting a Program

2.13 - A C Sort Example to Put It All Together

2.14 - Arrays versus Pointers

2.15 - Advanced Material：Compiling C and Interpreting Java

2.16 - Real Stuff：ARMv7 (32-bit) Instructions

2.17 - Real Stuff：x86 Instructions

2.18 - Real Stuff：ARMv8 (64-bit) Instructions

2.19 - Fallacies and Pitfalls

2.20 - Concluding Remarks

2.21 - Historical Perspective and Further Reading

2.22 - Exercises

第三章 - Arithmetic for Computers

3.1 Introduction

3.2 Addition and Subtraction

3.3 Multiplication

3.4 Division

3.5 Floating Point

3.6 Parallelism and Computer Arithmetic：Subword Parallelism

3.7 Real Stuff：Streaming SIMD Extensions and Advanced Vector Extensions in x86

期中考範圍到此，以下為期末考範圍

3.8 Going Faster：Subword Parallelism and Matrix Multiply

3.9 Fallacies and Pitfalls

3.10 Concluding Remarks

3.11 Historical Perspective and Further Reading

3.12 Exercises

第四章 - The Processor

4.1 - Introduction

4.2 - Logic Design Conventions

4.3 - Building a Datapath

4.4 - A Simple Implementation Scheme

4.5 - An Overview of Pipelining

4.6 - Pipelined Datapath and Control

4.7 - Data Hazards：Forwarding v.s. Stalling

4.8 - Control Hazards

4.9 - Exceptions

4.10 - Parallelism and Advanced Instruction Level Parallelism

4.11 - Real Stuff：The ARM Cortex-A8 and Intel Core i7 Pipelines

4.12 - Instruction-Level Parallelism and Matrix Multiply

4.13 - Advanced Topic：An Introduction to Digital Design Using a Hardware Design Language to Describe and Model a Pipeline and More Pipelining Illustrations

4.14 - Fallacies and Pitfalls

4.15 - Conduding Remarks

4.16 - Historical Perspective and Further Reading

4.17 - Exercises

第五章 - Large and Fast：Exploiting Memory Hierarchy

5.1 - Introduction

5.2 - Memory Technologies

5.3 - The Basics of Caches

5.4 - Measuring and Improving Cache Performance

5.5 - Dependable Memory Hierarchy

5.6 - Virtual Machine（VM）

5.7 - Virtual Memory

5.8 - A Common Framework for Memory Hierarchy

5.9 - Using a Finite-State Machine to Control a Simple Cache

5.10 - Parallelism and Memory Hierarchies: Cache Coherence

5.11 - Real Stuff：The ARM Cortex-A8 and Intel Core i7 Memory Hierarchies

5.12 - Going Faster：Cache Blocking and Matrix Multiply

5.13 - Fallacies and Pitfalls

第一章 - Computer Abstractions and Technology

1.1 - Introduction

現代計算機的分類：

個人電腦（Personal computers）
- 普通用途、多種軟體
- 價格與性能（CP 值）很重要
伺服器電腦（Server computers）
- 以網路連線為基礎
- 高容量、高性能、高可靠性
- Range from small servers to building sized
超級電腦（Supercomputers）
- 用於高階科學 & 工程計算
- 有著最高的性能、最低的市場需求
嵌入式電腦（Embedded computers）
- Hidden as components of systems
- 嚴格限制權限、性能、價格

1.2 - Eight Great Ideas in Computer Architecture

計算機系統結構中的 8 個 Great Ideas。

設計考量摩爾定律（Design for Moore's Law）
系統結構設計的周期通常都比較長，由於摩爾定律的存在，很有可能項目開始時和項目結束時能夠提供的原件在工藝方面的性能就有很大的差距了，因此做系統結構設計需要考慮到這方面的因素。
p.s. 但到了 2017 年，各方都開始覺得摩爾定律要走到盡頭了。
以抽象概念簡化設計（Use Abstraction to Simplify Design）
分層次設計、模組化（作業系統的 Layer approach、loadable kernel module），底層的細節對上層不透明，只要提供接口即可。
優化常見情況（Make the Common Case Fast）
做優化要注重在常見情況上，找到瓶頸、最耗時的點才有用。
想找出常見情況則需要做針對性的測試（1.6）。
平行運算（Performance via Parallelism）
管線運算 / 流水線運算（Performance via Pipelining）
多級流水可以提高資源利用率、隱藏延遲等等。
預測運算（Performance via Prediction）
處理器指令預測、指令預取。
記憶體分級（Hierarchy of Memories）
最突出的體現是在 cache 上。
p.s. 現在主流的處理器上 cache 已經做到了 L3，可能過幾年就有 L4 了。
冗餘設計（Dependability via Redundancy）
通過冗餘設計來保證系統可靠性，實現容災備份。

1.3 - Below Your Program

程式運行架構 & 過程：

應用程式（Application software）
- 以高機語言（high-level language, HLL）撰寫。
系統程式（System software）
- 將高階語言編譯成組合語言（Assembly language）。
- 作業系統（Operating system, OS）則負責：
  - Handle input / output
  - Manage memory and storage
  - Schedule tasks & sharing resources
硬體（Hardware）
- Processor, memory, I/O controllers

1.4 - Under the Covers

1.5 - Technologies for Building Processors and Memory

電子元件的發展演變：

年分電子元件技術單位資源消耗能得到的相對性能

1951 電子管（Vacuum tube） 1

1965 晶體管（Transistor） 35

1975 集成電路（Integrated circuit, IC） 900

1995 Very large scale IC, VLSI 2,400,000

2013 Ultra large scale IC 250,000,000,000

年分	電子元件技術	單位資源消耗能得到的相對性能
1951	電子管（Vacuum tube）	1
1965	晶體管（Transistor）	35
1975	集成電路（Integrated circuit, IC）	900
1995	Very large scale IC, VLSI	2,400,000
2013	Ultra large scale IC	250,000,000,000

CPU 芯片製造：

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

延伸資料：

1.6 - Performance

$P e r f o r m a n c e_{x} = \frac{1}{E x e c u t i o n t i m e_{x}}$

$P e r f o r m a n c e_{x} > P e r f o r m a n c e_{y} = \frac{1}{E x e c u t i o n t i m e_{x}} > \frac{1}{E x e c u t i o n t i m e_{y}}$

$E x e c u t i o n t i m e_{y} > E x e c u t i o n t i m e_{x}$

效能比較：X 與 Y 的效能比數 n

$n = \frac{P e r f o r m a n c e_{x}}{P e r f o r m a n c e_{y}} = \frac{E x e c u t i o n t i m e_{y}}{E x e c u t i o n t i m e_{x}}$
執行時間（Execution Time）：
系統效能針對的時間計量方式是前者，而 CPU 效能針對的時間計算方式是後者。
- Elapsed time / Response time / Wall clock time
  跑某個任務實際花費的時間，包含了 I/O、訪存、操作系統 overhead… 等等。
- CPU time / CPU execution time
  CPU 跑某個任務消耗的時間，實際上是 CPU 跑了多少個時鐘週期，不包含其他部分消耗所花費的時間（I/O、線程切換… 等等）。
  
  $C P U t i m e = C P U c l o c k c y c l e s \times C l o c k c y c l e t i m e = \frac{C P U c l o c k c y c l e s}{C l o c k r a t e}$
  
  $C l o c k r a t e = \frac{1}{C P U c y c l e t i m e} = \frac{C P U c l o c k c y c l e s}{C P U t i m e}$
  
  $C l o c k c y c l e s = I n s t r u c t i o n c o u n t \times C l o c k c y c l e s p e r i n s t r u c t i o n (C P I)$
  
  $\begin{aligned} C P U t i m e & = I n s t r u c t i o n c o u n t \times C P I \times C l o c k c y c l e t i m e \\ = \frac{I n s t r u c t i o n c o u n t \times C P I}{C l o c k r a t e} \end{aligned}$

指令總數計算指令記億體載入指令記憶體貯存指令分支指令

20 萬 45% 20% 15% 20%

機器時脈頻率計算 CPI 記億體載入 CPI 記憶體貯存 CPI 分支 CPI

P1 1 GHz 1 8 8 2

P2 1.5 GHz 80%：1 20%：2 10 10 2

期中考題目（2013 - 1）：
一個程式在時脈頻率分別是 1 GHz 和 1.5 GHz 的兩部計算機（P1 和 P2）上執行，共執行了 20 萬個指令。其中，計算指令佔 45%，記億體載入指令佔 20%，記憶體貯存指令佔 15%，分支指令佔 20%。P1 的計算指令的 CPI 是 1，記憶體載人或貯存指令是 8，分支指令是 2。P2 的計算指令中有 80% 的 CPI 仍是 1，剩餘的計算指令的 CPI 增為 2，記億體載入或貯存指令是 10，分支指令也是 2。
請計算 P1 和 P2 的：
(1) 總執行時間。
(2) CPI。
(3) 比較兩機器的效能。

解答：

P1 的總執行時間：
$(45 ％ \times 1 + 20 ％ \times 8 + 15 ％ \times 8 + 20 ％ \times 2) \times (2 \times 10^{5}) \div (1 \times 10^{9}) = 0.00073 秒$

P2 的總執行時間：
$(45 ％ \times (80 ％ \times 1 + 20 ％ \times 2) + 20 ％ \times 10 + 15 ％ \times 10 + 20 ％ \times 2) \times (2 \times 10^{5}) \div (1.5 \times 10^{9}) = 0.000592 秒$

P1 的 CPI：
$(45 ％ \times 1 + 20 ％ \times 8 + 15 ％ \times 8 + 20 ％ \times 2) = 3.65$

P2 的 CPI：
$(45 ％ \times (80 ％ \times 1 + 20 ％ \times 2) + 20 ％ \times 10 + 15 ％ \times 10 + 20 ％ \times 2) = 4.44$

$效能比數 n n = \frac{P e r f o r m a n c e_{P 1}}{P e r f o r m a n c e_{P 2}} = \frac{E x e c u t i o n t i m e_{P 2}}{E x e c u t i o n t i m e_{P 1}} = \frac{0.000592}{0.00073} = 0.8109589$

指令總數	計算指令	記億體載入指令	記憶體貯存指令	分支指令
20 萬	45%	20%	15%	20%

機器	時脈頻率	計算 CPI	記億體載入 CPI	記憶體貯存 CPI	分支 CPI
P1	1 GHz	1	8	8	2
P2	1.5 GHz	80%：1 20%：2	10	10	2

1.7 - The Power Wall

功耗與主頻幾乎是同等上升的，主頻越高功耗越高。電腦發展前期主頻提升很快，但到了最近，處理器的主頻基本不再提升，因為功耗已經達到了一個相當高的程度，在散熱等其他方面還沒辦法跟上的時候只能限制主頻的提升。

動態功耗
CPU 的功耗主要是動態功耗，來源於晶體管的開關切換，即高低電平（0 和 1）之間的翻轉，這中間本質上是個充放電的過程，所以這部分能量消耗是無法避免的。

$P o w e r = C a p a c i t i v e l o a d \times V o l t a g e^{2} \times F r e q u e n c y$
（動態功耗）（電容負載）　（電壓）　（切換頻率）
靜態功耗
主要來自於晶體管的漏電流，幾乎無法避免，只能通過工藝改進來減少。

1.8 - The Sea Change（巨變）：The Switch from Uniprocessors to Multiprocessors

受到工藝、功耗等的限制，已經無法再單純地提升主頻來提高處理器的性能，因此轉向提高處理器的並行能力來繼續發展 CPU，即單核心開始向多核心轉變。

之前由於性能的提升主要在工藝、硬體層面，對於軟體來說影響很小。但是多核處理器出現之後，也推動了平行演算法的發展，因為只有從演算法上進行改良才能夠更好地利用多核心處理器的優勢。

1.9 - Real Stuff：Benchmarking（基準測試） the Intel Core i7

1.10 - Fallacies and Pitfalls（謬誤 & 隱患）

提及：每秒百萬指令（Millions of Instructions Per Second, MIPS）。

Amdahl's law（阿姆達爾定律）：
$T_{i m p r o v e d} = \frac{T_{a f f e c t e d}}{I m p r o v e m e n t f a c t o r} + T_{u n a f f e c t e d}$

1.11 - Concluding Remarks

$\frac{S e c o n d s}{P r o g r a m} = \frac{I n s t r u c t i o n s}{P r o g r a m} \times \frac{C l o c k c y c l e s}{I n s t r u c t i o n} \times \frac{S e c o n d s}{C l o c k c y c l e}$

1.12 - Historical Perspective and Further Reading

PDF

1.13 - Exercises

解答

第二章 - Instructions：Language of the Computer

2.1 - Introduction

第二章將以 MIPS 為例，介紹計算機指令。

2.2 - Operations of the Computer Hardware

設計原則 1：簡單有助於規整（Simplicity favors regularity）。

2.3 - Operands of the Computer Hardware

設計原則 2：越少越快（Smaller is faster）。
MIPS 有 32 個 32 位元的暫存器。

設計原則 3：優化常見情況（Make the Common Case Fast）

記憶體運算元（Memory Operands）

Example：A[12] = h + A[8];（h in $s2, base address of A in $s3）



lw  $t0, 32($s3)    # load word
add $t0, $s2, $t0
sw  $t0, 48($s3)    # store word

立即運算元（Immediate Operands）
- Constant data specified in an instruction
```
addi $s3, $s3, 4
```
- No subtract immediate instruction
  （Just use a negative constant）
```
addi $s2, $s1, -1
```
常數 0（Constant Zero）
- Useful for common operations
  ex：move between registers
```
add $t2, $s1, $zero
```

2.4 - Signed and Unsigned Numbers

2s-Complement Signed Integers

$x + \bar{x} = 1111. . {.1111}_{2} = - 1$

$\bar{x} + 1 = - x$

Example：negate +2

$+ 2 = 0000 0000 . . . 0010_{2}$

$\begin{aligned} - 2 & = 1111 1111 . . . 1101_{2} + 1 \\ = 1111 1111 . . . 1110_{2} \end{aligned}$
Sign Extension

Example：8-bit → 16-bit

$+ 2 ： 0000 0010 \to 0000 0000 0000 0010_{2}$

$- 2 ： 1111 1110 \to 1111 1111 1111 1110_{2}$

2.5 - Representing Instructions in the Computer

指令將以二進位呈現（所謂的機器語言, machine code）
以 MIPS 為例，指令以 32 位元的二進位呈現

MIPS R-format Instructions（R：Register）

op	rs	rt	rd	shamt	funct
6 bits	5 bits	5 bits	5 bits	5 bits	6 bits

op：operation code（opcode）
rs：first source register number
rt：second source register number
rd：destination register number
shamt：shift amount（00000 for now）

funct：function code（extends opcode）

Example：add $t0, $s1, $s2

special	$s1	$s2	$t0	0	add
0	17	18	8	0	32
000000	10001	10010	01000	00000	100000
$000000 10001 10010 01000 00000 100000_{2} = 0000 0010 0011 0010 0100 0000 0010 0000_{2} = 0232 4020_{16}$

MIPS I-format Instructions（I：Immediate）

op rs rt constant or address

6 bits 5 bits 5 bits 16 bits

op	rs	rt	constant or address
6 bits	5 bits	5 bits	16 bits

設計原則 4：好的設計需要適宜的折衷方案
　　　　　（Good design demands good compromises）

Examples：

Instruction	Format	op	rs	rt	rd	shamt	funct	address
add	R	0	reg	reg	reg	0	$32_{10}$	n.a.
sub (subtract)	R	0	reg	reg	reg	0	$34_{10}$	n.a.
add immediate	I	$8_{10}$	reg	reg	n.a.	n.a.	n.a.	constant
lw (load word)	I	$35_{10}$	reg	reg	n.a.	n.a.	n.a.	address
sw (store word)	I	$43_{10}$	reg	reg	n.a.	n.a.	n.a.	address

2.6 - Logical Operations

Operation	C	Java	MIPS
Shift left	<<	<<	sll
Shift right	>>	>>>	srl
Bitwise AND	&	&	and, andi
Bitwise OR	\|	\|	or, ori
Bitwise NOT	~	~	nor

AND Operations：$t0 = $t1 & $t2;
```
and $t0, $t1, $t2
```
OR Operations：$t0 = $t1 | $t2;
```
or $t0, $t1, $t2
```

Special in MIPS ─ NOR Operations：a NOR b == NOT（a OR b）


nor $t0, $t1, $zero    # The last is always Register 0

2.7 - Instructions for Making Decisions

if (rs == rt) branch to instruction labeled L1


beq rs, rt, L1    # beq = branch if equal

if (rs != rt) branch to instruction labeled L1
```
bne rs, rt, L1    # bne = branch if not equal
```
Question：
　　Why don't we use blt, bge, …?
Answer：
　　Hardware for "<, ≥, …" is slower than "=, ≠".
　　On the other hand, we can make beq and bne common case.
unconditional jump to instruction labeled L1
```
j L1
```

if (rs < rt) rd = 1; else rd = 0;


slt rd, rs, rt    # slt = set on less than

if (rs < constant) rd = 1; else rd = 0;


slti rd, rs, constant    # slti = set on less than immediate

2.8 - Supporting Procedures in Computer Hardware

暫存器號符號名用途

0 zero 看起來象浪費，其實很有用

1 at 保留給彙編器使用

2～3 v0、v1 函式返回值

4～7 a0～a3 前頭幾個函式引數

8～15 t0～t7 臨時暫存器，子過程可以不儲存就使用

16～23 s0～s7 暫存器變數

24～25 t8、t9 同 8～15（t0～t7），臨時暫存器

26、27 k0、k1 保留給異常處理函式使用

28 gp global pointer：方便存取全域或靜態變數

29 sp stack pointer

30 s8 / fp 第 9 個暫存器變數，也可用做 frame pointer

31 ra 返回地址

暫存器號	符號名	用途
0	zero	看起來象浪費，其實很有用
1	at	保留給彙編器使用
2～3	v0、v1	函式返回值
4～7	a0～a3	前頭幾個函式引數
8～15	t0～t7	臨時暫存器，子過程可以不儲存就使用
16～23	s0～s7	暫存器變數
24～25	t8、t9	同 8～15（t0～t7），臨時暫存器
26、27	k0、k1	保留給異常處理函式使用
28	gp	global pointer：方便存取全域或靜態變數
29	sp	stack pointer
30	s8 / fp	第 9 個暫存器變數，也可用做 frame pointer
31	ra	返回地址

Procedure Call Instructions
- Procedure call：jal（Jump And Link）
```
jal ProcedureLabel
```
- Procedure return：jr（Jump Register）
```
jr $ra
```

2.9 - Communicating with People

2.10 - MIPS Addressing for 32-Bit Immediates and Addresses

Byte / Halfword Operations
- Byte：8 bits / Halfword：16 bits (2 bytes)
- MIPS byte / halfword load / store
  - String processing is a common case
  - Sign extend to 32 bits in rt
```
lb rt, offset(rs)
lh rt, offset(rs)
```
  - Zero（Unsigned） extend to 32 bits in rt
```
lbu rt, offset(rs)
lhu rt, offset(rs)
```
  - Store just rightmost byte / halfword
```
sb rt, offset(rs)
sh rt, offset(rs)
```
MIPS J-format Instructions（J：Jump）

op address

6 bits 26 bits

op	address
6 bits	26 bits

2.11 - Parallelism and Instructions：Synchronization

提及：原子操作（Atomic read / write memory operation）

Synchronization（同步） in MIPS
- Load linked：
```
ll rt, offset(rs)
```
- Store conditional：
```
sc rt, offset(rs)
```
  - Succeed if location not changed since the ll → Returns 1 in rt
  - Fail if location is changed since the ll → Returns 0 in rt

2.12 - Translating and Starting a Program

The object file（for UNIX systems）typically contains six distinct pieces, provide information for building a complete program：
1. Header（object file header）：
  described contents of object module
2. Text segment：
  translated instructions（machine language codes）
3. Static data segment：
  data allocated for the life of the program
4. Relocation information：
  for contents that depend on absolute location of loaded program
5. Symbol table：
  undefined labels, ex：Global definitions and external references
6. Debug information：
  for associating with source code

2.13 - A C Sort Example to Put It All Together

2.14 - Arrays versus Pointers

2.15 - Advanced Material：Compiling C and Interpreting Java

2.16 - Real Stuff：ARMv7 (32-bit) Instructions

2.17 - Real Stuff：x86 Instructions

2.18 - Real Stuff：ARMv8 (64-bit) Instructions

2.19 - Fallacies and Pitfalls

2.20 - Concluding Remarks

設計原則（Design principles）
- 簡單有助於規整（Simplicity favors regularity）
- 越少越快（Smaller is faster）
- 優化常見情況（Make the common case fast）
- 好的設計需要適宜的折衷方案
  （Good design demands good compromises）

在本章中應該要學會的指令

Instruction class	MIPS examples
算數 / Arithmetic	add、sub、addi
資料 / Data transfer	lw、sw、lb、lbu、lh、lhu、sb、lui
邏輯 / Logical	and、or、nor、andi、ori、sll、srl
條件 / Condition branch	beq、bne、slt、slti、sltiu
跳轉 / Jump	j、jr、jal

2.21 - Historical Perspective and Further Reading

PDF

2.22 - Exercises

解答

第三章 - Arithmetic for Computers

3.1 Introduction

實數：加、減、乘、除，以及溢位處置。
浮點數：表示法 & 運算。

3.2 Addition and Subtraction

3.3 Multiplication

提及：算術邏輯單元（Arithmetic Logic Unit, ALU）。
名詞：multiplicand（被除數）、multiplier（除數）。

使用 2 個 32 位元的暫存器儲存 product（積）
- HI：most-significant 32 bits
- LO：least-significant 32 bits
MIPS Multiplication Instructions
- 64-bit product in HI/LO
  （用 rs & rt 做乘法，存入 HI/LO）
```
mult rs, rt
multu rs, rt
```
- Move from HI/LO to rd
  Can test HI value to see if product overflows 32 bits
```
mfhi rd
mflo rd
```
- Least-significant 32 bits of product → rd
  （只取積的後 32 位元，存入 rd）
```
mul rd, rs, rt    # not in the textbook, FYI
```
硬體運作圖：
邏輯流程圖：

3.4 Division

使用 2 個 32 位元的暫存器儲存 remainder（餘數）和 quotient（商數）
- HI：32-bit remainder
- LO：32-bit quotient
MIPS Division Instructions
No overflow or divide-by-0 checking.
Software must perform checks if required.
- access result
```
mfhi rd
mflo rd
```
- divide
```
div rs, rt
divu rs, rt
```
硬體運作圖：
邏輯流程圖：

3.5 Floating Point

$x = (- 1)^{s} \times (1 + F r a c t i o n)_{2} \times 2_{10}^{(E x p o n e n t_{10} - B i a s_{10})}$
- S：sign bit（0 => non-negative / 1 => negative）
- Normalize significand：1.0 ≤ | significand | < 2.0
  - Always have a leading pre-binary-point 1 bit, so no need to represent it explicitly（hidden bit）
  - Significand is Fraction with the “1.” restored
- Exponent：excess representation：actual exponent + Bias
  - Ensure exponent is unsigned
- Bias：Single = 127 / Double = 1023
Two representations
- Single precision（32-bit）
  
  S Exponent Fraction
  
  1 bit 8 bits 23 bits
- Double precision（64-bit）
  
  S Exponent Fraction
  
  1 bit 11 bits 52 bits
Single-Precision Range
- Exponent：00000000 and 11111111 are reserved.（Inf. & NaN.）
- Smallest value
  - Exponent：00000001 → actual exponent = 1 – 127 = –126
  - Fraction：000… 0000 → significand = 1.0
  - $\pm 1.0 \times 2^{- 126} \approx \pm 1.2 \times 10^{- 38}$
- Largest value
  - Exponent：11111110 → actual exponent = 254 – 127 = +127
  - Fraction：111… 1111 → significand
    $\approx$ 2.0
  - $\pm 2.0 \times 2^{+ 127} \approx \pm 3.4 \times 10^{+ 38}$
Double-Precision Range
- Exponent：00000000000 and 11111111111 are reserved.（Inf. & NaN.）
- Smallest value
  - Exponent：00000000001 → actual exponent = 1 – 1023 = –1022
  - Fraction：00000… 00000 → significand = 1.0
  - $\pm 1.0 \times 2^{- 1022} \approx \pm 2.2 \times 10^{- 308}$
- Largest value
  - Exponent：11111111110 → actual exponent = 2046 – 1023 = +1023
  - Fraction：1111… 111111 → significand
    $\approx$ 2.0
  - $\pm 2.0 \times 2^{+ 1023} \approx \pm 1.8 \times 10^{+ 308}$
Floating-Point Examples：
Represent –0.75：

$- 0.75 = (- 1)^{1} \times {1.1}_{2} \times 2^{- 1}$
- S = 1
- Fraction =
  $100 \dots 000_{2}$
- Exponent = –1 + Bias
  Single：–1 + 127 = 126 =
  $01111110_{2}$ （8 bits）
  Double：–1 + 1023 = 1022 =
  $01111111110_{2}$ （11 bits）
- Answer =
  Single：
  $1 01111110 1000 \dots 00$
  Double：
  $1 01111111110 1000 \dots 00$
What number is represented by this single-precision float：

$11000000101000 \dots 00$ （
$1 10000001 01000 \dots 00$ ）
- S = 1
- Fraction =
  $0100 \dots 000_{2}$ （23 bits）
- Exponent =
  $10000001_{2}$ （8 bits） =
  $129_{10}$
- Answer
  
  $= (- 1)^{1} \times (1_{2} + {0.01}_{2}) \times 2^{(129 - 127)} = (- 1) \times 1.25 \times 2^{2} = - 5.0$
Special cases：
- Exponent = 000…000 → hidden bit is 0
  
  $x = (- 1)^{S} \times (0 + F r a c t i o n) \times 2^{- B i a s}$
- Exponent & Fraction are both 000…000
  
  $x = (- 1)^{S} \times (0 + 0) \times 2^{- B i a s} = \pm 0.0$ （Two representations of 0.0！）
- Exponent = 111…111 / Fraction = 000…000
  ±Infinity
- Exponent = 111…111 / Fraction ≠ 000…000
  Not-a-Number (NaN.)
FP Instructions：
- Load and Store：lwc1、ldc1、swc1、sdc1
  l = "Load", s = "Store" / w = "Word", d = "Double" / c1 = "Coprocessor 1"
  Ex：
```
ldc1 $f8, 32($sp)    # Load Double-precision $f8 to Coprocessor 1
```
- Single-precision arithmetic：add.s、sub.s、mul.s、div.s
  Ex：
```
add.s $f0, $f1, $f6
```
- Double-precision arithmetic：add.d、sub.d、mul.d、div.d
  Ex：
```
mul.d $f4, $f4, $f6
```
- Single- and Double-precision comparison：
  c.XX.s、c.XX.d（XX is eq, lt, le, …）
  c = "Compare" / XX = {condition}
  eq = "EQual" / lt = "Less Than" / le = "Less Equal"…
```
c.lt.s $f3, $f4    # Compare $f3 & $f4 then Set condition bit
```
- Branch on FP condition code true or false：bc1t、bc1f
  b = "Branch" / c1 = "Coprocessor 1" / t = "True", f = "False"
  Ex：
```
bc1t TargetLabel    # do TargetLabel if condition bit == True
```

S	Exponent	Fraction
1 bit	8 bits	23 bits

S	Exponent	Fraction
1 bit	11 bits	52 bits

3.6 Parallelism and Computer Arithmetic：Subword Parallelism

3.7 Real Stuff：Streaming SIMD Extensions and Advanced Vector Extensions in x86

期中考範圍到此，以下為期末考範圍

3.8 Going Faster：Subword Parallelism and Matrix Multiply

3.9 Fallacies and Pitfalls

Right Shift and Division
Only for unsigned integers
Assumptions of associativity may fail

3.10 Concluding Remarks

3.11 Historical Perspective and Further Reading

3.12 Exercises

第四章 - The Processor

4.1 - Introduction

Instruction Execution
CPU Overview
Multiplexers
Control

4.2 - Logic Design Conventions

Basics
Combinational Elements
Sequential Elements
Clocking Methodology

4.3 - Building a Datapath

Instruction Fetch

處理指令需要的幾個基本元素：
- 指令寄存器（Instruction Register, IR）
  用於存儲所有的程序指令，並且給它們編上地址。
- 程式計數器（Program Counter, PC）
  一個地址寄存器，用於存放指令地址，即指向指令存儲器中當前正在執行的指令。
- 加法器
  用於改變 PC 的值，指向下一條指令，以使程序可以繼續向後執行。
R-format Instructions
- 通常需要讀取 2 個寄存器的內容，經過 ALU 運算後寫入另一個寄存器。
- 寄存器訪問需要 4 個輸入（2 個要讀的寄存器號，1 個要寫的寄存器號，1 個要寫入的寄存器值），2 個輸出（分別是輸出 2 個要讀的寄存器的內容），以及一個寫信號脈衝。
- 寄存器號的 3 個口需要 5 位的數據寬度（第二章中 rs、rt、rd 的長度都是 5 位），寫入與讀出的 3 個口則分別需要 32 位的數據寬度。
I-format Instructions
- ```
lw $t1, offset($t2)
```
  存儲地址是將 $t2 中的內容（32位基地址）加上 offset（16 位偏移地址）作為目標地址，然後將內容寫入寄存器 $t1。
  這裡需要的是一個地址加法器（加 16 位和 32 位的數）以及數據存儲器。
- ```
beq $t1, $t2, offset
```
  首先讀取 $t1 和 $t2 的內容到 ALU 進行對比，若通過，則將 offset（16 位偏移地址）加到 PC 上作為下一次的目標地址。
Branch Instructions
R-Type/Load/Store Datapath
Full Datapath

4.4 - A Simple Implementation Scheme

ALU Control
- 實現簡單的指令，包含：
  - load word（lw）
  - store word（sw）
  - branch equal（beq）
  - 算數邏輯運算：add、sub、AND、OR
  - set on less than（slt）
- 要實現上面的這些功能，只需要 2 位數：
  - 00 表示 lw 或者 sw，這裡對於 ALU 來說只需要作地址的相加即可。
  - 01 表示 beq，需要用 ALU 對兩個寄存器相減。
  - 10 表示 R-format 操作，需要配合 6 位的 Funct 再進行各種情況的選擇。
  - 根據上述的 2 位 ALUOp 和 6 位 Funct，生成出真正的 4 位 ALU 操作碼
Main Control Unit
Datapath With Control
R-Type Instruction
Load Instruction
Branch-on-Equal Instruction
Datapath With Jumps Added

4.5 - An Overview of Pipelining

Pipelining Analogy
MIPS Pipeline
- Five stages：
  1. IF（Instruction Fetch）
    Instruction fetch from memory.
  2. ID（Instruction Decode）
    Instruction decode & register read.
  3. EX（EXecute）
    Execute operation or calculate address.
  4. MEM（MEMory）
    Access memory operand.
  5. WB（Write Back）
    Write result back to register.
Pipeline Performance
- lw：IF + ID + EX + MEM + WB
- sw：IF + ID + EX + MEM
- R-format：IF + ID + EX + WB
- beq：IF + ID + EXE
Pipeline Speedup
- $S p e e d u p = \frac{T i m e_{p i p e l i n e d}}{T i m e_{n o n - p i p e l i n e d}} \approx n u m b e r o f s t a g e s$ (when there are many instructions)
Hazards（冒險）
- Structure hazards（結構冒險）
  兩條指令在不同的階段需要同時訪問相同的寄存器造成的。
- Data hazards（數據冒險）
  下一條指令需要的數據，上一條指令還在計算中。
- Control hazards（控制 / 分支冒險）
  遇到例如 beq 這類指令時，指令正在譯碼，但是流水線就要緊接著讀入下一條指令，而此時下一條指令的 PC 還沒有確定，因此產生矛盾。（Stall on Branch）
  - Branch Prediction
    隨便預測一個方向先執行，如果不對再忽略前面執行的指令轉去執行正確的指令，這樣如果預測是正確的，則流水線任然是全速運行，如果是錯誤的那麼與原來不採用預測的方式耗時一樣。
Forwarding（轉發）（a.k.a. Bypassing, 旁路）
- Load-Use Data Hazard

4.6 - Pipelined Datapath and Control

MIPS Pipelined Datapath
Pipeline registers
為了使得一個數據通路里面的五個部分之間相互不影響，解決方法就是在兩個相鄰的部分之間使用額外的寄存器來傳遞數據，保證上一條指令的結果能夠保存下來傳給下一個部分來執行，同時能繼續執行下一條指令的該部分。
Pipelined Control

4.7 - Data Hazards：Forwarding v.s. Stalling

Data Hazards in ALU Instructions
Dependencies & Forwarding
Detecting the Need to Forward
Forwarding Paths
Forwarding Conditions
Double Data Hazard
Revised Forwarding Condition
Datapath with Forwarding
Load-Use Data Hazard
如果遇到 lw 造成的 Hazard 必須 stall。
Datapath with Hazard Detection
Stalls and Performance

4.8 - Control Hazards

Branch Hazards
Reducing Branch Delay
Data Hazards for Branches
Dynamic Branch Prediction

4.9 - Exceptions

Exceptions and Interrupts
- Exception arise within the CPU
  Ex：undefined opcode, overflow, syscall…
- Interrupt from an external I/O controller
Handling Exceptions
An Alternate Mechanism
Handler Actions
Exceptions in a Pipeline
Pipeline with Exceptions
Exception Properties
Multiple Exceptions
Imprecise Exceptions

4.10 - Parallelism and Advanced Instruction Level Parallelism

Instruction-Level Parallelism (ILP)
Multiple Issue（指令多發射 / 單週期多指令並行）
- Static multiple issue（靜態多發射）
- Dynamic multiple issue / Superscalar（動態多發射 / 超標量）
Speculation（推測）
Compiler/Hardware Speculation
Speculation and Exceptions
Static Multiple Issue
Scheduling Static Multiple Issue
MIPS with Static Dual Issue
Hazards in the Dual-Issue MIPS
Scheduling Example
Loop Unrolling
Dynamic Multiple Issue
Dynamic Pipeline Scheduling
Dynamically Scheduled CPU
Register Renaming
Speculation
Why Do Dynamic Scheduling?
Does Multiple Issue Work?
Power Efficiency

4.11 - Real Stuff：The ARM Cortex-A8 and Intel Core i7 Pipelines

4.12 - Instruction-Level Parallelism and Matrix Multiply

4.13 - Advanced Topic：An Introduction to Digital Design Using a Hardware Design Language to Describe and Model a Pipeline and More Pipelining Illustrations

PDF

4.14 - Fallacies and Pitfalls

4.15 - Conduding Remarks

4.16 - Historical Perspective and Further Reading

PDF

4.17 - Exercises

解答

第五章 - Large and Fast：Exploiting Memory Hierarchy

5.1 - Introduction

Principle of Locality（局部性）
- Temporal locality（時間局部性）
- Spatial locality（空間局部性）
Take Advantage of Locality
Memory Hierarchy Level

5.2 - Memory Technologies

技術	訪問時間	價格 / GB	特點
SRAM	0.5~2.5 ns	$500~1000	數據用晶體管存儲，可直接讀取數據。
DRAM	50~70 ns	$10~20	數據用電容存儲，會不斷漏電，需要定期刷新。（充電）
FLASH	5k~50k ns	$0.75~1.00	EEPROM，電擦除可編程存儲器，有讀寫次數上限。
HDD	5M~20M ns	$0.05~0.10	用磁盤、磁頭等進行磁性存儲和讀寫。

Flash Storage
Flash Type
- NOR flash
- NAND flash
Disk Storage
Disk Sectors and Access
Disk Access Example
Disk Performance Issues

5.3 - The Basics of Caches

Cache Memory
Direct Mapped Cache
Tags and Valid Bits
Cache Example
Address Subdivision
Example：Larger Block Size
Block Size Considerations
Cache Misses
Write-Through
- write buffer
Write-Back
- dirty block
Write Allocation
Example：Intrinsity FastMATH
Main Memory Supporting Caches
Increasing Memory Bandwidth

5.4 - Measuring and Improving Cache Performance

Measure Cache Performance
- $\begin{aligned} M e m o r y s t a l l c y c l e s & = \frac{M e m o r y a c c e s s e s}{P r o g r a m} \times M i s s r a t e \times M i s s p e n a l t y \\ = \frac{I n s t r u c t i o n s}{P r o g r a m} \times \frac{M i s s e s}{I n s t r u c t i o n} \times M i s s p e n a l t y \end{aligned}$
Cache Performance Example
Average Access Time
- Average memory access time（AMAT）
  
  $A M A T = H i t t i m e + M i s s r a t e \times M i s s p e n a l t y$
Performance Summary
Associative Caches
- Directed mapped（直接匹配）
- Set associatative（組相聯）
- Fully associative（全相聯）
Example of Associative Cache
Spectrum of Associativity
Associativity Example
How Much Associativity
Set Associative Cache Organization
Replacement Policy
Multilevel Caches

5.5 - Dependable Memory Hierarchy

Dependability
Dependability Measures
- 故障（Failure）：機器從正常的運行狀態中被中斷，跳到對應的故障解決服務中。若不能從故障中恢復就是永久性故障，故障還可能是間歇性的。
- 可靠性（Reliability）：估量機器從某一點開始能夠持續正常服務的時長。
- 平均故障時間（Mean Time To Failure, MTTF）：平均持續正常工作不發生故障的時長。
- 年故障率（Annual Failure Rate）：給定 MTTF 之後，一年中發生故障的時間比率。
- 平均修復時間（Mean Time To Repair, MTTR）：一旦發生故障之後，從故障中恢復平均需要花費的時間。
- 平均故障間時間（Mean Time Between Failure, MTBF）：MTTF 和 MTTR 的總和。
- 有效性（Availability）：用於估量機器在整個故障及恢復過程中正常工作市場的比率。
  
  $有效性（ A v a i l a b i l i t y ） = \frac{M T T F}{M T B F} = \frac{M T T F}{M T T F + M T T R}$
為了提高MTTF 的三種方案：
1. 從設計上避免故障發生。
2. 依靠冗餘備份，使得即使故障發生了也能夠正常運行。
3. 提前預測可能發生故障的情況，並提前修正錯誤。