---
title: Archi 4
---
# Computer Architectures
NTNU 計算機結構
##### [Back to Note Overview](https://reurl.cc/XXeYaE)
##### [Back to Computer Architectures](https://hackmd.io/@NTNUCSIE112/Archi109-2)
{%hackmd @sophie8909/pink_theme %}
###### tags: `NTNU` `CSIE` `必修` `Computer Architectures` `109-2`
<!-- tag順序 [學校] [系 必選] or [學程 學程名(不含學程的 e.g. 大師創業)] [課程] [開課學期]-->
<!-- 網址名稱 Archi109-2_[] -->
:::info
這張基本全是圖片,直接看簡報
:::
## Ch.04 The Processor
### 4.1 Introduction
- CPU performance factors
- Instruction count
- Determined by ISA and compiler
- CPI and cycle time
- Determined by CPU hardware
- We will examine two MIPS implementations
- A simplified version
- A more realistic pipelined version
- Simple subset, shows most aspects
- Memory reference: `lw`,`sw`
- Arithmetic/logical: `add`,`sub`,`and`,`or`,`slt`
- Control transfer: `beq`,`j`
*[ISA]:Instruction set architecture
#### Instruction Execution
- PC → instruction memory, fetch instruction
- Register number → register file, read a register
- Depend on an instruction class
- Use ALU to calculate …
- Arithmetic result
- Memory address for load/store
- Branch target address
- Access data memory for load/store
- PC ← target address or PC + 4
#### CPU Overview

#### Multiplexer

#### Control

### 4.2 Logic Design Conventions
![Uploading file..._ay3rflgq4]()
#### Logic Design Basics
- Information is encoded in binary
- Low voltage = 0, High voltage = 1
- Oe wire per bit
- Multi-bit data are encoded on multi-wire buses.
- Combinational elements
- Operate on data
- Output is a function of input
- State (sequential) elements
- Store information
#### Combinational Element
- AND-gate
- Y = A & B

- Adder
- Y = A + B

- Multiplexer
- Y = S ? I1: I0

- Arithmetic/Logic Unit
- Y = F(A,B)

#### Sequential Element
- Register stores data in a circuit
- Use a clock signal to determine when to update the stored value
- Edge-triggered: update when Clk changes from 0 to 1
- 
- Register with a write control
- Only update on a clock edge when the write control input is 1
- Used when stored value is required later
- 
#### Clocking Methodology
- A combinational logic transforms data during a clock cycle
- Between two clock edges
- Input from a sequential elements, output to a sequential element
- Longest delay determines a clock period

### 4.3 Building a Datapath
#### Building a Datapath
- Datapath
- Elements that process data and address in the CPU
- Registers, ALUs, mux's, memories, ...
- We will build a MIPS datapath incrementally.
- Refine the overview ddesign
#### Instruction Fetch

#### R-Format Instructions
- Read two register operands
- Perform an arithmetic/logical operation
- Write a register result

#### Load/Store Instructions
- Read two register operands
- Calculate an address using 16bit offset
- Use an ALU, but a sign-extend offset
- Load: Read memory and update a register
- Store: Write a register value to memory

#### Branch Instructions
- Read register operands
- Compare operands
- Use ALU, subtract and cheek Zero output
- Calculate target address
- sign-extend displacement
- shift left 2 places(word displacement)
- Add to PC + 4
- Already calculated by instruction fetch

#### Composing the Elements
- First-cut datapath does an instruction in one clock cycle
- Each datapath element can only do one function at a time
- Hence, we need separate instruction and data memories.
- Use a multiplexer where alternate data sources are used for different instructions
#### R-Type/Load/Store Datapath

#### Full Datapath

### 4.4 A Simple Implementation Scheme
#### ALU Control
- ALU is used for
- Load/Store: F = add
- BranchL F = subtract
- R-type: F depends on funct field
| ALU control | Function |
| ----------- | ---------------- |
| 0000 | AND |
| 0001 | OR |
| 0010 | add |
| 0110 | subtract |
| 0111 | set-on-less-than |
| 1100 | NOR |
- Assume 2-bit ALUOp is derived from opcode
- Combinational logic derives ALU control.
| opcode | ALUOp | Operation | funct | ALU function | ALU control |
| ------ | ----- | ---------------- | ------ | ---------------- | ----------- |
| `lw` | 00 | load word | XXXXXX | add | 0010 |
| `sw` | 00 | stroe word | XXXXXX | add | 0010 |
| `beq` | 01 | branch word | XXXXXX | subtract | 0110 |
| R-type | 10 | add | 100000 | add | 0010 |
| | | subtract | 100010 | subtract | 0110 |
| | | AND | 100100 | AND | 0000 |
| | | OR | 100101 | OR | 0001 |
| | | set-on-less-than | 101010 | set-on-less-than | 0111 |
#### Main Control Unit
- Control signals derived from instruction

#### Datapath With Control
<!-- 0511 30:30 Slide 51 -->

#### R-Type Instruction
<!-- Slide 56 -->
#### Load Instruction
<!-- Slide 59 -->
### 4.5 An Overview of Pipelining
### 4.6 Pipelined Datapath and Control
#### MIPS Pipelined Dath
<!-- Slide 106 -->

#### Pipeline registers
- Need registers between stages
- Hold information prouced in previous cycle
#### IF for Load & Store
<!-- Slide 112 -->

#### ID for Load & Store
<!-- Slide 113 -->

####
#### Muti-Cycle Pipeline Diagram
<!-- Slide 124 TB p.286 -->
- A form shows resource usage
- 
### 4.7 Data Hazards: Forwarding vs. Stalling
### 4.8 Control Hazards
### 4.9 Exceptions
### 4.10 Parallelism and Advanced Instruction Level Parallelism
#### Dynamically Scheduled CPU
<!-- Slide 222/p.328 -->

#### Register Renaming
- Reservation stations and reorder buffer effectively provide register renaming
- On instruction issue available in register file or reorder buffer
- Copy to reservation station
- No longer required in the register ....
<!-- 未完 -->
#### Speculation
- Predict branch and continue issuing
- Don't commit until branch outcome determined
- Load speculation
- Avoid load and cache miss delay
- Predict the effective address
- Predict loaded value
- Load before completing
<!-- -未完 -->
#### Why Do Dynamic Scheduling
- Why not just let the compiler schedule code
- Not all stalls are predicable
- e.g. cache miss
- Cannot always schedule around branches
- Branch outcome is dynamically determined
- Different implementations of an ISA have different latencies and hazards
#### Power Efficiency
<!-- Slide 233 p.332 -->
- Complexity of dynamic scheduling and speculations requires power
- Multiple simpler cores may be better

### 4.11 Real Stuff: The ARM Cortex-A8 and Intel Core i7 Pipelines
### 4.12 Going Faster: Instruction-Level Parallelism and Matrix Multiply
### 4.14 Fallacies and Pitfalls
<!-- This is the end of Sophie -->