# 112-1 NTU-CA Note
## Lecture 2 - Performance/Power/Cost
### - Computer Architecture

---
### - Moore's Law (1965)
- "*The density of transitors in an IC will double every year.*"
- Reality : double per 18 months
---
### - Technology Node
- a.k.a **Process Node** / **Process Technology**
- A specific semiconductor manufacturing process and its design rules.
---
### - Importance of Evalutation/Analysis
- Why we care?
- Purchasing perspective
- Best performance / least cost?
- Design perspective
- Best performance / energy-efficiency?
- Measure Methods for ***performance/power/cost***
- **Metric**
- **Benchmark**
---
### - Term Def.
- **Response Time**
- How long it takes to do a task
- **Throughput**
- Total work done per time
- **Performance**
- $=1\ /\ Execution\ Time$
- $n = \frac{Performance_A}{Performance_B} \\\ \ =\frac{Execution\ Time_B}{Execution\ Time_A}$ (A is n times faster than B)
- **Elapsed Time**
- Total response time
- Process, I/O, OS overhead, Idle
- Determine system performance
- **CPU Time**
- Time spent on processing a given job
- **User CPU Time** / **System CPU Time**
- $= CPU\ Clock\ Cycles * Clock\ Cycle\ Time \\ =\frac{CPU\ Clock\ Cycles}{Clock\ Rate}$
(Clock Rate = 1/Clock Cycle Time = Clock Frequency)
- Performance can be improved by:
- Reducing number of clock cycles
- Increasing Clock Rate
(Often trade off between cycle and rate)
---
### - CPU Clocking

- **Clock Period**
- duration of a clock cycle
- e.g. $250ps = 0.25ns = 250*10^{-12}s$
- **Clock Frequency (Rate)**
- cycle per sec.
- e.g. $4.0GHz = 4000MHz = 4.0*10^{9}$
---
### - Instruction Count / CPI (Cycles per Instruction)
- **Instruction Count**
- Determine by program, ISA, compiler
- **CPI**
- Cycles per Instruction
- Average affect by both hardware and software
- **Clock\ Cycles**
- $= Instruction\ Count *CPI$
- **CPU Time**
- $=Clock\ Cycles*Clock\ Cycle\ Time\\= Instruction\ Count * CPI * Clock\ Cycle\ Time \\=\frac{Instruction\ Count * CPI}{Clock\ Rate}$
- **Weighted CPI**
- $= \sum_{i=1}^n(CPI_i*\frac{Instruction\ Count_i}{Total\ Instruction\ Count})$
---
### - Aspect of CPU Performance

- Should compare all 3 components between machines
---
### - Power Trends

- In CMOS IC tech.
- $Power \approx Capacitive\ Load * Voltage^2 * Frequency$
- **Power**
- Watt
- a unit of power
- **Energy**
- Joule ($= Watt * Time$)
- Energy per operation
- **The Power Wall**
- Can't reduce voltage further
- Can't remove more heat
- Falacy
- Lower Power at idle (x)
---
### - Multiprocessors
- **Multicore Multiprocessors**
- **Parallel Programming**
- Hard to do:
- Programming for performance
- Load balancing
- Optimizing communication and synchronization
- **Energy-Efficiency**

---
### - Integrated Circuit (IC) Cost
- **Cost per die**
- $=\frac{Cost\ per\ Wafer}{Dies\ per\ Wafer*Yield}$
- **Dies per Wafer**
- $\approx \frac{Wafer\ Area}{Die\ Area}$
- **Yield**
- $=\frac{1}{(1+\frac{Defects\ per\ Area*Die\ Area}{2})^2}$
- Proportion of working dies per wafer
- IC Cost is nonlinear to area and defect rate
- Wafer cost and area are fixed
- Defect rate determined by manufacturing process
- Die area determined by architecture and circuit design
---
### - SPEC CPU Benchmark
- Measure performance
- **Standard Performance Evaluation Corp** (SPEC)
- **Elapsed time** to execute a seletion of programs
- Negligible I/O -> Focus on CPU performance
- Contain both integer and floating point applications
- CINT(integer) and CFP(floating-point)
- **SPECRatio**
- Normalize execution times to reference computer
- $=\frac{time\ on\ ref.\ computer}{time\ on\ computer\ being\ rated}$

---
### - SPEC Power Benchmark
- **SPECPower**
- Power consumption of server at different workload levels
- Run SPECJBB2005 (A Java business application)
- Report power consumptions of servers at ***different workload levels***, divided into 10% increment
- Performance: $\frac{ssj\_ops}{sec}$
- Power: $Watt$
- Energy efficiency: $\frac{\#\ of\ operation}{Watt}$
- **Overall ssj_ops per Watt**:
- $=\frac{\sum_{i=0}^{10}ssj\_ops_i}{\sum_{i=0}^{10}power_i}$
---
### - Amdahl's Law
- $T_{improved} = \frac{T_{affected}}{improved\ factor}+T_{unaffected}$
---
### - MIPS
- As a performance **metric**
- $=\frac{Instruction\ Count}{Execution\ Time*10^6}\\=\frac{Clock\ Rate}{CPI*10^6}$
- Doesn't account for:
- Difference in ISA between computers
- Difference in complexity between instructuions
- CPU's performance can't be represented by a single MIPS value
- Different CPUs can't be compared with MIPS
---
## Lecture 3 - RISC-V: Instruction Set Architecture
### - Instruction Set Architecture

- Provides an layer of abstraction to programmers
- Easy to build the hardware and the compiler while maximizing performance and minimizing cost
- Good interface
- **Portability** / **Compatibility**
- **Generality** (Used in many ways)
- **Convenient** functionality to higher levels
- **Efficient** implementation at lower levels
---
### - General Purpose Register ISA
- **Register <-> Memory**
- 2 address
- add R1 A (R1 = R1 + mem[A])
- 3 address
- add R2 R1 A (R2 = R1 + mem[A])
- **Register to Register** (needs load-store -> more instructions)
- add R2 R1 R3 (R2 = R1 + R3)
- load R3 A (R3 = mem[A])
- store R3 A (mem[A] = R3)
---
### - RISC v.s. CISC
- **RISC (Reduced Instruction Set Architecture)**
- Perform AxB
- LOAD A 2:3
- LOAD B 5:2
- MULTI A B
- STORE 2:3 A
- ex. ARM, MIPS, RISC-V
- **CISV (Complex Instruction Set Architecture)**
- Perform AxB
- MULT 2:3 5:2
- ex. Intel x86
---
### - RISC-V
- A new **ISA**
- Standard open architecture for industry implementations
- Similar ISAs hace a large share of embedded core market
- MIPS, ARMS ...
- Design Principle:
- Simplicity favors regularity (Arithmetic operations)
- Smaller is faster (Register)
- Make the common case fast (Constant)
- Good design demands good compromise (Instruction)
---
### - Arithmetic Operations
- One operation must hace **exactly 3 operands**
- ADD **A**(dst) **B**(src1) **C**(src2)
- Operations:
- +, -, x, /
- Example: $f = (g+h) - (i+j)$
- add t0, g, h
- add t1, i, j
- sub f, t0, t1
---
### - Register Operands
- Operands of arithmetic operations must be stored in **registers**
- RISC-V has a **32 x 64-bit register** file
- Used for frequently accessed data
- **doubleword**: 64-bit data
- 32 x 64-bit **general purpose registers x0~x31**
- **word**: 32-bit data
- **RISC-V reg.** :
- x0: const value 0
- x1: ret addr.
- x2: stack pointer
- x3: global pointer
- x4: thread pointer
- x5~x7, x28~x31: temporaries
- x8: frame pointer
- x9, x18~x27: saved reg.
- x10~x11: function args/results
- x12~x17: function args
---
### - Memory Operands
- Data Transfer Instructions
- lw x9, 8(x22) #x9 = mem[8+reg[x22]]
- sw x9, 8(x22) #mem[8+reg[x22]] = x9
- Addressing
- **Byte-Addressing**
- **Offset**
- $=(size\ in \ byte)*(index)$
- Example
- For A[8], where A contains doubleword
- offset $= 8*8=64$
- Therefore, A[12] = h + A[8], (h in x21, A in x22)
- ld x9, 64(x22)
- add x9, x21, x9
- sd x9, 96(x22)
- Byte ordering

- **Big Endian**
- stores the **most-significant** byte at the smallest address (lowest)
- **Little Endian**
- stores the **least-significant** byte at the smallest address (lowest)
- Alignment

- RISC-V doesn't require that objexts fall on address that is multiple of their size
- Word(4 bytes) is alligned if $address\%4=0$
---
### - Register v.s. Memory
- Access Speed
- Register faster than Memory
- Register requires loads and store to operate on memory
- More instructions to be executed
- Computer must use registers for variables as much as possible
- Only access memory for less frequently used variable
---
### - Constant / Immediate Operands
- Small constants are used frequently (~50% of operands)
- Solution
- Put typical constant in memory and load them
- Example: Add constant to x22
- RISC-V Instruction:
- addi x22, x22, 4
---
### - Instructions
- Represented with binary in the computer
- Stored-Program concept
- Instructions are represented as numbers
- Hence, porgram can be stored in memory to be read and written as numbers
---
### - RISC-V Instructions Format
- **R-format Instructions**
- 
- 
- **I-format Instructions**
- 
- **S-format Instructions**
- 
- Overview
- 
---
### - RISC-V Logical Operations
- Instructions for bitwise manipulation
- 
- Shift

- I-type instruction format
- Shift left logical
- shift left and fill 0
- slli by i bits
- multiplies by $2^i$
- Shift right logical
- shift right and fill 0
- srli by i bits
- divides by $2^i$ (unsigned only)
- AND
- Mask bits in a word
- select some bits, clear others to 0
- OR
- Include bits in a word
- set some bits to 1, leave others unchanged
- XOR
- Differenced operations
- set some bits to 1, leave others unchanged