112-1 NTU-CA Note

# 112-1 NTU-CA Note ## Lecture 2 - Performance/Power/Cost ### - Computer Architecture ![](https://hackmd.io/_uploads/SkMSoonCh.png =350x200) --- ### - Moore's Law (1965) - "*The density of transitors in an IC will double every year.*" - Reality : double per 18 months --- ### - Technology Node - a.k.a **Process Node** / **Process Technology** - A specific semiconductor manufacturing process and its design rules. --- ### - Importance of Evalutation/Analysis - Why we care? - Purchasing perspective - Best performance / least cost? - Design perspective - Best performance / energy-efficiency? - Measure Methods for ***performance/power/cost*** - **Metric** - **Benchmark** --- ### - Term Def. - **Response Time** - How long it takes to do a task - **Throughput** - Total work done per time - **Performance** - $=1\ /\ Execution\ Time$ - $n = \frac{Performance_A}{Performance_B} \\\ \ =\frac{Execution\ Time_B}{Execution\ Time_A}$ (A is n times faster than B) - **Elapsed Time** - Total response time - Process, I/O, OS overhead, Idle - Determine system performance - **CPU Time** - Time spent on processing a given job - **User CPU Time** / **System CPU Time** - $= CPU\ Clock\ Cycles * Clock\ Cycle\ Time \\ =\frac{CPU\ Clock\ Cycles}{Clock\ Rate}$ (Clock Rate = 1/Clock Cycle Time = Clock Frequency) - Performance can be improved by: - Reducing number of clock cycles - Increasing Clock Rate (Often trade off between cycle and rate) --- ### - CPU Clocking ![](https://hackmd.io/_uploads/rJfAlh3Ah.png) - **Clock Period** - duration of a clock cycle - e.g. $250ps = 0.25ns = 250*10^{-12}s$ - **Clock Frequency (Rate)** - cycle per sec. - e.g. $4.0GHz = 4000MHz = 4.0*10^{9}$ --- ### - Instruction Count / CPI (Cycles per Instruction) - **Instruction Count** - Determine by program, ISA, compiler - **CPI** - Cycles per Instruction - Average affect by both hardware and software - **Clock\ Cycles** - $= Instruction\ Count *CPI$ - **CPU Time** - $=Clock\ Cycles*Clock\ Cycle\ Time\\= Instruction\ Count * CPI * Clock\ Cycle\ Time \\=\frac{Instruction\ Count * CPI}{Clock\ Rate}$ - **Weighted CPI** - $= \sum_{i=1}^n(CPI_i*\frac{Instruction\ Count_i}{Total\ Instruction\ Count})$ --- ### - Aspect of CPU Performance ![](https://hackmd.io/_uploads/S1sX8nnA2.png =350x200) - Should compare all 3 components between machines --- ### - Power Trends ![](https://hackmd.io/_uploads/SykrD3n0n.png =380x220) - In CMOS IC tech. - $Power \approx Capacitive\ Load * Voltage^2 * Frequency$ - **Power** - Watt - a unit of power - **Energy** - Joule ($= Watt * Time$) - Energy per operation - **The Power Wall** - Can't reduce voltage further - Can't remove more heat - Falacy - Lower Power at idle (x) --- ### - Multiprocessors - **Multicore Multiprocessors** - **Parallel Programming** - Hard to do: - Programming for performance - Load balancing - Optimizing communication and synchronization - **Energy-Efficiency** ![](https://hackmd.io/_uploads/Hyc0d22Rn.jpg =470x190) --- ### - Integrated Circuit (IC) Cost - **Cost per die** - $=\frac{Cost\ per\ Wafer}{Dies\ per\ Wafer*Yield}$ - **Dies per Wafer** - $\approx \frac{Wafer\ Area}{Die\ Area}$ - **Yield** - $=\frac{1}{(1+\frac{Defects\ per\ Area*Die\ Area}{2})^2}$ - Proportion of working dies per wafer - IC Cost is nonlinear to area and defect rate - Wafer cost and area are fixed - Defect rate determined by manufacturing process - Die area determined by architecture and circuit design --- ### - SPEC CPU Benchmark - Measure performance - **Standard Performance Evaluation Corp** (SPEC) - **Elapsed time** to execute a seletion of programs - Negligible I/O -> Focus on CPU performance - Contain both integer and floating point applications - CINT(integer) and CFP(floating-point) - **SPECRatio** - Normalize execution times to reference computer - $=\frac{time\ on\ ref.\ computer}{time\ on\ computer\ being\ rated}$ ![](https://hackmd.io/_uploads/B1ILh2hRh.jpg =400x250) --- ### - SPEC Power Benchmark - **SPECPower** - Power consumption of server at different workload levels - Run SPECJBB2005 (A Java business application) - Report power consumptions of servers at ***different workload levels***, divided into 10% increment - Performance: $\frac{ssj\_ops}{sec}$ - Power: $Watt$ - Energy efficiency: $\frac{\#\ of\ operation}{Watt}$ - **Overall ssj_ops per Watt**: - $=\frac{\sum_{i=0}^{10}ssj\_ops_i}{\sum_{i=0}^{10}power_i}$ --- ### - Amdahl's Law - $T_{improved} = \frac{T_{affected}}{improved\ factor}+T_{unaffected}$ --- ### - MIPS - As a performance **metric** - $=\frac{Instruction\ Count}{Execution\ Time*10^6}\\=\frac{Clock\ Rate}{CPI*10^6}$ - Doesn't account for: - Difference in ISA between computers - Difference in complexity between instructuions - CPU's performance can't be represented by a single MIPS value - Different CPUs can't be compared with MIPS --- ## Lecture 3 - RISC-V: Instruction Set Architecture ### - Instruction Set Architecture ![](https://hackmd.io/_uploads/SkMSoonCh.png =350x200) - Provides an layer of abstraction to programmers - Easy to build the hardware and the compiler while maximizing performance and minimizing cost - Good interface - **Portability** / **Compatibility** - **Generality** (Used in many ways) - **Convenient** functionality to higher levels - **Efficient** implementation at lower levels --- ### - General Purpose Register ISA - **Register <-> Memory** - 2 address - add R1 A (R1 = R1 + mem[A]) - 3 address - add R2 R1 A (R2 = R1 + mem[A]) - **Register to Register** (needs load-store -> more instructions) - add R2 R1 R3 (R2 = R1 + R3) - load R3 A (R3 = mem[A]) - store R3 A (mem[A] = R3) --- ### - RISC v.s. CISC - **RISC (Reduced Instruction Set Architecture)** - Perform AxB - LOAD A 2:3 - LOAD B 5:2 - MULTI A B - STORE 2:3 A - ex. ARM, MIPS, RISC-V - **CISV (Complex Instruction Set Architecture)** - Perform AxB - MULT 2:3 5:2 - ex. Intel x86 --- ### - RISC-V - A new **ISA** - Standard open architecture for industry implementations - Similar ISAs hace a large share of embedded core market - MIPS, ARMS ... - Design Principle: - Simplicity favors regularity (Arithmetic operations) - Smaller is faster (Register) - Make the common case fast (Constant) - Good design demands good compromise (Instruction) --- ### - Arithmetic Operations - One operation must hace **exactly 3 operands** - ADD **A**(dst) **B**(src1) **C**(src2) - Operations: - +, -, x, / - Example: $f = (g+h) - (i+j)$ - add t0, g, h - add t1, i, j - sub f, t0, t1 --- ### - Register Operands - Operands of arithmetic operations must be stored in **registers** - RISC-V has a **32 x 64-bit register** file - Used for frequently accessed data - **doubleword**: 64-bit data - 32 x 64-bit **general purpose registers x0~x31** - **word**: 32-bit data - **RISC-V reg.** : - x0: const value 0 - x1: ret addr. - x2: stack pointer - x3: global pointer - x4: thread pointer - x5~x7, x28~x31: temporaries - x8: frame pointer - x9, x18~x27: saved reg. - x10~x11: function args/results - x12~x17: function args --- ### - Memory Operands - Data Transfer Instructions - lw x9, 8(x22) #x9 = mem[8+reg[x22]] - sw x9, 8(x22) #mem[8+reg[x22]] = x9 - Addressing - **Byte-Addressing** - **Offset** - $=(size\ in \ byte)*(index)$ - Example - For A[8], where A contains doubleword - offset $= 8*8=64$ - Therefore, A[12] = h + A[8], (h in x21, A in x22) - ld x9, 64(x22) - add x9, x21, x9 - sd x9, 96(x22) - Byte ordering ![](https://hackmd.io/_uploads/ByOKe5Bkp.png =450x225) - **Big Endian** - stores the **most-significant** byte at the smallest address (lowest) - **Little Endian** - stores the **least-significant** byte at the smallest address (lowest) - Alignment ![](https://hackmd.io/_uploads/HkKKbqBy6.png =250x170) - RISC-V doesn't require that objexts fall on address that is multiple of their size - Word(4 bytes) is alligned if $address\%4=0$ --- ### - Register v.s. Memory - Access Speed - Register faster than Memory - Register requires loads and store to operate on memory - More instructions to be executed - Computer must use registers for variables as much as possible - Only access memory for less frequently used variable --- ### - Constant / Immediate Operands - Small constants are used frequently (~50% of operands) - Solution - Put typical constant in memory and load them - Example: Add constant to x22 - RISC-V Instruction: - addi x22, x22, 4 --- ### - Instructions - Represented with binary in the computer - Stored-Program concept - Instructions are represented as numbers - Hence, porgram can be stored in memory to be read and written as numbers --- ### - RISC-V Instructions Format - **R-format Instructions** - ![](https://hackmd.io/_uploads/S11mE5Sk6.png =500x300) - ![](https://hackmd.io/_uploads/H1UIE9SkT.png =300x200) - **I-format Instructions** - ![](https://hackmd.io/_uploads/BkAkr9ryT.png =500x300) - **S-format Instructions** - ![](https://hackmd.io/_uploads/HyBQr5Byp.png =500x300) - Overview - ![](https://hackmd.io/_uploads/BkKFLcBJp.png =500x300) --- ### - RISC-V Logical Operations - Instructions for bitwise manipulation - ![](https://hackmd.io/_uploads/B1bxDqByT.png =350x200) - Shift ![](https://hackmd.io/_uploads/BJKUD5Bkp.png =500x60) - I-type instruction format - Shift left logical - shift left and fill 0 - slli by i bits - multiplies by $2^i$ - Shift right logical - shift right and fill 0 - srli by i bits - divides by $2^i$ (unsigned only) - AND - Mask bits in a word - select some bits, clear others to 0 - OR - Include bits in a word - set some bits to 1, leave others unchanged - XOR - Differenced operations - set some bits to 1, leave others unchanged