<style> .red { color: red; } .blue { color: blue; } </style> ## Introduction - Computer architecture (計算機結構) - 系統的功能性行為 - ex. data types、支援的功能 - Programmer 需要知道的。 - Computer organization (計算機組織) - 結構間的關係。 - ex. 對外的 interface、clock frequency - Programmer 可能看不到。 ### Classes of Computing Applications - **Personal Computers** (PC) - Designed for **use by an individual** - Usually incorporating a **graphics display**, a **keyboard**, and a **mouse** - Execute **3rd-party software** - ex. 桌機、筆電。 - **Servers** 伺服器 - Much **larger computers** usually are **accessed only via a network** - Used for **running larger programs** for **multiple users** simultaneously - To carry large **workloads**(工作量) - ex. 單一大型複雜工作或是很多小工作。 - Usually based on software from another source - Built from the same basis technology as PCs - The widest range in cost and capability - Low-end server: File Storage, simple web server ... - **Supercomputers**: consist of tens of thousands of processors and many terabytes of memory ![](https://hackmd.io/_uploads/rJ9c4RAR2.png) - **Embedded computers** 嵌入式系統 - A **computer inside another device** - Used for **running one predetermined application** or collection of software - Internet of Things (**IoT**) **物聯網** - ex: 家電, 汽車, 微電腦, 微控制器 - The widest range of application and performance. - Integrated with the hardware and delivered as a single system. - Embedded Applications - Often have unique application requirements - Combine a **minimum performance** with stringent **limitations on cost or power** - Often have lower tolerance for failure ### PostPC era 後個人電腦 - PC $\Rightarrow$ Personal mobile device (**PMD**) - **Small wireless devices** to connect to the Internet - **Like PCs**: users can download software (apps) to run on them. - **Unlike PCs**: keyboard and mouse $\Rightarrow$ touch-sensitive screen or speech input - Server $\Rightarrow$ **Cloud** computing - **Cloud computing** (雲端運算) - **Large collections of servers** that provide services over the Internet - Some providers rent dynamically varying numbers of **servers as a utility** - **Warehouse scale computers** (WSC) - To rely upon **giant datacenters** - Companies **rent portions of WSCs** and **provide software services** to PMDs - ***Without*** having to build WSCs of their own - **Software as a Service** (SaaS) 軟體即服務 - To deliver software and data as a service over the Internet - Only a **thin program** such as a browser **runs on local** client devices ## Seven Great Ideas in Computer Architecture ### 1. Use abstraction to simplify design - 抽象化概念簡化硬體設計 - Use abstractions to represent the design at **different levels of representation** - **Lower-level details are hidden** to offer a simpler model at higher levels ### 2. Make the common case fast - 常用的 case 要有較好的效能。 - Making the common case fast will tend to **enhance performance better** than optimizing the rare case - The common case is often simpler and easier to enhance than rare case ### 3. Performance via parallelism - 平行化提升效能。 - To perform operations in **parallel** to get **more performance** ### 4. Performance via pipelining (Chap 4) - An implementation technique in which multiple **instructions are overlapped in execution** - 重疊部分 instruction 的執行時間 **Non-pipelining** ![](https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/pipelining/laundry1.gif) **Pipelining** ![](https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/pipelining/laundry2.gif) ### 5. Performance via prediction - 利用預測提高效能。 - 預先執行一些還沒確定要執行的運算 - To **guess and start working** rather than wait until you know for sure - **Assumptions** - The recovery from a misprediction is not too expensive - The prediction is relatively accurate ### 6. Hierarchy 階層 of memories - 善用記憶體階層設計。 - Programmers want memory to be fast, large, and cheap - Hierarchy of memories - The fastest, smallest, and most expensive memory per bit at the top - The slowest, largest, and cheapest per bit at the bottom - Cache 快取 - Main memory - Disk ![](https://www.cs.csustan.edu/~john/Classes/CS3750/Notes/Chap01/1_06strgDvcHrarch.jpg) ### 7. Dependability via redundancy(冗餘) - Computers need to be **fast and dependable** - Since any physical device can fail, we make systems **dependable** by including redundant components - To **take over** when a **failure occurs** - To help detect failures ## About Moore’s Law 摩爾定律 - **Transistor density** increased by about 35% per year \+ **Die size** increased by about 10%~20% per year - $\Rightarrow$ **Transistor count on a chip** increased by about 40%~50% per year, or **doubling every 18~24 months** - This prediction was accurate for 50 years, but it is no longer accurate today > Die: 裸晶,未被封裝的積體電路 ## Below Your Program ### A simple view of hardware and software ![](https://hackmd.io/_uploads/rkleaJJk1T.png =200x) - **Application software** - To consists of millions of lines of code - To **rely on sophisticated (複雜的) software libraries** - Those libraries **implement complex functions** - **Systems software** - **OS** - Handling **basic input/output** operations - **Allocating** storage and memory (***resources***) - Providing for **protected sharing** of the computer among **multiple applications** using it **simultaneously**. - **Compiler** - To translate a program written in a **high-level languages into instructions** that the hardware can execute. - **Hardware** - Can only execute extremely simple low-level instructions ### From a high-level language to the language of hardware - **High-level programming language** - Composed of words and algebraic notation - To provide for **productivity and portability** - Can be translated by a compiler into assembly language - **Assembly language** - A **symbolic** representation of machine instructions - Can be translated by an **assembler** into the **binary version** - **Machine language** - A binary representation of machine instruction - **Binary digit** (bit) ![](https://images.slideplayer.com/25/7805712/slides/slide_3.jpg) ## Under the Covers ### <font class="red">Five classic components of a computer</font> - **Processor - CPU** (**datapath** + **control**) - **Input device** - **Output device** - **Memory** ![](https://hackmd.io/_uploads/BytsZyJJ6.png =700x) - This organization is independent of hardware technology ### Opening the box - Integrated circuit (chip) - Central processor unit (CPU) - Memory - **Dynamic** Random Access Memory (DRAM) - Cache memory - **Static** Random Access Memory (SRAM) - Instruction set architecture (ISA) - Application binary interface (ABI) ### A safe place for data - Volatile (揮發性) memory vs. Nonvolatile memory - **Volatile memory** - 只有在供電時才能儲存資料 - 斷電時資料消失 - ex. SRAM, DRAM - **Nonvolatile memory** - hard disk, flash memory, DVD, ... - Primary memory vs. Secondary memory - **Primary memory** - **Main memory** (通常指 DRAM) - **Secondary memory** - Storage Device (通常指 disk) ## Communicating with other computers - Networked computers advantages - **Communication** - **Resource sharing** - Nonlocal access (**remote access**) - Local area network (**LAN**) - Designed to carry data within a **geographically confined area** - Typically within a single building - Wide area network (**WAN**) - A network extended over hundreds of kilometers that can span a continent - Wireless network - WiFi, Bluetooth... ## Technologies for Building Processors and Memory - The technologies that have been used over time - Vacuum tube $\rightarrow$ - transistor $\rightarrow$ - integrated circuit (IC) $\rightarrow$ - very large-scale integrated circuit (VLSI) $\rightarrow$ - ultra large-scale integrated circuit (ULSI) ### The manufacture of a chip - Terminologies - Silicon - Semiconductor - Silicon crystal ingot - Wafer - Die (chip) - Defect - Yield - The cost of an integrated circuit - 根據經驗法則推出 - Cost per die = Cost per wafer / (Dies per wafer $\times$ Yield) - Dies per wafer $\approx$ Wafer area / Die area - $$\rm{Yield = \cfrac{1}{(1 + (Defects\ per\ area\times Die\ area / 2))^2}}$$ ## Performance ### Defining performance What is "one computer has ***better performance*** than another"? - The ***faster*** computer is the one that gets the job done ***first***. - The ***faster*** computer is the one that complete the ***most*** jobs. :::danger **總結: 電腦的效能是需要定義的。** ::: #### Response time vs Throughput - **Response time (*execution time*)** - 任務從開始到執行完成的時間。 - Including disk access, memory access, I/O activities, operating system overhead, CPU execution time, and so on... - **Throughput (*bandwidth*)** - 單位時間完成的工作量。 - 替換更快的處理器可使 response time 與 throughput 提升。 - 增加處理器進行平行處理可使 throughput 提升,若處理的需求量與 throughput 相同,則 response time 也會提升。 --- ==(p.29 公式)== 效能是執行時間的倒數 $$ \text{Performance}_x\ =\ \frac{1}{\text{Execution time}_x} $$ :::info **Example**: Computer A runs a program in 10 sec, computer B runs the **same program** in 15 sec. **How much faster is A than B?** $$ \rm\frac{Performance_A}{Performance_B}\ =\ \frac{Execution\ time_B}{Execution\ time_A}\ =\ \frac{15}{10}\ =\ 1.5 $$ So, **A is 1.5 times faster than B**, or **B is 1.5 times slower than A**. ::: ### Measuring performance **Time** is the measure of computer performance: - Wall-clock time (**response time**, elapsed time) - **The total time to complete a task** - 包含 CPU 執行的時間、I/O 的時間、waiting 的時間 ... - CPU execution time (**CPU time**) - The time the CPU spends computing for a task(不包含其他等待的時間) - **User CPU time**: the CPU time spent in a program itself. - **System CPU time**: the CPU time spent in the operating system performing tasks on behalf of the program. - E.g., OS 可能封裝了一些操作 - 例如 load A,邏輯上從 RAM 把 data 載入到 register,但實際上 data 可能在硬碟上 - OS 偷偷把 data 從硬碟拉到 RAM 裡面的時間就是 Sys CPU time - **System performance**: the elapsed time on an unloaded system - **CPU performance**: user CPU time Other metrics: - **Clock cycle** (tick, clock tick, clock period, clock, cycle) - The time for **one clock period** (ps) - **Clock rate** - The **inverse** of the clock period (Hz) ### CPU performance and its factors - ==**CPU execution time** for a program== = <font class="blue">**CPU clock cycles for a program $\times$ Clock cycle time**</font> = <font class="blue">**CPU clock cycles for a program / Clock rate**</font> - The hardware designer can **improve** performance by - Reducing the length of the clock cycle - Reducing the number of clock cycles required for a program - ***Trade-off*** between above two factors :::info **Example**: Improving performance **Q.** - A program runs in 10 seconds on computer A, which has a 2 GHz clock. - We want to build a computer B that will run this program in 6 seconds. - To increase the clock rate is possible, but will require 1.2 times as many clock cycles as computer A. - **What clock rate should computer B have?** **A.** 1. $$ \begin{split} \rm{CPU\ Time_A} &= \rm{CPU\ Clock\ Cycles_A} / {Clock\ Rate_A} \\ 10 &= \rm{CPU\ Clock\ Cycles_A} / (2 \times 10^9) \\ \rm CPU\ Clock\ &\rm Cycles_A = 20 \times 10^9 Cycles \end{split} $$ 2. $$ \begin{split} \rm{CPU\ Time_B} &= 1.2 \times \rm{CPU\ Clock \ Cycles_A} / {Clock\ Rate_B} \\ 6 &= 1.2\times (20 \times 10^9) / \rm{Clock\ Rate_B} \\ &\rm{Clock\ Rate_B} = 4GHz \end{split} $$ - ***To run the program in 6 seconds, B must have twice the clock rate of A***. ::: ### Instruction performance The execution time must depend on the number of instructions in a program: - **CPU clock cycles required for a program** = <font class="blue">**Instructions for a program $\times$ Average clock cycles per instruction**</font> - **Clock cycles per instruction (CPI)** = <font class="blue">**The <font style="background-color: yellow">average</font> number of clock cycles each instruction takes to execute**</font> :::info **Example**: Using the performance equation **Q.** - **Computer A and B have the same ISA** - Computer A: clock cycle time 250 ps, CPI 2.0 for **same** program - Computer B: clock cycle time 500 ps, CPI 1.2 for the **same** program - Which computer is faster for this program and by how much? **A.** - Assume the program has $I$ instruction - CPU clock cycles A = $I$ $\times$ 2.0, CPU clock cycles B = $I$ $\times$ 1.2 - CPU timeA = $I$ $\times$ 2.0 $\times$ 250 = 500 $\times$ $I$ ps **faster !** CPU timeB = $I$ $\times$ 1.2 $\times$ 500 = 600 $\times$ $I$ ps - ***Computer A is 1.2 times as fast as computer B for this program*** ::: ### The classic CPU performance equation - Instruction count - The number of instructions executed by the program - **CPU time** = <font class="blue">**Instruction count $\times$ CPI $\times$ Clock cycle time**</font> = <font class="blue">**Instruction count $\times$ CPI / Clock rate**</font> :::info **Example**: Comparing code segments **Q.** - A compiler designer is trying to decide between two code sequence for a particular computer - Which code sequence executes the most instructions? - Which will be faster? - What is the CPI for each sequence? ![](https://hackmd.io/_uploads/rysoEzOk6.png =x65)![](https://hackmd.io/_uploads/S1H24G_yT.png =x65) **A.** - The number of executed instructions - Sequence 1: 2 + 1 + 2 = 5 - Sequence 2: 4 + 1 + 1 = 6 **more !** - Execution time - CPU clock cycles~1~ = (2 $\times$ 1) + (1 $\times$ 2) + (2 $\times$ 3) = 10 cycles - CPU clock cycles~2~ = (4 $\times$ 1) + (1 $\times$ 2) + (1 $\times$ 3) = 9 cycles **faster !** - CPI - CPI~1~ = CPU clock cycles~1~ / Instruction count~1~ = 10/5 = 2 - CPI~2~ = CPU clock cycles~2~ / Instruction count~2~ = 9/6 = 1.5 ::: To combine all factors: $$ \rm{Time = \frac{Seconds}{Program} = \frac{Instructions}{Program}\times\frac{Clock\ cycles}{Instruction}\times\frac{Seconds}{Clock\ cycle}} $$ - ==The only complete and reliable measure of performance is time== - Changing the instruction set to lower instruction count $\Rightarrow$ An organization with a slower clock cycle time or higher CPI - CPI depends on type of instructions executed $\Rightarrow$ The fewest number of instructions may not be the fastest - How to determine the value of factors - **CPU execution time** - Running the program - **Clock cycle time** - Usually be published for a computer - **Instruction count** - Depend on the architecture but not on the exact implementation - 跟高階語言有關,跟硬體較無關係。 - Using software tools to profile the execution - Simulator of the architecture - Hardware counter - **CPI** - Depend on a wide variety of design details in the computer - Detailed simulation of an implementation - Hardware counter - To look at the different types of instructions and using their individual clock cycle counts $$ \text{CPU clock cycle} = \sum_{i=1}^n{CPI_i \times C_i} $$ ![](https://hackmd.io/_uploads/ryozFGuyT.png) - The danger of using only one factor to access performance - All three components must be considered - CPI varies by **instruction mix** - Major factors to affect CPI - The performance of the **pipeline** - The performance of the **memory system** ## The Power Wall ### Clock rate vs. Power - Both **increased rapidly for decades** and then flattened or **dropped off recently** - The reason they **grew together** - Clock rate and power are correlated - The reason for their recent slowing - 散熱問題 - We have to run into the practical **power limit** for **cooling commodity microprocessors** ### The really critical resource is *energy* in the PostPC era - Battery life can trump performance in the PMD - Architects of WSCs try to reduce the costs of powering and cooling 50,000 servers - The energy metric joules is a better measure than a power rate like watts - **Watts = Joules/Second** ### In CMOS technology - The primary source of energy consumption is **dynamic energy** - Consumed when transistors switch states from 0 to 1 and vice versa - $\rm{Energy \propto Capacitive\ load \times Voltage^2}$ - The energy of a pulse during the logic transition of $0 \rightarrow 1 \rightarrow 0\rm\ or\ 1 \rightarrow 0 \rightarrow 1$ - $\rm{Energy \propto \frac{1}{2}Capacitive\ load \times Voltage^2}$ - The energy of a single transition - The power required per transistor - $\rm{Power \propto Capacitive\ load \times Voltage^2 \times Frequency\ switched}$ - **Frequency switched** is a function of the **clock rate** - **Capacitive load** per transistor is a function of **fanout** and the technology - Clock rates grew by a factor of 1000 while power grew by only a factor of 30 - Energy and thus **power can be reduced by lowering the voltage** - The voltage was reduced about 15% per generation - Voltages have gone from 5V to 1V in 20 years :::info **Relative power** - **Q**: - A new processor with 85% capacitive load, voltage is reduced by 15%, and 15% shrink in frequency - What is the impact on dynamic power? - **Sol**: ![](https://hackmd.io/_uploads/ryLw2fdJa.png) ::: ### Problem today - Further lower the voltage will make the transistors too leaky - 40% of the power consumption in server chips is due to **leakage**.(不使用也會損耗) - If transistors started leaking more, the whole process could become unwieldy. - Power reducing methods - To attach large devices to increase cooling - Too expensive for PCs and even servers, not to mention PMDs - To turn off parts of the chip that are not used in a given clock cycle - Computer designers slammed into a power wall ### Multicore microprocessor - The improving rate in response time of programs has slowed down - Since 2006, microprocessors $\rightarrow$ **multiple processor per chip** - The benefit is often more on **throughput** than on **response time** - Terminologies: Processor vs. Microprocessor - **Processors** = **Cores** - **Microprocessors** are generally called **multicore microprocessors** - Ex. quadcore - For programmers to get significant improvement in response time - **In the past** - To **rely on innovations in hardware**, architecture, and compilers to double performance **every 18 months** - Without having to change a line of code - **Today** - To **rewrite programs** to take advantage of multiple processors (將程式平行化) - To continue to improve performance of their code as the number of cores increases ### It is hard to write explicitly parallel programs - Programming for performance - The program need to be **not only correct but also fast** - To provide a **useful interface** to the people or other programs that invoke it - **Load balancing** - **pros** - To **divide an application** - That **each processor** has roughly **the same amount to do** at the same time - **cons** - The overhead of **scheduling** and coordination - **Communication** and **synchronization** overhead ## Real Stuff: Benchmarking the Intel Core i7 ### SPEC CPU benchmark - **Workload** - A set of programs run on a computer. - Either the actual collection of applications run by a user or constructed from real programs to approximate such a mix. - Typically specify both the programs and the relative frequencies. - **Benchmark** - Programs specifically **chosen to measure performance**. - To predict the performance of the actual workload. - To know accurately “which case is common” - **SPEC** (system performance evaluation corporation) **benchmark** - Standard sets of benchmarks for modern computer systems - Funded and supported by a number of **computer vendors** ### SPEC power benchmark - To report power consumption of servers at different workload levels over a period time - Divided into 10% increments - Performance is measured in throughput - Unit: business operations per second (ssj_ops) - Summary measurement ![](https://hackmd.io/_uploads/S1JLFXdJ6.png) ## Fallacies(謬論)and Pitfalls(陷阱) ### Fallacies - Computers at low utilization use little power - Utilization of servers in Google’s WSC - Mostly operates at 10%~50% of the load - At 100% of the load less than 1% of the time - The specially configured computer in 2020 still uses 33% of the peak power at 10% of the load (the same as in 2012) - To redesign hardware to achieve energy-proportional computing - Designing for performance and designing for energy efficiency are unrelated goals - Energy = power $\times$ time - Hardware or software optimizations that take less time save energy overall even if the optimization takes a bit more energy when it is used. ### Pitfalls - 期望電腦某一方面的改進能夠以與成比例的方式改進電腦的效能。 - Expecting the improvement of one aspect of a computer to increase performance by an amount proportional to the size of the improvement. #### Amdahl's law A rule stating that the performance enhancement possible with a given improvements is limited by the amount that the improved feature is used. --- - Using a subset of the performance equation as a performance metric - Nearly all alternatives to the use of time as the performance metric have led to misleading claims, distorted results, or incorrect interpretations - MIPS (million instructions per second) ![](https://hackmd.io/_uploads/H1Sgsm_1p.png =500x)