CA Mid 1 (3/21)

# CA Mid 1 (3/21) <style> .markdown-body li + li { padding-top: 0 !important; } </style> --- :::danger :page_facing_up: ***With 1 sheet (2 pages) of A4 handwritten notes*** :page_facing_up: ::: :::info **影片**： - [02/14](https://www.youtube.com/watch?v=GNNdRo7KwgA) - [02/21](https://www.youtube.com/watch?v=KOjRA34Vc38) - [03/07](https://www.youtube.com/watch?v=SOSP3xj4WyQ) - [03/14](https://www.youtube.com/watch?v=Zs6OS4yVPXU) ::: :::info **其他筆記** - https://hackmd.io/@sysprog/H1sZHv4R?type=view - https://cdn.discordapp.com/attachments/756081718124871710/955291400293986314/20220321_102532.jpg - https://cdn.discordapp.com/attachments/756081718124871710/955291400843427870/20220321_102548.jpg ::: --- [TOC] --- ## 考古 2021 1. No. Executing more instructions within a second does not mean completing the same program (task) faster, because the same program (task) can be compiled into different amounts of instructions for different instruction set architectures. $CPU Time = Instruction Count \times CPI \times Clock Cycle Time$ 2. $FIT_o = 20 \times 1/1000000 + 1/1000000 + 1/500000 + 1/200000 + 1/200000 \\ = (20+1+2+5+5)/10^6 = 33000 / 10^9$ $MTTF_o = 10^9 / 33000$ $FIT_p = 20 \times 1/1000000 + 1/1000000 + 1/(500000 \times 1000) + 1/200000 + 1/200000 \\ = (20+1+0.002+5+5)/10^6 = 31002 / 10^9$ $MTTF_p = 10^9 / 31002$ $\frac{MTTF_p}{MTTF_o} = \frac{10^9 / 31002}{10^9 / 33000} = \frac{33000}{31002} \approx 1.06$ or $(1-\frac{2}{33} + \frac{2}{33} \times \frac{1}{1000})^{-1} \approx 1.06$ 3. For all the tasks or all the programs in the benchmark suite, the execution time would be very different. Then, those programs taking longer time would automatically be given higher weights and dominate the arithmetic average. - For example: On a computer, Program A needs 20 ms to be completed; Program B needs 2 ms to be completed. On another computer, if Program A needs 25 ms to be completed; Program B may need only 0.001 ms to be completed, the overall result is that this computer is slower in respect of arithmetic average, with automatically less consideration about improvements on Program B. 4. Wrong, in this case, fractions in the Amdahl's Law should be measured as time, not frequency. $Frac_1 = 0.2,\ Frac_2 = 0.4$ $Speedup_1 = (1-0.2 + 0.2 / 10)^{-1} \approx 1.22$ $Speedup_2 = (1-0.4 + 0.4 / 2)^{-1} \approx 1.25$ $(CPI = 2,\ CPI_1 = 2 - 0.02 \times (20-2) = 1.64,\ CPI_2 = 2 \times 0.2 + 1.5 \times 0.8 = 1.6)$ 5. Accessing registers is faster than accessing the memory. Registers are more efficient for a compiler to use than other forms of internal storage. Registers can be used to hold variables, so memory traffic can be reduced. Code density improves. It could be more efficient because of optmizations of executing order of the operands and pipelining. 6. *(Out)* 7. It is better to use multi-cycle implementation, because it allows a functional unit to be used more than once per instruction, as long as it is used on different cycles. Allow instructions to take different numbers of clock cycles. Because in MIPS architecture, all the instructions are 4 bytes long and the memory should be aligned, the least two bit of the addresses of all instructions must be `00`. Accordingly, it is not necessary to store these two bits when storing the displacement in `Imm`. When it is time to use it, just shift left `Imm` by 2. ## 考古 2020 1. ==TODO== More complex. More 2. *(Answered)* 3. PC relative address is an addressing mode that a displacement is added to the program counter, which is useful for conditional branch across several instrucitons. It also permits the code to run independently of where it is loaded (position independence) and make good use of space for MIPS J-type instructions. 4. *(Similar)* ==TODO== 5. *(Out)* 6. Limits of power, cooling, available ILP, long mamory latency have slowed uni-processor performance. 7. *(Out)* 8. *(Out)* 9. ==TODO== 10. B, C ## 考古 2019 1. *(Out)* 2. *(Similar)* ==TODO== 3. *(Similar)* ==TODO== 4. *(Similar)* $Speedup = (1-0.6+0.6 / 3)^{-1} = 1 / 0.6 \approx 1.67$ 5. *(Out)* 6. $CPUTime = IC \times CPI \times ClockCycleTime,\ IC\ CCT$ would not change $CPI = 0.4 \times 2 + 0.3 \times 3 + 0.3 \times 3 = 2.6$ $CPI_A = 0.4 \times 1 + 0.3 \times 3 + 0.3 \times 3 = 2.2$ $CPI_B = 0.4 \times 2 + 0.3 \times 3 + 0.3 \times 2 = 2.3$ Enhancement A is better. ## 考古 2018 1. *(Answered)* 2. RISC focuses the attention of designers on Instruction Level Parallelism and the use of cache. A RISC ship will typically have fewer transistors dedicated to the core logic which originally allowed designers to increase the size of the register set and increase internal parallelism. Also, RISC has uniform instruction format (fixed-length, 4 bytes, single word), simple addressing modes, and general purpose registers, which simplify compiler design. 3. *(Similar)* ==TODO== 4. *(Answered)* 5. *(Similar)* ==TODO== 7. *(Similar)* ==TODO== 8. *(Out)* ## 1. Fundamentals Of Computer Design ### Growth in Processor Performance - **Moore's Law**: max number of transistors in an IC doubles every two years - slowdown due to physical limitations and economic factors - **Dennard scaling**: as transistors get smaller, their power density stays roughly constant - ended around 2006 - Time: - Before 1986: tech driven - More advanced architectural and organizational ideas - Limits of power, available ILP, long mamory latency have slowed uniprocessor performance ### RISC - **RISC**: Reduced Instruction Set Computer - **ILP**: Instruction Level Parallelism - <-> **TLP**: Thread Level Parallelism - <-> **DLP**: Data Level Parallelism - the use of caches - RISC chips typically have far fewer transistors dedicated to the core logic - allow designer to increase the size of the register set - Uniform instruction format - General Purpose registers, ### Trends in Technology - $Power_{dynamic} = 1/2 \times CapacitiveLoad \times Voltage^2 \times FrequencySwitched$ (watts) - $Energy_{dynamic} = CapacitiveLoad \times Voltage^2$ (joules) - Slowing clock rate reduces power, not energy - The limit of what can be cooled by air - Challenges: distributing the power, removing the heat, preventing hot spots - Limits of air cooling -> multiple processors on a chip - relatively lower power and clock rates - $ICCost = \frac{DieCost + TestingCost + PackagingCost}{FinalTestYield}$ - $DieCost = \frac{WaferCost}{DiesPerWafer \times DieYield}$ - $DiesPerWafer = \frac{\pi(WaferR)^2}{DieArea} - \frac{2\pi \times WaferR}{\sqrt{2 \times DiwArea}} - TestDie$ - $DieYield = WaferYield \times (1 + (\frac{DefectDensity \times DieArea}{\alpha}))^{-\alpha}$ ### Availability/Reliability - Module Reliability $\approx$ time to fail - **Service Level Agreements (SLA)** 1. Service accomplishment 2. Service interruption - Failure: 1 -> 2 - Restoration: 2 -> 1 - $MTTF =$ total number of hours the disks worked cumulatively $/$ the number of that failed - $FIT = 1 / MTTF$ - $MTBF = MTTF + MTTR$ - $ModuleAvailability = \frac{MTTF}{MTBF} = \frac{MTTF}{MTTF + MTTR}$ - $MTTF_{pair} = \frac{MTTF / 2}{MTTR / MTTF}$ - For the MTTF to make sense, a user should keep replacing the disk in lifetime. - $FailedDisks = NumberOfDisks \times TimePeriod / MTTF$ ### Performance - "X is n times faster than Y" - $n = \frac{ExTime_Y}{ExTime_X} = \frac{Performance_X}{Performance_Y}$ - Arithmetic average: bad - those programs taking longer time would automatically be given higher weights and dominate the arithmetic average - weights by executing an equal time on some reference computer: bad - relevant to the reference computer - **SPECRatio** - choice of reference computer is irrelevant - geometric mean is proper - unit-less, so arithmetic mean is meaningless - $SPECRatio = \frac{ExTime_B}{ExTime_A} = \frac{Performance_A}{Performance_B} = \frac{SPECRatio_A}{SPECRatio_B}$ - $\frac{GeometricMean_A}{GeometricMean_B} = \frac{\sqrt[n]{\prod_{i=1}^n{SPECRatioA_i}}}{\sqrt[n]{\prod_{i=1}^n{SPECRatioB_i}}} = \sqrt[n]{\prod_{i=1}^n{\frac{PerformanceA_i}{PerformanceB_i}}}$ - Quantitative Principles of Computer Design - Take advantage of Parallelism - Principle of Locality - Make the Common Case Fast - **Amdahl's Law** - $ExTime_n = ExTime_o \times ((1-Frac) + \frac{Frac_e}{SpeedUp_e})$ - $SpeedUp = \frac{ExTime_o}{ExTime_n} = ((1-Frac_e) + \frac{Frac_e}{SpeedUp_e})^{-1}$ - applied on reliability - CPU performance - $CPUTime = InstructionCount \times CPI \times ClockCycleTime$ - Amdahl's Law should be used on % of TIME not FREQUENCY (when calculating CPI) - Percentage of IC $\ne$ Fraction of ExTime ## 2. Instruction Set ### Classifying ISA - Class - **Stack** - operands are implicitly on the top of the stack - **Accumulator** - one operand is implicitly the accumulator, the other is explicit - **Register-memory** (80x86) - only explicit operands, operation with reg and mem - **Register-register/load-store** (MIPS) - only explicit operands, all data must be loaded into registers before other operations (e.g. ALU) - **General Purpose Register** - Faster than memory - More efficient for a compiler to use than other forms of internal storage - Memory traffic reduced - code density improves - EX: `(A*B)-(B*C)-(A*D)` - stack: evaluate in only one order - GPR: more efficient (location of operands, pipelining) - effectively use larger number of reg -> an increase in reg counts - important resource to compiler - ALU instructions comparison ![](https://i.imgur.com/mPCF9Vq.png =400x)  ### Memory Addressing - Byte-addressed - **Byte** (8b), **half word** (16b), **word** (32b), **double word** (64b) - Ordering bytes - **Little Endian**: `(0000) 78|56|34|12 (0003)` - **Big Endian**: `(0000) 12|34|56|78 (0003)` - An access to an object of size $s$ bytes at byte address $A$ is aligned if $A\ mod\ s = 0$ ### Addressing Mode ![](https://i.imgur.com/8ApVI6l.png =650x) ![](https://i.imgur.com/kOGwPJv.png =650x) - **Displacement** and **immediate** addressing dominate addressing mode usage - Control Flow Instruction - PC-relative: a displacement added ### MIPS Instructions ![](https://i.imgur.com/zfv4UBa.png =350x) - **I-type**: load, store, conditional branch - **R-type**: ALU - **J-type**: jump, jump and link ![](https://i.imgur.com/gXVogo6.png =300x) - No typical program: program can vary significantly - all architecture design involved trade-offs ### Single/Multi-Cycle Implementation - **Single-Cycle Implementation** - Inefficient - the clock cycle is determined by the longest possible path - long clock cycle time - **Multi-Cycle Implementation** - Each step takes 1 clock cycle - Allows a functional unit to be used more than once per instruction, as long as it is used on different cycles - allow instructions to take different number of clock cycles - registers are added after every major functional unit - Instruction register - memory data register - A and B register - ALUOut register - Every instruction can be implemented in at most 5 clock cycles - **IF** (instruction fetch) - **ID** (instruction decode/register fetch) - **EX** (execution/effective address) - **MEM** (memory access/branch completion) - **WB** (write-back) ![](https://i.imgur.com/NEf5WRe.png =200x) ## 3. Pipeline ### Pipelining ![](https://i.imgur.com/aV0M8uJ.png =500x) - 5 steps of **DLX Datapath** - **IF** (instruction fetch) - **ID** (instruction decode/register fetch) - **EX** (execution/effective address) - **MEM** (memory access/branch completion) - **WB** (write-back) - 4 pipelined register - **IF/ID** - **ID/EX** - **EX/MEM** - **MEM/WB** - Ensure that instructions in different stages do not interfere with one another - Carry intermediate results from one stage to another - **Steady state situation** - 5 instructions same time - Instruction memory & Data memory - Reg - write: first half cycle - read: second half cycle ### Basic Performance issue in pipelining - $IdealCPI = 1$ - Hazards -> stalls - Structural hazards: hardware cannot support this combination of instructions - Data hazards: Instruction depends on result of prior instruction still in the pipeline - Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps) - $CPI_p = IdealCPI + AverageStallCyclesPerInst$ - $Speedup = \frac{CPI_o}{CPI_p} \times \frac{CycleTime_o}{CycleTime_p}$ $= \frac{IdealCPI \times PipelineDepth}{IdealCPI + PipelineStallCPI} \times \frac{CycleTime_o}{CycleTime_p},\ (CPI_o = IdealCPI \times PipelineDepth)$ $= \frac{PipelineDepth}{1 + PipelineStallCPI} \times \frac{CycleTime_o}{CycleTime_p},\ (IdealCPI = 1)$