# CA Fin (6/6) <style> .markdown-body li + li { padding-top: 0 !important; } .markdown-body table th, .markdown-body table td { padding: 2px 10px; } </style> --- :::danger :page_facing_up: ***With 1 sheet (2 pages) of A4 handwritten notes*** :page_facing_up: R: 39~59, C: ~150 ::: :::info **影片**: - [05/09](https://www.youtube.com/watch?v=VdDC_dO7ne4) - [05/16](https://www.youtube.com/watch?v=3UQ3RhSFaww) - ~~[05/23]()~~ - [05/30](https://www.youtube.com/watch?v=AfLUHr7dvcE) - ~~[06/06]()~~ ::: :::info **其他筆記** - https://hackmd.io/@sysprog/H1sZHv4R?type=view ::: --- [TOC] --- ## ==TODO== - Past - Collect Slide's exercises ## 7. Branch Prediction ### Exceptions and Interrupts - Polling & interrupts - Polling: manually check - Interrupts: hiccup - Polling is faster than interrupts - not need to save and restore regs - other interrupt overhead avoided - Polling is slower than interrupts - Overhead of polling instructions is incurred - device have to wait for service for a long time - Use which? - polling: device can be controlled at user level - interrupt: infrequent irregular event - Exceptions & interrupts - Exceptions: relevant ti the current process - Interrupts: caused by asynchronous, outside event - Synchronous: means related to the instruction stream - Asynchronous: means unrelated to the instruction stream - Precise interrupts/exceptions - restart execution at the interrupt point and “get the right answer” ## 8. Multi-Processor ### Multi-Processor - Why multiprocessor good - A growth in data-intensive applications - A growing interest in server markets - Complexity of current microprocessors - Collecting several much easier than redesigning one - Steady improvement in parallel software (scientific apps, databases, OS) - Long term goal of the field: scale number processors to size of budget, desired performance - Categories - SISD (Single Instruction stream Single Data stream) - uniprocessor - SIMD (Single Instruction stream Multiple Data stream) - Data-level parallelism - each processor has its own data memory - single inst memory and control processor - MISD (Multiple Instruction stream Single Data stream) - MIMD (Multiple Instruction stream Multiple Data stream) - thread-level parallelism - Challenges of parallel processing: - limited parallelism available in programs: - running independent tasks - high cost of communications - thread must communicate to complete the task - large latency of remote access - in existing shared-memory multiprocessors, communication cost: from 50 clock cycles (multicore) to 10000 clock cycles (large-scale multiprocessors) ### SIMD - Why SIMD: - architectures can exploit significant data level parallelism for: - matrix - is more energy efficient than MIMD - allows programmer to continue to think sequentially - Vector arch: - basic idea: - read sets of data elements into "vector regs" - operate on those regs - regs are controlled by compiler - hide memory latency - leverage memory bandwidth - a significant number of read and write ports - VMIPS (Vector arch): - basic idea: - Vector regs - vector FUs - vector load-store unit - scalar regs - Y= a * X + Y: - SAXPY/DAXPY loop (single/double precision a\*X plus Y) - MIPS vs. VMIPS: - overhead insts are not present in the VMIPS cpde - Compiler: the code is vectorized or vectorizable - loops can be vectorized: - no loop-carries dependences - fewer pipeline load interlocks - vector archs call forwarding of element-dependent operatoions "chaining" - Vector execution time: - 3 factors: - length of operand vectors - structural hazards - data dependences - VMIPS FUs consume one element per clock cycle: - execution time is approx the vector length - Convoy - set of vector insts that could potentially execute together - Chimes: - Sequences with RAW dependency hazards can be in the same convoy via chaining - Chaining: allows a vector operation to start as soon as the individual elements of its vector source operand become available - Chime: unit of time to execute one convoy - $m$ convoys ececutes in $m$ chimes - For vector length of $n$, requires $m \times n$ clock cycles - Multiple lanes: - parallel pipelines - allow for multiple hw lanes - Vector length reg (VLR): - vector length not known at compile time - use "strip mining" for vectors over the max vector length (MVL) - generate code such that each vector operation is done for a size less than or equal to the MVL - Vector mask regs: - handling "if" statements in vectors loops - to "disable" elements ### MIMD - Why MIMD: - offers flexibility - with correct HW and SW support: - Single-user multiprocessors focusing on high performance for one application - Multi-programmed multiprocessors running many tasks simultaneously - Or some combination of these functions - MIMD can build on the cost-performance advantages of off-the-shelf processors - Nearly all multiprocessors built today use the same microprocessors found in workstations and single-processor servers - Multi-core chips leverage the design investment in a single processor core by replicating it - Clusters: - one popular class of MIMD computers - often use standard components & standard network technology - Commodity clusters: - Often assembled by users or computer center directors, rather than vendors - For applications: - focusing on throughput - requiring almost no communication among threads - Web serving, multiprogramming, transaction-processing applications - Inexpensive - Custom clusters: - exploit large amounts of parallelism on a single problem - significant amount of communication during computation - Multicore: - On-chip multi-processing (single-chip multiprocessing, multi-core) - multiple multiple processors on a single die - typically share L2 cache, memory or I/O bus - usually, each processor execute a different processor - Process: (usually) independent of each other - Thread: may share code and address space - Major MIMD styles: - Decentralized memory (memory module with CPU) - get more memory banddwisth - lower memory latency - Drawback: - (Major) longer communication latency - SW model more complex - Models for communication: - through a shared address space - addressed as one logically shared address space - Distributed shared-memory - for non-uniform memory access - access time deoends on location of data - through message passing - address spaces are logically disjoint and cannot be addressed by a remote processor - Centralized shared memory (Symmetric (shared-memory) multiprocessors) (Uniform Memory Access) (Shared Memory Processor) - uniform latency from memory - large, multilevel caches reduce the memory demand - more than one processor to share the bus to the memory - different organizations: - Processor and cache on an extension board - Integrated on the main-board (most popular) - Integrated on the same chip - Private data - Used by single processor - Shared data - Used by multiple processor - Provide comm among processors through RW the shared data - Problem: Cache Coherence - Cache Coherence: - Informally: "Any read must return the most recent write" - Formally: - P writes x and P reads x. If no other writes of x by another processor between the write and the read, the read always return the value written by P. - If P1 writes x and P2 reads it, P1’s write will be seen by P2 if the read and write are sufficiently far apart and no other writes to x between the two accesses - Writes to a single location are serialized: - seen in one order (Latest write will be seen) - Coherence & Consistency: - Coherence: defines "what values" can be returned by a read - Consistency: defines "when" a written value will be returned by a read - exactly when a written value must be seen by a reader is defined by a memory consistency model - Solutions: - Snooping: - No centralized state is kept - Every cache that has a copy of the data from a block of physical memory also has a copy of the sharing status of the block - Caches are on a shared-memory bus - All caches controllers monitor or snoop on the bus to determine if they have a copy of a block that is requested on the bus - Dominates for small scale machines (most of the market) - Direct-Based Schemes: - The sharing status of a block is kept in one location called directory - Keep track of what is being shared in one place (centralized logically, but can be distributed physically) - Has slightly higher implementation overhead than snooping - Can scale to larger processor counts - Basic Snooping Protocols: - Write Invalidate protocol - Multiple readers, single writer - Write to shared data: - an invalidate is sent to all caches which snoop and invalidate any copies - Read miss: - Write-through: memory is always up-to-date - Write-back: snoop in caches to find the most recent copy - Write Broadcast Protocol (Write Update Protocol) - Write to shared data: - broadcast on bus, processors snoop, and update any copies - Read miss: - memory is always up-to-date - typically write through - E.g. Invalidation protocol, write-back cache ![](https://i.imgur.com/2jg7xGS.png =348x)![](https://i.imgur.com/9LidDpR.png =348x) - Each block of memory is in one state: - Clean in all caches and up-to-date in memory (Shared) - Dirty in exactly one cache (Exclusive) - Not in any caches - Each cache block is in one state (track these): - Shared: block can be read - Exclusive: cache has the only copy, it’s writeable, and dirty - Invalid: block contains no data - Boldface: - bus actions generated as part of the state transition - Example: ![](https://i.imgur.com/EbDTM1p.png) - Scalable approach: Directories - Non uniform memory access - Every memory block has associated directory information - keeps track of copies of cached blocks and their states - on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary - in scalable networks, communicatetion with directory and copies is through network transactions - Many alternatives for organizing directory information - Directory-based coherence - Local node: requesting node - Home node: where the memory location and the directory entry of an address reside - Remote node: have a copy of the cache block, whether exclusive or shared - E.g. Directory based Cache Coherence Protocol - 3 states: - Shared: ≥ 1 processors have data, memory up-to-date - Exclusive: 1 processor (owner) has data; memory out-of-date - Uncached (no processor has a copy of the cache block; not valid in any cache) - In addition to cache state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy) - A popular middle ground - Two level "hierarchy" - Individual nodes are multiprocessors, connected non-hierarchically - Coherence across nodes is directory-based - Coherence within nodes is snooping or directory ## 9. Memory Hierarchy ### Memory Hierarchy - Why Memory Hierarchy: - Processor-Memory Performance Gap - Levels of the Memory Hierarchy: ![](https://i.imgur.com/0PKDyJA.png =600x) ![](https://i.imgur.com/HBByMBZ.png =450x) - Goal: illusion of large, fast, cheap memory - Let programs address a memory space that scales to the disk size, at a speed that is usually as fast as register access - Achieving higher memory bandwidth: ![](https://i.imgur.com/8Et0DRL.png) ### Caches - The principle of locality: - Program access a relatively small portion of the address space at any instant of time - 2 different types of locality: - Temporal Locality (Locality in Time): - If an item is referenced, it will tend to be referenced again soon - Spatial Locality (Locality in Space): - If an item is referenced, items whose addresses are close by tend to be referenced soon - Last 15 years, HW relied on locality for speed - Caches: - Small, fast storage used to improve average access time to slow memory - Exploits spacial and temporal locality - Hit & Miss: - Hit: data appears in some block in this level - Hit Rate: the fraction of memory access found in this level - Hit Time: Time to access the upper level which consists of - RAM access time + Time to determine hit/miss - Miss: data needs to be retrieve from a block in the lower level - Miss Rate: 1 - (Hit Rate) - Miss penalty: Time to replace a block in the upper level + Time to deliver the block to the processor - Hit Time << Miss Penalty - Direct-mapped cache ==TODO== ![](https://i.imgur.com/bXr1qxV.png) ### 4 Questions for Memory Hierarchy - Where can a block be placed in the upper level? (**Block placement**) - Fully Associative - Set Associative - $$2^{\text{Index}} = \frac{\text{Cache size}}{\text{Block size} \times \text{Set associativity}}$$ - $N$-way set associative: N entries for each Cache Index - N direct mapped caches operates in parallel - Direct Mapped (1-way set associative) - For a $2^N$ byte cache: - The uppermost $(32-N)$ bits are always the Cache Tag - The lowmost M$$ bits are the Byte Select (Block Size $= 2^M$) - How is a block found if it is in the upper level? (**Block identification**) - Index identify set ![](https://i.imgur.com/IclXcCP.png =400x) - increasing associativity shrinks index, expands tag - $$\text{Cache size} = \text{Associativity} \times 2^{\text{Index size}} \times 2^{\text{Offset size}}$$ - Which block should be replaced on a miss? (**Block replacement**) - Easy for Direct Mapped - Set associative & Fully associative: - Random - Easy to implement, has very small differences to LRU - LRU (Least Recently Used) - Appealing, but hard to implement for high associativity - What happens on a write (**Write strategy**) ![](https://i.imgur.com/v29JifN.png =500x) - Write through: - The information is written to both the block in the cache and to the block in the lower level memory - read misses cannot result in writes - Write buffer: ![](https://i.imgur.com/Bw2M7At.png =300x) - have to wait for lower level memory - Processor: writes data into the cache and the write buffer - Memory Controller: write content of the buffer to memory - Write buffer is just a FIFO: - Typical number of entries: $4$ - Works fine if: Store frequency << (1 / DRAM write cycle) - Write back - The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced - is block clean or dirty - no repeated writes to same location - Write miss: - Write allocate (fetch on write) - The block is loaded on a write miss - Usually, write-back cache use write allocate - No write allocate (write around) - The block is modifed in the lower level and not loaded into the cache - Usually, write through cache use no-write allocate ### Cache Performance ![](https://i.imgur.com/AwBMpar.png =500x) - E.g. Harvard architecture - Unified vs. Separate Instruction & Data (Harvard) - Structural hazard stall - Impact of Cache: - The lower $CPI_{\text{execution}}$, the higher the relative impact of a fixed number of cache miss clock cycles - For CPU with higher clock rates, a larger number of clock cycles per miss and hence the memory portion of CPI is higher - The Cache Design Space: - Several interacting dimensions: - cache sizez - block size - associativity - replacement policy - write-through vs. write-back - The optimal choice is a compromise - depends on access chracteristics - workload - use (I-cache, D-cache, TLB) - edpendes on technologiees / cost - Simplicity often wins - Improving Cache Performance: ![](https://i.imgur.com/xoy794n.png =500x) - Reduce the miss rate - Reduce the miss penalty - Reduce the time to hit in the cache ### Reduce the Miss Rate - Classifying Misses: - Compulsory - The first access to a block is not in the cache, so the block must be brought into the cache - Also called cold start misses or first reference misses. - Misses in even an Infinite Cache - Capacity - If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. - Misses in Fully Associative Size X Cache - Conflict - If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. - Misses in N-way Associative, Size X Cache - Coherence (nit in normal 3Cs) - Misses caused by cache coherence - Classify cache misses - Infinite cache, fully associative - Compulsory misses - Finite cache, fully associative - Compulsory misses + Capacity misses - Finite cache, limited associativity - Compulsory misses + Capacity misses + Conflict misses - Solutions: - Change Block Size: Compulsory - Larger block size: - Compulsory Miss UP - Miss Penalty DOWN - (If small cache) conflict miss UP - (If small cache) capacity miss UP - Change Associativity: Conflict - 2:1 Cache rule: - miss rate 1-way associative cache size X $\approx$ miss rate 2-way associative cache size X/2 - Associativity UP, Clock cycle time (CCT) UP, Hit time UP - Associativity UP, AMAT NOT-UP (Hit time UP) - Victim Cache - Combine fast hit time of direct mapped yet still avoid conflict misses - Add buffer to place data discarded from cache - Pseudo-Associativity - Combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache - Divide cache: - on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) - Alternative hit rate: found in other cache - Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles - Better for caches not tied directly to processor (e.g. L2) - HW Prefetching of Instructions & Data SW Prefetching of Data - stream buffer: - Instruction stream buffer - Data stream buffer - Prefetching relies on having extra memory bandwidth that can be used without penalty - Compiler Optimizations: 3Cs - Merging Arrays - improve spatial locality - these accesses will interfere with each other - each cache block can contain the desired elements - Loop interchange - improve spatial locality - Sequential accesses: Row major or Column major - Loop fusion - improve spatial locality - merge loops - Blocking - improve spatial locality - $B$: Blocking factor - Complete a block first, not go so far - $\text{AMAT} = \text{HitTime} \times \text{EffectMissRate} \times \text{MissPenalty}$ ### Reduce the miss penalty - Read Priority over Write on Miss - Write through with write buffer - Check write buffer contents before read; if not conflicts, let the memory access continue - Write Back - Read miss replaces dirty block - Normal: Write dirty block to memory, then do the read - Instead: Copy the dirty block to a write buffer, then do the read, and then do the write - CPU stall less since restarts as soon as read is done - Sub-block Placement - Don’t have to load full block on a miss - Have valid bits per sub-block to indicate valid - Early Restart and Critical Word First - Do not wait for full block to be loaded before restarting CPU - Early restart: - as soon as requested word of the block arrives, send it to the CPU and let the CPU continue execution - Critical word first: - request the missed word first from memory and send it to the CPU as soon as it arrives - Let the CPU continue execution while filling the rest of the words in the block - Also called wrapped fetch and requested word first - Generally useful only in large blocks - Non-blocking Caches to reduce stalls on misses - Non-blocking cache (lockup-free cache): - allow data cache to continue to supply cache hits during a miss - out-of-order execution - requires multi-bank memories - hit under miss: - reduces the effective miss penalty by working during miss - hit under multiple miss (miss under miss): - may further lower the effective miss penalty by overlapping multiple misses - complexity UP (multiple outstanding memory accesses) - Requires muliple memory banks - Better for FP programs than for Integer programs - One potential advantage of hit-under-miss is that it won’t increase hit time - Add a second-level cache - Local miss rate: - misses in this cache $/$ the total number of memory accesses to this cache - Global miss rate (What matters): - misses in this cache $/$ the total number of memory accesses generated by the CPU - For lower level cache: - Higher associativity or pseudo associativity are worth considering: - They have small impact on the second-level hit time - Much of the AMAT is due to misses in the second level cache - Larger block size - Multilevel inclusion property ### Reduce Hit Time - Hit time - Hit time is critical because it affects the clock cycle time - A fast hit time is multiplied in importance beyond the average memory access time formula because it helps everything - Fast hit time via small and simple cache - In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: - Possible to assume a hit and continue. Recover later if miss. - Fast Hit time via Small and Simple Caches: - Small cache can help hit time since smaller memory takes less time to index - Simple: direct mapping - Can overlap tag check with data transmission ### Virtual Memory - 4 Questions for Virtual Memory: more inflexible due to more high miss penalty - Block placement: Fully Associative - Due to the exorbitant miss penalty - Block identification: Page Table - Translation look-aside buffer: reduce address translation time - Block replacement: LRU - provide use bit or reference bit - Write strategy: Write back - ![](https://i.imgur.com/D7CJSI6.png =500x) - Page size usually large: - Smaller page table - Save memory - Reduce TLB misses - Transferring larger pages to/from secondary storage is more efficient - The Limits of Physical Addressing - Physical Addressing - All programs share one address space: The physical address space - Machine language programs must be aware of the machine organization - No way to prevent a program from accessing any machine resource - Add a Layer of Indirection - User programs run in an standardized virtual address space - Address Translation hardware managed by the operating system (OS) maps virtual address to physical memory - Hardware supports “modern” OS features: Protection, Translation, Sharing - 3 advantages of virtual memory: - Translation: - Program can be given consistent view of memory, even though physical memory is scrambled - Makes multithreading reasonable (now used a lot!) - Only the most important part of program (“Working Set”) must be in physical memory. - Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later. - Protection (Base <= Adderss <= Bound): - Different threads (or processes) protected from each other. - Different pages can be given special behavior - Read Only, Invisible to user programs, etc - Kernel data protected from User programs - Very important for protection from malicious programs - Sharing: - Can map same physical page to multiple users (Shared memory) - Page tables encode virtual address spaces - A virtual address space is divided into blocks of memory called pages - Page table maps virtual page numbers to physical frames - Page Table Entry (PTE) - A page table is indexed by a virtual address - A valid page table entry codes physical memory “frame” address for the page - Treat memory as cache for disk - Page tables may not fit in memory! - Each process needs its own address space! - Two-level Page Tables - Page replacement policy (write-back) - Architect’s role: support setting dirty and used bits - Set of all pages in Memory: - Tail pointer: Clear the used bit in the page table - Head pointer: Place pages on free list if used bit is still clear. Schedule pages with dirty bit set to be written to disk - MIPS Address Translation: Translation Look-Aside Buffer (TLB) - A small fully-associative cache of mappings from virtual to physical addresses - TLB also contains protection bits for virtual address - Fast common case: Virtual address is in TLB, process has permission to read/write it. ## 10. Storage ### Magnetic Disks ![](https://i.imgur.com/40gzCSc.png) - Seek time: time to move the arm over the proper track - Rotation Latency (Rotational Delay): Time for the requested sector to rotate under the head - Average Rotation time = 0.5 / Revolutions Per Minute ### RAID - Redundant Arrays of (Inexpensive) Disks (RAIDs): - Files are "striped" across multiple disks (spreading data over multiple disks) - Force accesses to several disks if the data files are large - Redundancy yields high data availability: - Availability: service still provided to user, even if some components failed - Contents reconstructed from data redundantly stored in the array - Capacity penalty to store redundant info - Bandwidth penalty to update redundant info - RAID 0 - No redundancy - RAID 1: Disk Mirroring/Shadowing - Each disk is fully duplicated onto its “mirror” - Very high availability can be achieved - Bandwidth sacrifice on write: - Logical write = two physical writes - Reads may be optimized - Most expensive solution: 100% capacity overhead - RAID 2 - Inspired by applying memory-style error correcting codes to disks - RAID 3: Parity Disk - Striped physical records - P contains sum of other disks per stripe mod 2 (“parity”) - If disk fails, subtract P from sum of other disks to find missing information - Sum computed across recovery group to protect against hard disk failures, stored in P disk - Logically, a single high capacity, high transfer rate disk: good for large transfers - Wider arrays reduce capacity costs, but decreases availability - RAID 4: High I/O Rate Parity - works well for small reads (write to one disk): - ~~Option 1: read other data disks, create new sum and write to Parity Disk~~ - Option 2 (RAID 4): since P has old sum, compare old data to new data, add the difference to P - Heavy load on P - RAID 5: High I/O Rate Interleaved Parity - Independent writes possible because of interleaved parity - no heavy load on a specific disks - RAID 6: Recovering from 2 failures - Network Appliance’s row-diagonal parity (RAID-DP) - Row parity - Diagonal parity - two check blocks per stripe of data - If p+1 disks total, p-1 disks have data