1062 - Computer Architecture - 8~10

###### tags: `計結` # 1062 - Computer Architecture - 8~10 ## (8) Multiprocessors ### Why Multiprocessors? * A growth in data-intensive applications * A growing interest in server markets * Complexity of current microprocessors * Collecting several much easier than redesigning one * Steady improvement in parallel software (scientific apps, databases, OS) * Long term goal of the field: * Scale number processors to size of budget, desired performance ### Categories ```= S = Single M = Multiple I = Instruction stream D = Data stream ``` * **SISD** * Uni-processor * **SIMD** * **Data level parallelism** * Each processor **has its own data memory**, but there is a **single instruction memory and control processor**, which fetches and dispatches instructions * Multimedia, graphics performance * **MISD** * No commercial product of this type * **MIMD** * Each processor has its own instructions and operates on its own data * **==Thread== level parallelism** ### Data Level Parallelism: **SIMD** * Vector architectures * SIMD extensions * Graphics Processor Units (GPUs) * **Advantage** * SIMD architectures can exploit **significant data level parallelism** for: * **Matrix-oriented** scientific computing * **Media-oriented** image and sound processors * SIMD is **more energy efficient** than MIMD * Only needs to fetch one instruction per data operation * Makes SIMD attractive for personal mobile devices * SIMD allows programmer to continue to think sequentially ### Vector Architectures * Basic idea * Read sets of data elements into "**vector registers**" * Operate on those registers * Disperse the results back into memory * Registers are controlled by compiler * Used to hide memory latency * Leverage memory bandwidth * The basic structure for a vector architecture * ![](https://i.imgur.com/nqbMmW3.png) ### Example architecture: **VMIPS** * **Vector registers** * Each register holds a 64-element, 64 bits/element vector * Register file has 16 read ports and 8 write ports * **Vector functional units** * Fully pipelined * Data and control hazards are detected * **Vector load-store unit** * Fully pipelined * One word per clock cycle after initial latency * **Scalar registers** * 32 general-purpose registers * 32 floating-point registers ### VMIPS Instructions * `ADDVV.D Vd, Vs, Vt` * add two vectors `Vs` and `Vt`, then put each result in `Vd` * `ADDVS.D Vd, Vs, Ft` * add scalar `Ft` to **each** element of vector `Vs`, then put each result in `Vd` * `SUBVV.D Vd, Vs, Vt` * `SUBVS.D Vd, Vs, Ft` * `SUBSV.D Vd, Fs, Vt` * `MULVV.D`, `MULVS.D`, `DIVVV.D`, `DIVVS.D`, `DIVSV.D` * `LV Vd, (Rs)`, `SV (Rd), Vs` * vector load / store from address `R` * [For more instruction](http://www.cburch.com/cs/330/reading/vmips-ref.pdf) ### **SAXPY** (Scalar Alpha X Plus Y) * $Y = \alpha X + Y$ * $X, Y$ are vectors, initially resident in memory * $\alpha$ is a scalar * For different data type, there are: * **`SAXPY`**: Single precision * **`DAXPY`**: Double precision * `CAXPY`: Complex number with Single precision * `ZAXPY`: Complex number with Double precision * This problem is so-called SAXPY or DAXPY loop in benchmark ```c= for (int i = m; i < n; i++) { y[i] = a * x[i] + y[i]; } ``` * SAXPY MIPS code ```= L.D F0, a ; load scalar a DADDIU R4, Rx, #512 ; last address to load Loop: L.D F2, 0(Rx) ; load X[i] MUL.D F2, F2, F0 ; a*X[i] L.D F4, 0(Ry) ; load Y[i] ADD.D F4, F4, F2 ; a * X + Y S.D F4, 0(Ry) ; store into Y[i] DADDIU Rx, Rx, #8 ; increment index to X DADDIU Ry, Ry, #8 ; increment index to Y DSUBU R20, R4, Rx ; compute bound BNEZ R20, Loop ; check if done ``` * SAXPY VMIPS code ```= L.D F0, a ; load scalar a LV V1, Rx ; load vector X MULVS.D V2, V1, F0 ; vector-scalar multiply LV V3, Ry ; load vector Y ADDVV V4, V2, V3 ; add SV Ry, V4 ; store the result ``` * Requires 6 instructions vs. almost 600 for MIPS ### Comparison : MIPS, VMIPS * The overhead instructions are not present in the VMIPS * The compiler produces vector instructions for such a sequence: The code is **vectorized** or **vectorizable** * Loops can be vectorized if there are **no loopcarried dependences** * VMIPS has fewer **pipeline load interlocks** * Vector architects call **forwarding of element dependent operations** "==chaining==" ### Vector Execution Time * Execution time depends on three factors: * Length of operand vectors * Structural hazards * Data dependencies * VMIPS functional units consume one element per clock cycle * Execution time is approximately the vector length * **Convoy** * Set of vector instructions that could potentially execute together ### Chimes * Sequences with **RAW** dependency hazards can be in the same convoy via chaining * **Chaining** * Allows a vector operation to start as soon as the individual elements of its vector source operand become available * **Chime** * Unit of time to execute one convoy * $m$ convoys executes in $m$ chimes * For vector length of $n$ , requires $m$ * $n$ clock cycles ### Example : Total clock cycle with chimes ```= LV V1, Rx ; load vector X MULVS.D V2, V1, F0 ; vector-scalar multiply LV V3, Ry ; load vector Y ADDVV.D V4, V2, V3 ; add two vectors SV Ry, V4 ; store the sum ``` * Convoys: 3 convoys ```= LV MULVS.D LV ADDVV.D SV ``` * Total clock cycle = 3 Chime * 64 element per vector = 192 clock cycles ### Multiple Lanes * Using parallel pipelines * Allows for multiple hardware lanes * Example * ![](https://i.imgur.com/zRyvxg8.png) * Structure of a vector unit containing 4 lanes * ![](https://i.imgur.com/eyWa8N3.png) ### Vector Length Register * Vector length not known at compile time? * Use **Vector Length Register** (**VLR**) * Use ==strip mining== for vectors over the **maximum vector length** (**MVL**) * ![](https://i.imgur.com/BeCuwtZ.png) * For vector length n, m = n % MVL * So length of each operation <= MVL, like the picture ### Vector Mask Registers * Consider: ```cpp= for (i = 0; i < 64; i++) { if (X[i] != 0) { X[i] = X[i] – Y[i]; } } ``` * Use vector mask register to "**disable**" elements: ```= LV V1, Rx ; load vector X into V1 LV V2, Ry ; load vector Y into V2 L.D F0, #0 ; load FP zero into F0 SNEVS.D V1, F0 ; sets VM(i) to 1 if V1(i)!=F0 SUBVV.D V1, V1, V2 ; subtract under vector mask SV Rx, V1 ; store the result in X ``` ### **MIMD** * MIMD offers **flexibility** * With correct hardware and software support, MIMDs can function as * **Single-user** multiprocessors focusing on high performance for **one application** * **Multi-programmed** multiprocessors running many tasks simultaneously * Or some combination of these functions * MIMD can build on the cost-performance advantages of **off-the-shelf** processors * Nearly all multiprocessors built today **use the same microprocessors found in workstations and single-processor servers**. * **Multi-core** chips leverage the design investment in a single processor core by **replicating** it ### MIMD - Clusters * One od the popular class of MIMD computers, which often use **standard components** and **standard network technology** * Clusters * **Commodity clusters** * Often assembled by **users or computer center directors**, rather than vendors * For applications focusing on throughput and requiring almost no communication among threads * **Web serving**, **multiprogramming**, **transaction-processing applications** * inexpensive * **Custom clusters** * A designer customizes the detailed node design or the interconnection network * For parallel applications that can **exploit large amounts of parallelism on a single problem** * Such applications require **significant amount of communication during computation** ### MIMD - Multicore * On-chip multi-processing, single-chip multiprocessing, **multi-core** * Place multiple processors on a **single die**, typically **share L2 cache**, **memory** or **I/O bus** * Usually, each processor execute a different process * **Process**: Independent of each other (usually) * **Thread**: May share code and address space ### Major MIMD Styles * **Centralized shared memory** * Symmetric (shared-memory) multiprocessors * Uniform Memory Access time * Shared Memory Processor * ![](https://i.imgur.com/t0yoZLF.png) * **Decentralized memory memory module with CPU** * Get more memory bandwidth * Lower memory latency * Major ==Drawback==: **Longer communication latency** * ==Drawback==: Software model more complex * ![](https://i.imgur.com/r5k6b6T.png) ### Models for communication * **Shared address space** * The physically separate memories can be **addressed as one logically shared address space** * Called ==**Distributed shared-memory**== * For non-uniform memory access * access time depends on location of data * **Message passing** * Address spaces are logically disjoint and cannot be addressed by a remote processor ### Symmetric Multi-Processors (SMP) * The **large**, **multilevel caches reduce the memory demand** of a processor * More than one processor to **share the bus** to the memory * Different organizations: * Processor and cache on an extension board * **Integrated on the main-board** (most popular) * Integrated on the same chip * Private data * Used by single processor * Shared data * Used by multiple processor * Provide communication among processors through read/write the shared data * The new problem: ==**cache coherence**== * For a single memory location `X`, R/W by 2 processors `A` and `B` * ![](https://i.imgur.com/SyFGpch.png) ### Coherency * Informally: * **Any read must return the most recent write** * Formally: A memory system is coherent if * `P` writes `x` and `P` reads x. If **no other writes of `x` by another processor** between the write and the read, the read always return the value written by `P`. * If `P1` writes `x` and `P2` reads it, `P1`’s write will be seen by `P2` if the read and write are **sufficiently far apart** and **no other writes to `x`** between the two accesses. * Writes to a single location are serialized: * **seen in one order (Latest write will be seen)** ### **Coherence** and **Consistency** * **Coherence**: * defines **what values** can be returned by a read * **Consistency**: * defines **when** a written value will be returned by a read * We cannot require that a read of `X` **instantaneously** see the value written for `X` by some other processor * The issue of **exactly when a written value must be seen by a reader** is defined by a memory consistency model ### Coherence Solutions * Snooping: * **No centralized state** is kept. * **Every cache that has a copy of the data from a block of physical memory also has a copy of the sharing status of the block** * Caches are on a shared-memory bus * All cache controllers **monitor or snoop on the bus** to determine if they have a copy of a block that is requested on the bus * Dominates for **small scale** machines (most of the market) * Directory-Based Schemes * The sharing status of a block is kept in **one location** called ==**directory**== * Keep track of what is being shared in one place (**centralized** logically, but can be distributed physically) * Has slightly higher implementation **overhead** than snooping * Can **scale to larger processor counts** ### Basic Snooping Protocols * Write **Invalidate** Protocol: * Multiple readers, single writer * Write to shared data: an invalidate is sent to all caches which snoop and **invalidate** any copies * Read Miss: * Write-through: memory is always up-to-date * Write-back: snoop in caches to find the most recent copy * Example: * ![](https://i.imgur.com/EecjqOH.png) * Write **Broadcast** Protocol * Also called **Write Update Protocol** * Write to shared data: broadcast on bus, processors snoop, and **update** any copies * Read miss: memory is always up-to-date * typically write through ### States of Snooping Protocol * For **invalidation** protocol, **write-back** cache * Each **block of memory** is in one state: * Shared : Clean in all caches and up-to-date in memory * Exclusive : Dirty in exactly one cache * Invalid : Not in any caches * Each **cache block** is in one state (track these): * Shared : block can be read * Exclusive : cache has the only copy, it’s writeable, and dirty * Invalid : block contains no data * State Diagram * Cache state transitions based on **requests from CPU** * ![](https://i.imgur.com/lclazcx.png) * Cache state transitions based on **requests from bus** * ![](https://i.imgur.com/ZgxXRbJ.png) * Cache Coherence state diagram with the state transitions induced by * black : the local processor * gray : the bus activities * ![](https://i.imgur.com/BioTsnj.png) ### Directory : A Scalable Approach * Every memory block has associated directory information * keeps track of copies of cached blocks and their states * **on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary** * in scalable networks, communication with directory and copies is through network transactions * Many alternatives for organizing directory information * ![](https://i.imgur.com/9MftyAa.png) ### Directory-based coherence * Local node: * requesting node * Home node: * where the memory location and the directory entry of an address reside * Remote node: * have a copy of the cache block, whether exclusive or shared ### States of Directory based Cache Coherence Protocol * Similar to Snoopy Protocol: 3 states * Shared : ≥ 1 processors have data, memory up-to-date * Exclusive : 1 processor (**owner**) has data; memory out-of-date * Uncached : no processor has a copy of the cache block; not valid in any cache * In addition to cache state, must track **which processors** have data when in the **shared state** (usually bit vector, 1 if processor has copy) * State Diagram * An individual cache block in a directory-based system * ![](https://i.imgur.com/h30cf72.png) * For the directory has the same states and structure as the transition diagram for an individual cache * ![](https://i.imgur.com/oEAOsiW.png) ### Two-level "Hierarchy" * A popular middle ground * Individual nodes are multiprocessors, connected nonhierarchically * Coherence across nodes is directory-based * Coherence within nodes is snooping or directory ### Challenges of Parallel Processing * Limited parallelism available in programs * Running independent tasks * High cost of communications * Threads must communicate to complete the task * Large latency of remote access * In existing shared-memory multiprocessors, **communication cost**: * from **50 clock cycles** (multicore) to **1000 clock cycles** (large-scale multiprocessors) ### Examples * ==**Example 1**== * ![](https://i.imgur.com/Bylu6Bg.png) * ==**Example 2**== * ![](https://i.imgur.com/oRjaPDz.png) * ![](https://i.imgur.com/vIq5o4n.png) * ==**Example 3**== * ![](https://i.imgur.com/yjE0oSa.png) * ![](https://i.imgur.com/kY8AkI6.png) * ![](https://i.imgur.com/wwgBvnv.png) * ==**Example 4**== * ![](https://i.imgur.com/Uo8vtyo.png) * ![](https://i.imgur.com/Lfwg31N.png) * ![](https://i.imgur.com/voQRDrK.png) ### Conclusion * Sharing cached data * **Coherence** (values returned by a read) * **Consistency** (when a written value will be returned by a read) * Snooping cache over shared medium for smaller number of Multi-Processor * Snooping * $\rightarrow$ uniform memory access; **bus makes snooping easier because of broadcast** * **Directory has extra data structure** to keep track of state of all cache blocks * **Distributing directory** * $\rightarrow$ scalable shared address multiprocessor * $\rightarrow$ Non uniform memory access ## (9-1) Memory Hierarchy 參考：[淺談memory cache](http://opass.logdown.com/tags/%E8%A8%88%E7%AE%97%E6%A9%9F%E7%B5%84%E7%B9%94) ### Why Memory Hierarchy? ![](https://i.imgur.com/6JGGlY0.png) ### Levels of the Memory Hierarchy ![](https://i.imgur.com/6PkDhS9.png) ### Achieving higher memory bandwidth ![](https://i.imgur.com/LvKVca1.png) * Suppose * 1 clock cycle to send the address * 15 clock cycles for each DRAM access initiated * 1 clock cycle to send a word of data * (a) **One-word-wide memory organization** * If we have a **cache block of 4 words** and a **one-wordwide bank** of DRAMs, the miss penalty would be * ==1 + 4 * 15 + 4 * 1 = 65 cycles== * Number of bytes transferred per clock cycle for a single miss is * ==4(words) * 4(bytes/word) / 65 = 0.246== * The memory is **one-word-wide** and all the accesses are made **sequentially**. * (b) **Wide memory organization** * Increase the bandwidth to memory by **widening the memory and the buses** between the processor and memory * Allow **parallel access** to all the words of the block * A 4-word-wide memory, the miss penalty is * ==1 + (4/4) * 15 + (4/4) * 1 = 17 cycles== * A 2-word-wide memory, the miss penalty is * ==1 + (4/2) * 15 + (4/2) * 1 = 33 cycles== * (c\) **Interleaved memory organization** * Increase the bandwidth to memory by **widening the memory but not the interconnection bus** * **Still need to pay a cost to transmit each word**, but we can **avoid paying the cost of access latency more than once** * **Interleaving**: memory chips organized in banks read/write multiple words in one access time instead of reading/writing single words each time * ==1 + 1 * 15 + 4 * 1 = 20 cycles== ### Cache : **The Principle of Locality** * The Principle of Locality: * Program access a **relatively small portion of the address space** at any instant of time. * Two Different Types of Locality: * **Temporal** Locality (Locality in *Time*): If an item is referenced, it will tend to be referenced again soon * loops * reuse * **Spatial** Locality (Locality in *Space*): If an item is referenced, items whose addresses are close by tend to be referenced soon * straightline code * array access * Last 15 years, HW relied on locality for speed ### Cache * Small, fast storage used to improve average access time to slow memory. * Exploits spacial and temporal locality * In computer architecture, almost everything is a cache! * Registers: a cache on variables * First-level cache: a cache on second-level cache * Second-level cache: a cache on memory * Memory: a cache on disk (virtual memory) * TLB: a cache on page table * Branch-prediction: a cache on prediction information * BTB: a cache of branch target * ![](https://i.imgur.com/iy8jsj0.png) ### Review: Terminology * **Hit**: data appears in some block in this level * **Hit Rate**: the fraction of memory access found in this level * **Hit Time**: Time to access the upper level which consists of RAM access time + Time to determine hit/miss * **Miss**: data needs to be retrieve from a block in the lower level (Block `Y`) * **Miss Rate** = 1 - Hit Rate * **Miss Penalty**: Time to replace a block in the upper level + Time to deliver the block to the processor * ==Hit Time << Miss Penalty== * ![](https://i.imgur.com/PZc7GM9.png) ### Review: The Basics of Caches * A direct-mapped cache with 8 entries * ![](https://i.imgur.com/zi4GGeq.png) * A 4KB cache using 1 word block * ![](https://i.imgur.com/LFw556Y.png) * 1024 blocks * 4 (32bit / 8) bytes data per block = 4KB ### 4 Questions for Memory Hierarchy * Q1: Where can a block be placed in the upper level? * Block placement * Q2: How is a block found if it is in the upper level? * Block identification * Q3: Which block should be replaced on a miss? * Block replacement * Q4: What happens on a write? * Write strategy ### Block Placement * Where can a block be placed in the upper level? * Fully Associative * Set Associative (n-way associative) * Direct Mapped (1-way associative) * ![](https://i.imgur.com/qaUpUWB.png) * Cache Size : ![](https://i.imgur.com/b2qLzih.png) * ![](https://i.imgur.com/rFLSd6t.png) * ![](https://i.imgur.com/u63evKK.png) * Example * ![](https://i.imgur.com/tQKrsXq.png) * ![](https://i.imgur.com/wdEECy7.png) ### Block identification * How is a block found if it is in the upper level? * Index identifies set * Increasing associativity shrinks index, expands tag * ![](https://i.imgur.com/InQqdR0.png) ### Block replacement * Which block should be replaced on a miss? * Easy for Direct Mapped * Set Associative or Fully Associative: * Random * Easy to implement * **LRU** (**Least Recently Used**) * Appealing, but hard to implement for high associativity ### Write strategy * What happens on a write? * **Write through** * The information is written to **both the block in the cache and to the block in the lower-level memory**. * **Write back** * The information is written only to the block in the cache. The modified cache block is written to main memory only **when it is replaced**. * is block clean or dirty? * Pros and Cons of each * WT: read misses cannot result in writes * WB: no repeated writes to same location * ![](https://i.imgur.com/fRYPWG4.png) * What about write miss? $\rightarrow$ **write_allocate?** * **Write allocate** (fetch on write) * The block is loaded on a write miss * let writes to an un-cached address allocate a new cache line * Usually, **WB cache use WA** (hoping subsequent writes to that block will be captured by the cache) * **No write allocate** (write around) * The block is modifed in the lower level and not loaded into the cache * **WT cache often use no-WA** (since subsequent writes to that block will still have to go to memory) * **Write Buffer** for WT cache * WT always combined with **write buffers** so that we don’t have to wait for lower level memory * ![](https://i.imgur.com/ZWakuxk.png) * A Write Buffer is needed between the Cache and Memory * Processor: writes data into the cache and the write buffer * Memory controller: write contents of the buffer to memory * Write buffer is just a **FIFO**: * Typical number of entries: 4 * Works fine if: Store frequency << (1 / DRAM write cycle) ### Cache performance * Miss-oriented Approach to Memory Access * ![](https://i.imgur.com/A2n7Wa9.png) * Separating out Memory component entirely * AMAT = Average Memory Access Time * ![](https://i.imgur.com/eCQcNoS.png) * ![](https://i.imgur.com/NbLSXCV.png) ### Impact on Performance * ==Example 1== * ![](https://i.imgur.com/wn45fP6.png) * ==Example 2== : **Harvard** Architecture * Separate Instruction & Data Memories (*right*) * ![](https://i.imgur.com/FVYQD9E.png) * ![](https://i.imgur.com/uqfMhVr.png) ### Impact of Cache * ![](https://i.imgur.com/ET2YdW0.png) * ![](https://i.imgur.com/VnUdQcV.png) ### The Cache Design Space * **Several interacting dimensions** * ![](https://i.imgur.com/OcGhfzi.png) * **The optimal choice is a compromise** * ![](https://i.imgur.com/plM32fE.png) * ==**Simplicity often wins**== ### Improving Cache Performance ![](https://i.imgur.com/A2n7Wa9.png) * Reduce the **miss rate** * Reduce the **miss penalty** * Reduce the **hit time** ### Reducing Misses * Classifying Misses: **3C** * **Compulsory** * The **first access** to a block is not in the cache, so the block must be brought into the cache. * Also called **cold start misses** or **first reference misses**. * ==Misses in **even an Infinite Cache**== * **Capacity** * If the cache cannot contain all the blocks needed during execution of a program, **capacity misses will occur due to blocks being discarded and later retrieved**. * ==Misses in Fully Associative Size X Cache== * **Conflict** * If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. * Also called **collision misses** or **interference misses**. * ==Misses in N-way Associative, Size X Cache== * More recent, **4th C**: * **Coherence** * Misses caused by cache coherence. ### Classify Cache Misses, How? * Infinite cache, fully associative * Compulsory misses * Finite cache, fully associative * Compulsory misses + Capacity misses * Finite cache, limited associativity * Compulsory misses + Capacity misses + Conflict misses ### 2:1 Cache Rule ``` [miss rate of 1-way associative cache size X] ~= [miss rate of 2-way associative cache size X/2] ``` * ![](https://i.imgur.com/uH4pBOF.png) ### How to Reduce Misses? * In all cases, assume **total cache size not changed**: * What happens if * Change Block Size: * Which of 3Cs is obviously affected? **Compulsory** * Change Associativity: * Which of 3Cs is obviously affected? **Conflict** * Change Algorithm / Compiler: * Which of 3Cs is obviously affected? **3Cs** ### 1. Reduce Misses via "**Larger Block Size**" * ![](https://i.imgur.com/hD77Zhh.png) * ![](https://i.imgur.com/TNq7zzj.png) * Larger Block Size : * Miss Rate $\downarrow$ * mainly **Compulsory Miss** $\downarrow$ * but also Conflict & Capacity Miss $\downarrow$ * **Miss Penalty** $\uparrow$ * **AMAT** $\downarrow$ * **Hit Time** $\uparrow$ ### 2. Reduce Misses via "**Higher Associativity**" * 2:1 Cache Rule * Beware: Execution time is the only final measure * Will Clock Cycle Time (CCT) increase? * Hill(1988) suggested hit time for 2-way vs. 1-way * external cache + 10% * internal + 2% * AMAT vs. Miss Rate * Assume CCT = n times v.s. CCT direct mapped * n = 1.10 for 2-way * n = 1.12 for 4-way * n = 1.14 for 8-way * ![](https://i.imgur.com/IGZlSD3.png) ### 3. Reducing Misses via a "**Victim Cache**" * How to combine **fast hit time** of direct mapped yet **still avoid conflict misses**? * **Victim Cache**: Add buffer to place data discarded from cache * Jouppi(1990): 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache * Used in Alpha, HP machines * ![](https://i.imgur.com/BlEVgsu.png) * From [Wiki](https://zh.wikipedia.org/wiki/CPU%E7%BC%93%E5%AD%98#%E5%8F%97%E5%AE%B3%E8%80%85%E7%BC%93%E5%AD%98) ### 4. Reducing Misses via "**Pseudo-Associativity**" * How to combine **fast hit time of DM cache** and have the **lower conflict misses of 2-way SA cache * Divide cache: on a miss, check other half of cache to see if there, if so have a **pseudo-hit** (slow hit) * ![](https://i.imgur.com/NTkvyrh.png) * Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles * Better for caches not tied directly to processor (L2) * Used in MIPS R1000 L2 cache, similar in UltraSPARC ### 5.6. Reducing Misses by "**Prefetching**" * Category * Hardware Prefetching of Instructions & Data * Software Prefetching Data * Instruction Prefetching * Alpha 21064 fetches 2 blocks on a miss * Extra block placed in **stream buffer** * Instruction stream buffer * Data stream buffer * On miss check stream buffer * Works with data blocks too: * Jouppi(1990) 1 data stream buffer got 25% misses from 4KB cache; 4 stream buffers got 43% * Palacharla & Kessler(1994) for scientific programs for 8 stream buffers got 50% to 70% of misses from 2 64KB, 4-way set associative caches * Prefetching relies on having **extra** memory bandwidth that can be used without penalty ### 7. Reducing Misses by Compiler Optimizations * McFarling(1989) reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks **in software** * Reorder procedures in memory so as to reduce misses * Data * **Merging Arrays** * improve spatial locality by single array of compound elements vs. 2 arrays ```cpp= /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of stuctures */ struct merge { int val; int key; } merged_array[SIZE]; ``` * Some programs reference multiple arrays in the same dimension with the same indices. (**these accesses will interfere with each other**) * Reducing conflicts by combining independent matrices into a single compound array so that **each cache block can contain the desired elements**; improve spatial locality * **Loop Interchange** * change nesting of loops to access data in order stored in memory * **row major or column major** ```cpp= /* Before */ for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) // <- change for (j = 0; j < 100; j++) // <- change x[i][j] = 2 * x[i][j]; ``` * Sequential accesses instead of striding through memory every 100 words; improved spatial locality * **Loop Fusion** * Combine 2 independent loops that have same looping and some variables overlap ```cpp= /* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; } ``` * 2 misses per access to `a` & `c` vs. one miss per access; improve spatial locality * **Blocking** * Improve temporal locality by accessing "blocks" of data repeatedly vs. going down whole columns or rows ```cpp= /* Matrix multiplication */ /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { r = 0; for (k = 0; k < N; k = k+1) { r = r + y[i][k] * z[k][j]; } x[i][j] = r; } ``` * Two Inner Loops: * Read all N^2 elements of `z[]` * Read N elements of 1 row of `y[]` repeatedly * Write N elements of 1 row of `x[]` * Idea: compute on B^2 submatrix that fits ```cpp= /* After */ for (jj = 0; jj < N; jj += B) for (kk = 0; kk < N; kk += B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1, N); j++) { r = 0; for (k = kk; k < min(kk+B-1, N); k++) { r = r + y[i][k] * z[k][j]; } x[i][j] = x[i][j] + r; } ``` * `B` called **Blocking Factor** ### Summary on reducing miss rate ![](https://i.imgur.com/6q29uJx.png) * 3C : Compulsory, Capacity, Conflict 1. Reducing Misses via Larger Block Size 2. Reducing Misses via Higher Associativity 3. Reducing Misses via Victim Cache 4. Reducing Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Misses by Compiler Optimizations * Remember the danger of concentrating on just one parameter when evaluating performance ### Reducing Cache Miss Penalty 1. Read Priority over Write on Miss 2. Sub-block Placement 3. Early Restart and Critical Word First 4. Non-blocking Caches to reduce stalls on misses 5. Add a second-level cache ### 1. Read Priority over Write on Miss * **Write through with write buffer** * If simply wait for write buffer to empty, might increase read miss penalty * Check write buffer contents before read; if not conflicts, let the memory access continue * **Write Back** * Read miss replaces dirty block * Normal: Write dirty block to memory, then do the read (**slow**) * **Instead**: Copy the dirty block to a write buffer, then do the read, and then do the write * CPU stall less since restarts as soon as read is done ### 2. Sub-block Placement * **Don’t** have to load **full block** on a miss * Have valid bits per sub-block to indicate valid * ![](https://i.imgur.com/8Bul32h.png) ### 3. Early Restart and Critical Word First * Don't **wait** for full block to be loaded before restarting CPU * **Early restart** * as soon as requested word of the block arrives, send it to the CPU and let the CPU continue execution * **Critical word first** * request the missed word first from memory and send it to the CPU as soon as it arrives * Let the CPU continue execution while filling the rest of the words in the block * Also called **wrapped fetch** or **requested word first** * ==Generally useful only in large blocks== ### 4. Non-blocking Caches to reduce stalls on misses ![](https://i.imgur.com/4bC6H4q.png) from [PTT](https://www.ptt.cc/bbs/Grad-ProbAsk/M.1285752708.A.6D0.html) * **Non-blocking cache** or **lockup-free cache** allow data cache to continue to supply cache hits during a miss * requires F/E bits on registers or out-of-order execution * requires multi-bank memories * "**hit under miss**" reduces the effective miss penalty by working during miss * "**hit under multiple miss**" or "**miss under miss**" may further lower the effective miss penalty by overlapping multiple misses * Significantly increases the **complexity** of the cache controller as there can be **multiple outstanding memory accesses** * Requires muliple memory banks (otherwise cannot support) * Pentium Pro allows 4 outstanding memory misses * Example * ![](https://i.imgur.com/mu6L38C.png) * ![](https://i.imgur.com/ZtAPTtZ.png) ### 5. Add a second-level cache * L2 Equations * ![](https://i.imgur.com/o4UmIJc.png) * Definitions: * **Local miss rate** * misses in this cache divided by the total number of memory accesses to this cache * Local miss rate for L2 = Miss rate_L2 * **Global miss rate** * misses in this cache divided by the total number of memory accesses generated by the CPU * global miss rate for L2 = Global miss rate for L1 * Miss Rate_L2 = Miss Rate_L1 * Miss Rate_L2 * ==**Global Miss Rate is what matters**== * Example * Given the data below, what is the impact of second-level cache associativity on the miss penalty? * ![](https://i.imgur.com/UqA2Hn7.png) * ![](https://i.imgur.com/SgmE8EO.png) * Conclusion: **Higher associativity or pseudo-associativity are worth considering** because * They have small impact on the second-level hit time * So much of the AMAT is due to misses in the second level cache * Usually for L2 cache: * L2 >> L1, for 32 KB L1 cache; * L2 not tied to CPU clock cycle * Higher associativity or pseudo associativity * Larger block size * Multilevel inclusion property ### Summary on Reducing Miss Penalty ![](https://i.imgur.com/rPBlIvG.png) * Techniques 1. Read priority over write on miss 2. Sub-block placement 3. Early Restart and Critical Word First on miss 4. Non-blocking Caches (Hit under Miss, Miss under Miss) 5. Second Level Cache * Can be applied recursively to Multilevel Caches * Danger is that time to DRAM will grow with multiple levels in between ### Cache Optimization Summary on miss rate and miss penalty * ![](https://i.imgur.com/ermv9za.png) ### Reducing Hit time * Hit time is critical because it affects the clock cycle time * A fast hit time is multiplied in importance beyond the AMAT formula because it helps everything * Fast hit time via **small and simple cache** ### Disadvantage of Set Associative Cache * N-way Set Associative Cache v. Direct Mapped Cache: * N comparators vs. 1 * Extra MUX delay for the data * Data comes AFTER Hit/Miss * In a DM cache, cache block is available **BEFORE** Hit/Miss: * Possible to assume a hit and continue. Recover later if miss. * ![](https://i.imgur.com/mJuw7fE.png) ### Fast Hit time via Small and Simple Caches * Index tag memory and then compare takes time * **Small** cache can help hit time since smaller memory takes less time to index * E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and Opteron * Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip * **Simple** $\rightarrow$ direct mapping * Can overlap tag check with data transmission since no choice * Access time estimate for 90 nm using CACTI model 4.0 * Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-way caches * ![](https://i.imgur.com/a4nwKVj.png) ## (9-2) Virtual Memory ### For Virtual Memory ![](https://i.imgur.com/k9E6OZi.png) * ![](https://i.imgur.com/in0a19V.png) * **Page size usually large** * Smaller **page table**, save memory, reduce TLB misses * Transferring larger pages to/from secondary storage is more efficient * ![](https://i.imgur.com/ISlwFA7.png) ### The Limits of Physical Addressing ![](https://i.imgur.com/cjLhsoX.png) ### Solution: Add a Layer of Indirection ![](https://i.imgur.com/OEgtdTH.png) ### Three Advantages of Virtual Memory * **Translation**: * Program can be given **consistent view** of memory, even though physical memory is scrambled * Makes **multithreading** reasonable (now used a lot!) * Only the most important part of program ("**Working Set**") must be in **physical memory**. * Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later. * **Protection**: * Different threads (or processes) protected from each other. * Different pages can be given special behavior * **Read Only, Invisible to user programs, etc.** * Kernel data protected from User programs * Very important for protection from malicious programs * **Sharing**: * Can map same physical page to multiple users ("Shared memory") ### Page tables encode virtual address spaces ![](https://i.imgur.com/8I2vact.png) ### Details of Page Table ![](https://i.imgur.com/0Dc55of.png) ### Page tables may not fit in memory! ![](https://i.imgur.com/FLtcybb.png) ### VM and Disk: Page replacement policy ![](https://i.imgur.com/vzmzGyC.png) ### **Translation Look-Aside Buffer** (**TLB**) ![](https://i.imgur.com/uN4nxhi.png) ### The TLB caches page table entries ![](https://i.imgur.com/UWCS6IE.png) ## (10) Storage ### Magnetic Disks ![](https://i.imgur.com/FUY16uV.png) * Average Rotation time = 1 / 2 * Revolutions Per Minute * for 3600RPM, AMT = 1 / (2 * 3600) min = 8.33 ms ### Redundant Arrays of (Inexpensive) Disks (**RAID**) * Files are "striped" across multiple disks * spreading data over multiple disks * **Force accesses to several disks if the data files are large** * Redundancy yields high data availability * **Availability**: service still provided to user, even if some components failed * Disks will still fail * Contents reconstructed from data redundantly stored in the array * **Capacity penalty** to store redundant info * **Bandwidth penalty** to update redundant info * [Wiki (中文)](https://zh.wikipedia.org/wiki/RAID) ### RAID 0 * No redundancy * Sometimes nicknamed JBOD (Just a Bunch of Disks) * Used as a **measuring** stick for other RAID levels in terms of cost, performance and dependability. ### RAID 1 : Disk Mirroring / Shadowing ![](https://i.imgur.com/mhVSPYc.png) * Each disk is fully duplicated onto its "**mirror**" * Very high availability can be achieved * **Bandwidth sacrifice on write**: * **Logical write = two physical writes** * Reads may be optimized * **Most expensive solution: 100% capacity overhead** ### RAID 2 * Inspired by applying memory-style error correcting codes to disks * RAID 2 not used since other more interesting schemes ### RAID 3 : Parity Disk ![](https://i.imgur.com/nKjDCQW.png) * Sum computed across recovery group to protect against hard disk failures, **stored in P disk** * Logically, a single high capacity, high transfer rate disk * **good for large transfers** * **Wider arrays reduce capacity costs, but decreases availability** * 33% capacity cost for parity if 3 data disks and 1 parity disk ### RAID 4 : High I/O Rate Parity ![](https://i.imgur.com/NpxM2Q1.png) * **Small Writes** * RAID 4 works well for small reads * Small writes (write to one disk): * Option 1: read other data disks, create new sum and write to Parity Disk * **Option 2: since P has old sum, compare old data to new data, add the difference to P** * ==**RAID 4 takes Option 2**== * In RAID 4, Small writes are limited by Parity Disk * Write to D0, D5 both also write to P disk * ![](https://i.imgur.com/VWHESjo.png) * ![](https://i.imgur.com/Dk61xBt.png) ### RAID 5 : High I/O Rate Interleaved Parity ![](https://i.imgur.com/DxCopFU.png) ### RAID 6 : Recovering from 2 failures * Why > 1 failure recovery? * operator accidentally replaces the wrong disk during a failure * since disk bandwidth is growing more slowly than disk capacity, the MTT Repair a disk in a RAID system is increasing * increases the chances of a 2nd failure during repair since it takes longer * reading much more data during reconstruction meant increasing the chance of an uncorrectable media failure, which would result in data loss * Network Appliance’s **row-diagonal parity** or RAID-DP * ![](https://i.imgur.com/H5eFPR7.png) * Like the standard RAID schemes, it uses redundant space based on parity calculation per stripe * Since it is protecting against a double failure, it adds **two check blocks per stripe of data**. * If p+1 disks total, p-1 disks have data; assume p=5 * Row parity disk is just like in RAID 4 * Even parity across the other 4 data blocks in its stripe * Each block of the diagonal parity disk contains the even parity of the blocks in the same diagonal * Example p = 5 * Row diagonal parity starts by recovering one of the 4 blocks on the failed disk using diagonal parity * Since each diagonal misses one disk, and all diagonals miss a different disk * Once the data for those blocks is recovered, then the standard RAID recovery scheme can be used to recover two more blocks in the standard RAID 4 stripes * Process continues until two failed disks are restored