a5180352
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.

      Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

      Explore these features while you wait
      Complete general settings
      Bookmark and like published notes
      Write a few more notes
      Complete general settings
      Write a few more notes
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note No publishing access yet

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.

    Your account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Your team account was recently created. Publishing will be available soon, allowing you to share notes on your public page and in search results.

    Explore these features while you wait
    Complete general settings
    Bookmark and like published notes
    Write a few more notes
    Complete general settings
    Write a few more notes
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    1
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    ###### tags: `計結` # 1062 - Computer Architecture - 8~10 ## (8) Multiprocessors ### Why Multiprocessors? * A growth in data-intensive applications * A growing interest in server markets * Complexity of current microprocessors * Collecting several much easier than redesigning one * Steady improvement in parallel software (scientific apps, databases, OS) * Long term goal of the field: * Scale number processors to size of budget, desired performance ### Categories ```= S = Single M = Multiple I = Instruction stream D = Data stream ``` * **SISD** * Uni-processor * **SIMD** * **Data level parallelism** * Each processor **has its own data memory**, but there is a **single instruction memory and control processor**, which fetches and dispatches instructions * Multimedia, graphics performance * **MISD** * No commercial product of this type * **MIMD** * Each processor has its own instructions and operates on its own data * **==Thread== level parallelism** ### Data Level Parallelism: **SIMD** * Vector architectures * SIMD extensions * Graphics Processor Units (GPUs) * **Advantage** * SIMD architectures can exploit **significant data level parallelism** for: * **Matrix-oriented** scientific computing * **Media-oriented** image and sound processors * SIMD is **more energy efficient** than MIMD * Only needs to fetch one instruction per data operation * Makes SIMD attractive for personal mobile devices * SIMD allows programmer to continue to think sequentially ### Vector Architectures * Basic idea * Read sets of data elements into "**vector registers**" * Operate on those registers * Disperse the results back into memory * Registers are controlled by compiler * Used to hide memory latency * Leverage memory bandwidth * The basic structure for a vector architecture * ![](https://i.imgur.com/nqbMmW3.png) ### Example architecture: **VMIPS** * **Vector registers** * Each register holds a 64-element, 64 bits/element vector * Register file has 16 read ports and 8 write ports * **Vector functional units** * Fully pipelined * Data and control hazards are detected * **Vector load-store unit** * Fully pipelined * One word per clock cycle after initial latency * **Scalar registers** * 32 general-purpose registers * 32 floating-point registers ### VMIPS Instructions * `ADDVV.D Vd, Vs, Vt` * add two vectors `Vs` and `Vt`, then put each result in `Vd` * `ADDVS.D Vd, Vs, Ft` * add scalar `Ft` to **each** element of vector `Vs`, then put each result in `Vd` * `SUBVV.D Vd, Vs, Vt` * `SUBVS.D Vd, Vs, Ft` * `SUBSV.D Vd, Fs, Vt` * `MULVV.D`, `MULVS.D`, `DIVVV.D`, `DIVVS.D`, `DIVSV.D` * `LV Vd, (Rs)`, `SV (Rd), Vs` * vector load / store from address `R` * [For more instruction](http://www.cburch.com/cs/330/reading/vmips-ref.pdf) ### **SAXPY** (Scalar Alpha X Plus Y) * $Y = \alpha X + Y$ * $X, Y$ are vectors, initially resident in memory * $\alpha$ is a scalar * For different data type, there are: * **`SAXPY`**: Single precision * **`DAXPY`**: Double precision * `CAXPY`: Complex number with Single precision * `ZAXPY`: Complex number with Double precision * This problem is so-called SAXPY or DAXPY loop in benchmark ```c= for (int i = m; i < n; i++) { y[i] = a * x[i] + y[i]; } ``` * SAXPY MIPS code ```= L.D F0, a ; load scalar a DADDIU R4, Rx, #512 ; last address to load Loop: L.D F2, 0(Rx) ; load X[i] MUL.D F2, F2, F0 ; a*X[i] L.D F4, 0(Ry) ; load Y[i] ADD.D F4, F4, F2 ; a * X + Y S.D F4, 0(Ry) ; store into Y[i] DADDIU Rx, Rx, #8 ; increment index to X DADDIU Ry, Ry, #8 ; increment index to Y DSUBU R20, R4, Rx ; compute bound BNEZ R20, Loop ; check if done ``` * SAXPY VMIPS code ```= L.D F0, a ; load scalar a LV V1, Rx ; load vector X MULVS.D V2, V1, F0 ; vector-scalar multiply LV V3, Ry ; load vector Y ADDVV V4, V2, V3 ; add SV Ry, V4 ; store the result ``` * Requires 6 instructions vs. almost 600 for MIPS ### Comparison : MIPS, VMIPS * The overhead instructions are not present in the VMIPS * The compiler produces vector instructions for such a sequence: The code is **vectorized** or **vectorizable** * Loops can be vectorized if there are **no loopcarried dependences** * VMIPS has fewer **pipeline load interlocks** * Vector architects call **forwarding of element dependent operations** "==chaining==" ### Vector Execution Time * Execution time depends on three factors: * Length of operand vectors * Structural hazards * Data dependencies * VMIPS functional units consume one element per clock cycle * Execution time is approximately the vector length * **Convoy** * Set of vector instructions that could potentially execute together ### Chimes * Sequences with **RAW** dependency hazards can be in the same convoy via chaining * **Chaining** * Allows a vector operation to start as soon as the individual elements of its vector source operand become available * **Chime** * Unit of time to execute one convoy * $m$ convoys executes in $m$ chimes * For vector length of $n$ , requires $m$ * $n$ clock cycles ### Example : Total clock cycle with chimes ```= LV V1, Rx ; load vector X MULVS.D V2, V1, F0 ; vector-scalar multiply LV V3, Ry ; load vector Y ADDVV.D V4, V2, V3 ; add two vectors SV Ry, V4 ; store the sum ``` * Convoys: 3 convoys ```= LV MULVS.D LV ADDVV.D SV ``` * Total clock cycle = 3 Chime * 64 element per vector = 192 clock cycles ### Multiple Lanes * Using parallel pipelines * Allows for multiple hardware lanes * Example * ![](https://i.imgur.com/zRyvxg8.png) * Structure of a vector unit containing 4 lanes * ![](https://i.imgur.com/eyWa8N3.png) ### Vector Length Register * Vector length not known at compile time? * Use **Vector Length Register** (**VLR**) * Use ==strip mining== for vectors over the **maximum vector length** (**MVL**) * ![](https://i.imgur.com/BeCuwtZ.png) * For vector length n, m = n % MVL * So length of each operation <= MVL, like the picture ### Vector Mask Registers * Consider: ```cpp= for (i = 0; i < 64; i++) { if (X[i] != 0) { X[i] = X[i] – Y[i]; } } ``` * Use vector mask register to "**disable**" elements: ```= LV V1, Rx ; load vector X into V1 LV V2, Ry ; load vector Y into V2 L.D F0, #0 ; load FP zero into F0 SNEVS.D V1, F0 ; sets VM(i) to 1 if V1(i)!=F0 SUBVV.D V1, V1, V2 ; subtract under vector mask SV Rx, V1 ; store the result in X ``` ### **MIMD** * MIMD offers **flexibility** * With correct hardware and software support, MIMDs can function as * **Single-user** multiprocessors focusing on high performance for **one application** * **Multi-programmed** multiprocessors running many tasks simultaneously * Or some combination of these functions * MIMD can build on the cost-performance advantages of **off-the-shelf** processors * Nearly all multiprocessors built today **use the same microprocessors found in workstations and single-processor servers**. * **Multi-core** chips leverage the design investment in a single processor core by **replicating** it ### MIMD - Clusters * One od the popular class of MIMD computers, which often use **standard components** and **standard network technology** * Clusters * **Commodity clusters** * Often assembled by **users or computer center directors**, rather than vendors * For applications focusing on throughput and requiring almost no communication among threads * **Web serving**, **multiprogramming**, **transaction-processing applications** * inexpensive * **Custom clusters** * A designer customizes the detailed node design or the interconnection network * For parallel applications that can **exploit large amounts of parallelism on a single problem** * Such applications require **significant amount of communication during computation** ### MIMD - Multicore * On-chip multi-processing, single-chip multiprocessing, **multi-core** * Place multiple processors on a **single die**, typically **share L2 cache**, **memory** or **I/O bus** * Usually, each processor execute a different process * **Process**: Independent of each other (usually) * **Thread**: May share code and address space ### Major MIMD Styles * **Centralized shared memory** * Symmetric (shared-memory) multiprocessors * Uniform Memory Access time * Shared Memory Processor * ![](https://i.imgur.com/t0yoZLF.png) * **Decentralized memory memory module with CPU** * Get more memory bandwidth * Lower memory latency * Major ==Drawback==: **Longer communication latency** * ==Drawback==: Software model more complex * ![](https://i.imgur.com/r5k6b6T.png) ### Models for communication * **Shared address space** * The physically separate memories can be **addressed as one logically shared address space** * Called ==**Distributed shared-memory**== * For non-uniform memory access * access time depends on location of data * **Message passing** * Address spaces are logically disjoint and cannot be addressed by a remote processor ### Symmetric Multi-Processors (SMP) * The **large**, **multilevel caches reduce the memory demand** of a processor * More than one processor to **share the bus** to the memory * Different organizations: * Processor and cache on an extension board * **Integrated on the main-board** (most popular) * Integrated on the same chip * Private data * Used by single processor * Shared data * Used by multiple processor * Provide communication among processors through read/write the shared data * The new problem: ==**cache coherence**== * For a single memory location `X`, R/W by 2 processors `A` and `B` * ![](https://i.imgur.com/SyFGpch.png) ### Coherency * Informally: * **Any read must return the most recent write** * Formally: A memory system is coherent if * `P` writes `x` and `P` reads x. If **no other writes of `x` by another processor** between the write and the read, the read always return the value written by `P`. * If `P1` writes `x` and `P2` reads it, `P1`’s write will be seen by `P2` if the read and write are **sufficiently far apart** and **no other writes to `x`** between the two accesses. * Writes to a single location are serialized: * **seen in one order (Latest write will be seen)** ### **Coherence** and **Consistency** * **Coherence**: * defines **what values** can be returned by a read * **Consistency**: * defines **when** a written value will be returned by a read * We cannot require that a read of `X` **instantaneously** see the value written for `X` by some other processor * The issue of **exactly when a written value must be seen by a reader** is defined by a memory consistency model ### Coherence Solutions * Snooping: * **No centralized state** is kept. * **Every cache that has a copy of the data from a block of physical memory also has a copy of the sharing status of the block** * Caches are on a shared-memory bus * All cache controllers **monitor or snoop on the bus** to determine if they have a copy of a block that is requested on the bus * Dominates for **small scale** machines (most of the market) * Directory-Based Schemes * The sharing status of a block is kept in **one location** called ==**directory**== * Keep track of what is being shared in one place (**centralized** logically, but can be distributed physically) * Has slightly higher implementation **overhead** than snooping * Can **scale to larger processor counts** ### Basic Snooping Protocols * Write **Invalidate** Protocol: * Multiple readers, single writer * Write to shared data: an invalidate is sent to all caches which snoop and **invalidate** any copies * Read Miss: * Write-through: memory is always up-to-date * Write-back: snoop in caches to find the most recent copy * Example: * ![](https://i.imgur.com/EecjqOH.png) * Write **Broadcast** Protocol * Also called **Write Update Protocol** * Write to shared data: broadcast on bus, processors snoop, and **update** any copies * Read miss: memory is always up-to-date * typically write through ### States of Snooping Protocol * For **invalidation** protocol, **write-back** cache * Each **block of memory** is in one state: * Shared : Clean in all caches and up-to-date in memory * Exclusive : Dirty in exactly one cache * Invalid : Not in any caches * Each **cache block** is in one state (track these): * Shared : block can be read * Exclusive : cache has the only copy, it’s writeable, and dirty * Invalid : block contains no data * State Diagram * Cache state transitions based on **requests from CPU** * ![](https://i.imgur.com/lclazcx.png) * Cache state transitions based on **requests from bus** * ![](https://i.imgur.com/ZgxXRbJ.png) * Cache Coherence state diagram with the state transitions induced by * black : the local processor * gray : the bus activities * ![](https://i.imgur.com/BioTsnj.png) ### Directory : A Scalable Approach * Every memory block has associated directory information * keeps track of copies of cached blocks and their states * **on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary** * in scalable networks, communication with directory and copies is through network transactions * Many alternatives for organizing directory information * ![](https://i.imgur.com/9MftyAa.png) ### Directory-based coherence * Local node: * requesting node * Home node: * where the memory location and the directory entry of an address reside * Remote node: * have a copy of the cache block, whether exclusive or shared ### States of Directory based Cache Coherence Protocol * Similar to Snoopy Protocol: 3 states * Shared : ≥ 1 processors have data, memory up-to-date * Exclusive : 1 processor (**owner**) has data; memory out-of-date * Uncached : no processor has a copy of the cache block; not valid in any cache * In addition to cache state, must track **which processors** have data when in the **shared state** (usually bit vector, 1 if processor has copy) * State Diagram * An individual cache block in a directory-based system * ![](https://i.imgur.com/h30cf72.png) * For the directory has the same states and structure as the transition diagram for an individual cache * ![](https://i.imgur.com/oEAOsiW.png) ### Two-level "Hierarchy" * A popular middle ground * Individual nodes are multiprocessors, connected nonhierarchically * Coherence across nodes is directory-based * Coherence within nodes is snooping or directory ### Challenges of Parallel Processing * Limited parallelism available in programs * Running independent tasks * High cost of communications * Threads must communicate to complete the task * Large latency of remote access * In existing shared-memory multiprocessors, **communication cost**: * from **50 clock cycles** (multicore) to **1000 clock cycles** (large-scale multiprocessors) ### Examples * ==**Example 1**== * ![](https://i.imgur.com/Bylu6Bg.png) * ==**Example 2**== * ![](https://i.imgur.com/oRjaPDz.png) * ![](https://i.imgur.com/vIq5o4n.png) * ==**Example 3**== * ![](https://i.imgur.com/yjE0oSa.png) * ![](https://i.imgur.com/kY8AkI6.png) * ![](https://i.imgur.com/wwgBvnv.png) * ==**Example 4**== * ![](https://i.imgur.com/Uo8vtyo.png) * ![](https://i.imgur.com/Lfwg31N.png) * ![](https://i.imgur.com/voQRDrK.png) ### Conclusion * Sharing cached data * **Coherence** (values returned by a read) * **Consistency** (when a written value will be returned by a read) * Snooping cache over shared medium for smaller number of Multi-Processor * Snooping * $\rightarrow$ uniform memory access; **bus makes snooping easier because of broadcast** * **Directory has extra data structure** to keep track of state of all cache blocks * **Distributing directory** * $\rightarrow$ scalable shared address multiprocessor * $\rightarrow$ Non uniform memory access ## (9-1) Memory Hierarchy 參考:[淺談memory cache](http://opass.logdown.com/tags/%E8%A8%88%E7%AE%97%E6%A9%9F%E7%B5%84%E7%B9%94) ### Why Memory Hierarchy? ![](https://i.imgur.com/6JGGlY0.png) ### Levels of the Memory Hierarchy ![](https://i.imgur.com/6PkDhS9.png) ### Achieving higher memory bandwidth ![](https://i.imgur.com/LvKVca1.png) * Suppose * 1 clock cycle to send the address * 15 clock cycles for each DRAM access initiated * 1 clock cycle to send a word of data * (a) **One-word-wide memory organization** * If we have a **cache block of 4 words** and a **one-wordwide bank** of DRAMs, the miss penalty would be * ==1 + 4 * 15 + 4 * 1 = 65 cycles== * Number of bytes transferred per clock cycle for a single miss is * ==4(words) * 4(bytes/word) / 65 = 0.246== * The memory is **one-word-wide** and all the accesses are made **sequentially**. * (b) **Wide memory organization** * Increase the bandwidth to memory by **widening the memory and the buses** between the processor and memory * Allow **parallel access** to all the words of the block * A 4-word-wide memory, the miss penalty is * ==1 + (4/4) * 15 + (4/4) * 1 = 17 cycles== * A 2-word-wide memory, the miss penalty is * ==1 + (4/2) * 15 + (4/2) * 1 = 33 cycles== * (c\) **Interleaved memory organization** * Increase the bandwidth to memory by **widening the memory but not the interconnection bus** * **Still need to pay a cost to transmit each word**, but we can **avoid paying the cost of access latency more than once** * **Interleaving**: memory chips organized in banks read/write multiple words in one access time instead of reading/writing single words each time * ==1 + 1 * 15 + 4 * 1 = 20 cycles== ### Cache : **The Principle of Locality** * The Principle of Locality: * Program access a **relatively small portion of the address space** at any instant of time. * Two Different Types of Locality: * **Temporal** Locality (Locality in *Time*): If an item is referenced, it will tend to be referenced again soon * loops * reuse * **Spatial** Locality (Locality in *Space*): If an item is referenced, items whose addresses are close by tend to be referenced soon * straightline code * array access * Last 15 years, HW relied on locality for speed ### Cache * Small, fast storage used to improve average access time to slow memory. * Exploits spacial and temporal locality * In computer architecture, almost everything is a cache! * Registers: a cache on variables * First-level cache: a cache on second-level cache * Second-level cache: a cache on memory * Memory: a cache on disk (virtual memory) * TLB: a cache on page table * Branch-prediction: a cache on prediction information * BTB: a cache of branch target * ![](https://i.imgur.com/iy8jsj0.png) ### Review: Terminology * **Hit**: data appears in some block in this level * **Hit Rate**: the fraction of memory access found in this level * **Hit Time**: Time to access the upper level which consists of RAM access time + Time to determine hit/miss * **Miss**: data needs to be retrieve from a block in the lower level (Block `Y`) * **Miss Rate** = 1 - Hit Rate * **Miss Penalty**: Time to replace a block in the upper level + Time to deliver the block to the processor * ==Hit Time << Miss Penalty== * ![](https://i.imgur.com/PZc7GM9.png) ### Review: The Basics of Caches * A direct-mapped cache with 8 entries * ![](https://i.imgur.com/zi4GGeq.png) * A 4KB cache using 1 word block * ![](https://i.imgur.com/LFw556Y.png) * 1024 blocks * 4 (32bit / 8) bytes data per block = 4KB ### 4 Questions for Memory Hierarchy * Q1: Where can a block be placed in the upper level? * Block placement * Q2: How is a block found if it is in the upper level? * Block identification * Q3: Which block should be replaced on a miss? * Block replacement * Q4: What happens on a write? * Write strategy ### Block Placement * Where can a block be placed in the upper level? * Fully Associative * Set Associative (n-way associative) * Direct Mapped (1-way associative) * ![](https://i.imgur.com/qaUpUWB.png) * Cache Size : ![](https://i.imgur.com/b2qLzih.png) * ![](https://i.imgur.com/rFLSd6t.png) * ![](https://i.imgur.com/u63evKK.png) * Example * ![](https://i.imgur.com/tQKrsXq.png) * ![](https://i.imgur.com/wdEECy7.png) ### Block identification * How is a block found if it is in the upper level? * Index identifies set * Increasing associativity shrinks index, expands tag * ![](https://i.imgur.com/InQqdR0.png) ### Block replacement * Which block should be replaced on a miss? * Easy for Direct Mapped * Set Associative or Fully Associative: * Random * Easy to implement * **LRU** (**Least Recently Used**) * Appealing, but hard to implement for high associativity ### Write strategy * What happens on a write? * **Write through** * The information is written to **both the block in the cache and to the block in the lower-level memory**. * **Write back** * The information is written only to the block in the cache. The modified cache block is written to main memory only **when it is replaced**. * is block clean or dirty? * Pros and Cons of each * WT: read misses cannot result in writes * WB: no repeated writes to same location * ![](https://i.imgur.com/fRYPWG4.png) * What about write miss? $\rightarrow$ **write_allocate?** * **Write allocate** (fetch on write) * The block is loaded on a write miss * let writes to an un-cached address allocate a new cache line * Usually, **WB cache use WA** (hoping subsequent writes to that block will be captured by the cache) * **No write allocate** (write around) * The block is modifed in the lower level and not loaded into the cache * **WT cache often use no-WA** (since subsequent writes to that block will still have to go to memory) * **Write Buffer** for WT cache * WT always combined with **write buffers** so that we don’t have to wait for lower level memory * ![](https://i.imgur.com/ZWakuxk.png) * A Write Buffer is needed between the Cache and Memory * Processor: writes data into the cache and the write buffer * Memory controller: write contents of the buffer to memory * Write buffer is just a **FIFO**: * Typical number of entries: 4 * Works fine if: Store frequency << (1 / DRAM write cycle) ### Cache performance * Miss-oriented Approach to Memory Access * ![](https://i.imgur.com/A2n7Wa9.png) * Separating out Memory component entirely * AMAT = Average Memory Access Time * ![](https://i.imgur.com/eCQcNoS.png) * ![](https://i.imgur.com/NbLSXCV.png) ### Impact on Performance * ==Example 1== * ![](https://i.imgur.com/wn45fP6.png) * ==Example 2== : **Harvard** Architecture * Separate Instruction & Data Memories (*right*) * ![](https://i.imgur.com/FVYQD9E.png) * ![](https://i.imgur.com/uqfMhVr.png) ### Impact of Cache * ![](https://i.imgur.com/ET2YdW0.png) * ![](https://i.imgur.com/VnUdQcV.png) ### The Cache Design Space * **Several interacting dimensions** * ![](https://i.imgur.com/OcGhfzi.png) * **The optimal choice is a compromise** * ![](https://i.imgur.com/plM32fE.png) * ==**Simplicity often wins**== ### Improving Cache Performance ![](https://i.imgur.com/A2n7Wa9.png) * Reduce the **miss rate** * Reduce the **miss penalty** * Reduce the **hit time** ### Reducing Misses * Classifying Misses: **3C** * **Compulsory** * The **first access** to a block is not in the cache, so the block must be brought into the cache. * Also called **cold start misses** or **first reference misses**. * ==Misses in **even an Infinite Cache**== * **Capacity** * If the cache cannot contain all the blocks needed during execution of a program, **capacity misses will occur due to blocks being discarded and later retrieved**. * ==Misses in Fully Associative Size X Cache== * **Conflict** * If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. * Also called **collision misses** or **interference misses**. * ==Misses in N-way Associative, Size X Cache== * More recent, **4th C**: * **Coherence** * Misses caused by cache coherence. ### Classify Cache Misses, How? * Infinite cache, fully associative * Compulsory misses * Finite cache, fully associative * Compulsory misses + Capacity misses * Finite cache, limited associativity * Compulsory misses + Capacity misses + Conflict misses ### 2:1 Cache Rule ``` [miss rate of 1-way associative cache size X] ~= [miss rate of 2-way associative cache size X/2] ``` * ![](https://i.imgur.com/uH4pBOF.png) ### How to Reduce Misses? * In all cases, assume **total cache size not changed**: * What happens if * Change Block Size: * Which of 3Cs is obviously affected? **Compulsory** * Change Associativity: * Which of 3Cs is obviously affected? **Conflict** * Change Algorithm / Compiler: * Which of 3Cs is obviously affected? **3Cs** ### 1. Reduce Misses via "**Larger Block Size**" * ![](https://i.imgur.com/hD77Zhh.png) * ![](https://i.imgur.com/TNq7zzj.png) * Larger Block Size : * Miss Rate $\downarrow$ * mainly **Compulsory Miss** $\downarrow$ * but also Conflict & Capacity Miss $\downarrow$ * **Miss Penalty** $\uparrow$ * **AMAT** $\downarrow$ * **Hit Time** $\uparrow$ ### 2. Reduce Misses via "**Higher Associativity**" * 2:1 Cache Rule * Beware: Execution time is the only final measure * Will Clock Cycle Time (CCT) increase? * Hill(1988) suggested hit time for 2-way vs. 1-way * external cache + 10% * internal + 2% * AMAT vs. Miss Rate * Assume CCT = n times v.s. CCT direct mapped * n = 1.10 for 2-way * n = 1.12 for 4-way * n = 1.14 for 8-way * ![](https://i.imgur.com/IGZlSD3.png) ### 3. Reducing Misses via a "**Victim Cache**" * How to combine **fast hit time** of direct mapped yet **still avoid conflict misses**? * **Victim Cache**: Add buffer to place data discarded from cache * Jouppi(1990): 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache * Used in Alpha, HP machines * ![](https://i.imgur.com/BlEVgsu.png) * From [Wiki](https://zh.wikipedia.org/wiki/CPU%E7%BC%93%E5%AD%98#%E5%8F%97%E5%AE%B3%E8%80%85%E7%BC%93%E5%AD%98) ### 4. Reducing Misses via "**Pseudo-Associativity**" * How to combine **fast hit time of DM cache** and have the **lower conflict misses of 2-way SA cache * Divide cache: on a miss, check other half of cache to see if there, if so have a **pseudo-hit** (slow hit) * ![](https://i.imgur.com/NTkvyrh.png) * Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles * Better for caches not tied directly to processor (L2) * Used in MIPS R1000 L2 cache, similar in UltraSPARC ### 5.6. Reducing Misses by "**Prefetching**" * Category * Hardware Prefetching of Instructions & Data * Software Prefetching Data * Instruction Prefetching * Alpha 21064 fetches 2 blocks on a miss * Extra block placed in **stream buffer** * Instruction stream buffer * Data stream buffer * On miss check stream buffer * Works with data blocks too: * Jouppi(1990) 1 data stream buffer got 25% misses from 4KB cache; 4 stream buffers got 43% * Palacharla & Kessler(1994) for scientific programs for 8 stream buffers got 50% to 70% of misses from 2 64KB, 4-way set associative caches * Prefetching relies on having **extra** memory bandwidth that can be used without penalty ### 7. Reducing Misses by Compiler Optimizations * McFarling(1989) reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks **in software** * Reorder procedures in memory so as to reduce misses * Data * **Merging Arrays** * improve spatial locality by single array of compound elements vs. 2 arrays ```cpp= /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of stuctures */ struct merge { int val; int key; } merged_array[SIZE]; ``` * Some programs reference multiple arrays in the same dimension with the same indices. (**these accesses will interfere with each other**) * Reducing conflicts by combining independent matrices into a single compound array so that **each cache block can contain the desired elements**; improve spatial locality * **Loop Interchange** * change nesting of loops to access data in order stored in memory * **row major or column major** ```cpp= /* Before */ for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) // <- change for (j = 0; j < 100; j++) // <- change x[i][j] = 2 * x[i][j]; ``` * Sequential accesses instead of striding through memory every 100 words; improved spatial locality * **Loop Fusion** * Combine 2 independent loops that have same looping and some variables overlap ```cpp= /* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; } ``` * 2 misses per access to `a` & `c` vs. one miss per access; improve spatial locality * **Blocking** * Improve temporal locality by accessing "blocks" of data repeatedly vs. going down whole columns or rows ```cpp= /* Matrix multiplication */ /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { r = 0; for (k = 0; k < N; k = k+1) { r = r + y[i][k] * z[k][j]; } x[i][j] = r; } ``` * Two Inner Loops: * Read all N^2 elements of `z[]` * Read N elements of 1 row of `y[]` repeatedly * Write N elements of 1 row of `x[]` * Idea: compute on B^2 submatrix that fits ```cpp= /* After */ for (jj = 0; jj < N; jj += B) for (kk = 0; kk < N; kk += B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1, N); j++) { r = 0; for (k = kk; k < min(kk+B-1, N); k++) { r = r + y[i][k] * z[k][j]; } x[i][j] = x[i][j] + r; } ``` * `B` called **Blocking Factor** ### Summary on reducing miss rate ![](https://i.imgur.com/6q29uJx.png) * 3C : Compulsory, Capacity, Conflict 1. Reducing Misses via Larger Block Size 2. Reducing Misses via Higher Associativity 3. Reducing Misses via Victim Cache 4. Reducing Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Misses by Compiler Optimizations * Remember the danger of concentrating on just one parameter when evaluating performance ### Reducing Cache Miss Penalty 1. Read Priority over Write on Miss 2. Sub-block Placement 3. Early Restart and Critical Word First 4. Non-blocking Caches to reduce stalls on misses 5. Add a second-level cache ### 1. Read Priority over Write on Miss * **Write through with write buffer** * If simply wait for write buffer to empty, might increase read miss penalty * Check write buffer contents before read; if not conflicts, let the memory access continue * **Write Back** * Read miss replaces dirty block * Normal: Write dirty block to memory, then do the read (**slow**) * **Instead**: Copy the dirty block to a write buffer, then do the read, and then do the write * CPU stall less since restarts as soon as read is done ### 2. Sub-block Placement * **Don’t** have to load **full block** on a miss * Have valid bits per sub-block to indicate valid * ![](https://i.imgur.com/8Bul32h.png) ### 3. Early Restart and Critical Word First * Don't **wait** for full block to be loaded before restarting CPU * **Early restart** * as soon as requested word of the block arrives, send it to the CPU and let the CPU continue execution * **Critical word first** * request the missed word first from memory and send it to the CPU as soon as it arrives * Let the CPU continue execution while filling the rest of the words in the block * Also called **wrapped fetch** or **requested word first** * ==Generally useful only in large blocks== ### 4. Non-blocking Caches to reduce stalls on misses ![](https://i.imgur.com/4bC6H4q.png) from [PTT](https://www.ptt.cc/bbs/Grad-ProbAsk/M.1285752708.A.6D0.html) * **Non-blocking cache** or **lockup-free cache** allow data cache to continue to supply cache hits during a miss * requires F/E bits on registers or out-of-order execution * requires multi-bank memories * "**hit under miss**" reduces the effective miss penalty by working during miss * "**hit under multiple miss**" or "**miss under miss**" may further lower the effective miss penalty by overlapping multiple misses * Significantly increases the **complexity** of the cache controller as there can be **multiple outstanding memory accesses** * Requires muliple memory banks (otherwise cannot support) * Pentium Pro allows 4 outstanding memory misses * Example * ![](https://i.imgur.com/mu6L38C.png) * ![](https://i.imgur.com/ZtAPTtZ.png) ### 5. Add a second-level cache * L2 Equations * ![](https://i.imgur.com/o4UmIJc.png) * Definitions: * **Local miss rate** * misses in this cache divided by the total number of memory accesses to this cache * Local miss rate for L2 = Miss rate_L2 * **Global miss rate** * misses in this cache divided by the total number of memory accesses generated by the CPU * global miss rate for L2 = Global miss rate for L1 * Miss Rate_L2 = Miss Rate_L1 * Miss Rate_L2 * ==**Global Miss Rate is what matters**== * Example * Given the data below, what is the impact of second-level cache associativity on the miss penalty? * ![](https://i.imgur.com/UqA2Hn7.png) * ![](https://i.imgur.com/SgmE8EO.png) * Conclusion: **Higher associativity or pseudo-associativity are worth considering** because * They have small impact on the second-level hit time * So much of the AMAT is due to misses in the second level cache * Usually for L2 cache: * L2 >> L1, for 32 KB L1 cache; * L2 not tied to CPU clock cycle * Higher associativity or pseudo associativity * Larger block size * Multilevel inclusion property ### Summary on Reducing Miss Penalty ![](https://i.imgur.com/rPBlIvG.png) * Techniques 1. Read priority over write on miss 2. Sub-block placement 3. Early Restart and Critical Word First on miss 4. Non-blocking Caches (Hit under Miss, Miss under Miss) 5. Second Level Cache * Can be applied recursively to Multilevel Caches * Danger is that time to DRAM will grow with multiple levels in between ### Cache Optimization Summary on miss rate and miss penalty * ![](https://i.imgur.com/ermv9za.png) ### Reducing Hit time * Hit time is critical because it affects the clock cycle time * A fast hit time is multiplied in importance beyond the AMAT formula because it helps everything * Fast hit time via **small and simple cache** ### Disadvantage of Set Associative Cache * N-way Set Associative Cache v. Direct Mapped Cache: * N comparators vs. 1 * Extra MUX delay for the data * Data comes AFTER Hit/Miss * In a DM cache, cache block is available **BEFORE** Hit/Miss: * Possible to assume a hit and continue. Recover later if miss. * ![](https://i.imgur.com/mJuw7fE.png) ### Fast Hit time via Small and Simple Caches * Index tag memory and then compare takes time * **Small** cache can help hit time since smaller memory takes less time to index * E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and Opteron * Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip * **Simple** $\rightarrow$ direct mapping * Can overlap tag check with data transmission since no choice * Access time estimate for 90 nm using CACTI model 4.0 * Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-way caches * ![](https://i.imgur.com/a4nwKVj.png) ## (9-2) Virtual Memory ### For Virtual Memory ![](https://i.imgur.com/k9E6OZi.png) * ![](https://i.imgur.com/in0a19V.png) * **Page size usually large** * Smaller **page table**, save memory, reduce TLB misses * Transferring larger pages to/from secondary storage is more efficient * ![](https://i.imgur.com/ISlwFA7.png) ### The Limits of Physical Addressing ![](https://i.imgur.com/cjLhsoX.png) ### Solution: Add a Layer of Indirection ![](https://i.imgur.com/OEgtdTH.png) ### Three Advantages of Virtual Memory * **Translation**: * Program can be given **consistent view** of memory, even though physical memory is scrambled * Makes **multithreading** reasonable (now used a lot!) * Only the most important part of program ("**Working Set**") must be in **physical memory**. * Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later. * **Protection**: * Different threads (or processes) protected from each other. * Different pages can be given special behavior * **Read Only, Invisible to user programs, etc.** * Kernel data protected from User programs * Very important for protection from malicious programs * **Sharing**: * Can map same physical page to multiple users ("Shared memory") ### Page tables encode virtual address spaces ![](https://i.imgur.com/8I2vact.png) ### Details of Page Table ![](https://i.imgur.com/0Dc55of.png) ### Page tables may not fit in memory! ![](https://i.imgur.com/FLtcybb.png) ### VM and Disk: Page replacement policy ![](https://i.imgur.com/vzmzGyC.png) ### **Translation Look-Aside Buffer** (**TLB**) ![](https://i.imgur.com/uN4nxhi.png) ### The TLB caches page table entries ![](https://i.imgur.com/UWCS6IE.png) ## (10) Storage ### Magnetic Disks ![](https://i.imgur.com/FUY16uV.png) * Average Rotation time = 1 / 2 * Revolutions Per Minute * for 3600RPM, AMT = 1 / (2 * 3600) min = 8.33 ms ### Redundant Arrays of (Inexpensive) Disks (**RAID**) * Files are "striped" across multiple disks * spreading data over multiple disks * **Force accesses to several disks if the data files are large** * Redundancy yields high data availability * **Availability**: service still provided to user, even if some components failed * Disks will still fail * Contents reconstructed from data redundantly stored in the array * **Capacity penalty** to store redundant info * **Bandwidth penalty** to update redundant info * [Wiki (中文)](https://zh.wikipedia.org/wiki/RAID) ### RAID 0 * No redundancy * Sometimes nicknamed JBOD (Just a Bunch of Disks) * Used as a **measuring** stick for other RAID levels in terms of cost, performance and dependability. ### RAID 1 : Disk Mirroring / Shadowing ![](https://i.imgur.com/mhVSPYc.png) * Each disk is fully duplicated onto its "**mirror**" * Very high availability can be achieved * **Bandwidth sacrifice on write**: * **Logical write = two physical writes** * Reads may be optimized * **Most expensive solution: 100% capacity overhead** ### RAID 2 * Inspired by applying memory-style error correcting codes to disks * RAID 2 not used since other more interesting schemes ### RAID 3 : Parity Disk ![](https://i.imgur.com/nKjDCQW.png) * Sum computed across recovery group to protect against hard disk failures, **stored in P disk** * Logically, a single high capacity, high transfer rate disk * **good for large transfers** * **Wider arrays reduce capacity costs, but decreases availability** * 33% capacity cost for parity if 3 data disks and 1 parity disk ### RAID 4 : High I/O Rate Parity ![](https://i.imgur.com/NpxM2Q1.png) * **Small Writes** * RAID 4 works well for small reads * Small writes (write to one disk): * Option 1: read other data disks, create new sum and write to Parity Disk * **Option 2: since P has old sum, compare old data to new data, add the difference to P** * ==**RAID 4 takes Option 2**== * In RAID 4, Small writes are limited by Parity Disk * Write to D0, D5 both also write to P disk * ![](https://i.imgur.com/VWHESjo.png) * ![](https://i.imgur.com/Dk61xBt.png) ### RAID 5 : High I/O Rate Interleaved Parity ![](https://i.imgur.com/DxCopFU.png) ### RAID 6 : Recovering from 2 failures * Why > 1 failure recovery? * operator accidentally replaces the wrong disk during a failure * since disk bandwidth is growing more slowly than disk capacity, the MTT Repair a disk in a RAID system is increasing * increases the chances of a 2nd failure during repair since it takes longer * reading much more data during reconstruction meant increasing the chance of an uncorrectable media failure, which would result in data loss * Network Appliance’s **row-diagonal parity** or RAID-DP * ![](https://i.imgur.com/H5eFPR7.png) * Like the standard RAID schemes, it uses redundant space based on parity calculation per stripe * Since it is protecting against a double failure, it adds **two check blocks per stripe of data**. * If p+1 disks total, p-1 disks have data; assume p=5 * Row parity disk is just like in RAID 4 * Even parity across the other 4 data blocks in its stripe * Each block of the diagonal parity disk contains the even parity of the blocks in the same diagonal * Example p = 5 * Row diagonal parity starts by recovering one of the 4 blocks on the failed disk using diagonal parity * Since each diagonal misses one disk, and all diagonals miss a different disk * Once the data for those blocks is recovered, then the standard RAID recovery scheme can be used to recover two more blocks in the standard RAID 4 stripes * Process continues until two failed disks are restored

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password
    or
    Sign in via Facebook Sign in via X(Twitter) Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    By signing in, you agree to our terms of service.

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully