---
# System prepended metadata

title: 1062 - Computer Architecture - 8~10
tags: [計結]

---

###### tags: `計結`

# 1062 - Computer Architecture - 8~10

## (8) Multiprocessors

### Why Multiprocessors?
* A growth in data-intensive applications
* A growing interest in server markets
* Complexity of current microprocessors
* Collecting several much easier than redesigning one
* Steady improvement in parallel software (scientific apps, databases, OS)
* Long term goal of the field:
    * Scale number processors to size of budget, desired performance

### Categories
```=
S = Single
M = Multiple
I = Instruction stream
D = Data stream
```
* **SISD**
    * Uni-processor
* **SIMD**
    * **Data level parallelism**
    * Each processor **has its own data memory**, but there is a **single instruction memory and control processor**, which fetches and dispatches instructions
    * Multimedia, graphics performance
* **MISD**
    * No commercial product of this type
* **MIMD**
    * Each processor has its own instructions and operates on its own data
    * **==Thread== level parallelism**

### Data Level Parallelism: **SIMD**
* Vector architectures
* SIMD extensions
* Graphics Processor Units (GPUs)
* **Advantage**
    * SIMD architectures can exploit **significant data level parallelism** for:
        * **Matrix-oriented** scientific computing
        * **Media-oriented** image and sound processors
    * SIMD is **more energy efficient** than MIMD
        * Only needs to fetch one instruction per data operation
        * Makes SIMD attractive for personal mobile devices
    * SIMD allows programmer to continue to think sequentially

### Vector Architectures
* Basic idea
    * Read sets of data elements into "**vector registers**"
    * Operate on those registers
    * Disperse the results back into memory
* Registers are controlled by compiler
    * Used to hide memory latency
    * Leverage memory bandwidth
* The basic structure for a vector architecture
    * ![](https://i.imgur.com/nqbMmW3.png)

### Example architecture: **VMIPS**
* **Vector registers**
    * Each register holds a 64-element, 64 bits/element vector
    * Register file has 16 read ports and 8 write ports
* **Vector functional units**
    * Fully pipelined
    * Data and control hazards are detected
* **Vector load-store unit**
    * Fully pipelined
    * One word per clock cycle after initial latency
* **Scalar registers**
    * 32 general-purpose registers
    * 32 floating-point registers

### VMIPS Instructions
* `ADDVV.D Vd, Vs, Vt`
    * add two vectors `Vs` and `Vt`, then put each result in `Vd`
* `ADDVS.D Vd, Vs, Ft`
    * add scalar `Ft` to **each** element of vector `Vs`, then put each result in `Vd`
* `SUBVV.D Vd, Vs, Vt`
* `SUBVS.D Vd, Vs, Ft`
* `SUBSV.D Vd, Fs, Vt`
* `MULVV.D`, `MULVS.D`, `DIVVV.D`, `DIVVS.D`, `DIVSV.D`
* `LV Vd, (Rs)`, `SV (Rd), Vs`
    * vector load / store from address `R`
* [For more instruction](http://www.cburch.com/cs/330/reading/vmips-ref.pdf)

### **SAXPY** (Scalar Alpha X Plus Y)
* $Y = \alpha X + Y$
    * $X, Y$ are vectors, initially resident in memory
    * $\alpha$ is a scalar
* For different data type, there are:
    * **`SAXPY`**: Single precision
    * **`DAXPY`**: Double precision
    * `CAXPY`: Complex number with Single precision
    * `ZAXPY`: Complex number with Double precision
* This problem is so-called SAXPY or DAXPY loop in benchmark
    ```c=
    for (int i = m; i < n; i++) {
       y[i] = a * x[i] + y[i];
    }
    ```
* SAXPY MIPS code
    ```=
    L.D    F0,  a        ; load scalar a
    DADDIU R4,  Rx, #512 ; last address to load
        Loop: 
    L.D    F2,  0(Rx)    ; load X[i]
    MUL.D  F2,  F2, F0   ; a*X[i]
    L.D    F4,  0(Ry)    ; load Y[i]
    ADD.D  F4,  F4, F2   ; a * X + Y
    S.D    F4,  0(Ry)    ; store into Y[i]
    DADDIU Rx,  Rx, #8   ; increment index to X
    DADDIU Ry,  Ry, #8   ; increment index to Y
    DSUBU  R20, R4, Rx   ; compute bound
    BNEZ   R20, Loop     ; check if done
    ```
* SAXPY VMIPS code
    ```=
    L.D     F0, a      ; load scalar a
    LV      V1, Rx     ; load vector X
    MULVS.D V2, V1, F0 ; vector-scalar multiply
    LV      V3, Ry     ; load vector Y
    ADDVV   V4, V2, V3 ; add
    SV      Ry, V4     ; store the result
    ```
* Requires 6 instructions vs. almost 600 for MIPS

### Comparison : MIPS, VMIPS
* The overhead instructions are not present in the VMIPS
* The compiler produces vector instructions for such a sequence: The code is **vectorized** or **vectorizable**
* Loops can be vectorized if there are **no loopcarried dependences**
* VMIPS has fewer **pipeline load interlocks**
* Vector architects call **forwarding of element dependent operations** "==chaining=="

### Vector Execution Time
* Execution time depends on three factors:
    * Length of operand vectors
    * Structural hazards
    * Data dependencies
* VMIPS functional units consume one element per clock cycle
    * Execution time is approximately the vector length
* **Convoy**
    * Set of vector instructions that could potentially execute together

### Chimes
* Sequences with **RAW** dependency hazards can be in the same convoy via chaining
* **Chaining**
    * Allows a vector operation to start as soon as the individual elements of its vector source operand become available
* **Chime**
    * Unit of time to execute one convoy
    * $m$ convoys executes in $m$ chimes
    * For vector length of $n$ , requires $m$ * $n$ clock cycles

### Example : Total clock cycle with chimes
```=
LV      V1, Rx     ; load vector X
MULVS.D V2, V1, F0 ; vector-scalar multiply
LV      V3, Ry     ; load vector Y
ADDVV.D V4, V2, V3 ; add two vectors
SV      Ry, V4     ; store the sum
```
* Convoys: 3 convoys
    ```=
    LV    MULVS.D
	LV    ADDVV.D
	SV
    ```
* Total clock cycle = 3 Chime * 64 element per vector = 192 clock cycles

### Multiple Lanes
* Using parallel pipelines
* Allows for multiple hardware lanes
* Example
    * ![](https://i.imgur.com/zRyvxg8.png)
* Structure of a vector unit containing 4 lanes
    * ![](https://i.imgur.com/eyWa8N3.png)

### Vector Length Register
* Vector length not known at compile time?
* Use **Vector Length Register** (**VLR**)
* Use ==strip mining== for vectors over the **maximum vector length** (**MVL**)
    * ![](https://i.imgur.com/BeCuwtZ.png)
    * For vector length n, m = n % MVL
    * So length of each operation <= MVL, like the picture

### Vector Mask Registers
* Consider:
    ```cpp=
    for (i = 0; i < 64; i++) {
        if (X[i] != 0) {
            X[i] = X[i] – Y[i];
        }
    }
    ```
* Use vector mask register to "**disable**" elements:
    ```=
    LV      V1, Rx     ; load vector X into V1
	LV      V2, Ry     ; load vector Y into V2
	L.D     F0, #0     ; load FP zero into F0
	SNEVS.D V1, F0     ; sets VM(i) to 1 if V1(i)!=F0
	SUBVV.D V1, V1, V2 ; subtract under vector mask
	SV      Rx, V1     ; store the result in X
    ```

### **MIMD**
* MIMD offers **flexibility**
* With correct hardware and software support, MIMDs can function as
    * **Single-user** multiprocessors focusing on high performance for **one application**
    * **Multi-programmed** multiprocessors running many tasks simultaneously
    * Or some combination of these functions
* MIMD can build on the cost-performance advantages of **off-the-shelf** processors
    * Nearly all multiprocessors built today **use the same microprocessors found in workstations and single-processor servers**.
    * **Multi-core** chips leverage the design investment in a single processor core by **replicating** it

### MIMD - Clusters
* One od the popular class of MIMD computers, which often use **standard components** and **standard network technology**
* Clusters
    * **Commodity clusters**
        * Often assembled by **users or computer center directors**, rather than vendors
        * For applications focusing on throughput and requiring almost no communication among threads
        * **Web serving**, **multiprogramming**, **transaction-processing applications**
        * inexpensive
    * **Custom clusters**
        * A designer customizes the detailed node design or the interconnection network
        * For parallel applications that can **exploit large amounts of parallelism on a single problem**
        * Such applications require **significant amount of communication during computation**

### MIMD - Multicore
* On-chip multi-processing, single-chip multiprocessing, **multi-core**
    * Place multiple processors on a **single die**, typically **share L2 cache**, **memory** or **I/O bus**
* Usually, each processor execute a different process
    * **Process**: Independent of each other (usually)
    * **Thread**: May share code and address space

### Major MIMD Styles
* **Centralized shared memory**
    * Symmetric (shared-memory) multiprocessors
    * Uniform Memory Access time
    * Shared Memory Processor
    * ![](https://i.imgur.com/t0yoZLF.png)
* **Decentralized memory memory module with CPU**
    * Get more memory bandwidth
    * Lower memory latency
    * Major ==Drawback==: **Longer communication latency**
    * ==Drawback==: Software model more complex
    * ![](https://i.imgur.com/r5k6b6T.png)

### Models for communication
* **Shared address space**
    * The physically separate memories can be **addressed as one logically shared address space**
    * Called ==**Distributed shared-memory**==
    * For non-uniform memory access
        * access time depends on location of data
* **Message passing**
    * Address spaces are logically disjoint and cannot be addressed by a remote processor

### Symmetric Multi-Processors (SMP)
* The **large**, **multilevel caches reduce the memory demand** of a processor
* More than one processor to **share the bus** to the memory
* Different organizations:
    * Processor and cache on an extension board
    * **Integrated on the main-board** (most popular)
    * Integrated on the same chip
* Private data
    * Used by single processor
* Shared data
    * Used by multiple processor
    * Provide communication among processors through read/write the shared data
* The new problem: ==**cache coherence**==
    * For a single memory location `X`, R/W by 2 processors `A` and `B`
    * ![](https://i.imgur.com/SyFGpch.png)

### Coherency
* Informally:
    * **Any read must return the most recent write**
* Formally: A memory system is coherent if
    * `P` writes `x` and `P` reads x. If **no other writes of `x` by another processor** between the write and the read, the read always return the value written by `P`.
    * If `P1` writes `x` and `P2` reads it, `P1`’s write will be seen by `P2` if the read and write are **sufficiently far apart** and **no other writes to `x`** between the two accesses.
    * Writes to a single location are serialized:
        * **seen in one order (Latest write will be seen)**

### **Coherence** and **Consistency**
* **Coherence**:
    * defines **what values** can be returned by a read
* **Consistency**:
    * defines **when** a written value will be returned by a read
    * We cannot require that a read of `X` **instantaneously** see the value written for `X` by some other processor
    * The issue of **exactly when a written value must be seen by a reader** is defined by a memory consistency model

### Coherence Solutions
* Snooping:
    * **No centralized state** is kept.
    * **Every cache that has a copy of the data from a block of physical memory also has a copy of the sharing status of the block**
    * Caches are on a shared-memory bus
    * All cache controllers **monitor or snoop on the bus** to determine if they have a copy of a block that is requested on the bus
    * Dominates for **small scale** machines (most of the market)
* Directory-Based Schemes
    * The sharing status of a block is kept in **one location** called ==**directory**==
    * Keep track of what is being shared in one place (**centralized** logically, but can be distributed physically)
    * Has slightly higher implementation **overhead** than snooping
    * Can **scale to larger processor counts**

### Basic Snooping Protocols
* Write **Invalidate** Protocol:
    * Multiple readers, single writer
    * Write to shared data: an invalidate is sent to all caches which snoop and **invalidate** any copies
    * Read Miss:
        * Write-through: memory is always up-to-date
        * Write-back: snoop in caches to find the most recent copy
    * Example:
        * ![](https://i.imgur.com/EecjqOH.png)
* Write **Broadcast** Protocol
    * Also called **Write Update Protocol**
    * Write to shared data: broadcast on bus, processors snoop, and **update** any copies
    * Read miss: memory is always up-to-date
    * typically write through

### States of Snooping Protocol
* For **invalidation** protocol, **write-back** cache
* Each **block of memory** is in one state:
    * Shared : Clean in all caches and up-to-date in memory
    * Exclusive : Dirty in exactly one cache
    * Invalid : Not in any caches
* Each **cache block** is in one state (track these):
    * Shared : block can be read
    * Exclusive : cache has the only copy, it’s writeable, and dirty
    * Invalid : block contains no data
* State Diagram
    * Cache state transitions based on **requests from CPU**
        * ![](https://i.imgur.com/lclazcx.png)
    * Cache state transitions based on **requests from bus**
        * ![](https://i.imgur.com/ZgxXRbJ.png)
    * Cache Coherence state diagram with the state transitions induced by
        * black : the local processor
        * gray : the bus activities
        * ![](https://i.imgur.com/BioTsnj.png)

### Directory : A Scalable Approach
* Every memory block has associated directory information
    * keeps track of copies of cached blocks and their states
    * **on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary**
    * in scalable networks, communication with directory and copies is through network transactions
* Many alternatives for organizing directory information
* ![](https://i.imgur.com/9MftyAa.png)

### Directory-based coherence
* Local node:
    * requesting node
* Home node:
    * where the memory location and the directory entry of an address reside
* Remote node:
    * have a copy of the cache block, whether exclusive or shared

### States of Directory based Cache Coherence Protocol
* Similar to Snoopy Protocol: 3 states
    * Shared : ≥ 1 processors have data, memory up-to-date
    * Exclusive : 1 processor (**owner**) has data; memory out-of-date
    * Uncached : no processor has a copy of the cache block; not valid in any cache
* In addition to cache state, must track **which processors** have data when in the **shared state** (usually bit vector, 1 if processor has copy)
* State Diagram
    * An individual cache block in a directory-based system
        * ![](https://i.imgur.com/h30cf72.png)
    * For the directory has the same states and structure as the transition diagram for an individual cache
        * ![](https://i.imgur.com/oEAOsiW.png)

### Two-level "Hierarchy"
* A popular middle ground
* Individual nodes are multiprocessors, connected nonhierarchically
* Coherence across nodes is directory-based
* Coherence within nodes is snooping or directory

### Challenges of Parallel Processing
* Limited parallelism available in programs
    * Running independent tasks
* High cost of communications
    * Threads must communicate to complete the task
    * Large latency of remote access
    * In existing shared-memory multiprocessors, **communication cost**:
        * from **50 clock cycles** (multicore) to **1000 clock cycles** (large-scale multiprocessors)

### Examples
* ==**Example 1**==
    * ![](https://i.imgur.com/Bylu6Bg.png)
* ==**Example 2**==
    * ![](https://i.imgur.com/oRjaPDz.png)
    * ![](https://i.imgur.com/vIq5o4n.png)
* ==**Example 3**==
    * ![](https://i.imgur.com/yjE0oSa.png)
    * ![](https://i.imgur.com/kY8AkI6.png)
    * ![](https://i.imgur.com/wwgBvnv.png)
* ==**Example 4**==
    * ![](https://i.imgur.com/Uo8vtyo.png)
    * ![](https://i.imgur.com/Lfwg31N.png)
    * ![](https://i.imgur.com/voQRDrK.png)

### Conclusion
* Sharing cached data
    * **Coherence** (values returned by a read)
    * **Consistency** (when a written value will be returned by a read)
* Snooping cache over shared medium for smaller number of Multi-Processor
* Snooping
    * $\rightarrow$ uniform memory access; **bus makes snooping easier because of broadcast**
* **Directory has extra data structure** to keep track of state of all cache blocks
* **Distributing directory**
    * $\rightarrow$ scalable shared address multiprocessor
    * $\rightarrow$ Non uniform memory access

## (9-1) Memory Hierarchy

參考：[淺談memory cache](http://opass.logdown.com/tags/%E8%A8%88%E7%AE%97%E6%A9%9F%E7%B5%84%E7%B9%94)

### Why Memory Hierarchy?
![](https://i.imgur.com/6JGGlY0.png)

### Levels of the Memory Hierarchy
![](https://i.imgur.com/6PkDhS9.png)

### Achieving higher memory bandwidth
![](https://i.imgur.com/LvKVca1.png)
* Suppose
    * 1 clock cycle to send the address
    * 15 clock cycles for each DRAM access initiated
    * 1 clock cycle to send a word of data
* (a)  **One-word-wide memory organization**
    * If we have a **cache block of 4 words** and a **one-wordwide bank** of DRAMs, the miss penalty would be
        * ==1 + 4 * 15 + 4 * 1 = 65 cycles==
    * Number of bytes transferred per clock cycle for a single miss is
        * ==4(words) * 4(bytes/word) / 65 = 0.246==
    * The memory is **one-word-wide** and all the accesses are made **sequentially**.
* (b)  **Wide memory organization**
    * Increase the bandwidth to memory by **widening the memory and the buses** between the processor and memory
    * Allow **parallel access** to all the words of the block
    * A 4-word-wide memory, the miss penalty is
        * ==1 + (4/4) * 15 + (4/4) * 1 = 17 cycles==
    * A 2-word-wide memory, the miss penalty is
        * ==1 + (4/2) * 15 + (4/2) * 1 = 33 cycles==
* (c\) **Interleaved memory organization**
    * Increase the bandwidth to memory by **widening the memory but not the interconnection bus**
    * **Still need to pay a cost to transmit each word**, but we can **avoid paying the cost of access latency more than once**
    * **Interleaving**: memory chips organized in banks read/write multiple words in one access time instead of reading/writing single words each time
        * ==1 + 1 * 15 + 4 * 1 = 20 cycles==

### Cache : **The Principle of Locality**
* The Principle of Locality:
    * Program access a **relatively small portion of the address space** at any instant of time.
* Two Different Types of Locality:
    * **Temporal** Locality (Locality in *Time*): If an item is referenced, it will tend to be referenced again soon
        * loops
        * reuse
    * **Spatial** Locality (Locality in *Space*): If an item is referenced, items whose addresses are close by tend to be referenced soon
        * straightline code
        * array access
* Last 15 years, HW relied on locality for speed

### Cache
* Small, fast storage used to improve average access time to slow memory.
* Exploits spacial and temporal locality
* In computer architecture, almost everything is a cache!
    * Registers: a cache on variables
    * First-level cache: a cache on second-level cache
    * Second-level cache: a cache on memory
    * Memory: a cache on disk (virtual memory)
    * TLB: a cache on page table
    * Branch-prediction: a cache on prediction information
    * BTB: a cache of branch target
* ![](https://i.imgur.com/iy8jsj0.png)

### Review: Terminology
* **Hit**: data appears in some block in this level
    * **Hit Rate**: the fraction of memory access found in this level
    * **Hit Time**: Time to access the upper level which consists of RAM access time + Time to determine hit/miss
* **Miss**: data needs to be retrieve from a block in the lower level (Block `Y`)
    * **Miss Rate** = 1 - Hit Rate
    * **Miss Penalty**: Time to replace a block in the upper level + Time to deliver the block to the processor
* ==Hit Time << Miss Penalty==
* ![](https://i.imgur.com/PZc7GM9.png)

### Review: The Basics of Caches
* A direct-mapped cache with 8 entries
    * ![](https://i.imgur.com/zi4GGeq.png)
* A 4KB cache using 1 word block
    * ![](https://i.imgur.com/LFw556Y.png)
    * 1024 blocks * 4 (32bit / 8) bytes data per block = 4KB

### 4 Questions for Memory Hierarchy
* Q1: Where can a block be placed in the upper level?
	* Block placement
* Q2: How is a block found if it is in the upper level?
	* Block identification
* Q3: Which block should be replaced on a miss?
	* Block replacement
* Q4: What happens on a write?
	* Write strategy

### Block Placement
* Where can a block be placed in the upper level?
    * Fully Associative
    * Set Associative (n-way associative)
    * Direct Mapped (1-way associative)
    * ![](https://i.imgur.com/qaUpUWB.png)
* Cache Size : ![](https://i.imgur.com/b2qLzih.png)
    * ![](https://i.imgur.com/rFLSd6t.png)
    * ![](https://i.imgur.com/u63evKK.png)
* Example
    * ![](https://i.imgur.com/tQKrsXq.png)
    * ![](https://i.imgur.com/wdEECy7.png)

### Block identification
* How is a block found if it is in the upper level?
    * Index identifies set
    * Increasing associativity shrinks index, expands tag
    * ![](https://i.imgur.com/InQqdR0.png)

### Block replacement
* Which block should be replaced on a miss?
    * Easy for Direct Mapped
    * Set Associative or Fully Associative:
        * Random
            * Easy to implement
        * **LRU** (**Least Recently Used**)
            * Appealing, but hard to implement for high associativity

### Write strategy
* What happens on a write?
    * **Write through**
        * The information is written to **both the block in the cache and to the block in the lower-level memory**.
    * **Write back**
        * The information is written only to the block in the cache. The modified cache block is written to main memory only **when it is replaced**.
        * is block clean or dirty?
* Pros and Cons of each
    * WT: read misses cannot result in writes
    * WB: no repeated writes to same location
    * ![](https://i.imgur.com/fRYPWG4.png)
* What about write miss? $\rightarrow$ **write_allocate?**
    * **Write allocate** (fetch on write)
        * The block is loaded on a write miss
        * let writes to an un-cached address allocate a new cache line
        * Usually, **WB cache use WA** (hoping subsequent writes to that block will be captured by the cache)
    * **No write allocate** (write around)
        * The block is modifed in the lower level and not loaded into the cache
        * **WT cache often use no-WA** (since subsequent writes to that block will still have to go to memory)
* **Write Buffer** for WT cache
    * WT always combined with **write buffers** so that we don’t have to wait for lower level memory
    * ![](https://i.imgur.com/ZWakuxk.png)
    * A Write Buffer is needed between the Cache and Memory
        * Processor: writes data into the cache and the write buffer
        * Memory controller: write contents of the buffer to memory
    * Write buffer is just a **FIFO**:
		* Typical number of entries: 4
		* Works fine if: Store frequency << (1 / DRAM write cycle)

### Cache performance
* Miss-oriented Approach to Memory Access
    * ![](https://i.imgur.com/A2n7Wa9.png)
* Separating out Memory component entirely
    * AMAT = Average Memory Access Time
    * ![](https://i.imgur.com/eCQcNoS.png)
    * ![](https://i.imgur.com/NbLSXCV.png)

### Impact on Performance
* ==Example 1==
    * ![](https://i.imgur.com/wn45fP6.png)
* ==Example 2== : **Harvard** Architecture
    * Separate Instruction & Data Memories (*right*)
        * ![](https://i.imgur.com/FVYQD9E.png)
    * ![](https://i.imgur.com/uqfMhVr.png)

### Impact of Cache
* ![](https://i.imgur.com/ET2YdW0.png)
* ![](https://i.imgur.com/VnUdQcV.png)

### The Cache Design Space
* **Several interacting dimensions**
    * ![](https://i.imgur.com/OcGhfzi.png)
* **The optimal choice is a compromise**
    * ![](https://i.imgur.com/plM32fE.png)
* ==**Simplicity often wins**==

### Improving Cache Performance
![](https://i.imgur.com/A2n7Wa9.png)
* Reduce the **miss rate**
* Reduce the **miss penalty**
* Reduce the **hit time**

### Reducing Misses
* Classifying Misses: **3C**
    * **Compulsory**
        * The **first access** to a block is not in the cache, so the block must be brought into the cache.
        * Also called **cold start misses** or **first reference misses**.
        * ==Misses in **even an Infinite Cache**==
    * **Capacity**
        * If the cache cannot contain all the blocks needed during execution of a program, **capacity misses will occur due to blocks being discarded and later retrieved**.
        * ==Misses in Fully Associative Size X Cache==
    * **Conflict**
        * If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set.
        * Also called **collision misses** or **interference misses**.
        * ==Misses in N-way Associative, Size X Cache==
* More recent, **4th C**:
    * **Coherence**
        * Misses caused by cache coherence.
### Classify Cache Misses, How?
* Infinite cache, fully associative
    * Compulsory misses
* Finite cache, fully associative
    * Compulsory misses + Capacity misses
* Finite cache, limited associativity
    * Compulsory misses + Capacity misses + Conflict misses

### 2:1 Cache Rule
```
[miss rate of 1-way associative cache size X]
~= [miss rate of 2-way associative cache size X/2]
```
* ![](https://i.imgur.com/uH4pBOF.png)

### How to Reduce Misses?
* In all cases, assume **total cache size not changed**:
* What happens if
    * Change Block Size:
        * Which of 3Cs is obviously affected? **Compulsory**
    * Change Associativity:
		* Which of 3Cs is obviously affected? **Conflict**
    * Change Algorithm / Compiler:
		* Which of 3Cs is obviously affected? **3Cs**

### 1. Reduce Misses via "**Larger Block Size**"
* ![](https://i.imgur.com/hD77Zhh.png)
* ![](https://i.imgur.com/TNq7zzj.png)
* Larger Block Size :
    * Miss Rate $\downarrow$
        * mainly **Compulsory Miss** $\downarrow$
        * but also Conflict & Capacity Miss $\downarrow$
    * **Miss Penalty** $\uparrow$
    * **AMAT** $\downarrow$
    * **Hit Time** $\uparrow$

### 2. Reduce Misses via "**Higher Associativity**"
* 2:1 Cache Rule
* Beware: Execution time is the only final measure
    * Will Clock Cycle Time (CCT) increase?
    * Hill(1988) suggested hit time for 2-way vs. 1-way
        * external cache + 10%
        * internal + 2%
* AMAT vs. Miss Rate
    * Assume CCT = n times v.s. CCT direct mapped
        * n = 1.10 for 2-way
        * n = 1.12 for 4-way
        * n = 1.14 for 8-way
    * ![](https://i.imgur.com/IGZlSD3.png)

### 3. Reducing Misses via a "**Victim Cache**"
* How to combine **fast hit time** of direct mapped yet **still avoid conflict misses**?
* **Victim Cache**: Add buffer to place data discarded from cache
* Jouppi(1990): 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache
* Used in Alpha, HP machines
* ![](https://i.imgur.com/BlEVgsu.png)
    * From [Wiki](https://zh.wikipedia.org/wiki/CPU%E7%BC%93%E5%AD%98#%E5%8F%97%E5%AE%B3%E8%80%85%E7%BC%93%E5%AD%98)

### 4. Reducing Misses via "**Pseudo-Associativity**"
* How to combine **fast hit time of DM cache** and have the **lower conflict misses of 2-way SA cache
* Divide cache: on a miss, check other half of cache to see if there, if so have a **pseudo-hit** (slow hit)
    * ![](https://i.imgur.com/NTkvyrh.png)
* Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
    * Better for caches not tied directly to processor (L2)
    * Used in MIPS R1000 L2 cache, similar in UltraSPARC

### 5.6. Reducing Misses by "**Prefetching**"
* Category
    * Hardware Prefetching of Instructions & Data
    * Software Prefetching Data
* Instruction Prefetching
	* Alpha 21064 fetches 2 blocks on a miss
	* Extra block placed in **stream buffer**
		* Instruction stream buffer
		* Data stream buffer
	* On miss check stream buffer
* Works with data blocks too:
    * Jouppi(1990) 1 data stream buffer got 25% misses from 4KB cache; 4 stream buffers got 43%
    * Palacharla & Kessler(1994) for scientific programs for 8 stream buffers got 50% to 70% of misses from 2 64KB, 4-way set associative caches
* Prefetching relies on having **extra** memory bandwidth that can be used without penalty

### 7. Reducing Misses by Compiler Optimizations
* McFarling(1989) reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks **in software**
* Reorder procedures in memory so as to reduce misses
* Data
	* **Merging Arrays**
	    * improve spatial locality by single array of compound elements vs. 2 arrays
        ```cpp=
        /* Before: 2 sequential arrays */
        int val[SIZE];
        int key[SIZE];
        /* After: 1 array of stuctures */
        struct merge {
            int val;
            int key;
        } merged_array[SIZE];
        ```
        * Some programs reference multiple arrays in the same dimension with the same indices. (**these accesses will interfere with each other**)
        * Reducing conflicts by combining independent matrices into a single compound array so that **each cache block can contain the desired elements**; improve spatial locality
	* **Loop Interchange**
	    * change nesting of loops to access data in order stored in memory
	    * **row major or column major**
	    ```cpp=
        /* Before */
        for (k = 0; k < 100; k++)
            for (j = 0; j < 100; j++)
                for (i = 0; i < 5000; i++)
                    x[i][j] = 2 * x[i][j];
        /* After */
        for (k = 0; k < 100; k++)
            for (i = 0; i < 5000; i++)    // <- change
                for (j = 0; j < 100; j++) // <- change
                    x[i][j] = 2 * x[i][j];
        ```
        * Sequential accesses instead of striding through memory every 100 words; improved spatial locality
	* **Loop Fusion**
	    * Combine 2 independent loops that have same looping and some variables overlap
	    ```cpp=
        /* Before */
        for (i = 0; i < N; i++)
            for (j = 0; j < N; j++)
                a[i][j] = 1/b[i][j] * c[i][j];
        for (i = 0; i < N; i++)
            for (j = 0; j < N; j++)
                d[i][j] = a[i][j] + c[i][j];
        /* After */
        for (i = 0; i < N; i++)
            for (j = 0; j < N; j++) {
                a[i][j] = 1/b[i][j] * c[i][j];
                d[i][j] = a[i][j] + c[i][j];
            }
        ```
	    * 2 misses per access to `a` & `c` vs. one miss per access; improve spatial locality
	* **Blocking**
	    * Improve temporal locality by accessing "blocks" of data repeatedly vs. going down whole columns or rows
	    ```cpp=
        /* Matrix multiplication */
        /* Before */
        for (i = 0; i < N; i = i+1)
            for (j = 0; j < N; j = j+1) {
                r = 0;
                for (k = 0; k < N; k = k+1) {
                    r = r + y[i][k] * z[k][j];
                }
                x[i][j] = r;
            }
        ```
        * Two Inner Loops:
			* Read all N^2 elements of `z[]`
			* Read N elements of 1 row of `y[]` repeatedly
			* Write N elements of 1 row of `x[]`
        * Idea: compute on B^2 submatrix that fits
        ```cpp=
        /* After */
        for (jj = 0; jj < N; jj += B)
            for (kk = 0; kk < N; kk += B)
                for (i = 0; i < N; i++)
                    for (j = jj; j < min(jj+B-1, N); j++) {
                        r = 0;
                        for (k = kk; k < min(kk+B-1, N); k++) {
                            r = r + y[i][k] * z[k][j];
                        }
                        x[i][j] = x[i][j] + r;
                    }
        ```
        * `B` called **Blocking Factor**

### Summary on reducing miss rate
![](https://i.imgur.com/6q29uJx.png)
* 3C : Compulsory, Capacity, Conflict
    1. Reducing Misses via Larger Block Size
    2. Reducing Misses via Higher Associativity
    3. Reducing Misses via Victim Cache
    4. Reducing Misses via Pseudo-Associativity
    5. Reducing Misses by HW Prefetching Instr, Data
    6. Reducing Misses by SW Prefetching Data
    7. Reducing Misses by Compiler Optimizations
* Remember the danger of concentrating on just one parameter when evaluating performance

### Reducing Cache Miss Penalty
1. Read Priority over Write on Miss
2. Sub-block Placement
3. Early Restart and Critical Word First
4. Non-blocking Caches to reduce stalls on misses
5. Add a second-level cache

### 1. Read Priority over Write on Miss
* **Write through with write buffer**
	* If simply wait for write buffer to empty, might increase read miss penalty
	* Check write buffer contents before read; if not conflicts, let the memory access continue
* **Write Back**
	* Read miss replaces dirty block
	* Normal: Write dirty block to memory, then do the read (**slow**)
	* **Instead**: Copy the dirty block to a write buffer, then do the read, and then do the write
	* CPU stall less since restarts as soon as read is done

### 2. Sub-block Placement
* **Don’t** have to load **full block** on a miss
* Have valid bits per sub-block to indicate valid
* ![](https://i.imgur.com/8Bul32h.png)

### 3. Early Restart and Critical Word First
* Don't **wait** for full block to be loaded before restarting CPU
	* **Early restart**
		* as soon as requested word of the block arrives, send it to the CPU and let the CPU continue execution
	* **Critical word first**
		* request the missed word first from memory and send it to the CPU as soon as it arrives
		* Let the CPU continue execution while filling the rest of the words in the block
		* Also called **wrapped fetch** or **requested word first**
* ==Generally useful only in large blocks==

### 4. Non-blocking Caches to reduce stalls on misses
![](https://i.imgur.com/4bC6H4q.png)
from [PTT](https://www.ptt.cc/bbs/Grad-ProbAsk/M.1285752708.A.6D0.html)
* **Non-blocking cache** or **lockup-free cache** allow data cache to continue to supply cache hits during a miss
	* requires F/E bits on registers or out-of-order execution
	* requires multi-bank memories
* "**hit under miss**" reduces the effective miss penalty by working during miss
* "**hit under multiple miss**" or "**miss under miss**" may further lower the effective miss penalty by overlapping multiple misses
	* Significantly increases the **complexity** of the cache controller as there can be **multiple outstanding memory accesses**
	* Requires muliple memory banks (otherwise cannot support)
	* Pentium Pro allows 4 outstanding memory misses
* Example
    * ![](https://i.imgur.com/mu6L38C.png)
    * ![](https://i.imgur.com/ZtAPTtZ.png)

### 5. Add a second-level cache
* L2 Equations
    * ![](https://i.imgur.com/o4UmIJc.png)
* Definitions:
	* **Local miss rate**
	    * misses in this cache divided by the total number of memory accesses to this cache 
	    * Local miss rate for L2 = Miss rate_L2
	* **Global miss rate**
	    * misses in this cache divided by the total number of memory accesses generated by the CPU
	    * global miss rate for L2 
            = Global miss rate for L1 * Miss Rate_L2
            = Miss Rate_L1 * Miss Rate_L2
	* ==**Global Miss Rate is what matters**==
* Example
    * Given the data below, what is the impact of second-level cache associativity on the miss penalty?
    * ![](https://i.imgur.com/UqA2Hn7.png)
    * ![](https://i.imgur.com/SgmE8EO.png)
    * Conclusion: **Higher associativity or pseudo-associativity are worth considering** because
        * They have small impact on the second-level hit time
        * So much of the AMAT is due to misses in the second level cache
* Usually for L2 cache:
    * L2 >> L1, for 32 KB L1 cache;
    * L2 not tied to CPU clock cycle
    * Higher associativity or pseudo associativity
    * Larger block size
    * Multilevel inclusion property

### Summary on Reducing Miss Penalty
![](https://i.imgur.com/rPBlIvG.png)
* Techniques
    1. Read priority over write on miss
    2. Sub-block placement
    3. Early Restart and Critical Word First on miss
    4. Non-blocking Caches (Hit under Miss, Miss under Miss)
    5. Second Level Cache
* Can be applied recursively to Multilevel Caches
	* Danger is that time to DRAM will grow with multiple levels in between

### Cache Optimization Summary on miss rate and miss penalty
* ![](https://i.imgur.com/ermv9za.png)

### Reducing Hit time
* Hit time is critical because it affects the clock cycle time
* A fast hit time is multiplied in importance beyond the AMAT formula because it helps everything
* Fast hit time via **small and simple cache**

### Disadvantage of Set Associative Cache
* N-way Set Associative Cache v. Direct Mapped Cache:
	* N comparators vs. 1
	* Extra MUX delay for the data
	* Data comes AFTER Hit/Miss
* In a DM cache, cache block is available **BEFORE** Hit/Miss:
	* Possible to assume a hit and continue. Recover later if miss.
* ![](https://i.imgur.com/mJuw7fE.png)

### Fast Hit time via Small and Simple Caches
* Index tag memory and then compare takes time
* **Small** cache can help hit time since smaller memory takes less time to index
	* E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and Opteron
	* Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip
* **Simple** $\rightarrow$ direct mapping
	* Can overlap tag check with data transmission since no choice
* Access time estimate for 90 nm using CACTI model 4.0
	* Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-way caches
	* ![](https://i.imgur.com/a4nwKVj.png)

## (9-2) Virtual Memory

### For Virtual Memory
![](https://i.imgur.com/k9E6OZi.png)
* ![](https://i.imgur.com/in0a19V.png)
* **Page size usually large**
	* Smaller **page table**, save memory, reduce TLB misses
	* Transferring larger pages to/from secondary storage is more efficient
* ![](https://i.imgur.com/ISlwFA7.png)

### The Limits of Physical Addressing
![](https://i.imgur.com/cjLhsoX.png)

### Solution: Add a Layer of Indirection
![](https://i.imgur.com/OEgtdTH.png)

### Three Advantages of Virtual Memory
* **Translation**:
	* Program can be given **consistent view** of memory, even though physical memory is scrambled
	* Makes **multithreading** reasonable (now used a lot!)
	* Only the most important part of program ("**Working Set**") must be in **physical memory**.
	* Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later.
* **Protection**:
	* Different threads (or processes) protected from each other.
	* Different pages can be given special behavior
	    * **Read Only, Invisible to user programs, etc.**
	* Kernel data protected from User programs
	* Very important for protection from malicious programs
* **Sharing**:
	* Can map same physical page to multiple users ("Shared memory")

### Page tables encode virtual address spaces
![](https://i.imgur.com/8I2vact.png)

### Details of Page Table
![](https://i.imgur.com/0Dc55of.png)

### Page tables may not fit in memory!
![](https://i.imgur.com/FLtcybb.png)

### VM and Disk: Page replacement policy
![](https://i.imgur.com/vzmzGyC.png)

### **Translation Look-Aside Buffer** (**TLB**)
![](https://i.imgur.com/uN4nxhi.png)

### The TLB caches page table entries
![](https://i.imgur.com/UWCS6IE.png)

## (10) Storage

### Magnetic Disks
![](https://i.imgur.com/FUY16uV.png)
* Average Rotation time = 1 / 2 * Revolutions Per Minute
    * for 3600RPM, AMT = 1 / (2 * 3600) min = 8.33 ms

### Redundant Arrays of (Inexpensive) Disks (**RAID**)
* Files are "striped" across multiple disks
    * spreading data over multiple disks
* **Force accesses to several disks if the data files are large**
* Redundancy yields high data availability
	* **Availability**: service still provided to user, even if some components failed
* Disks will still fail
* Contents reconstructed from data redundantly stored in the array
	* **Capacity penalty** to store redundant info
	* **Bandwidth penalty** to update redundant info
* [Wiki (中文)](https://zh.wikipedia.org/wiki/RAID)

### RAID 0
* No redundancy
* Sometimes nicknamed JBOD (Just a Bunch of Disks)
* Used as a **measuring** stick for other RAID levels in terms of cost, performance and dependability.

### RAID 1 : Disk Mirroring / Shadowing
![](https://i.imgur.com/mhVSPYc.png)
* Each disk is fully duplicated onto its "**mirror**"
    * Very high availability can be achieved
* **Bandwidth sacrifice on write**:
    * **Logical write = two physical writes**
* Reads may be optimized
* **Most expensive solution: 100% capacity overhead**

### RAID 2
* Inspired by applying memory-style error correcting codes to disks
* RAID 2 not used since other more interesting schemes

### RAID 3 : Parity Disk
![](https://i.imgur.com/nKjDCQW.png)
* Sum computed across recovery group to protect against hard disk failures, **stored in P disk**
* Logically, a single high capacity, high transfer rate disk
    * **good for large transfers**
* **Wider arrays reduce capacity costs, but decreases availability**
* 33% capacity cost for parity if 3 data disks and 1 parity disk

### RAID 4 : High I/O Rate Parity
![](https://i.imgur.com/NpxM2Q1.png)
* **Small Writes**
    * RAID 4 works well for small reads
	* Small writes (write to one disk):
		* Option 1: read other data disks, create new sum and write to Parity Disk
		* **Option 2: since P has old sum, compare old data to new data, add the difference to P**
		* ==**RAID 4 takes Option 2**==
	* In RAID 4, Small writes are limited by Parity Disk
	    * Write to D0, D5 both also write to P disk
	    * ![](https://i.imgur.com/VWHESjo.png)
    * ![](https://i.imgur.com/Dk61xBt.png)

### RAID 5 : High I/O Rate Interleaved Parity
![](https://i.imgur.com/DxCopFU.png)

### RAID 6 : Recovering from 2 failures
* Why > 1 failure recovery?
	* operator accidentally replaces the wrong disk during a failure
	* since disk bandwidth is growing more slowly than disk capacity, the MTT Repair a disk in a RAID system is increasing
        * increases the chances of a 2nd failure during repair since it takes longer
	* reading much more data during reconstruction meant increasing the chance of an uncorrectable media failure, which would result in data loss
* Network Appliance’s **row-diagonal parity** or RAID-DP
    * ![](https://i.imgur.com/H5eFPR7.png)
* Like the standard RAID schemes, it uses redundant space based on parity calculation per stripe
* Since it is protecting against a double failure, it adds **two check blocks per stripe of data**.
    * If p+1 disks total, p-1 disks have data; assume p=5
* Row parity disk is just like in RAID 4
    * Even parity across the other 4 data blocks in its stripe
* Each block of the diagonal parity disk contains the even parity of the blocks in the same diagonal
* Example p = 5
	* Row diagonal parity starts by recovering one of the 4 blocks on the failed disk using diagonal parity
        * Since each diagonal misses one disk, and all diagonals miss a different disk
	* Once the data for those blocks is recovered, then the standard RAID recovery scheme can be used to recover two more blocks in the standard RAID 4 stripes
	* Process continues until two failed disks are restored