###### tags: `計結`
# 1062 - Computer Architecture - 8~10
## (8) Multiprocessors
### Why Multiprocessors?
* A growth in data-intensive applications
* A growing interest in server markets
* Complexity of current microprocessors
* Collecting several much easier than redesigning one
* Steady improvement in parallel software (scientific apps, databases, OS)
* Long term goal of the field:
* Scale number processors to size of budget, desired performance
### Categories
```=
S = Single
M = Multiple
I = Instruction stream
D = Data stream
```
* **SISD**
* Uni-processor
* **SIMD**
* **Data level parallelism**
* Each processor **has its own data memory**, but there is a **single instruction memory and control processor**, which fetches and dispatches instructions
* Multimedia, graphics performance
* **MISD**
* No commercial product of this type
* **MIMD**
* Each processor has its own instructions and operates on its own data
* **==Thread== level parallelism**
### Data Level Parallelism: **SIMD**
* Vector architectures
* SIMD extensions
* Graphics Processor Units (GPUs)
* **Advantage**
* SIMD architectures can exploit **significant data level parallelism** for:
* **Matrix-oriented** scientific computing
* **Media-oriented** image and sound processors
* SIMD is **more energy efficient** than MIMD
* Only needs to fetch one instruction per data operation
* Makes SIMD attractive for personal mobile devices
* SIMD allows programmer to continue to think sequentially
### Vector Architectures
* Basic idea
* Read sets of data elements into "**vector registers**"
* Operate on those registers
* Disperse the results back into memory
* Registers are controlled by compiler
* Used to hide memory latency
* Leverage memory bandwidth
* The basic structure for a vector architecture
* 
### Example architecture: **VMIPS**
* **Vector registers**
* Each register holds a 64-element, 64 bits/element vector
* Register file has 16 read ports and 8 write ports
* **Vector functional units**
* Fully pipelined
* Data and control hazards are detected
* **Vector load-store unit**
* Fully pipelined
* One word per clock cycle after initial latency
* **Scalar registers**
* 32 general-purpose registers
* 32 floating-point registers
### VMIPS Instructions
* `ADDVV.D Vd, Vs, Vt`
* add two vectors `Vs` and `Vt`, then put each result in `Vd`
* `ADDVS.D Vd, Vs, Ft`
* add scalar `Ft` to **each** element of vector `Vs`, then put each result in `Vd`
* `SUBVV.D Vd, Vs, Vt`
* `SUBVS.D Vd, Vs, Ft`
* `SUBSV.D Vd, Fs, Vt`
* `MULVV.D`, `MULVS.D`, `DIVVV.D`, `DIVVS.D`, `DIVSV.D`
* `LV Vd, (Rs)`, `SV (Rd), Vs`
* vector load / store from address `R`
* [For more instruction](http://www.cburch.com/cs/330/reading/vmips-ref.pdf)
### **SAXPY** (Scalar Alpha X Plus Y)
* $Y = \alpha X + Y$
* $X, Y$ are vectors, initially resident in memory
* $\alpha$ is a scalar
* For different data type, there are:
* **`SAXPY`**: Single precision
* **`DAXPY`**: Double precision
* `CAXPY`: Complex number with Single precision
* `ZAXPY`: Complex number with Double precision
* This problem is so-called SAXPY or DAXPY loop in benchmark
```c=
for (int i = m; i < n; i++) {
y[i] = a * x[i] + y[i];
}
```
* SAXPY MIPS code
```=
L.D F0, a ; load scalar a
DADDIU R4, Rx, #512 ; last address to load
Loop:
L.D F2, 0(Rx) ; load X[i]
MUL.D F2, F2, F0 ; a*X[i]
L.D F4, 0(Ry) ; load Y[i]
ADD.D F4, F4, F2 ; a * X + Y
S.D F4, 0(Ry) ; store into Y[i]
DADDIU Rx, Rx, #8 ; increment index to X
DADDIU Ry, Ry, #8 ; increment index to Y
DSUBU R20, R4, Rx ; compute bound
BNEZ R20, Loop ; check if done
```
* SAXPY VMIPS code
```=
L.D F0, a ; load scalar a
LV V1, Rx ; load vector X
MULVS.D V2, V1, F0 ; vector-scalar multiply
LV V3, Ry ; load vector Y
ADDVV V4, V2, V3 ; add
SV Ry, V4 ; store the result
```
* Requires 6 instructions vs. almost 600 for MIPS
### Comparison : MIPS, VMIPS
* The overhead instructions are not present in the VMIPS
* The compiler produces vector instructions for such a sequence: The code is **vectorized** or **vectorizable**
* Loops can be vectorized if there are **no loopcarried dependences**
* VMIPS has fewer **pipeline load interlocks**
* Vector architects call **forwarding of element dependent operations** "==chaining=="
### Vector Execution Time
* Execution time depends on three factors:
* Length of operand vectors
* Structural hazards
* Data dependencies
* VMIPS functional units consume one element per clock cycle
* Execution time is approximately the vector length
* **Convoy**
* Set of vector instructions that could potentially execute together
### Chimes
* Sequences with **RAW** dependency hazards can be in the same convoy via chaining
* **Chaining**
* Allows a vector operation to start as soon as the individual elements of its vector source operand become available
* **Chime**
* Unit of time to execute one convoy
* $m$ convoys executes in $m$ chimes
* For vector length of $n$ , requires $m$ * $n$ clock cycles
### Example : Total clock cycle with chimes
```=
LV V1, Rx ; load vector X
MULVS.D V2, V1, F0 ; vector-scalar multiply
LV V3, Ry ; load vector Y
ADDVV.D V4, V2, V3 ; add two vectors
SV Ry, V4 ; store the sum
```
* Convoys: 3 convoys
```=
LV MULVS.D
LV ADDVV.D
SV
```
* Total clock cycle = 3 Chime * 64 element per vector = 192 clock cycles
### Multiple Lanes
* Using parallel pipelines
* Allows for multiple hardware lanes
* Example
* 
* Structure of a vector unit containing 4 lanes
* 
### Vector Length Register
* Vector length not known at compile time?
* Use **Vector Length Register** (**VLR**)
* Use ==strip mining== for vectors over the **maximum vector length** (**MVL**)
* 
* For vector length n, m = n % MVL
* So length of each operation <= MVL, like the picture
### Vector Mask Registers
* Consider:
```cpp=
for (i = 0; i < 64; i++) {
if (X[i] != 0) {
X[i] = X[i] – Y[i];
}
}
```
* Use vector mask register to "**disable**" elements:
```=
LV V1, Rx ; load vector X into V1
LV V2, Ry ; load vector Y into V2
L.D F0, #0 ; load FP zero into F0
SNEVS.D V1, F0 ; sets VM(i) to 1 if V1(i)!=F0
SUBVV.D V1, V1, V2 ; subtract under vector mask
SV Rx, V1 ; store the result in X
```
### **MIMD**
* MIMD offers **flexibility**
* With correct hardware and software support, MIMDs can function as
* **Single-user** multiprocessors focusing on high performance for **one application**
* **Multi-programmed** multiprocessors running many tasks simultaneously
* Or some combination of these functions
* MIMD can build on the cost-performance advantages of **off-the-shelf** processors
* Nearly all multiprocessors built today **use the same microprocessors found in workstations and single-processor servers**.
* **Multi-core** chips leverage the design investment in a single processor core by **replicating** it
### MIMD - Clusters
* One od the popular class of MIMD computers, which often use **standard components** and **standard network technology**
* Clusters
* **Commodity clusters**
* Often assembled by **users or computer center directors**, rather than vendors
* For applications focusing on throughput and requiring almost no communication among threads
* **Web serving**, **multiprogramming**, **transaction-processing applications**
* inexpensive
* **Custom clusters**
* A designer customizes the detailed node design or the interconnection network
* For parallel applications that can **exploit large amounts of parallelism on a single problem**
* Such applications require **significant amount of communication during computation**
### MIMD - Multicore
* On-chip multi-processing, single-chip multiprocessing, **multi-core**
* Place multiple processors on a **single die**, typically **share L2 cache**, **memory** or **I/O bus**
* Usually, each processor execute a different process
* **Process**: Independent of each other (usually)
* **Thread**: May share code and address space
### Major MIMD Styles
* **Centralized shared memory**
* Symmetric (shared-memory) multiprocessors
* Uniform Memory Access time
* Shared Memory Processor
* 
* **Decentralized memory memory module with CPU**
* Get more memory bandwidth
* Lower memory latency
* Major ==Drawback==: **Longer communication latency**
* ==Drawback==: Software model more complex
* 
### Models for communication
* **Shared address space**
* The physically separate memories can be **addressed as one logically shared address space**
* Called ==**Distributed shared-memory**==
* For non-uniform memory access
* access time depends on location of data
* **Message passing**
* Address spaces are logically disjoint and cannot be addressed by a remote processor
### Symmetric Multi-Processors (SMP)
* The **large**, **multilevel caches reduce the memory demand** of a processor
* More than one processor to **share the bus** to the memory
* Different organizations:
* Processor and cache on an extension board
* **Integrated on the main-board** (most popular)
* Integrated on the same chip
* Private data
* Used by single processor
* Shared data
* Used by multiple processor
* Provide communication among processors through read/write the shared data
* The new problem: ==**cache coherence**==
* For a single memory location `X`, R/W by 2 processors `A` and `B`
* 
### Coherency
* Informally:
* **Any read must return the most recent write**
* Formally: A memory system is coherent if
* `P` writes `x` and `P` reads x. If **no other writes of `x` by another processor** between the write and the read, the read always return the value written by `P`.
* If `P1` writes `x` and `P2` reads it, `P1`’s write will be seen by `P2` if the read and write are **sufficiently far apart** and **no other writes to `x`** between the two accesses.
* Writes to a single location are serialized:
* **seen in one order (Latest write will be seen)**
### **Coherence** and **Consistency**
* **Coherence**:
* defines **what values** can be returned by a read
* **Consistency**:
* defines **when** a written value will be returned by a read
* We cannot require that a read of `X` **instantaneously** see the value written for `X` by some other processor
* The issue of **exactly when a written value must be seen by a reader** is defined by a memory consistency model
### Coherence Solutions
* Snooping:
* **No centralized state** is kept.
* **Every cache that has a copy of the data from a block of physical memory also has a copy of the sharing status of the block**
* Caches are on a shared-memory bus
* All cache controllers **monitor or snoop on the bus** to determine if they have a copy of a block that is requested on the bus
* Dominates for **small scale** machines (most of the market)
* Directory-Based Schemes
* The sharing status of a block is kept in **one location** called ==**directory**==
* Keep track of what is being shared in one place (**centralized** logically, but can be distributed physically)
* Has slightly higher implementation **overhead** than snooping
* Can **scale to larger processor counts**
### Basic Snooping Protocols
* Write **Invalidate** Protocol:
* Multiple readers, single writer
* Write to shared data: an invalidate is sent to all caches which snoop and **invalidate** any copies
* Read Miss:
* Write-through: memory is always up-to-date
* Write-back: snoop in caches to find the most recent copy
* Example:
* 
* Write **Broadcast** Protocol
* Also called **Write Update Protocol**
* Write to shared data: broadcast on bus, processors snoop, and **update** any copies
* Read miss: memory is always up-to-date
* typically write through
### States of Snooping Protocol
* For **invalidation** protocol, **write-back** cache
* Each **block of memory** is in one state:
* Shared : Clean in all caches and up-to-date in memory
* Exclusive : Dirty in exactly one cache
* Invalid : Not in any caches
* Each **cache block** is in one state (track these):
* Shared : block can be read
* Exclusive : cache has the only copy, it’s writeable, and dirty
* Invalid : block contains no data
* State Diagram
* Cache state transitions based on **requests from CPU**
* 
* Cache state transitions based on **requests from bus**
* 
* Cache Coherence state diagram with the state transitions induced by
* black : the local processor
* gray : the bus activities
* 
### Directory : A Scalable Approach
* Every memory block has associated directory information
* keeps track of copies of cached blocks and their states
* **on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary**
* in scalable networks, communication with directory and copies is through network transactions
* Many alternatives for organizing directory information
* 
### Directory-based coherence
* Local node:
* requesting node
* Home node:
* where the memory location and the directory entry of an address reside
* Remote node:
* have a copy of the cache block, whether exclusive or shared
### States of Directory based Cache Coherence Protocol
* Similar to Snoopy Protocol: 3 states
* Shared : ≥ 1 processors have data, memory up-to-date
* Exclusive : 1 processor (**owner**) has data; memory out-of-date
* Uncached : no processor has a copy of the cache block; not valid in any cache
* In addition to cache state, must track **which processors** have data when in the **shared state** (usually bit vector, 1 if processor has copy)
* State Diagram
* An individual cache block in a directory-based system
* 
* For the directory has the same states and structure as the transition diagram for an individual cache
* 
### Two-level "Hierarchy"
* A popular middle ground
* Individual nodes are multiprocessors, connected nonhierarchically
* Coherence across nodes is directory-based
* Coherence within nodes is snooping or directory
### Challenges of Parallel Processing
* Limited parallelism available in programs
* Running independent tasks
* High cost of communications
* Threads must communicate to complete the task
* Large latency of remote access
* In existing shared-memory multiprocessors, **communication cost**:
* from **50 clock cycles** (multicore) to **1000 clock cycles** (large-scale multiprocessors)
### Examples
* ==**Example 1**==
* 
* ==**Example 2**==
* 
* 
* ==**Example 3**==
* 
* 
* 
* ==**Example 4**==
* 
* 
* 
### Conclusion
* Sharing cached data
* **Coherence** (values returned by a read)
* **Consistency** (when a written value will be returned by a read)
* Snooping cache over shared medium for smaller number of Multi-Processor
* Snooping
* $\rightarrow$ uniform memory access; **bus makes snooping easier because of broadcast**
* **Directory has extra data structure** to keep track of state of all cache blocks
* **Distributing directory**
* $\rightarrow$ scalable shared address multiprocessor
* $\rightarrow$ Non uniform memory access
## (9-1) Memory Hierarchy
參考:[淺談memory cache](http://opass.logdown.com/tags/%E8%A8%88%E7%AE%97%E6%A9%9F%E7%B5%84%E7%B9%94)
### Why Memory Hierarchy?

### Levels of the Memory Hierarchy

### Achieving higher memory bandwidth

* Suppose
* 1 clock cycle to send the address
* 15 clock cycles for each DRAM access initiated
* 1 clock cycle to send a word of data
* (a) **One-word-wide memory organization**
* If we have a **cache block of 4 words** and a **one-wordwide bank** of DRAMs, the miss penalty would be
* ==1 + 4 * 15 + 4 * 1 = 65 cycles==
* Number of bytes transferred per clock cycle for a single miss is
* ==4(words) * 4(bytes/word) / 65 = 0.246==
* The memory is **one-word-wide** and all the accesses are made **sequentially**.
* (b) **Wide memory organization**
* Increase the bandwidth to memory by **widening the memory and the buses** between the processor and memory
* Allow **parallel access** to all the words of the block
* A 4-word-wide memory, the miss penalty is
* ==1 + (4/4) * 15 + (4/4) * 1 = 17 cycles==
* A 2-word-wide memory, the miss penalty is
* ==1 + (4/2) * 15 + (4/2) * 1 = 33 cycles==
* (c\) **Interleaved memory organization**
* Increase the bandwidth to memory by **widening the memory but not the interconnection bus**
* **Still need to pay a cost to transmit each word**, but we can **avoid paying the cost of access latency more than once**
* **Interleaving**: memory chips organized in banks read/write multiple words in one access time instead of reading/writing single words each time
* ==1 + 1 * 15 + 4 * 1 = 20 cycles==
### Cache : **The Principle of Locality**
* The Principle of Locality:
* Program access a **relatively small portion of the address space** at any instant of time.
* Two Different Types of Locality:
* **Temporal** Locality (Locality in *Time*): If an item is referenced, it will tend to be referenced again soon
* loops
* reuse
* **Spatial** Locality (Locality in *Space*): If an item is referenced, items whose addresses are close by tend to be referenced soon
* straightline code
* array access
* Last 15 years, HW relied on locality for speed
### Cache
* Small, fast storage used to improve average access time to slow memory.
* Exploits spacial and temporal locality
* In computer architecture, almost everything is a cache!
* Registers: a cache on variables
* First-level cache: a cache on second-level cache
* Second-level cache: a cache on memory
* Memory: a cache on disk (virtual memory)
* TLB: a cache on page table
* Branch-prediction: a cache on prediction information
* BTB: a cache of branch target
* 
### Review: Terminology
* **Hit**: data appears in some block in this level
* **Hit Rate**: the fraction of memory access found in this level
* **Hit Time**: Time to access the upper level which consists of RAM access time + Time to determine hit/miss
* **Miss**: data needs to be retrieve from a block in the lower level (Block `Y`)
* **Miss Rate** = 1 - Hit Rate
* **Miss Penalty**: Time to replace a block in the upper level + Time to deliver the block to the processor
* ==Hit Time << Miss Penalty==
* 
### Review: The Basics of Caches
* A direct-mapped cache with 8 entries
* 
* A 4KB cache using 1 word block
* 
* 1024 blocks * 4 (32bit / 8) bytes data per block = 4KB
### 4 Questions for Memory Hierarchy
* Q1: Where can a block be placed in the upper level?
* Block placement
* Q2: How is a block found if it is in the upper level?
* Block identification
* Q3: Which block should be replaced on a miss?
* Block replacement
* Q4: What happens on a write?
* Write strategy
### Block Placement
* Where can a block be placed in the upper level?
* Fully Associative
* Set Associative (n-way associative)
* Direct Mapped (1-way associative)
* 
* Cache Size : 
* 
* 
* Example
* 
* 
### Block identification
* How is a block found if it is in the upper level?
* Index identifies set
* Increasing associativity shrinks index, expands tag
* 
### Block replacement
* Which block should be replaced on a miss?
* Easy for Direct Mapped
* Set Associative or Fully Associative:
* Random
* Easy to implement
* **LRU** (**Least Recently Used**)
* Appealing, but hard to implement for high associativity
### Write strategy
* What happens on a write?
* **Write through**
* The information is written to **both the block in the cache and to the block in the lower-level memory**.
* **Write back**
* The information is written only to the block in the cache. The modified cache block is written to main memory only **when it is replaced**.
* is block clean or dirty?
* Pros and Cons of each
* WT: read misses cannot result in writes
* WB: no repeated writes to same location
* 
* What about write miss? $\rightarrow$ **write_allocate?**
* **Write allocate** (fetch on write)
* The block is loaded on a write miss
* let writes to an un-cached address allocate a new cache line
* Usually, **WB cache use WA** (hoping subsequent writes to that block will be captured by the cache)
* **No write allocate** (write around)
* The block is modifed in the lower level and not loaded into the cache
* **WT cache often use no-WA** (since subsequent writes to that block will still have to go to memory)
* **Write Buffer** for WT cache
* WT always combined with **write buffers** so that we don’t have to wait for lower level memory
* 
* A Write Buffer is needed between the Cache and Memory
* Processor: writes data into the cache and the write buffer
* Memory controller: write contents of the buffer to memory
* Write buffer is just a **FIFO**:
* Typical number of entries: 4
* Works fine if: Store frequency << (1 / DRAM write cycle)
### Cache performance
* Miss-oriented Approach to Memory Access
* 
* Separating out Memory component entirely
* AMAT = Average Memory Access Time
* 
* 
### Impact on Performance
* ==Example 1==
* 
* ==Example 2== : **Harvard** Architecture
* Separate Instruction & Data Memories (*right*)
* 
* 
### Impact of Cache
* 
* 
### The Cache Design Space
* **Several interacting dimensions**
* 
* **The optimal choice is a compromise**
* 
* ==**Simplicity often wins**==
### Improving Cache Performance

* Reduce the **miss rate**
* Reduce the **miss penalty**
* Reduce the **hit time**
### Reducing Misses
* Classifying Misses: **3C**
* **Compulsory**
* The **first access** to a block is not in the cache, so the block must be brought into the cache.
* Also called **cold start misses** or **first reference misses**.
* ==Misses in **even an Infinite Cache**==
* **Capacity**
* If the cache cannot contain all the blocks needed during execution of a program, **capacity misses will occur due to blocks being discarded and later retrieved**.
* ==Misses in Fully Associative Size X Cache==
* **Conflict**
* If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set.
* Also called **collision misses** or **interference misses**.
* ==Misses in N-way Associative, Size X Cache==
* More recent, **4th C**:
* **Coherence**
* Misses caused by cache coherence.
### Classify Cache Misses, How?
* Infinite cache, fully associative
* Compulsory misses
* Finite cache, fully associative
* Compulsory misses + Capacity misses
* Finite cache, limited associativity
* Compulsory misses + Capacity misses + Conflict misses
### 2:1 Cache Rule
```
[miss rate of 1-way associative cache size X]
~= [miss rate of 2-way associative cache size X/2]
```
* 
### How to Reduce Misses?
* In all cases, assume **total cache size not changed**:
* What happens if
* Change Block Size:
* Which of 3Cs is obviously affected? **Compulsory**
* Change Associativity:
* Which of 3Cs is obviously affected? **Conflict**
* Change Algorithm / Compiler:
* Which of 3Cs is obviously affected? **3Cs**
### 1. Reduce Misses via "**Larger Block Size**"
* 
* 
* Larger Block Size :
* Miss Rate $\downarrow$
* mainly **Compulsory Miss** $\downarrow$
* but also Conflict & Capacity Miss $\downarrow$
* **Miss Penalty** $\uparrow$
* **AMAT** $\downarrow$
* **Hit Time** $\uparrow$
### 2. Reduce Misses via "**Higher Associativity**"
* 2:1 Cache Rule
* Beware: Execution time is the only final measure
* Will Clock Cycle Time (CCT) increase?
* Hill(1988) suggested hit time for 2-way vs. 1-way
* external cache + 10%
* internal + 2%
* AMAT vs. Miss Rate
* Assume CCT = n times v.s. CCT direct mapped
* n = 1.10 for 2-way
* n = 1.12 for 4-way
* n = 1.14 for 8-way
* 
### 3. Reducing Misses via a "**Victim Cache**"
* How to combine **fast hit time** of direct mapped yet **still avoid conflict misses**?
* **Victim Cache**: Add buffer to place data discarded from cache
* Jouppi(1990): 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache
* Used in Alpha, HP machines
* 
* From [Wiki](https://zh.wikipedia.org/wiki/CPU%E7%BC%93%E5%AD%98#%E5%8F%97%E5%AE%B3%E8%80%85%E7%BC%93%E5%AD%98)
### 4. Reducing Misses via "**Pseudo-Associativity**"
* How to combine **fast hit time of DM cache** and have the **lower conflict misses of 2-way SA cache
* Divide cache: on a miss, check other half of cache to see if there, if so have a **pseudo-hit** (slow hit)
* 
* Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
* Better for caches not tied directly to processor (L2)
* Used in MIPS R1000 L2 cache, similar in UltraSPARC
### 5.6. Reducing Misses by "**Prefetching**"
* Category
* Hardware Prefetching of Instructions & Data
* Software Prefetching Data
* Instruction Prefetching
* Alpha 21064 fetches 2 blocks on a miss
* Extra block placed in **stream buffer**
* Instruction stream buffer
* Data stream buffer
* On miss check stream buffer
* Works with data blocks too:
* Jouppi(1990) 1 data stream buffer got 25% misses from 4KB cache; 4 stream buffers got 43%
* Palacharla & Kessler(1994) for scientific programs for 8 stream buffers got 50% to 70% of misses from 2 64KB, 4-way set associative caches
* Prefetching relies on having **extra** memory bandwidth that can be used without penalty
### 7. Reducing Misses by Compiler Optimizations
* McFarling(1989) reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks **in software**
* Reorder procedures in memory so as to reduce misses
* Data
* **Merging Arrays**
* improve spatial locality by single array of compound elements vs. 2 arrays
```cpp=
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
/* After: 1 array of stuctures */
struct merge {
int val;
int key;
} merged_array[SIZE];
```
* Some programs reference multiple arrays in the same dimension with the same indices. (**these accesses will interfere with each other**)
* Reducing conflicts by combining independent matrices into a single compound array so that **each cache block can contain the desired elements**; improve spatial locality
* **Loop Interchange**
* change nesting of loops to access data in order stored in memory
* **row major or column major**
```cpp=
/* Before */
for (k = 0; k < 100; k++)
for (j = 0; j < 100; j++)
for (i = 0; i < 5000; i++)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k++)
for (i = 0; i < 5000; i++) // <- change
for (j = 0; j < 100; j++) // <- change
x[i][j] = 2 * x[i][j];
```
* Sequential accesses instead of striding through memory every 100 words; improved spatial locality
* **Loop Fusion**
* Combine 2 independent loops that have same looping and some variables overlap
```cpp=
/* Before */
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i++)
for (j = 0; j < N; j++) {
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];
}
```
* 2 misses per access to `a` & `c` vs. one miss per access; improve spatial locality
* **Blocking**
* Improve temporal locality by accessing "blocks" of data repeatedly vs. going down whole columns or rows
```cpp=
/* Matrix multiplication */
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1) {
r = 0;
for (k = 0; k < N; k = k+1) {
r = r + y[i][k] * z[k][j];
}
x[i][j] = r;
}
```
* Two Inner Loops:
* Read all N^2 elements of `z[]`
* Read N elements of 1 row of `y[]` repeatedly
* Write N elements of 1 row of `x[]`
* Idea: compute on B^2 submatrix that fits
```cpp=
/* After */
for (jj = 0; jj < N; jj += B)
for (kk = 0; kk < N; kk += B)
for (i = 0; i < N; i++)
for (j = jj; j < min(jj+B-1, N); j++) {
r = 0;
for (k = kk; k < min(kk+B-1, N); k++) {
r = r + y[i][k] * z[k][j];
}
x[i][j] = x[i][j] + r;
}
```
* `B` called **Blocking Factor**
### Summary on reducing miss rate

* 3C : Compulsory, Capacity, Conflict
1. Reducing Misses via Larger Block Size
2. Reducing Misses via Higher Associativity
3. Reducing Misses via Victim Cache
4. Reducing Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Misses by Compiler Optimizations
* Remember the danger of concentrating on just one parameter when evaluating performance
### Reducing Cache Miss Penalty
1. Read Priority over Write on Miss
2. Sub-block Placement
3. Early Restart and Critical Word First
4. Non-blocking Caches to reduce stalls on misses
5. Add a second-level cache
### 1. Read Priority over Write on Miss
* **Write through with write buffer**
* If simply wait for write buffer to empty, might increase read miss penalty
* Check write buffer contents before read; if not conflicts, let the memory access continue
* **Write Back**
* Read miss replaces dirty block
* Normal: Write dirty block to memory, then do the read (**slow**)
* **Instead**: Copy the dirty block to a write buffer, then do the read, and then do the write
* CPU stall less since restarts as soon as read is done
### 2. Sub-block Placement
* **Don’t** have to load **full block** on a miss
* Have valid bits per sub-block to indicate valid
* 
### 3. Early Restart and Critical Word First
* Don't **wait** for full block to be loaded before restarting CPU
* **Early restart**
* as soon as requested word of the block arrives, send it to the CPU and let the CPU continue execution
* **Critical word first**
* request the missed word first from memory and send it to the CPU as soon as it arrives
* Let the CPU continue execution while filling the rest of the words in the block
* Also called **wrapped fetch** or **requested word first**
* ==Generally useful only in large blocks==
### 4. Non-blocking Caches to reduce stalls on misses

from [PTT](https://www.ptt.cc/bbs/Grad-ProbAsk/M.1285752708.A.6D0.html)
* **Non-blocking cache** or **lockup-free cache** allow data cache to continue to supply cache hits during a miss
* requires F/E bits on registers or out-of-order execution
* requires multi-bank memories
* "**hit under miss**" reduces the effective miss penalty by working during miss
* "**hit under multiple miss**" or "**miss under miss**" may further lower the effective miss penalty by overlapping multiple misses
* Significantly increases the **complexity** of the cache controller as there can be **multiple outstanding memory accesses**
* Requires muliple memory banks (otherwise cannot support)
* Pentium Pro allows 4 outstanding memory misses
* Example
* 
* 
### 5. Add a second-level cache
* L2 Equations
* 
* Definitions:
* **Local miss rate**
* misses in this cache divided by the total number of memory accesses to this cache
* Local miss rate for L2 = Miss rate_L2
* **Global miss rate**
* misses in this cache divided by the total number of memory accesses generated by the CPU
* global miss rate for L2
= Global miss rate for L1 * Miss Rate_L2
= Miss Rate_L1 * Miss Rate_L2
* ==**Global Miss Rate is what matters**==
* Example
* Given the data below, what is the impact of second-level cache associativity on the miss penalty?
* 
* 
* Conclusion: **Higher associativity or pseudo-associativity are worth considering** because
* They have small impact on the second-level hit time
* So much of the AMAT is due to misses in the second level cache
* Usually for L2 cache:
* L2 >> L1, for 32 KB L1 cache;
* L2 not tied to CPU clock cycle
* Higher associativity or pseudo associativity
* Larger block size
* Multilevel inclusion property
### Summary on Reducing Miss Penalty

* Techniques
1. Read priority over write on miss
2. Sub-block placement
3. Early Restart and Critical Word First on miss
4. Non-blocking Caches (Hit under Miss, Miss under Miss)
5. Second Level Cache
* Can be applied recursively to Multilevel Caches
* Danger is that time to DRAM will grow with multiple levels in between
### Cache Optimization Summary on miss rate and miss penalty
* 
### Reducing Hit time
* Hit time is critical because it affects the clock cycle time
* A fast hit time is multiplied in importance beyond the AMAT formula because it helps everything
* Fast hit time via **small and simple cache**
### Disadvantage of Set Associative Cache
* N-way Set Associative Cache v. Direct Mapped Cache:
* N comparators vs. 1
* Extra MUX delay for the data
* Data comes AFTER Hit/Miss
* In a DM cache, cache block is available **BEFORE** Hit/Miss:
* Possible to assume a hit and continue. Recover later if miss.
* 
### Fast Hit time via Small and Simple Caches
* Index tag memory and then compare takes time
* **Small** cache can help hit time since smaller memory takes less time to index
* E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and Opteron
* Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip
* **Simple** $\rightarrow$ direct mapping
* Can overlap tag check with data transmission since no choice
* Access time estimate for 90 nm using CACTI model 4.0
* Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-way caches
* 
## (9-2) Virtual Memory
### For Virtual Memory

* 
* **Page size usually large**
* Smaller **page table**, save memory, reduce TLB misses
* Transferring larger pages to/from secondary storage is more efficient
* 
### The Limits of Physical Addressing

### Solution: Add a Layer of Indirection

### Three Advantages of Virtual Memory
* **Translation**:
* Program can be given **consistent view** of memory, even though physical memory is scrambled
* Makes **multithreading** reasonable (now used a lot!)
* Only the most important part of program ("**Working Set**") must be in **physical memory**.
* Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later.
* **Protection**:
* Different threads (or processes) protected from each other.
* Different pages can be given special behavior
* **Read Only, Invisible to user programs, etc.**
* Kernel data protected from User programs
* Very important for protection from malicious programs
* **Sharing**:
* Can map same physical page to multiple users ("Shared memory")
### Page tables encode virtual address spaces

### Details of Page Table

### Page tables may not fit in memory!

### VM and Disk: Page replacement policy

### **Translation Look-Aside Buffer** (**TLB**)

### The TLB caches page table entries

## (10) Storage
### Magnetic Disks

* Average Rotation time = 1 / 2 * Revolutions Per Minute
* for 3600RPM, AMT = 1 / (2 * 3600) min = 8.33 ms
### Redundant Arrays of (Inexpensive) Disks (**RAID**)
* Files are "striped" across multiple disks
* spreading data over multiple disks
* **Force accesses to several disks if the data files are large**
* Redundancy yields high data availability
* **Availability**: service still provided to user, even if some components failed
* Disks will still fail
* Contents reconstructed from data redundantly stored in the array
* **Capacity penalty** to store redundant info
* **Bandwidth penalty** to update redundant info
* [Wiki (中文)](https://zh.wikipedia.org/wiki/RAID)
### RAID 0
* No redundancy
* Sometimes nicknamed JBOD (Just a Bunch of Disks)
* Used as a **measuring** stick for other RAID levels in terms of cost, performance and dependability.
### RAID 1 : Disk Mirroring / Shadowing

* Each disk is fully duplicated onto its "**mirror**"
* Very high availability can be achieved
* **Bandwidth sacrifice on write**:
* **Logical write = two physical writes**
* Reads may be optimized
* **Most expensive solution: 100% capacity overhead**
### RAID 2
* Inspired by applying memory-style error correcting codes to disks
* RAID 2 not used since other more interesting schemes
### RAID 3 : Parity Disk

* Sum computed across recovery group to protect against hard disk failures, **stored in P disk**
* Logically, a single high capacity, high transfer rate disk
* **good for large transfers**
* **Wider arrays reduce capacity costs, but decreases availability**
* 33% capacity cost for parity if 3 data disks and 1 parity disk
### RAID 4 : High I/O Rate Parity

* **Small Writes**
* RAID 4 works well for small reads
* Small writes (write to one disk):
* Option 1: read other data disks, create new sum and write to Parity Disk
* **Option 2: since P has old sum, compare old data to new data, add the difference to P**
* ==**RAID 4 takes Option 2**==
* In RAID 4, Small writes are limited by Parity Disk
* Write to D0, D5 both also write to P disk
* 
* 
### RAID 5 : High I/O Rate Interleaved Parity

### RAID 6 : Recovering from 2 failures
* Why > 1 failure recovery?
* operator accidentally replaces the wrong disk during a failure
* since disk bandwidth is growing more slowly than disk capacity, the MTT Repair a disk in a RAID system is increasing
* increases the chances of a 2nd failure during repair since it takes longer
* reading much more data during reconstruction meant increasing the chance of an uncorrectable media failure, which would result in data loss
* Network Appliance’s **row-diagonal parity** or RAID-DP
* 
* Like the standard RAID schemes, it uses redundant space based on parity calculation per stripe
* Since it is protecting against a double failure, it adds **two check blocks per stripe of data**.
* If p+1 disks total, p-1 disks have data; assume p=5
* Row parity disk is just like in RAID 4
* Even parity across the other 4 data blocks in its stripe
* Each block of the diagonal parity disk contains the even parity of the blocks in the same diagonal
* Example p = 5
* Row diagonal parity starts by recovering one of the 4 blocks on the failed disk using diagonal parity
* Since each diagonal misses one disk, and all diagonals miss a different disk
* Once the data for those blocks is recovered, then the standard RAID recovery scheme can be used to recover two more blocks in the standard RAID 4 stripes
* Process continues until two failed disks are restored