# CA Fin (6/6)
<style>
.markdown-body li + li
{
padding-top: 0 !important;
}
.markdown-body table th, .markdown-body table td
{
padding: 2px 10px;
}
</style>
---
:::danger
:page_facing_up: ***With 1 sheet (2 pages) of A4 handwritten notes*** :page_facing_up:
R: 39~59, C: ~150
:::
:::info
**影片**:
- [05/09](https://www.youtube.com/watch?v=VdDC_dO7ne4)
- [05/16](https://www.youtube.com/watch?v=3UQ3RhSFaww)
- ~~[05/23]()~~
- [05/30](https://www.youtube.com/watch?v=AfLUHr7dvcE)
- ~~[06/06]()~~
:::
:::info
**其他筆記**
- https://hackmd.io/@sysprog/H1sZHv4R?type=view
:::
---
[TOC]
---
## ==TODO==
- Past
- Collect Slide's exercises
## 7. Branch Prediction
### Exceptions and Interrupts
- Polling & interrupts
- Polling: manually check
- Interrupts: hiccup
- Polling is faster than interrupts
- not need to save and restore regs
- other interrupt overhead avoided
- Polling is slower than interrupts
- Overhead of polling instructions is incurred
- device have to wait for service for a long time
- Use which?
- polling: device can be controlled at user level
- interrupt: infrequent irregular event
- Exceptions & interrupts
- Exceptions: relevant ti the current process
- Interrupts: caused by asynchronous, outside event
- Synchronous: means related to the instruction stream
- Asynchronous: means unrelated to the instruction stream
- Precise interrupts/exceptions
- restart execution at the interrupt point and “get the right answer”
## 8. Multi-Processor
### Multi-Processor
- Why multiprocessor good
- A growth in data-intensive applications
- A growing interest in server markets
- Complexity of current microprocessors
- Collecting several much easier than redesigning one
- Steady improvement in parallel software (scientific apps, databases, OS)
- Long term goal of the field: scale number processors to size of budget, desired performance
- Categories
- SISD (Single Instruction stream Single Data stream)
- uniprocessor
- SIMD (Single Instruction stream Multiple Data stream)
- Data-level parallelism
- each processor has its own data memory
- single inst memory and control processor
- MISD (Multiple Instruction stream Single Data stream)
- MIMD (Multiple Instruction stream Multiple Data stream)
- thread-level parallelism
- Challenges of parallel processing:
- limited parallelism available in programs:
- running independent tasks
- high cost of communications
- thread must communicate to complete the task
- large latency of remote access
- in existing shared-memory multiprocessors, communication cost: from 50 clock cycles (multicore) to 10000 clock cycles (large-scale multiprocessors)
### SIMD
- Why SIMD:
- architectures can exploit significant data level parallelism for:
- matrix
- is more energy efficient than MIMD
- allows programmer to continue to think sequentially
- Vector arch:
- basic idea:
- read sets of data elements into "vector regs"
- operate on those regs
- regs are controlled by compiler
- hide memory latency
- leverage memory bandwidth
- a significant number of read and write ports
- VMIPS (Vector arch):
- basic idea:
- Vector regs
- vector FUs
- vector load-store unit
- scalar regs
- Y= a * X + Y:
- SAXPY/DAXPY loop (single/double precision a\*X plus Y)
- MIPS vs. VMIPS:
- overhead insts are not present in the VMIPS cpde
- Compiler: the code is vectorized or vectorizable
- loops can be vectorized:
- no loop-carries dependences
- fewer pipeline load interlocks
- vector archs call forwarding of element-dependent operatoions "chaining"
- Vector execution time:
- 3 factors:
- length of operand vectors
- structural hazards
- data dependences
- VMIPS FUs consume one element per clock cycle:
- execution time is approx the vector length
- Convoy
- set of vector insts that could potentially execute together
- Chimes:
- Sequences with RAW dependency hazards can be in the same convoy via chaining
- Chaining: allows a vector operation to start as soon as the individual elements of its vector source operand become available
- Chime: unit of time to execute one convoy
- $m$ convoys ececutes in $m$ chimes
- For vector length of $n$, requires $m \times n$ clock cycles
- Multiple lanes:
- parallel pipelines
- allow for multiple hw lanes
- Vector length reg (VLR):
- vector length not known at compile time
- use "strip mining" for vectors over the max vector length (MVL)
- generate code such that each vector operation is done for a size less than or equal to the MVL
- Vector mask regs:
- handling "if" statements in vectors loops
- to "disable" elements
### MIMD
- Why MIMD:
- offers flexibility
- with correct HW and SW support:
- Single-user multiprocessors focusing on high performance for one application
- Multi-programmed multiprocessors running many tasks simultaneously
- Or some combination of these functions
- MIMD can build on the cost-performance advantages of off-the-shelf processors
- Nearly all multiprocessors built today use the same microprocessors found in workstations and single-processor servers
- Multi-core chips leverage the design investment in a single processor core by replicating it
- Clusters:
- one popular class of MIMD computers
- often use standard components & standard network technology
- Commodity clusters:
- Often assembled by users or computer center directors, rather than vendors
- For applications:
- focusing on throughput
- requiring almost no communication among threads
- Web serving, multiprogramming, transaction-processing applications
- Inexpensive
- Custom clusters:
- exploit large amounts of parallelism on a single problem
- significant amount of communication during computation
- Multicore:
- On-chip multi-processing (single-chip multiprocessing, multi-core)
- multiple multiple processors on a single die
- typically share L2 cache, memory or I/O bus
- usually, each processor execute a different processor
- Process: (usually) independent of each other
- Thread: may share code and address space
- Major MIMD styles:
- Decentralized memory (memory module with CPU)
- get more memory banddwisth
- lower memory latency
- Drawback:
- (Major) longer communication latency
- SW model more complex
- Models for communication:
- through a shared address space
- addressed as one logically shared address space
- Distributed shared-memory
- for non-uniform memory access - access time deoends on location of data
- through message passing
- address spaces are logically disjoint and cannot be addressed by a remote processor
- Centralized shared memory
(Symmetric (shared-memory) multiprocessors)
(Uniform Memory Access)
(Shared Memory Processor)
- uniform latency from memory
- large, multilevel caches reduce the memory demand
- more than one processor to share the bus to the memory
- different organizations:
- Processor and cache on an extension board
- Integrated on the main-board (most popular)
- Integrated on the same chip
- Private data
- Used by single processor
- Shared data
- Used by multiple processor
- Provide comm among processors through RW the shared data
- Problem: Cache Coherence
- Cache Coherence:
- Informally: "Any read must return the most recent write"
- Formally:
- P writes x and P reads x. If no other writes of x by another processor between the write and the read, the read always return the value written by P.
- If P1 writes x and P2 reads it, P1’s write will be seen by P2 if the read and write are sufficiently far apart and no other writes to x between the two accesses
- Writes to a single location are serialized:
- seen in one order (Latest write will be seen)
- Coherence & Consistency:
- Coherence: defines "what values" can be returned by a read
- Consistency: defines "when" a written value will be returned by a read
- exactly when a written value must be seen by a reader is defined by a memory consistency model
- Solutions:
- Snooping:
- No centralized state is kept
- Every cache that has a copy of the data from a block of physical memory also has a copy of the sharing status of the block
- Caches are on a shared-memory bus
- All caches controllers monitor or snoop on the bus to determine if they have a copy of a block that is requested on the bus
- Dominates for small scale machines (most of the market)
- Direct-Based Schemes:
- The sharing status of a block is kept in one location called directory
- Keep track of what is being shared in one place (centralized logically, but can be distributed physically)
- Has slightly higher implementation overhead than snooping
- Can scale to larger processor counts
- Basic Snooping Protocols:
- Write Invalidate protocol
- Multiple readers, single writer
- Write to shared data:
- an invalidate is sent to all caches which snoop and invalidate any copies
- Read miss:
- Write-through: memory is always up-to-date
- Write-back: snoop in caches to find the most recent copy
- Write Broadcast Protocol (Write Update Protocol)
- Write to shared data:
- broadcast on bus, processors snoop, and update any copies
- Read miss:
- memory is always up-to-date
- typically write through
- E.g. Invalidation protocol, write-back cache

- Each block of memory is in one state:
- Clean in all caches and up-to-date in memory (Shared)
- Dirty in exactly one cache (Exclusive)
- Not in any caches
- Each cache block is in one state (track these):
- Shared: block can be read
- Exclusive: cache has the only copy, it’s writeable, and dirty
- Invalid: block contains no data
- Boldface:
- bus actions generated as part of the state transition
- Example:

- Scalable approach: Directories
- Non uniform memory access
- Every memory block has associated directory information
- keeps track of copies of cached blocks and their states
- on a miss, find directory entry, look it up, and communicate only with the nodes that have copies if necessary
- in scalable networks, communicatetion with directory and copies is through network transactions
- Many alternatives for organizing directory information
- Directory-based coherence
- Local node: requesting node
- Home node: where the memory location and the directory entry of an address reside
- Remote node: have a copy of the cache block, whether exclusive or shared
- E.g. Directory based Cache Coherence Protocol
- 3 states:
- Shared: ≥ 1 processors have data, memory up-to-date
- Exclusive: 1 processor (owner) has data; memory out-of-date
- Uncached (no processor has a copy of the cache block; not valid in any cache)
- In addition to cache state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy)
- A popular middle ground
- Two level "hierarchy"
- Individual nodes are multiprocessors, connected non-hierarchically
- Coherence across nodes is directory-based
- Coherence within nodes is snooping or directory
## 9. Memory Hierarchy
### Memory Hierarchy
- Why Memory Hierarchy:
- Processor-Memory Performance Gap
- Levels of the Memory Hierarchy:


- Goal: illusion of large, fast, cheap memory
- Let programs address a memory space that scales to the disk size, at a speed that is usually as fast as register access
- Achieving higher memory bandwidth:

### Caches
- The principle of locality:
- Program access a relatively small portion of the address space at any instant of time
- 2 different types of locality:
- Temporal Locality (Locality in Time):
- If an item is referenced, it will tend to be referenced again soon
- Spatial Locality (Locality in Space):
- If an item is referenced, items whose addresses are close by tend to be referenced soon
- Last 15 years, HW relied on locality for speed
- Caches:
- Small, fast storage used to improve average access time to slow memory
- Exploits spacial and temporal locality
- Hit & Miss:
- Hit: data appears in some block in this level
- Hit Rate: the fraction of memory access found in this level
- Hit Time: Time to access the upper level which consists of
- RAM access time + Time to determine hit/miss
- Miss: data needs to be retrieve from a block in the lower level
- Miss Rate: 1 - (Hit Rate)
- Miss penalty: Time to replace a block in the upper level + Time to deliver the block to the processor
- Hit Time << Miss Penalty
- Direct-mapped cache ==TODO==

### 4 Questions for Memory Hierarchy
- Where can a block be placed in the upper level?
(**Block placement**)
- Fully Associative
- Set Associative
- $$2^{\text{Index}} = \frac{\text{Cache size}}{\text{Block size} \times \text{Set associativity}}$$
- $N$-way set associative: N entries for each Cache Index
- N direct mapped caches operates in parallel
- Direct Mapped (1-way set associative)
- For a $2^N$ byte cache:
- The uppermost $(32-N)$ bits are always the Cache Tag
- The lowmost M$$ bits are the Byte Select (Block Size $= 2^M$)
- How is a block found if it is in the upper level?
(**Block identification**)
- Index identify set

- increasing associativity shrinks index, expands tag
- $$\text{Cache size} = \text{Associativity} \times 2^{\text{Index size}} \times 2^{\text{Offset size}}$$
- Which block should be replaced on a miss?
(**Block replacement**)
- Easy for Direct Mapped
- Set associative & Fully associative:
- Random
- Easy to implement, has very small differences to LRU
- LRU (Least Recently Used)
- Appealing, but hard to implement for high associativity
- What happens on a write
(**Write strategy**)

- Write through:
- The information is written to both the block in the cache and to the block in the lower level memory
- read misses cannot result in writes
- Write buffer:

- have to wait for lower level memory
- Processor: writes data into the cache and the write buffer
- Memory Controller: write content of the buffer to memory
- Write buffer is just a FIFO:
- Typical number of entries: $4$
- Works fine if: Store frequency << (1 / DRAM write cycle)
- Write back
- The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced
- is block clean or dirty
- no repeated writes to same location
- Write miss:
- Write allocate (fetch on write)
- The block is loaded on a write miss
- Usually, write-back cache use write allocate
- No write allocate (write around)
- The block is modifed in the lower level and not loaded into the cache
- Usually, write through cache use no-write allocate
### Cache Performance

- E.g. Harvard architecture
- Unified vs. Separate Instruction & Data (Harvard)
- Structural hazard stall
- Impact of Cache:
- The lower $CPI_{\text{execution}}$, the higher the relative impact of a fixed number of cache miss clock cycles
- For CPU with higher clock rates, a larger number of clock cycles per miss and hence the memory portion of CPI is higher
- The Cache Design Space:
- Several interacting dimensions:
- cache sizez
- block size
- associativity
- replacement policy
- write-through vs. write-back
- The optimal choice is a compromise
- depends on access chracteristics
- workload
- use (I-cache, D-cache, TLB)
- edpendes on technologiees / cost
- Simplicity often wins
- Improving Cache Performance:

- Reduce the miss rate
- Reduce the miss penalty
- Reduce the time to hit in the cache
### Reduce the Miss Rate
- Classifying Misses:
- Compulsory
- The first access to a block is not in the cache, so the block must be brought into the cache
- Also called cold start misses or first reference misses.
- Misses in even an Infinite Cache
- Capacity
- If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.
- Misses in Fully Associative Size X Cache
- Conflict
- If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses.
- Misses in N-way Associative, Size X Cache
- Coherence (nit in normal 3Cs)
- Misses caused by cache coherence
- Classify cache misses
- Infinite cache, fully associative
- Compulsory misses
- Finite cache, fully associative
- Compulsory misses + Capacity misses
- Finite cache, limited associativity
- Compulsory misses + Capacity misses + Conflict misses
- Solutions:
- Change Block Size: Compulsory
- Larger block size:
- Compulsory Miss UP
- Miss Penalty DOWN
- (If small cache) conflict miss UP
- (If small cache) capacity miss UP
- Change Associativity: Conflict
- 2:1 Cache rule:
- miss rate 1-way associative cache size X $\approx$ miss rate 2-way associative cache size X/2
- Associativity UP, Clock cycle time (CCT) UP, Hit time UP
- Associativity UP, AMAT NOT-UP (Hit time UP)
- Victim Cache
- Combine fast hit time of direct mapped yet still avoid conflict misses
- Add buffer to place data discarded from cache
- Pseudo-Associativity
- Combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache
- Divide cache:
- on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit)
- Alternative hit rate: found in other cache
- Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
- Better for caches not tied directly to processor (e.g. L2)
- HW Prefetching of Instructions & Data
SW Prefetching of Data
- stream buffer:
- Instruction stream buffer
- Data stream buffer
- Prefetching relies on having extra memory bandwidth that can be used without penalty
- Compiler Optimizations: 3Cs
- Merging Arrays
- improve spatial locality
- these accesses will interfere with each other
- each cache block can contain the desired elements
- Loop interchange
- improve spatial locality
- Sequential accesses: Row major or Column major
- Loop fusion
- improve spatial locality
- merge loops
- Blocking
- improve spatial locality
- $B$: Blocking factor
- Complete a block first, not go so far
- $\text{AMAT} = \text{HitTime} \times \text{EffectMissRate} \times \text{MissPenalty}$
### Reduce the miss penalty
- Read Priority over Write on Miss
- Write through with write buffer
- Check write buffer contents before read; if not conflicts, let the memory access continue
- Write Back
- Read miss replaces dirty block
- Normal: Write dirty block to memory, then do the read
- Instead: Copy the dirty block to a write buffer, then do the read, and then do the write
- CPU stall less since restarts as soon as read is done
- Sub-block Placement
- Don’t have to load full block on a miss
- Have valid bits per sub-block to indicate valid
- Early Restart and Critical Word First
- Do not wait for full block to be loaded before restarting CPU
- Early restart:
- as soon as requested word of the block arrives, send it to the CPU and let the CPU continue execution
- Critical word first:
- request the missed word first from memory and send it to the CPU as soon as it arrives
- Let the CPU continue execution while filling the rest of the words in the block
- Also called wrapped fetch and requested word first
- Generally useful only in large blocks
- Non-blocking Caches to reduce stalls on misses
- Non-blocking cache (lockup-free cache):
- allow data cache to continue to supply cache hits during a miss
- out-of-order execution
- requires multi-bank memories
- hit under miss:
- reduces the effective miss penalty by working during miss
- hit under multiple miss (miss under miss):
- may further lower the effective miss penalty by overlapping multiple misses
- complexity UP (multiple outstanding memory accesses)
- Requires muliple memory banks
- Better for FP programs than for Integer programs
- One potential advantage of hit-under-miss is that it won’t increase hit time
- Add a second-level cache
- Local miss rate:
- misses in this cache $/$ the total number of memory accesses to this cache
- Global miss rate (What matters):
- misses in this cache $/$ the total number of memory accesses generated by the CPU
- For lower level cache:
- Higher associativity or pseudo associativity are worth considering:
- They have small impact on the second-level hit time
- Much of the AMAT is due to misses in the second level cache
- Larger block size
- Multilevel inclusion property
### Reduce Hit Time
- Hit time
- Hit time is critical because it affects the clock cycle time
- A fast hit time is multiplied in importance beyond the average memory access time formula because it helps everything
- Fast hit time via small and simple cache
- In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:
- Possible to assume a hit and continue. Recover later if miss.
- Fast Hit time via Small and Simple Caches:
- Small cache can help hit time since smaller memory takes less time to index
- Simple: direct mapping
- Can overlap tag check with data transmission
### Virtual Memory
- 4 Questions for Virtual Memory: more inflexible due to more high miss penalty
- Block placement: Fully Associative
- Due to the exorbitant miss penalty
- Block identification: Page Table
- Translation look-aside buffer: reduce address translation time
- Block replacement: LRU
- provide use bit or reference bit
- Write strategy: Write back
- 
- Page size usually large:
- Smaller page table
- Save memory
- Reduce TLB misses
- Transferring larger pages to/from secondary storage is more efficient
- The Limits of Physical Addressing
- Physical Addressing
- All programs share one address space: The physical address space
- Machine language programs must be aware of the machine organization
- No way to prevent a program from accessing any machine resource
- Add a Layer of Indirection
- User programs run in an standardized virtual address space
- Address Translation hardware managed by the operating system (OS) maps virtual address to physical memory
- Hardware supports “modern” OS features: Protection, Translation, Sharing
- 3 advantages of virtual memory:
- Translation:
- Program can be given consistent view of memory, even though physical memory is scrambled
- Makes multithreading reasonable (now used a lot!)
- Only the most important part of program (“Working Set”) must be in physical memory.
- Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later.
- Protection (Base <= Adderss <= Bound):
- Different threads (or processes) protected from each other.
- Different pages can be given special behavior
- Read Only, Invisible to user programs, etc
- Kernel data protected from User programs
- Very important for protection from malicious programs
- Sharing:
- Can map same physical page to multiple users (Shared memory)
- Page tables encode virtual address spaces
- A virtual address space is divided into blocks of memory called pages
- Page table maps virtual page numbers to physical frames
- Page Table Entry (PTE)
- A page table is indexed by a virtual address
- A valid page table entry codes physical memory “frame” address for the page
- Treat memory as cache for disk
- Page tables may not fit in memory!
- Each process needs its own address space!
- Two-level Page Tables
- Page replacement policy (write-back)
- Architect’s role: support setting dirty and used bits
- Set of all pages in Memory:
- Tail pointer: Clear the used bit in the page table
- Head pointer: Place pages on free list if used bit is still clear. Schedule pages with dirty bit set to be written to disk
- MIPS Address Translation: Translation Look-Aside Buffer (TLB)
- A small fully-associative cache of mappings from virtual to physical addresses
- TLB also contains protection bits for virtual address
- Fast common case: Virtual address is in TLB, process has permission to read/write it.
## 10. Storage
### Magnetic Disks

- Seek time: time to move the arm over the proper track
- Rotation Latency (Rotational Delay): Time for the requested sector to rotate under the head
- Average Rotation time = 0.5 / Revolutions Per Minute
### RAID
- Redundant Arrays of (Inexpensive) Disks (RAIDs):
- Files are "striped" across multiple disks (spreading data over multiple disks)
- Force accesses to several disks if the data files are large
- Redundancy yields high data availability:
- Availability: service still provided to user, even if some components failed
- Contents reconstructed from data redundantly stored in the array
- Capacity penalty to store redundant info
- Bandwidth penalty to update redundant info
- RAID 0
- No redundancy
- RAID 1: Disk Mirroring/Shadowing
- Each disk is fully duplicated onto its “mirror”
- Very high availability can be achieved
- Bandwidth sacrifice on write:
- Logical write = two physical writes
- Reads may be optimized
- Most expensive solution: 100% capacity overhead
- RAID 2
- Inspired by applying memory-style error correcting codes to disks
- RAID 3: Parity Disk
- Striped physical records
- P contains sum of other disks per stripe mod 2 (“parity”)
- If disk fails, subtract P from sum of other disks to find missing information
- Sum computed across recovery group to protect against hard disk failures, stored in P disk
- Logically, a single high capacity, high transfer rate disk: good for large transfers
- Wider arrays reduce capacity costs, but decreases availability
- RAID 4: High I/O Rate Parity
- works well for small reads (write to one disk):
- ~~Option 1: read other data disks, create new sum and write to Parity Disk~~
- Option 2 (RAID 4): since P has old sum, compare old data to new data, add the difference to P
- Heavy load on P
- RAID 5: High I/O Rate Interleaved Parity
- Independent writes possible because of interleaved parity
- no heavy load on a specific disks
- RAID 6: Recovering from 2 failures
- Network Appliance’s row-diagonal parity (RAID-DP)
- Row parity
- Diagonal parity
- two check blocks per stripe of data
- If p+1 disks total, p-1 disks have data