<style>
.red {
color: #e23e30;
font-weight: bold;
}
.green {
color: #269647;
font-weight: bold;
}
.blue {
color: #2c7bed;
font-weight: bold;
}
</style>
# Large and Fast: Exploiting Memory Hierarchy
## Introduction
- A library example
### Principle of locality
- Programs **access a relatively small portion** of their address space at any instant of time
- It is **impossible** to make most memory **accesses fast and still have large memory** in computers
- **Temporal locality (時間局部性)** (locality in time)
- If an item is referenced, it will tend to be **reference again soon**
- Ex: loops
- **Spatial locality (空間局部性)** (locality in space)
- If an item is referenced, items **whose addresses are close** by will tend to be **referenced soon**
- Ex: program (instructions) and array data are normally sequentially accessed
### Memory hierarchy
- The basic structure
- Multiple levels of memory with **different speeds and sizes**
- The faster memories are more expensive per bit than the slower memories and thus are smaller
- **The faster memory is close to the processor** and the slower, less expensive memory is below it
- To present the user with as much memory as is available in the cheapest technology
- To provide access at the speed offered by the fastest memory
- The **data** is similarly hierarchical
- All the data is stored at the lowest level
- **Inclusion property**
- A level closer to the processor is generally a subset of any level further away

- Data is ==**copied between only two adjacent levels**== at a time
- The upper level is smaller, faster, and expensive
- **Block (line)**
- The **minimum unit** of information that can be either present or not present in a cache
- **Hit vs. Miss**
- ***Hit***: the data requested by the processor **appears in some block** in the upper level
- ***Miss***: the requested data is **not found** in the upper level
- **Hit rate (hit ratio) vs. Miss rate**
- The fraction of memory accesses found (or not found) in the upper level
- <font class='red'>Hit rate + Miss rate = 1</font>
- **Hit time**
- The time to access the upper level of the memory hierarchy
- **Including the time needed to determine** whether the access is a hit or a miss
- **Miss penalty** (Miss time)
- The time to replace a block in the upper level
- **Including the time to deliver** this block to the processor
- Hit time will be **much smaller** than the time to access the next level in the hierarchy

### Memory systems are critical to performance
- All programs **spend much of their time accessing memory**
- The memory system is necessarily a major factor in determining performance
### Memory hierarchy vs. Locality
- Programs exhibit both temporal locality and spatial locality
- Temporal locality
- To **keep more recently accessed data items closer** to the processor
- Accesses that **hit in the highest level** of the hierarchy can be processed quickly
- Spatial locality
- To move blocks consisting of **multiple contiguous words** in memory to upper levels of the hierarchy
- In most system, ==**data cannot be present in level `i` unless it is also present in level `i + 1`**==

## The Basics of Caches
### A simple cache
- A **safe place** to store things that we needed to examine
- The name to represent the level of the memory hierarchy **between the processor and main memory**
- Two questions to answer
- How do we know if a data item is in the cache?
- If it is, how do we find it?

- **Direct-mapped cache**
- Each memory location is mapped to **exactly one** location in the cache
- To assign the cache location based on the memory address of the word
- (block address) **modulo** (number of cache blocks in the cache)
- It is better to have the number of entries in the cache is a **power of 2**
- **Different memory locations will be mapped to the same cache location**
- **Tags**
- To identify whether a word in the cache **corresponds to the requested word**
- Only contains the upper portion of the address, which is ==**not used as an index**== into the cache
- **Valid bit**
- To indicate whether an entry **contains a valid address**

### Accessing a Cache
- An example: 8-block direct-mapped cache
- Each block contains one word
- The low-order 3 bits of an address give the block number








- A referenced address is divided into
- A ***cache index***
- Used to select the cache block
- The total number of entries in a direct-mapped cache must be a power of 2
- A ***tag field***
- Used to compare with the value of the tag field of the cache
- The least significant 2 bit are ignored when selecting a word in the block
- Words are aligned to multiples of 4 bytes in the MIPS architecture

- An example cache: the Intrinsity FastMATH processor
- A embedded microprocessor that uses the MIPS architecture and a simple cache implementation
- Processor
- 12-stage pipeline
- When operating at peak speed, both an instruction word and a data word can be request on every clock
- Separate instruction and data caches
- Each is 16 KiB (4096 words) with 16-word blocks
- Separate control signals to read and write each cache
- A **block index** field is used to select the requestd word from the 16 words in the indexed block
- To offer both **write-through** and **write-back** strategies
- One-entry write buffer


- The total number of bits needed for a cache
- Example
- A direct-map cache
- 32-bit byte address, $2^n$ blocks with $2^m$-word block
- $\Rightarrow$ Tag size: $32 - (n + m + 2)$ bits
- **Total number of bits**: $2^n \times \text{(block size + tag size + valid field size)}$
- $= 2^n \times (2^m \times 32 + (32 - n - m - 2) + 1) = 2^n \times (2^m \times 32 + 31 - n - m)$
- Usually to count only the size of the data

- Example: Bits in a cache
- Problem
- Assuming a 32-bit address
- How many total bits are required for a direct-mapped cache with 16KiB of data and 4-word blocks?
- Solution
- 16 KiB data, 4-word block $\Rightarrow 2^{10}$ blocks
- Tag size: $32 - 10 - 2 - 2 = 18$ bits
- Each block has $4 \times 32 = 128$ bits
- Total cache size = $2^{10} \times (128 + 18 + 1) = 2^{10} \times 147$ KiBits (or 18.4 KiB)
- Total size: 18.4 KiB, Data size: 16 KiB $\Rightarrow$ 1.15 times
- **Example: Mapping an address to a multiword cache block**
- Problem
- Consider a cache with 64 blocks and a block size of 16 bytes
- What block number does byte address 1200 map to?
- Solution
- The block is given by: (Block address) **modulo**(Number of cache blocks)
- The address of the block: $\frac{\text{Byte address}}{\text{Bytes per block}}$
- Block $\frac{\text{Byte address}}{\text{Bytes per block}}$ contains all addresses between
$\lfloor\frac{\text{Byte address}}{\text{Bytes per block}}\rfloor \times \text{Bytes per block}$
and
$\lfloor\frac{\text{Byte address}}{\text{Bytes per block}}\rfloor \times \text{Bytes per block} + (\text{Bytes per block} - 1)$
- Byte address 1200 is block address $\lfloor\cfrac{1200}{16}\rfloor = 75$ which maps to cache block number (75 **modulo** 64) = 11
- Block 11 maps all address between 1200 and 1215
- When the cache block size becomes larger
- Larger block exploit **spatial locality** to lower miss rates
- The miss rate may go up if the block size becomes too large, because the number of blocks that can be held in the cache will become small
- To increase the block size will **increase the miss penalty**
- Miss penalty: the latency to the first word + the transfer time for the rest of the block
- The transfer time will increase as the block size grows
- $\Rightarrow$ To increase the block size and obtain good performance
- To design the memory to **transfer larger block more efficient**(讓 miss penalty 不要太大)

### Handling cache misses
- When a cache miss occurs
- Done with the precessor control unit and a separate controller
- To **stall** the processor and wait for memory
- To freeze the contents of the temporary and programmer-visible registers
- More sophisticated out-of-order processors can allow execution of instructions while waiting for a cache miss
- To initiate the memory access and **refill** the cahe
- How instruction misses are handled for pipelined datapath
- Send the original **PC** (current **PC** - 4) to the memory
- Instruct memory to perform a read and wait for the memory to complete its access
- Write the cache entry
- Putting the data in the data portion of the entry
- Writing the upper bits of the address into the tag field
- Turning the valid bit on
- Restart the instruction execution at the first step (instruction refetch)
- The control of the cache on a data access is essentially identical
### Handling Writes
- The cache and memory may be **inconsistent**
- Data is written into only the data cache without changing memory
- **Write-through**
- Always write the data into both the memory and the cache
- Simple but not efficient
- **Write buffer**
- To store the data while it is waiting to be written to memory
- The processor can continue execution after writing the data into the write buffer
- A write buffer entry is freed when a write to main memory completes
- If the write buffer is full when the processor reaches a write
- The processor must stall until there is an empty position in the write buffer
- Using deeper write buffer to reduce the occurrence of such stalls
- **Write-back**
- Data is written only to the block in the cache
- The modified block is written to the memory when it is replaces
- Complex but efficient
### Elaboration
- **Write allocate** cache vs. **No write allocate** cache
- For **write-through** cache
- Write allocate cache
- The missed block is fetched from memory and **then allocate a block in the cache**
- The appropriate portion of the block in the cache is overwritten
- Memory 取到 cache -> 改 cache -> 寫回 memory
- No write allocate cache
- To update the portion of the block in memory but **not put it in the cache**
- 直接改 memory
- Combined cache vs. **Split cache**
- Combined cache
- A single cache stored both instructions and data
- **Better hit rate**
- Split cache
- Two independent cache that operate in parallel with one handling instructions and one handling data
- Slightly increased miss rate
- Doubling the cache **bandwidth**
- Almost all **processors today use split cache** to match what modern pipelines expect(僅 L1 cache,L2 以下為 combined cache,因為越下層 cache size 較大可降低 miss rate)
## Measuring and Improving Cache Performance
### To measure and analyze cache performance
- To compute the CPU time
- **CPU time = (CPU execution clock cycles + <font class='red'>Memory-stall clock cycles</font>) $\times$ Clock cycle time**
- The costs of cache hits are part of the normal CPU execution cycles
- <font class='red'>Memory-stall clock cycles</font> = <font class='blue'>Read-stall cycles</font> + <font class='green'>Write-stall cycles</font>
- <font class='blue'>Read-stall cycles</font> = $\frac{Reads}{Program}$ $\times$ Read miss rate $\times$ Read miss penalty
- <font class='green'>Write-stall cycles</font> = ($\frac{Writes}{Program}$ $\times$ Write miss rate $\times$ Write miss penalty) + Write buffer stalls (write-through)
- **Write-through**: with a reasonable write buffer depth, write buffer stall will be small and can be safely ignored
- **Write-back**: additional stalls arising from the need to write a cache block back to memory when the blocks is replace
- <font class='red'>Memory-stall clock cycles</font> = $\frac{\text{Memory accesses}}{Program}$ $\times$ Miss rate $\times$ Miss penalty
= $\frac{Instructions}{Program}$ $\times$ $\frac{Misses}{Instruction}$ $\times$ Miss penalty
- The **read and write miss penalties are the same** in most write-through cache organization
- The write buffer stalls are **negligible**
- **Example: Calculating cache performance**
- **Problem**
- Instruction cache miss rate: 2%
- Data cache miss rate: 4%
- A processor has a CPI of 2 without any memory stalls
- Miss penalty: 100 cycles for all misses
- Assume the frequency of all loads and stores is 36%
- How much faster a processor would run with a perfect cache that never missed?
- **Solution**
- Assume there are total $I$ instructions
- Instruction miss cycles = $I \times 2\% \times 100 = 2 \times I$
- Data miss cycles = $I \times 36\% \times 4\% \times 100 = 1.44 \times I$
- Total number of memory-stall cycles: $(2 + 1.44) \times I = 3.44 \times I$
- CPI with memory stalls: 2 + 3.44 = 5.44
- The ratio of the CPU execution times is:
$\frac{\text{CPU time with stalls}}{\text{CPU time with perfectcache}} = \frac{I \times CPI_{stall} \times Clock\ cycle}{I \times CPI_{perfect} \times Clock\ cycle} = \frac{CPI_{stall}}{CPI_{perfect}} = \frac{5.44}{2} = 2.72$
- If the processor is made faster but the memory system is not
- The amount of time spent on memory stalls will take up an increasing fraction of the execution time
- In previous example, suppose the CPI is reduced from 2 to 1
- CPI of the system with cache miss: 1 + 3.44 = 4.44
- The system with the perfect cache will be 4.44 (4.44 / 1) times faster
- Execution time spent on memory stalls: 63%(CPI = 2) vs. 77%(CPI = 1)
- Increasing the clock rate without changing the memory system also increases the performance lost due to cache misses
- A **larger cache** could have a **longer access time** (the increase in hit time)
- An increase in accessing a word from the memory system
- An increase in the processor cycle time
- Likely adds another stage to the pipeline
- It may take **multiple cycles for a cache hit**
- The increase in hit time for a larger cache could dominate the improvement in hit rate, leading to a decrease in processor performance
- ***Average memory access time (AMAT)***
- <font class='red'>$AMAT = \text{Time for a hit} + \text{Miss rate} \times \text{Miss penalty}$</font>
- The average time to access memory considering both hits and misses and the frequency of different accesses
- Example
- **Problem**
- A processor with a 1 ns clock cycle time
- Miss penalty: 20 clock cycles
- Miss rate: 0.05 misses per instruction
- A cache access time (including hit detection): 1 clock cycle
- Find the AMAT for this processor
- **Solution**
$$
\begin{align}
AMAT &= \text{Time for a hit} + \text{Miss rate} \times \text{Miss penalty} \\
&= 1 + 0.05 \times 20 \\
&= \text{2 clock cycles} \\
&= 2ns
\end{align}
$$
### Reducing cache misses by more flexible placement of blocks
- Different placement schemes
- ***Direct mapped***
- A block can go in exactly one place in the cache
- ***Position***: (Block number) **modulo** (Number of cache blocks)
- <font class='blue'>*Fully associative*</font>
- A block can be placed in **any location** in the cache
- **All tags** of all the blocks in the cache **must be searched** (in parallel)
- Comparators significantly increase the hardware cost
- Only practical for caches with **small numbers of blocks**
- <font class='green'>*Set associative*</font>
- Each block can be placed in a fixed number of locations
- <font class='green'>n-way set-associative</font> cache
- A cache with **n locations for a block**
- A cache consists of a number of ***sets***, each of which consists of n blocks
- Each block in the memory maps to a unique set in the cache given by the index field, and a block can be placed in any element of that set
- All tags of all the elements of the set must be searched
- ***Position***: (Block number) **modulo** (Number of sets in the cache)

- To increase the degree of associativity
- A **direct-mapped** cache is simply a **one-way** set-associative cache
- A fully assciative cache with *m* entries
- An m-way set-associative cache, only one set with m blocks
- **Advantage**: decreases the miss rate
- **Disadvantage**: increases in the hit time

- **Example: Misses and associativity in caches**
- ***Problem***
- Three small cache, each consists of four one-word blocks
- Direct mapped, two-way set associative, fully associative
- Given a sequence of block addresses: 0, 8, 0, 6, 8
- Find the number of misses for each cache organization
- ***Solution***
- Direct-mapped: 5 misses

- Two-way set associative: 4 misses
- Using the ***least recently used (LRU)*** replacement rule

- Fully-associative: 3 misses
- The best we can do

- Reduction in the miss rate vs. associativity
- 64 KiB data cache, 16-word block
- One-way to two-way associativity: 15% decreasing
- Little further improvement in going to higher associativity

### Locating a block in the cache
- Address decomposition
- **Index**: used to select the set containing the address of interest (n bits)
- **Tag**: all the tags in the selected set are searched in parallel
- **Block offset**: block offset (m bit) + byte offset (2 bits)
- Costs of an associative cache
- Extra comparators and n-to-1 multiplier
- Any delay imposed by having to do the compare and select from among the elements of the set
- The choice among three placement schemes
- Depending on the cost of a miss versus the cost of implementing associativity
- Both in time and in extra hardware


---
- **Example: Size of tags versus set associativity**
- ***Problem***
- A cache of 4096 blocks, 4-word block size, 32-bit address
- Find the total number of sets and the total number of tag bits for caches that are direct mapped, two-way set associative, four-way set associative, and full associative
- ***Solution***
- 32-bit address: 4 bits for block offset(包含 byte offset), 28 bits for index and tag
- ***Direct mapped***: 4096 sets
- *Index*: $\log_2(4096)$ = 12 bits
- *Tag*: (28 - 12) $\times$ 4096 = 64 Ki bits ≈ 66 K bits
- ***Two-way set associative***: 2048 sets
- *Index*: $log_2(2048)$ = 11 bits
- *Tag*: (28 - 11) $\times$ 2 $\times$ 2048 = 68 Ki bits ≈ 70 K bits
- ***Four-way set associative***: 1024 sets
- *Index*: $log_2(1024)$
- *Tag*: (28 - 10) $\times$ 4 $\times$ 1024 = 72 Ki bits ≈ 74 K bits
- ***Fully associative***: 1 set
- *Index*: $log_2(1)$ = 0 bits
- *Tag*: (28 - 0) $\times$ 4096 $\times$ 1 = 112 Ki bits ≈ 115 K bits
### Choosing which block to replace
- To choose a block among the blocks in the selected set to replace
- <font class='blue'>*Least recently used (LRU)*</font>
- The block replaced is the one that has been unused for the longest time
- Implemented by keeping track of when each element in a set was used relative to the other elements in the set
- As associativity increases, implementing LRU gets harder
### Reducing the miss penalty using multilevel caches
- ***Multilevel cache***
- A memory hierarchy with multiple levels of caches
- To close the gap further between the fast clock rates of modern processors and the increasingly long time required to access DRAMs
- Using a second-level cache
- Normally on the same chip as the processor and primary cache
- To be accessed when a miss occurs in the primary cache
- The access time is much less than the access time of main memory
- If neither the **primary** nor **secondary** cache contains the data, a main memory access is required and incurred a larger miss penalty
- **Example: Performance of multilevel caches**
- ***Problem***
- A processor with a base CPI of 1.0 and clock rate of 4 GHz
- Main memory access time: 100 ns
- Secondary cache access time: 5 ns
- Miss rate per instruction at the primary cache: 2%
- After adding a secondary cache, the miss rate to main memory is 0.5%
- How much faster will the processor with and without a secondary cache?
- ***Solution 1***
- Miss penalty to main memory: 100 ns $\Rightarrow$ 400 clock cycles
- Miss penalty to secondary caches: 5 ns $\Rightarrow$ 20 clock cycles
- For the processor with one level of caching
- Total CPI = Base CPI + Memory-stall cycles per instruction
- Total CPI = 1.0 + 2% $\times$ 400 = 9
- For the processor with two level of caching
- Total CPI = Base CPI + Primary stalls per instruction + Secondary stalls per instruction
- Total CPI = 1 + 2% $\times$ 20 + 0.5% $\times$ 400 = 3.4
- *Relative performance*: 9 / 3.4 = 2.6
- ***Solution 2***
- For the processor with two level of caching
- Total CPI = 1 + (2% - 0.5%) $\times$ 20 + 0.5% $\times$ (20 + 400) = 3.4
- Design considerations for a primary and secondary cache
- **Primary cache**
- To **focus on minimizing hit time** to yield a shorter clock cycle or fewer pipeline stages
- <font class='red'>Smaller cache size, higher miss rate</font>
- <font class='blue'>Smaller block size to reduce the miss penalty</font>
- **Secondary cache**
- To **focus on miss rate** to reduce the penalty of long memory access times
- Larger cache size, the access time is less critical
- Larger block size
- Higher associativity to reduce miss rates
### Elaboration
- ***Global miss rate***
- The fraction of references that miss in all levels of a multilevel cache
- ***Local miss rate***
- The fraction of references to one level of a cache that miss; used in multilevel hierarchies
- Miss rate 通常指 global miss rate
## Virtual Memory
### Preliminary
- ***Virtual memory***
- A technique that uses main memory as a **cache** for secondary storage
- **Two major motivations** for virtual memory
- To allow efficient and safe sharing of memory among multiple programs
- Such as for the memory needed by multiple virtual machines for cloud computing
- To remove the programming burdens of a small, limited amount of main memory
- The first motivation
- Consider multiple virtual machines running at once on a computer
- The total required memory size may be much larger than the available size
- Main memory need **contain only the active portions** of the many virtual machines
- Only a fraction of total required memory is actively being used at any point in time
- Must be able to protect the virtual machines from each other
- To ensure that a **program can only access the portions of main memory that have been assigned to it**
- The virtual machines sharing the memory change dynamically while the virtual machines are running
- To compile each program into its own **address space**
- Virtual memory implements the translation of a program's address space to **physical addresses**
- The translation process enforces **protection** of a program's address space from other virtual machine
- The second motivation
- If a program became too large for memory, it was up to the programmer to make it fit
- Programmers divided programs into mutually exclusive pieces
- **Overlays** were loaded or unloaded under user program control during execution
- The program never tried to access an unloaded the total size of the memory
- The overlays loaded **never exceeded the total size of the memory**
- Overlays were tradtionally organized as moduled, each containing both code and data
- Virtual memory automatically manages the main memory (***physical memory***) and secondary storage
### The structure of virtual memory
- Terminologies in virtual memory system
- **Page**: a virtual memory block
- **Page fault**: a virtual memory miss
- **Virtual address**
- An address that corresponds to a location in virtual space
- To be translated by address mapping to a **physical address** when memory is accessed
- **Address translation (address mapping)**
- The process by which a virtual address is mapped to an address used to access memory

---
- **Relocation**
- To map the virtual addresses to different physical addresses before the addresses are used to access memory
- Program can be loaded to anywhere in main memory
- All virtual memory systems relocate the program as a set of pages
- To eliminate the need to find a contiguous block of memory to allocate a program
- The operating system only need to find a sufficient number of pages in main memory
- Virtual address format
- ***Page offset***
- Not changed during address translation
- **To determine the page size**
- ***Virtual page number***
- To be translate to **physical page number** during address translation
- The number of pages addressable with the virtual addresses need not match those of addressable with the physical addresses
- The number of virtual pages is larger

- A **page fault will take millions of clock cycles** to process
- Dominated by the time to get the first word for typical page sizes
- Key decisions in designing virtual memory systems
- Pages should be large enough to amortize the high access time
- Typical size: 4 KiB ~ 16 KiB
- New desktop and server system: to support 32 KiB and 64 KiB
- New embedded system: 1KiB
- Organizations that reduce the page fault rate are attractive
- To allow ==**fully associative**== placement(miss rate 最低)
- Page faults can be handled in software
- The overhead will be small compared to the disk access time
- To use **clever algorithms** for choosing how to place pages
- Write-through will not work for virtual memory
- Writes take too long
- Virtual memory system uses ==**write-back**==
### Placing a page and finding it again
- Using **fully associative** placement of pages
- To **reduce page fault** frequency
- To use a clever and flexible replacement scheme
- To use a sophisticated algorithm and complex data structure to track page usage
- To try to choose a page that will not be needed for a long time
- <font class='red'>Page table</font>
- The table containing the virtual to physical address translations
- Resided in memory, to avoid a full search of memory
- **Indexed with the page number** from the virtual address to discover the corresponding physical page number
- **Each program has its own page table**
- **Different programs use the same virtual addresses**
- The page table may contain entries for pages not present in memory
- A <font class='blue'>valid bit</font> is used in each page table entry
- 1 表示 page 已經在 memory,儲存的是 physical address;0 表示 page 還在 disk,儲存的是如何到 disk 找到該 page。
- **No tags** are required
- The page table contains a mapping for every possible virtual page
- Page number 跟 page table 的 entry 是一對一的關係,所以不需要 tag。
- ***Page table register***
- A register that points to the start of the page table

### Page faults
- A page fault occurs if the valid bit for a virtual page is off
- The operating system gets control, done with the exception mechanism
- To find the page in the next level of the hierarchy and decide where to place the requested page
- Must keep track of the location on disk of each page in virtual address space

- When a process is created, the operating system creates
- A **swap space** on flash memory or disk
- The space reserved for the full virtual memory space of a process
- A data structure to record where each virtual page is stored on disk
- Part of the page table
- An auxiliary data structure indexed in the same way as the page table
- The operating system tracks which processes and which virtual addresses use each physical page
- When a page fault occurs and all pages in memory are in use
- Usually follow the ***least recently used (LRU)*** replacement scheme
- The replaced pages are written to swap space on the disk
- ==**Approximate LRU**== replacement scheme
- To keep track which pages have and which pages have not been recently used
- <font class='green'>Use bit (reference bit)</font>
- To be **set whenever a page is accessed**
- <font class='green'>The operating system periodically clears the reference bits</font>
- To determine which pages were touched during a particular time period
- The operating system will replace a page which reference bit is off
- **Example: Compute the total page table size**
- **Problem**
- 32-bit virtual address, 4 KiB pages, 4 byte per page table entry
- Compute the total page table size
- **Solution**
- Number of page table entries = $2^{32} / 2^{12} = 2^{20}$
- Size of page table = $2^{20} \times 4$ = 4 MiB
### What about writes?
- Virtual memory must use write-back scheme
- Write to the disk can take millions of processor clock cycles
- Using write-through scheme (even with a write buffer) is impractical
- <font class='blue'>Dirty bit</font>
- Write-back operation is still costly
- A dirty bit is associated with each page in the page table
- <font class='blue'>To be set if any word in a page is written</font>
- To indicate whether the page needs to be written out when it is replaced
### Making address translation fast: the TLB
- Every memory access by a program can take at least twice as long
- One memory access to the page table to obtain the physical address
- A second access to get the data
- <font class='red'>Translation lookaside buffer (TLB)</font>
- A special cache that keeps track of recently used translations
- **Tag entry**: virtual page number
- **Data entry**: physical page number
- **Valid bit**, **dirty bit**, **reference bit**

- On every reference, we access the TLB instead of the page table
- To look up the virtual page number in the TLB
- **TLB hit**
- The physical page number is used to form the address
- The corresponding reference bit is turned on
- The dirty bit is turned on if the processor isperforming a write
- **TLB miss**
- Merely TLB miss
- The page exists in memory, only the translation is missing in the TLB
- To load the translation from the page table and replace an TLB entry
- To copy the reference and dirty bits back to the page table entry when a TLB entry is replaced (write-back scheme)
- **True page fault**
- The page is not present in memory
- To invoke the operating system using an exception
- Merely TLB miss will be much more frequent than true page fault
- The TLB miss rate is expected to very small
- TLB misses can be handled either in hardware or in software
- With care there can be little performance between the two approaches
- Typical values for a TLB
- TLB size: 16 ~ 512 entries
- Block size: 1 ~ 2 page table entries (typically 4 ~ 8 bytes each)
- Hit time: 0.5 ~ 1 clock cycle
- Miss penalty: 10 ~ 100 clock cycles
- Miss rate: 0.01% ~ 1%
- Associativities in TLBs
- Small and fully associative TLBs
- **Lower miss rate**
- Since the TLB is small, the cost of a **fully associativity mapping** is not too high
- To support for **randomly choosing** an entry to replace, since implementing a hardware LRU scheme is too expensive
- Larger TLBs with small associativity
- Example: The Intrinsity FastMATH TLB
- 4 KiB pages, 32-bit address space $\Rightarrow$ The virtual page number is 20 bits long
- The physical address is the same size as the virtual address
- The TLB contains 16 entries, fully associative
- Each TLB entry is 64 bits wide
- Valid(1) + Dirty(1) + Tag(20) + Physical page number(20) + Others


### Integrating virtual memory, TLBs, and caches
- Data cannot be in the cache unless it is present in main memory
- The operating system helps maintain the memory hierarchy
- To flush the contents of any page from the cache when it decides to migrate that page to disk
- To modify the page tables and TLB
- To generate a page fault when the processor tries to access any data on the migrated page
- **Example: Overall operation of a memory hierarchy**
- In a memory hierarchy includes a TLB and a cache
- Three different types of misses: **TLB miss**, **page fault**, **cache miss**
- Consider all the combinations of these three events with one or more occurring
- State whether this event can actually occur and under what circumstances

### Implementing protection with virtual memory
- To allow sharing of a single main memory by multiple processes and provide memory protection among these processes and the operating system
- To ensure one renegade process cannot write into the address space of another user process or into the operating system
- The **write access bit** in the TLB can protect a page from being write
- Basic capabilities must be provided by the operating system
- At least two modes
- **User mode**: the running process is a user process
- **Kernel (supervisor) mode**: the running process is an operating system process
- A portion of the processor state that a user process can read but not write
- Including the user/supervisor mode bit, the page table pointer, and the TLB
- The operating system uses special instructions that are only available in supervisor mode to write these element
- Alternating between user and supervisor modes
- User mode $\rightarrow$ supervisor mode
- Using a special **system call** instruction
- The **PC** from the point of the system call is saved in the **EPC**
- Supervisor mode $\rightarrow$ user mode
- Using the **return from exception (ERET)** instruction
- To prevent a process from reading the data of another process
- Each process has it own virtual address space
- The operating system keeps the page tables organized and let the independent virtual pages map to disjoint physical pages
- One process will not be able to access another’s data
- To place the page tables in the protected address space
- The operating system can assure safety if it prevents the user process from modifying its own page table
- The operating system must be able to modify the page tables
- When processes want to share information in a limited way
- The operating system must assist them
- It requires changing the page table
- The write access bit can be used to restrict the sharing to just read sharing
- To allow process P1 to read a page owned by process P2
- P2 would ask the operating system to let a virtual page in P1’s address space point to the same physical page that P2 wants to share
- The write **protection bit** can be used to prevent P1 from writing the data
- Any bits that determine the access rights must be included in both the page table and the TLB
### Handling TLB misses and page faults
- Two possibilities of TLB miss
- Merely TLB miss
- The page is present in memory
- Only need to create the missing TLB entry
- Page fault
- The page is not present in memory
- Need to transfer control to the operating system
- MIPS traditionally handles a TLB miss in software
- To bring in the page table entry from memory and re-execute the instruction
- To generate a page fault exception if necessary
- Using the exception mechanism to handle a TLB miss
- To interrupt the active process $\rightarrow$ to transfer control to the OS $\rightarrow$ to resume execution of the interrupted process
- **PC** of the instruction caused the TLB miss must be saved in the **EPC**
- The exception must be asserted by the end of the same clock cycle that the memory access occurs
- A page fault will be recognized during the clock cycle used to access memory
- The next clock cycle will begin exception processing
- Once the operating system knows the virtual address that caused the page fault (for instruction access)
- Find the location of the referenced page on disk
- Using the virtual address to look up the page table entry
- Choose a physical page to replace
- If the chosen page is dirty, it must be written out to disk
- Start a read to bring the referenced page from disk into the chosen physical page
- It may take millions of processor clock cycle
- The operating system will usually select another process to execute until the disk access completes
- Re-execute the instruction that faulted
- Restore the PC and state of the process that caused the page fault
- Reset the processor from kernel to user mode
- Access the requested page successfully and continue execution
- Page fault exceptions for data accesses are difficult to implement
- They occur in the middle of instructions
- The instruction cannot be completed before handling the exception
- The instruction must be restarted after handling the exception
- Making instructions restartable
- The exception can be handled and the instruction later continued
- For an architecture like the MIPS: **easy**
- Each instruction writes only one data item and this write occurs at the end of the instruction cycle
- Simply prevents the instruction from completing and restarts the instruction at the beginning
- For processors with complex instructions: **hard**
- Each instruction may touch many memory locations and write many data items
$\Rightarrow$ To generate a number of page faults
- Instruction often cannot be restarted from the beginning
$\Rightarrow$ To interrupt and later continue midstream in its execution
- To resume an instruction in the middle of its execution, it requires saving some special state during the execution
- Careful and detailed coordination between the exception-handling code in the operating system and the hardware
- Example: How MIPS architecture works when a TLB miss occurs
- Self-study
- If page faults occur too frequently
- A program routinely accesses more virtual memory than it has physical memory
- **Thrashing**
- A program continuously swaps pages between memory and disk
- The program will run very slowly
- Solutions
- Using a computer with more memory
- To reexamine the algorithm and data structure to change the locality
## A Common Framework for Memory Hierarchies
- Four questions apply between any two levels of memory hierarchy
- Where can a block be placed?
- Direct mapped, set associative, fully associative
- How is a block found?
- Indexing, fully search, limited search, a separate lookup table
- Which block should be replaced on a cache miss?
- Random, LRU, approximated LRU
- What happens on a write?
- Write-through, write-back

:::info
**Question 1: Where can a block be placed?**
- The advantage of **higher degree of associativity**: the **lower miss rate**
- Reducing misses that compete for the same location
- The largest gains are obtained in going from one-way to two-way associative
- As cache sizes grow, the relative improvement increases only slightly
- The overall miss rate of a larger cache is slower
- The potential disadvantages of associativity
- Increased cost, **slower access time**
- The choice among three block placement schemes
- Depending on the **cost of a miss versus the cost of implementing associativity**
- Both in time and in extra hardware
- Including the **L2 cache** on the chip enables much **higher associativity**
- The **hit times are not as critical**
- Fully associative caches are prohibitive except for small sizes
- The cost of the comparators is not overwhelming
- The absolute miss rate improvements are greatest

:::
:::info
**Question 2: How is a block found?**
- **Indexing**: used in direct-mapped cache
- A few systems have used it because of their **advantages in access time and simplicity**
- Finding the requested block does not depend on a comparison
- **Limited search**: used in set-associative cache
- Often **used for caches and TLBs**
- The access combines indexing and the search of a small set
- **Fully search**: used in fully-associative cache
- **A separate lookup table**
- Example: Page table for the **virtual memory** system
- **Extra storage and extra memory access** is required
- The motivation for choosing fully-associative placement and the extra table
- Fully associativity is beneficial since **misses are very expensive**
- Fully associativity allows the software to use sophisticated replacement schemes that are designed to reduce the miss rate
- The full map can be easily indexed with no extra hardware and no searching required

:::
:::info
**Question 3: Which block should be replaced on a cache miss?**
- Primary replacement strategies in set-associative or fully associative caches
- Random
- Candidate blocks are randomly selected
- Simple to build in hardware
- **Least recently used (LRU)**
- To replace the block that has been unused for the longest time
- Too costly to implement for caches with more than a small degree of associativity
- LRU is often **approximated**
- **Used for four-way or larger set-associativity cache**
- Tracking which pair of blocks is LRU
- Tracking which block in each pair is LRU
- For larger associativity, either approximated LRU or random replacement can be used
- **Random can sometimes be better** than the simple LRU approximations
- As the cache become larger
- The miss rate for both replacement strategies falls
- The absolute difference becomes small
- In **virtual memory**
- **A tiny reduction in the miss rate can be important**
- Approximated LRU implemented in software is acceptable
- Misses are so expensive and relatively infrequent
:::
:::info
**Question 4: What happens on a write?**
- Write-through vs. Write-back
- **Only write-back policy is practical in virtual memory systems**
- Today lowest-level caches typically use write-back
- Advantages of both policies
- Write-through
- **Misses are simpler and cheaper**
- Easier to implement although a write buffer is required
- Write-back
- Individual words can be **written at the rate that the cache can accept** them
- Multiple writes within a block **require only one write to the lower level** in the hierarchy
- To make effective use of a high-bandwidth transfer since the entire block is written
- 在 block 要被 replace 時才一次寫回 memory
:::
### The three Cs: An intuitive model for understanding the behavior of memory hierarchies
- **Three Cs model**
- A qualitative cache model in which all cache misses are classified into one of three categories
- <font class='red'>Compulsory misses (cold-start misses)</font>
- Caused by the **first access to a block** that has never been in the cache
- <font class='red'>Capacity misses</font>
- Caused when the **cache cannot contain all the blocks** needed during execution
- To occur when blocks are replaced and then later retrieved
- 發生比例為三者最高
- <font class='red'>Conflict misses (collision misses)</font>
- Occur in set-associative or direct-mapped caches when **multiple blocks compete for the same set**
- Can be **eliminated in a fully associative** cache of the same size(fully associative 有位置就放,所以不會有 conflict miss)

- Methods to attack three categories of misses
- **Compulsory misses**
- To **increase the block size**
- Too large block will lead too high miss penalty
- **Capacity misses**
- To **enlarge the cache**
- The increased access time could lead to lower overall performance
- Second-level caches have been growing steadily larger for many years
- First-level caches have been growing slowly, if at all
- **Conflict misses**
- To **increase associativity**
- May slow access time to lower overall performance
