Ch05: Large and Fast: Exploiting Memory Hierarchy - HackMD

<style> .red { color: #e23e30; font-weight: bold; } .green { color: #269647; font-weight: bold; } .blue { color: #2c7bed; font-weight: bold; } </style> # Large and Fast: Exploiting Memory Hierarchy ## Introduction - A library example ### Principle of locality - Programs **access a relatively small portion** of their address space at any instant of time - It is **impossible** to make most memory **accesses fast and still have large memory** in computers - **Temporal locality (時間局部性)** (locality in time) - If an item is referenced, it will tend to be **reference again soon** - Ex: loops - **Spatial locality (空間局部性)** (locality in space) - If an item is referenced, items **whose addresses are close** by will tend to be **referenced soon** - Ex: program (instructions) and array data are normally sequentially accessed ### Memory hierarchy - The basic structure - Multiple levels of memory with **different speeds and sizes** - The faster memories are more expensive per bit than the slower memories and thus are smaller - **The faster memory is close to the processor** and the slower, less expensive memory is below it - To present the user with as much memory as is available in the cheapest technology - To provide access at the speed offered by the fastest memory - The **data** is similarly hierarchical - All the data is stored at the lowest level - **Inclusion property** - A level closer to the processor is generally a subset of any level further away ![image](https://hackmd.io/_uploads/HJqL-oTB6.png) - Data is ==**copied between only two adjacent levels**== at a time - The upper level is smaller, faster, and expensive - **Block (line)** - The **minimum unit** of information that can be either present or not present in a cache - **Hit vs. Miss** - ***Hit***: the data requested by the processor **appears in some block** in the upper level - ***Miss***: the requested data is **not found** in the upper level - **Hit rate (hit ratio) vs. Miss rate** - The fraction of memory accesses found (or not found) in the upper level - <font class='red'>Hit rate + Miss rate = 1</font> - **Hit time** - The time to access the upper level of the memory hierarchy - **Including the time needed to determine** whether the access is a hit or a miss - **Miss penalty** (Miss time) - The time to replace a block in the upper level - **Including the time to deliver** this block to the processor - Hit time will be **much smaller** than the time to access the next level in the hierarchy ![image](https://hackmd.io/_uploads/Hk9rGspSp.png) ### Memory systems are critical to performance - All programs **spend much of their time accessing memory** - The memory system is necessarily a major factor in determining performance ### Memory hierarchy vs. Locality - Programs exhibit both temporal locality and spatial locality - Temporal locality - To **keep more recently accessed data items closer** to the processor - Accesses that **hit in the highest level** of the hierarchy can be processed quickly - Spatial locality - To move blocks consisting of **multiple contiguous words** in memory to upper levels of the hierarchy - In most system, ==**data cannot be present in level `i` unless it is also present in level `i + 1`**== ![image](https://hackmd.io/_uploads/BkdCPxRST.png) ## The Basics of Caches ### A simple cache - A **safe place** to store things that we needed to examine - The name to represent the level of the memory hierarchy **between the processor and main memory** - Two questions to answer - How do we know if a data item is in the cache? - If it is, how do we find it? ![image](https://hackmd.io/_uploads/SkoqOl0Hp.png) - **Direct-mapped cache** - Each memory location is mapped to **exactly one** location in the cache - To assign the cache location based on the memory address of the word - (block address) **modulo** (number of cache blocks in the cache) - It is better to have the number of entries in the cache is a **power of 2** - **Different memory locations will be mapped to the same cache location** - **Tags** - To identify whether a word in the cache **corresponds to the requested word** - Only contains the upper portion of the address, which is ==**not used as an index**== into the cache - **Valid bit** - To indicate whether an entry **contains a valid address** ![image](https://hackmd.io/_uploads/rySKAgRHp.png) ### Accessing a Cache - An example: 8-block direct-mapped cache - Each block contains one word - The low-order 3 bits of an address give the block number ![image](https://hackmd.io/_uploads/HJXmN39Lp.png) ![image](https://hackmd.io/_uploads/SJxVEsIOT.png) ![image](https://hackmd.io/_uploads/S1fL42q8p.png) ![image](https://hackmd.io/_uploads/ryg6UVh5U6.png) ![image](https://hackmd.io/_uploads/HyLDV3cUa.png) ![image](https://hackmd.io/_uploads/SJCwN3cIT.png) ![image](https://hackmd.io/_uploads/BJLO42qLa.png) ![image](https://hackmd.io/_uploads/Sy6d42c8p.png) - A referenced address is divided into - A ***cache index*** - Used to select the cache block - The total number of entries in a direct-mapped cache must be a power of 2 - A ***tag field*** - Used to compare with the value of the tag field of the cache - The least significant 2 bit are ignored when selecting a word in the block - Words are aligned to multiples of 4 bytes in the MIPS architecture ![image](https://hackmd.io/_uploads/HkNMLpIuT.png) - An example cache: the Intrinsity FastMATH processor - A embedded microprocessor that uses the MIPS architecture and a simple cache implementation - Processor - 12-stage pipeline - When operating at peak speed, both an instruction word and a data word can be request on every clock - Separate instruction and data caches - Each is 16 KiB (4096 words) with 16-word blocks - Separate control signals to read and write each cache - A **block index** field is used to select the requestd word from the 16 words in the indexed block - To offer both **write-through** and **write-back** strategies - One-entry write buffer ![image](https://hackmd.io/_uploads/S1TgO39Ip.png) ![image](https://hackmd.io/_uploads/rJnZO25Ia.png) - The total number of bits needed for a cache - Example - A direct-map cache - 32-bit byte address, $2^n$ blocks with $2^m$-word block - $\Rightarrow$ Tag size: $32 - (n + m + 2)$ bits - **Total number of bits**: $2^n \times \text{(block size + tag size + valid field size)}$ - $= 2^n \times (2^m \times 32 + (32 - n - m - 2) + 1) = 2^n \times (2^m \times 32 + 31 - n - m)$ - Usually to count only the size of the data ![image](https://hackmd.io/_uploads/Hy9n5h9Up.png) - Example: Bits in a cache - Problem - Assuming a 32-bit address - How many total bits are required for a direct-mapped cache with 16KiB of data and 4-word blocks? - Solution - 16 KiB data, 4-word block $\Rightarrow 2^{10}$ blocks - Tag size: $32 - 10 - 2 - 2 = 18$ bits - Each block has $4 \times 32 = 128$ bits - Total cache size = $2^{10} \times (128 + 18 + 1) = 2^{10} \times 147$ KiBits (or 18.4 KiB) - Total size: 18.4 KiB, Data size: 16 KiB $\Rightarrow$ 1.15 times - **Example: Mapping an address to a multiword cache block** - Problem - Consider a cache with 64 blocks and a block size of 16 bytes - What block number does byte address 1200 map to? - Solution - The block is given by: (Block address) **modulo**(Number of cache blocks) - The address of the block: $\frac{\text{Byte address}}{\text{Bytes per block}}$ - Block $\frac{\text{Byte address}}{\text{Bytes per block}}$ contains all addresses between $\lfloor\frac{\text{Byte address}}{\text{Bytes per block}}\rfloor \times \text{Bytes per block}$ and $\lfloor\frac{\text{Byte address}}{\text{Bytes per block}}\rfloor \times \text{Bytes per block} + (\text{Bytes per block} - 1)$ - Byte address 1200 is block address $\lfloor\cfrac{1200}{16}\rfloor = 75$ which maps to cache block number (75 **modulo** 64) = 11 - Block 11 maps all address between 1200 and 1215 - When the cache block size becomes larger - Larger block exploit **spatial locality** to lower miss rates - The miss rate may go up if the block size becomes too large, because the number of blocks that can be held in the cache will become small - To increase the block size will **increase the miss penalty** - Miss penalty: the latency to the first word + the transfer time for the rest of the block - The transfer time will increase as the block size grows - $\Rightarrow$ To increase the block size and obtain good performance - To design the memory to **transfer larger block more efficient**（讓 miss penalty 不要太大） ![image](https://hackmd.io/_uploads/SkRSfT5UT.png) ### Handling cache misses - When a cache miss occurs - Done with the precessor control unit and a separate controller - To **stall** the processor and wait for memory - To freeze the contents of the temporary and programmer-visible registers - More sophisticated out-of-order processors can allow execution of instructions while waiting for a cache miss - To initiate the memory access and **refill** the cahe - How instruction misses are handled for pipelined datapath - Send the original **PC** (current **PC** - 4) to the memory - Instruct memory to perform a read and wait for the memory to complete its access - Write the cache entry - Putting the data in the data portion of the entry - Writing the upper bits of the address into the tag field - Turning the valid bit on - Restart the instruction execution at the first step (instruction refetch) - The control of the cache on a data access is essentially identical ### Handling Writes - The cache and memory may be **inconsistent** - Data is written into only the data cache without changing memory - **Write-through** - Always write the data into both the memory and the cache - Simple but not efficient - **Write buffer** - To store the data while it is waiting to be written to memory - The processor can continue execution after writing the data into the write buffer - A write buffer entry is freed when a write to main memory completes - If the write buffer is full when the processor reaches a write - The processor must stall until there is an empty position in the write buffer - Using deeper write buffer to reduce the occurrence of such stalls - **Write-back** - Data is written only to the block in the cache - The modified block is written to the memory when it is replaces - Complex but efficient ### Elaboration - **Write allocate** cache vs. **No write allocate** cache - For **write-through** cache - Write allocate cache - The missed block is fetched from memory and **then allocate a block in the cache** - The appropriate portion of the block in the cache is overwritten - Memory 取到 cache -> 改 cache -> 寫回 memory - No write allocate cache - To update the portion of the block in memory but **not put it in the cache** - 直接改 memory - Combined cache vs. **Split cache** - Combined cache - A single cache stored both instructions and data - **Better hit rate** - Split cache - Two independent cache that operate in parallel with one handling instructions and one handling data - Slightly increased miss rate - Doubling the cache **bandwidth** - Almost all **processors today use split cache** to match what modern pipelines expect（僅 L1 cache，L2 以下為 combined cache，因為越下層 cache size 較大可降低 miss rate） ## Measuring and Improving Cache Performance ### To measure and analyze cache performance - To compute the CPU time - **CPU time = (CPU execution clock cycles + <font class='red'>Memory-stall clock cycles</font>) $\times$ Clock cycle time** - The costs of cache hits are part of the normal CPU execution cycles - <font class='red'>Memory-stall clock cycles</font> = <font class='blue'>Read-stall cycles</font> + <font class='green'>Write-stall cycles</font> - <font class='blue'>Read-stall cycles</font> = $\frac{Reads}{Program}$ $\times$ Read miss rate $\times$ Read miss penalty - <font class='green'>Write-stall cycles</font> = ($\frac{Writes}{Program}$ $\times$ Write miss rate $\times$ Write miss penalty) + Write buffer stalls (write-through) - **Write-through**: with a reasonable write buffer depth, write buffer stall will be small and can be safely ignored - **Write-back**: additional stalls arising from the need to write a cache block back to memory when the blocks is replace - <font class='red'>Memory-stall clock cycles</font> = $\frac{\text{Memory accesses}}{Program}$ $\times$ Miss rate $\times$ Miss penalty = $\frac{Instructions}{Program}$ $\times$ $\frac{Misses}{Instruction}$ $\times$ Miss penalty - The **read and write miss penalties are the same** in most write-through cache organization - The write buffer stalls are **negligible** - **Example: Calculating cache performance** - **Problem** - Instruction cache miss rate: 2% - Data cache miss rate: 4% - A processor has a CPI of 2 without any memory stalls - Miss penalty: 100 cycles for all misses - Assume the frequency of all loads and stores is 36% - How much faster a processor would run with a perfect cache that never missed? - **Solution** - Assume there are total $I$ instructions - Instruction miss cycles = $I \times 2\% \times 100 = 2 \times I$ - Data miss cycles = $I \times 36\% \times 4\% \times 100 = 1.44 \times I$ - Total number of memory-stall cycles: $(2 + 1.44) \times I = 3.44 \times I$ - CPI with memory stalls: 2 + 3.44 = 5.44 - The ratio of the CPU execution times is: $\frac{\text{CPU time with stalls}}{\text{CPU time with perfectcache}} = \frac{I \times CPI_{stall} \times Clock\ cycle}{I \times CPI_{perfect} \times Clock\ cycle} = \frac{CPI_{stall}}{CPI_{perfect}} = \frac{5.44}{2} = 2.72$ - If the processor is made faster but the memory system is not - The amount of time spent on memory stalls will take up an increasing fraction of the execution time - In previous example, suppose the CPI is reduced from 2 to 1 - CPI of the system with cache miss: 1 + 3.44 = 4.44 - The system with the perfect cache will be 4.44 (4.44 / 1) times faster - Execution time spent on memory stalls: 63%(CPI = 2) vs. 77%(CPI = 1) - Increasing the clock rate without changing the memory system also increases the performance lost due to cache misses - A **larger cache** could have a **longer access time** (the increase in hit time) - An increase in accessing a word from the memory system - An increase in the processor cycle time - Likely adds another stage to the pipeline - It may take **multiple cycles for a cache hit** - The increase in hit time for a larger cache could dominate the improvement in hit rate, leading to a decrease in processor performance - ***Average memory access time (AMAT)*** - <font class='red'>$AMAT = \text{Time for a hit} + \text{Miss rate} \times \text{Miss penalty}$</font> - The average time to access memory considering both hits and misses and the frequency of different accesses - Example - **Problem** - A processor with a 1 ns clock cycle time - Miss penalty: 20 clock cycles - Miss rate: 0.05 misses per instruction - A cache access time (including hit detection): 1 clock cycle - Find the AMAT for this processor - **Solution** $$ \begin{align} AMAT &= \text{Time for a hit} + \text{Miss rate} \times \text{Miss penalty} \\ &= 1 + 0.05 \times 20 \\ &= \text{2 clock cycles} \\ &= 2ns \end{align} $$ ### Reducing cache misses by more flexible placement of blocks - Different placement schemes - ***Direct mapped*** - A block can go in exactly one place in the cache - ***Position***: (Block number) **modulo** (Number of cache blocks) - <font class='blue'>*Fully associative*</font> - A block can be placed in **any location** in the cache - **All tags** of all the blocks in the cache **must be searched** (in parallel) - Comparators significantly increase the hardware cost - Only practical for caches with **small numbers of blocks** - <font class='green'>*Set associative*</font> - Each block can be placed in a fixed number of locations - <font class='green'>n-way set-associative</font> cache - A cache with **n locations for a block** - A cache consists of a number of ***sets***, each of which consists of n blocks - Each block in the memory maps to a unique set in the cache given by the index field, and a block can be placed in any element of that set - All tags of all the elements of the set must be searched - ***Position***: (Block number) **modulo** (Number of sets in the cache) ![image](https://hackmd.io/_uploads/rkT4AwIup.png) - To increase the degree of associativity - A **direct-mapped** cache is simply a **one-way** set-associative cache - A fully assciative cache with *m* entries - An m-way set-associative cache, only one set with m blocks - **Advantage**: decreases the miss rate - **Disadvantage**: increases in the hit time ![image](https://hackmd.io/_uploads/rJoOzuLua.png) - **Example: Misses and associativity in caches** - ***Problem*** - Three small cache, each consists of four one-word blocks - Direct mapped, two-way set associative, fully associative - Given a sequence of block addresses: 0, 8, 0, 6, 8 - Find the number of misses for each cache organization - ***Solution*** - Direct-mapped: 5 misses ![image](https://hackmd.io/_uploads/B1Hm4O8u6.png) - Two-way set associative: 4 misses - Using the ***least recently used (LRU)*** replacement rule ![image](https://hackmd.io/_uploads/ByU5VuL_6.png) - Fully-associative: 3 misses - The best we can do ![image](https://hackmd.io/_uploads/rJO_Hd8_T.png) - Reduction in the miss rate vs. associativity - 64 KiB data cache, 16-word block - One-way to two-way associativity: 15% decreasing - Little further improvement in going to higher associativity ![image](https://hackmd.io/_uploads/ryUWj_UOa.png) ### Locating a block in the cache - Address decomposition - **Index**: used to select the set containing the address of interest (n bits) - **Tag**: all the tags in the selected set are searched in parallel - **Block offset**: block offset (m bit) + byte offset (2 bits) - Costs of an associative cache - Extra comparators and n-to-1 multiplier - Any delay imposed by having to do the compare and select from among the elements of the set - The choice among three placement schemes - Depending on the cost of a miss versus the cost of implementing associativity - Both in time and in extra hardware ![image](https://hackmd.io/_uploads/rJt2pOUup.png) ![image](https://hackmd.io/_uploads/BkUZRuLOT.png) --- - **Example: Size of tags versus set associativity** - ***Problem*** - A cache of 4096 blocks, 4-word block size, 32-bit address - Find the total number of sets and the total number of tag bits for caches that are direct mapped, two-way set associative, four-way set associative, and full associative - ***Solution*** - 32-bit address: 4 bits for block offset（包含 byte offset）, 28 bits for index and tag - ***Direct mapped***: 4096 sets - *Index*: $\log_2(4096)$ = 12 bits - *Tag*: (28 - 12) $\times$ 4096 = 64 Ki bits ≈ 66 K bits - ***Two-way set associative***: 2048 sets - *Index*: $log_2(2048)$ = 11 bits - *Tag*: (28 - 11) $\times$ 2 $\times$ 2048 = 68 Ki bits ≈ 70 K bits - ***Four-way set associative***: 1024 sets - *Index*: $log_2(1024)$ - *Tag*: (28 - 10) $\times$ 4 $\times$ 1024 = 72 Ki bits ≈ 74 K bits - ***Fully associative***: 1 set - *Index*: $log_2(1)$ = 0 bits - *Tag*: (28 - 0) $\times$ 4096 $\times$ 1 = 112 Ki bits ≈ 115 K bits ### Choosing which block to replace - To choose a block among the blocks in the selected set to replace - <font class='blue'>*Least recently used (LRU)*</font> - The block replaced is the one that has been unused for the longest time - Implemented by keeping track of when each element in a set was used relative to the other elements in the set - As associativity increases, implementing LRU gets harder ### Reducing the miss penalty using multilevel caches - ***Multilevel cache*** - A memory hierarchy with multiple levels of caches - To close the gap further between the fast clock rates of modern processors and the increasingly long time required to access DRAMs - Using a second-level cache - Normally on the same chip as the processor and primary cache - To be accessed when a miss occurs in the primary cache - The access time is much less than the access time of main memory - If neither the **primary** nor **secondary** cache contains the data, a main memory access is required and incurred a larger miss penalty - **Example: Performance of multilevel caches** - ***Problem*** - A processor with a base CPI of 1.0 and clock rate of 4 GHz - Main memory access time: 100 ns - Secondary cache access time: 5 ns - Miss rate per instruction at the primary cache: 2% - After adding a secondary cache, the miss rate to main memory is 0.5% - How much faster will the processor with and without a secondary cache? - ***Solution 1*** - Miss penalty to main memory: 100 ns $\Rightarrow$ 400 clock cycles - Miss penalty to secondary caches: 5 ns $\Rightarrow$ 20 clock cycles - For the processor with one level of caching - Total CPI = Base CPI + Memory-stall cycles per instruction - Total CPI = 1.0 + 2% $\times$ 400 = 9 - For the processor with two level of caching - Total CPI = Base CPI + Primary stalls per instruction + Secondary stalls per instruction - Total CPI = 1 + 2% $\times$ 20 + 0.5% $\times$ 400 = 3.4 - *Relative performance*: 9 / 3.4 = 2.6 - ***Solution 2*** - For the processor with two level of caching - Total CPI = 1 + (2% - 0.5%) $\times$ 20 + 0.5% $\times$ (20 + 400) = 3.4 - Design considerations for a primary and secondary cache - **Primary cache** - To **focus on minimizing hit time** to yield a shorter clock cycle or fewer pipeline stages - <font class='red'>Smaller cache size, higher miss rate</font> - <font class='blue'>Smaller block size to reduce the miss penalty</font> - **Secondary cache** - To **focus on miss rate** to reduce the penalty of long memory access times - Larger cache size, the access time is less critical - Larger block size - Higher associativity to reduce miss rates ### Elaboration - ***Global miss rate*** - The fraction of references that miss in all levels of a multilevel cache - ***Local miss rate*** - The fraction of references to one level of a cache that miss; used in multilevel hierarchies - Miss rate 通常指 global miss rate ## Virtual Memory ### Preliminary - ***Virtual memory*** - A technique that uses main memory as a **cache** for secondary storage - **Two major motivations** for virtual memory - To allow efficient and safe sharing of memory among multiple programs - Such as for the memory needed by multiple virtual machines for cloud computing - To remove the programming burdens of a small, limited amount of main memory - The first motivation - Consider multiple virtual machines running at once on a computer - The total required memory size may be much larger than the available size - Main memory need **contain only the active portions** of the many virtual machines - Only a fraction of total required memory is actively being used at any point in time - Must be able to protect the virtual machines from each other - To ensure that a **program can only access the portions of main memory that have been assigned to it** - The virtual machines sharing the memory change dynamically while the virtual machines are running - To compile each program into its own **address space** - Virtual memory implements the translation of a program's address space to **physical addresses** - The translation process enforces **protection** of a program's address space from other virtual machine - The second motivation - If a program became too large for memory, it was up to the programmer to make it fit - Programmers divided programs into mutually exclusive pieces - **Overlays** were loaded or unloaded under user program control during execution - The program never tried to access an unloaded the total size of the memory - The overlays loaded **never exceeded the total size of the memory** - Overlays were tradtionally organized as moduled, each containing both code and data - Virtual memory automatically manages the main memory (***physical memory***) and secondary storage ### The structure of virtual memory - Terminologies in virtual memory system - **Page**: a virtual memory block - **Page fault**: a virtual memory miss - **Virtual address** - An address that corresponds to a location in virtual space - To be translated by address mapping to a **physical address** when memory is accessed - **Address translation (address mapping)** - The process by which a virtual address is mapped to an address used to access memory ![image](https://hackmd.io/_uploads/r1YhK2Uda.png =400x) --- - **Relocation** - To map the virtual addresses to different physical addresses before the addresses are used to access memory - Program can be loaded to anywhere in main memory - All virtual memory systems relocate the program as a set of pages - To eliminate the need to find a contiguous block of memory to allocate a program - The operating system only need to find a sufficient number of pages in main memory - Virtual address format - ***Page offset*** - Not changed during address translation - **To determine the page size** - ***Virtual page number*** - To be translate to **physical page number** during address translation - The number of pages addressable with the virtual addresses need not match those of addressable with the physical addresses - The number of virtual pages is larger ![image](https://hackmd.io/_uploads/ryicdCUup.png =400x) - A **page fault will take millions of clock cycles** to process - Dominated by the time to get the first word for typical page sizes - Key decisions in designing virtual memory systems - Pages should be large enough to amortize the high access time - Typical size: 4 KiB ~ 16 KiB - New desktop and server system: to support 32 KiB and 64 KiB - New embedded system: 1KiB - Organizations that reduce the page fault rate are attractive - To allow ==**fully associative**== placement（miss rate 最低） - Page faults can be handled in software - The overhead will be small compared to the disk access time - To use **clever algorithms** for choosing how to place pages - Write-through will not work for virtual memory - Writes take too long - Virtual memory system uses ==**write-back**== ### Placing a page and finding it again - Using **fully associative** placement of pages - To **reduce page fault** frequency - To use a clever and flexible replacement scheme - To use a sophisticated algorithm and complex data structure to track page usage - To try to choose a page that will not be needed for a long time - <font class='red'>Page table</font> - The table containing the virtual to physical address translations - Resided in memory, to avoid a full search of memory - **Indexed with the page number** from the virtual address to discover the corresponding physical page number - **Each program has its own page table** - **Different programs use the same virtual addresses** - The page table may contain entries for pages not present in memory - A <font class='blue'>valid bit</font> is used in each page table entry - 1 表示 page 已經在 memory，儲存的是 physical address；0 表示 page 還在 disk，儲存的是如何到 disk 找到該 page。 - **No tags** are required - The page table contains a mapping for every possible virtual page - Page number 跟 page table 的 entry 是一對一的關係，所以不需要 tag。 - ***Page table register*** - A register that points to the start of the page table ![image](https://hackmd.io/_uploads/HkYszkD_T.png) ### Page faults - A page fault occurs if the valid bit for a virtual page is off - The operating system gets control, done with the exception mechanism - To find the page in the next level of the hierarchy and decide where to place the requested page - Must keep track of the location on disk of each page in virtual address space ![image](https://hackmd.io/_uploads/r1Zs7yw_a.png =400x) - When a process is created, the operating system creates - A **swap space** on flash memory or disk - The space reserved for the full virtual memory space of a process - A data structure to record where each virtual page is stored on disk - Part of the page table - An auxiliary data structure indexed in the same way as the page table - The operating system tracks which processes and which virtual addresses use each physical page - When a page fault occurs and all pages in memory are in use - Usually follow the ***least recently used (LRU)*** replacement scheme - The replaced pages are written to swap space on the disk - ==**Approximate LRU**== replacement scheme - To keep track which pages have and which pages have not been recently used - <font class='green'>Use bit (reference bit)</font> - To be **set whenever a page is accessed** - <font class='green'>The operating system periodically clears the reference bits</font> - To determine which pages were touched during a particular time period - The operating system will replace a page which reference bit is off - **Example: Compute the total page table size** - **Problem** - 32-bit virtual address, 4 KiB pages, 4 byte per page table entry - Compute the total page table size - **Solution** - Number of page table entries = $2^{32} / 2^{12} = 2^{20}$ - Size of page table = $2^{20} \times 4$ = 4 MiB ### What about writes? - Virtual memory must use write-back scheme - Write to the disk can take millions of processor clock cycles - Using write-through scheme (even with a write buffer) is impractical - <font class='blue'>Dirty bit</font> - Write-back operation is still costly - A dirty bit is associated with each page in the page table - <font class='blue'>To be set if any word in a page is written</font> - To indicate whether the page needs to be written out when it is replaced ### Making address translation fast: the TLB - Every memory access by a program can take at least twice as long - One memory access to the page table to obtain the physical address - A second access to get the data - <font class='red'>Translation lookaside buffer (TLB)</font> - A special cache that keeps track of recently used translations - **Tag entry**: virtual page number - **Data entry**: physical page number - **Valid bit**, **dirty bit**, **reference bit** ![image](https://hackmd.io/_uploads/rk8F-lvOp.png) - On every reference, we access the TLB instead of the page table - To look up the virtual page number in the TLB - **TLB hit** - The physical page number is used to form the address - The corresponding reference bit is turned on - The dirty bit is turned on if the processor isperforming a write - **TLB miss** - Merely TLB miss - The page exists in memory, only the translation is missing in the TLB - To load the translation from the page table and replace an TLB entry - To copy the reference and dirty bits back to the page table entry when a TLB entry is replaced (write-back scheme) - **True page fault** - The page is not present in memory - To invoke the operating system using an exception - Merely TLB miss will be much more frequent than true page fault - The TLB miss rate is expected to very small - TLB misses can be handled either in hardware or in software - With care there can be little performance between the two approaches - Typical values for a TLB - TLB size: 16 ~ 512 entries - Block size: 1 ~ 2 page table entries (typically 4 ~ 8 bytes each) - Hit time: 0.5 ~ 1 clock cycle - Miss penalty: 10 ~ 100 clock cycles - Miss rate: 0.01% ~ 1% - Associativities in TLBs - Small and fully associative TLBs - **Lower miss rate** - Since the TLB is small, the cost of a **fully associativity mapping** is not too high - To support for **randomly choosing** an entry to replace, since implementing a hardware LRU scheme is too expensive - Larger TLBs with small associativity - Example: The Intrinsity FastMATH TLB - 4 KiB pages, 32-bit address space $\Rightarrow$ The virtual page number is 20 bits long - The physical address is the same size as the virtual address - The TLB contains 16 entries, fully associative - Each TLB entry is 64 bits wide - Valid(1) + Dirty(1) + Tag(20) + Physical page number(20) + Others ![image](https://hackmd.io/_uploads/rkPmLbwua.png) ![image](https://hackmd.io/_uploads/H16iU-D_a.png) ### Integrating virtual memory, TLBs, and caches - Data cannot be in the cache unless it is present in main memory - The operating system helps maintain the memory hierarchy - To flush the contents of any page from the cache when it decides to migrate that page to disk - To modify the page tables and TLB - To generate a page fault when the processor tries to access any data on the migrated page - **Example: Overall operation of a memory hierarchy** - In a memory hierarchy includes a TLB and a cache - Three different types of misses: **TLB miss**, **page fault**, **cache miss** - Consider all the combinations of these three events with one or more occurring - State whether this event can actually occur and under what circumstances ![image](https://hackmd.io/_uploads/SygQOWvua.png) ### Implementing protection with virtual memory - To allow sharing of a single main memory by multiple processes and provide memory protection among these processes and the operating system - To ensure one renegade process cannot write into the address space of another user process or into the operating system - The **write access bit** in the TLB can protect a page from being write - Basic capabilities must be provided by the operating system - At least two modes - **User mode**: the running process is a user process - **Kernel (supervisor) mode**: the running process is an operating system process - A portion of the processor state that a user process can read but not write - Including the user/supervisor mode bit, the page table pointer, and the TLB - The operating system uses special instructions that are only available in supervisor mode to write these element - Alternating between user and supervisor modes - User mode $\rightarrow$ supervisor mode - Using a special **system call** instruction - The **PC** from the point of the system call is saved in the **EPC** - Supervisor mode $\rightarrow$ user mode - Using the **return from exception (ERET)** instruction - To prevent a process from reading the data of another process - Each process has it own virtual address space - The operating system keeps the page tables organized and let the independent virtual pages map to disjoint physical pages - One process will not be able to access another’s data - To place the page tables in the protected address space - The operating system can assure safety if it prevents the user process from modifying its own page table - The operating system must be able to modify the page tables - When processes want to share information in a limited way - The operating system must assist them - It requires changing the page table - The write access bit can be used to restrict the sharing to just read sharing - To allow process P1 to read a page owned by process P2 - P2 would ask the operating system to let a virtual page in P1’s address space point to the same physical page that P2 wants to share - The write **protection bit** can be used to prevent P1 from writing the data - Any bits that determine the access rights must be included in both the page table and the TLB ### Handling TLB misses and page faults - Two possibilities of TLB miss - Merely TLB miss - The page is present in memory - Only need to create the missing TLB entry - Page fault - The page is not present in memory - Need to transfer control to the operating system - MIPS traditionally handles a TLB miss in software - To bring in the page table entry from memory and re-execute the instruction - To generate a page fault exception if necessary - Using the exception mechanism to handle a TLB miss - To interrupt the active process $\rightarrow$ to transfer control to the OS $\rightarrow$ to resume execution of the interrupted process - **PC** of the instruction caused the TLB miss must be saved in the **EPC** - The exception must be asserted by the end of the same clock cycle that the memory access occurs - A page fault will be recognized during the clock cycle used to access memory - The next clock cycle will begin exception processing - Once the operating system knows the virtual address that caused the page fault (for instruction access) - Find the location of the referenced page on disk - Using the virtual address to look up the page table entry - Choose a physical page to replace - If the chosen page is dirty, it must be written out to disk - Start a read to bring the referenced page from disk into the chosen physical page - It may take millions of processor clock cycle - The operating system will usually select another process to execute until the disk access completes - Re-execute the instruction that faulted - Restore the PC and state of the process that caused the page fault - Reset the processor from kernel to user mode - Access the requested page successfully and continue execution - Page fault exceptions for data accesses are difficult to implement - They occur in the middle of instructions - The instruction cannot be completed before handling the exception - The instruction must be restarted after handling the exception - Making instructions restartable - The exception can be handled and the instruction later continued - For an architecture like the MIPS: **easy** - Each instruction writes only one data item and this write occurs at the end of the instruction cycle - Simply prevents the instruction from completing and restarts the instruction at the beginning - For processors with complex instructions: **hard** - Each instruction may touch many memory locations and write many data items $\Rightarrow$ To generate a number of page faults - Instruction often cannot be restarted from the beginning $\Rightarrow$ To interrupt and later continue midstream in its execution - To resume an instruction in the middle of its execution, it requires saving some special state during the execution - Careful and detailed coordination between the exception-handling code in the operating system and the hardware - Example: How MIPS architecture works when a TLB miss occurs - Self-study - If page faults occur too frequently - A program routinely accesses more virtual memory than it has physical memory - **Thrashing** - A program continuously swaps pages between memory and disk - The program will run very slowly - Solutions - Using a computer with more memory - To reexamine the algorithm and data structure to change the locality ## A Common Framework for Memory Hierarchies - Four questions apply between any two levels of memory hierarchy - Where can a block be placed? - Direct mapped, set associative, fully associative - How is a block found? - Indexing, fully search, limited search, a separate lookup table - Which block should be replaced on a cache miss? - Random, LRU, approximated LRU - What happens on a write? - Write-through, write-back ![image](https://hackmd.io/_uploads/B1e4TbDuT.png) :::info **Question 1: Where can a block be placed?** - The advantage of **higher degree of associativity**: the **lower miss rate** - Reducing misses that compete for the same location - The largest gains are obtained in going from one-way to two-way associative - As cache sizes grow, the relative improvement increases only slightly - The overall miss rate of a larger cache is slower - The potential disadvantages of associativity - Increased cost, **slower access time** - The choice among three block placement schemes - Depending on the **cost of a miss versus the cost of implementing associativity** - Both in time and in extra hardware - Including the **L2 cache** on the chip enables much **higher associativity** - The **hit times are not as critical** - Fully associative caches are prohibitive except for small sizes - The cost of the comparators is not overwhelming - The absolute miss rate improvements are greatest ![image](https://hackmd.io/_uploads/BywAp-wup.png) ::: :::info **Question 2: How is a block found?** - **Indexing**: used in direct-mapped cache - A few systems have used it because of their **advantages in access time and simplicity** - Finding the requested block does not depend on a comparison - **Limited search**: used in set-associative cache - Often **used for caches and TLBs** - The access combines indexing and the search of a small set - **Fully search**: used in fully-associative cache - **A separate lookup table** - Example: Page table for the **virtual memory** system - **Extra storage and extra memory access** is required - The motivation for choosing fully-associative placement and the extra table - Fully associativity is beneficial since **misses are very expensive** - Fully associativity allows the software to use sophisticated replacement schemes that are designed to reduce the miss rate - The full map can be easily indexed with no extra hardware and no searching required ![image](https://hackmd.io/_uploads/HyN9p-wdT.png) ::: :::info **Question 3: Which block should be replaced on a cache miss?** - Primary replacement strategies in set-associative or fully associative caches - Random - Candidate blocks are randomly selected - Simple to build in hardware - **Least recently used (LRU)** - To replace the block that has been unused for the longest time - Too costly to implement for caches with more than a small degree of associativity - LRU is often **approximated** - **Used for four-way or larger set-associativity cache** - Tracking which pair of blocks is LRU - Tracking which block in each pair is LRU - For larger associativity, either approximated LRU or random replacement can be used - **Random can sometimes be better** than the simple LRU approximations - As the cache become larger - The miss rate for both replacement strategies falls - The absolute difference becomes small - In **virtual memory** - **A tiny reduction in the miss rate can be important** - Approximated LRU implemented in software is acceptable - Misses are so expensive and relatively infrequent ::: :::info **Question 4: What happens on a write?** - Write-through vs. Write-back - **Only write-back policy is practical in virtual memory systems** - Today lowest-level caches typically use write-back - Advantages of both policies - Write-through - **Misses are simpler and cheaper** - Easier to implement although a write buffer is required - Write-back - Individual words can be **written at the rate that the cache can accept** them - Multiple writes within a block **require only one write to the lower level** in the hierarchy - To make effective use of a high-bandwidth transfer since the entire block is written - 在 block 要被 replace 時才一次寫回 memory ::: ### The three Cs: An intuitive model for understanding the behavior of memory hierarchies - **Three Cs model** - A qualitative cache model in which all cache misses are classified into one of three categories - <font class='red'>Compulsory misses (cold-start misses)</font> - Caused by the **first access to a block** that has never been in the cache - <font class='red'>Capacity misses</font> - Caused when the **cache cannot contain all the blocks** needed during execution - To occur when blocks are replaced and then later retrieved - 發生比例為三者最高 - <font class='red'>Conflict misses (collision misses)</font> - Occur in set-associative or direct-mapped caches when **multiple blocks compete for the same set** - Can be **eliminated in a fully associative** cache of the same size（fully associative 有位置就放，所以不會有 conflict miss） ![image](https://hackmd.io/_uploads/SJLRkGPu6.png) - Methods to attack three categories of misses - **Compulsory misses** - To **increase the block size** - Too large block will lead too high miss penalty - **Capacity misses** - To **enlarge the cache** - The increased access time could lead to lower overall performance - Second-level caches have been growing steadily larger for many years - First-level caches have been growing slowly, if at all - **Conflict misses** - To **increase associativity** - May slow access time to lower overall performance ![image](https://hackmd.io/_uploads/SJiJxzw_p.png)