# Computer Organization Chap5 Report ###### tags: `ComputerOrganization` `NSYSU` `CSE260` ## [PPT](https://docs.google.com/presentation/d/1HEcQoKy0B1ug0tcEUyflwwEHrsbrjwjwE5n0K2pjVw0/edit?usp=sharing) ## Requirement ![](https://i.imgur.com/VCuVH78.png =200x200) ## Team Member * B084020005 蔡昀燁 ( @B084020005 ) * B083040008 黃宸洋 ( @kevin-NSYSU ) * B083040012 陳柏翰 ( @B083040012 ) ## ToDo - [x] 5.3 Basic of Caches - [x] 5.4 Measuring and Improving Cache Performance - [x] 5.5 Dependable Memory Hierarchy - [x] 5.6 Virtual Machines - [x] 5.7 Virtual Memory - [x] 5.8 Common Framework for Memory Hierarchy - [x] 5.9 Cache Control by FSM - [x] 5.10 Cache Coherence ## 5.3 Basic of Caches ### Memory Hierarchy #### Concept: ![](https://i.imgur.com/5Uomxxc.jpg) #### Locality (How hierarchy works): 1. Principle:Program access a **relatively small portion** of the address space at any instant 2. Two Types of locality: (a) Temporal locality:已經被reference到的item,短時間內會再次被reference EX:for loop的counter (b) Spatial locality:已經被reference到的item,其周圍的address也會同時被reference EX:A[i] and A[i+1] 3. Advantages:With memory hierarchy, **we can copy recently accessed/nearby items from disk to smaller DRAM memory** #### Memory Hierarchy Levels: 1. Hit:Data is **present** in upper level 2. Miss:Data is **absent** in upper level-->need to be retrieved from a block in the lower level ### Terminology 1. Hit Rate:$$Hit \ Rate=\frac{hit}{access}$$ 2. Hit Time:Time to access the upper level 3. Miss Rate:$$Miss \ Rate=1-(Hit \ Rate)$$ 4. Miss penalty:Time to replace a block in the upper level+time to deliver the block to the processer$$Miss \ Penalty=Time \ to \ determine \ hit/miss+Mem \ access \ time$$ 5. Hit time << Miss Penalty-->important to reduce miss rate 6. Exercise (PPT p.11) ### Direct-Mapped Cache 1. A cache structure in which each memory location is mapped to **exactly one location** in the cache. 2. Referenced address:![](https://i.imgur.com/RGMGeaR.jpg =200x100) * Tag field:used to compare with the value of the tag field of the cache * Index:used to select the block 3. Field in cache entry: * Valid bit:Indicates that the associated block in the hierarchy contains valid data. * Tag:Contains the address information required to identify whether the associated block in the hierarchy corresponds to a requested word (when start up,valid is 0). 4. Direct mapped --> only one choice ``` address in cache=(block address) mod (# of blocks in cache) ``` ![](https://i.imgur.com/D7YodTB.png =1000x150) 5. Accessing a directed-mapped cache:![](https://i.imgur.com/ZVfWpQz.png =450x300) * Step by step accessing:Textbook p.386~387 6. Total cache size calculation: * 32 bit address * A directed mapped cache * A word takes 4 bytes * The cache size is $2^n$ blocks,so n bits are used for the index * Block size is $2^m$ word ($2^{m+2}$ bytes),so m bits are used for the word,2 bit is used for the byte part of the address * So the size of the tag field is $32-((index \ size=n) + (offset \ size=(m+2))$ * Totol number of bit in the directed-mapped cache is$$2^n \times (block \ size+tag \ size+valid \ field \ size)$$ 7. Exercise (PPT p.23 、 Textbook p.390~391) ### Block Size Consideration 1. Larger blocks shoud reduce miss rate due to the spatial loality 2. But for fixed-size cache: * Larger blocks-->less number of blocks-->more address compete for a block-->increased miss rate * Larger blocks-->more data are not used in a block-->**pollution**(讀進太多不需要的data) * Larger blocks-->Larger miss penalty * Take more time to move a block * Can override benefit of reduced miss rate 3. ![](https://i.imgur.com/uZ30yWM.png =400x200) * For fixed size cache,miss rate will get higher when the block size is too large due to the reason above * For fixed block size,larger cache size will get lower miss rate due to the large number of block number(memory之間的競爭減少) ### Handling cache miss 1. On cache hit,CPU proceeds normally 2. On cache miss: * Stall the CPU pipeline,fetch block from next level of hierarchy * Instruction cache miss-->Restart instruction fetch * Data cache miss-->Complete data access * Steps to taken on an instruction cache miss: 1. Send the original PC value(**current PC-4**) to the memory. 2. Instruct main memory to perform a read and wait for the memory to complete its access. 3. Write the data from memory to cache entry,writing the upper bits of the address(from the ALU) into the tag field,and turning the valid bit on. 4. Restart the instruction execution. ### Handling Writes * On data-write hit,if only block in cache is updated-->cache and memory would be "**inconsistent**" * Two ways to keep main memory and cache consistent: 1. Write-through ![](https://i.imgur.com/soivGSF.png =100x150) * **Update both cache and memory** * **Makes write takes longer because needs to wait for MEM opeation** * EX:CPI (PPT p.29) * Using **write buffer** to improve performance * **Write buffer is a cache that hold the data which is waiting to be written to memory** * CPU continues immediately(*Only stall on write **if write buffer is already full**) * 讓 cache 有空間可以存load 的東西 2. Write back ![](https://i.imgur.com/phANHVv.png =100x150) * **On data-write hit,just update the block in cache** * Keep track of whether each block is drity(**need a drity bit**) * When dirty bit is 1,mean it is already modified * When a dirty block is replaced * Write it back to memory * Can use a write back buffer to allow replacing block **to be read first** * Two options on write miss(write a block that is not in cache): 1. Write allocate ![](https://i.imgur.com/0Li7nxy.png =100x150) * Allocate block in cache(both cache and memory has the block),and write cache and memory * Write miss is like read miss 2. No-write allocate ![](https://i.imgur.com/5aunqiN.png =100x150) * Do not allocate block in cache,**write block in memory only**. * Block stay in memory until it is read. * More cache space for read(reduce miss rate) * Useful for **initialization**(not use immediately in cache) * Write through/back vs. Write/No-write allocate | | Write back | Write through | | -------- | -------- | -------- | | **Write allocate** | Subsequent writes/reads to the same location,which is now cached |-| | **No-write allocate** |-| Subsequent writes still need to be written to the lower mem-->no advantage,but cache space is saved for read | ### Intrinsity FastMATH * Accessing the cache:![](https://i.imgur.com/XACuiZU.png) ### Main Memory Supporting Caches * Use DRAMS for main memory * Fixed width (1 word). * Connected by fixed-width clocked bus(bus clock is typically slower than CPU clock). * For cache block read: * 1 bus cycle for address transfer * 15 bus cycles per DRAM access(read) * 1 bus cycle per data transfer Transferring a 4-word block on 1-word-wide DRAM: * Miss penalty=1+4x15+4x1=65 bus cycle * Bandwidth=16 bytes/65 cycles = 0.25 Bytes/cycle * Increasing Memory Bandwidth 1. Wider memory and bus width * **Access 4-word and transfer 4-word at a one time**. * Miss penalty=1+**15**+**1**=17 bus cycle * Bandwidth=16 bytes/17 cycles=0.94 Bytes/cycle * -->Increase bandwidth,but **memory and bus cost is high**. 2. Interleaved memory organize * Bus width is the same * Sending address to 4 banks at the time -->**4 memory bank can be accessed at the same time** -->overlap the access time * Miss penalty=1+**15**+4x1=20 bus cycles * Bandwidth=16 bytes/20 cycles=0.8 Bytes/cycle ## 5.4 Measuring and Improving Cache Performance ### Mesuring Cache Performance #### Program execution cycles (CPU clock cycles) && Memory stall cycles: * The former is the normal cycles of the instructions; The latter is the extra cycles of the instructions **blocked on memory access**, which gets miss penalty cycles. * $$ CPU \ Time = (CPU \ Clock \ cycle + Memory \ stall \ cycle) \times Cycle \ Time$$ * Memory stall cycles $$=\frac{Memory \ accesses}{Program} \times Miss \ rate \times Miss \ penalty$$ $$=\frac{Instructions}{Program}\times\frac{Misses}{instruction} \times Miss \ penalty$$ #### Example I‐cache miss rate = 2%, D‐cache miss rate = 4% Miss penalty = 100 cycles Basic CPI(ideal CPI) = 2 Load & Store are 36% of instructions $Find \ actual \ CPI, \ and \ compare \ to \ ideal \ CPI$ I-cache total miss penalty = 0.02 * 100 = 2 D-cahce total miss penalty = 0.36 * 0.04 * 100 = 1.44 actual CPI = 2 + 2 + 1.44 = 5.44 actual CPI / ideal CPI = 5.44 / 2 = 2.72 actual CPI is 2.72 times more than ideal CPI #### Average Memory Access Time (ATMT) Hit time is also important in for performance, so we use ATMT to estimate average time to access data for both hit and miss ATMT $= Hit \ time + Miss \ rate \times Miss \ penalty$ #### example CPU with 1ns cycle time, hit cycle = 1 cycle, miss penlaty = 20 cycles, I-cache miss rate = 5%, find ATMT ATMT = 1 + 0.05 $\times$ 20 = 2(ns) ATMT is 2 cycles per instruction. ### Different Cache Architectures #### Direct mapped (discussed before) ![](https://i.imgur.com/KGu47XX.png) #### Fully assiociative : each block can be anywhere in cahce * allow a given block to go in any cache entry * All cache entrys are searched (in parrallel) to locate block, so every entry need a comparator(hardware cost is high) #### N-way Set associative : * Each memory block can place in a set of cache locations (any location in the set , each set contain N entries ) * cache set number = (Block number) modulo (number of Sets in cache) * all cache entries in the set are searched (in parallel) to locate block, so we need N comparators (hardware cost is less expensive) ![](https://i.imgur.com/iluC49D.png) E.g. What is the location of a memory block with address 12 in a 2-way cache with 8 entrys? Total number of entry=8, each set has 2 blocks => 4 Sets Set address = (12 mod 4) = 0 block number 12 is put into set 0 #### Different Associativity For a cache with 8 entries, four different cache organization ![](https://i.imgur.com/v6jGAdQ.png) #### example ... #### Why Set Associative Cache? * Directed mapped may introduce too many conflict misses Conflict miss: >=1 memory blocks contend a cache line * Fully associative minimizes the number of conflict misses, but the hardware cost is high * Set associative intends to come up with a balance #### How much associativity? * Increased associativity decreases miss rate, but with diminishing returns * Simulation of a system with 64KB D‐cache, 16‐word blocks, SPEC2000, miss rate shown below 1‐way: 10.3% 2‐way: 8.6% 4‐way: 8.3% 8‐way: 8.1% #### Replacement policy * Direct mapped : no choice but to drop the block loaded * Set Associative : not used block if there is one , or drop Least-recently used block * Full Associative : Random ### Multilevel Caches Uses multilevel cache to reduce miss penalty ![](https://i.imgur.com/U5m08Rg.png) * Primary cache (Level‐1) attached to CPU => Small, but fast * Level‐2 cache services misses from primary cache => Larger, slower, but still faster than main memory * L‐2 cache reduces the read/write on main memory => reduce miss penalty * Some high‐end systems include L‐3 #### multilevel cache example #### Multilevel Cache Consideration ![](https://i.imgur.com/U5m08Rg.png) * Primary cache => Focus on minimal hit time * L‐2 cache => Focus on low miss rate to avoid main memory access => Hit time has less overall impact * Results * L1 cache usually smaller than a single cache * L1 block size smaller than L‐2 block size <style> .red { color: red; } </style> ## 5.5 Dependable Memory Hierarchy ### Dependability - MTTF and avalibility ![](https://i.imgur.com/BBZ2IHJ.png) * Mean time to failure(MTTF): the expected time to failure for an item. * Mean time to repair(MTTR): the average time required to repair for failed item. * Availability: The percentage in a specific quantity time you can one item normally. * $Availability = MTTF / (MTTF+MTTR)$ ![](https://i.imgur.com/YGWtVFk.png) ### Kinds of failure * Hardware failures: failed due to hardware component(CPU, RAM, disk, fan), severed network, power outages. * Software failures: software bugs, security breaches, system patches. ## 5.6 Virtual Machines ### What is virtual machine? * Host computer emulates guest operating system and machine resources. * Improved isolation of multiple guests. * Avoid security and reliability problems. * Aids sharing of resources. * Virtulization has some performance impact. * Feasible with modern high-performance computers. ### Virtual machine monitor * Maps virtual resources to physical resources. * memory, I/O devices, CPUs. * Guest's code runs on native machine in user mode. * Traps to VMM on privileged instructions and access to the protected resources. * Guest's OS may be different from host's OS. * VMM handles read I/O devices. * Emulates generic virtual I/O devices for guests. ### Timer Virtulization * If a VM requires timer interrputs: * VMM emulates a virtual timer. * Emulates interrupt for VM when physical timer interrupts occurs. ## 5.7 Virtual Memory ### What is virtual memory? ![](https://i.imgur.com/N7rlXkI.png) * Use main memory as a “cache” for secondary (disk) storage. * Managed jointly by CPU hardware and the operating system(OS). * Programs share main memory. * Each gets a private virtual address space holding its frequently used code and data. * CPU and OS translate virtual address to physical space. * There are two goals: * Allow many programs to run at once. * Protected from other programs. * A VM “block” is called a page. ### The relationship between virtual address and physical addresss ![](https://i.imgur.com/bUXpy5Y.png) * Frame size and page size have to be same. * Virtual address can be larger than physical address. * We should have a way to translate virtual address to a physical one. * We call a VM translation "miss" as a "page fault". * If there is a page fault, we must fetch the page from the disk. * It will take millions of clock cycles to this. * It will be handled by OS code. * Several ways to minimize page fault rate: * Fully associative placement * Smart replacement algorithm ### Page table ![](https://i.imgur.com/HLgjIDS.png) * Page table stores <span class = red>"placement information"</span>. * Array of page table entries(PTE) is indexed by virtual page number. * Page table register in CPU points to page table in <span class = red >"physical memory"</span>. * If page is present in memory: * PTE stores the physical page number and other status bits(referenced, dirty, ...). * If page is absent: * PTE should refer to the location in swap space on disk. * Page table maps each page in virtual memory to either a page in main memory or a page stored on disk. ### Translation with a page table ![](https://i.imgur.com/LkwEAVY.png) ### How to deal with page fault? * A page fault means that the data is not in main memory, therefore, hardware should trap to the OS to obtain the required data. * We may choose a page to discard if the page table has no free space. * Load the requested page in from disk. * Update the page table and resume program. * To reduce page fault rate, we prefer to use LRU(least-recently used) replacement. * Reference bit(use bit) in page table set to 1 on access to page. * Reference bit will periodically cleared to 0 by OS. ### TLB(Translation Lookaside Buffer) * There has a problem: page table is also in the main memory, : every memory access by a program take at least <span class = red>"two memory access"</span>. * One to obtain physical address. * One to get the data. * It is time consuming. * We can use <span class = red>TLB</span> to speed up the address translation. ![](https://i.imgur.com/QMN245c.png) * TLB is “cache” of page table : page table entries that are accessed frequently are put into TLB. * Typically, it’s hardware inside a typical processor. * TLBs are typically small in size, typically < 128~256 entries. ### How TLB works? ![](https://i.imgur.com/M36BiEp.png) * TLB hit on read: * Physical address is on obtained. * TLB hit on write: * Toggle dirty bit(<span class = red>write back</span> to page table on replacement). ![](https://i.imgur.com/QMIU2Mu.png) * TLB miss and page table hit: * Load page table entry into TLB, then restart the instruction. ![](https://i.imgur.com/VsxeWEK.png) * TLB miss and page table miss: * OS handles fetching the page and updating the page table, then restart the faulting instruction. ## 5.8 Common Framework for Memory Hierarchy ### Block Placement * Determined by associativity 1. Directed mapped-->One choice for placement 2. n-way set associative-->n choices per set 3. Fully associative-->Any location will be fine ![](https://i.imgur.com/bs96ZNL.png) * Associativity vs. Miss Rate ![](https://i.imgur.com/z72W3bj.png =300x280) * Increasing the associativity reduce the miss that compete for the same location. * There is **less imporvement** in going from lower-way associative to higher-way associative. * Higher associativity also increases complexity,cost and access * Finally,miss rate of higher-way associative will close to the fully associative. * **Smaller cache obtain a significantly larger absolute benefit from associativity**: * The base miss rate of a small cache is larger. * The overall miss rate of larger size cache is lower,absolute improvement in miss rate from associativity shrinks. ### Finding a block * Determined by associativity ![](https://i.imgur.com/7dhb1J6.png) * For Cache: * Cost of a miss rate vs. cost of implementing associativity. * **Reduce comparisions** to reduce cost. * Normally set-associative. * For Virtual Memory: * **Reduce miss rate** to reduce cost. * Page table uses fully associative 1. More beneficial since the miss is expensive. 2. Allows software to use sophisticated replacement schemes to reduce the miss rate. 3. Easily indexed with no extra hardware and no searching required. ### Replacement * Choose entry to replace on a miss 1. Least recently used(LRU) * The block replaced is the one that has been unused for the longest time. * Complex and costly hardware for high associativity. 2. Random * Blocks are randomly selected,possibly using some hardware assitance. * Easier to implement * Virtual Memory * LRU approximation with hardware support ### Write Policy * Write-through * Misses are simpler and cheaper * Need write buffer * Write-back * Need to keep more state * Need only one write to the lower level when multiple writes within a block * Either write-through and write-back are possible between cache and main memory * Virtual Memory: * Only the write-back policy is practical because of the long latency of a write to lower level. ## 5.9 Cache Control by FSM ### A Simple Cache #### Characteristic of the cache * Direct-mapped cache * write-back using write-allocte * Block size is 4 words(16 bytes) * Cache size is 16KiB(1024 blocks) * 32-bits address * Include valid bit & dirty bit per block #### derives from characteristic * Cache index is 10 bits * Block offset is 4 bits * Tag size is 32-10-4 = 18 bits #### signal between processor and cache * 1-bit Read or Write signal * 1-bit Valid signal, saying whether there is a cache operation or not * 32-bit address * 32-bit data from processor to cache * 32-bit data from cache to processor * 1-bit Ready signal, saying the cache operation is complete #### signal between cache and memory * The interface between the memory and the cache has the same fields as between the processor and the cache, except that the data fields are now 128 bits wide. ### finite state machine * Here is a mealy machine * ( Next State , Output ) = Next-state-function ( Current state , Input ) * Next-state-function is decided by conbinational logic block ### A finite state machine strcuture of cache controller ![](https://i.imgur.com/7jbDPcV.png) ### A four-state simple controller ![](https://i.imgur.com/KnAvDtO.png) #### Idle: * This state waits for a valid read or write request from the processor, which moves the FSM to the Compare Tag state. #### Compare Tag: * This state checks if the requested block is in cache or not. If the address matches the tag and the tag is valid,hit;else miss. * Once hit,either the data is read from the selected word if it is a load or written to the selected word if it is a store. The Cache Ready signal is then set. If it is a write, the dirty bit is set to 1. Note that a write hit also sets the valid bit and the tag field; while it seems nnecessary, it is included because the tag is a single memory, so to change the dirty bit we also need to change the valid and tag fields. If it is a hit and the block is valid, the FSM returns to the idle state. * A miss first updates the cache tag and then goes either to the Write-Back state, if the block at this location has dirty bit value of 1, or to the Allocate state if it is 0 #### Write-Back: * This state writes the 128-bit block to memory using the address composed from the tag and cache index. * We remain in this state waiting for the Ready signal from memory. When the memory write is complete, the FSM goes to the Allocate state. #### Allocate: * The new block is fetched from memory. We remain in this state waiting for the Ready signal from memory. When the memory read is complete, the FSM goes to the Compare Tag state. * Although we could have gone to a new state to complete the operation instead of reusing the Compare Tag state, there is a good deal of overlap, including the update of the appropriate word in the block if the access was a write. ## 5.10 Cache Coherence ### A memory system is coherent if ![](https://i.imgur.com/h526vEV.png) 1. A read by a processor P to a location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P. Thus, in Figure, if CPU A were to read X after time step 3, it should see the value 1. 2. A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses. Thus, in Figure, we need a mechanism so that the value 0 in the cache of CPU B is replaced by the value 1 after CPU A stores 1 into memory at address X in time step 3 3. Writes to the same location are serialized; that is, two writes to the same location by any two processors are seen in the same order by all processors. For example, if CPU B stores 2 into memory at address X after time step 3, processors can never read the value at location X as 2 and then later read it as 1. ### Two cache coherence ways: * **Migration**: A data item can be moved to a local cache and used there in a transparent fashion. Migration reduces both the latency to access a shared data item that is allocated remotely and the bandwidth demand on the shared memory * **Replication**: When shared data are being simultaneously read, the caches make a copy of the data item in the local cache. Replication reduces both latency of access and contention for a read shared data item. ### A coherece problem of read after write ![](https://i.imgur.com/h526vEV.png) * CPU B will read the old value instead of the new value ### Snooping potocols * This style of protocol is called a write invalidate protocol because it invalidates copies in other caches on a write. Exclusive access ensures that no other readable or writable copies of an item exist when the write occurs: all other cached copies of the item are invalidated. ![](https://i.imgur.com/VwEt5yf.png) * CPU B will read the updated value since the local cache content of CPU B is set invalid, so when CPU B want to read data, it will miss and be forced to load the updated data from memory. ### False sharing: * When two unrelated shared variables are located in the same cache block and the full block is exchanged between processors even though the processors are accessing different variables. * Solution : **cache padding** :填充多餘的 variables,使一個 variable 佔據一整個 cache line (多為 64 bytes),使不同 CPU 之間不會因更新 variable 的 coherent time 拖累效能。