Translation Lookaside Buffer

# Translation Lookaside Buffer ## Procedure TLB 是處理器中的快取記憶體，屬於記憶體管理單元 (MMU) 的一部份，把目前行程常用的 VPN-to-PPN 的對應關係記起來，如果目前 TLB 中沒有對應 VPN-to-PPN 的轉換，就會觸發 Exception，處理器到 Page Table 找到 PPN 之後填入 TLB 然後重新執行剛剛觸發 Exception 的指令，這時就為 TLB Hit ```c VPN = (VirtualAddress & VPN_MASK) >> SHIFT (Success, TlbEntry) = TLB_Lookup(VPN) if (Success == True) // TLB Hit if (CanAccess(TlbEntry.ProtectBits) == True) Offset = VirtualAddress & OFFSET_MASK PhysAddr = (TlbEntry.PFN << SHIFT) | Offset Register = AccessMemory(PhysAddr) else RaiseException(PROTECTION_FAULT) else // TLB Miss PTEAddr = PTBR + (VPN * sizeof(PTE)) PTE = AccessMemory(PTEAddr) if (PTE.Valid == False) RaiseException(SEGMENTATION_FAULT) else if (CanAccess(PTE.ProtectBits) == False) RaiseException(PROTECTION_FAULT) else TLB_Insert(VPN, PTE.PFN, PTE.ProtectBits) RetryInstruction() ``` ## Example 陣列的存取順序和 TLB 的關係，當 Page 大小越大時，TLB 的 Spatial Locality 效應就越大，當存取的資料都在同一個 Page 時 - `a[0]`：TLB Miss, Fill TLB entry - `a[1]`：TLB Hit - `a[2]`：TLB Hit - `a[3]`：TLB Miss, Fill TLB entry - `a[4]`：TLB Hit - `a[5]`：TLB Hit - `a[6]`：TLB Hit - `a[7]`：TLB Miss, Fill TLB entry - `a[8]`：TLB Hit - `a[9]`：TLB Hit ![Screenshot from 2025-04-22 18-05-25](https://hackmd.io/_uploads/H1hxqJHkee.png =350x) ## TLB Update Mechanism 當 TLB Miss 時 CISC (Complex Instruction Set Computer) 和 RISC (Reduced Instruction Set Computer) 的處理方式不一樣，CISC 如 x86 由硬體處理，硬體透過 Page Table Register 找到對應的 PPN 填入 TLB，RISC 架構由軟體處理，透過 Trap Handler 處理，當 TLB Miss 觸發 Trap 由寫好的 Trap Handler 找到對應的 PPN 填入 TLB，而 TLB 造成的 Trap 和一般的 Trap 不同的是由 TLB 引發的 Trap 在 Return-from-Trap 時要重新執行引發 Trap 的指令，一般的 Trap 在 Return-from-Trap 時是要執行引發 Trap 的下一道指令 ## Context Switch 在沒有額外 Flag 的 TLB 時，TLB 只對目前的行程有用，發生 Context Switch 時就要 TLB Flush 避免新的行程用到舊的行程的 PPN 引發錯誤，如果要避免 TLB Flush 可以替行程添加 ASID (Address Space Identifier)，當有行程使用同樣的 VPN 時就可以避免錯誤 ![Screenshot from 2025-04-22 18-23-30](https://hackmd.io/_uploads/HkeV0yBygx.png =400x) ## Share Page 行程之間共享 Page 只要把行程的 VPN 映射到同一個 PPN ![Screenshot from 2025-04-22 18-27-57](https://hackmd.io/_uploads/SyaNkgHkge.png =400x) ## 課後問題 In this homework, you are to measure the size and cost of accessing a TLB. The idea is based on work by Saavedra-Barrera [SB92], who developed a simple but beautiful method to measure numerous aspects of cache hierarchies, all with a very simple user-level program. Read his work for more details. The basic idea is to access some number of pages within a large data structure (e.g., an array) and to time those accesses. For example, let’s say the TLB size of a machine happens to be `4` (which would be very small, but useful for the purposes of this discussion). If you write a program that touches `4` or fewer pages, each access should be a TLB hit, and thus relatively fast. However, once you touch 5 pages or more, repeatedly in a loop, each access will suddenly jump in cost, to that of a TLB miss. The basic code to loop through an array once should look like this: ```c int jump = PAGESIZE / sizeof(int); for (i = 0; i < NUMPAGES * jump; i += jump) a[i] += 1; ``` In this loop, one integer per page of the array a is updated, up to the number of pages specified by NUMPAGES. By timing such a loop repeatedly (say, a few hundred million times in another loop around this one, or however many loops are needed to run for a few seconds), you can time how long each access takes (on average). By looking for jumps in cost as NUMPAGES increases, you can roughly determine how big the first-level TLB is, determine whether a second-level TLB exists (and how big it is if it does), and in general get a good sense of how TLB hits and misses can affect performance. ![Screenshot from 2025-04-22 18-54-24](https://hackmd.io/_uploads/B1JOSxrklg.png =400x) Figure 19.5 shows the average time per access as the number of pages accessed in the loop is increased. As you can see in the graph, when just a few pages are accessed (`8` or fewer), the average access time is roughly `5` nanoseconds. When 16 or more pages are accessed, there is a sudden jump to about `20` nanoseconds per access. A final jump in cost occurs at around `1024` pages, at which point each access takes around `70` nanoseconds. From this data, we can conclude that there is a two-level TLB hierarchy; the first is quite small (probably holding between `8` and `16` entries); the second is larger but slower (holding roughly `512` entries). The overall difference between hits in the first-level TLB and misses is quite large, roughly a factor of fourteen. TLB performance matters! 1. For timing, you’ll need to use a timer (e.g., `gettimeofday()`). How precise is such a timer? How long does an operation have to take in order for you to time it precisely? (this will help determine how many times, in a loop, you’ll have to repeat a page access in order to time it successfully) 2. Write the program, called `tlb.c`, that can roughly measure the cost of accessing each page. Inputs to the program should be: the number of pages to touch and the number of trials. 3. Now write a script in your favorite scripting language (bash?) to run this program, while varying the number of pages accessed from `1` up to a few thousand, perhaps incrementing by a factor of two per iteration. Run the script on different machines and gather some data. How many trials are needed to get reliable measurements? 4. Next, graph the results, making a graph that looks similar to the one above. Use a good tool like ploticus or even zplot. Visualization usually makes the data much easier to digest; why do you think that is? 5. One thing to watch out for is compiler optimization. Compilers do all sorts of clever things, including removing loops which increment values that no other part of the program subsequently uses. How can you ensure the compiler does not remove the main loop above from your TLB size estimator? 6. Another thing to watch out for is the fact that most systems today ship with multiple CPUs, and each CPU, of course, has its own TLB hierarchy. To really get good measurements, you have to run your code on just one CPU, instead of letting the scheduler bounce it from one CPU to the next. How can you do that? (hint: look up “pinning a thread” on Google for some clues) What will happen if you don’t do this, and the code moves from one CPU to the other? 7. Another issue that might arise relates to initialization. If you don’t initialize the array `a` above before accessing it, the first time you access it will be very expensive, due to initial access costs such as demand zeroing. Will this affect your code and its timing? What can you do to counterbalance these potential costs?