Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems

# Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems ###### tags: `PIM` `Onur Mutlu` `ISCA '16` ###### papers: [link](https://people.inf.ethz.ch/omutlu/pub/TOM-programmer-transparent-GPU-near-data-processing_isca16.pdf) ###### slides: [link](https://people.inf.ethz.ch/omutlu/pub/TOM-programmer-transparent-GPU-near-data-processing_kevinhsieh_isca16-talk.pptx) # 0. ABSTRACT * 要如何在 programmer 不介入的情況下快速的將 computation 跟 data mapping 到 3D-stacked memories，讓所有 programmer 都受益？ * 此論文提出兩個方法 * compiler-based technique，透過 cost-benifit 分析將相關的 code offload * software/hardware 合作機制，預測哪些 memory pages 會被 offloaded code access，並將那些 page 放在 memory stack 附近 # 1. INTRODUCTION * NDP GPU 架構概述： * 多個 3D-stacked memory，各自包含一或多個 streaming multiprocessors (SMs) 在 logic layer * main GPU 會將 computation offload 到對應的 memory stack，以此減少 off-chip data movement ![](https://i.imgur.com/L81bs2R.jpg) ![](https://i.imgur.com/o06Hwvd.png) * 若要 programmer transparent，兩個主要挑戰因此出現： * 哪些 operation 該在 main GPU，哪些該在 memory stack 執行？ * memory intensive 適合在 memory stack * compute intensive 適合在 main GPU * 但因為 input 或 program phase behavior 不同，讓 programmer 決定可能很難精準的判斷如何分配 * 決定 data 該如何分配在不同的 memory stack * 理想上要最大化 code/data co-location for NDP operations * 對於在 main GPU 上執行的 code，最大化 bandwidth utilization * 挑戰一解決方法： * 提出 offload candidate selection mechanism，在 static compiler 端分析哪些 code block 可能需要的最多 memory bandwidth，以此來預測若放在 NDP 能省下多少 memory bandwith * 再透過系統端根據系統的狀態，如 SM 或 bandwith 的使用情形，動態決定那些被選定的 code block 是否該真的 offload 到 SM 中 * 挑戰二解決方法： * 提出 programmer-transparent data mapping mechanism，將 data 放在跟會需要那些 data 的 offloaded code 的同樣的 memory stack * 根據實驗，code block 約有 85％的重複 memory access * 使用 main GPU 在 run time 時評估幾個不同的 memory mapping option，預測那些 offload code block 會 access 哪些 memory pages，然後將那些 pages 放置在靠近 code 的區塊 # 2. MOTIVATION * 圖二展示了 ideal 情況下 (不算 offloading code blocks overhead 且假設 offloaded code and data all co-located) 套用 offload candidate identification mechanism 能產生的 improvement * 有許多動態的 component ( host-side driver、GPU runtime、memory controller )，會影響 mapping of memory to different stack，現今許多方法([Sura](https://researcher.watson.ibm.com/researcher/files/us-zsura/amc-cf2015.pdf)、[Top-pim](https://www.eecis.udel.edu/~lxu/resources/TOP-PIM.pdf))都是提供 API 讓 programmer 自行決定 memory map，無法做到 programmer-transparent * 圖三展示了 ideal mapping，使用連續兩個 address bits 來 map memory pages to memory stacks，可以比先今的 GPU memory mapping policy 來的進步 13％ ![](https://i.imgur.com/5Ky0aWj.jpg) ![](https://i.imgur.com/ckxXM4M.jpg) # 3. MECHANISM ## 3.1 Identification of Offloading Candidates ### 3.1.1 計算 memory bandwidth cost-benefit * Load instruction: 由 TX 傳送 address，RX 接收 data * Store instruction: 由 TX 傳送 address 跟 data，RX 接收 ack message * 假定 address、data、register 的 size 是 ACK message 的 4 倍，如果 Load & Store 由獨立的 thread 執行，可以得到以下公式： ![](https://i.imgur.com/p3bHMCT.jpg) * REG$^T$$^X$ 跟 REG$^R$$^X$ 代表 transmit 跟 receive 的 register 數量，這個值代表著 offload block 的 bandwidth cost * N$^L$$^D$ 跟 N$^S$$^T$ 代表在 block 中執行的 load & store 數量，這個值代表 offloading 會得到的 bandwidth benefit * (1) 假定 TX 下，LOAD 傳送 an address，STORE 傳送 an address and a data * (2) 假定 RX 下，LOAD 得到 data，STORE 得到 ACK Message($1/4$ size) * 實際上在 GPU，thread 會在 lock-step warp 中執行，所以 hardware 在 offload 時會以 warp 為單位而非 thread * 實際上 load and store 會合併成 load-store unit and cache，對於 load，address and data 的 size 不相同，因為 data 是以 cache line 的粒度來獲得 * 以下是由 warp granularity 所做 bandwidth 公式改變： ![](https://i.imgur.com/NcfgYFp.jpg) * S$^W$ 是 warp 的 size (e.g. 32) * S$^C$ 是 cache line size / address size (e.g. 32 for 128B cache lines and 4B addresses) * Coal$^L$$^D$ and Coal$^S$$^T$ 分別代表平均上 load and store coalescing ratio (e.g. 所有 warp 中的 load 能夠合併成 2 個 cache line，則 Coal$^L$$^D$ = 2) * Miss$^L$$^D$ 代表 cache miss rate，且是 N$^L$$^D$ 的係數 * Compiler 可以知道 REG$^T$$^X$ 跟 REG$^R$$^X$ 等 static 的參數，但如 Coal$^L$$^D$、Coal$^S$$^T$、Miss$^L$$^D$ 等，只能透過預測假設所有 warp 中的 memory instruction 都可以完美合併，所以 Coal$^L$$^D$、Coal$^S$$^T$ 都是 1，而因為 GPU cache miss rate 通常很高，所以估計為 50％ ### 3.1.2 Offloading candidate block identification * 如果 Equations (3)(4) 是負值，代表所省下的 bandwidth 大於 offload 產生的 cost，compiler 會使用 2-bits 標示出這些 candidate，並標出是 TX 還是 RX 有省到 ### 3.1.3 Loops and conditional offloading candidates * 迴圈裡頭的 candidate block，offload cost 會固定，但 bandwidth benifit 會隨著 iterations 執行數量而改變 * compiler 面對迴圈有三種應對方式： 1. 假設 iteration 數量是 statically，那 compiler 算出整體的 benifit 並做出決定 2. 如果進入迴圈前才能知道，那 compiler 標示成 "conditional offloading candidate"，之後 hardware 看到這個 mark 會另外處理 3. 如果在 execution time 才能知道，compiler 就假設只有 1 次，並且計算一次下的 cost and benifit，以此來決定 > [name=johnnycck] 迴圈內的變數若實際在執行第二次開始應該會一直 cache hit? 公式以 cache miss rate 50% 來算會不會不公平? ### 3.1.4 Offloading candidate block limitations * 對於 candidate blocks 有三個限制： 1. candidate blocks 不能有會 access GPU's on-chip share memory 的，因為若是 offload 到 memory stack 來做，變成要走 off-chip memory bandwidth 2. candiate code 如果有分歧的 thread，要確保在 execution 結束前會 converge，因為 GPU 使用 Single instruction, multiple threads (SIMT) execution model，warp 中的 threads 可能分散成多個 control flow instructions，若是允許執行後的 offloaded code 的 thread 還是分散的，會讓管理上非常困難，因此 compiler 會確保 control flow instructions 的 destination 會在 candidate block 中 3. 在 candidate block 中並不支援 memory barrier、synchronization、atomic instructions，因為 GPU and logic layer SM 間並沒有支援 synchronization primitives ### 3.1.5 Offloading candidate block example * 範例 GPU workload: LIBOR Monte Carlo * 在這 workload 中，有兩個 loop 是 conditional offloading candidate block，每個 loop 有 5 個 input value ( REG$^T$$^X$, marked as red )，one load and one store ( both circled ) * compiler 將這兩個 loop mark 為 conditional offloading candiate，考量到這個 loop 若 iteration number >= 4，就能開始節省 memory bandwidth，在 iternation 4 次的情況，bandwidth 變成：BW$^T$$^X$ + BW$^R$$^X$ = -39，但若是 1 次，bandwidth 會是 BW$^T$$^X$ + BW$^R$$^X$ = +110.25 ![](https://i.imgur.com/Csqlgjc.jpg) ## 3.2 Programmer-transparent Data Mapping ### 3.2.1 Memory access pattern analysis * 透過觀察可以發現有很大一部分的 accesses，他們之間會有 fixed offset，這讓access 間可以透過 offset 預測，並 mapping 到同一塊 memory stack * 舉例來說，上圖的 L_b[n] & L[n]，這兩個 array 間的差距是一段 2 的指數的 offset，這是因為 page size 幾乎都是 2 的指數，所以這兩個 array 的 LSB 都會一樣，而他們間的 offset 就是 base address 的差距 * 透過實驗了解到兩個現象 * 85％的 offloading candidate，memory access 間可能存在 fixed offset，這樣可以幫助增加 data locality * 6 個 workload 都有 always fixed offset 的情況，因此若是用簡單的 mapping 就能將他們放入同一個 memory stack，改善很大程度的 access ![](https://i.imgur.com/wsWQjsv.jpg) * 假設 fixed offset address 是 2$^M$，cache line size 是 2$^N$，最佳的 mapping 到別的 memory stack 的標準是 bits [M-1:N] * 在實驗中，將 bits [16:7] (cache line size: 128B) 來做比較並使用 2 個連續 bits mapping 到 4 個 memory stack * 定義 compute/data colocation 為 accessing one memory stack in an offloading candidate instance 的機率 * 實驗結果顯示比起 baseline data mapping 高了 38-75％ ### 3.2.2 Predictability of the best memory mapping * 在 data 放到 memory 前就使用 predict the best memory mapping mechanism，減少高 overhead 的 stack 間 data movement * 圖六顯示觀察 0.1%、0.5%、1% 的 offloading candidate instances 能夠 access one memory stack 的機率，0.1% 就能有 72% 的機率 * 因為 GPU access memory 經由 index of threads and thread blocks，讓 access pattern 非常 predictable and consistent among threads ![](https://i.imgur.com/AvJchCm.png) ### 3.2.3 Programmer-transparent data mapping * GPU kernel execution 區分為 short initial learning phase 跟 regular execution * 在 initial learning phase 所有的 GPU kernel 都在 main SMs 上執行且相關的 data 都在 CPU memory * 在 initial learning phase 結束後，得到如何 mapping，再將 data 從 CPU memory copy 到 GPU memory * Our mechanism essentially delays the memory copy from the CPU to the GPU in the typical GPU execution flow so there is no extra data remapping overhead. ## 3.3 Dynamic Offloading Aggressiveness Control * 隨意的將 compiler 認定的 offloading candidate 都交由 SMs 來做會有兩種可能變慢: * 要做的 offloading request 太多了，反倒讓 memory stack SMs 變慢 * 每個 SM 都有同時執行 warp 的硬體限制，memory stacks 的 SMs 達到硬體限制後不能對 offloaded block 生產更多的 warp，因此不能執行，但同時 GPU 會等待 SM 回傳做好的 offloaded block，因此卡住 * RX and TX 的 bandwidth saving 可能會產生誤解，雖然 RX 可以省很多，但若是 TX 因此增加 traffic，甚至成了 application 的 bottleneck，那 offload 這樣的 block 就會放大此問題 * 藉由提出 dynamic offloading aggressiveness control，使用 runtime information 決定到底要不要 offload candidate * 由兩種方法 override compiler 標示的 offloading recommendation * GPU track 每個送到 memory stack 且目前 pend 住的 offload request 數目，若是達到硬體上限，就不再送往 SM * 透過 2-bit 標示的 TX and RX bandwidth saving，GPU monitors TX and RX 的 bandwidth 使用量，若達到 threshold utilization rate 則不再 offload # 4. IMPLEMENTATION ## 4.1 Hardware Block Diagram * 加了三個 component: * Offload Controller，用來做最終的 offloading decisions * Channel Busy Monitor，監控 off-chip channel 的使用量 * Memory Map Analyzer，支援 programmer-transparent data mapping ![](https://i.imgur.com/x8CSrcM.png) * **以下簡述每個 component** * Offload Contoller: * Work with the main GPU pipeline * 決定 candidate block 是否真的要 offload，包含 runtime 才能決定的 conditional offloading candidate，以及 dynamic offloading aggressiveness control * 把 offloading information 包起來並且傳送給對應的 memory stack SMs * 當他從 memory stack SMs 收到 ack packet 後會繼續原先的 offloaded warp * Channel Busy Monitor: * 監控 off-chip TX/RX channel 的使用量，如果 utilization rate 達到 pre-defined threshold，回報 channel 處於 busy 狀態 * Memory Map Analyzer: * 在 GPU Runtime set up，並提供每個 offloading candidate instance 所有潛在的不同 mapping 方式的 memory stack 的 number， * 只有在 learning phase 才會 enable 這個 unit ## 4.2 Design of NDP Offloading * **Interface between the compiler and the hardware** * compiler 使用兩種方法提供 offloading candidate 的資訊給硬體 * 在 ISA 中新增一個 instruction，表示 offloading candidate block 的開頭 * compiler 提供 offloading metadata table，每個 table 的 entry 都連結 offloading candidate，並提供 * PC 的 begin/end address * live-in/live-out registers * 2-bit tags 表達 TX/RX channel saving * condition for conditional offloading candidates * 這些資訊由 compiler allocated 並放在 on-chip Shared Memory $(4)$ * **Offloading candidate block detection** * 在 pipeline 中，每個 instruction 會被 decoded 且放在 Instruction Buffer $(5)$ * 當 Instruction Buffer 發現這個 instruction 是 offload candidate block 的開始，將這個 warp 標示成 not ready，並詢問 Offload Controller $(1)$，確認是否要 offload * offload controller 從 Shared Memory $(4)$ fetch 該 offload metadata，並以此來做最終的 offloading decision * **Dynamic offloading decision** * 分成三個步驟： * Offload Controller 從 Operand Collector $(6)$ 得到 corresponding register value，以此來判斷 condition for offloading 是否為 true * 如果 Channel Busy Monitor $(2)$ 顯示某個 TX/RX channel 其中一個 channel 處於 busy，而 offload 該 block (透過 2-bits) 會造成更多 memory traffic，那就不會 offload * Offload Controller 會透過第一個 instruction 會 access 的 memory stack 作為 offload 的 destination，並檢查是否已經達到該 memory stack 的 warp limitation，如果達到上限，則不 offload * 如果第一個 instruction 不會做到 memory access，那就給 GPU 執行直到第一個 memory access 的 instruction * 在做 offloading decision 的時候 pipeline 可以 schedule 並執行其他的 warp * **Sending offloading requests** * decision 決定好後，在等待中的 warp 變成 ready * 假如決定不 offload，Issue unit $(7)$ 把 instruction 放到 pipeline 中執行 * 若決定 offload，將 instruction 送到 Offload Controller，Offload Controller 會將 live-in registers、begin/end PCs、active masks as an offloading request，並送到 memory stack * 實驗中假設 pipeline offload task 的 latency 是 10 cycles，這跟每個 memory access 所要花的 200 cycles 比起來非常小 * 在等待 offloaded tasks 的時候 GPU pipeline 會執行其他的 warps * **Receiving offload acknowledgement** * offload block 完成後，memory stack SM 傳送 offload ack pachet to main GPU，包含： * live-out registers * 需要 invalid 的 chache lines * Offload Controller 會 request cache invalidation 且更新 register * 並重新開始執行 corresponding warp from the next instruction of the end of the block ## 4.3 Design of Programmer-transparent Data Mapping * 硬體上，在 GPU 加入 memory mapping analyzer $(3)$ * 軟體上，修改 GPU host-side driver，會在 CPU and GPU runtime 執行 * 這些機制包含以下步驟： 1. 在 launching GPU kernel 前，GPU application 需要經由 GPU driver allocate memory，並從 CPU memory 複製相對應的 location 到 GPU memory，透過 programmer-transparent data mapping，GPU driver 仍然 allocate memory 在 GPU virtual memory，但在 initial phase 先延遲了 copy，並先把 GPU virtual memory map 到 CPU memory，GPU driver 會把每個 memory allocation 記錄在 memory allocation table，以做日後用途 2. 當 application request GPU kernel launch 的時候，GPU driver 基於兩個 table：(1) memory allocation table、(2) offloading metadata table，並利用 GPU runtime 來 set up memory mapping analyzer $(3)$，因為在 initial phase GPU access memory 是 access 到 CPU memory，所以這段時間 memory 會走 GPU-to-CPU link (PCI-E) 3. memory mapping analyzer 監控 GPU threads 的 execution 跟他們的 memory access，他會計算潛在可能的 stack mapping(e.g, bits 7:8, 8:9, ..., 16:17 in a system with four memory stacks)，並計算每個 offloading candidate instance 要放在幾個 memory stack 上，對於每個被 offloading candidate access 在 application-allocated memory range 中的部分 analyzer 會在 memory allocation table 設一個 bit，並建議要放 offloaded blocks 4. 當 memory mapping analyzer 看了足夠多的 offloading candidate instance 後，會 issue interrupt to the GPU runtime，GPU runtime 會停下所有執行中的 SMs，並使用 best memory mapping，將最會 access 的放到同一個 stack 5. GPU runtime request GPU driver 到 CPU，並開始 memory copy，一般來說這件事會在 kernel launch 前發生，但因為要經過 initial phase，所以會在找到 best found mapping 後才會放到 memory stacks，其他的 memory 則依照原先的 default mapping ## 4.4 Design Consideration ### 4.4.1 Virtual address translation * GPU 以 hardware TLB 跟 MMU 來將 virtual address 轉成 physical address * 假設 SMs 中也有相同的 TLB 跟 MMU，而根據實驗，這些硬體只需要很小的體積，約 1-2K flip-flop 跟小數量的 logic，約不到整個 memory stack SM 的 2% * 有兩個問題要探討 1. page table 需要做 address translation 的那個部分可能不在該 memory stack 上 * 使用 cross-stack links 來完成 remote data access 2. 如果 GPU SMs 更新了 page table，TLB shootdown 會需要 maintain correctness * 但因為是在完成 memory copy 跟 page table update 後才會把 candidate block offload，所以 memory stack 不需要 TLB shootdown ### 4.4.2 Cache coherence * GPU 跟 memory stack 都有各自的 cache，若單純使用現今的 cache coherence 方法，會造成過多的 off-chip data movement * 當 GPU 叫 memory stack SM 執行 offload block，memory stack SM 會是唯一執行 Cooperative Thread Array (CTA) 的 SMs，因為 GPU 不需要保證 CTAs 間的 order，所以 SMs 間不需要保證 cache coherence，programmer 要自己在 CTA 內的 threads 間下 memory barriers 等保證順序的指令 * 以下三種方法保證 cache coherence： 1. GPU 的 SMs 會把所有還未寫入 memory 的 update 寫入，並且送到 offload block 的那個 memory stack 2. memory stack 要更新 private cache，確保資料都是最新的 3. memory stack SM 會記錄 update 哪些 cache line address，並在執行後將這些 address 送回 main GPU # 5. METHODOLOGY ## 5.1 System Modeling and Configuration ![](https://i.imgur.com/lRSFdP9.png) ## 5.2 Offloading Candidate Selection Tool * 使用 tool 分析由 CUDA compiler 產生的 PTX file ## 5.3 Evaluated Applications and Metrics ![](https://i.imgur.com/7nD6DjR.png)