# Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems ###### tags: `GPUs` `Bandwidth` `Memory management` `Mathematical model` `3D displays` `Runtime` `Message systems` ###### paper origin: IEEE ISCA 2016 ###### papers: [link](https://ieeexplore.ieee.org/document/7551394) ###### slides: [link](https://people.inf.ethz.ch/omutlu/pub/TOM-programmer-transparent-GPU-near-data-processing_kevinhsieh_isca16-talk.pptx) # 1. INTRODUCTION ## Research Problems * How to enable **computation offloading** and **data mapping** to multiple 3D-stacked memories without burdening the programmer such that any application can transparently benefit from near-data processing capabilities in the logic layer. ## NDP GPU system NDP enables very wide, energy-efficient interfaces to the processor. ![](https://i.imgur.com/tFn5vyC.png) * Challenges 1. Which operations to offload? * Programmers need to identify offloaded operations, and consider run time behavior. ![](https://i.imgur.com/2uUH1lR.png) * Solution: Propose an offload candidate selection mechanism that can be implemented as a static compiler analysis pass requiring no programmer intervention. 2. How to map data across multiple memory stacks? * Programmers need to map all the operands in each offloaded operation to the same memory stack. ![](https://i.imgur.com/gKdcyfx.png) * Solution: Propose a programmer-transparent data mapping mechanism that places data in the same memory stack as that of the offloaded code that accesses it. ## Proposed Solutions * compiler-based technique * A compiler-based technique that automatically identifies code to offload to a logic-layer GPU based on a simple cost-benefit analysis. * software/hardware cooperative mechanism * A software/hardware cooperative mechanism that predicts which memory pages will be accessed by offloaded code, and places those pages in the memory stack closest to the offloaded code, to minimize off-chip bandwidth consumption. ## Results * Performance: reduced by 30% on average (up to 76%) * Memory reduced traffic: by 38% on average (up to 99%) * Energe: reduced 11% on average (up to 37) * Area: 40nm -> 0.11mm^2 (0.018%) # 2. Mechanism ## Identification of Offloading Candidates * Whether offloading a block saves more memory bandwidth during offloaded execution than it costs in additional data transfers to initiate and complete the offload * Estimating the memory bandwidth cost-benefit in compile time. * Load instruction * TX: address * RX: data * Store instruction * TX: address, data * RX: ack, message * Size of ACK message is 1/4 data. ![](https://i.imgur.com/xzb3yJa.png) Consider load/store unit and cache: ![](https://i.imgur.com/DRrxyfe.png) * Offloading candidate block identification uses tags. * Use 2 bits(None, TX, RX) * Loop * statically * before entering the loop * marked as a **conditional offloading candidate** * during execution * assume the count to be one. ## Programmer-transparent Data Mapping * Memory access pattern analysis. ![](https://i.imgur.com/T6TIPKH.png) * 85% of all offloading have **fixed offset** memory accesses. ![](https://i.imgur.com/mGvcHkP.png) * Serveral workloads have offloading that *always* access memory with fixed offset. ## Predictability of the best memory mapping ![](https://i.imgur.com/M8fQV6E.png) * Learning phase * GPU reaches data from CPU that the mechanism learns the best mapping * Delay the memory copy until learning phase is over. * Produce **memory allocation table** for next phase. ## Dynamic Offloading Aggressiveness Control * Issues * Memory stack SMs could become new performance bottlenecks if the number of offloading requests is more than what they can handle. * There could be a discrepancy in the bandwidth savings of the RX and TX off-chip links. * Mechanism * Stop further offloading when number of warps is not enough. * Track and Set a threshold ultilization rate to TX and RX channelst. # 3. Implementation ## Hardware Block Diagram ![](https://i.imgur.com/oiGFz0r.png) * Offload Controller Makes final offloading decision(from **compiler**, **conditional candidates** and **dynamic aggressiveness control**). * Sends offloading request to memory stacks * Packs the offloading information and sends it to the memory stack SMs. * Resumes the offloaded warp when it receives the corresponding acknowledgment packet from memory stack SMs. * Channel Busy Monitor * Monitor the utilization of the off-chip channels(TX and RX) and report to Offload Crontroller. * Memory Map Analyzer(only in learning phase) * Support programmer-transparent data mapping ## Design of NDP Offloading * Interface between the compiler and the hardware * **New instruction** indicates the beginning of an offloading candidate block. * Compiler allocates **offloading metadata table** in on-chip shared memory. * Begin/end PC addresses * Live-in/live-out registers * 2-bit tags to indicate the TX/RX channel savings * Condition of conditional offloading candidates * Offloading candidate block detection * When the Instruction Buffer detects the instruction as the beginning of an offloading candidate block, it marks the warp as not ready and consults the Offload Controller for an offloading decision. * Dynamic offloading decision * The Offload Controller checks whether the condition for offloading is true for a conditional offloading candidate. * If one of the TX/RX channels is signaled as busy by the Channel Busy Monitor and the 2-bit tag for this block indicates that it would introduce more memory traffic to a busy channel, the Offload Controller does not offload it. * The Offload Controller determines the offloading destination based on the memory stack that will be accessed by the first instruction of the block. * Sending offloading requests. * The Offload Controller packs live-in registers, begin/end PCs, and active masks as an offloading request and sends it to the memory stack. * Receiving offload acknowledgment. * If completed, send acknowledgment packet to main GPU. ## Design of Programmer-transparent Data Mapping * Hardware * Memory mapping analyzer in the GPU * Software * Modify the GPU host-side driver that runs on the CPU and the GPU runtime that runs on the GPU. * Mechianism * With programmer-transparent data mapping, the GPU driver still allocates memory in the GPU virtual memory space , but delays the copy by initially mapping the GPU virtual memory to CPU memory During the initial learning phase, the GPU driver records each memory allocation in a memory allocation table 6 for further reference. * When GPU kernel lanuch, GPU driver sets up the memory mapping analyzer based on two tables: memory allocation table(memory allocation table), offloading metadata table(from compiler). * Memory mapping analyzer moniters the execution adn edits the memory allocation table. * After memory mapping analyzer saw the pre-determined number of offloading cacndidate instances, it issues an interrupt to GPU. GPU uses the best meory mapping. * GPU driver perform the memory copy based on best found mappping in the learning phase. <!-- * Design Considerations * Virtual address translation in memory stacks * The page table needed for address translation may not be located in the same memory stack as the requesting SM * If the GPU SMs were to update the page table, a TLB shootdown may be needed to maintain the correctness of address translation * Cache coherence between the SMs in the main GPU and the SMs in the memory stacks --> # 4. Result * TOM significantly improves performance * TOM provides better performance than having additional SMs at the main GPU. * The reason is that BFS is a very irregular workload, and the mapping chosen by our mechanism with the initial 0.1% offloading candidate instances is not the best one for all offloading candidate instances. ![](https://i.imgur.com/D3tiwEJ.png) * Effect on Memory Traffic * Effect on Energy Consumption ![](https://i.imgur.com/O3k7S5S.png) # 5. Throughts * Introduced the learning phase mechanism with supporting experiment results. * Use GPU driver to controll GPU in the runtime by software.