TelaMalloc: Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators

# TelaMalloc: Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators ###### paper origin: ASPLOS 2023 ###### Link: [Paper](https://dl.acm.org/doi/10.1145/3567955.3567961) ## 1. Introduction - problem - memory allocation problem - 2D bin-packing problem, while the width (life time), the height (size), and arrival time of each rectangle are known ![](https://i.imgur.com/941RdLq.png =60%x) - two existing solutions - **heuristics**: exploit domain-specific knowledge for local decision, fast but cannot solve workloads with complex allocation behavior that are close to the memory limit - **solver-based approaches**: reason globally, can handle complex workloads that heuristic can't solve but are often too slow to run online - solution: running heuristics and solver in parallel - every step of allocating buffers 1. pick a buffer from a set of possible heuristic 2. update the solver with the decision 3. use the solver's output to guide future decisions ## 2. Background - platforms - Pixel 6 Tensor SoC: manage PE memory in each PE - requires on-the-fly and on-device compilation due to the diversity of device's hardware configuration and other runtime parameters (e.g. library settings) - goal is to address the long tail of compilation times, reduce the number of failing models as close to zero as possible - TPUv4: manage CMEM, a global 128 MB on-chip memory - integrated into XLA, which repacks buffers many times in its inner loop to maximize their utility - TelaMalloc pack more buffers into same memory within the time limit ![](https://i.imgur.com/TQkz1lG.png =60%x) - memory allocation problem - difference between mapping problem - mapping problem: concerned with determining which level of a memory hierarchy to map each buffer to - memory allocation problem: selects buffer locations within addressable scratchpad memories that are shared between multiple buffers with overlapping live ranges - two reasons that prior work investigating the mapping problem can ignore the allocation problem - non-overlapping operators: mapping one operator at a time - partitioned or banked memory: architectures that allow operators to overlap without sharing memory :::warning assumptions don't hold in many production systems, since overlapping and cross-layer fusion of operators are crucial for maxmizing performance ::: ## 3. Problem Formulation & Baselines - a set of buffers $B\in N^3(start, end, size)$ and memory limit $M$ - allocator produces mapping $B\rightarrow address$, address is the lowest address of the buffer - need to speedup memory allocation since it can account for a significant portion of the compilation time ![](https://i.imgur.com/clIFMMs.png =55%x) - heuristic baseline - consider buffers with the highest contention first, and place them at the next available location according to the skyline - divide the time into several slots, contention of a time slot is the total of the size of all buffers at that time - contention of a buffer is the maximum contention of the time slot while it's alive - can't work for all cases, and don't have the ability to recover if it made a wrong decision ![](https://i.imgur.com/lVuML2c.png =55%x) - solver-based approaches - encode the location of the lower end (the lowest address location) and not overlap constraint (boolean variable) to each buffer, search for the best allocation - if the number of buffers overlapping increases, the search will become larger, and cannot exploit domain-specific knowledge to speed up ![](https://i.imgur.com/QnJtJUu.png =60%x) ## 4. Telamalloc Overview ![](https://i.imgur.com/3ro2Oam.png =50%x) - steps 1. using heuristic to decide which buffer will get placed next - consider a number of heuristics - query constraints from CP solver to reduce the search space 2. choose one buffer and its location 3. ensure this decision won't cause the problem unsolvable - update CP solver, it will check this decision will cause the problem unsolvable or not - backtrack when solver indicate the problem unsolvable - allows Telamalloc to exploit domain-specific information by early identify the problem unsolvable ## 5. Design & Implementation ### Problem Encoding - change ILP Formulation encoding(logical OR) to logical XOR, reduce the number of constraints ![](https://i.imgur.com/krtle4Q.png =50%x) - combine with policy that exploited domain-specific knowledge - model a search tree, each node represents a state (a set of already placed blocks and their locations) ![](https://i.imgur.com/MyvXefA.png =55%x) - choose between a set of heuristic that pick a block (candidate) and place - three heuristic chosen by sequence: longest lifetime, largest size, largest size\*lifetime - if backtrack, try the next heuristic and follow that part of search tree - keeps a skyline and place the block on the skyline ![](https://i.imgur.com/DqaGUcl.png =30%x) - still doesn't solve many problems since the heuristics often stuck in local optima ### Solver-Guided Placement - ask the solver to find the lowest valid location to place the buffer instead of skyline approach - necessary for solver to not stuck in a local optimum - improve search time but still insufficient for the most complex model - example: [figure 1 right figure](#1-Introduction), after placed block 8, 2, 7, 4, block 5, 6 should be able to place under block 7 ![](https://i.imgur.com/pyP51Pi.png =30%x) ### Contention-Based Grouping - place blocks in high contention phase first - choose the blocks in same phase first, choose blocks from other phases only if failed - steps 1. identify all time steps, blocks live before and after time step won't overlap - divide problem into two subproblems 2. use algorithm in Figure 9 to divide blocks into different phases ![](https://i.imgur.com/MNlT9zj.png =60%x) ### Smart Backtracking - minor backtracking - backtrack one step and try next candidate when placing one block that cause unsolvable - major backtracking - backtrack when all candidates at the decision point select once, solver decide how far to backtrack - when backtrack occurs - CP solver reports conflicting variable assignments - tells which block placements caused the problem - backtrack to second-to-last conflicting placement when major backtrack occurs - prepend the set of candidates at current decision point to the set of candidates backtracking to - only when major backtrack occurs - should be considered earlier than last time - number of candidates is limited to prevent growing unboundedly - record the number of backtracks that occurred within the subtree rooted at a backtrack point - backtrack to the point when number exceeds a constant to prevent stuck within a part of search space ### Encoding Alignment - extend CP encoding to provide the ability to constrain the alignment of some buffers - example: fixed alignment of A(X) $\ge$ 1 for buffer X ![](https://i.imgur.com/aCIkAeD.png =60%x) ### Compiler Integration - Pixel 6 - integrated into production compiler, replace the ILP approach (triggerd when heuristic fails) - TPUv4 - integrated into the XLA compiler framework, will be called repeatedly in memory packer to pack memory buffers more densely ## 6. Learning Allocation Strategies - forward-looking research on enabling TelaMalloc to handle the small number of models that cannot be handled by any of the allocators ### Challenges of Machine Learning - challenges - from performance perspective, running model at every step to select a set of candidates is infeasible - most of the models don't benefit from ML approach - their memory allocator needs to behave consistently after it has shipped - solutions - only use ML during major backtracks to predict where to backtrack to - using offline learning, learn a single static backtracking model and bake into TelaMalloc ### Learning Approach - problem - difficult to train a model that can reliably select between thousands of possible targets - the selection need to be precise, otherwise may cause the problem unsolvable - need to be cheap - solution - only choose between a set of candidates - the set contains decisions which caused major backtrack - last backtrack target need to be ignored, otherwise same as minor backtrack - set decision levels 0-4, 5-8, 9-16 (ranges exponentially increase), if don't have backtrack target, add the top of that range into the set - to prevent getting stuck if all targets are in same part of tree - use ILP approach's solution as label and train an imitation learning model ### Producing Labels for Imitation Learning - minimum and best backtrack points - use the ILP solver to find the deepest point on the current path that is still solvable - encoding the problem as ILP and fixing all pos variables of blocks that have already been placed - the problem won't be solvable if don't backtrack to this point, yet might not be the best point to backtrack - best backtrack point is the deepest point of the intersection of current path and solution path ![](https://i.imgur.com/JjtprYc.png =60%x) ### Model Design & Feature Engineering - using pointwise ranking model - produce a score for each candidate and train a gradient boosted tree model to predict the score ![](https://i.imgur.com/DBqkMw1.png =50%x) - train model using these features - Size, lifetime and contention of the block - The decision level when the block was placed - times appeared in the backtracking reason of a major backtrack - times backtracked to this point - times backtracked anywhere within the tree rooted at this point - is or not in the same region as the current point backtracking from - total number of backtracks ### Implementation & Training Strategy - build on gradient boosted tree, which is configured to learn regression from features to score - every major backtrack use the regular backtrack approach or the ILP approach with 50% probability to add randomness to training data ![](https://i.imgur.com/Dps0Rsn.png =65%x) - implementation - collect features for each backtrack target and pass to the model during major backtrack - scores produced by model will be weighted according their depth, pick the highest-scoring target which is above a certain threshold - if no score is above the threshold, try all unplaced buffers until one fits and performs a minor backtrack otherwise ## 7. Evaluation - compare with ILP ![](https://i.imgur.com/hG5Jc18.png) - non-overlap compare with full overlap (won't occur in practice) - to test search efficiency ![](https://i.imgur.com/3R2jnMi.png =55%x) - compare with other approaches in this paper ![](https://i.imgur.com/zeVRHyO.png) - heuristic's running time (each model 110% of the minimum required memory) - faster than TelaMalloc, so will still try heuristic first ![](https://i.imgur.com/RBjmQee.png =55%x) - compare with other strategies - fail after 500,000 steps ![](https://i.imgur.com/Mn6O4fL.png =55%x) - execution time for ML approach - the model runs 2us per candidate ![](https://i.imgur.com/xWFhQeT.png =60%x) - compare with without ML ![](https://i.imgur.com/VrKViJP.png =60%x) - how the model makes decisions ![](https://i.imgur.com/8b0g9do.png =60%x) - compare with best-fit in XLA on TPUv4 - achieved up to 7% better performance than best-fit ![](https://i.imgur.com/bKDqeaF.png =60%x) ## 8. Discussion - Workload Analysis - visualized memory allocation pattern - first phase similar to [Figure 1](#1-Introduction), TelaMalloc exploit domain-specific knowledge to solve it - the latter exploited by contention-based grouping ![](https://i.imgur.com/LnU0vRz.png =50%x) - Generalization of Ideas - extends to other areas - explore search spaces problem like hardware design space exploration, floor planning which expressed as ILP programs - learning approach extended beyond backtracking - keep a single, shallow decision tree that executes at every step of the search, identifies to run a more expensive model that considers different blocks, or run a more expensive heuristic ## 9. Related Work ![](https://i.imgur.com/G55cS3b.png =70%x) ## 10. Conclusion - new method for solving the memory allocation problem on machine learning accelerators - combines heuristics with a solver-based approach to explore a complex search space more efficiently - two orders of magnitude speedup for several important models and enables on-the-fly compilation of these models that would otherwise have been impossible - ships in Pixel 6 and the compiler of TPUv4