# TelaMalloc: Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators
###### paper origin: ASPLOS 2023
###### Link: [Paper](https://dl.acm.org/doi/10.1145/3567955.3567961)
## 1. Introduction
- problem
- memory allocation problem
- 2D bin-packing problem, while the width (life time), the height (size), and arrival time of each rectangle are known

- two existing solutions
- **heuristics**: exploit domain-specific knowledge for local decision, fast but cannot solve workloads with complex allocation behavior that are close to the memory limit
- **solver-based approaches**: reason globally, can handle complex workloads that heuristic can't solve but are often too slow to run online
- solution: running heuristics and solver in parallel
- every step of allocating buffers
1. pick a buffer from a set of possible heuristic
2. update the solver with the decision
3. use the solver's output to guide future decisions
## 2. Background
- platforms
- Pixel 6 Tensor SoC: manage PE memory in each PE
- requires on-the-fly and on-device compilation due to the diversity of device's hardware configuration and other runtime parameters (e.g. library settings)
- goal is to address the long tail of compilation times, reduce the number of failing models as close to zero as possible
- TPUv4: manage CMEM, a global 128 MB on-chip memory
- integrated into XLA, which repacks buffers many times in its inner loop to maximize their utility
- TelaMalloc pack more buffers into same memory within the time limit

- memory allocation problem
- difference between mapping problem
- mapping problem: concerned with determining which level of a memory hierarchy to map each buffer to
- memory allocation problem: selects buffer locations within addressable scratchpad memories that are shared between multiple buffers with overlapping live ranges
- two reasons that prior work investigating the mapping problem can ignore the allocation problem
- non-overlapping operators: mapping one operator at a time
- partitioned or banked memory: architectures that allow operators to overlap without sharing memory
:::warning
assumptions don't hold in many production systems, since overlapping and cross-layer fusion of operators are crucial for maxmizing performance
:::
## 3. Problem Formulation & Baselines
- a set of buffers $B\in N^3(start, end, size)$ and memory limit $M$
- allocator produces mapping $B\rightarrow address$, address is the lowest address of the buffer
- need to speedup memory allocation since it can account for a significant portion of the compilation time

- heuristic baseline
- consider buffers with the highest contention first, and place them at the next available location according to the skyline
- divide the time into several slots, contention of a time slot is the total of the size of all buffers at that time
- contention of a buffer is the maximum contention of the time slot while it's alive
- can't work for all cases, and don't have the ability to recover if it made a wrong decision

- solver-based approaches
- encode the location of the lower end (the lowest address location) and not overlap constraint (boolean variable) to each buffer, search for the best allocation
- if the number of buffers overlapping increases, the search will become larger, and cannot exploit domain-specific knowledge to speed up

## 4. Telamalloc Overview

- steps
1. using heuristic to decide which buffer will get placed next
- consider a number of heuristics
- query constraints from CP solver to reduce the search space
2. choose one buffer and its location
3. ensure this decision won't cause the problem unsolvable
- update CP solver, it will check this decision will cause the problem unsolvable or not
- backtrack when solver indicate the problem unsolvable
- allows Telamalloc to exploit domain-specific information by early identify the problem unsolvable
## 5. Design & Implementation
### Problem Encoding
- change ILP Formulation encoding(logical OR) to logical XOR, reduce the number of constraints

- combine with policy that exploited domain-specific knowledge
- model a search tree, each node represents a state (a set of already placed blocks and their locations)

- choose between a set of heuristic that pick a block (candidate) and place
- three heuristic chosen by sequence: longest lifetime, largest size, largest size\*lifetime
- if backtrack, try the next heuristic and follow that part of search tree
- keeps a skyline and place the block on the skyline

- still doesn't solve many problems since the heuristics often stuck in local optima
### Solver-Guided Placement
- ask the solver to find the lowest valid location to place the buffer instead of skyline approach
- necessary for solver to not stuck in a local optimum
- improve search time but still insufficient for the most complex model
- example: [figure 1 right figure](#1-Introduction), after placed block 8, 2, 7, 4, block 5, 6 should be able to place under block 7

### Contention-Based Grouping
- place blocks in high contention phase first
- choose the blocks in same phase first, choose blocks from other phases only if failed
- steps
1. identify all time steps, blocks live before and after time step won't overlap
- divide problem into two subproblems
2. use algorithm in Figure 9 to divide blocks into different phases

### Smart Backtracking
- minor backtracking
- backtrack one step and try next candidate when placing one block that cause unsolvable
- major backtracking
- backtrack when all candidates at the decision point select once, solver decide how far to backtrack
- when backtrack occurs
- CP solver reports conflicting variable assignments
- tells which block placements caused the problem
- backtrack to second-to-last conflicting placement when major backtrack occurs
- prepend the set of candidates at current decision point to the set of candidates backtracking to
- only when major backtrack occurs
- should be considered earlier than last time
- number of candidates is limited to prevent growing unboundedly
- record the number of backtracks that occurred within the subtree rooted at a backtrack point
- backtrack to the point when number exceeds a constant to prevent stuck within a part of search space
### Encoding Alignment
- extend CP encoding to provide the ability to constrain the alignment of some buffers
- example: fixed alignment of A(X) $\ge$ 1 for buffer X

### Compiler Integration
- Pixel 6
- integrated into production compiler, replace the ILP approach (triggerd when heuristic fails)
- TPUv4
- integrated into the XLA compiler framework, will be called repeatedly in memory packer to pack memory buffers more densely
## 6. Learning Allocation Strategies
- forward-looking research on enabling TelaMalloc to handle the small number of models that cannot be handled by any of the allocators
### Challenges of Machine Learning
- challenges
- from performance perspective, running model at every step to select a set of candidates is infeasible
- most of the models don't benefit from ML approach
- their memory allocator needs to behave consistently after it has shipped
- solutions
- only use ML during major backtracks to predict where to backtrack to
- using offline learning, learn a single static backtracking model and bake into TelaMalloc
### Learning Approach
- problem
- difficult to train a model that can reliably select between thousands of possible targets
- the selection need to be precise, otherwise may cause the problem unsolvable
- need to be cheap
- solution
- only choose between a set of candidates
- the set contains decisions which caused major backtrack
- last backtrack target need to be ignored, otherwise same as minor backtrack
- set decision levels 0-4, 5-8, 9-16 (ranges exponentially increase), if don't have backtrack target, add the top of that range into the set
- to prevent getting stuck if all targets are in same part of tree
- use ILP approach's solution as label and train an imitation learning model
### Producing Labels for Imitation Learning
- minimum and best backtrack points
- use the ILP solver to find the deepest point on the current path that is still solvable
- encoding the problem as ILP and fixing all pos variables of blocks that have already been placed
- the problem won't be solvable if don't backtrack to this point, yet might not be the best point to backtrack
- best backtrack point is the deepest point of the intersection of current path and solution path

### Model Design & Feature Engineering
- using pointwise ranking model
- produce a score for each candidate and train a gradient boosted tree model to predict the score

- train model using these features
- Size, lifetime and contention of the block
- The decision level when the block was placed
- times appeared in the backtracking reason of a major backtrack
- times backtracked to this point
- times backtracked anywhere within the tree rooted at this point
- is or not in the same region as the current point backtracking from
- total number of backtracks
### Implementation & Training Strategy
- build on gradient boosted tree, which is configured to learn regression from features to score
- every major backtrack use the regular backtrack approach or the ILP approach with 50% probability to add randomness to training data

- implementation
- collect features for each backtrack target and pass to the model during major backtrack
- scores produced by model will be weighted according their depth, pick the highest-scoring target which is above a certain threshold
- if no score is above the threshold, try all unplaced buffers until one fits and performs a minor backtrack otherwise
## 7. Evaluation
- compare with ILP

- non-overlap compare with full overlap (won't occur in practice)
- to test search efficiency

- compare with other approaches in this paper

- heuristic's running time (each model 110% of the minimum required memory)
- faster than TelaMalloc, so will still try heuristic first

- compare with other strategies
- fail after 500,000 steps

- execution time for ML approach
- the model runs 2us per candidate

- compare with without ML

- how the model makes decisions

- compare with best-fit in XLA on TPUv4
- achieved up to 7% better performance than best-fit

## 8. Discussion
- Workload Analysis
- visualized memory allocation pattern
- first phase similar to [Figure 1](#1-Introduction), TelaMalloc exploit domain-specific knowledge to solve it
- the latter exploited by contention-based grouping

- Generalization of Ideas
- extends to other areas
- explore search spaces problem like hardware design space exploration, floor planning which expressed as ILP programs
- learning approach extended beyond backtracking
- keep a single, shallow decision tree that executes at every step of the search, identifies to run a more expensive model that considers different blocks, or run a more expensive heuristic
## 9. Related Work

## 10. Conclusion
- new method for solving the memory allocation problem on machine learning accelerators
- combines heuristics with a solver-based approach to explore a complex search space more efficiently
- two orders of magnitude speedup for several important models and enables on-the-fly compilation of these models that would otherwise have been impossible
- ships in Pixel 6 and the compiler of TPUv4