HAT: ++H++ardware-++A++ware ++T++ransformers for Efficient Natural Language Processing

# HAT: ++H++ardware-++A++ware ++T++ransformers for Efficient Natural Language Processing ###### paper origin: ACL 2020 ###### paper:[link](https://arxiv.org/pdf/2005.14187.pdf) ###### slides:[link](https://hanlab.mit.edu/projects/hat/assets/ACL20_HAT_HanruiWang.pdf) ###### github:[link](https://github.com/mit-han-lab/hardware-aware-transformers.git) ###### tags:`TinyML` # 1. INTRODUCTION - Motivation To enable low-latency inference on resource-constrained hardware platformes, they propose to design Hardware-Aware Transforms (HAT) with neural architecture search. There are two common pitfalls when evaluating the efficiency of a Transformer - 1. FLOPs doesn't reflect the measured latency. - 2. Different hardware prefers differnet Transformer architecture. ![](https://i.imgur.com/kDb8xh3.png) ![](https://i.imgur.com/z7xFNym.png) So they propoose to search for Hardware-Aware Transformers (HAT) by directly involvign the latency feedback into the design loop. In this way, they do not use FLOPs as the latency proxy and can search specialized models for various hardware. - Framework ![](https://i.imgur.com/2cZgokN.png) 1. Train a SuperTransformer that contains numerous sub-networks. 2. Evolutionary with hardware latency feedback to find one **specialized** SubTransformer for each hardware. - Contribution 1. Hardware-Aware and Specialization 2. Low-cost Neural Architecture Search with a Large Design Space. - Arbitrary encoder-decoder attention to break the information bottleneck. - Heterogeneous layer to let different layers alter its capacity. - A weight-share SuperTransgormer is trained to search for efficient models at a low cost. 4. Design Insights - Attending multiple encoder layers is beneficial for the decoder - GPU prefers shallow and wide models while ARM CPU prefers deep and thin ones # 2. APPROACHES - Design Space - Arbitrary Encode-Decode Attention ![](https://i.imgur.com/3ja2Nhz.png) - To break the information bottleneck, each decoder layer can choose *multiple* encoder layers to attend. - The *key and value* vectors from encoders are concatenated in the *sentence length* dimension and fed to the encoder-decoder cross attention module. -> Efficient and the latency overhead is also negligible - Heterogeneous Transformer Layers - Different layers are *heterogeneous* -> Different numbers of heads, hidden dim, and embedding dim - *Elastic* hidden dim, embedding dim of encoder and decoder because there are many heads are redundant. (from Voita et al. (2019)) - In the FFN layer, the input features are cast to a higher dimension(hidden dim), followed by an activation layer. Traditionally, the hidden dim is set as 2x or 4x of the embedding dim, but this is not the best because deifferent layers need different capacities depending on the feature extraction difficulty. -> Make hidden dim elastic. - Make the number of encoder and decoder layers elastic to learn the proper levelof feature encoding and decoding. - There are some other design choices left for future work - Traditional Transformer Design v.s. HAT Design | Traditional Transformer Design | HAT Design | | -------- | -------- | | All decode layers only attend to the last encode layer -> Information bottleneck | Arbitrary Encode-Decoder Attention| | All layers are identical| Heterogeneous layers| - Steps ![](https://i.imgur.com/uvdLwLy.png) 1. Train a weight-shared SuperTransformer. ![](https://i.imgur.com/oQKbn3Y.png) - The SuperTransformer is the *largest model* in the search space with *weight sharing* - Every model in the search space is a part of the SuperTransformer. - All SubTransformers *share* the the weights of their common parts. - Elastic layer numbers let all SubTransformers share the first several layers. - In the SuperTransformer training, all possible SubTransformers are *uniformly sampled*, and the corresponding weights are updated. In practice, the SuperTransformer only needs to be trained for the same steps as a baseline Transformer model, which is fast and low-cost. - After training, we can get the performance proxy of sampled models in the design space by evaluating the correspond SubTransformers on the validation set *without training*. 2. Collect data pairs on the target hardware. - Including SubTransformer architecture, measured latency - They test the latency of the models by measuring translation from a source sentence to a target sentence with the same length. 3. Train a latency predictor for each hardware. - A dataset of 2000 (SubTransformer architecture, measured latency) samples for each hardware is collected, and split into train:valid:test=8:1:1. They normalize thefeatures and latency, and train a three-layer MLP with 400 hidden dim and ReLU activation. They choose three-layer because it is more accurate than the one-layer model, and over three layers do not improve accuracy anymore. - There are two ways to evaluate the hardware latency 1. Online - Measure the models during search process. A single smapled SubTransformer requires hundreds of inferences to get an accurate latency, which lasts for minutes and slows down the searching. 2. Offline - Train a *latency predictor* to provide the latency. Encode the architecture of a SubTransforemer into a feature vector, and predict its latency instantly with a MLP. Trained with thousands of *real* latency data points, the predictor yields high accuracy. ![](https://i.imgur.com/pOS6bqc.png) **They apply the offline method here because it is *fast and accurate*** 4. Evolutionary search with a hardware latency constraint to find a SubTransformer. - With the predictor, they conduct an evolutionary search for 30 iterations in the SuperTransformer, with population 125, parents population 25, mutation population 50 with 0.3 probability and crossover population 50. - The search engine queries the latency predictor for SubTransformer latency, and validates the loss on the loss on the validation set. - Evolutionary Search v.s. NAS | Evolutionary Search | NAS | |- |- | | Pay the SupertTransformer training cost for *once* | Pay the training cost for every search | | Can evaluate *all* the models in the design space with it| | 5. Train the SubTransformer from scratch # 3. RESULT - Baselines - Transformer - Levenshtein transformer - Evolved transformer - Lite Transformer - HAT Performance Comparisons ![](https://i.imgur.com/dBJSqq8.png) ![](https://i.imgur.com/611eqxf.png) - HAT is smaller and faster than other baseline models ![](https://i.imgur.com/efJuq9y.png) - HAT has lower models size, latency, and cloud computing cost ![](https://i.imgur.com/0KyvaDz.png) - HAT has the lowest latency with the highest BLEU - Analysis - Design Insights ![](https://i.imgur.com/V29ZZm7.png) - The largest may not be the best. - Ablation Study ![](https://i.imgur.com/UuIIGPC.png) - Evolutionary search can find models with lower losses than random search - SubTransformer Performace Proxy ![](https://i.imgur.com/RZeDIgz.png) - All SubTransformers inside the SuperTransformer are *uniformly sampled* and thus *equally trained*, so the performance order is well-preserved during training. - Low Search Cost - Evolved Transformer v.s. HAT | Evolved Transformer | HAT | |- |- | | Train all individual models and sort their final performance to pick top ones | Train all models together inside SuperTransformer and sorts their performance proxy to pick top ones | - Finetuning Inherited SubTransformers ![](https://i.imgur.com/DDJ9COW.png) - Directly finetune the SubTransformers with inherited weights from the SuperTransformer to further reduce the training cost. - The training cost for a model under a new hardware constraint can be further reduced by 4X. - Quantization Friendly ![](https://i.imgur.com/6cmhxPF.png) - They apply K-means quantization to HAT. - Knowledge Distillation Friendly - HAT and Knowledge Distillation(KD) | KD | HAT | | - | - | | Focus on better training a given architecture | Focus on searching for an efficient architecture | - They combine KD with HAT by distilling token-level knowledge (top-5 soft labels) from a high-performance SubTransformer to a low-performance SubTransformer on WMT'14 En-De task. | | Teacher model | Student model | | - | - | - | | parameters | 49M | 30M | | BLEU | 28.5 | 25.8 -> 26.1(KD)| # 4.CONCLUSION - It is critical to have a large design space in order to find high-performance models. - They combine many methods to reduce the model size.