Sparse is Enough in Scaling Transformers

# Sparse is Enough in Scaling Transformers by Peng Xu ### What problem does it solve? Leverage sparsity to make large Tranformer models scale efficiently. To be specific, the goal is to perform inference faster than the standard Transformer as the model size scales up, while retaining the empirical performances on real tasks. ### Why is this important? The Transformer architecture have achieved huge successes in the field of natural language processing in recent years. Lately, it also gains great popularity among other fields. At the same time, the model size of Transformer models grows larger and larger, as well as the huge costs such models incur. As a result, it is increasingly important to make them scale efficiently. ### The apporach taken: This paper address this problem by proposing *Scaling Transformers* with a separate sparse mechanism for the query, key, value and output layers (Sparse QKV for short) and combine it with sparse feedforward blocks (Sparse FF for short) to get a fully sparse Transformer architecture. ### Results: ![](https://i.imgur.com/MI4m5Md.png) Scaling Transformers also yield competitive results on challenging real-world tasks like summarizing arxiv articles, as compared to state-of-the-art approaches.