Apache MXNet Performance Evolution

# Apache MXNet Performance Evolution *Authors: Bartłomiej Gawrych, Bartosz Kuncer* ## Introduction Improving the hardware performance by writing better and better software is nowadays a common strategy. It is not surprising that it has been widely adopted in deep learning as the performance is usually one of the key indicators whether models and frameworks are useful. This article is about software improvements introduced to Apache MXNet framework between versions 1.8.0 and unreleased yet version 2.0. Results gotten during benchmarking show that with a good knowledge of the model and the hardware, well written software can do wonders for the performance. One of the most demanding problems of deep learning from performance standpoint (and also very popular) is natural language processing. NLP models are crucial for many top notch online services to work. To be able to promptly deliver results to the customer, it is important that the server is capable of handling large number of requests in real-time. As the demand rises, there are two common ways to increase server capabilities. The first one is to add more components or swap the currently used ones for more powerful counterparts, which can be very expensive. The other one is to improve the hardware utilization and overall model throughput with well designed software. This article focuses on the latter. The performance analysis has been performed for three common NLP models which are BERT Base, BERT Large and GPT-2. All of them require enormous computational power as they utilize costly FullyConnected and Matrix Multiplication operations. ## Apache MXNet and oneDNN MXNet is one of the most popular deep learning frameworks. Since the release of version 0.8 in 2016, there were many further versions of MXNet, improving the framework by both broadening its capabilities and boosting performance of already present functions. In version 1.2, oneDNN (previously known as MKL-DNN) was introduced to MXNet [[1]](#References). Since version 1.7, oneDNN library is enabled in MXNet by default [[2]](#References). Both MXNet and oneDNN have undergone many performance improving changes in recent years. Additionally to API change in MXNet v2.0 to the NumPy like one [[3]](#References), the end user can get all the benefits of oneDNN support. ## CPU Performance in NLP models As it was mentioned before, nowadays NLP models are one of the most demanding workloads, therefore they make up a good candidate to measure the performance of a given system configuration. In MXNet, with the use of other packages of its ecosystem like GluonNLP, user can easily use existing there models to his own purpose, e.g. benchmarking. Below one can see performance comparison of three NLP models (BERT Base, Bert Large and GPT-2), which are available in GluonNLP for both old and new MXNet API, between versions 1.8, 1.9.1 and 2.0 of MXNet: | ![float-results.png](https://i.imgur.com/B0yXCEz.png) | |:--:| | *Fig. 1. Relative performance results of models with full precision in different versions of MXNet* | Above chart shows relative speedup over next versions of MXNet with version 1.8 used as the baseline. It is clearly visible that the utilized version of MXNet has a significant impact on achieved throughput. In every examined case, MXNet v2.0 has the best performance. Using MXNet v2.0 can bring over two times shorter execution time than v1.8 on the same hardware configuration, which is a huge improvement in terms of only software optimizations and high performance computing. One of the common software optimizations used in deep learning frameworks is quantization - process of reducing precision of computation by quantizing wide data type (e.g. float32) into smaller one (e.g. int8). The API for quantization is available in versions 1.x of MXNet, however in MXNet v2.0 it was refactored to be as easy to use as possible. Besides that, more quantized operators and more features of newer oneDNN are now available to be utilized to achive even better performance in the newest version of the framework with slight accuracy loss (0.5 - 1.5 percentage points vs. float model). More about using quantization has been written in other article on Medium [[4]](#References). | ![](https://i.imgur.com/iqQIgVH.png) | |:--:| | *Fig. 2. Relative performance results of quantized models in MXNet v1.9.1 and MXNet v2.0* | Chart shows that using ecosystem of MXNet v1.x is inefficient when using small batch size for quantized BERT family models. In new MXNet, performance of quantized model with small batch size can be expected to be ~2-3.5x better! Even for larger batch size (in this case 32 was used), newer quantized model shows almost 2x bigger throughput. ## Optimization examples in MXNet v2.0 The following paragraph is dedicated to show examples of optimizations implemented to increase inference speed of demanding workloads. It can give rough overview of the results achived presented on above charts and build understanding on how much software optimizations can affect performance. ### Fuses Branch of deep learning software optimization where common sequences of operations can be executed as a single operator. The main benefit of fusing patterns is reducing IO operations (read/write to memory) which is often one of the biggest bottlenecks of getting good perfomance. #### Self-Attention The self-attention mechanism is the father of the success of most NLP models, but due to its complexity, it is also the main culprit of their great complexness. When it goes to the MXNet implementation, the query, key, and value tensors are created using only one FullyConnected layer, the output of which is then divided into the three tensors mentioned above using a split operation. Then these tensors are transformed accordingly using the reshape and transpose operations, to finally perform multiplication from the formula: $$ Attention = {Softmax({QK^T \over \sqrt{d_k}})V} $$ Thanks to the use of oneDNN library, most of the operations aiming to prepare tensors for multiplication could have been omitted. This was made possible by the construction of an appropriate oneDNN's data descriptor including information on tensor's dimensions and strides. This descriptor, once passed to the library, allows it to create proper kernel for multiplication. | ![](https://i.imgur.com/7Qw80eH.png) | |:--:| | *Fig. 3. Self-attention fused in MXNet v2.0* | #### GeLU activation There are a lot of different activation functions used in neural networks - the most common one in convolutional neural networks is ReLU, but in NLP models it is GeLU function. There are two approximation types of GeLU: - using ERF function, - using hyperbolic tangent (tanh) function. While the first one was implemented a long time ago, the second one was added just in MXNet v2.0. Before, this activation was done by a set of different operators shown in the picture (Fig. 4.). This optimization reduced execution time by getting rid of many IO operations as each operator needs to read input tensor from cache/RAM and then write the results back to output tensor - in optimized approach there is only one read and write and all other operations can be executed with values stored in CPU registers. | ![](https://i.imgur.com/BQxJeZv.png) | |:--:| | *Fig. 4. GeLU activation with tanh approximation - implemented by hand vs. single operation* | #### Cast removal During the models examination, it was found out that in some places tensor was being cast to the same data type before and after operations which did not change its type. As this casts were clearly redundant, a simple pass has been introduced to MXNet looking for this kind of duplicated casts and replacing them with a single operation. This fuse mostly benefited BERT family of models. ### Operators Adding new functionality to the framework is also in a way an optimization. It can introduce completly new approach to solving complex problems, usually in a much simpler and faster way. #### MaskedSofmax In v2.0 of MXNet, MaskedSoftmax has replaced softmax with length parameter utilized in v1.x versions. It was impossible to rewrite softmax with length parameter using oneDNN, thus its performance was much worse than the one of normal softmax. Using MaskedSoftmax in new gluon implementation of NLP models solved this issue as it was also implemented by oneDNN kernels, which utilize vectorization and available ISA (e.g. AVX-512) very efficiently. According to PR-20853 [[5]](#References) using MaskedSoftmax implemented with oneDNN can bring ~60% more samples per second in BERT Base model. ### Overall improvements Performance bottlenecks can also occur in preparation time before executing specific operation. This can be hard to track down as often many factors influence performance of operation and each should be taken into account individually to get best performance possible in any combination. #### FullyConnected weights caching As NLP models do not have predefined sequence length and work with dynamically changing input shapes, it can cause problems with identification whether specific layer was already prepared and used before, thus if its descriptor could be reused. With shape change, all preparations have to be repeated which in some cases can be very expensive. Especially for FullyConnected operator with oneDNN library implementation where weights have to be reordered to special format (to fully utilize vectorization and computer architecture). Due to the above, in case of using oneDNN backend for model, weight caching mechanism was added which significantly reduced preparation time as it saves weights in destination format which can be easily identified for individual layers. It can be turned off by setting environment flag MXNET_ONEDNN_DISABLE_FC_CACHE. #### Threshold for oneDNN After thorough examination, it turned out that creating oneDNN's Just-In-Time kernels for operations on small tensors is counterproductive from the performance point of view and it is faster to use already compiled solution. Therefore, tensor size thresholds have been introduced to operators like slice and add. With this simple change, performance of some of the operations used in GPT-2 model has improved by ~20-60%. ## Conclusion The key to fully utilize hardware capabilities is to write optimal software. This article has shown how much performance of deep learning workloads has improved in MXNet on a given hardware with proper software optimization. Software has yet again proven to be a powerful tool in performance boosting toolbox as it allowed to get even over 3 times better performance on the same hardware. Examples mentioned in this article are not the only ones in MXNet v2.0, but were described as a case study for users who can make use of this knowledge in their work. ### Benchmark environment - Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz (Amazon EC2 c6i.12xlarge instance) - MXNet versions: - [v1.8.0](https://pypi.org/project/mxnet/1.8.0.post0/) - [v1.9.1](https://pypi.org/project/mxnet/1.9.1/) - [v2.0](https://repo.mxnet.io/dist/python/cpu/mxnet-2.0.0b20220810-py3-none-manylinux2014_x86_64.whl) - GluonNLP versions: - v0.10.x - commit sha: b4d7c0ffe358cdb52ad34d902f1cb2d43fb5a1c0 - master - commit sha: TODO: (waiting for merge...) - Environment configuration: - OMP_NUM_THREADS=24 - OMP_PROC_BIND=TRUE - OMP_PLACES=sockets - Bert Config for MXNet v1.x [[6]](#References) and for MXNet 2.0 [[7]](#References): - maximum sequence length: 384 - round samples to: 32 - GPT-2 (124M) config for MXNet v1.x [[8]](#References) and for MXNet 2.0 [[9]](#References): - sequence length: 128 - topk: 400 ### References [1] [Apache* MXNet* v1.2.0 optimized with Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN)](https://www.intel.com/content/www/us/en/developer/articles/technical/apache-mxnet-v120-released-with-intel-optimized-cpu-backend.html) [2] [Getting Started with Intel® Optimization for MXNet*](https://www.intel.com/content/www/us/en/developer/articles/guide/getting-started-with-intel-optimization-for-mxnet.html) [3] [A New NumPy Interface for Apache MXNet (Incubating)](https://medium.com/apache-mxnet/a-new-numpy-interface-for-apache-mxnet-incubating-dbb4a4096f9f) [4] [Optimizing inference on CPU in the Upcoming Apache MXNet 2.0](https://medium.com/apache-mxnet/optimizing-inference-on-cpu-in-mxnet-2-0-1852ff9729b4) [5] [Improve MaskedSoftmax by oneDNN](https://github.com/apache/incubator-mxnet/pull/20853) [6] [Question Answering - MXNet v1.x](https://github.com/dmlc/gluon-nlp/tree/v0.10.x/scripts/bert) [7] [Question Answering - MXNet v2.0](https://github.com/dmlc/gluon-nlp/tree/master/scripts/question_answering) [8] [Text Generation - MXNet v1.x](https://github.com/dmlc/gluon-nlp/tree/v0.10.x/scripts/text_generation) [9] [Text Generation - MXNet v2.0](https://github.com/dmlc/gluon-nlp/tree/master/scripts/generation) ### Notices and Disclaimers © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. BACKUP: ![](https://i.imgur.com/Qmd52wD.png) ![](https://i.imgur.com/A55OPmX.png)