# WACV 2023 Tutorial (9am-12pm, Jan 3, 2023) ## Advances in Design and Implementation of End-to-End Learned Image and Video Compression ### Speakers <img src=https://i.imgur.com/foZiZJQ.png alt="drawing" width="150"/> **Prof. Wen-Hsiao Peng, National Yang Ming Chiao Tung University, Taiwan** <br> <img src=https://i.imgur.com/t2vpOMN.png alt="drawing" width="150"/> **Prof. Heming Sun, Waseda University, Japan** ### Description The arrival of deep learning recently spurred a new wave of developments in end-to-end learned image and video compression. This fast growing research area has attracted more than 100+ publications in the literature, with the state-of-the-art end-to-end learned image compression showing comparable compression performance to H.266/Versatile Video Coding (VVC) intra coding in terms of Peak-Signal-to-Noise-Ratio-RGB (PSNR-RGB) and much better Multi-Scale Structural Similarity (MS-SSIM) results. End-to-end learned video coding is also catching up quickly. Some preliminary studies report comparable PSNR-RGB results to H.265/High-Efficiency Video Coding (HEVC) or even H.266/VVC under the low-delay setting. These interesting results have led to intense activity in international standards organizations, e.g. JPEG AI, and various challenges, e.g. Challenge on Learned Image Compression (CLIC) at the IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR) and Grand Challenge on Neural Network-based Video Coding at the IEEE International Symposium on Circuits and Systems (ISCAS). There are, however, hidden aspects related to end-to-end learned image and video compression that have not yet been given enough attention. For example, the excessive peak memory or memory bandwidth requirements of learned codecs are often ignored, when much effort is put into demonstrating their full compression potential. Moreover, learned codecs typically achieve variable-rate compression with separate network models, which is prohibitively expensive in real-world applications. Another prominent issue is cross-platform encoding and decoding with learned codecs, which calls for special care due to potentially inconsistent arithmetic precision. It is also noted that the domain shift between training and test data often leads to their sub-optimal coding performance and/or poor generation on individual test images/sequences. This tutorial aims at providing an overview of recent advances in design and implementation of end-to-end learned image and video compression, in an effort to invite more contributions from the computer vision community. In the first part of this tutorial, we shall (1) summarize briefly the progress of this topic in the past 3 or so years, including an overview of JPEG AI Call-for-Proposals on Learning-based Image Coding (which was concluded in July 2022). In the second part, we shall (2) address the basics of learned image compression that builds upon variational autoencoders and/or flow models. Recent methods to accelerate the most time-consuming part (i.e. entropy coding) will be introduced. We will also analyze the complexity of the whole system and explore some recent low-complexity algorithms and architectures. In the third part, we shall switch gears to (3) explore learned video compression. In particular, we will be covering an emerging school of thought that leverages conditional generative models for more efficient inter-frame coding, in addition to the traditional residual-based coding framework. This part will be concluded with a complexity analysis of several state-of-the-art methods in terms of algorithmic intrinsic metrics. In the last part, we provide an outlook for future developments. ### Tutorial Outline #### Part I – Overview of End-to-End Learned Image and Video Compression by Prof. Wen-Hsiao Peng (20 minutes) 1. Introduction to end-to-end learned image and video compression 2. Rate-distortion performance of learned image/video compression 3. Recent developments in CLIC and JPEG AI Call-for-Proposals #### [Learning outcomes] At the end of this module, the attendees will be able to * Tell the algorithmic differences between learned and traditional image/video coding methods. * Tell the rationale and benefits of using neural networks for image/video compression. * Indicate how state-of-the-art learned image/video coding methods perform as compared to traditional ones in terms of compression performance. * Describe the activities in CLIC and JPEG AI. #### Part II – End-to-End Learned Image Compression by Prof. Heming Sun (70 minutes) 4. Elements of end-to-end learned image compression 5. Review of a few notable systems 6. Review of fast efficient entropy coding methods 7. Complexity analysis of learned image compression 8. Real-time implementation of learned image compression #### [Learning outcomes] At the end of this module, the attendees will be able to * List the common elements in end-to-end learned image compression systems. * Identify key prior works in this area. * Express the trade-off between coding gain and complexity * Describe the implementation challenges and some possible solutions. #### Part III – End-to-End Learned Video Compression by Prof. Wen-Hsiao Peng (60 minutes) 9. Elements of end-to-end learned video compression 10. Review of some notable residual-based learned video compression methods 11. Review of some notable conditional video compression methods 12. Complexity characterization of learned video compression 13. Low-complexity implementation of learned video compression #### [Learning outcomes] At the end of this module, the attendees will be able to * Describe the conceptual differences between residual-based and conditional video compression methods. * Identify key elements in these video compression methods. * Indicate the design challenges from the perspective of complexity-performance trade-offs. #### Part IV – Outlook for Future Developments by Prof. Wen-Hsiao Peng (10 minutes) 14. Open issues and concluding remarks ### Short-Bios #### Wen-Hsiao Peng Computer Science Dept., National Yang Ming Chiao Tung University, Taiwan Email: wpeng@cs.nctu.edu.tw Web: https://sites.google.com/g2.nctu.edu.tw/wpeng Dr. Wen-Hsiao Peng (M’09-SM’13) received his Ph.D. degree from National Chiao Tung University (NCTU), Taiwan, in 2005. He was with the Intel Microprocessor Research Laboratory, USA, from 2000 to 2001, where he was involved in the development of ISO/IEC MPEG-4 fine granularity scalability. Since 2003, he has actively participated in the ISO/IEC and ITU-T video coding standardization process and contributed to the development of SVC, HEVC, and SCC standards. He was a Visiting Scholar with the IBM Thomas J. Watson Research Center, USA, from 2015 to 2016. He is currently a Professor with the Computer Science Department, National Yang Ming Chiao Tung University, Taiwan. He has authored over 75+ journal/conference papers and over 60 ISO/IEC and ITU-T standards contributions. His research interests include learning-based video/image compression, deep/machine learning, multimedia analytics, and computer vision. Dr. Peng was Chair of the IEEE Circuits and Systems Society (CASS) Visual Signal Processing (VSPC) Technical Committee from 2020-2022. He was Technical Program Co-chair for 2021 IEEE VCIP, 2011 IEEE VCIP, 2017 IEEE ISPACS, and 2018 APSIPA ASC; Publication Chair for 2019 IEEE ICIP; Area Chair/Session Chair/Tutorial Speaker/Special Session Organizer for IEEE ICME, IEEE VCIP, and APSIPA ASC; and Track/Session Chair and Review Committee Member for IEEE ISCAS. He served as AEiC for Digital Communications for IEEE JETCAS and Associate Editor for IEEE TCSVT. He was Lead Guest Editor, Guest Editor and SEB Member for IEEE JETCAS, and Guest Editor for IEEE TCAS-II. He was Distinguished Lecturer of APSIPA and the IEEE CASS. Dr. Peng is also a Fellow of the Higher Education Academy (FHEA). #### Heming Sun Waseda Research Institute for Science and Engineering Waseda University, Japan Email: hemingsun@aoni.waseda.jp Web: https://scholar.google.com/citations?user=LtkiCFcAAAAJ&hl=en Dr. Heming Sun received the B.E. degree in electronic engineering from Shanghai Jiao Tong University, Shanghai, China, in 2011, and received the M.E. degree from Waseda University and Shanghai Jiao Tong University, in 2012 and 2014, respectively, through a double-degree program. In 2017 he earned his Ph.D. degree from Waseda University through the embodiment informatics program supported by Ministry of Education, Culture, Sports, Science and Technology (MEXT), where he is currently a junior researcher. He was a researcher at NEC Central Research Laboratories from 2017 to 2018. He is selected as Japan Science and Technology Agency (JST) PRESTO Researcher, during 2019 to 2023. His interests are in algorithms and VLSI architectures for image/video processing and neural networks. He participated in the 8K HEVC decoder chip design, which won the ISSCC 2016 Takuo Sugano Award for Outstanding Far-East Paper. He also got several awards including the Best Paper Award of VCIP 2020, Top-10 Best Paper of PCS 2021, and IEEE Computer Society Japan Chapter Young Author Award 2021. Regarding the academic achievements and activities, he has published over 80 peer-reviewed journal and conference papers (e.g. TMM, JSSC, TCAS-I, ISSCC, CVPR, VCIP, ISCAS). He held a special session on "Neural Network Technology in Future Image/Video Coding" at Picture Coding Symposium (PCS) 2019 and co-organize the special session on “Towards Practical Learning-based Image and Video Coding” at PCS 2022. He is invited to give a talk about “Deep Learning Method for Image Compression” by Information Processing Society of Japan. He also served as reviewers for many flagship CAS-society journals such as TCSVT, TCAS-I, TCAS-II.