--- tags : PM, worklog --- # 2021 Year-end Summary <!--CSS styles header here--> <style> .date { color: #579978; } .emph { color: tomato; } </style> # Table of content [TOC] # Papers ### Transformer * ==How to train ViT?== (Google Brain + Research, 2021/06) [How to train your ViT? Data, Augmentation,and Regularization in Vision Transform](https://arxiv.org/abs/2106.10270) * Study on effects in augmentation & regularization on ViT * What dataset should we use to pretrain our ViT * ==Optimizing the performance–accuracy tradeoff : **LeViT**== (Meta AI 2021/03) [LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference](https://arxiv.org/abs/2104.01136) * **Based on DeiT** * Get better inference speed on GPU & CPU, even on ARM mobbile devices. * Convolution as *patch descriptor* to shrinks the number of features. * Use attention as a downsampling mechanism * ==Disjoint block + cross blocking aggregating : **NesT**== (Google Cloud AI 2021/05) [Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding](https://arxiv.org/abs/2105.12723) * Tree-like hierarchical struture to build local attentions * Use max-pooling to aggregating blocks * ==Using shifted windows to create multiple scale feature map : **Swin**== (Microsoft AI 2021/03) [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030) * Swin is basically PvT with shifted windows * Shifted windows = bigger stride + cyclic sliding windows * SOTA in Object detection, instance segmentation & semantic segmentation * Make ViT more friendly : **Visformer** * [Visformer: The Vision-friendly Transformer](https://arxiv.org/abs/2104.12533) * MBConv with attention : **CoAtNet** * [CoAtNet: Marrying Convolution and Attention for All Data Sizes](https://arxiv.org/abs/2106.04803) * Reduce the data amount of requirement by Distillation : **DeiT** * [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877) * Convolution to replace projection : **CvT** * [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) * Change attention formula to stack more : **DeepViT** * [DeepViT: Towards Deeper Vision Transformer](https://arxiv.org/abs/2103.11886) * Dynamic residual to stack more : **CaiT** * [Going deeper with Image Transformers](https://arxiv.org/abs/2103.17239) * Additive operation in attention : **Fastformer** * [Fastformer: Additive Attention Can Be All You Need](https://arxiv.org/abs/2108.09084) * Multiple scale feature map for dense prediction : **PvT** * [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/abs/2102.12122) * Proof attentions can express like colvolutions * [Can Vision Transformers Perform Convolution?](https://arxiv.org/abs/2111.01353) <!--MHSA with enough heads can express like convolution--> Total Read : **13** Read entire paper: **4** ### Detection & Segmentation * ==Using transformer as detector== (Facebook AI 2020/05) [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) * mAP & latency are not as good as faster rcnn, but provide a simple way to deal with object detection problem * Convert detection problem to bipartite pairing * ==Cascade predicting head + faster rcnn : **Cascade RCNN**== (UC San Diego 2017/12) [Cascade R-CNN: Delving into High Quality Object Detection](https://arxiv.org/abs/1712.00726) * [Hybrid Task Cascade for Instance Segmentation](https://arxiv.org/abs/1901.07518) * [You Only Look Once: Unified, Real-Time Object Detection](https://arxiv.org/abs/1506.02640) * [Rich feature hierarchies for accurate object detection and semantic segmentation](https://arxiv.org/abs/1311.2524) * [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/abs/1506.01497) * *[Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation](https://openaccess.thecvf.com/content/CVPR2021/papers/GhiasiSimpleCopy-PasteIsaStrongDataAugmentationMethodforInstanceCVPR2021paper.pdf)* Total Read : **7** Read entire paper: **2** ## Meta learning * ==Use RNN for optimizing gradient descent based network== (Google DeepMind 2016/06) [Learning to learn by gradient descent by gradient descent](https://arxiv.org/abs/1606.04474) * In experiment, using LSTM to reduce computational difficulty (instead of RNN) * Can replace optimizer like RMSprop or ADAM. * ==A more generic neural network optimizer : **MAML**== (Stanford University 2017/05) [Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks](https://arxiv.org/pdf/1706.03762.pdf) * Constructing optimizer for different type models(model agnostic) * Conv+Linear instead of LSTM * [Matching Networks for One Shot Learning](https://arxiv.org/abs/1606.04080) Total Read : **3** Read entire paper: **2** ### Others * ==Introducing **re-parameterization** to VGG== (BNRist, MEGVII Technology 2021/01) [RepVGG: Making VGG-style ConvNets Great Again](https://arxiv.org/abs/2101.03697) * Simple **Patch embedding** on **separable conv** * [Patches Are All You Need?](https://openreview.net/pdf?id=TVHS5Y4dNvM) * ==ResNet with sufficient pretraining achives SOTA accuracy on ImageNet : **BiT**== (Google Research, Brain 2019/12) [Big Transfer (BiT): General Visual Representation Learning](https://arxiv.org/abs/1912.11370) * Using JFT-300M to pretrain * Pretrained model can perform few shot classification * Pretrained ResNet is good on general task like VTAB * Use 3 billions data to explore the limits of pretrain * [Exploring the Limits of Weakly Supervised Pretraining](https://arxiv.org/abs/1805.00932) * soft label + mixing images improve generalization: **MixUp** * [mixup: Beyond Empirical Risk Minimization](https://arxiv.org/abs/1710.09412) * A common automated augmentation in ImageNet training: **RandAug** * [RandAugment: Practical automated data augmentation with a reduced search space](https://arxiv.org/abs/1909.13719) * <span class="emph">Applying recent training method to ResNet : **Summary of Training Procedures** </span> * [ResNet strikes back: An improved training procedure in timm](https://arxiv.org/abs/2110.00476) * MCTS(Monte Carlo Tree Search) + CNN in go : AlphaGo * [Mastering the game of Go with deep neural networks and tree search](https://www.nature.com/articles/nature16961) * MCTS + ResNet without supervised : AlphaGo Zero * [Mastering the game of Go without human knowledge](https://www.nature.com/articles/nature24270) * Generic MCTS + ResNet in other board game : AlphaZero * [Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm](https://arxiv.org/abs/1712.01815) Total Read : **10** Read entire paper: **2** # Programming 1. Pytorch & tensorflow * load custom data * build model # Review Total Read : 33 Entire Read : 10 * Too less experiment: Spenting too much time on reading papers, which results in I didn't do any experiment on model structure or new ideas. Though I have more various sight in transformer & object detection field, still have few pratical research experience in this semester. After choosed a target paper, I should start applying adjustment on that model. * Too less paper: There are a lot improvements on my paper reading speed. However, the quantity of paper I read is looks a lot, but if we put it in 6 monthes, meaning I only 5 papers a month which is quiet less. # Schedule