---
tags : PM, worklog
---
# 2021 Year-end Summary
<!--CSS styles header here-->
<style>
.date {
color: #579978;
}
.emph {
color: tomato;
}
</style>
# Table of content
[TOC]
# Papers
### Transformer
* ==How to train ViT?==
(Google Brain + Research, 2021/06)
[How to train your ViT? Data, Augmentation,and Regularization in Vision Transform](https://arxiv.org/abs/2106.10270)
* Study on effects in augmentation & regularization on ViT
* What dataset should we use to pretrain our ViT
* ==Optimizing the performance–accuracy tradeoff : **LeViT**==
(Meta AI 2021/03)
[LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference](https://arxiv.org/abs/2104.01136)
* **Based on DeiT**
* Get better inference speed on GPU & CPU, even on ARM mobbile devices.
* Convolution as *patch descriptor* to shrinks the number of features.
* Use attention as a downsampling mechanism
* ==Disjoint block + cross blocking aggregating : **NesT**==
(Google Cloud AI 2021/05)
[Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding](https://arxiv.org/abs/2105.12723)
* Tree-like hierarchical struture to build local attentions
* Use max-pooling to aggregating blocks
* ==Using shifted windows to create multiple scale feature map : **Swin**==
(Microsoft AI 2021/03)
[Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030)
* Swin is basically PvT with shifted windows
* Shifted windows = bigger stride + cyclic sliding windows
* SOTA in Object detection, instance segmentation & semantic segmentation
* Make ViT more friendly : **Visformer**
* [Visformer: The Vision-friendly Transformer](https://arxiv.org/abs/2104.12533)
* MBConv with attention : **CoAtNet**
* [CoAtNet: Marrying Convolution and Attention for All Data Sizes](https://arxiv.org/abs/2106.04803)
* Reduce the data amount of requirement by Distillation : **DeiT**
* [Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877)
* Convolution to replace projection : **CvT**
* [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808)
* Change attention formula to stack more : **DeepViT**
* [DeepViT: Towards Deeper Vision Transformer](https://arxiv.org/abs/2103.11886)
* Dynamic residual to stack more : **CaiT**
* [Going deeper with Image Transformers](https://arxiv.org/abs/2103.17239)
* Additive operation in attention : **Fastformer**
* [Fastformer: Additive Attention Can Be All You Need](https://arxiv.org/abs/2108.09084)
* Multiple scale feature map for dense prediction : **PvT**
* [Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions](https://arxiv.org/abs/2102.12122)
* Proof attentions can express like colvolutions
* [Can Vision Transformers Perform Convolution?](https://arxiv.org/abs/2111.01353) <!--MHSA with enough heads can express like convolution-->
Total Read : **13**
Read entire paper: **4**
### Detection & Segmentation
* ==Using transformer as detector==
(Facebook AI 2020/05)
[End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872)
* mAP & latency are not as good as faster rcnn, but provide a simple way to deal with object detection problem
* Convert detection problem to bipartite pairing
* ==Cascade predicting head + faster rcnn : **Cascade RCNN**==
(UC San Diego 2017/12)
[Cascade R-CNN: Delving into High Quality Object Detection](https://arxiv.org/abs/1712.00726)
* [Hybrid Task Cascade for Instance Segmentation](https://arxiv.org/abs/1901.07518)
* [You Only Look Once: Unified, Real-Time Object Detection](https://arxiv.org/abs/1506.02640)
* [Rich feature hierarchies for accurate object detection and semantic segmentation](https://arxiv.org/abs/1311.2524)
* [Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks](https://arxiv.org/abs/1506.01497)
* *[Simple Copy-Paste is a Strong Data Augmentation Method
for Instance Segmentation](https://openaccess.thecvf.com/content/CVPR2021/papers/GhiasiSimpleCopy-PasteIsaStrongDataAugmentationMethodforInstanceCVPR2021paper.pdf)*
Total Read : **7**
Read entire paper: **2**
## Meta learning
* ==Use RNN for optimizing gradient descent based network==
(Google DeepMind 2016/06)
[Learning to learn by gradient descent by gradient descent](https://arxiv.org/abs/1606.04474)
* In experiment, using LSTM to reduce computational difficulty (instead of RNN)
* Can replace optimizer like RMSprop or ADAM.
* ==A more generic neural network optimizer : **MAML**==
(Stanford University 2017/05)
[Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks](https://arxiv.org/pdf/1706.03762.pdf)
* Constructing optimizer for different type models(model agnostic)
* Conv+Linear instead of LSTM
* [Matching Networks for One Shot Learning](https://arxiv.org/abs/1606.04080)
Total Read : **3**
Read entire paper: **2**
### Others
* ==Introducing **re-parameterization** to VGG==
(BNRist, MEGVII Technology 2021/01)
[RepVGG: Making VGG-style ConvNets Great Again](https://arxiv.org/abs/2101.03697)
* Simple **Patch embedding** on **separable conv**
* [Patches Are All You Need?](https://openreview.net/pdf?id=TVHS5Y4dNvM)
* ==ResNet with sufficient pretraining achives SOTA accuracy on ImageNet : **BiT**==
(Google Research, Brain 2019/12)
[Big Transfer (BiT): General Visual Representation Learning](https://arxiv.org/abs/1912.11370)
* Using JFT-300M to pretrain
* Pretrained model can perform few shot classification
* Pretrained ResNet is good on general task like VTAB
* Use 3 billions data to explore the limits of pretrain
* [Exploring the Limits of Weakly Supervised Pretraining](https://arxiv.org/abs/1805.00932)
* soft label + mixing images improve generalization: **MixUp**
* [mixup: Beyond Empirical Risk Minimization](https://arxiv.org/abs/1710.09412)
* A common automated augmentation in ImageNet training: **RandAug**
* [RandAugment: Practical automated data augmentation with a reduced search space](https://arxiv.org/abs/1909.13719)
* <span class="emph">Applying recent training method to ResNet : **Summary of Training Procedures** </span>
* [ResNet strikes back: An improved training procedure in timm](https://arxiv.org/abs/2110.00476)
* MCTS(Monte Carlo Tree Search) + CNN in go : AlphaGo
* [Mastering the game of Go with deep neural networks and tree search](https://www.nature.com/articles/nature16961)
* MCTS + ResNet without supervised : AlphaGo Zero
* [Mastering the game of Go without human knowledge](https://www.nature.com/articles/nature24270)
* Generic MCTS + ResNet in other board game : AlphaZero
* [Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm](https://arxiv.org/abs/1712.01815)
Total Read : **10**
Read entire paper: **2**
# Programming
1. Pytorch & tensorflow
* load custom data
* build model
# Review
Total Read : 33
Entire Read : 10
* Too less experiment:
Spenting too much time on reading papers, which results in I didn't do any experiment on model structure or new ideas. Though I have more various sight in transformer & object detection field, still have few pratical research experience in this semester. After choosed a target paper, I should start applying adjustment on that model.
* Too less paper:
There are a lot improvements on my paper reading speed. However, the quantity of paper I read is looks a lot, but if we put it in 6 monthes, meaning I only 5 papers a month which is quiet less.
# Schedule