# Self-training with Noisy Student improves ImageNet classification Submitted on 11 Nov 2019 (v1), last revised 19 Jun 2020 link: https://arxiv.org/abs/1911.04252 ## Algorithm 1. The teacher model is first trained on the dataset given. 2. The teacher model labels additional images as an expansion of the dataset. 3. A student model is then trained on the expanded dataset. 4. Repeat from step 2. several times. * There are some points to be mentioned: * The student model is not smaller than the teacher model to outperform the later one. * Thus, to generalize the model, input noises and model noises are added into the training process. For input noise, data augmentation with RandAugment is used. For model noise, dropout and stochastic depth are used. * The addition dataset will be several times larger than the original dataset. ![](https://i.imgur.com/GaNE9TG.png) ## Traing Details EfficientNet is chosen as the basemodel as it's a top CNN model and is able to scale large. The teacher produces high-quality pseudo labels by reading in clean images, while the student is required to reproduce those labels with augmented images as input. The high-quality pseudo labels are acquired by two extra tricks: data filtering and balancing: * Images that the teacher model has low confidences are droped since they are usually out-of-domain images. * Images are duplicated in classes whose size is too small. The authors also find out that soft pseudo labels performs better than hard pseudo labels although they emphasize either type works with Noizy Student Learning. ## Results * This method beats previous SOTA on the ImageNet 2012 ILSVRC challenge problem. * Further, Noizy Student Learning significantly improves base model on robustness problems, ImageNet-[A, C, P]. * This method can also deal with adversarial attack. (75% accuarcy) ## 4. Ablation Study Detailed effects of iterative training: * | iteration | Batch Size Ratio | Top-1 Acc. | | - | - |- | | 1 | 14:1 | 87.6% | | 2 | 14:1 | 88.1% | | 3 | 28:1 | 88.4% | Other interesting findings: * A large amount of unlabeled data is neccessary for better performance. * Data balancing is useful for small models. * Joint training on labeled data and unlabeled data outperforms the pipeline that first pretrains with unlabeled data and then finetunes on labeled data. * Using a large ratio between unlabeled batch size and labeled batch size enables models to train longer on unlabeled data to achieve a higher accuracy. * Training the student from scratch is sometimes better than initializing the student with the teacher * The student initialized with the teacher still requires a large number of training epochs to perform well. ## My thoughts The authors proposed an innovantive learning algorithm, and also studied the method thourghly by well designed experiments, quantizing the performance of each part. This teacher-student iterative training method is simular to data augmenting techniques. It is very amazing that this method can have such good results since the source of the knowledge requires only a small dataset (compared to its performance). Noisy student self-training allows us to train large model so extrordinary perferomance is possible, the iterative noizy traing process and progressive model size help generalize the student model. Like GAN based data augmentation, this kind of methods make self learning and lifelong learning possible.