# 2nd PM Semester 3 (26.10.2022)
## Read and learn paper and book
* Book: Getting Started with Google BERT Build and train state-of-the-art natural language processing models using BERT
https://github.com/PacktPublishing/Getting-Started-with-Google-BERT
* Vision Transformers
https://cameronrwolfe.substack.com/p/vision-transformers
* Self-Attention
https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a
## Paper
### 1. MsIFT: Multi-Source Image Fusion Transformer ###
https://www.mdpi.com/2072-4292/14/16/4062
The contributions of this paper are as follows:
1. multi-source image fusion method with the global receptive field is proposed. The non-locality of the transformer is helpful for overcoming the feature semantic bias caused by semantic misalignment between multi-source images.
2. Different feature extractor and task predictor networks are proposed and unified for three classification-based downstream tasks, and the MsIFT can be uniformly used for pixel-wise classification, image-wise classification and semantic segmentation.
3. The proposed MsIFT improved the classification performance based on fusion and achieved state-of-the-art (SOTA) performances on the VAIS dataset [9], SpaceNet 6, dataset [10] and GRSS-DFC-2013 dataset.

MsIFT Architecture
As shown in Figure 2, the MsIFT consists of CNN feature extractor, feature fusion transformer and the task predictor.
**CNN feature extractor**

**Feature fusion transformer **
**The task predictor**

### **Result**

Figure 5. Attention maps of the picked points. (a–d) are the attention maps from classification and semantic segmentation, respectively. (a,c) are multi-source cross-attention maps. (b,d) are single-source self-attention maps. The figures indicated by the arrow show the global attention maps of the picked query point. The red points represent the background queries, and the yellow points are the foreground queries

Figure 6. Results comparison of multi-source semantic segmentation, (a) ground truth, (b) MsIFT, (c) OPT and (d) SAR. Different rows represent the results in different scenarios. The red dotted rectangle areas are enlarged to show the results more clearly.
### **Conclution**
* The MsIFT integrates three downstream classification tasks: i.e., pixel-wise classification, image-wise classification and semantic segmentation.
* Different task-specific networks are designed for local feature extraction and prediction, respectively. Three tasks share the multi-source features fusion module within the MsIFT.
* A feature fusion transformer (FFT) with encoder–decoder style is proposed for multi-source feature-level fusion; the global attention mechanism is beneficial for alleviating semantic biases caused by inaccurate registration.
* The FFT allows features to perform global queries, inspiring each query feature to aggregate global features similar to their semantic information.
* Extensive experiments demonstrate that the MsIFT achieved state-of-the-art performances on VAIS, GRSS-DFC-2013 and SpaceNet 6, which validates the superiority and versatility of the proposed method.
### 2. Pre-TrainedImageProcessingTransformer ###
https://openaccess.thecvf.com/content/CVPR2021/papers/Chen_Pre-Trained_Image_Processing_Transformer_CVPR_2021_paper.pdf
contributed :
1. Develop a pre-trained model for image processing using the transformer architecture, namely, Image Processing Transformer (IPT). As the pre-trained model needs to be compatible with different image processing tasks, including super-resolution, denoising, and deraining, the entire network is composed of multiple pairs of head and tail corresponding to different tasks and a single shared body. Since the potential of transformer needs to be excavated using large-scale dataset, we should prepair a great number of images with considerable diversity for training the IPT model.
2. Select ImageNet benchmark which contains various high-resolution with 1,000 categories. For each image in the ImageNet, we generate multiple corrupted counterparts using several carefully designed operations to serve different tasks. For example, training samples for the super-resolution task are generated by downsampling original images. The entired dataset we used for training IPT contains about over 10 millions of images. Then, the transformer architecture is trained
3. Transformer architecture is trained on the huge dataset as follows. The training images are input to the specific head, and the generated features are cropped into patches (i.e., “words”) and flattened to sequences subsequently. The transformer body is employed to process the f lattened features in which position and task embedding are utilized for encoder and decoder, respectively. In addition, tails are forced to predict the original images with different output sizes according to the specific task. Moreover, a contrastive loss on the relationship between patches of different inputs is introduced for well adopting to different image processing tasks. The proposed image processing transformer is learned in an end-to-end manner. Experimental results conducted on several benchmarks show that the pre-trained IPT model can surpass most of existing methods on their own tasks by a significant enhancement after f ine-tuning.

experiment :
**Super-resolution**
compare the model with several state-of-the-art CNN-based SR methods. 
the model in 4× scale on Urban100 dataset

**Denoising**
compare the results with various state-of-the-art models.


**Deraining**
For the image deraining task,we evaluate our model on the synthesized Rain 100 L dataset[70],which consists of 100 rainy images.


**GeneralizationAbility**

**Ablation Study**

Conclusion
Experimental results demonstrate that our IPT can outperform the stateof-the-art methods using only one pre-trained model after a quickly fine-tuning.
### 3. Olive Disease Classification Based on Vision Transformer and CNN Models
https://doi.org/10.1155/2022/3998193
The main contributions of this paper are as follows:
1. To improve the quality of olive images, we use the median noise filtering algorithm that removes and reduces noise after data augmentation process.
2. They propose a hybrid deep learning-based architecture that combines the convolutional neural network (CNN) model and the vision transformer model to extract the most relevant features from olive images.
3. For the image classification process, They use a pooling layer and dropout to avoid the overfitting problem before applying a softmax feature.

The architecture of the approach can be divided into
three steps.
1. preprocessing of the dataset is performed to remove noise and improve imagequality. this is done using an algorithm known as a noise filter.
2. After that, a data enhancement procedure is performed. the preprocessed dataset is then fitted with the hybrid model previously described, which is composed of CNN models and vision transformers, to extract features from it. It is important to note that the primary goal of using such a variety of models is to run a series of experiments to determine which results are most favorable.
3. The third and final step is image classification using a pooling layer and dropout to avoid the problem of overfitting before applying a softmax function.

**Result**
Dataset Description and Evaluation Metrics.
A dataset that was obtained over the spring and summer was used by us in the process of evaluating our proposed ideal model that is based on deep learning. With the assistance of an agricultural engineer who is highly knowledgeable in the subject matter, 3,400 olive leaves were separated into three distinct categories: healthy leaves, leaves infected with Aculus olearius, and leaves infected with olive peacock spot. Our dataset also included uninfected olive leave





**Conclusion**
Compared with other deep learning models, including VGG-16, VGG-19, and vision transformer, they found that the accuracy rate of the hybrid deep learning models was significantly higher than that of the other deep learning models.
In binary classification, the most effective model, which was a combination of the ViTmodel and the VGG-16 model, achieved
an accuracy rate of 97 percent.
They intend to adapt the most effective deep learning model to other plant collections in the future, and we will strive to collect more photos of olive diseases.
### 4. Transformers in computational visual media: A survey
https://link.springer.com/content/pdf/10.1007/s41095-021-0247-3.pdf
## learn Programming ##
* Install Pytroch
https://learn.microsoft.com/id-id/windows/ai/windows-ml/tutorials/pytorch-installation
* install tensorflow and keras
https://warstek.com/tensorflowgpu/
https://github.com/Wayan123/Tensorflow-GPU
https://softscients.com/2020/03/25/cara-install-tensorflow-dan-keras/
### Data Augmentation techniques & Pre-processing
We can apply various changes to the initial data. For example, for images we can use: (https://neptune.ai/blog/data-augmentation-in-python)
* **Geometric transformations** – you can randomly flip, crop, rotate or translate images, and that is just the tip of the iceberg
* **Color space transformations** – change RGB color channels, intensify any color
* **Kernel filters** – sharpen or blur an image
* **Random Erasing** – delete a part of the initial image
* **Mixing images** – basically, mix images with one another. Might be counterintuitive but it works
Remove image Background





