# π **Vision Transformer (ViT) Tutorial β Part 3: Pretraining, Transfer Learning & Real-World Applications**
**#VisionTransformer #TransferLearning #HuggingFace #ImageNet #FineTuning #AI #DeepLearning #ComputerVision #Transformers #ModelZoo**
---
## πΉ **Table of Contents**
1. [Recap of Part 2](#recap-of-part-2)
2. [Why Pretraining Matters: The Power of Scale](#why-pretraining-matters-the-power-of-scale)
3. [Pretrained ViT Models: ViT-Base, ViT-Large, ViT-Huge](#pretrained-vit-models-vit-base-vit-large-vit-huge)
4. [Using Hugging Face Transformers for ViT](#using-hugging-face-transformers-for-vit)
5. [Loading Pretrained ViT from Model Zoo](#loading-pretrained-vit-from-model-zoo)
6. [Transfer Learning: Adapting ViT to Custom Datasets](#transfer-learning-adapting-vit-to-custom-datasets)
7. [Fine-Tuning Strategies: Full, Partial, and Feature Extraction](#fine-tuning-strategies-full-partial-and-feature-extraction)
8. [Case Study: Fine-Tuning ViT on CIFAR-100](#case-study-fine-tuning-vit-on-cifar-100)
9. [Visualizing Attention Rollout & Token Merging](#visualizing-attention-rollout--token-merging)
10. [Comparing ViT, DeiT, and Hybrid Models](#comparing-vit-deit-and-hybrid-models)
11. [Optimizing ViT for Inference Speed](#optimizing-vit-for-inference-speed)
12. [Common Pitfalls in Transfer Learning](#common-pitfalls-in-transfer-learning)
13. [Visualizing Transfer Learning Pipeline (Diagram)](#visualizing-transfer-learning-pipeline-diagram)
14. [Summary & Whatβs Next in Part 4](#summary--whats-next-in-part-4)
---
## π **1. Recap of Part 2**
In **Part 2**, we:
- Built a **Vision Transformer from scratch** in PyTorch.
- Implemented **patch embedding**, **multi-head attention**, and **transformer blocks**.
- Trained a small ViT on **CIFAR-10**.
- Learned that **ViT underperforms CNNs on small datasets** without pretraining.
- Visualized training dynamics and debugged common issues.
Now, in **Part 3**, we unlock ViTβs true potential: **pretraining at scale** and **transfer learning**.
Youβll learn how to:
- Use **pretrained ViT models** from Hugging Face.
- **Fine-tune** ViT on custom datasets.
- Visualize **attention rollout**.
- Optimize for **speed and efficiency**.
Letβs go!
---
## π **2. Why Pretraining Matters: The Power of Scale**
In **Part 2**, our ViT only reached ~75% on CIFAR-10 β far below ResNetβs ~95%.
But in the original ViT paper, **ViT-Huge achieved 78.5% on ImageNet** β and **outperformed CNNs** when pretrained on **JFT-300M** (300 million images).
> π‘ **Key Insight**:
> **ViT needs large-scale pretraining to unlock its capacity.**
### π Scaling Laws: Data vs Performance

*(Image: ViT scales better with data than CNNs β performance grows linearly with dataset size)*
> β
ViT is **data-hungry** but **highly scalable**.
This is why **transfer learning** is essential.
---
## ποΈ **3. Pretrained ViT Models: ViT-Base, ViT-Large, ViT-Huge**
Google released several ViT variants pretrained on **ImageNet-21k** and **ImageNet-1k**.
| Model | Patch Size | Image Size | Params | Top-1 Acc (ImageNet) |
|------|-----------|------------|--------|------------------------|
| **ViT-Base/16** | 16x16 | 224x224 | 86M | 77.9% |
| **ViT-Large/16** | 16x16 | 224x224 | 307M | 76.5% |
| **ViT-Huge/14** | 14x14 | 224x224 | 632M | 78.5% |
> β
**ViT-Base/16** is the most commonly used.
They are available via:
- **Hugging Face Hub**
- **Google Research GitHub**
- **TorchVision (newer versions)**
---
## π¦ **4. Using Hugging Face Transformers for ViT**
[Hugging Face](https://huggingface.co) provides a unified API for ViT.
### β
Install
```bash
pip install transformers torch torchvision
```
### β
Load Pretrained ViT
```python
from transformers import ViTImageProcessor, ViTForImageClassification
import torch
# Load processor (handles preprocessing)
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
# Load model
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
```
> β
Automatically downloads weights and config.
---
### β
Inference on a Single Image
```python
from PIL import Image
import requests
# Load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
# Preprocess
inputs = processor(images=image, return_tensors="pt")
# Predict
with torch.no_grad():
logits = model(**inputs).logits
# Get predicted class
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
```
> β
Output: `"cat"` or `"Egyptian cat"`
---
## π§© **5. Loading Pretrained ViT from Model Zoo**
You can also use **TorchVision** (if available):
```python
import torchvision.models as models
# Only available in newer versions
# vit = models.vit_b_16(pretrained=True) # TorchVision 0.15+
```
Or load from **Googleβs official checkpoints**:
```python
# Using timm (another popular library)
import timm
model = timm.create_model('vit_base_patch16_224', pretrained=True)
```
> β
`timm` supports 100+ ViT variants.
Install with:
```bash
pip install timm
```
---
## π **6. Transfer Learning: Adapting ViT to Custom Datasets**
Transfer learning means:
1. Start with a **pretrained ViT** (trained on ImageNet).
2. Replace the final classification head.
3. **Fine-tune** on your dataset.
### β
Use Case: Medical Image Classification
You have 5,000 X-ray images (Pneumonia vs Normal).
You donβt have enough data to train ViT from scratch β but you can **fine-tune a pretrained ViT**.
---
## π οΈ **7. Fine-Tuning Strategies: Full, Partial, and Feature Extraction**
### β
**Strategy 1: Full Fine-Tuning**
Update **all layers**.
```python
model = ViTForImageClassification.from_pretrained(
'google/vit-base-patch16-224',
num_labels=100, # e.g., CIFAR-100
ignore_mismatched_sizes=True
)
# Unfreeze all parameters
for param in model.parameters():
param.requires_grad = True
```
> β
Best performance, but slow and needs lots of data.
---
### β
**Strategy 2: Partial Fine-Tuning**
Only fine-tune the **last few layers**.
```python
# Freeze all
for param in model.parameters():
param.requires_grad = False
# Unfreeze last transformer block + head
for param in model.vit.encoder.layer[-2:].parameters():
param.requires_grad = True
for param in model.classifier.parameters():
param.requires_grad = True
```
> β
Faster, less overfitting.
---
### β
**Strategy 3: Feature Extraction**
Use ViT as a **fixed feature extractor**.
```python
# Remove classifier
model.classifier = nn.Identity()
# Forward pass to extract features
with torch.no_grad():
features = model(**inputs).logits # (1, 768)
# Train a small classifier on top
clf = nn.Linear(768, num_classes)
```
> β
Fastest, but lower accuracy.
---
## π§ͺ **8. Case Study: Fine-Tuning ViT on CIFAR-100**
Letβs fine-tune **ViT-Base** on **CIFAR-100** (100 classes, 32x32 images).
### β
Problem: ViT expects 224x224
We must **resize images**.
```python
from transformers import ViTFeatureExtractor
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
# Custom transform
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std),
])
```
---
### β
Load CIFAR-100
```python
trainset = torchvision.datasets.CIFAR100(root='./data', train=True, download=True, transform=transform)
testset = torchvision.datasets.CIFAR100(root='./data', train=False, download=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=16, shuffle=True) # Small batch due to memory
testloader = DataLoader(testset, batch_size=16)
```
---
### β
Initialize Model
```python
model = ViTForImageClassification.from_pretrained(
'google/vit-base-patch16-224',
num_labels=100,
ignore_mismatched_sizes=True
).to(device)
# Only fine-tune last 4 layers
for param in model.vit.parameters():
param.requires_grad = False
for param in model.vit.encoder.layer[-4:].parameters():
param.requires_grad = True
```
---
### β
Training Loop (Simplified)
```python
optimizer = torch.optim.AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=5e-5)
criterion = nn.CrossEntropyLoss()
for epoch in range(10):
model.train()
for inputs, labels in trainloader:
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(inputs, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")
```
> β
After 10 epochs: ~85% accuracy (vs ~50% from scratch).
---
## π **9. Visualizing Attention Rollout & Token Merging**
### πΉ **Attention Rollout**
Shows how attention spreads across the image through layers.
Uses the idea:
> "If token A attends to token B, it 'inherits' Bβs attention."
Recursive formula:
$$
R = \text{Identity} + \sum_{l=1}^L A^l
$$
where $A^l$ is the attention matrix at layer $l$.
---
### β
Code (Simplified)
```python
def rollout(attentions, head_fusion="mean"):
result = torch.eye(attentions[0].size(-1))
with torch.no_grad():
for attn in attentions:
if head_fusion == "mean":
attn_mean = attn.mean(1)
else:
attn_mean = attn.sum(1) / attn.sum(1).sum(-1)
result = torch.matmul(attn_mean, result)
return result
```
---
### πΌοΈ Attention Rollout Example

*(Image: Heatmap showing attention focused on object regions)*
> β
The model learns to **attend to relevant parts** like eyes, wheels, or wings.
---
## π **10. Comparing ViT, DeiT, and Hybrid Models**
| Model | Key Idea | Advantage | Use Case |
|------|---------|----------|---------|
| **ViT** | Pure transformer | Global context | Large datasets |
| **DeiT** | **D**ata-**e**fficient **I**mage **T**ransformer | Trains on ImageNet without extra data | Medium datasets |
| **Hybrid (e.g., BoTNet)** | CNN + Transformer | Local + global | Object detection |
| **MobileViT** | Lightweight ViT | Fast on mobile | Edge devices |
| **Twins SVT** | Spatial attention | Faster inference | Real-time apps |
> β
**DeiT** uses **token distillation** to match teacher model.
---
### π Performance Comparison (ImageNet)

*(Image: DeiT matches ViT with less data)*
---
## β‘ **11. Optimizing ViT for Inference Speed**
ViT is **computationally heavy** due to self-attention:
$$
\text{Complexity} = O(N^2 \cdot D)
$$
where $N$ = number of patches, $D$ = embedding size.
### β
Optimization Techniques
| Technique | How It Helps |
|---------|-------------|
| **Model Pruning** | Remove unimportant attention heads |
| **Quantization** | Convert weights to FP16 or INT8 |
| **Knowledge Distillation** | Train small student from large teacher |
| **Patch Merging** | Reduce $N$ in deeper layers |
| **Efficient Attention** | Use Linformer, Performer, or FlashAttention |
---
### β
Example: FP16 Inference
```python
model.half() # Convert to float16
inputs = inputs.half()
with torch.no_grad():
logits = model(inputs).logits
```
> β
2x faster, 50% memory.
---
## β οΈ **12. Common Pitfalls in Transfer Learning**
### β **Pitfall 1: Not Resizing Images**
ViT expects **224x224**. Feeding 32x32 β blurry, poor performance.
β
Always **resize or crop**.
---
### β **Pitfall 2: Using Wrong Normalization**
ImageNet stats: `mean=[0.485, 0.456, 0.406]`, `std=[0.229, 0.224, 0.225]`
Using CIFAR stats β poor convergence.
β
Use `ViTFeatureExtractor` for correct values.
---
### β **Pitfall 3: High Learning Rate**
Pretrained models are sensitive.
β
Use **low LR** (1e-5 to 5e-5).
---
### β **Pitfall 4: Not Freezing Early Layers**
Fine-tuning all layers on small data β overfitting.
β
Freeze early layers, fine-tune last few.
---
## πΌοΈ **13. Visualizing Transfer Learning Pipeline (Diagram)**

```
Pretrained ViT (ImageNet)
β
Remove Classifier Head
β
Add New Head (e.g., 100 classes)
β
Freeze Early Layers
β
Fine-Tune on Custom Dataset
β
Optimized for Inference
```
> π This is how ViT powers real-world applications.
---
## π **14. Summary & Whatβs Next in Part 4**
### β
**What Youβve Learned in Part 3**
- Why **pretraining** is essential for ViT.
- How to load **pretrained ViT** from Hugging Face.
- **Transfer learning** strategies: full, partial, feature extraction.
- Fine-tuned ViT on **CIFAR-100** with resizing.
- Visualized **attention rollout**.
- Compared **ViT, DeiT, and hybrid models**.
- Optimized for **speed and efficiency**.
---
### π **Whatβs Coming in Part 4: Vision Transformers for Object Detection, Segmentation & Video**
In the next part, weβll explore:
- πΌοΈ **DETR**: Transformer for **object detection**.
- π¨ **Segmenter**: ViT for **semantic segmentation**.
- π₯ **Video Swin Transformer**: For **video classification**.
- π **MAE (Masked Autoencoder)**: Self-supervised pretraining.
- π§© **Multimodal Models**: CLIP, Flamingo.
- π§ͺ **Training ViT from Scratch with MAE**.
> π **#DETR #Segmenter #VideoTransformer #MAE #SelfSupervised #Multimodal**
---
## π Final Words
Youβve now mastered **real-world Vision Transformer applications**.
> π¬ **"Pretraining is not a shortcut β itβs a paradigm shift. ViT learns general visual understanding, then specializes."**
In **Part 4**, weβll go beyond classification and explore how Transformers are revolutionizing **detection, segmentation, and video**.
---
π **Pro Tip**: Always check **Hugging Face Model Hub** before training from scratch.
π **Share this guide** to help others leverage **pretrained vision models**.
---
β
**You're now ready for Part 4!**
We're entering the world of **Transformers beyond classification**.
#VisionTransformer #TransferLearning #HuggingFace #FineTuning #DeepLearning #AI #ComputerVision #Transformers #ModelZoo #AttentionIsAllYouNeed