# 🌟 **65+ Vision Transformer (ViT) Multiple Choice Questions (MCQs) with Answers**
**#VisionTransformer #ViT #DeepLearning #ComputerVision #Transformers #AI #MachineLearning #MCQ #InterviewPrep**
---
## 🔹 **Table of Contents**
1. [Basic Concepts (Q1–Q15)](#basic-concepts-q1–q15)
2. [Architecture & Components (Q16–Q30)](#architecture--components-q16–q30)
3. [Attention & Transformers (Q31–Q45)](#attention--transformers-q31–q45)
4. [Training & Optimization (Q46–Q55)](#training--optimization-q46–q55)
5. [Advanced & Real-World Applications (Q56–Q65)](#advanced--real-world-applications-q56–q65)
6. [Answer Key & Explanations](#answer-key--explanations)
---
## 🔹 **Basic Concepts (Q1–Q15)**
---
### **Q1: Who introduced the Vision Transformer (ViT)?**
A) Facebook AI
B) OpenAI
C) Google Research
D) DeepMind
**Answer: C) Google Research**
✅ The ViT paper was published by Alexey Dosovitskiy et al. at Google Research in 2020.
---
### **Q2: What is the core idea behind ViT?**
A) Use CNNs for feature extraction
B) Treat image patches as tokens like words in NLP
C) Use RNNs for spatial modeling
D) Apply GANs to image generation
**Answer: B) Treat image patches as tokens like words in NLP**
✅ ViT splits an image into patches and processes them as a sequence using a Transformer.
---
### **Q3: Which NLP model inspired ViT?**
A) BERT
B) GPT
C) Transformer
D) LSTM
**Answer: C) Transformer**
✅ ViT is based on the **Transformer architecture** from *"Attention Is All You Need"*.
---
### **Q4: What is the purpose of the [CLS] token in ViT?**
A) To mark the end of the sequence
B) To store the final class prediction
C) To store global image representation for classification
D) To separate patches
**Answer: C) To store global image representation for classification**
✅ The [CLS] token aggregates information from all patches and is used for classification.
---
### **Q5: What does "An Image is Worth 16x16 Words" refer to?**
A) Image compression
B) Splitting an image into 16x16 patches
C) Using 16x16 filters in CNNs
D) Text-to-image generation
**Answer: B) Splitting an image into 16x16 patches**
✅ It means each 16×16 patch is treated as a "word" in the sequence.
---
### **Q6: Which of the following is NOT a component of ViT?**
A) Patch Embedding
B) Convolutional Layers
C) Transformer Encoder
D) Positional Encoding
**Answer: B) Convolutional Layers**
✅ ViT is **purely attention-based** — no convolutions in the original architecture.
---
### **Q7: What is the role of positional encoding in ViT?**
A) To classify the image
B) To provide spatial information to patches
C) To reduce model size
D) To increase patch size
**Answer: B) To provide spatial information to patches**
✅ Since Transformers are permutation-equivariant, positional encoding adds location info.
---
### **Q8: What is the input to the Transformer encoder in ViT?**
A) Raw pixel values
B) Flattened patches + positional encoding
C) CNN feature maps
D) Frequency domain data
**Answer: B) Flattened patches + positional encoding**
✅ Patches are flattened, embedded, and combined with positional encodings.
---
### **Q9: What is the typical patch size used in ViT-Base?**
A) 8x8
B) 16x16
C) 32x32
D) 64x64
**Answer: B) 16x16**
✅ ViT-Base/16 uses 16×16 patches.
---
### **Q10: What is the output of the ViT model?**
A) Reconstructed image
B) Class logits
C) Bounding boxes
D) Semantic mask
**Answer: B) Class logits**
✅ The final output is logits for classification after the MLP head.
---
### **Q11: Which dataset was ViT pretrained on to achieve SOTA results?**
A) CIFAR-10
B) ImageNet
C) JFT-300M
D) MNIST
**Answer: C) JFT-300M**
✅ Pretraining on **JFT-300M** (300M images) enabled ViT to outperform CNNs.
---
### **Q12: What is the main advantage of ViT over CNNs?**
A) Faster inference
B) Global context modeling
C) Lower memory usage
D) Simpler architecture
**Answer: B) Global context modeling**
✅ Self-attention sees all patches at once, unlike CNNs with limited receptive fields.
---
### **Q13: What is the main limitation of ViT on small datasets?**
A) Too slow
B) Overfitting due to lack of inductive bias
C) Cannot handle color images
D) Requires GPU
**Answer: B) Overfitting due to lack of inductive bias**
✅ ViT lacks CNN’s built-in locality bias, so it needs large data to generalize.
---
### **Q14: Which of the following is a hybrid model combining CNN and Transformer?**
A) ResNet
B) MobileViT
C) EfficientNet
D) AlexNet
**Answer: B) MobileViT**
✅ MobileViT uses CNNs for local features and Transformers for global context.
---
### **Q15: What is the purpose of the MLP head in ViT?**
A) To extract patches
B) To add positional encoding
C) To classify the [CLS] token
D) To reduce patch size
**Answer: C) To classify the [CLS] token**
✅ The MLP head takes the final [CLS] token and outputs class probabilities.
---
## 🔹 **Architecture & Components (Q16–Q30)**
---
### **Q16: How is a 224x224 RGB image split into 16x16 patches? How many patches are created?**
A) 14x14 = 196 patches
B) 16x16 = 256 patches
C) 8x8 = 64 patches
D) 32x32 = 1024 patches
**Answer: A) 14x14 = 196 patches**
✅ (224/16)² = 14² = 196 patches.
---
### **Q17: What is the dimension of each flattened 16x16x3 patch?**
A) 768
B) 256
C) 512
D) 1024
**Answer: A) 768**
✅ 16×16×3 = 768 dimensions.
---
### **Q18: What is the role of the linear projection layer in patch embedding?**
A) To classify the patch
B) To reduce the dimension of the flattened patch
C) To add noise
D) To increase patch size
**Answer: B) To reduce the dimension of the flattened patch**
✅ Projects 768-dim patch to lower D (e.g., 768 → 512).
---
### **Q19: Which type of positional encoding is used in the original ViT?**
A) Sinusoidal
B) Learned
C) Random
D) None
**Answer: B) Learned**
✅ ViT uses **learned positional embeddings**, not fixed sinusoidal.
---
### **Q20: What is the total sequence length after adding the [CLS] token to 196 patches?**
A) 196
B) 197
C) 195
D) 200
**Answer: B) 197**
✅ 196 patches + 1 [CLS] token = 197.
---
### **Q21: Which component enables ViT to model long-range dependencies?**
A) Pooling
B) Self-Attention
C) Convolution
D) Dropout
**Answer: B) Self-Attention**
✅ Self-attention allows any patch to attend to any other patch.
---
### **Q22: What is the purpose of Layer Normalization in ViT?**
A) To classify images
B) To stabilize training by normalizing activations
C) To reduce image size
D) To add positional info
**Answer: B) To stabilize training by normalizing activations**
✅ Improves training stability and convergence.
---
### **Q23: Which of the following is NOT part of a Transformer encoder block?**
A) Multi-Head Attention
B) Feed-Forward Network
C) Batch Normalization
D) Residual Connection
**Answer: C) Batch Normalization**
✅ ViT uses **LayerNorm**, not BatchNorm.
---
### **Q24: What is the function of the feed-forward network in a Transformer block?**
A) To compute attention weights
B) To apply non-linear transformations to each token
C) To add positional encoding
D) To reduce patch size
**Answer: B) To apply non-linear transformations to each token**
✅ Typically a two-layer MLP with GELU activation.
---
### **Q25: How many Transformer encoder blocks are in ViT-Base?**
A) 6
B) 8
C) 12
D) 24
**Answer: C) 12**
✅ ViT-Base has 12 layers.
---
### **Q26: What is the embedding dimension (D) in ViT-Base?**
A) 256
B) 512
C) 768
D) 1024
**Answer: C) 768**
✅ ViT-Base uses D = 768.
---
### **Q27: Which of the following is used to prevent overfitting in ViT?**
A) Max Pooling
B) Dropout
C) Stride
D) Zero Padding
**Answer: B) Dropout**
✅ Applied in attention and MLP layers.
---
### **Q28: What is the role of the [SEP] token in ViT?**
A) To separate patches
B) To mark the end of sequence
C) ViT does not use [SEP] token
D) To classify the image
**Answer: C) ViT does not use [SEP] token**
✅ [SEP] is from BERT; ViT only uses [CLS].
---
### **Q29: What is the purpose of residual connections in ViT?**
A) To reduce model size
B) To allow gradients to flow easily through deep networks
C) To add positional encoding
D) To classify images
**Answer: B) To allow gradients to flow easily through deep networks**
✅ Helps with training deep models.
---
### **Q30: Which of the following best describes ViT's inductive bias?**
A) Strong locality and translation invariance
B) Weak — learns from data
C) Fixed receptive field
D) Hierarchical feature extraction
**Answer: B) Weak — learns from data**
✅ Unlike CNNs, ViT has minimal built-in bias.
---
## 🔹 **Attention & Transformers (Q31–Q45)**
---
### **Q31: What are the three matrices used in self-attention?**
A) Input, Output, Hidden
B) Query, Key, Value
C) Weight, Bias, Gradient
D) Patch, Token, Embedding
**Answer: B) Query, Key, Value**
✅ Q, K, V are linear projections of input.
---
### **Q32: What is the formula for scaled dot-product attention?**
A) softmax(QK^T) V
B) softmax(QK^T / √d_k) V
C) softmax(Q + K) V
D) QK^T V
**Answer: B) softmax(QK^T / √d_k) V**
✅ Scaling prevents large values in softmax.
---
### **Q33: What is the purpose of multi-head attention?**
A) To reduce model size
B) To allow the model to attend to information from different representation subspaces
C) To add positional encoding
D) To classify images
**Answer: B) To allow the model to attend to information from different representation subspaces**
✅ Each head learns different attention patterns.
---
### **Q34: What is the computational complexity of self-attention with respect to sequence length N?**
A) O(N)
B) O(N log N)
C) O(N²)
D) O(1)
**Answer: C) O(N²)**
✅ Due to QK^T computation.
---
### **Q35: Which operation is used to combine multi-head outputs?**
A) Addition
B) Concatenation followed by linear projection
C) Averaging
D) Max pooling
**Answer: B) Concatenation followed by linear projection**
✅ Heads are concatenated and projected back to D.
---
### **Q36: What is the role of the key (K) in attention?**
A) What the model is looking for
B) What the model contains
C) What the model reports
D) What the model outputs
**Answer: B) What the model contains**
✅ Keys represent content; queries represent what to look for.
---
### **Q37: What is the role of the query (Q) in attention?**
A) What the model is looking for
B) What the model contains
C) What the model reports
D) What the model outputs
**Answer: A) What the model is looking for**
✅ Queries determine attention focus.
---
### **Q38: What is the role of the value (V) in attention?**
A) What the model is looking for
B) What the model contains
C) What the model reports when attended to
D) What the model outputs
**Answer: C) What the model reports when attended to**
✅ Values are aggregated based on attention weights.
---
### **Q39: What is the output dimension of multi-head attention?**
A) Number of heads × head dimension
B) Embedding dimension D
C) Sequence length N
D) Patch size P
**Answer: B) Embedding dimension D**
✅ Output is projected back to D.
---
### **Q40: Which activation function is commonly used in the MLP of ViT?**
A) ReLU
B) Sigmoid
C) GELU
D) Tanh
**Answer: C) GELU**
✅ GELU (Gaussian Error Linear Unit) is used in ViT.
---
### **Q41: What is the purpose of the softmax in attention?**
A) To classify
B) To normalize attention weights to sum to 1
C) To reduce dimension
D) To add noise
**Answer: B) To normalize attention weights to sum to 1**
✅ Creates a probability distribution over tokens.
---
### **Q42: Which of the following is NOT a benefit of self-attention?**
A) Global context
B) Parallel processing
C) Fixed receptive field
D) Long-range dependency modeling
**Answer: C) Fixed receptive field**
✅ Self-attention has **variable** receptive field (all tokens).
---
### **Q43: What is the main drawback of self-attention in ViT?**
A) Slow on GPUs
B) O(N²) complexity
C) Cannot handle color
D) Requires labels
**Answer: B) O(N²) complexity**
✅ Limits scalability to high-resolution images.
---
### **Q44: How does ViT handle variable-sized images?**
A) Uses padding only
B) Resizes images to fixed size
C) Uses adaptive pooling
D) Cannot handle variable sizes
**Answer: B) Resizes images to fixed size**
✅ Standard practice: resize to 224x224.
---
### **Q45: Which of the following models uses self-attention for object detection?**
A) YOLO
B) Faster R-CNN
C) DETR
D) SSD
**Answer: C) DETR**
✅ DETR uses Transformers for end-to-end detection.
---
## 🔹 **Training & Optimization (Q46–Q55)**
---
### **Q46: What is the recommended way to fine-tune ViT on small datasets?**
A) Train from scratch
B) Full fine-tuning
C) Feature extraction or partial fine-tuning
D) Use only MLP head
**Answer: C) Feature extraction or partial fine-tuning**
✅ Freeze early layers, fine-tune last few.
---
### **Q47: Which library is commonly used to load pretrained ViT models?**
A) scikit-learn
B) Hugging Face Transformers
C) OpenCV
D) Matplotlib
**Answer: B) Hugging Face Transformers**
✅ `from transformers import ViTForImageClassification`
---
### **Q48: What is knowledge distillation in the context of ViT?**
A) Removing patches
B) Training a small student model from a large teacher
C) Adding noise
D) Reducing image size
**Answer: B) Training a small student model from a large teacher**
✅ Used in TinyViT.
---
### **Q49: Which technique reduces ViT model size by converting weights to 8-bit integers?**
A) Pruning
B) Quantization
C) Distillation
D) Clustering
**Answer: B) Quantization**
✅ INT8 quantization reduces size and speeds up inference.
---
### **Q50: What is the purpose of patch masking in MAE?**
A) To classify patches
B) To reconstruct masked patches for self-supervised learning
C) To add noise
D) To reduce image size
**Answer: B) To reconstruct masked patches for self-supervised learning**
✅ MAE trains by reconstructing 75% masked patches.
---
### **Q51: Which model uses masked autoencoders for ViT pretraining?**
A) CLIP
B) MAE
C) DETR
D) MobileViT
**Answer: B) MAE**
✅ "Masked Autoencoders Are Scalable Vision Learners"
---
### **Q52: What is the benefit of using MobileViT?**
A) Higher accuracy than ViT-Base
B) Lightweight and mobile-friendly
C) Uses no attention
D) Requires no pretraining
**Answer: B) Lightweight and mobile-friendly**
✅ Hybrid CNN-Transformer for efficiency.
---
### **Q53: Which format is used to export ViT for cross-platform deployment?**
A) JSON
B) CSV
C) ONNX
D) XML
**Answer: C) ONNX**
✅ Open Neural Network Exchange.
---
### **Q54: Which tool can accelerate ViT inference on NVIDIA GPUs?**
A) TensorFlow.js
B) TensorRT
C) WebNN
D) Core ML
**Answer: B) TensorRT**
✅ Optimizes models for NVIDIA hardware.
---
### **Q55: What is the main goal of MLOps in ViT deployment?**
A) To reduce image size
B) To apply DevOps principles to ML systems
C) To remove attention
D) To increase patch size
**Answer: B) To apply DevOps principles to ML systems**
✅ Includes CI/CD, monitoring, rollback.
---
## 🔹 **Advanced & Real-World Applications (Q56–Q65)**
---
### **Q56: Which model combines vision and language for zero-shot classification?**
A) DETR
B) CLIP
C) MAE
D) MobileViT
**Answer: B) CLIP**
✅ CLIP classifies images using text prompts.
---
### **Q57: Which model enables few-shot visual reasoning with images and text?**
A) Flamingo
B) PaLM-E
C) TimeSformer
D) Segmenter
**Answer: A) Flamingo**
✅ Can answer questions about images with few examples.
---
### **Q58: Which model connects vision, language, and robotic action?**
A) CLIP
B) Flamingo
C) PaLM-E
D) ViT
**Answer: C) PaLM-E**
✅ Embodied AI model by Google.
---
### **Q59: Which architecture extends ViT to video understanding?**
A) DETR
B) TimeSformer
C) MobileViT
D) MAE
**Answer: B) TimeSformer**
✅ Applies attention across space and time.
---
### **Q60: Which model uses ViT for semantic segmentation?**
A) U-Net
B) Segmenter
C) YOLO
D) ResNet
**Answer: B) Segmenter**
✅ Uses ViT + mask transformer decoder.
---
### **Q61: Which next-gen architecture replaces attention with state space models?**
A) RetNet
B) Mamba
C) Transformer-XL
D) Performer
**Answer: B) Mamba**
✅ Selective State Spaces for O(N) complexity.
---
### **Q62: In which medical application is ViT used for cancer detection in tissue samples?**
A) Radiology
B) Pathology
C) Surgery
D) Cardiology
**Answer: B) Pathology**
✅ Whole slide image analysis.
---
### **Q63: Which metric is critical for monitoring ViT in production?**
A) Image resolution
B) Prediction latency
C) Patch size
D) Number of heads
**Answer: B) Prediction latency**
✅ Must be low for real-time apps.
---
### **Q64: What is the purpose of A/B testing in ViT deployment?**
A) To compare different patch sizes
B) To compare old and new models on live traffic
C) To reduce model size
D) To increase image size
**Answer: B) To compare old and new models on live traffic**
✅ Ensures new model performs better.
---
### **Q65: Which tool is used for model tracking and experiment management?**
A) Git
B) MLflow
C) Docker
D) Kubernetes
**Answer: B) MLflow**
✅ Tracks parameters, metrics, models.
---
## ✅ **Answer Key & Explanations**
| Q | Answer | Explanation |
|----|--------|-------------|
| 1 | C | Google Research introduced ViT |
| 2 | B | Patches as tokens |
| 3 | C | Based on Transformer |
| 4 | C | [CLS] stores global representation |
| 5 | B | 16x16 patches |
| 6 | B | No convolutions |
| 7 | B | Adds spatial info |
| 8 | B | Patches + pos encoding |
| 9 | B | 16x16 patch size |
| 10 | B | Class logits |
| 11 | C | JFT-300M for pretraining |
| 12 | B | Global context |
| 13 | B | Needs large data |
| 14 | B | MobileViT is hybrid |
| 15 | C | Classifies [CLS] |
| 16 | A | 14x14=196 |
| 17 | A | 16*16*3=768 |
| 18 | B | Dimension reduction |
| 19 | B | Learned embeddings |
| 20 | B | 196+1=197 |
| 21 | B | Self-attention |
| 22 | B | Stabilizes training |
| 23 | C | Uses LayerNorm |
| 24 | B | Non-linear transform |
| 25 | C | 12 layers |
| 26 | C | D=768 |
| 27 | B | Dropout prevents overfitting |
| 28 | C | No [SEP] in ViT |
| 29 | B | Helps gradient flow |
| 30 | B | Weak inductive bias |
| 31 | B | Q, K, V |
| 32 | B | Scaled dot-product |
| 33 | B | Multiple attention heads |
| 34 | C | O(N²) complexity |
| 35 | B | Concat + project |
| 36 | B | Key = content |
| 37 | A | Query = what to look for |
| 38 | C | Value = what to report |
| 39 | B | Output dim = D |
| 40 | C | GELU activation |
| 41 | B | Normalizes weights |
| 42 | C | Self-attention has global context |
| 43 | B | O(N²) is costly |
| 44 | B | Resize to fixed size |
| 45 | C | DETR uses Transformers |
| 46 | C | Partial fine-tuning |
| 47 | B | Hugging Face |
| 48 | B | Student from teacher |
| 49 | B | INT8 quantization |
| 50 | B | Reconstruct masked patches |
| 51 | B | MAE for pretraining |
| 52 | B | Mobile-friendly |
| 53 | C | ONNX for export |
| 54 | B | TensorRT for NVIDIA |
| 55 | B | MLOps = ML + DevOps |
| 56 | B | CLIP for zero-shot |
| 57 | A | Flamingo for few-shot |
| 58 | C | PaLM-E for robotics |
| 59 | B | TimeSformer for video |
| 60 | B | Segmenter for segmentation |
| 61 | B | Mamba replaces attention |
| 62 | B | Pathology for cancer |
| 63 | B | Latency critical |
| 64 | B | Compare models |
| 65 | B | MLflow for tracking |
---
✅ **You're now fully prepared** for any **Vision Transformer interview or exam**.
#ViT #MCQ #VisionTransformer #DeepLearning #AI #ComputerVision #InterviewQuestions #MachineLearning