🌟 **65+ Vision Transformer (ViT) Multiple Choice Questions (MCQs) with Answers**

# 🌟 **65+ Vision Transformer (ViT) Multiple Choice Questions (MCQs) with Answers** **#VisionTransformer #ViT #DeepLearning #ComputerVision #Transformers #AI #MachineLearning #MCQ #InterviewPrep** --- ## 🔹 **Table of Contents** 1. [Basic Concepts (Q1–Q15)](#basic-concepts-q1–q15) 2. [Architecture & Components (Q16–Q30)](#architecture--components-q16–q30) 3. [Attention & Transformers (Q31–Q45)](#attention--transformers-q31–q45) 4. [Training & Optimization (Q46–Q55)](#training--optimization-q46–q55) 5. [Advanced & Real-World Applications (Q56–Q65)](#advanced--real-world-applications-q56–q65) 6. [Answer Key & Explanations](#answer-key--explanations) --- ## 🔹 **Basic Concepts (Q1–Q15)** --- ### **Q1: Who introduced the Vision Transformer (ViT)?** A) Facebook AI B) OpenAI C) Google Research D) DeepMind **Answer: C) Google Research** ✅ The ViT paper was published by Alexey Dosovitskiy et al. at Google Research in 2020. --- ### **Q2: What is the core idea behind ViT?** A) Use CNNs for feature extraction B) Treat image patches as tokens like words in NLP C) Use RNNs for spatial modeling D) Apply GANs to image generation **Answer: B) Treat image patches as tokens like words in NLP** ✅ ViT splits an image into patches and processes them as a sequence using a Transformer. --- ### **Q3: Which NLP model inspired ViT?** A) BERT B) GPT C) Transformer D) LSTM **Answer: C) Transformer** ✅ ViT is based on the **Transformer architecture** from *"Attention Is All You Need"*. --- ### **Q4: What is the purpose of the [CLS] token in ViT?** A) To mark the end of the sequence B) To store the final class prediction C) To store global image representation for classification D) To separate patches **Answer: C) To store global image representation for classification** ✅ The [CLS] token aggregates information from all patches and is used for classification. --- ### **Q5: What does "An Image is Worth 16x16 Words" refer to?** A) Image compression B) Splitting an image into 16x16 patches C) Using 16x16 filters in CNNs D) Text-to-image generation **Answer: B) Splitting an image into 16x16 patches** ✅ It means each 16×16 patch is treated as a "word" in the sequence. --- ### **Q6: Which of the following is NOT a component of ViT?** A) Patch Embedding B) Convolutional Layers C) Transformer Encoder D) Positional Encoding **Answer: B) Convolutional Layers** ✅ ViT is **purely attention-based** — no convolutions in the original architecture. --- ### **Q7: What is the role of positional encoding in ViT?** A) To classify the image B) To provide spatial information to patches C) To reduce model size D) To increase patch size **Answer: B) To provide spatial information to patches** ✅ Since Transformers are permutation-equivariant, positional encoding adds location info. --- ### **Q8: What is the input to the Transformer encoder in ViT?** A) Raw pixel values B) Flattened patches + positional encoding C) CNN feature maps D) Frequency domain data **Answer: B) Flattened patches + positional encoding** ✅ Patches are flattened, embedded, and combined with positional encodings. --- ### **Q9: What is the typical patch size used in ViT-Base?** A) 8x8 B) 16x16 C) 32x32 D) 64x64 **Answer: B) 16x16** ✅ ViT-Base/16 uses 16×16 patches. --- ### **Q10: What is the output of the ViT model?** A) Reconstructed image B) Class logits C) Bounding boxes D) Semantic mask **Answer: B) Class logits** ✅ The final output is logits for classification after the MLP head. --- ### **Q11: Which dataset was ViT pretrained on to achieve SOTA results?** A) CIFAR-10 B) ImageNet C) JFT-300M D) MNIST **Answer: C) JFT-300M** ✅ Pretraining on **JFT-300M** (300M images) enabled ViT to outperform CNNs. --- ### **Q12: What is the main advantage of ViT over CNNs?** A) Faster inference B) Global context modeling C) Lower memory usage D) Simpler architecture **Answer: B) Global context modeling** ✅ Self-attention sees all patches at once, unlike CNNs with limited receptive fields. --- ### **Q13: What is the main limitation of ViT on small datasets?** A) Too slow B) Overfitting due to lack of inductive bias C) Cannot handle color images D) Requires GPU **Answer: B) Overfitting due to lack of inductive bias** ✅ ViT lacks CNN’s built-in locality bias, so it needs large data to generalize. --- ### **Q14: Which of the following is a hybrid model combining CNN and Transformer?** A) ResNet B) MobileViT C) EfficientNet D) AlexNet **Answer: B) MobileViT** ✅ MobileViT uses CNNs for local features and Transformers for global context. --- ### **Q15: What is the purpose of the MLP head in ViT?** A) To extract patches B) To add positional encoding C) To classify the [CLS] token D) To reduce patch size **Answer: C) To classify the [CLS] token** ✅ The MLP head takes the final [CLS] token and outputs class probabilities. --- ## 🔹 **Architecture & Components (Q16–Q30)** --- ### **Q16: How is a 224x224 RGB image split into 16x16 patches? How many patches are created?** A) 14x14 = 196 patches B) 16x16 = 256 patches C) 8x8 = 64 patches D) 32x32 = 1024 patches **Answer: A) 14x14 = 196 patches** ✅ (224/16)² = 14² = 196 patches. --- ### **Q17: What is the dimension of each flattened 16x16x3 patch?** A) 768 B) 256 C) 512 D) 1024 **Answer: A) 768** ✅ 16×16×3 = 768 dimensions. --- ### **Q18: What is the role of the linear projection layer in patch embedding?** A) To classify the patch B) To reduce the dimension of the flattened patch C) To add noise D) To increase patch size **Answer: B) To reduce the dimension of the flattened patch** ✅ Projects 768-dim patch to lower D (e.g., 768 → 512). --- ### **Q19: Which type of positional encoding is used in the original ViT?** A) Sinusoidal B) Learned C) Random D) None **Answer: B) Learned** ✅ ViT uses **learned positional embeddings**, not fixed sinusoidal. --- ### **Q20: What is the total sequence length after adding the [CLS] token to 196 patches?** A) 196 B) 197 C) 195 D) 200 **Answer: B) 197** ✅ 196 patches + 1 [CLS] token = 197. --- ### **Q21: Which component enables ViT to model long-range dependencies?** A) Pooling B) Self-Attention C) Convolution D) Dropout **Answer: B) Self-Attention** ✅ Self-attention allows any patch to attend to any other patch. --- ### **Q22: What is the purpose of Layer Normalization in ViT?** A) To classify images B) To stabilize training by normalizing activations C) To reduce image size D) To add positional info **Answer: B) To stabilize training by normalizing activations** ✅ Improves training stability and convergence. --- ### **Q23: Which of the following is NOT part of a Transformer encoder block?** A) Multi-Head Attention B) Feed-Forward Network C) Batch Normalization D) Residual Connection **Answer: C) Batch Normalization** ✅ ViT uses **LayerNorm**, not BatchNorm. --- ### **Q24: What is the function of the feed-forward network in a Transformer block?** A) To compute attention weights B) To apply non-linear transformations to each token C) To add positional encoding D) To reduce patch size **Answer: B) To apply non-linear transformations to each token** ✅ Typically a two-layer MLP with GELU activation. --- ### **Q25: How many Transformer encoder blocks are in ViT-Base?** A) 6 B) 8 C) 12 D) 24 **Answer: C) 12** ✅ ViT-Base has 12 layers. --- ### **Q26: What is the embedding dimension (D) in ViT-Base?** A) 256 B) 512 C) 768 D) 1024 **Answer: C) 768** ✅ ViT-Base uses D = 768. --- ### **Q27: Which of the following is used to prevent overfitting in ViT?** A) Max Pooling B) Dropout C) Stride D) Zero Padding **Answer: B) Dropout** ✅ Applied in attention and MLP layers. --- ### **Q28: What is the role of the [SEP] token in ViT?** A) To separate patches B) To mark the end of sequence C) ViT does not use [SEP] token D) To classify the image **Answer: C) ViT does not use [SEP] token** ✅ [SEP] is from BERT; ViT only uses [CLS]. --- ### **Q29: What is the purpose of residual connections in ViT?** A) To reduce model size B) To allow gradients to flow easily through deep networks C) To add positional encoding D) To classify images **Answer: B) To allow gradients to flow easily through deep networks** ✅ Helps with training deep models. --- ### **Q30: Which of the following best describes ViT's inductive bias?** A) Strong locality and translation invariance B) Weak — learns from data C) Fixed receptive field D) Hierarchical feature extraction **Answer: B) Weak — learns from data** ✅ Unlike CNNs, ViT has minimal built-in bias. --- ## 🔹 **Attention & Transformers (Q31–Q45)** --- ### **Q31: What are the three matrices used in self-attention?** A) Input, Output, Hidden B) Query, Key, Value C) Weight, Bias, Gradient D) Patch, Token, Embedding **Answer: B) Query, Key, Value** ✅ Q, K, V are linear projections of input. --- ### **Q32: What is the formula for scaled dot-product attention?** A) softmax(QK^T) V B) softmax(QK^T / √d_k) V C) softmax(Q + K) V D) QK^T V **Answer: B) softmax(QK^T / √d_k) V** ✅ Scaling prevents large values in softmax. --- ### **Q33: What is the purpose of multi-head attention?** A) To reduce model size B) To allow the model to attend to information from different representation subspaces C) To add positional encoding D) To classify images **Answer: B) To allow the model to attend to information from different representation subspaces** ✅ Each head learns different attention patterns. --- ### **Q34: What is the computational complexity of self-attention with respect to sequence length N?** A) O(N) B) O(N log N) C) O(N²) D) O(1) **Answer: C) O(N²)** ✅ Due to QK^T computation. --- ### **Q35: Which operation is used to combine multi-head outputs?** A) Addition B) Concatenation followed by linear projection C) Averaging D) Max pooling **Answer: B) Concatenation followed by linear projection** ✅ Heads are concatenated and projected back to D. --- ### **Q36: What is the role of the key (K) in attention?** A) What the model is looking for B) What the model contains C) What the model reports D) What the model outputs **Answer: B) What the model contains** ✅ Keys represent content; queries represent what to look for. --- ### **Q37: What is the role of the query (Q) in attention?** A) What the model is looking for B) What the model contains C) What the model reports D) What the model outputs **Answer: A) What the model is looking for** ✅ Queries determine attention focus. --- ### **Q38: What is the role of the value (V) in attention?** A) What the model is looking for B) What the model contains C) What the model reports when attended to D) What the model outputs **Answer: C) What the model reports when attended to** ✅ Values are aggregated based on attention weights. --- ### **Q39: What is the output dimension of multi-head attention?** A) Number of heads × head dimension B) Embedding dimension D C) Sequence length N D) Patch size P **Answer: B) Embedding dimension D** ✅ Output is projected back to D. --- ### **Q40: Which activation function is commonly used in the MLP of ViT?** A) ReLU B) Sigmoid C) GELU D) Tanh **Answer: C) GELU** ✅ GELU (Gaussian Error Linear Unit) is used in ViT. --- ### **Q41: What is the purpose of the softmax in attention?** A) To classify B) To normalize attention weights to sum to 1 C) To reduce dimension D) To add noise **Answer: B) To normalize attention weights to sum to 1** ✅ Creates a probability distribution over tokens. --- ### **Q42: Which of the following is NOT a benefit of self-attention?** A) Global context B) Parallel processing C) Fixed receptive field D) Long-range dependency modeling **Answer: C) Fixed receptive field** ✅ Self-attention has **variable** receptive field (all tokens). --- ### **Q43: What is the main drawback of self-attention in ViT?** A) Slow on GPUs B) O(N²) complexity C) Cannot handle color D) Requires labels **Answer: B) O(N²) complexity** ✅ Limits scalability to high-resolution images. --- ### **Q44: How does ViT handle variable-sized images?** A) Uses padding only B) Resizes images to fixed size C) Uses adaptive pooling D) Cannot handle variable sizes **Answer: B) Resizes images to fixed size** ✅ Standard practice: resize to 224x224. --- ### **Q45: Which of the following models uses self-attention for object detection?** A) YOLO B) Faster R-CNN C) DETR D) SSD **Answer: C) DETR** ✅ DETR uses Transformers for end-to-end detection. --- ## 🔹 **Training & Optimization (Q46–Q55)** --- ### **Q46: What is the recommended way to fine-tune ViT on small datasets?** A) Train from scratch B) Full fine-tuning C) Feature extraction or partial fine-tuning D) Use only MLP head **Answer: C) Feature extraction or partial fine-tuning** ✅ Freeze early layers, fine-tune last few. --- ### **Q47: Which library is commonly used to load pretrained ViT models?** A) scikit-learn B) Hugging Face Transformers C) OpenCV D) Matplotlib **Answer: B) Hugging Face Transformers** ✅ `from transformers import ViTForImageClassification` --- ### **Q48: What is knowledge distillation in the context of ViT?** A) Removing patches B) Training a small student model from a large teacher C) Adding noise D) Reducing image size **Answer: B) Training a small student model from a large teacher** ✅ Used in TinyViT. --- ### **Q49: Which technique reduces ViT model size by converting weights to 8-bit integers?** A) Pruning B) Quantization C) Distillation D) Clustering **Answer: B) Quantization** ✅ INT8 quantization reduces size and speeds up inference. --- ### **Q50: What is the purpose of patch masking in MAE?** A) To classify patches B) To reconstruct masked patches for self-supervised learning C) To add noise D) To reduce image size **Answer: B) To reconstruct masked patches for self-supervised learning** ✅ MAE trains by reconstructing 75% masked patches. --- ### **Q51: Which model uses masked autoencoders for ViT pretraining?** A) CLIP B) MAE C) DETR D) MobileViT **Answer: B) MAE** ✅ "Masked Autoencoders Are Scalable Vision Learners" --- ### **Q52: What is the benefit of using MobileViT?** A) Higher accuracy than ViT-Base B) Lightweight and mobile-friendly C) Uses no attention D) Requires no pretraining **Answer: B) Lightweight and mobile-friendly** ✅ Hybrid CNN-Transformer for efficiency. --- ### **Q53: Which format is used to export ViT for cross-platform deployment?** A) JSON B) CSV C) ONNX D) XML **Answer: C) ONNX** ✅ Open Neural Network Exchange. --- ### **Q54: Which tool can accelerate ViT inference on NVIDIA GPUs?** A) TensorFlow.js B) TensorRT C) WebNN D) Core ML **Answer: B) TensorRT** ✅ Optimizes models for NVIDIA hardware. --- ### **Q55: What is the main goal of MLOps in ViT deployment?** A) To reduce image size B) To apply DevOps principles to ML systems C) To remove attention D) To increase patch size **Answer: B) To apply DevOps principles to ML systems** ✅ Includes CI/CD, monitoring, rollback. --- ## 🔹 **Advanced & Real-World Applications (Q56–Q65)** --- ### **Q56: Which model combines vision and language for zero-shot classification?** A) DETR B) CLIP C) MAE D) MobileViT **Answer: B) CLIP** ✅ CLIP classifies images using text prompts. --- ### **Q57: Which model enables few-shot visual reasoning with images and text?** A) Flamingo B) PaLM-E C) TimeSformer D) Segmenter **Answer: A) Flamingo** ✅ Can answer questions about images with few examples. --- ### **Q58: Which model connects vision, language, and robotic action?** A) CLIP B) Flamingo C) PaLM-E D) ViT **Answer: C) PaLM-E** ✅ Embodied AI model by Google. --- ### **Q59: Which architecture extends ViT to video understanding?** A) DETR B) TimeSformer C) MobileViT D) MAE **Answer: B) TimeSformer** ✅ Applies attention across space and time. --- ### **Q60: Which model uses ViT for semantic segmentation?** A) U-Net B) Segmenter C) YOLO D) ResNet **Answer: B) Segmenter** ✅ Uses ViT + mask transformer decoder. --- ### **Q61: Which next-gen architecture replaces attention with state space models?** A) RetNet B) Mamba C) Transformer-XL D) Performer **Answer: B) Mamba** ✅ Selective State Spaces for O(N) complexity. --- ### **Q62: In which medical application is ViT used for cancer detection in tissue samples?** A) Radiology B) Pathology C) Surgery D) Cardiology **Answer: B) Pathology** ✅ Whole slide image analysis. --- ### **Q63: Which metric is critical for monitoring ViT in production?** A) Image resolution B) Prediction latency C) Patch size D) Number of heads **Answer: B) Prediction latency** ✅ Must be low for real-time apps. --- ### **Q64: What is the purpose of A/B testing in ViT deployment?** A) To compare different patch sizes B) To compare old and new models on live traffic C) To reduce model size D) To increase image size **Answer: B) To compare old and new models on live traffic** ✅ Ensures new model performs better. --- ### **Q65: Which tool is used for model tracking and experiment management?** A) Git B) MLflow C) Docker D) Kubernetes **Answer: B) MLflow** ✅ Tracks parameters, metrics, models. --- ## ✅ **Answer Key & Explanations** | Q | Answer | Explanation | |----|--------|-------------| | 1 | C | Google Research introduced ViT | | 2 | B | Patches as tokens | | 3 | C | Based on Transformer | | 4 | C | [CLS] stores global representation | | 5 | B | 16x16 patches | | 6 | B | No convolutions | | 7 | B | Adds spatial info | | 8 | B | Patches + pos encoding | | 9 | B | 16x16 patch size | | 10 | B | Class logits | | 11 | C | JFT-300M for pretraining | | 12 | B | Global context | | 13 | B | Needs large data | | 14 | B | MobileViT is hybrid | | 15 | C | Classifies [CLS] | | 16 | A | 14x14=196 | | 17 | A | 16*16*3=768 | | 18 | B | Dimension reduction | | 19 | B | Learned embeddings | | 20 | B | 196+1=197 | | 21 | B | Self-attention | | 22 | B | Stabilizes training | | 23 | C | Uses LayerNorm | | 24 | B | Non-linear transform | | 25 | C | 12 layers | | 26 | C | D=768 | | 27 | B | Dropout prevents overfitting | | 28 | C | No [SEP] in ViT | | 29 | B | Helps gradient flow | | 30 | B | Weak inductive bias | | 31 | B | Q, K, V | | 32 | B | Scaled dot-product | | 33 | B | Multiple attention heads | | 34 | C | O(N²) complexity | | 35 | B | Concat + project | | 36 | B | Key = content | | 37 | A | Query = what to look for | | 38 | C | Value = what to report | | 39 | B | Output dim = D | | 40 | C | GELU activation | | 41 | B | Normalizes weights | | 42 | C | Self-attention has global context | | 43 | B | O(N²) is costly | | 44 | B | Resize to fixed size | | 45 | C | DETR uses Transformers | | 46 | C | Partial fine-tuning | | 47 | B | Hugging Face | | 48 | B | Student from teacher | | 49 | B | INT8 quantization | | 50 | B | Reconstruct masked patches | | 51 | B | MAE for pretraining | | 52 | B | Mobile-friendly | | 53 | C | ONNX for export | | 54 | B | TensorRT for NVIDIA | | 55 | B | MLOps = ML + DevOps | | 56 | B | CLIP for zero-shot | | 57 | A | Flamingo for few-shot | | 58 | C | PaLM-E for robotics | | 59 | B | TimeSformer for video | | 60 | B | Segmenter for segmentation | | 61 | B | Mamba replaces attention | | 62 | B | Pathology for cancer | | 63 | B | Latency critical | | 64 | B | Compare models | | 65 | B | MLflow for tracking | --- ✅ **You're now fully prepared** for any **Vision Transformer interview or exam**. #ViT #MCQ #VisionTransformer #DeepLearning #AI #ComputerVision #InterviewQuestions #MachineLearning