Is incorporating attention rollout beneficial for pretraining in transformers?

--- title: Is incorporating attention rollout beneficial for pooling in Vision transformers? --- # Is incorporating attention rollout beneficial for pooling in Vision transformers? In this work, I trained two Vision Transformer (ViT) models from scratch on the CIFAR-100 dataset. The first is a baseline ViT using standard self-attention, and the second is a modified ViT that integrates attention rollout into its forward pass as described by Abnar and Zuidema (2020). As per my understanding, Vision Transformers treat an image as a sequence of patch embeddings and use a special classification token ([CLS]) to aggregate information from all patches via the self-attention mechanism. The output embedding of this [CLS] token is passed to a classifier to predict the image’s class. In the modified model, I incorporated attention rollout, which multiplies the attention matrices of all transformer layers (after adding identity matrices to account for residual connections) to compute the overall influence of each input patch on the final output. This rolled-out attention is used to derive the image representation instead of relying solely on the [CLS] token. ### Key Objective and Steps for the first leg of this work: * **Model Definitions**– Define a baseline ViT architecture and a modified ViT that uses attention rollout in its forward pass. * **Training Setup** – Prepare CIFAR-100 data and train both models from scratch on all 100 classes, using a lightweight configuration suitable for limited hardware (I used Google Colab free version - T4 GPU). * **Optimization** – Use standard PyTorch training loops with appropriate optimizations (like smaller model size and batch size, and optional mixed precision) to accommodate hardware constraints. * **Evaluation**– Provide a script to load the trained models and compute cosine similarities between the learned representations of sample images from four test classes (e.g. boy, man, table, chair) on a custom test image, demonstrating how to extract and compare feature embeddings.  # Iteration 2 Previous implementation was a very preliminary work given my less experience dealing with vision transformers. Below were the noteworthy limitation of that- 1. Extremely small evaluation slice (4 classes × 1 image) → not statistically meaningful. 2. Single seed, small model (128‑d, 6 layers), and short schedule (50 epochs, no LR decay) constrain performance. 3. Cosine similarities on a handful of points can be unstable and sensitive to sampling. So in this iteration, we go for a much better implementation with a more productionised code at - https://github.com/shreejeetsahay/attention-rollout-work Let's understand what I did now. ## Goal As I mentioned, previous implementation was like a toy probe. This time I wanted to move from from a toy probe to a scalable, reproducible pipeline that cleanly compares ViT (CLS pooling) vs ViT with attention‑rollout pooling, and evaluates embeddings with class‑level metrics. ## Model Design As you can see in the vit_rollout.py file, this time I implemented a single ViT class with a mode switch- * baseline: representation = final [CLS] token. * rollout: representation = attention‑rollout–weighted sum of final‑layer patch embeddings. Rollout is computed per layer by averaging heads, adding identity (residual path), row‑normalising, then multiplying across layers to get an influence matrix. I dropped the initial [CLS] column, re‑normalise patch weights, and pool patches accordingly. Moreover, backbone was kept identical across modes (patch=4, embed=128, depth=6, heads=4, Pre‑LN, dropout=0.1) to isolate the effect of pooling. ## Data & loaders * We used CIFAR‑100 with standard aug: RandomCrop(32, padding=4) + RandomHorizontalFlip; per‑channel normalisation. * get_loaders() returns train/test loaders and the raw test_ds; I use test_ds to build class‑balanced subsets for evaluation. ## Training regimen * Optimiser: AdamW + cross‑entropy. * Schedule: linear warm‑up → cosine decay (LambdaLR), default epochs=200, lr_max=5e‑4, lr_min=5e‑6, warm‑up = 5 epochs. * AMP on CUDA via torch.cuda.amp to improve speed/VRAM. * Checkpoints saved/reused as baseline.pth and rollout.pth (skip retrain unless --retrain is set). ## Evaluation * collect_feats() builds a class‑balanced subset (configurable --classes and --per_class), forwards in batches, and concatenates features. * I L2‑normalise features before distance‑based metrics to make scales comparable. * Metrics: - **k‑NN Accuracy (k = 5, cosine)** - Classifies each embedding by its 5 nearest neighbors using cosine distance. - **Higher is better** → local neighborhoods align with labels - **Intra‑class Distance (centroid‑based)** - On **L2‑normalized** features, compute the mean Euclidean distance from samples to their class centroid; average over classes. - **Lower is better** → tighter clusters. - **Inter‑class Distance (centroid‑based, cosine)** - Compute pairwise centroid distances as 1-cos(c_i,c_j) average over all class pairs. - **Higher is better** → class means are farther apart. - **Inter / Intra Ratio** - ratio = inter/intra - **Higher is better** → separation outweighs within‑class spread. - **Silhouette Score (Euclidean)** - For each point, (b−a)/max(a,b) , where $a$ = mean distance to points in the same class, $b$ = mean distance to the nearest other class. - Ranges **−1 to 1**; **higher is better**. Near **0** → weak/overlapping clusters. * All metrics are cast to Python floats from torch floats and written to results.json. ## Visualization * For a chosen test image, I computed rollout patch weights, upsampled the 8×8 grid to 32×32, and overlay it on the de‑normalised image. * I saved a side‑by‑side figure (heatmap_compare.png) for Baseline (rollout visualised for reference) and Rollout (weights actually used). ## Why I went for these choices Simple reason: I am new to this topic and I took suggestions from Deep Research and incorporated them because I knew my initial work was not good. Now from what I understand- 1. The mode switch ensures the only change between models is the pooling mechanism. 1. Warm‑up + cosine stabilises early updates and improves late‑stage convergence versus a flat LR. 1. L2‑normalisation avoids misleading Euclidean scales and aligns with cosine‑based retrieval. 1. Balanced sampling and multiple metrics provide a more meaningful picture than a 4‑image cosine matrix. ## Understanding the results ### Training ![tlossc100](https://hackmd.io/_uploads/S1FpvQRwxg.png) As you can see from above figure, Rollout converges faster and lower (CE ≈ 0.009 vs 0.072). Lower loss doesn’t always translate 1‑to‑1 into k‑NN gains, but it shows the model is using its capacity well. ## Evaluation ### 10-class probe, 100 images/class So firstly, let's understand the results on 10-class probe, 100 images/class, | metric (10‑class probe, 100 images / class) | **Baseline** | **Rollout** | What it means | | ------------------------------------------- | ------------ | ----------- | ----------------------------------------------------------------------------------------- | | k‑NN @ k = 5 | **0.877** | 0.862 | Baseline wins by ≈ 1.5 pp – a small gap that could be noise. | | Inter‑class distance (cosine) | 0.869 | **0.982** | Rollout pushes class centroids farther apart (good). | | Intra‑class distance | 0.855 | 0.876 | Slightly looser clusters for rollout (expected: patch‑weighted pooling injects variance). | | Ratio (inter / intra) | 1.02 | **1.12** | Net separation improves despite the looser clusters. | | Silhouette | 0.067 | 0.062 | Virtually unchanged – both embeddings form weak but visible clusters. | A key take-away here is rollout now yields better centroid separation but a tiny drop in retrieval accuracy. #### Heatmap comparison Both maps below highlight the central fruit, but rollout’s saliency is more concentrated and symmetric, especially on the orange. That matches the higher inter‑class distance: the model is weighting the truly class‑defining region a bit more heavily. ![heatmap_compare](https://hackmd.io/_uploads/rksbOXRvlg.png) ### 100-class probe, 100 images/class So after the above probe, my thought process was, let us find the same metrics, but this time for all the 100 classes, with 100 images/class, and that is equal to full 10,000-image test split. Let us see the metrics we got. | metric (10 000 test imgs) | **Baseline** | **Rollout** | Δ (roll – base) | Interpretation | | ------------------------- | ------------ | ----------- | --------------- | --------------------------------------------------------------------------------------------------- | | k‑NN @ k = 5 | **0.616** | 0.588 | –2.8 pp | Retrieval accuracy dropped—CLS embedding still works slightly better when the label space is large. | | Intra‑class dist | 0.859 | 0.883 |  ↑ 0.024 | Rollout clusters became a bit looser (patch weighting introduces variation). | | Inter‑class dist | 0.861 | **0.956** |  ↑ 0.095 | Centroids are farther apart (good). | | Inter / Intra ratio | 1.00 | **1.08** |  ↑ 0.08 | Net separation improves despite looser clusters. | | Silhouette | 0.003 | ≈ 0 | \~0 | 100 very small clusters in 128‑D space → silhouette is near zero for both; change is negligible. | #### Interpretation on CIFAR 100 * Strategic view: * Roll‑out weighting pushes class means apart (good for a linear head or centroid‑based classifiers) but adds variance within each class, which hurts sample‑level retrieval like k‑NN. * Why the variance? * From my understanding, the patch‑weighted vector changes more from image to image—if different patches dominate in different examples of the same class, embeddings scatter. * Is it “beneficial”? * For global separability and lower classification loss, yes. * For neighbour‑based retrieval or embeddings that must stay tight which we have not achieved it. # Iteration 3 In this iteration, we went ahead and tested the above on CIFAR 10 and SVHN standard datasets. The code has been accordingly updated in repo: https://github.com/shreejeetsahay/attention-rollout-work/tree/main ## CIFAR-10 ### Training So training on CIFAR 1O, gave us a training loss curve as shown below: ![cifar10losscurve](https://hackmd.io/_uploads/SyQdpm0vll.png) | Phase | Observation | What it suggests | | ------------------------ | ------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- | | **Epoch 1-10 (warm-up)** | Roll-out drops from 1.84 → 0.91, baseline 1.87 → 1.00. Roll-out is ≈ 9 % lower at epoch 10. | Patch-weighted pooling helps the optimiser take larger effective steps from the start. | | **Epoch 10-40** | Both curves fall in parallel; the absolute gap widens (≈ 0.05 CE at E40). | Advantage persists—no sign that rollout slows gradient flow. | | **Epoch 40-120** | Gap is stable: rollout remains 10-20 % lower (e.g., 0.185 vs 0.230 at E80). | Gains are not just early luck; they last through mid-training while LR decays. | | **Epoch 120-200 (tail)** | Final losses: **0.0145** (roll-out) vs **0.0335** (baseline) → \~-56 %. | Roll-out still extracts extra margin even in low-LR regime. | So like in CIFAR 100, Roll-out converges faster and to a markedly lower loss on CIFAR-10—roughly halving the final cross-entropy. The improvement is consistent (no crossover), indicating that attention-weighted patch pooling gives the classifier head a cleaner, more linearly separable feature space right from the start. Whether that lower loss translates into better downstream metrics (accuracy, retrieval) still depends on the intra-class variance trade-off, but purely as an optimiser target roll-out is decisively easier to fit on this dataset. ### Metrics on all 10-class, 100-images probe | metric | Baseline | Roll-out | Δ | | ----------------------- | --------- | ---------- | --------------------------------- | | **Intra-class dist** | 0.7167 | **0.6937** | ↓ 0.023 (tighter clusters) | | **Inter-class dist** | 1.0003 | **1.0628** | ↑ 0.062 (centroids farther apart) | | **Inter / Intra ratio** | 1.40 | **1.53** | ↑ 0.13 | | **k-NN @ 5** | 0.845 | **0.850** | +0.5 pp | | **Silhouette** | **0.143** | 0.134 | –0.009 | Now, let's try to interpret the above table- 1. Stronger separation overall. Roll-out simultaneously shrinks within-class spread and pushes class centroids apart, lifting the inter / intra ratio from 1.40 → 1.53. 1. Retrieval matches the improvement. k-NN accuracy ticks up (+0.5 pp), confirming that the tighter, better-separated space helps nearest-neighbour classification. 1. Slight silhouette dip is not a concern. Silhouette drops a hair because it weights both clusters and their nearest neighbours; with tighter clusters the nearest-cluster distance sometimes decreases too. The magnitude (-0.009) is minor compared with the gains in the other metrics. Hence, om CIFAR-10, attention-rollout not only converges faster during training but also yields quantitatively stronger embeddings—better class compactness and better separation—without the variance penalty we saw on CIFAR-100. ### Heatmap Comparison ![heatmap_compare_cifar10](https://hackmd.io/_uploads/HkfIZVCDlg.png) | | Baseline | Roll-out | | ---------------------- | --------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------ | | **Saliency focus** | High-weight patches are spread across the sky and water as well as the airplane fuselage. | Heat concentrates on the airplane body and wing; background sky / water patches cool down. | | **Background leakage** | Several top-row (sky) and bottom-row (water) cells are warm → background influences the CLS output. | Those same cells turn blue/green → background contribution reduced. | | **Sharpness** | Warm areas bleed into neighbours. | Crisper hot region around the airplane; colder elsewhere. | | Roll-out pooling makes the model rely more on the actual airplane pixels and less on surrounding background, which is consistent with the quantitative gains (tighter intra-class, higher inter-class, +0.5 pp k-NN). ## SVHN Dataset ### Training For SVHN, attached is the training loss curve ![output](https://hackmd.io/_uploads/SJnTXEAwex.png) | Epoch range | Baseline CE (→) | Roll-out CE (→) | Gap (roll-base) @ end | Read | | ------------------ | ------------------: | ------------------: | --------------------: | ----------------------------------------------------- | | **1–10 (warm-up)** | 2.22 → **0.6729** | 2.21 → **0.6447** | **−0.028** | Roll-out gets a small early lead. | | **10–40** | 1.0002 → **0.2595** | 0.9147 → **0.2684** | **+0.009** | Curves nearly overlap; baseline a hair better by E40. | | **40–80** | 0.2595 → **0.1551** | 0.2684 → **0.1574** | **+0.002** | Essentially tied through mid-training. | | **80–120** | 0.1551 → **0.0815** | 0.1574 → **0.0772** | **−0.004** | Roll-out nudges ahead again. | | **120–200 (tail)** | 0.0815 → **0.0296** | 0.0772 → **0.0234** | **−0.006** | Roll-out finishes lower; absolute gap is modest. | On SVHN the two trains almost identically; roll-out is slightly faster early and ends a bit lower, but the differences are small. ### Metrics (10-class, 2600 images probe) | metric | Baseline | Roll-out | Δ | | ----------------------- | --------- | --------- | --------------------------------- | | **Intra-class dist** | **0.562** | 0.602 | ↑ 0.041 (clusters looser) | | **Inter-class dist** | 1.003 | **1.025** | ↑ 0.022 (centroids a bit farther) | | **Inter / Intra ratio** | **1.79** | 1.70 | ↓ 0.09 | | **k-NN @5 (cosine)** | **0.967** | 0.965 | –0.2 pp | | **Silhouette** | **0.350** | 0.310 | ↓ 0.040 | 1. Separation vs. compactness Roll-out nudges centroid distances up, but inflates within-class scatter even more, so the inter / intra ratio drops from 1.79 → 1.70. 2. Retrieval essentially unchanged k-NN@5 slips by only 0.2 pp—a statistical tie. The added variance doesn’t hurt neighbour consistency because SVHN digits are already highly separable. On SVHN, attention-rollout provides no clear advantage. Background is plain and digits are centred, so the CLS token already captures the salient region; patch-weighting adds variance without lifting useful separation. ## Heatmap ![heatmap_compare_svhn](https://hackmd.io/_uploads/rkTBpVAPxe.png) As you can see from above, rollout changes attention distribution only subtly: it smooths-out extreme spikes and lights up the entire digit outline. Because SVHN digits are centred and isolated, this refinement neither helps nor hurts the downstream metrics—exactly what we saw in the metrics result above. ## Iteration 2 and 3 interpretation | data set | visual scene | what roll-out changes | net result | | ------------------------------------------------ | ----------------------------------------------------------- | ---------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | | **CIFAR-100 – 100 fine-grained natural classes** | cluttered; class-defining cues vary from picture to picture | pushes centroids farther apart **and** inflates the spread inside each class | linear head benefits (lower CE, higher inter-class distance) but k-NN drops because the extra scatter hurts local neighbourhoods | | **CIFAR-10 – 10 coarse natural classes** | single object, limited backgrounds | pushes centroids apart **and** makes clusters slightly tighter | everything improves: lower CE, higher inter / intra ratio, small but noticeable lift in k-NN | | **SVHN – centred digits, uniform background** | very simple | almost no change beyond a tiny increase in variance | training loss difference is negligible; k-NN and centroid metrics are effectively unchanged As for heatmaps, we see that where the background is informative (fruit-on-table, airplane in sky) roll-out concentrates weight on the object and cools the background. In SVHN, where the digit already dominates the frame, both pooling methods highlight the same central region, so the saliency maps, and the metrics — are almost identical. ### Conclusion from the iteration 2 and 3 If the data contain varied scenes in which the class-defining region can be occluded by background (CIFAR-100, CIFAR-10), attention-rollout is worth enabling: it converges faster and gives better linear separability, and on medium-granularity problems like in CIFAR-10 it even helps neighbour retrieval. If the object is always centred and the background is uniform (SVHN-style data), the ordinary [CLS] token already captures everything important, so roll-out adds complexity without delivering a measurable benefit. # Explanation of rollout mechanism used here Figure below shows a contrast between baseline and rollout mechanism, with explanation of rollout to follow. ![IMG_3778](https://hackmd.io/_uploads/HyQbv6cOel.jpg) Before we explain the rollout mechanism, it is important to know what tensors do we get from each layer - 1. Each transformer block we have can optionally return its raw attention weights **w** with shape (B, H, S, S), where B=batch size, H=heads, S=sequence length (1 CLS token + N patch tokens). 2. When we select mode == 'rollout', forward_features asks every block for w and stores them in a list **attn**. If you would see the code for "_rollout" method in ViT class, we basically implement the rollout mechanism from abnar and zueidema paper (2020), and you could infer that we turn per-layer attentions into an “influence” matrix. How? Let's see this: 1. Head Average: For each layer ℓ, average across heads: A_ℓ = mean_over_heads(w_ℓ) → shape (B, S, S). 2. Add residual path or Identity: A_ℓ ← A_ℓ + I This models the residual connection: a token can also “keep” its own content. 3. Row-normalize: A_ℓ[i] = A_ℓ[i] / sum_row(A_ℓ[i]) Now each row is a probability distribution over “where this token routes information next.” 4. Multiply across layers (rollout): Initialize P = I. For each layer in order: P = A_ℓ · P After all layers, P (B, S, S) captures indirect influence paths (e.g., CLS→patch via multiple hops). This is the “rollout” part: it composes the routing over the stack. Now, how is this rollout computation used in forward_features is the question. How it pans out? So, basically what we are doing here is first turning the CLS->Patch influence we got above into pooling weights in the below steps. - Take the CLS row of the final influence matrix and drop the CLS column: W = P[:, 0, 1:] # (B, N_patches) - Normalize across patches: W = W / W.sum(-1, keepdim=True) Once we have gotten these pooling weights, we then produce the representation used by the classifier using below steps: - Get the final-layer patch embeddings patches = x[:, 1:, :] (after the last block + LayerNorm). - Do a weighted sum with the rollout weights: rep = (W.unsqueeze(-1)*patches).sum(1) #shape = (B, embed_dim) - And then we simply feed the "rep" to the classifier instead of the final [CLS] token embedding used in the baseline (i.e., we use rep rather than x[:, 0]). In the code implementation, one important thing to note is that in forward_features method, when collecting attentions, we perform attn.append(w.detach()). That means the pooling weights W do not receive gradients directly (they’re computed from current attentions but treated as constants in backprop). Although gradients still flow through the patch embeddings being pooled, so training differs between modes (CLS vs rollout pooling), but there’s no extra loss or gradient pushing attentions to change specifically for rollout. It’s a pure readout change, not an auxiliary objective (auxiliary loss function). Detaching keeps rollout strictly readout-only, ie the model still learns via cross-entropy on the pooled representation, but we avoid long gradient paths through the product of per-layer attention matrices. If .detach() were removed, gradients would flow through that chain, implicitly pressuring attention patterns, which can destabilize training and muddy a clean apples-to-apples comparison with the CLS baseline.