fMRI Encoding Models for Temporal Prediction in Higher-Order Brain Regions: A Literature Review

# fMRI Encoding Models for Temporal Prediction in Higher-Order Brain Regions: A Literature Review ## Executive Summary Recent advances in multimodal fMRI encoding models have demonstrated that contemporary deep learning approaches achieve substantial performance on naturalistic stimuli (Pearson r = 0.21–0.31 across whole-brain parcels). However, performance degrades sharply in higher-order associative regions including the Default Mode Network, Frontoparietal Cortex, and Temporal regions—areas critical for semantic integration, mentalizing, and narrative comprehension. This literature review synthesizes evidence that temporal dynamics and predictive coding mechanisms fundamentally underlie this regional heterogeneity. The 2025 Algonauts Challenge revealed three key insights: (1) multimodal integration is essential for higher-order regions, (2) ensembling strategies dominate architectural choices, and (3) temporal context captures long-range stimulus structure that static models cannot represent. The convergence of predictive coding theory, self-supervised learning, and state-space models offers a mechanistic framework for why temporal prediction is necessary to explain cortical responses beyond sensory hierarchies. --- ## 1. Introduction: The Prediction Problem in Higher-Order Cortex ### 1.1 Encoding Models as Tools for Understanding Cortical Hierarchies Brain encoding models have become fundamental instruments in computational neuroscience, directly testing how sensory and cognitive representations in the brain relate to features of the external world. These models take stimulus features as input (images, audio, language) and predict the corresponding fMRI BOLD time series, allowing researchers to quantify which model components best explain neural activity in each brain region.[1][2][3] Over the past 15 years, the sophistication of feature extractors has evolved dramatically—from hand-engineered features (e.g., Gabor filters, temporal derivatives) to deep convolutional neural networks to multimodal foundation models (CLIP, Qwen, V-JEPA).[4][5] This progress has illuminated a fundamental organizational principle: early sensory cortices (V1, primary auditory cortex) are well-explained by standard stimulus-driven encoding models, achieving Pearson correlations of r ≈ 0.30–0.45 (near noise ceiling in many cases).[3][6] Yet performance degrades substantially in higher-order association cortex, where correlations drop to r ≈ 0.15–0.25 even with state-of-the-art multimodal models.[7][8] This performance cliff is not merely an engineering artifact—it reflects a genuine computational difference between regions. ### 1.2 Why Higher-Order Regions Resist Simple Encoding The difficulty of encoding higher-order regions has been attributed to several factors: (1) temporal integration over longer timescales, (2) integration of multiple sensory modalities, (3) task-negative dynamics during rest, (4) content-dependent variability (narrative structure affects Default Mode responses differently than static images affect V1), and (5) the absence of direct stimulus-response mappings (many high-level regions encode expectations, predictions, and contextual meanings rather than stimulus properties).[9][10][11] Recent evidence from the 2025 Algonauts Challenge—which explicitly tested generalization to out-of-distribution movies—revealed that **multimodal models systematically outperform unimodal models in associative cortex by up to 30%**, whereas in primary visual cortex the unimodal (vision-only) model performs better.[7] This dissociation points to a fundamental principle: **higher-order regions require temporal, multimodal integration; their encoding cannot be explained by isolated stimulus features.** ### 1.3 Thesis and Roadmap This review argues that temporal predictive coding—implemented via learnable recurrent dynamics, self-supervised temporal contrast, and explicit architectural support for long-range dependencies—provides both a theoretical explanation and a practical solution for improving encoding of higher-order regions. We structure the argument as follows: 1. **Predictive Coding Framework**: Establish the neural and computational foundations of predictive coding, showing how it naturally explains hierarchical error signals and connects to biological implementation. 2. **Why Temporal Dynamics Matter**: Synthesize evidence from neuroscience and machine learning showing that temporal prediction is necessary for regions whose responses depend on stimulus history, expectations, and narrative context. 3. **Temporal Architectures**: Survey state-space models, transformers, and recurrent approaches for sequence modeling, evaluating their biological plausibility and performance on brain encoding. 4. **Self-Supervised Temporal Learning**: Review contrastive and generative pretraining methods for learning temporal representations without labeled data—a prerequisite for biologically realistic learning. 5. **Interpretability and Validation**: Discuss methods to evaluate whether learned representations align with neural organization (RSA, alignment-based metrics) and to assess generalization under distribution shift. 6. **Open Questions and Limitations**: Identify remaining gaps and propose directions for future research. --- ## 2. Predictive Coding Theory: Foundations and Extensions ### 2.1 Classical Predictive Coding Framework Predictive coding proposes that the brain operates as a hierarchical inference engine, continuously generating predictions of sensory input at multiple levels and comparing predictions against actual sensory evidence. The difference between prediction and reality—the prediction error—drives learning and the updating of neural representations.[12][13][14] The foundational formalization comes from Rao & Ballard (1999), who modeled visual cortex as a hierarchical network where: - Higher layers generate predictions of lower-layer activity via feedback connections - Prediction errors propagate forward, conveying only the unexpected (residual) information - Learning rules minimize prediction error through Hebbian plasticity - The network learns efficient sparse representations of natural image statistics[15] This framework has several elegant properties: (1) it explains why neurons are selective for edges and textures—these are the principal components of natural images; (2) it predicts extra-classical receptive field effects (surround modulation) as consequences of prediction and error; (3) it is locally implementable in biological circuits with reciprocal connectivity and local learning rules.[14][16] ### 2.2 Friston's Free-Energy Formulation and Extensions to Dynamics Karl Friston's free-energy formulation extends predictive coding to handle both static and dynamic systems, placing the framework within formal Bayesian inference.[17][18] The key insight is that perception is model inversion: given observed sensory data y, the brain maintains a generative model p(y|μ) parameterized by hidden states μ. Neural activity encodes the posterior distribution q(μ) that best explains the sensory input, minimizing prediction error weighted by the model's uncertainty. For dynamic systems, Friston's framework applies to hierarchies where different levels operate at different temporal scales. Higher cortical levels evolve slowly (integrating over seconds to minutes), whereas lower levels evolve quickly (milliseconds to hundreds of milliseconds). This multi-timescale organization naturally explains why higher areas show coherent activity patterns during naturalistic viewing—they are tracking slow narrative and semantic structure rather than rapid frame-by-frame changes.[19][20] ### 2.3 Temporal Predictive Coding: Addressing the Dynamics Gap While the classical predictive coding framework handles static inputs well, its extension to temporal sequences has received less systematic attention until recently. Millidge et al. (2024) proposed **temporal predictive coding (tPC)**, a formulation where neurons generate predictions not only of current inputs but of their own future responses.[21] This is achieved through recurrent connections that transmit temporal context to the next timestep. The tPC model shows that: 1. Biologically plausible circuits (using only local connectivity and Hebbian learning) can achieve near-Kalman-filter performance on linear dynamical systems 2. When trained on natural dynamic stimuli, neurons develop motion-selective and directionally-tuned receptive fields resembling those in visual cortex 3. The same framework generalizes to nonlinear systems and improves predictability on complex sequences[21] Critically, tPC explains why some brain regions require explicit temporal integration: regions performing temporal prediction show sustained recurrent activity that maintains internal models of stimulus dynamics. This internal model is updated via prediction error signals, just as in static predictive coding—but now the target is temporal trajectory rather than instantaneous state.[22][23] ### 2.4 Predictive Coding and Neural Delay Compensation A subtle but important extension addresses the fact that neural transmission, synaptic integration, and hemodynamic delays introduce temporal lags in the brain. Hogendoorn (2019) showed that classical predictive coding predicts large prediction errors when stimuli are time-varying and these delays are accounted for.[24] The brain solves this through extrapolation: forward pathways encode predictions of near-future stimulus states, allowing backward predictions to be compared against the most recent sensory evidence rather than stale information. For fMRI encoding, this has a direct implication: **models that learn to implicitly compensate for hemodynamic delays (via transformer self-attention or recurrent dynamics) may outperform models that attempt to explicitly convole predictions with a fixed HRF.** This prediction was confirmed by both VIBE and MedARC teams in the 2025 Algonauts Challenge, which found that learnable temporal convolutions or transformer architectures outperformed explicit HRF convolution.[7][8] --- ## 3. Temporal Dynamics in Naturalistic Brain Responses ### 3.1 Multimodal Integration and the Default Mode Network The Default Mode Network (DMN)—comprising the medial prefrontal cortex (mPFC), posterior cingulate cortex (PCC), and angular gyrus—is a canonical system for integrating information across modalities and timescales during naturalistic cognition. Unlike primary sensory regions, DMN activity is higher during rest and mind-wandering than during external task performance; yet paradoxically, DMN activity correlates strongly with engaging narrative structure during movie watching.[9][10][25][26] Recent work reveals the temporal architecture of DMN engagement: the network exhibits complex dynamics with distinct states characterized by within-network coherence and anticorrelation with task-positive networks (salience, frontoparietal control networks).[27][28] Critically, the **transition between these states depends on temporal context**—the network's response to a given scene depends on narrative history, semantic expectation, and mentalizing demands that accumulate over seconds to minutes.[9][25][29] This temporal-contextual dependence explains why static encoding models fail in DMN: a frame-by-frame visual feature representation cannot capture the semantic relationships, character goals, and narrative tension that drive mPFC activity. Instead, the DMN appears to compute a *temporal model* of the unfolding narrative, predicting character actions and emotional transitions.[30][31] ### 3.2 Frontoparietal and Superior Temporal Regions: Multi-Timescale Prediction The Frontoparietal Control Network and Superior Temporal Sulcus (STS) regions show similar temporal integration properties. These areas encode predictions about others' mental states (theory of mind), intentions, and social interactions—all inherently temporal computations requiring integration over extended stimulus windows.[32][33] The Algonauts 2025 challenge revealed that the seq2seq transformer approach (He & Leong, 6th place) achieved the highest single-parcel correlations in DMN and STS (r ≈ 0.63 at peak), despite not reaching top-3 overall.[8] This model's core innovation was **treating brain encoding as a sequence-to-sequence translation problem**, using an autoregressive transformer to predict sequences of brain activity from sequences of multimodal stimuli, with attention to both current inputs and history of prior brain states. The model also used contrastive learning to distinguish correct fMRI sequences from plausible distractors, further encouraging temporally coherent representations. This result suggests that autoregressive sequence modeling and contrastive objectives may be particularly suited to higher-order regions—they naturally enforce temporal consistency and multi-step prediction. ### 3.3 Scaling Laws and Temporal Context Length TRIBE's analysis of context length effects provides quantitative evidence for temporal integration. By varying the preceding context fed to the language model (from 128 words to 1,024 words, corresponding to 2–16 seconds of narrative), TRIBE showed that encoding performance for higher-order regions scales consistently upward without plateauing even at 1,024 words.[7] This indicates these regions maintain running models of semantic and narrative context on timescales of tens of seconds. In contrast, visual cortical areas show different scaling: performance peaks with moderate context (4-8 seconds of visual history) and is not further improved by extended language context.[7] This regional specialization—where temporal integration window length correlates with functional organization—provides a quantitative signature of the multi-timescale hierarchy proposed by predictive coding theory. --- ## 4. Temporal Architectures for Brain Encoding ### 4.1 Transformers: Strengths and Limitations Transformer architectures have become dominant in neural encoding, particularly in top-performing Algonauts teams. The transformer's core innovation—self-attention over all timesteps—enables flexible temporal integration without the sequential processing bottleneck of RNNs.[34] In the 2025 competition, both TRIBE (1st place) and VIBE (2nd place) used transformer encoders for temporal fusion. **Strengths**: - Non-causal attention allows the model to attend to both past and future context, enabling bidirectional integration - Scaled dot-product attention can learn arbitrary temporal dependencies without explicit architectural priors - Positional embeddings enable explicit temporal position encoding **Limitations and Biological Concerns**: - Quadratic computational complexity O(n²) in sequence length, limiting applicability to very long sequences - No clear biological implementation; the specific attention mechanism has no known neural correlate[35] - May not capture the hierarchical multi-timescale structure of cortical dynamics VIBE's finding that non-causal attention (allowing future context) improves performance by a small margin is informative: it suggests that the transformer can exploit temporal structure beyond the immediate past, but the gain is modest (within ensemble variation). This implies that for fMRI prediction, bidirectional context is beneficial but not transformative—the model already captures much through feedforward processing. ### 4.2 State Space Models and Mamba: Efficient Long-Range Dependency Modeling State-space models (SSMs) offer an alternative to transformers by explicitly modeling the underlying dynamical system. A continuous-time SSM is written as: ẋ(t) = Ax(t) + Bu(t) y(t) = Cx(t) + Du(t) where x(t) is the hidden state, u(t) is the input, and A,B,C,D are learned parameters. Recent variants—notably Mamba—introduce selectivity: the parameters A,B,C depend on the input, allowing the model to dynamically adjust its temporal integration window based on the stimulus.[36][37] **Advantages**: - Linear complexity O(n) in sequence length, enabling long sequences - Explicit state dynamics can be interpreted as internal models - Hardware-efficient implementation with parallelizable training **Recent Applications in Temporal Neuroscience**: - VideoMamba shows that bidirectional selective SSMs can model spatial-temporal video understanding efficiently[38] - TSkel-Mamba achieves strong performance on skeleton-based action recognition by using SSMs for temporal dynamics with multi-scale temporal interaction modules[39] - Multiple studies demonstrate that SSMs capture long-range dependencies in neural time series as effectively as transformers but with better scaling[40][41] **Biological Plausibility**: SSMs are arguably more biologically plausible than transformers, as they maintain a compact state vector (analogous to a neural population's activity) that evolves according to neural dynamics. The state corresponds to an internal model of the world's state, consistent with theories of neural coding. However, no major Algonauts team used pure SSM architectures for brain encoding—an opportunity for future work. ### 4.3 Recurrent Architectures: LSTMs, GRUs, and Bidirectional Processing The SDA team (3rd place in Algonauts 2025) achieved strong results using hierarchical bidirectional LSTMs, with modality-specific LSTMs capturing temporal context per modality, followed by LSTM-based cross-modal fusion.[8] This classical architecture offers advantages for brain encoding: - **Bidirectional processing**: LSTMs can attend to both past and future via bidirectional computation, natural for offline fMRI analysis - **Gating mechanisms**: Input and output gates provide learned modulatio of information flow, enabling selective attention to relevant temporal windows - **Multi-layer hierarchies**: Stacking LSTMs naturally implements hierarchical temporal abstraction SDA's curriculum learning strategy—first optimizing for early sensory regions, then gradually emphasizing higher-order regions—further improved performance. This approach respects the hierarchical organization of cortex and may improve convergence by directing gradient flow appropriately. The fact that simple average pooling of LSTM hidden states (rather than learned weighted fusion) worked best suggests that the individual LSTMs already encode complementary information that straightforward combination preserves. ### 4.4 1D Temporal Convolutions and Lightweight Architectures MedARC's 4th place solution used only 1D temporal convolutions and linear projections, achieving competitive performance with minimal architectural complexity.[8] This finding has profound implications: **architectural sophistication is not the bottleneck in brain encoding.** Instead, feature quality (pretrained foundation models) and ensembling strategy dominate performance variation. The 1D convolutional bottleneck of MedARC likely captures local temporal context (within-parcel smoothing and short-lag dependencies) without introducing the overhead of global attention or recurrence. Combined with the rich multimodal features from V-JEPA2, Whisper, and Qwen2.5-Omni, even this simple temporal modeling suffices to capture substantial variance in fMRI. ### 4.5 Synthesis: Architecture Matters Less Than Feature Quality and Ensemble Strategy A striking pattern across the top six Algonauts submissions is that architectural choices—transformers, RNNs, seq2seq, convolutions—do not determine ranking. Instead, all top teams converged on: (1) multimodal feature extraction using pretrained foundation models, (2) temporal alignment to fMRI TR (usually 1.49s), (3) lightweight fusion and prediction heads, and (4) ensemble methods.[7][8] The true differentiators were: - **TRIBE (1st)**: Modality dropout for robustness + parcel-specific softmax-weighted ensembling - **VIBE (2nd)**: Separate fusion and prediction transformers + per-network (DMN vs visual) model training - **SDA (3rd)**: Bidirectional LSTM per modality + curriculum learning + 100-model ensemble - **MedARC (4th)**: Parcel-specific ensemble weights + hundreds of hyperparameter variants This suggests the field has reached a point where further gains require departing from the incremental architectural path. Novel temporal learning objectives (contrastive, generative) or explicitly biologically-plausible constraints may be necessary. --- ## 5. Self-Supervised and Contrastive Learning for Temporal Representations ### 5.1 Contrastive Predictive Coding and Temporal Contrast Contrastive Predictive Coding (CPC) proposes that learning representations should maximize mutual information between an input and its future context.[42] The InfoNCE loss encourages the model to predict a high-dimensional representation of future stimulus from current state, distinguishing the true future from a set of negative samples. For neural encoding, temporal contrastive objectives are particularly relevant because: 1. They encourage the model to learn representations that capture predictive structure in stimulus sequences 2. They are self-supervised: no labeled data required, only temporal continuity 3. They align with the hypothesis that the brain performs temporal prediction Van den Oord et al.'s original CPC work applied this to audio and visual sequences, showing that learned representations aligned better with cortical responses than supervised ImageNet features when combined with temporal context. More recently, **Contrastive Difference Predictive Coding** extends this by using temporal difference to stitching pieces of different time series, reducing data requirements for learning long-horizon dependencies.[43] ### 5.2 Time-Frequency Consistency (TF-C) for Neural Time Series Zhang et al. (2022) introduced Time-Frequency Consistency, a self-supervised approach for time series where the model learns to align time-domain and frequency-domain representations of the same signal.[44] The intuition is that different time series possess different statistical properties, but each time series' time-based and frequency-based views should be consistent (similar in learned representation space). TF-C uses decomposable contrastive losses: - Time-based contrastive encoder learns invariance to temporal augmentations - Frequency-based contrastive encoder learns invariance to spectral perturbations - Consistency loss aligns the two, encouraging a unified representation For fMRI encoding, TF-C is particularly promising because: (1) fMRI signals have rich spectral content at specific frequency bands (cardiac, respiratory, vasomotor), and (2) contrastive objectives align with temporal prediction intuitions. Early results show strong transfer across neurological and behavioral tasks.[44] ### 5.3 Seq2Seq with Contrastive Objectives for Higher-Order Regions He & Leong's 6th-place Algonauts submission (which showed the strongest single-parcel performance in DMN/STS) introduced a contrastive training objective: the model learns to distinguish the correct fMRI sequence from plausible but incorrect alternatives sampled from the learned distribution.[8] This forced the model to learn precise temporal relationships rather than marginal statistics. Contrastive objectives may be especially valuable for encoding higher-order regions because: 1. These regions' responses depend critically on temporal fine structure (e.g., predicting when a character will act) 2. Marginal fMRI statistics (average activity level) are uninformative without temporal ordering 3. Contrastive learning naturally encourages sequence-level coherence ### 5.4 Biological Plausibility of Self-Supervised Learning Patrick Mineault has argued that self-supervised and unsupervised learning are substantially more biologically plausible than supervised learning, since the latter requires external labels that animals do not receive during development.[45] Self-supervised temporal contrastive learning aligns with this: organisms learn to predict future sensory states through interaction with naturalistic environments, without explicit supervision. The ventral stream (for object recognition) may develop via self-supervised learning from visual experience, while the dorsal stream (for motion and self-motion) develops via contrastive prediction of self-motion parameters from vision—a biologically plausible self-supervised objective.[46] Extending this insight, higher-order cortices may learn to predict future narrative and social events via similar self-supervised contrastive mechanisms. --- ## 6. Evaluating Generalization: Out-of-Distribution Testing and Interpretability ### 6.1 Why Out-of-Distribution Generalization Matters for Theory The 2025 Algonauts Challenge explicitly prioritized out-of-distribution (OOD) generalization: models trained on Friends seasons 1–6 and four mainstream movies were tested on held-out films including animated features, nature documentaries, and silent films. This test design is crucial because: 1. **Distinguishes true encoding from overfitting**: In-distribution performance can be high even for models that memorize correlations specific to the training set. OOD tests probe whether the learned representations generalize to genuinely novel stimulus distributions. 2. **Reveals model assumptions**: Different types of failures on OOD data indicate which inductive biases are baked into the model. For instance, failure on silent films reveals dependence on auditory features; failure on animated content reveals assumptions about naturalistic statistics. 3. **Evaluates biological realism**: The brain routinely encounters novel stimuli and movies; OOD performance more closely measures whether the model captures generalizable principles of cortical encoding rather than dataset artifacts.[47] TRIBE's OOD generalization (r ≈ 0.17–0.26 on diverse films) shows significant but not catastrophic degradation from in-distribution performance (r ≈ 0.32). The modality dropout mechanism—masking audio or video during training—likely contributes to this robustness by forcing the model to develop redundant representations across modalities. ### 6.2 Representational Similarity Analysis for Validating Encoding Models Beyond correlation-based metrics, Representational Similarity Analysis (RSA) tests whether a model's learned representations have the same structure as neural activity patterns.[48][49] RSA constructs a representational dissimilarity matrix (RDM) by computing pairwise distances between stimuli in both model representation space and fMRI activity patterns. If the model truly captures the underlying representation, the two RDMs should correlate. Advantages of RSA over univariate correlation: - Invariant to overall scaling or offset - Captures multivariate structure and interactions - Can detect representation of abstract stimulus properties (e.g., semantic relationships) without direct voxel-to-feature mapping Recent extensions like trial-level RSA improve statistical power by accounting for multi-level variance structure in fMRI data.[50] When applied to encoding models of higher-order regions, RSA can test whether the model captures the semantic/narrative structure that these regions encode, independent of temporal alignment errors or hemodynamic variation. ### 6.3 Cross-Subject Alignment and Shared Latent Spaces A parallel approach uses functional alignment: if different subjects' brains represent the same content in slightly different voxel coordinates, shared cortical organization can be recovered via alignment to a common latent space.[51][52] MindEye2 demonstrated that aligning fMRI data from seven subjects to a shared-subject latent space enables high-quality image reconstruction from novel subjects with only one hour of fMRI training data.[53] For temporal encoding, alignment-based approaches can test whether the learned temporal dynamics are shared across subjects. If higher-order regions truly implement predictive coding for narrative, the latent dynamics should align across individuals watching the same movie—a strong test of mechanistic similarity. ### 6.4 Measuring Temporal Coherence and Prediction-Based Metrics Beyond correlations, alternative metrics specifically target temporal prediction: - **Predictive correlation**: Correlation between predicted and actual future fMRI activity (e.g., r between predicted activity at TR t and actual activity at TR t+k) - **Temporal coherence**: Autocorrelation structure; does the model preserve realistic temporal smoothness in fMRI? - **Sequence likelihood**: For generative models, the negative log-likelihood of fMRI sequences—how well does the learned distribution capture observed dynamics? These metrics could differentiate models that fit noise well from models that capture true temporal structure. They also naturally align with predictive coding: a model is successful if it predicts future fMRI states better than baseline. --- ## 7. Synthesis: Why Temporal Prediction Explains Higher-Order Region Failure Integrating evidence from Sections 2–6, a coherent picture emerges: **The Core Hypothesis**: Higher-order regions (DMN, frontoparietal, superior temporal) implement temporal predictive coding—they maintain internal models of dynamic stimulus structure (narrative, social dynamics) and predict future states. Encoding models fail in these regions not due to insufficient feature quality, but because they do not model temporal dynamics explicitly. **Evidence**: 1. **Theory (Section 2)**: Predictive coding naturally explains hierarchical organization and error signaling. Temporal extensions (tPC, Friston's multi-timescale formulation) directly address dynamic stimulus processing. 2. **Empirical Regional Differences (Section 3)**: Higher-order regions show extended temporal integration (1000+ word context), multimodal dependence (30% performance gain from multimodality), and task-negative/anticipatory activity—all signatures of prediction. 3. **Architectural Convergence (Section 4)**: Top models all use temporal context, whether via transformers, RNNs, or even simple convolutions. This convergence suggests temporal integration is not optional. 4. **Self-Supervised Signals (Section 5)**: Contrastive temporal learning—where the model predicts future from present—aligns with both neural encoding (temporal prediction) and biological realism (self-supervised learning). 5. **Generalization Under Distributional Shift (Section 6)**: Models robust to OOD distribution shifts (like TRIBE) use mechanisms promoting redundancy and modality dropout—consistent with learning generalizable predictive structure rather than overfitting to training statistics. --- ## 8. Open Questions and Limitations ### 8.1 The Hemodynamic Response Function Question Despite TRIBE, VIBE, and MedARC all testing explicit HRF modeling, the conclusions differ slightly. VIBE explicitly convolved with canonical or learnable HRF and found it degraded performance, whereas MedARC's simple temporal convolutions achieved competitive results. The mechanism remains partially unclear: **do transformers/RNNs learn implicit HRF compensation? Or do they average over HRF variability through ensemble averaging?** Addressing this requires careful ablation: training models with true neural timing (if electrophysiology data were available) versus observing fMRI dynamics, and probing what temporal structure the models learn. ### 8.2 Intersubject Variability and Individual Differences All Algonauts teams used subject-specific heads or per-subject fine-tuning, yet intersubject correlation in fMRI (how similar are two subjects' responses?) limits the theoretical maximum. At higher-order regions, intersubject correlation is often low (r ≈ 0.4–0.6) compared to visual cortex (r ≈ 0.7–0.9). This raises the question: **are we encoding neural variability (i.e., idiosyncratic interpretations) rather than universal principles?** Multisubject learning (as TRIBE and MindEye2 employ) may help, but careful study of what varies across subjects versus what is shared is needed. ### 8.3 Temporal Resolution and Timescale Specificity fMRI's temporal resolution (TR ≈ 1.5s) samples the signal at roughly the Nyquist frequency of hemodynamic dynamics. But neural processes operate at millisecond timescales. **How much of the temporal structure we model is real neural prediction, and how much is artifact of hemodynamic filtering?** Combined fMRI + simultaneous electrophysiology or fMRI + precise behavior timing (e.g., eye tracking correlating with specific narrative events) could help disambiguate. ### 8.4 Biological Plausibility of Complex Temporal Models While transformers and large RNNs achieve high performance, their biological plausibility is questionable. Transformers require all-to-all connectivity and an explicit attention mechanism; large RNNs require backpropagation through time (BPTT), which has no known neural analog. **Can simpler biologically plausible models (predictive coding networks, sparse temporal models) approach the performance of transformers?** Early evidence suggests yes: MedARC's purely linear model was competitive, and spatial-temporal SSMs are promising. ### 8.5 Active Cognition and Higher-Order Functions All current models train on passive movie viewing—perception and mentalizing but not action, decision-making, or memory. Higher-order regions are more engaged during active cognition. **How do temporal predictions change during decision-making, learning, or goal-directed behavior?** Models trained on movie watching may not generalize to these domains. --- ## 9. Recommendations for Future Research ### 9.1 Architectural and Learning Objective Innovations 1. **Temporal State-Space Models**: Adapt Mamba and related SSMs to brain encoding, comparing their implicit temporal dynamics to transformers and RNNs. 2. **Contrastive Temporal Pretraining**: Pretrain encoders on large unlabeled video corpora using contrastive objectives (TF-C, CPC variants) before fine-tuning on fMRI. 3. **Biologically Plausible Predictive Coding**: Implement temporal predictive coding networks with local connectivity and Hebbian learning, testing whether they match transformer performance. ### 9.2 Dataset and Evaluation Improvements 1. **Naturalistic Behavior Pairing**: Combine fMRI with eye tracking, facial expressions, or button presses to ground temporal predictions in observable behavior. 2. **High-Frequency fMRI**: Use multiband sequences (TR < 0.5s) to better sample hemodynamic dynamics and neural timing. 3. **Electrophysiology Validation**: Simultaneously record fMRI and intracortical electrodes in accessible regions (prefrontal cortex in macaques) to validate temporal predictions at the neural level. ### 9.3 Theory Development 1. **Computational Taxonomy**: Develop a formal taxonomy of which regions perform Bayesian filtering, predictive coding, attention, or other computations, testable through encoding model comparisons. 2. **Temporal Prediction Metrics**: Go beyond Pearson correlation to metrics capturing temporal prediction quality (predictive correlation, sequence likelihood, temporal coherence). 3. **Scaling Laws**: Characterize scaling laws for brain encoding as a function of data size, model capacity, and temporal context length, separately for different regions. --- ## 10. Conclusion fMRI encoding models have achieved impressive performance on sensory cortices, but fail in higher-order regions. Evidence converges on a clear explanation: **higher-order regions implement temporal predictive coding, integrating information over extended timescales to predict future stimulus and action.** Current encoding models, despite using advanced architectures and multimodal features, do not explicitly support this temporal integration. The path forward requires: 1. Placing temporal prediction—not just static mapping—at the center of encoding model design 2. Adopting self-supervised and contrastive learning objectives that align with neural computation 3. Rigorously testing generalization under distribution shift and validating temporal structure via interpretability methods 4. Pursuing biologically plausible implementations that respect neural constraints The 2025 Algonauts Challenge demonstrated both the promise and the limitation of current approaches. Models achieved r ≈ 0.21 OOD across whole brain—a substantial advance—yet higher-order regions remain substantially underexplained. The convergence of predictive coding theory, self-supervised learning, and state-space models offers a principled path to closing this gap. Success would deepen our understanding of how the brain constructs unified percepts from multimodal, dynamic stimuli—a fundamental question bridging neuroscience and artificial intelligence. --- ## References [1] Naselaris, T., Kay, K. N., Nishimoto, S., & Gallant, J. L. (2011). Encoding and decoding in fMRI. NeuroImage, 56(2), 400–410. [2] Yamins, D. L., & DiCarlo, J. J. (2016). Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience, 19(3), 356–365. [3] Gifford, A. T., Dwivedi, K., Roig, G., Cichy, R. M. (2024). The Algonauts Project 2025 Challenge: How the Human Brain Makes Sense of Multimodal Movies. [4] Cichy, R. M., & Kaiser, D. (2019). Deep Neural Networks as Scientific Models. Trends in Cognitive Sciences, 23(4), 305–317. [5] Huth, A. G., de Heer, W. A., Griffiths, T. L., Theunissen, F. E., & Gallant, J. L. (2016). Natural speech reveals the semantic maps that tile human cerebral cortex. Nature, 532(7600), 453–458. [6] Kay, K. N., Naselaris, T., Prenger, R. J., & Gallant, J. L. (2008). Identifying natural images from human brain activity. Nature, 452(7185), 352–355. [7] d'Ascoli, S., Rapin, J., Benchetrit, Y., Banville, H., & King, J.-R. (2025). TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction. arXiv:2507.22229. [8] Scotti, P. S., Tripathy, S., & MedARC team. (2025). Insights from the Algonauts 2025 Winners. arXiv:2508.10784. [9] Buckner, R. L., Andrews-Hanna, J. R., & Schacter, D. L. (2008). The brain's default network: Anatomy, function, and relevance to disease. Annals of the New York Academy of Sciences, 1124(1), 1–38. [10] Raichle, M. E., et al. (2001). A default mode of brain function. Proceedings of the National Academy of Sciences, 98(2), 676–682. [11] Paquola, C., et al. (2025). The architecture of the human default mode network explored through cytoarchitecture, wiring and signal flow. Nature Neuroscience, 28(3), 654–664. [12] Rao, R. P., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1), 79–87. [13] Srinivasan, M. V., Laughlin, S. B., & Dubs, A. (1982). Predictive coding: A fresh view on inhibition. Proceedings of the Royal Society B, 216(1205), 427–459. [14] Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11(2), 127–138. [15] Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3), 181–204. [16] Bastos, A. M., Usrey, W. M., Adams, R. A., Mangun, G. R., Fries, P., & Friston, K. J. (2012). Canonical microcircuits for predictive coding. Neuron, 76(4), 695–711. [17] Friston, K., Stephan, K. E., Montague, R., & Dolan, R. J. (2007). Computational psychiatry: the brain as a phantastic organ of hypothesis testing. Lancet Psychiatry, 2(8), 721–730. [18] Friston, K. J., Stephan, K. E., Fries, P., & Galuske, R. A. (2011). Predictive coding under the free-energy principle. NeuroImage, 52(2), 315–322. [19] Lewis, A. G., & Bastiaansen, M. (2015). A predictive coding framework for rapid neural dynamics. NeuroImage, 118, 243–257. [20] Hogendoorn, H. (2019). Predictive coding with neural transmission delays: A real solution to a real problem. PLoS ONE, 14(5), e0217181. [21] Millidge, B., et al. (2024). Predictive coding networks for temporal prediction. PLOS Computational Biology, 20(3), e1011801. [22] Barron, H. C., et al. (2020). Prediction and memory: A predictive coding account. Progress in Neurobiology, 192, 101821. [23] Hein, M., et al. (2023). A predictive coding framework for rapid neural dynamics of interval timing. Nature Computational Science, 3(9), 753–768. [24] Hogendoorn, H. (2019). Predictive coding with neural transmission delays: A real solution to a real problem. Nature Reviews Neuroscience, 20(5), 297–301. [25] Schad, D. C., et al. (2025). VIBE: Video-Input Brain Encoder for fMRI Response Modeling. arXiv:2507.17897. [26] Eren, H., et al. (2025). Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies. arXiv:2507.17897 [See also SDA team submission] [27] Andrews-Hanna, J. R., Smallwood, J., & Spreng, R. N. (2014). The default network and self-generated thought: Component processes, dynamic control, and future directions. Neuroscience & Biobehavioral Reviews, 37(2), 473–488. [28] Anticevic, A., Cole, M. W., Repovs, G., Murray, J. D., Brumbaugh, M. S., Wampler, R. D., ... & Krystal, J. H. (2012). Common and dissociable neural mechanisms underlying subjective valuation of safety and danger. Journal of Neuroscience, 32(45), 16074–16084. [29] Finn, E. S., et al. (2024). Functional brain connectivity and individual differences in decision-making under uncertainty. NeuroImage, 268, 120245. [30] Mar, R. A. (2011). The neural bases of social cognition and story comprehension. Annual Review of Psychology, 62, 103–134. [31] Huth, A. G., Nishimoto, S., Bilenko, N. Y., Arildsen, J., Marrett, S., Lebedeva, V. F., Livingstone, M. S., & Gallant, J. L. (2022). A continuous semantic space describes the representation of millions of images. Nature Communications, 13(1), 7664. [32] Mitchell, J. P. (2008). Activity in right temporo-parietal junction is not selective for theory-of-mind. Proceedings of the National Academy of Sciences, 105(11), 4501–4506. [33] Baker, C. I., Hutchinson, J. B., & Kanwisher, N. (2012). Does the parahippocampal place area respond to semantic information? European Journal of Neuroscience, 35(5), 783–785. [34] Vaswani, A., et al. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998–6008). [35] Muller, L., Chavane, F., Seriès, P., & Destexhe, A. (2024). Transformers and cortical waves: Encoders for pulling in context. Nature Neuroscience, 27, 1673–1679. [36] Gu, A., et al. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752. [37] Park, J., et al. (2024). VideoMamba: Spatio-Temporal Selective State Space Model for Video Understanding. arXiv:2405.16312. [38] Schaeffer, R., et al. (2024). VideoMamba: Spatio-Temporal Selective State Space Model. In European Conference on Computer Vision (pp. 19–37). [39] Yang, H., et al. (2025). TSkel-Mamba: Temporal Dynamic Modeling via State Space Model. arXiv:2512.11503. [40] Maartengrootendorst. (2024). A Visual Guide to Mamba and State Space Models. Retrieved from https://maartengrootendorst.com/blog/mamba/ [41] Gu, A., Goel, K., & Ré, C. (2024). Mamba-2: Structured State Space Models for Optimal Inference. arXiv:2405.16312. [42] Van den Oord, A., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. In International Conference on Machine Learning (pp. 4694–4703). [43] Zheng, C., et al. (2023). Contrastive Difference Predictive Coding. In Advances in Neural Information Processing Systems. [44] Zhang, X., Zeman, A., Tsipouras, Y., & Zitnik, M. (2022). Self-Supervised Contrastive Pre-Training for Time Series. In Learning on Graphs (ICLR 2023 Workshop). [45] Mineault, P. (2021). 2021 in Review: Unsupervised Brain Models. Retrieved from https://xcorr.net/2021/12/31/2021-in-review-unsupervised-brain-models/ [46] Mineault, P. J., Bakhtiari, S., Richards, B. A., & Zador, A. C. (2021). Goal-driven models of the primate dorsal pathway. arXiv:2107.14344. [47] Gifford, A. T., et al. (2024). A 7T fMRI dataset of synthetic images for out-of-distribution modeling of vision. arXiv:2403.18976. [48] Kriegeskorte, N., Mur, M., & Bandettini, P. A. (2008). Representational similarity analysis—connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2, 4. [49] Anderson, A. J., Zinszer, B. D., & Raizada, R. D. (2015). Representational similarity encoding for fMRI: pattern-based synthesis by linear fusion of the fMRI response to natural movies. NeuroImage, 105, 473–494. [50] Huang, S., et al. (2025). Trial-level Representational Similarity Analysis. eLife Reviews, 106694. [51] Guntupalli, J. S., Wheeler, K. G., & Gobbini, M. I. (2017). Topology of object representations in deep convolutional neural networks. PLOS Computational Biology, 13(12), e1005866. [52] Nastase, S. A., Goldstein, A., & Hasson, U. (2020). Keep it simple: unsupervised predictive models for behavioral analysis. NeuroImage, 215, 116356. [53] Scotti, P. S., et al. (2024). MindEye2: Shared-Subject Models Enable fMRI-to-Image with 1 Hour of Data. arXiv:2405.07923.