## List of Revisions
The lines refer to those in the submitted manuscript, where 'left' or 'right' indicates the left or right column, respectively.
- **Line 013 (left)**: “unkown” → “unknown”
- **Line 023**: Replaced "DISCO" with a LaTeX macro and ensured it is followed by a space when necessary (this change also applies to Line 064, 081 (left), etc...).
- **Line 074 (left)**: Updated to:
“These methods often preserve key structural properties of physics, such as continuous-time evolution and translation equivariance (as defined in Mallat, 1999), where a spatial translation of the initial condition results in the same translation of the solution in the absence of boundary conditions. In contrast, transformers do not naturally inherit these properties.”
- **Line 095 (right)**: Now includes additional references:
“While classical PDE solvers remain the state-of-the-art for achieving high precision, neural network-based surrogate solvers (Long et al., 2018; Hsieh et al., 2019; Li et al., 2020; Gelbrecht et al., 2021; Kovachki et al., 2023; Verma et al., 2024; LiuSchiaffini et al., 2024) ...”
- **Line 156 (left)**: Now includes additional references:
“Without explicitly integrating an operator, transformer models trained to predict the next frame from a context of previous frames via attention have been applied to multi-physics-agnostic prediction (Yang et al., 2023; Liu et al., 2023; Yang & Osher, 2024; McCabe et al., 2024; Hao et al., 2024; Cao et al., 2024; Serrano et al., 2024).”
- **Line 165 (right)**: Updated to:
“where $\alpha \in \mathbb{R}^{d_2}$ are learnable parameters, while $\theta$ are parameters predicted by $\psi$, with $d_1$ and $d_2$ being the sizes of the operator network and hypernetwork, respectively.”
- **Line 184 (right)**: Added a paragraph titled “Operator network architecture $f_\theta$” after the paragraph “Numerical integration”, which starts with:
“The operator network $f_\theta$ in our DISCO model has to be fast (because it will be integrated) and small (to enforce a bottleneck).”
- **Line 184 (right)**: The paragraph “Hypernetwork design” was merged with the section “Hypernetwork architecture $\psi$”. It has been trimmed, and the details about the encoder kernel sizes, number of attention blocks, and nonlinearity have been moved to the appendix “C.1.2. Hypernetwork”. We added the sentence:
“We refer the reader to App. C.1 for details on the architecture.”
- **Line 186 (left)**: Updated to:
“For more complex PDEs involving nonlinear dynamics, additional layers with nonlinear activation functions are necessary to capture the underlying effects (Long et al., 2018; Kochkov et al., 2021).”
- **Line 200 (right)**: The superscript index “$\ell$” has been replaced with “$k$” to avoid confusion with a loss.
- **Line 258 (left)**: “resulting in a single token” → “resulting in a single token of dimension 384.”
- **Line 207 (right)**: Added a paragraph titled “End-to-end training” to better separate this section, which explains how our hypernetwork is optimized. It starts with:
“The two equations in Eq. 4 can be learned jointly in an end-to-end manner by solving the optimization problem …”
- **Line 211 (right)**: Following your suggestion, section 4 “Numerical experiments” now only contains paragraphs:
- “Multiple physics datasets” exlains and motivates the two collections of datasets used in our paper,
- “Multiple physics training” details our training strategy on multiple physics,
- “Baselines” presents the three main baselines used in our paper,
- “Next step prediction performances” includes the presentation and discussion of Tables 2, 3, 5 and Figures 2,
- “Information bottleneck and operator space”, presents the advantages of having a bottleneck in the operator space, starting with: “Our DISCO model employs a low-dimensional operator $f_\theta$ (see Eq. 4) to predict the next state through time integration, compared to the typical input sizes.”
- “Finetuning on unseen physics”,
- “Ablation on model size”.
- **Line 233 (left)**: " $128=8.2^4$ " → " $128=8\cdot2^4$ ".
- **Line 237 (right)**: “is learned is learned” → “is learned”.
- **Line 238 (right)**: "(1D,2D, and 3D)" → "(1D, 2D, and 3D)"
- **Line 291 (right)**: “the to” → “to”.
- **Line 309 (right)**: Added:
“As we can see, our DISCO model remains competitive on predicting future steps on PDE data. Note that improving the rollout performance of an autoregressive model is an active area of research (McCabe et al., 2023), and one could complement our model with techniques such as noise injection (Hao et al., 2024), among others.”
- **Line 422 (right)**: Updated to:
“In such cases, finite volume or finite element schemes can be implemented using graph neural networks instead (Pfaff et al., 2020; Zhou et al., 2022; Brandstetter et al., 2022; Zhou et al., 2023). In particular, several papers, such as (Lino et al., 2022; Cao et al., 2023), propose U-Net-like graph neural network architectures, which are natural candidates for our operator class.”
- **Line 770**: This Appendix section has been renamed "Additional details on DISCO architecture" and now contains two subsections "Operator network" and "hypernetwork".
- **Line 986**: Figure 6 had a wrong legend “AViT” and “IC-NPDE (ours)”, which have been replaced with “MPP” and “DISCO (ours)”.
- **Line 1099**: Figure 8 was too big and has been reduced to fit the page.