# ICML Post-Rebuttal Discussion -- Hamlet
## Reviewer KUbF
➡️ **Thanks for addressing the comments and clarify on the irregular grid experiments using airfoil.**
We're grateful for your insightful feedback.
## Reviewer NTpA
➡️ **Thanks for the supplementary experiments. What's the computational complexity of HAMLET? How about the scaling behavior?**
#============================================
# ICML Rebuttal-- Hamlet
## Reviewer KUbF
➡️ **Does not demonstrate simulation on meshes with irregular grids. Why the output is on a regular grid?**
Thank you for your insightful comment. To clarify, our model indeed handles both regular and irregular grid inputs. Specifically, we incorporate irregular grids through the Airfoil dataset (Pfaff et al., 2021), which models aerodynamics around an airfoil wing's cross-section. Recognising the need for an explicit indication where irregular grids are used, we will replace "non-uniform grid" with "irregular grid" in section 4.1 to better reflect our setting. This adjustment, along with a clarifying note in the grid types, will be added in Appendix C (where the datasets are described in detail) for greater clarity.
➡️ **Why does Hamlet have better performance under less data?**
Thanks for the comment. HAMLET demonstrates improved performance under data-scarce conditions because it utilises a graph-based approach, which inherently requires fewer data points to model complex relationships effectively compared to traditional methods. Our graph perspective allows the neural operator to infer and generalise from less training data by leveraging the intrinsic geometrical and topological information contained within the graphs, which represent physical domains or conditions in PDEs. This approach enables more efficient learning, particularly advantageous when dealing with limited datasets commonly encountered in PDE scenarios. We add a clarifying note on this.
➡️ **Why OFormer also converge to the correct field profile at future time steps?**
The unique characteristics of the Diffusion Reaction system, particularly the non-linear reaction terms, lead to complex patterns from initial conditions. HAMLET's architecture is better suited for capturing the intricate dynamics, due to our graph perspective, of this system early on due to its ability to model complex spatial relationships and non-linear interactions more effectively than OFormer, especially under conditions of rapid change and high-frequency information present at the initial time steps. This advantage is particularly evident in the early stages, where the intricate interplay of diffusion and reaction is most challenging to model accurately. Therefore, we internationally chose to showcase these results to highlight the advantages of our approach.
➡️ **I'd be nice to have results on irregular meshes**
We thank for the suggestion. We now provide the results of GNOT and EAGLE on Airfoil which has highly irregular meshes, and also update the result of HAMLET:
| Method | Airfoil, Relative L2 |
|----------|----------------------|
| OFormer | 3.486E-02 |
| GNOT | 4.310E-02 |
| EAGLE | 1.192E-01 |
| **HAMLET** | 3.030E-02 |
We also provide the results of GNOT on Shallow Water 2D and Diffusion Reaction 2D (nRMSE):
| Dataset | U-Net | FNO | DeepONet | OFormer | GeoFNO | GNOT | **HAMLET** |
|------------------|----------|----------|----------|----------|----------|----------|----------|
| Shallow Water | 8.30E-02 | 4.40E-03 | 2.35E-03 | 2.90E-03 | 6.70E-03 | 4.16E-03 | 2.04E-03 |
| Diffusion Reaction| 8.40E-01 | 1.20E-01 | 8.42E-01 | 3.28E+00 | 7.72E+00 | 8.22E-01 | 9.02E-02 |
## Reviewer NTpA
➡️ **Comparison against GNOT[1] and GINO[2]**
We thank the reviewer for their valuable suggestion. In our work, we address both regular and irregular grid inputs, with a particular emphasis on highly irregular grids, as exemplified by our use of the Airfoil dataset (Pfaff et al., 2021). We acknowledge the relevance of comparing our model to GNOT [1], given its transformer structure and focus on similar challenges. Although GINO [2] also contributes to this field, its lack of publicly available code limits our direct comparison- but we added EAGLE. We have now included a performance comparison with GNOT and EAGLE, which substantiates the strengths of our framework in handling complex grid scenarios. The results further support our framework reporting the best results:
| Method | Airfoil, Relative L2 |
|----------|----------------------|
| OFormer | 3.486E-02 |
| GNOT | 4.310E-02 |
| EAGLE | 1.192E-01 |
| **HAMLET** | 3.030E-02 |
We also provide the results of GNOT on Shallow Water 2D and Diffusion Reaction 2D (nRMSE):
| Dataset | U-Net | FNO | DeepONet | OFormer | GeoFNO | GNOT | **HAMLET** |
|------------------|----------|----------|----------|----------|----------|----------|----------|
| Shallow Water | 8.30E-02 | 4.40E-03 | 2.35E-03 | 2.90E-03 | 6.70E-03 | 4.16E-03 | 2.04E-03 |
| Diffusion Reaction| 8.40E-01 | 1.20E-01 | 8.42E-01 | 3.28E+00 | 7.72E+00 | 8.22E-01 | 9.02E-02 |
➡️ **The introduction does not clearly explain the necessity or advantage of introducing a graph. Most of the datasets used in experiments are based on regular geometries, with too few examples of irregular grids. Additionally, the choice of baselines for comparison is not competitive enough, making the results less convincing.**
We appreciate the reviewer’s suggestions and are addressed them in two parts.
We wish to emphasise the advantage of our graph-based architecture. Our choice is deliberate, enabling the capture of complex interactions and dynamics more effectively than standard transformers due to its ability to model intricate spatial relationships and nonlinear interactions efficiently. For instance, we specifically highlight the highly irregular Airfoil dataset (Pfaff et al., 2021) in Table 4. Here, our graph perspective demonstrates superiority over OFormer, particularly at the initial time steps characterised by rapid changes and high-frequency information. Additionally, our approach allows the neural operator to infer and generalise from less training data, leveraging the intrinsic geometrical and topological information of the graphs, which is beneficial for PDE scenarios with limited data availability, as shown in Table 3.
### Additional Reply
Thanks for the valuable advice. The complexity depends on the datasets since different model architectures are used for different datasets (See Tab. 6 of Appendix D).
Specifically, the number of parameters (\#Param) and inference time (T) are as follows. \#Param=3.16M, T=0.306s on Darcy Flow 2D, \#Param=4.29M, T=0.313s on Shallow Water 2D, \#Param=4.30M, T=0.378s on Diffusion Reaction 2D, and \#Param=5.16M, T=0.094s on Airfoil.
Inference time is measured as an average of 50 runs, with a batch size of 1, on an NVIDIA RTX3090.
We use inference time to indicate the computational complexity, since FLOPs are not constant for our proposed model, as they partly depend on the number of edges.
We have presented results of our HAMLET on a wide range of datasets, some of which are larger and more complex, e.g., larger spatial and temporal resolution on Diffusion Reaction 2D. On these datasets, there are no scalability issues found in our HAMLET, implying that our method does not suffer from the scalability issue on these 2D datasets. Meanwhile, the scalability on more complex 3D problems is an interesting topic deserving future exploration.
## Reviewer Lw9X
➡️ **Lack of Novelty. Claim on the first graph Transformers used in [1] and [2]**
In addressing the reviewer's concerns about novelty, we emphasise that HAMLET represents a novel integration of graph transformers with neural networks for solving PDEs. This architecture incorporates graph transformers and modular input encoders, offering enhanced control over parameter correspondence and adaptability across various PDE geometries and input formats. We further emphasise HAMLET's ability to model complex spatial relationships and non-linear interactions more effectively than other techniques, such as standard transformers like OFormer (see experiments). This capability is especially crucial in handling rapid changes and high-frequency information at initial time steps, as demonstrated by our experimental results. The architecture’s modular input encoders ensure unmatched adaptability and control over parameter correspondence, highlighting the novelty of our method. Therefore, our technique is far from being trivial; it is meticulously designed to address the current challenges in the field, underscoring a significant advancement over existing approaches.
Regarding [1] and [2]-- thank you for pointing out these references. We clarify that [1] introduces a mesh-based transformer optimizing for large distances through pre-generated coarse meshes, combining GNN and self-attention for dynamic data. HAMLET contrasts with its graph-transformer encoder, CrossFormer for parameter integration, and MLP-based time propagation, favoring recurrent over autoregressive model updates. [2], parallel in aim, differs in topology and integration methods, using a GNN backbone and recurrent updates with continuous integration, unlike HAMLET's single CrossFormer approach.
➡️ **Validation in the architecture choice-- there is no justification of using an RNN on top of the encoder features, and it is unclear how the latent state is kept as input of the RNN over time.**
The decision to utilise an RNN on top of the encoder features was driven by the requirement to capture temporal dependencies and propagate information across time steps, which is critical for the accurate simulation of PDEs over time. The latent state's role as input to the RNN over time is to maintain a continuous and coherent evolution of the system's state, ensuring that predictions at each time step are informed by the accumulated knowledge of prior states. Our technique is far from being trivial; it is meticulously designed to address the current challenges in the field, underscoring a significant advancement over existing approaches.
➡️ **I am not sure about the proof in Proposition 3.1, how can you absorb the second residual connection and normalisation into the sum ?**
Thank you for highlighting the need for clarity. To add clarity, we now explicitly mention the residual block of the graph transformer, it reads: "The residual block of the graph transformer layer, as proposed above, can be seen as a special case of the integration kernel of the neural operator." This adjustment is aimed at addressing your concerns by explicitly acknowledging the fundamental role of residual connections and normalisation, in line with the principle of residual-attention mechanisms. Also, for clarity we remove the absortion statment and keep "We use ReLU as the activation function $\sigma$."
➡️ **Overall the notations are very difficult to read, especially in section 3.2 and section 3.3.**
We appreciate your feedback on the readability of our notations. We will ensure that our notation system is clear and consistent throughout the paper.
➡️ **There is no graph baselines on dynamical systems. It would have been interesting to understand if a graph-transformer is better than a graph-based model, and in which setting. AND There is a lack of evaluation on irregular geometries.**
The results of different methods on irregular geometries, Airfoil, are presented below. Our HAMLET achieves the best performance.
| Method | Airfoil, Relative L2 |
|----------|----------------------|
| OFormer | 3.486E-02 |
| GNOT | 4.310E-02 |
| EAGLE | 1.192E-01 |
| **HAMLET** | 3.030E-02 |
➡️ **Why do you use GeoFNO on Darcy? What geometry varies in the Darcy equation?**
The use of GeoFNO for the Darcy Flow dataset does not imply variation in geometry within the Darcy equation itself, as the Darcy Flow problem typically involves a fixed geometry. Instead, the application of GeoFNO likely aims to leverage its capacity for learning over complex geometries and could be used to assess the robustness of the model even when the underlying PDE does not exhibit geometric variability. The model's application in this context might be more about demonstrating its potential in handling geometric complexities should they arise in other PDEs, rather than indicating a varying geometry within the Darcy Flow problem itself.
➡️ **How did you select the hyperparameters of the baselines?**
The dataset-specific hyperparameters follow the PDEBench setting, while the model-specific hyperparameters follow the default setting of baseline methods suggested by the code repositories or their papers.
➡️ **How many parameters do the baselines have, for instance FNO?**
We now provide the number of parameters of some baseline methods. For instance Shallow Water 2D dataset, U-Net 7.77M, FNO 0.46M, DeepONet, 0.56M, OFormer 0.22M, GNOT 0.9M, HAMLET 4.29M.
➡️ **Which UNet implementation did you use to perform your comparison?**
We now clarify that we used the UNet that follows the PDEBench setting.
(Takamoto, Makoto, et al. "PDEBench: An Extensive Benchmark for Scientific Machine Learning" Advances in Neural Information Processing Systems 35 (2022)).
➡️ **Why does Hamlet fail for $\beta=0.01$ compared to OFormer and Magnet?**
It's essential to note that all models have specific design characteristics that may favour certain types of data distributions or problem settings. HAMLET is optimised for robust performance across a spectrum of scenarios, including those with highly irregular grids and complex geometries, as highlighted in our experimental results. In case that for $\beta=0.01$, the data exhibits properties that are well-captured by OFormer and MAgNet. However, HAMLET demonstrates superior performance in other settings, which could be attributed to its graph-based approach, demonstrating its strength in handling a wide variety of PDEs.
➡️ **Why do you use an RNN to unroll the dynamics over time?**
We use an RNN to unroll the dynamics over time because it's well-suited for capturing sequential data and temporal dependencies. This approach enables the model to effectively learn and predict the evolution of the system's state across successive time steps, leveraging the RNN's ability to maintain a hidden state that acts as a memory of previous computations. This is particularly valuable in simulating PDEs where the future state depends on the past and present states.
➡️ **How is HAMLET trained? Is it auto-regressive with teacher forcing (i.e. only next step prediction ) or do you unroll the dynamics for the 90 timestamps during training?**
We directly train HAMLET unrolling for 90 timestamps.
➡️ **How are the other baselines trained for the dynamical setting?**
UNet, FNO, GeoFNO, MAgNet are auto-regressively trained.
DeepONet is directly trained for full timestamps.
OFormer is trained unrolling for full timestamps.
➡️ **Why is Magnet not used for dynamical systems?**
We actually used MAgNet for dynamical systems in Tab. 3, but the training of MAgNet on Shallow Water and Diffusion Reaction failed to converge. That is why the results of MAgNet are not reported in Tab. 3
➡️ **Why is there no graph baseline except magnet?**
Following your suggestion, we have included other graph baselines, e.g., EAGLE and GNOT, in the tables below. It is clear that our HAMLET also outperforms the GNOT on both regular-grid datasets (Diffusion Reaction and Shallow Water) and the highly rregular-grid dataset (Airfoil). It also surpasses EAGLE on Airfoil.
| Method | Airfoil, Relative L2 |
|----------|----------------------|
| EAGLE | 1.192E-01 |
| **HAMLET** | 3.030E-02 |
| Dataset | GNOT | **HAMLET** |
|-------------------|----------|----------|
| Shallow Water | 4.16E-03 | 2.04E-03 |
| Diffusion Reaction| 8.22E-01 | 9.02E-02 |
| Airfoil | 4.310E-02| 3.030E-02|
➡️ **Why did you use DeepOnet as a baseline?**
We chose DeepOnet as a baseline due to its proven effectiveness in learning operators mapping between function spaces, which aligns closely with the challenges presented by PDEs. Its architecture, which leverages deep learning to approximate operators, offers a strong comparison for evaluating HAMLET's performance.
➡️ **How does the model size vary for the comparison with the dataset training size in Table 3?**
Thank you for the comment. The models have been designed to maintain a consistent size regardless of the training dataset size, ensuring a fair comparison of learning capabilities across different volumes of data. However, the number of parameters (#Param) is tailored to the characteristics of each dataset to optimise performance. Specifically, the models have #Param=3.16M for Darcy Flow 2D, #Param=4.29M for Shallow Water 2D, #Param=4.30M on Diffusion Reaction 2D, and #Param=5.16M for Airfoil. This tailored approach ensures that each model is suitably complex for the specific demands of the dataset it is trained on.
➡️ **There is no limitation section.**
We appreciate your observation regarding the limitations section. Indeed, our method involves graph construction time, which is a common aspect in graph-based approaches. However, we view this not as a major limitation but as an inherent step that enables our model's robust performance. The construction of the graph is a crucial phase where HAMLET captures the complex dependencies within the data. We believe the benefits gained in accuracy and adaptability to diverse PDEs far outweigh the computational time required during this stage.
<!-- Tips for Angelica: Some useful informetion here. -->
<!-- 1. The construction of graph takes time, is not efficient.
2. Can be more flexible. We need a unified structure to accept different kinds of input (irregular data, multiple input...) -->
## Reviewer N32v
➡️ **Computational complexity of the proposed model, particularly concerning the scalability of larger and more complex problems**
Thanks for the valuable advice. The complexity depends on the datasets since different model architectures are used for different datasets (See Tab. 6 of Appendix D).
Specifically, the number of parameters (\#Param) and inference time (T) are as follows. \#Param=3.16M, T=0.306s on Darcy Flow 2D, \#Param=4.29M, T=0.313s on Shallow Water 2D, \#Param=4.30M, T=0.378s on Diffusion Reaction 2D, and \#Param=5.16M, T=0.094s on Airfoil.
Inference time is measured as an average of 50 runs, with a batch size of 1, on an NVIDIA RTX3090.
We use inference time to indicate the computational complexity, since FLOPs are not constant for our proposed model, as they partly depend on the number of edges.
We have presented exciting results of our HAMLET on a wide range of datasets, some of which are larger and more complex, e.g., larger spatial and temporal resolution on Diffusion Reaction 2D. On these datasets, there are no scalability issues found in our HAMLET, implying that our method does not suffer from the scalability issue on these 2D datasets. Meanwhile, the scalability on more complex 3D problems is an interesting topic deserving future exploration.
➡️ **Application of HAMLET to higher-dimensional PDEs (e.g., 3D problems)**
We thank the reviewer for the suggestion. Extending HAMLET to handle higher-dimensional PDEs such as 3D problems is indeed a logical and exciting next step for our research. Addressing the complexity of 3D PDEs will require a dedicated exploration to refine the model architecture and comparison with existing approaches, as comprehensively addressing these challenges goes beyond the scope of the current work. We plan to undertake this as a separate problem, akin to the thorough methodology and analysis presented in the work of Li et al. on large-scale 3D PDEs (Li, Zongyi, et al. "Geometry-informed neural operator for large-scale 3d pdes." Advances in Neural Information Processing Systems 36 (2024)).
➡️ **Why choose graph transformers over other graph neural network architectures? What specific advantages do graph transformers offer in the context of solving PDEs?**
Because graph transformers are proven to excel in processing graph-structured data (Dwivedi & Bresson, 2020), allowing them to learn robust node representations and representations of the entire graph. This owes to the attention mechanism in graph transformer which implicitly leverages the patterns within inputs and the relationship between **arbitrary query locations and inputs** (Li et al., 2023). In contrast, a graph neural network utilises the **local grid** structure where functions’ values are sampled, thus less effective.
The advantages of graph transformers in solving PDEs are two-fold: 1) It can incorporate differential equation information into the solution process; 2) It make our method has adaptability to irregular meshes, allowing it to solve discretisation-invariant PDEs.