R1 additional response

# R1 additional response We thank the reviewer for the positive comments. We would like to draw your attention to additional modifications we have included now. Specifically, we have performed following additional simulations: * Comparison of the baselines with StriderNet run for a longer duration (App. M), * Parametric study on the learning rate of Adam (App. N) and $\alpha$ of StriderNet (App. O), * Energy and stress distribution of the atoms on the structures obtained by StriderNet (App. K), Our updated revised version of the manuscript with these details is available at the anonymous link (https://anonymous.4open.science/r/StriderNET-F64D/StriderNET.pdf). In addition, we attempted to train StriderNet using an equivariant GNN (NequIP; see ref.: Batzner, S., et al., 2022. E (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nature communications, 13(1), p.2453.). The details including the hyperparamters, architecture, and training/validation performance are included in https://anonymous.4open.science/r/StriderNET-F64D/Eqv_Stridernet.pdf. Considering the limited time available, it is not possible to provide a conclusive comment on how equivariant GNN may affect the performance of StriderNet. However, we see value in the reviewers' suggestion and wish to pursue it in the future.  ***Appeal to the reviewer:*** With these additional experiments and clarifications, we hope we have addressed all the concerns raised by the reviewer. If the reviewer feels satisfied with our response, we humbly appeal to increase the score. # R2 rebuttal response We thank the reviewer for the detailed comments. Please find a point-by-point response. Our updated revised version is available at the same anonymous link (https://anonymous.4open.science/r/StriderNET-F64D/StriderNET.pdf). **i) I disagree with the author's comment, “....better configuration is of paramount importance irrespective of the computational cost.”** **Energy computation is computationally expensive, and the cost grows with the number of atoms and the complexity of the structure. There is an upper computational cost budget; otherwise, running a molecular simulation and keeping track of the best sample would be a reasonable algorithm. In applications like drug development, the high cost will slow down the validation of the candidate drug step in the drug design process. Several recent ML methods for MD simulations reduce the simulation cost by using coarse-graining.** *Response:* We agree with the comment by the reviewer that computational cost does indeed matter. We intended to state that in these problems, reaching optimum structures is of paramount importance even if the computational cost is high. But we agree that it should be within reasonable time limits. Indeed, traditional MD simulations are computationally prohibitive to sample low energy configurations eventhough in principle they are able to. Accelerating MD simulations with ML can indeed help in this context. *To address this comment, the following additional sentence is included in the Limitations section.* >"Finally, an iteration of StriderNet is computationally more expensive compared to non-neural baselines, the acceleration of which also is an open challenge." **From the runtime comparison, it appears comparison to the baseline is not fair. Since non-neural approaches are faster, they should be run for the same time as the proposed method. It is important to validate whether running the less costly model for the same compute budget could lead to a better-optimised structure. I request the authors to please investigate this case.** *Response:* We agree with the reviewer that the overall running time of StriderNet is significantly more than the baselines. However, it should be noted that both for the non-neural methods and StriderNet, an energy-based cutoff was used to terminate the runs. That is, the simulations were run until they converged. Thus, running the baselines for similar time as StriderNet should not make any difference in the results. We demonstrate this below by performing additional baseline runs. For a fair comparison in terms of running time, we allowed the baseline methods to run for same running time (~3 hours) as the StriderNET (model adaptation). Specifically, we ran all the three non-neural baselines for ~3 hours on 10 random configurations. Indeed, the baseline methods saturates beyond certain number of steps and does not lead to more optimized structures than the StriderNET (see Table below). *To address this comment a new section has been added in the Appendix of the main manuscript (App. M) along with the comparison curves Fig. 9.* | System Size |Metric|StriderNET | Gradient Descent | Adam | FIRE| | -------- |-------- | -------- |----|----|----| | 100 |Min |**-8.156** |-8.113|-8.082|-8.114| | |Mean |**-8.119** |-8.041|-8.019|-8.061| **ii) I still find the related work section weak. The machine learning methods for molecular simulations are popular and have led to several developments in reducing the cost of stable atomic structures or finding structures with the desired property. In my view, authors should clearly explain this research direction and explicitly state in 2-3 sentences where StriderNet sits among other existing work (i.e., the benefits of RL formulation). Perhaps after the second paragraph of the introduction, refer to MD simulations work and then introduce StriderNet?** *Response:* We thank the reviewer for this suggestion. *To address this, we have added the following new sentences at the end of the second paragraph as suggested by the reviewer.* >*"Another approach toward finding minima is to combine classical molecular simulation (MS) with machine learning. These approaches focus on accelerated MS using machine-learned force fields or coarse graining (Noe et al., 2020; Park et al.,2021; Li et al., 2022)"* **iii) RL tends to be sensitive to hyperparameters. This is why I questioned the choice of Adam and its learning rate. A simple ablation to investigate the learning rate here would make it more attractive.** *Response:* We agree with the reviewer that training the RL model is sensitive to hyperparameters. To adress this comment, we performed training of StriderNET at different learning rates and plotted the validation curves of average reduction in energy, $\Delta E$ in 20 steps. We observe that the training is unstable at learning rates faster than $10^{-3}$, and training is very slow at rates below $10^{-6}$. The learning rate chosen in the present work indeed is the optimal one. *To address the comment we have added a new section on the effect of learing rate on training of StriderNET (App. N) along with the validation curves as Fig. 10.* **iv) Thanks for the explanation of alpha in multivariate Gaussian. Perhaps this should be added in the paper with a demonstration of deviation in delta E.** *Response:* To demonstrate the effect of $\alpha$ in multivariate gaussian we calculate the $\Delta E$ first using predicted displacements directly and then using the sampled displacements from the multivariate gaussian with given $\alpha$. The percentage deviation in the both the values of $\Delta E$ is reported below (added to App. O). **Table:** Variation in percentage deviation in $\Delta E$ with $\alpha$ factor of multivariate Gaussian. | $\alpha$ | % Deviation in $\Delta E$| | -------- | --------------- | | 1e-1 | 2.591e6 | | 1e-2 | 2.378e3 | | 1e-3 | 4.087e2 | | 1e-4 | 4.576e1 | | 1e-5 | 4.640 | | 1e-6 | 0.470 | **v) My suggestion on the details of MDP was to make the paper more clear. Sorry I didn’t ask for a general definition of MDP. My suggestion was in 3.2, write an MDP tuple and then explain each part in the context of StriderNet, i.e. what are the states in atomic structure, likewise transition and so on. In an MDP discount factor gamma is also in a tuple.** *Response:* To address this concern, we have modified the Sec.3.2 of the manuscript with a tuple-based definition of MDP. **vi) In the without bond length case, why is energy positive in the table? This result looks a bit odd to me. I am wondering if GNN was adapted correctly here to operate on graphs without edge features.** *Response:* The table reports the validation score which is average reduction in energy over 20 steps. The positive value of validation score signifies that the model is unable to reduce the energy of system. Note that the absolute energy is not positive and is still negative. GNN employed is same as the one used in StriderNET with only bond length vector removed from edge features, the other feature $\|x_{vu}^{equi}\|-\|x_{vu}\|$ being still present. **vii) One of the points I raised in weaknesses is not answered. I would appreciate it if the authors could provide their comments on the following:** **“ The energy scale of each atomic structure is different. Using a difference in initial and final energy for evaluation is reasonable. But it is still hard to grasp for which atomic structures StriderNET makes a significant difference. In some cases, the reduction could be less significant due to the complexity of the structure. I am keen to know what other evaluation criteria can be used for this purpose. Perhaps authors can use molecular modelling to determine the stable energy configuration and use that as a benchmark. “** *Response:* We thank the reviewer for pointing out the missing response. One other evaluation criteria that we have used to evaluate the performance of our model is the reward that the model obtains during training. These results are presented in Appendix (App. C). The reward thus signifies how well the model is reducing energy in comparison to the baseline method. We have used FIRE algorithm as a *baseline* in our reward function (Eq. 9). We observe in Figure 4 that as the model trains it outperforms the baseline method (indicated by positive rewards). The baseline can be any optimization method, e.g. Adam, FIRE, molecular modelling (as suggested by the reviewer) and even the trained StriderNET model itself. The performance of the baseline thus gives a reasonable estimate how complex the task is. For instance, the reduction in energy during minimization for high energy and low energy structures will be high and low for the baselines, respectively. This change in energy $\Delta E$ during the minimization of a structure by the baseline can be used as a proxy to identify how difficult the task is. **viii) I also agree with the other reviewer that using SE(3) would be more reasonable as it considers the symmetries of the atomic structure. It does add an additional cost, but that's marginal. The search space is significantly reduced here.** *Response:* We thank the reviewer for the suggestion. We agree that SE(3) GNNs can potentially improve StriderNet. To address this comment, we attempted to train StriderNet using an equivariant GNN (NequIP; see ref.: Batzner, S., et al., 2022. E (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nature communications, 13(1), p.2453.). The details including the hyperparamters, architecture, and training/validation performance are included in https://anonymous.4open.science/r/StriderNET-F64D/Eqv_Stridernet.pdf. Considering the limited time available, it is not possible to provide a conclusive comment on this point at this stage. However, we see value in the reviewers' suggestion and wish to pursue it in the future. **At this point, I will stay with my original score. Although the paper is improved, I don’t see enough evidence that justifies a description associated with a higher score.** *Response:* In addition to these experiments, we have now included the energy and stress distribution of the atoms corresponding to different configurations obtained from the StriderNet (see App. K, Fig. 7, and Fig. 8), which was suggested by the reviewer earlier. We observe that the energy and stress distribution becomes narrower as the system goes toward more stable configurations with lower energies. Altogether, we believe we have extensively addressed all the comments by the reviewer, which has now significantly improved the manuscript. ***Appeal to the reviewer:*** With these additional experiments and clarifications, we hope we have addressed all the concerns raised by the reviewer. We would be happy to engage further if there are any outstanding concerns. If the reviewer feels satisfied with our response, we humbly appeal to increase the score. --- **High priority tasks ?????** 1. (R1-the complaint person) : run SE(3)- equivariant??? 2. (R2) fairness of baselines- run baselines with diff learning rate: R2. Its not clear whether reviewer means baseline ( ADAM/SGD/FIRE ) or stridernet with diff Gnn architectures. This expt. is to convince reviewer that we were fair. 3. **(not sure if high priority R2)** To show we can do well on large complex system can we pre-train on slightly larger systems(300/500) and FT on 1000 or 1500??.. in paper we trained on 100 node system.: 4. (R2) Stridernet trained with Adam/SGD/Adagrad optimizer? not really important but reviewer wants this (R2). For sake of completeness we can add SGD/Adagrad atleast for 1 dataset and wrap it up.(R2).. may be skip (or say initial expts. showed better performance with adam compared to SGD. i guess we saw this ?) 5. (R2) Ablation of stridernet by removing neighborhood potential feature + ablation on edge features. (R2) **General comment to all the reviewers** We thank the reviewers for their comments and suggestions. Please find a point-by-point response to the comments raised by the reviewers below. We have also updated the main manuscript and the appendix to address these comments. The changes made in the main manuscript are highlighted in **blue color** and the updated manuscript is available in the anonymized link https://anonymous.4open.science/r/StriderNET-F64D/StriderNET.pdf. The **major additional experiments** carried out and included in the updated manuscript are listed below. **1. Ablation study on StriderNet features:** To understand the role of different features toward the performance of StriderNet, we performed additional ablation experiments. Specifically, we trained the StriderNet after removing each of the node and edge features one by one and analyzed the performance (see App. F, Fig. 6). **2. Baselines with different learning rates:** To show the effect of learning rates on the non-neural baselines, additional experiments are performed with different learning rates (see App. J, Tab. 11). **3. System size:** To show the performance of StriderNet on larger system size, we trained StriderNet on 250 atom LJ-system in addition to the original 100 atom system. StriderNet when trained on larger system size, performs better (see App. H, Tab. 8 and Tab. 9) **4. Training and inference time:** We included additional discussions on the training and inference time of StriderNet in comparison to the non-neural baselines (see App. G and App. I, Tab. 7 and Tab. 10) # R1 We thank the reviewer for the comments. Please find a point-by-point response to the comments. Further, the revised manuscript with the changes highlighted in blue are available at the anonymized link https://anonymous.4open.science/r/StriderNET-F64D/StriderNET.pdf: **It is not clear to me what is the physical meaning of the outputted action vector . If it is the direction vector along which the atom will move to lower the system energy, than it should be a 3-dimensional vector, however it is a -dimensional vector as described in the paper (line 243 right, end of Section 3.3) and authors do not explain what is and what is the actual physical meaning of . It would be better if authors can elaborate these details.** *Response:* The action vector $a^t$ is $|\mathcal{V}| \times d$ dimensional (see: page 4 col. 2 line 194, page 2 col. 2 line 70, and Table 3 in Appendix) with $d$ taking a value 2 or 3 depending on the dimensionality of the system. For instance, for a 2-dimensional simulation $d$ will be 2, whereas for 3-dimensional simulations (as in the case of present work), $d$ will be 3. We have included an additional text (in page 4 col. 2 line 194) to clarify this point in the revised manuscript. **It seems that the designed policy network is not SE(3)-equivariant. When the input atom systems rotate in 3D space, the output moving direction of atoms are supposed to be rotated accordingly. But the graph neural network model is a GAT network, not a SE(3)-equivariant model.** *Response:* We agree that the present work is not employing an SE(3) equivariant GNN. This is a specific architectural choice. In fact, without using SE(3) equivariant GNN (EGNN), we already demonstrate that StriderNet outperforms all the baselines. The inclusion of an equivariant GNN may only enhance the performance. Further, it should be noted that EGNN are computationally much more expensive than the GNNs used in the present work. Thus, incorporating them in the present work will come at a significant computational overhead. Accordingly, we believe that not incorporating EGNN is not a weakness of the present work as the StriderNet already outperforms all the baselines. To address this comment, following new text is added to the main mansucript in the Limitations and future work section (see Sec. 5 of the revised manuscript). *The GNN employed in StriderNet is not SE(3) equivariant. While it is known that SE(3) equivariant GNNs have better expressibility, they are computationally expensive. It would be interesting to explore the application of such architectures on the performance STRIDERNET.* Further, it should be noted that the GNN in StriderNet is not a GAT but a custom GNN (see Sec. 3.3 line 231). Indeed, we performed an ablation study to identify the best GNN that is efficient while being expressive (Sec. 4.4). We found that our GNN outperforms GAT and FGN architectures (Sec. 4.4, Figure 3). **In addition to reach lower system energies, does STRIDERNET have other advantages over baseline methods, e.g., lower time cost or optimization steps?** *Response:* It should be noted that obtaining stable structures corresponding to lower energy minima is an extremely challenging problem with wide applications in different areas of condensed matter physics, astronomy, mechanics, materials design, and drug discovery as also acknowledged by Reviewer Kcce. In the present work, we show StriderNet outperforms all the standard minimization algorithms in achieving better minima. This is indeed at a higher computational cost than the classical non-neural baselines (see response to Reviewer Kcce, and App. I and Tab. 10 of the revised manuscript). Note that the goal in such tasks (for instance, drug discovery) is to obtain a better minima irrespective of the computational cost. There are several approaches that can be employed to make StriderNet fast. Our graph update function is currently not JIT compiled, whereas the baseline optimizers are JIT compiled. An efficient JIT compiled implementation of graph update function can potentially make StriderNet significantly faster. Further, the training is currently performed in CPUs. Migrating to GPU will significantly improve the running time. Further, the additional results on the performance of StriderNet trained on larger system size of 250 atoms (see response to Reviewer Kcce, and App. H, Tab. 8 and 9 of the revised manuscript) demonstrate that StriderNet can indeed perform better on larger system sizes. This suggests that StriderNet can indeed be used for complex systems for obtaining stable structures that are hitherto unachievable by any of the state-of-the-art methods. **Appeal to the reviewer:** With the additional results and explanations, we hope we have been address the concerns raised raised. If there are any outstanding concerns, please do share so that we can address those. Otherwise, we request the reviewer to raise the score. # R2 We thank the reviewer for the positive comments. Please find the point-by-point response below. Further, the revised manuscript with the changes highlighted in blue are available at the anonymized link: https://anonymous.4open.science/r/StriderNET-F64D/StriderNET.pdf. **When both nodes' features and structure information (as a graph) are available, it’s reasonable to ask what information is leveraged by GNNs. Perhaps a simple ablation to run the same experiments without neighbourhood potential. Likewise, an ablation on edge features.** *Response:* We thank the reviewer for raising this interesting point. To address this comment, we performed additional ablation studies by removing the node and edge features one by one. The validation score of each of the models are given in the table below. First, we observe that the edge features play a crucial role as the model without edge features exhibits poor performance. We also note that StriderNet with all the features exhibits the best performance among all the models. Thus, although the edge features play a major role in the model performance, the node features enhance the performance when used in conjunction with the edge features. We have also added the validation curve in Fig.6 in Appendix F. | Model | Val. Score | |-------|---------| | w/o neighbor PE (node feature)|-254.72 | | w/o node PE (node feature)| -252.73| | w/o any PE feature (node features) |-254.09 | | w/o bond length $\|x_{vu}\|$ (edge feature) | 0.31 | | w/o $\|x_{vu}^{equi}\|-\|x_{vu}\|$ (edge feature)|-20.55| |StriderNet|-259.40| To address this comment, new section has been added to the Appendix of the main manuscript (App. F, Fig. 6), which referenced from the main manuscript (Sec. 4.2, page 7 col. 1 line 382). **From Table 2, it appears on increasing the number of atoms, the performance of StriderNET is comparable to baseline methods. I am sceptic of its applicability to larger complex systems.** *Response:* To address this comment, we performed additional experiments. Specifically, we trained StriderNet on a larger LJ-system of 250 atoms and the adaptation was performed on varying system sizes with 250, 500, and 1000 atoms for 10 random configurations (same methodology as employed in Sec. 4.6). The results of these models are compared with the ones obtained from the original StriderNet trained on 100 atoms (see Tables below). Interestingly, we observe that the model trained on 250 atoms outperform StriderNet trained on 100 atoms. This suggests that the model performance improves when trained on larger number of atoms. This result suggests that StriderNet can indeed be used for more larger complex systems with potentially better performance. **Table:** Performance of StriderNet when trained on 100 and 250 atom systems for the LJ-system. The table shows minimum energy obtained for different system sizes. | System Size |Metric| Training on N = 100 | Training on N = 250 | | -------- |------- |-------- | -------- | | 250 |Min |-2037.06 | **-2038.45** | | |Mean |-2033.25 | **-2035.74** | | 500 |Min |-4080.65 | **-4086.12** | | |Mean |-4069.59 | **-4071.64** | | 1000 |Min |-8126.29 | **-8139.19** | | |Mean |-8121.44 | **-8122.94** | **Table:** For easy comparison of different system sizes, this table shows minimum energy obtained (normalized with the number of atoms) for each system. | System Size |Metric|Training on N = 100 | Training on N = 250 | Gradient Descent | Adam | FIRE| | -------- |------- |-------- | -------- |----|----|----| | 250 |Min |-8.148 | **-8.154** |-8.025|-8.152|-8.153| | |Mean |-8.133 | **-8.143** |-7.978|-8.103|-8.109| | 500 |Min |-8.161 | **-8.172** |-8.016|-8.144|-8.136| | |Mean |-8.139 | **-8.142** |-7.990|-8.118|-8.120| | 1000 |Min |-8.126 | **-8.139** |-8.003|-8.135|-8.139| | |Mean |-8.121 | **-8.123** |-7.984|-8.119|-8.124| To address this comment, new section has been added in the Appendix (App. H, Table 8 and 9) and referenced in the revised manuscript (Sec. 4.6, page 8 col. 2 line 414). **A key motivation, as set in the introduction, is applicability to areas such as proteins, drug discovery, materials etc. The atomic structures involved in such applications are generally much larger than those used in the paper. Moreover, the energy landscape is also more complex. This complexity demands a question of how reasonable StriderNET can be for such applications as the main limitation (also noted by authors) is scalability to larger atoms. Understandably, the paper proposes a new approach to a complex problem. However, the steps involved are mostly applications of existing ideas. So from the novelty perspective, I am keen to learn the “practical applicability” of StriderNET. What are the scalability issues, and perhaps a discussion to outline them?** *Response:* Indeed, striding the energy landscape towards identifying stable configurations is a challenging problem. While LJ systems present archetypical glass formers with fundamental importance in fields such as condensed matter physics, practical problems will involve much larger atom sizes with complex interactions. As demonstrated in the previous comment, additional experiments demonstrate that StriderNet can perform better when trained on larger system sizes, confirm its applicability for realistic systems. In addition, we also evaluated the training time of StriderNet as a function of the number of atoms. For binary LJ system the training time with system size is shown below. | System Size | Training time (100 epochs)(hr) | | ----------- | -------- | | 10 | 0.92 | | 25 | 1.25 | | 50 | 1.75 | | 100 | 2.11 | | 250 | 2.50 | | 500 | 3.83 | | 1000 | 8.89 | To address this comment, a new section has been added in the Appendix of the revised manuscript (see App. G, Table 7). **Several papers exist on general machine-learning solutions [1, 2, 3] for simulating molecular dynamics. A reference in a related work section and a sentence outlining the difference between those methods will improve the paper. [1] Noé, Frank, et al. "Machine learning for molecular simulation." Annual review of physical chemistry 71 (2020): 361-390. [2] Park, Cheol Woo, et al. "Accurate and scalable graph neural network force field and molecular dynamics with direct force architecture." npj Computational Materials 7.1 (2021): 73. [3] Li, Zijie, et al. "Graph neural networks accelerated molecular dynamics." The Journal of Chemical Physics 156.14 (2022): 144103** *Response:* We thank the reviewer for the comment. We have now included the three additional references suggested by the reviewer along with new text in the Introduction section of the revised manuscript. We would like to point out that while there are several works on GNNs for molecular simulations where ML potentials are used, our work is different in nature. We aim to discover the minima for a structure where the interaction potential is known, for instance, LJ system or even using DFT. Indeed, it would be interesting to combine both the works to develop a framework for optimizing systems where the potential can be predicted through ML. **The energy scale of each atomic structure is different. Using a difference in initial and final energy for evaluation is reasonable. But it is still hard to grasp on which atomic structures StriderNET makes a significant difference. In some cases, the reduction could be less significant due to the complexity of the structure. I am keen to know what other evaluation criteria can be used for this purpose. Perhaps authors can use molecular modelling to determine the stable energy configuration and use that as a benchmark.** *Response:* We thank the reviewer for the comment. One other evaluation criteria that we have used to evaluate the performance of our model is the reward that the model obtains during training. These results are presented in Appendix (App. C). The reward thus signifies how well the model is reducing energy in comparison to the baseline method. We have used FIRE algorithm as a *baseline* in our reward function (Eq. 9). We observe in Figure 4 that as the model trains it outperforms the baseline method (indicated by positive rewards). The baseline can be any optimization method, e.g. Adam, FIRE, molecular modelling (as suggested by the reviewer) and even the trained StriderNET model itself. The performance of the baseline thus gives a reasonable estimate how complex the task is. For instance, the reduction in energy during minimization for high energy and low energy structures will be high and low for the baselines, respectively. This change in energy $\Delta E$ during the minimization of a structure by the baseline can be used as a proxy to identify how difficult the task is. **Bayesian optimisation methods using local search, evolutionary approaches, etc., have been used for finding stable atomic configurations in other applications such as drug design. I think there is a scope to strengthen a related work section. Gonzalez, J., Longworth, J., James, D. C., & Lawrence, N. D. (2015). Bayesian optimization for synthetic gene design. arXiv preprint arXiv:1505.01627.** *Response:* We thank the reviewer for the comment. We have strengthened the related works section by including the reference suggested by the reviewer and an additional reference on evolutionary algorithm (see below) in the Introduction of the revised manuscript. >Daven, D.M., Tit, N., Morris, J.R. and Ho, K.M., 1996. Structural optimization of Lennard-Jones clusters by a genetic algorithm. Chemical physics letters, 256(1-2), pp.195-200. **In addition to epochs, could authors also report the runtime of all optimisers? I suspect the chosen baselines will be faster and would like to understand where this added cost comes from.** *Response:* The following are the results of runtime analysis. | Method | Runtime (in s)( 100 steps)| | ------------------------- | ------------------------- | | StriderNET Adaptation | 1170 (for 100 epochs)| | StriderNET Forward Pass | 54.3 | | Gradient Descent | 0.72 | | Adam | 0.76 | | FIRE | 1.02 | The baseline (classical) optimizers are faster than the StriderNET. The major cost in StriderNet comes from the requirement of creation of graph based representation at each step. The other methods doesn't require the graph representation. There are several approaches that can be employed to make StriderNet fast. Our graph update function is currently not JIT compiled, whereas the baseline optimizers are JIT compiled. An efficient JIT compiled implementation of graph update function can potentially make StriderNet significantly faster. Further, the training is currently performed in CPUs. Migrating to GPU will significantly improve the running time. That said, we would like to re-emphasize that the goal of StriderNet is to find stable configurations with lower energies on a complex landscape. In such cases, finding a better configuration (for instance, in the case of drug discovery as acknowledge by the reviewer) is of paramount importance irrespective of the computational cost. StriderNet is aimed towards such applications. To address this comment, a new section is added to the Appendix (see App. I, Tab. 10). **Beyond the convergence curve, I think it would be more reasonable to visualise the energy of individual atoms in an atomic structure at different timesteps to understand the optimisation process of StriderNET better.** *Response:* We thank the reviewer for the suggestion. Indeed, we analyzed the energy of individual atoms at different steps of StriderNet. We observed that the distribution of energy becomes increasingly narrower with most atoms having comparable energy. This is demonstrated in Fig. 1 of the manuscript. We promise to include additional figures showing the evolution of energy distribution in the Appendix. **How exactly were baseline methods trained? Do they use GNN on top of the graph representation with the same node attributes? A section may be in an appendix to elucidate the details that will be helpful.** The baselines Gradient Descent, Adam and FIRE are non-neural methods and does not require training. The details of these methods including the hyperparameters are given in App. E. Additionally, the run time of all the optimizers are given in App. I and the effect of learning rate of these optimizers are discussed in App. J. If the question is regarding section 4.3: Effect of baseline and additional components. Here the baseline FIRE refers to the baseline in reward function. We define the reward as decrease in energy by the model minus the decrease in energy by the FIRE algorithm. The details regarding the baseline is provided in Sec. 3.4 and Eq. (10). If the question is regarding section 4.4: Effect of Graph architecture. We evaluate the performance of different GNN architectures on the performance of StriderNet. We have added the details of the architectures in Appendix Sec. L. **What is the effect of the optimiser used in the training of StriderNET? Any specific reason why Adam was used here?** *Response:* The optimizer used for training will affect the learning of the policy. We chose Adam optimizer because of its superior performance and popularity in training deep neural networks. There are several studies that has compared Adam with other optimizers. e.g. A. Reddi, S. Kale, and S. Kumar, "On the Convergence of Adam and Beyond," arXiv:1904.09237 [cs.LG], Apr. 2019. Y. Luo, Y. Tao, and D. Wen, "Adaptive Learning Rate Optimization Methods for Neural Networks," Neural Networks, vol. 121, pp. 464-480, Jan. 2020. **MDP is specified in the text. For clarity, I would suggest that authors could introduce it formally as a tuple and then specify all parts. The full approach could be stated as an algorithm in an appendix.** *Response:* Thank you for the suggestion. We have added details of MDP in section K in the Appendix. Additionally, we promise to add our approach as an algorithm. **Why multivariate Gaussian constant factor alpha was chosen 1e-5? Isn’t that too small? I suspect the uncertainty doesn’t help. Is it based on existing work, or was it just an empirical choice by default? Also learning rate of baselines is different from the learning rate of Adam in StriderNET. Did the authors investigate the effect of the different learning rates for baselines? I think this is vital, and seeing it in a rebuttal would be helpful.*** *Response:* The **multivariate gaussian constant factor ($\alpha$)** controls how far the sampled displacements are from the predicted mean displacement by StriderNET. If $\alpha$ is kept large, the predicted and sampled displacements will have large deviation and this will reflect in deviation in the corresponding change in energies ($\Delta E$). If the deviation in $\Delta E$ is too large, the reward thus obtained will not correspond to the action that the model has taken. Thus, it is a tradeoff between exploration and exploitation. We have empirically choosen this value such that the deviation in $\Delta E$ due to sampled displacement is no more than 5-10%. Regarding the **learning rate of non-neural baselines(Adam, Gradient Descent and FIRE)**, the values were chosen by performing empirical experiments; the value that gave the best performance for a model was chosen. To demonstrate this, we have investigated the effect for all the baselines (dt_start, in case of FIRE algorithm) by performing additional experiments with different learning rates. The following table shows the minimum energies obtained at different learning rates in 2000 steps. Here unstable means that algorithm increases the energy instead of decreasing. We chose the learning rate that gives minimum energy. **Table:** The performance of baselines with different learning rates for 100 atom LJ-system. | Learning rate | Gradient Descent | Adam | FIRE | | ------------- | ---------------- | -------- | -------- | | 1e-1 | Unstable | Unstable | Unstable | | 5e-2 | Unstable | -808.6 | Unstable | | 1e-2 | Unstable | -807.4 | -813.7 | | 5e-3 | Unstable | -806.9 | -811.4 | | 1e-3 | Unstable | -803.8 | -806.3 | | 5e-4 | -799.5 | -793.5 | -803.4 | | 1e-4 | -791.3 | -761.8 | -787.4 | | 5e-4 | -786.6 | -730.2 | -770.3 | | 1e-5 | -762.9 | -626.2 | -693.6 | With energy tolerance : 1e-5 | Learning rate | Gradient Descent | Adam | FIRE | | ------------- | ---------------- | -------- | -------- | | 1e-1 | Unstable | Unstable | Unstable | | 5e-2 | Unstable | -808.6 | Unstable | | 1e-2 | Unstable | -807.3 | **-813.7** | | 5e-3 | Unstable | -806.8 | -811.4 | | 1e-3 | Unstable | -806.8 | -811.3 | | 5e-4 | -811.3 | -805.4 | -803.4 | | 1e-4 | -804.1 | -806.6 | -787.8 | | 5e-4 | -804.1 | -806.6 | -779.7 | | 1e-5 | -802.5 | -813.5 | -732.3 | | 1e-6 | - | -806.7 | - | To address this comment, a new section is added in the Appendix of the revised manuscript (App. J, Table 11). **Should energy be the sole criterion for designing atomic structure? For instance, in drug discovery, molecules have different properties at some point, stability of the structure is not the sole criterion, and other properties are relevant to look at. I am trying to understand how practical it is to extend StriderNET to multiple objective criteria.** Energy is one of the important factors in determining the stability and behavior of atomic structures, especially clusters and bulk systems. However, in other cases such as drug discovery there might be other relevant criteria as well such as reactivity, solubility, and functionality of the structure. StriderNET can be extended to address these situations by including a multi-objective criteria through its reward function definition. The reward can be defined based on the desired properties. To address this comment, new text has been added in the Limitations and future work section of the revised manuscript. **Ideally, figures should be self-explanatory. I suggest the authors add a detailed caption for Figure 2.** *Response:* We have updated the caption for Figure 2. **My main concern is its applicability to complex systems where structures have a larger number of atoms. Also, I find baselines and ablations are weak and need more investigation to validate the benefits of StriderNET. Further comments are in the strengths and weaknesses section.** *Response:* With the additiona experiments on larger system size, ablations on the node and edge features, and additional analysis on the baselines, we hope the concerns raised by the reviewer are addressed. **Appeal to the reviewer:** Thank you for the insightful review. We hope we have addressed all the concerns raised by you. If there are any additional concerns please do let us know and we would be glad to engage in a discussion on those. Otherwise, we request the reviewer to raise the score for the work. **Minor Comments:** **Line 73 on the right side, “Have been focuses -> have focused.”** *Response:* We thank the reviewer for pointing this out. We have corrected it. **“an actual material -> actual material” (material is an uncountable noun)** *Response*: Thank you. We have modified it. **Missing citation line 701 in an appendix.** *Response:* Thanks. We have fixed the citation. **Please change “gaussian” -> “Gaussian” in the appendix** *Response:* Thanks for pointing this out. We have corrected it. **List of software packages and hardware configurations could go in the appendix**** *Response:* Thanks for the suggestion. We have incorporated the change. Please refer App. E in the revised manuscript. **I would appreciate it if the unit of energy is stated in Tables and Figures wherever applicable.** *Response:* Thank you. Please note that LJ system is unitless (LJ units in terms of $\lambda$ and $\beta$, see Eq. 2) and hence the corresponding quantities are also unitless. For others, units are mentioned.