HackMD - Collaborative Markdown Knowledge Base

## Response to reviewer LqM5 (Part I ) We are grateful for the detailed review and the constructive feedback. This will be very helpful for improving this work. We recognize that we have not succeeded fully to convey our results and their significance in a satisfactory manner. We will strive to rectify this in a revised version. I thank the authors for engaging with my review and for their thorough responses to my questions. It is clear from the responses that the authors' clarity is far greater than what is reflected in the manuscript. I think we are on the same page in terms of the revisions that are needed to construct a clearer, better argued, and more effective paper. My enduring concern is that once the revisions are implemented the paper will really be a whole other paper (even though the results will remain the same), and it is not clear to me that the manuscript should not go through another round of reviews at that point. For now I am raising my score slightly pending discussion with the Area Chair on this point. Re: point 13, what I meant to suggest is that it would be ideal to offer some explanation of why the MII fails to work as well on the double pendulum as on the Hénon-Hailes problem. Failing that, it would be good to explain why that problem is too hard to solve in this paper and to give a sense of what may need to be done in order to solve it. Thank you again. We would again like to thank the reviewer for taking the time to engage with this paper, giving us the chance to learn and improve it. We agree that the differing performance of MII on the two different systems call for a deeper analysis. We have continued to study the proof of Theorem 4.2 and reached a deeper understanding of what is going on. This would certainly be included (and extended from what is shown here) and result in a more complete paper. Consider the proof of Theorem 4.2, and specifically from line 827: Let $$ \begin{align} \alpha_2 &:= (b^Tv)^2 + (b^T(\mathbb{1} - v))^2 \\ \alpha_1 &:= b^T(\mathbb{1} - 2v) \end{align} $$ We then have that $$ \begin{align} \text{Var} \bigg [ \tilde y_{n+1} - \tilde y_n - h\sum_{i=1}^s b_i k_i \bigg] \approx& \sigma^2 \bigg [2I +hb^T(\mathbb{1} - 2v)\big(f'(y_n) +f'(y_n)^T\big) +h^2\bigg ((b^Tv)^2 + (b^T(\mathbb{1} - v))^2\bigg) f'(y_n)f'(y_n)^T \bigg ]\\\\ =& \sigma^2 \bigg [2I +h\alpha_1 \big(f'(y_n) +f'(y_n)^T\big) +h^2\alpha_2 f'(y_n)f'(y_n)^T \bigg ] \end{align} $$ Since for a Matrix $A$ we have that the spectral radius $\rho(A) = \|A\|_2$, for the Euclidean norm. Thus we find that $$ \begin{align} \rho\bigg(\text{Var} \bigg [ \tilde y_{n+1} - \tilde y_n - h\sum_{i=1}^s b_i k_i \bigg]\bigg ) &\approx \sigma^2 \bigg \| 2I +h\alpha_1 \big(f'(y_n) +f'(y_n)^T\big) +h^2\alpha_2 f'(y_n)f'(y_n)^T \bigg \|_2 \\\\ &\leq \sigma^2 \bigg (2 +h|\alpha_1| \big \| f'(y_n) +f'(y_n)^T\big\|_2 +h^2|\alpha_2| \big \| f'(y_n)f'(y_n)^T \big\|_2 \bigg )\\\\ &= \sigma^2 \bigg (2 +h|\alpha_1| \rho\big(f'(y_n) +f'(y_n)^T\big) +h^2|\alpha_2| \rho\big ( f'(y_n)f'(y_n)^T\big) \bigg )\\\\ \end{align} $$ This allows us to separate the sensitivity in four different quantities. The MIRK method determines $\alpha_1$ and $\alpha_2$, while the true vectorfield determines the size of the largest eigenvalues for the symmetric Jacobian terms: $$ \begin{align} F_1(y_n) := \big(f'(y_n) +f'(y_n)^T\big) \quad \text{and} \quad F_2(y_n) := \rho\big ( f'(y_n)f'(y_n)^T\big) \end{align} $$ For a somewhat arbitrarily chosen trajectory, we find for the Hénon-Heiles $$ F_1(y_n)\approx 0.51, \quad F_2(y_n)\approx 1.55, $$ and for the double pendulum, that $$ F_1(y_n)\approx 1.58, \quad F_2(y_n)\approx 1.8. $$ This provides insight into why MII works better for HH than for DP. Considering that for a Hamiltonian system, we have $$ f'(y) = J\nabla^2H(y) $$ where $\nabla^2H(y)$ is symmetric. It might be possible to refine these estimates even further. However, it is clear from this that the differing performance could be understood by analyzing the eigenvalues of the Jacobian for the different Hamiltonian systems. This is easy to do numerically in a given domain, and could even by done on the approximated vector field $f'_{\theta}(y)$ to determine if MII is the proper tool to choose, or not. $$ f(y) = \begin{bmatrix}0 & 0 & 1 & 0\\0 & 0 & 0 & 1\\- 2 q_{2} - 1 & - 2 q_{1} & 0 & 0\\- 2 q_{1} & 2 q_{2} - 1 & 0 & 0 \end{bmatrix} $$ $$ \begin{align} f(y) = \begin{bmatrix}\frac{p_{2} \sin{\left(q_{1} - q_{2} \right)}}{\sin^{2}{\left(q_{1} - q_{2} \right)} + 1} - \frac{2 \left(p_{1} - p_{2} \cos{\left(q_{1} - q_{2} \right)}\right) \sin{\left(q_{1} - q_{2} \right)} \cos{\left(q_{1} - q_{2} \right)}}{\left(\sin^{2}{\left(q_{1} - q_{2} \right)} + 1\right)^{2}} & \dots & \dots & - \frac{\cos{\left(q_{1} - q_{2} \right)}}{\sin^{2}{\left(q_{1} - q_{2} \right)} + 1}\\\\ \frac{p_{1} \sin{\left(q_{1} - q_{2} \right)}}{\sin^{2}{\left(q_{1} - q_{2} \right)} + 1} - \frac{2 \left(- p_{1} \cos{\left(q_{1} - q_{2} \right)} + 2 p_{2}\right) \sin{\left(q_{1} - q_{2} \right)} \cos{\left(q_{1} - q_{2} \right)}}{\left(\sin^{2}{\left(q_{1} - q_{2} \right)} + 1\right)^{2}} &\dots &\dots & \frac{2}{\sin^{2}{\left(q_{1} - q_{2} \right)} + 1}\\\\ -\frac{p_{1} p_{2} \cos{\left(q_{1} - q_{2} \right)}}{\sin^{2}{\left(q_{1} - q_{2} \right)} + 1} + \frac{4 p_{1} p_{2} \sin^{2}{\left(q_{1} - q_{2} \right)} \cos{\left(q_{1} - q_{2} \right)}}{\left(\sin^{2}{\left(q_{1} - q_{2} \right)} + 1\right)^{2}} - \frac{2 \left(- p_{1} p_{2} \cos{\left(q_{1} - q_{2} \right)} + \left(\frac{p_{1}^{2}}{2} + p_{2}^{2}\right)\right) \sin^{2}{\left(q_{1} - q_{2} \right)}}{\left(\sin^{2}{\left(q_{1} - q_{2} \right)} + 1\right)^{2}} + \frac{2 \left(- p_{1} p_{2} \cos{\left(q_{1} - q_{2} \right)} + \left(\frac{p_{1}^{2}}{2} + p_{2}^{2}\right)\right) \cos^{2}{\left(q_{1} - q_{2} \right)}}{\left(\sin^{2}{\left(q_{1} - q_{2} \right)} + 1\right)^{2}} - \frac{8 \left(- p_{1} p_{2} \cos{\left(q_{1} - q_{2} \right)} + \left(\frac{p_{1}^{2}}{2} + p_{2}^{2}\right)\right) \sin^{2}{\left(q_{1} - q_{2} \right)} \cos^{2}{\left(q_{1} - q_{2} \right)}}{\left(\sin^{2}{\left(q_{1} - q_{2} \right)} + 1\right)^{3}} - 2 \cos{\left(q_{1} \right)} & \dots&\dots & - \frac{p_{1} \sin{\left(q_{1} - q_{2} \right)}}{\sin^{2}{\left(q_{1} - q_{2} \right)} + 1} + \frac{2 \left(- p_{1} \cos{\left(q_{1} - q_{2} \right)} + 2 p_{2}\right) \sin{\left(q_{1} - q_{2} \right)} \cos{\left(q_{1} - q_{2} \right)}}{\left(\sin^{2}{\left(q_{1} - q_{2} \right)} + 1\right)^{2}}\\\\ \frac{p_{1} p_{2} \cos{\left(q_{1} - q_{2} \right)}}{\sin^{2}{\left(q_{1} - q_{2} \right)} + 1} - \frac{4 p_{1} p_{2} \sin^{2}{\left(q_{1} - q_{2} \right)} \cos{\left(q_{1} - q_{2} \right)}}{\left(\sin^{2}{\left(q_{1} - q_{2} \right)} + 1\right)^{2}} + \frac{2 \left(- p_{1} p_{2} \cos{\left(q_{1} - q_{2} \right)} + \left(\frac{p_{1}^{2}}{2} + p_{2}^{2}\right)\right) \sin^{2}{\left(q_{1} - q_{2} \right)}}{\left(\sin^{2}{\left(q_{1} - q_{2} \right)} + 1\right)^{2}} - \frac{2 \left(- p_{1} p_{2} \cos{\left(q_{1} - q_{2} \right)} + \left(\frac{p_{1}^{2}}{2} + p_{2}^{2}\right)\right) \cos^{2}{\left(q_{1} - q_{2} \right)}}{\left(\sin^{2}{\left(q_{1} - q_{2} \right)} + 1\right)^{2}} + \frac{8 \left(- p_{1} p_{2} \cos{\left(q_{1} - q_{2} \right)} + \left(\frac{p_{1}^{2}}{2} + p_{2}^{2}\right)\right) \sin^{2}{\left(q_{1} - q_{2} \right)} \cos^{2}{\left(q_{1} - q_{2} \right)}}{\left(\sin^{2}{\left(q_{1} - q_{2} \right)} + 1\right)^{3}} & \dots &\dots & \frac{p_{1} \sin{\left(q_{1} - q_{2} \right)}}{\sin^{2}{\left(q_{1} - q_{2} \right)} + 1} - \frac{2 \left(- p_{1} \cos{\left(q_{1} - q_{2} \right)} + 2 p_{2}\right) \sin{\left(q_{1} - q_{2} \right)} \cos{\left(q_{1} - q_{2} \right)}}{\left(\sin^{2}{\left(q_{1} - q_{2} \right)} + 1\right)^{2}} \end{bmatrix} \end{align} $$ Here is our response to your questions and remarks: **(1) Introduction.** The reviewer has a point in that the motivation behind learning an approximation to the vector field $f$ is not highlighted sufficiently. We believe this problem is becoming increasingly important, as more and more data is collected from systems in e.g. engineering, and this can be used to obtain models of the system that are more accurate than analytic models based on simplified assumptions. One specific setting where this could be useful is control problems, where it is crucial to know the form of the dynamical system governing the vehicle one aims at controlling. Deriving an analytical expression for the dynamical system could be time consuming, and data driven modelling with apropriate physics informed priors (Hamiltonian i.e.) could yield improved results. We will include this motivating example in the introcution in the revised version. **(1) Late introduction of main contribution.** This paper is not only about learning from noisy data. It is generally discussing the problem of which numerical integrator to use in the training of the vector field. It first presents how the MIRK methods offer computationally efficient training and beneficial properties such as high order or symmetry. Using MIRK methods for inverse problem is a novel idea. Then, it considers a the mean inverse integrator as a way to improve training on noisy data, which takes advantage of the explicit property of the MIRK methods. We will amend the introduction to make this clearer. **(2) Lemma 3.1.** This is a simple but novel result. While it is well known that MIRK methods are implicit only in $y_{n+1}$ (and not in the stage values $k_i$) their use when approximating vector fields from known solution trajectories has not been considered before. **(3) Theorem 3.2.** Following up on Lemma 3.1, it is of general interest to study the properties of the class of methods we have identified as apt for learning dynamical systems. The use of symplectic integrators in the time integration of Hamiltonian systems is generally recognized as advantageous, especially when integrating over long times. For this reason, it might be natural to use symplectic methods also for discovering Hamiltonian dynamics, as other authors have pointed out, see for example [2,3]. Indeed in [3] it is argued that symplectic methods have benefits when used to retrive inverse Hamiltonian vector fields from noisy data. Hence, it is relevant and interesting in this paper to discuss symplectic MIRK methods. Moreover, the story we are trying to tell in the paper is that although symplectic methods have maximum order 2, non-symplectic but symmetric higher-order MIRK methods exist and show good performance. **(4) Integrators outside RK-class.** We limit our study to RK methods. Altough this is a very large and important class, we think we would be remiss if we did not mention that there exist alternatives beyond this class and discuss which properties they have. **(5) Symplectic and non-separable.** Agree. We will include this in the revision, e.g - "When applying a symplectic method to an Hamiltonian ODE, we effectively compute the solution of a nearby Hamiltonian system, with a different but 'close' Hamiltonian function which is exactly preserved by the numerical flow and over long times [1, Ch. IX.8]." - "A Hamiltonian that can be written on the form $H(q,p) = H_1(q) + H_2(p)$ is said to be separable. The standard example is the case where the energy is the sum of kinetic and potential energy." **(6) Notation on the mean inverse integrator.** The reviewer correctly points out unclear notation use here. There is indeed a discrepancy between Figure 2 and Eq. 16. If the MIRK method in use in not symmetric, we recommend using the adjoint (in the sense of [1, Ch. II.3]) when time-stepping from the right, so that we can reuse evaluations and keep a low computational cost. This is what is illustrated in Figure 2, but we should have written this out to make it clear and will do that in the revision. Then we will also amend Eq. 16 to use $\Psi_{2,3}$ and not $\Psi_{3,2}$, in addition to specifying that we use the adjoint (denoted by the supercript $\cdot ^*$) from the right. $$ \begin{align} \overline y_{2} &= \frac{1}{3} \big ( \Phi_{h,f}(\tilde y_{1}) + \Phi_{h,f} \circ \Phi_{h,f}(\tilde y_{0}) + \Phi^*_{-h,f} (\tilde y_{3}) \big) \\ &\approx \frac{1}{3} \big ( \tilde y_1 + \tilde y_3 + \Phi_{h,f}(\tilde y_0) + h( 2\Psi_{1,2} - \Psi_{2,3}) \big) \\ &\approx \frac{1}{3} \big (\tilde y_0 +\tilde y_1 +\tilde y_3 + h( \Psi_{0,1} + 2 \Psi_{1,2} - \Psi_{2,3} ) \big). \end{align} $$ **(7) Notation for $Y$.** We thank the reviewer for catching this error, which we will fix in the revision. **(8) Notation for $Y$ in optimization objective.** Again, to be consistent, it should be $\tilde Y$ in Eq. 18. The whole point of the mean inverse integrator is to cancel noise by averaging trajectories, so hence the notation for noisy samples should be used. The main point is not to compute $\bar Y$, but to obtain an optimization objective that is less sensitive to noise. Averaging different trajectories, or computing $\bar Y$ achieves this. When $\bar Y_{\theta}$ is computed, we use Eq. 18 as a loss function and use the same optimization algorithm (L-BFGS) as for a one-step method. One major difference is that we train the network with a one-step method for the first half of the epochs, then we use the MII scheme ($\bar Y_{\theta}$) for the remaining epochs. We arrived at this approach after testing different ways to do training with MII, and found that this yielded improved results over using MII directly. **(9) Noise level.** This is an important question and it is not directly clear what to measure in order to say if $\sigma$ is relatively large or not. We believe the most relevant metric is the average distance between two sequential points in the (unperturbed) training data: $$ \begin{equation} \frac{1}{N}\sum_{n=0}^{N-1}\|y(t_n) - y(t_{n+1}) \| \end{equation} $$ Which for the two systems for the three different step sizes $h$ is given by: *Average distance in training data* |**System** |$h=0.4$ |$h=0.2$ |$h=0.1$ | |---|---|---|---| |Double pendulum |$0.278$ |$0.141$ |$0.071$ | |Hénon-Heiles |$0.197$ |$0.099$ |$0.050$ | This tells us that the noise level of $\sigma = 0.05$ goes from a relative magnitude of approximately $25\%$ to $100\%$ in comparison to the average distance between sequential points. This is clearly illustrated in the following plots, that display the training data with decreasing step sizes $h$ and constant noise $\sigma = 0.05$. These plots will be indcluded in the revised version (in the Appendix if necessary because of the page limit) to give an idea of the scales of the noise and step size. *Link to figure:* [Plot of the training data and exact flow for the double pendulum problem.](https://gcdnb.pbrd.co/images/bKRaaSi9d604.png) **(10) Number of neurons.** After testing different architectures we found that $100$ neurons in each layer was sufficient and did not see much improvement increasing the size further. Other works on training Hamiltonian neural networks tend to use neural networks with $2$ or $3$ layers and typically between $200$ and $300$ neurons per layer. We include an slightly reduced variant of the experiment displayed in Figure 4, only omitting MIRK$3$ and $(h,N) = (0.1,16)$ to decrease the time to run the experiment. Comparing the figure below with Figure 4, we do not oberserve large differences. The main difference is that the result for double pendulum without noise is closer to $0.002$, where the flow error in Figure 4 is approximately $0.003$. The computational time should not be considered in this figure, as multiple experiments was run simultanously. The tendency of the MII to yield improved results when more data is available and the step size $h$ is smaller still holds. *Link to figure:* [Plot of error training neural networks with $200$ neurons per layer.](https://gcdnb.pbrd.co/images/O3EIAIjzVsoN.png) **(11) Dimensionality of $y_i$.** As given in Appendix B, it is $n = 4$ (or $d=2$) for both problems. This should have been stated under Section 5, as will be done in the revised version. **(12) Transitions.** The reviewer is right and we are happy to comply. Here, the neural network is trained in the same way as is done with a one-step method. However, there is a critical difference that should be mentioned. The neural network is initially trained half the number of epochs using a one-step method. Then, for the rest of the epochs, it is trained using MII. This procedure is chosen after testing different ways to train the networks. The initial training using a one-step method could be seen as a sort of pre-training. We will add these details in the revised edition. **(13) When does the MII yield improved results?** This is obviously a fundamental question. The sensitivity analysis (Theorem 4.2) attempts to adress it. By examining Figure 3, it appears that using MII to learn the double pendulum problem requires a shorter step size to achieve reduced standard deviation compared to the Hénon-Heiles system. However, this analysis is not properly linked to the experimental results observed in Figure 4. We have tried to adress this in a systematic way, without managing to get to the bottom of it. Connecting the relative performance of MII (VS a one-step method) to some measure of the stiffness of the system (ideally the neural network approximation) would be ideal. For instance the magnitude of the Jacobian $\|f'_{\theta}(y) \|, y \in \Omega$ in some domain $\Omega$ or some upper bound on the Lipschitz constant. We have tried to investigate this by numerical experiments without success. This is something that should be mentioned in the limitations of this work. The reviewer writes "*This could be done either by actually studying the performance of the mean inverse integrator on the double pendulum here (best case scenario)*", but we are unable to fully understand what is meant here. If this could be clarified, we will address this further. **(14) Notation for $\Phi$ and $y\in \mathbb{R}^{?}$.** In most numerical integration literature, one mostly writes $\Phi_{h,f}(y_n)$, even though in many cases (all RK-methods that are not explicit for instance), the integrator takes $y_{n+1}$ as an argument. We would not object if this is called a systematic abuse of notation, even though it is convention within this field. We chose to include both arguments explicitly in line 243 and 317 to express greater clarity. This could have been pointed out in the text to avoid confusion, and it will be done in the revision. For Hamiltonian systems, the dimension has to be even ($n = 2d$). However, much of the discussion in this paper is more general and in that case it is not necessary to restrict $n = 2d$. But, yes this lead to confusing notation, and we will clean this up in the revised version. **References.** [1]: Hairer, E., Lubich, C., and Wanner, G. Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Equations; 2nd ed. Springer, Dordrecht, 2006. doi: 10.1007/3-540-30666-8. [2]: Offen, C. and Ober-Blobaum, S. Symplectic integration of learned Hamiltonian systems. Chaos: An Interdisciplinary Journal of Nonlinear Science, 32(1):013122,2022. [3]: Chen, Zhengdao, et al. "Symplectic recurrent neural networks." arXiv preprint arXiv:1909.13334 (2019). ## Response to reviewer cbt9 We would like to thank the reviewer for relevant questions and valuable input. Several issues mentioned will be adressed in a revised edition. Our responses are found below: **Line 192, methods employed by other authors.** This paper identifies the class of MIRK methods as being inverse explicit. The references are examples where methods belonging to this class have been used. However, these other works have not identified the general class of MIRK methods, where we find all inverse explicit Runge--Kutta methods, and we consider this a substantial contribution. The identification of the full class that we provide here makes it possible to analyze their general properties and limitations, and to derive the optimal methods for specific uses. This is demonstrated in the paper by Theorem 3.2 and the higher-order MIRK methods that have not previously been suggested for learning dynamical systems from data. **Two irrelevant parts.** The reviewer is correctly pointing to the narrative challenge of this paper. It tries simultanously to be general when this is possible (any one-step method could be used in the mean inverse integrator) and to argue that MIRK methods are superior (due to efficiency, high order, and symmetry). This movement from general (any integrator, any ODE) to specific (MIRK methods, Hamiltonian ODEs) is challenging, and it is possible that we should have been less general and more specific to get a more straight forward narrative. However, the advantage of being general is that the results found in this paper readily could be applied in various different settings. In our revised version we will point to this in the introduction, which will hopefully improve the readability. **Relevance of Theorem 4.2.** The theorem does not prove the advantages of the MII algorithm directly. It shows that the MII reduces the variance of the optimization objective, when replacing the neural network with the exact vector field. However, you do get very similar results when you do the same analysis with the neural network vector field instead. The connection that lower variance in optimization objective leads to lower error in the trained vector field still has to be made, and we were not able to do that for this paper. A complete analysis involving the specific neural network used is a big undertaking that we are prepared to face in the coming years. We still believe that Theorem 4.2 is a major contribution, in that it points out a direction for identifying the above-mentioned connection more rigorously. ### Questions: **1. Definition of inverse explicit.** This is not an existing definition. We agree, it should be defined more rigorously. Here is how we will do that in the revision: *** Let $S_N = \{y(t_n)\}_{n=0}^N$ be pointwise observations of $y(t)$ where $\frac{d}{dt}y(t) = f(y(t))$. Let $y_{n+1} = \Phi_{h,f}(y_n,y_{n+1})$ be a numerical one-step method. Let the *inverse injection* for the method $\Phi$ be defined as the substitution $(y_n,y_{n+1}) \rightarrow (y(t_n),y(t_{n+1}))$ such that $$ \hat y_{n+1} = \Phi_{h,f}(y(t_n),y(t_{n+1})). $$ A numerical one-step method is called *inverse explicit* if it is explicit under the inverse injection. *** **2. Mentioned works in line 192.** They all set up the optimization problem in a similar way, albeit with slighty different formulations. There are differences in which integrators they use. Mehats et. al. focus on symplectic methods: symplectic Euler (implicit for initial value problems) and the implicit midpoint method. Eidnes et. al. consider pseudo-Hamiltonian systems which are not symplectic. They consider two explicit methods: explicit Euler, RK4 and two inverse explicit methods: implicit midpoint and a novel symmetric method of order $p=4$. Finally, Celledoni et. al. consider unconstrained Hamiltonian systems, of which the exact solutions are constrained to a manifold. They compare using Lie group integrators, that are specifically made to preserve the geometry, to using regular Runge--Kutta methods such as RK4. **3. Difference between the variance estimates.** The reviewer correctly observers that under the assumptions $h\rightarrow 0$ and $N\rightarrow \infty$, the variance of the optimization objective approaches $\sigma$ for MII and $2\sigma$ for a one-step method. As seen in the proof, the approximation is due to a first order Taylor expansion (linearization) in the direction of the perturbation $\delta_n$, meaning that terms $\mathcal O(\|\delta_n \|^2)$ are not considered. We then continue, and compute the variance of the truncated expansion. In the revision, this will be pointed out in the text preceeding Theorem 4.2. **4. Influence of neural network in Theorem 4.2.** No, this is not considered. However, in the process of developing this theory, we found similar behaviour when replacing $f$ with $f_{\theta}$ in the sensitivity analysis found in Figure 3. Using $f$ in stead of $f_{\theta}$ could be seen as a sort of *a priori* sensitivity analysis, in that one assumes that the exact solution is known ($f$ in this case) and describes the error (or sensitivity) of a numerical algorithm based on this. Using the exact solution in error analysis is quite common in numerical analysis. **5. $\approx$ in Eq 16.** The full derivation of Eq. 16 was omitted due to the page limit. Here is the complete derivation of how taking two steps with step size $h$ from $\tilde y_0$ to $y_2$ could be approximated. The known data in the points $\tilde y_i$ for $i = 0,\dots,3$ is repeatedly injected to achieve an explicit procedure: $$ \begin{align*} y_2^{(2)} &= \Phi_{h,f} \circ \Phi_{h,f}(\tilde y_{0}) \\ &= \Phi_{h,f}(\tilde y_{0}) + hf\bigg (\frac{ \Phi_{h,f}(\tilde y_{0}) + y_2^{(2)}}{2}\bigg )\\ &\approx \Phi_{h,f}(\tilde y_{0}) + hf\bigg ( \frac{\tilde y_1 + \tilde y_2}{2}\bigg )\\ &= \tilde y_0 + hf\bigg ( \frac{\tilde y_0 + \Phi_{h,f}(\tilde y_{0})}{2} \bigg ) + h\Psi_{1,2}\\ &\approx \tilde y_0 + hf\bigg ( \frac{\tilde y_0 + \tilde y_1 }{2} \bigg ) + h\Psi_{1,2}\\ &= \tilde y_0 + h\Psi_{0,1} + h\Psi_{1,2}.\\ \end{align*} $$ Where the superscript $\cdot ^{(2)}$ denotes taking two steps in positive direction. The same approximations are done for $y_2^{(1)},y_2^{(-1)}$, where finally we find $$ \begin{align*} \overline y_2 = \frac{1}{3}(y_2^{(1)} + y_2^{(2)} + y_2^{(-1)}). \end{align*} $$ We will include this explanation in the final paper. The use of $\approx$ in Theorem 4.2 is explained i pt. 3 above. **6. MIRK & MII.** Yes, any one-step method could be used. However, inverse explicit methods (such as the class of MIRK methods) allow for significant computational savings compared to inverse implicit methods, like e.g. the Gauss-Legendre methods or order 4 and 6. The derivation done under pt. 5 above would not be possible for an implicit method in general and using the MII with such a method would thus require a higher number of vector field evaluations. Additionally, one would have to solve non-linear systems, resulting in a dramatic increase in computational costs when using MII. We argue that the MIRK methods generally are attractive for inverse problems and should hence also be attractive to be utilized in tandem with MII. RK4 is inverse-explicit since it is explicit in the traditional sense. However, it is more expensive (has $s=4$ stages compared to MIRK4 that has $s=3$) and is not symmetric. Hence, MIRK$4$ would in many cases be a better choice. ## Response to reviewer vtZ8 We thank the reviewer for a thorough reading of our paper and very useful feedback. We recognize that we have not done a sufficiently good job at conveying what is our main contributions, which we regard as significant, and will do our best to amend this in a revised version. The reviewer has raised some important remarks and questions and we have strived to provide a detailed response: **Novelty of Lemma 3.1 and Theorem 3.2.** The contents of Section 2 are known to the general field. The contents of Section 3 are either reformulations of established results in the new setting of inverse problems or new contributions. Regarding Lemma 3.1, other recent works have exploited the same trick of injecting the data (typically for the implicit midpoint method). However, identifying that the class of MIRK methods encompass the RK methods where this is possible, is a novelty and formalized for the first time here. It is a simple but powerful result as it opens up a whole class of methods to be studied and applied in a new setting. Theorem 3.2 is, to the best of our knowledge also a novel result, and less trivial than Lemma 3.1. Hence we agree with the reviewer that the novelty should be pointed out more explicitly. **Assumption on Gaussian noise and Figure 3.** Assuming that the noise is Gaussian with diagonal covariance is convenient to make the analysis (proof of Theorem 4.2) clean. The only place where the distribution is used / makes a difference, is when we compute the variance in line 830 and 871. Changing the Gaussian / diagonal covariance assumption, means simply changing the variance in those lines. To what degree this assumption is realistic in scenarios where data is collected from sensors in physical systems is a major and significant question, and would naturally depend on the specific case. We will include a brief discussion on this in the revised paper. The blue line(s) is the standard deviation (sd) of the optimization objective when using a one-step method (OS), meaning we train the network one step at the time. We wish the MII to have lower sd than (OS). The dotted lines represent the sample approximators (Monte-Carlo), and we would like to see correspondence between dotted lines and the solid lines of same colour, as the solid lines is the theoretical/analytical approximation of the sd. We agree that this should be explained in greater detail, and it will be in the revision. **Relegation of proofs to Appendix.** We regretfully made the choice to relegate these results to the appendix to comply with the page limit. If given the chance to submit a camera-ready version, with one more page available, we will include a sketch of both proofs. **Brief discussion of experiments.** The discussion is very brief, and we might consider shortening some of the fundamental introduction to ODEs and numerical integration in Chapter 2 to allow for a more extensive discussion. 1) Due to the high number of different methods being used in the experiments, we found that splitting the results in two figures enhanced the readability. 2) *Number of samples in error computation.* These choices are somewhat arbitrary, and we value the input of the reviewer on this topic. There is no problem is increasing the number of points where the error is evaluated, and perhaps a better number is closer to 50. 3) Regarding the number of re-runs of the experiements, that is due to computational costs and limited access to GPUs. However, it should be possible to at least double the number of runs to 10 or more. **Lack of comparison to other invertible methods to learn ODEs.** This paper focuses on the role of numerical integrators when learning vector fields and we have thus focused on testing methods relying on numerical integration in the training. The reviewer correctly assumes that incompatibility of the problem setup makes it a challenge to compare directly to related work on inverse ODE learning. To our fault, we have not suffciently discusses this in the paper, and will amend this in the revision. To that end, we have constructed new experiments using an approach similar to that in [1]: taking multiple steps with an integrator before computing the loss. This is a natural method to test against the MII. In this experiment, we benchmark MIRK$4$ against RK$4$ using the methods as 1) one-step schemes, 2) taking multiple steps, denoted by *rec* in the plots, 3) with the MII-method. The results are interesting: We see that MII is superior in the noisy setting, where $\sigma = 0.05$. However, recurrent training (multiple steps) is superior for the Hénon--Heiles problem without noise. However, for the double pendulum, recurrent training has the lowest performance. These results should be verified by re-running the experiments multiple times to compute the standard deviation of the errors. It is likely that the poor performance of the recurrent training approach is due to the highly non-linear (and chaotic) nature of the double pendulum problem. Even so, more analysis is needed to understand this fully, and we will perform that analysis and report the results in the revised version. *Link to figure:* [Plot of the error when training neural networks with one-step, recurrent and MII intrgation.](https://gcdnb.pbrd.co/images/AHAIxFeHR5mH.png) *This is in part the same as what is written for the last reviewer (1V9V).* **Failure of one of the methods.** In the paper we write "*Furthermore, sufficiently high order might be necessary, as illustrated by the failure of the midpoint method to learn the Hénon–Heiles vector field without noise in Figure 4.*" In Figure 4, it is straightforward to identify the error of the midpoint method used in training on the Hénon--Heiles problem. So we do not understand what issue the reviewer is pointing to in this case. **Computational cost.** The computational cost reported, is the time used to train the neural networks. (We will clarify this in the revision.) When the learned vector field is used to obtain a numerical flow, the same integrator DOP853 is used for all training approaches, so that the differences in the reported errors only stem from the training method. **Limitations.** We agree that the limitations are not adressed to a great extent, and we will include a broader discussion in the revision. The main limitation of the paper, as we see it, is the following: While Theorem 4.2 provides an idea of how the MII method reduces sensitivity in training, it does not offer a direct connection to the performance of the learned vector field. **Explicit contribution.** To the best of our knowledge, Lemma 3.1 and Theorem 3.2 are novel. The MIRK methods of order 4 and 6 have not previously been suggested for the inverse problem. In general, employing a wide range of MIRK methods to train neural networks to approximate vector fields has not been done before, as well as emplying these methods in the way that is done in the MII framework. These facts should have been stated more clearly in the paper. **Numbering of equations.** Agree, we are happy to do so. ## Response to reviewer 1V9V We thank the reviewer for the constructive feedback. This review is on point and adressess some important remarks that will help us to improve the quality of this paper. Here are our responses: **Lemma 3.1 and Theorem 3.2** To our knowledge, these results were not previously known. If the reviewer could point us to the appropriate reference(s), we would cite those and not state these as a lemma and a theorem. **Flow error.** The flow error (line 378) is computed in time $T = h\cdot N_1$, meaning the learned vector field is integrated in the same time interval as where there is test data. Note, however, that the initial values used in testing are different from the initial values and the trajectories in the training, even if the length of the time interval is the same. We will make this clear in the revised paper. **Testing against explicit methods taking several steps.** This is a good suggestion. We have run new experiments that we would like to include in the final paper. The approach of taking several steps is used in [1] with the Störmer--Verlet method on separable Hamiltonian systems, allowing for explicit time-stepping. We have included the results from this experiment below, without standard deviation. We only ran the experiments once due to time limitations. In this experiment, we benchmark MIRK$4$ against RK$4$ using the methods as 1) one-step schemes, 2) taking multiple steps, denoted by *rec* in the plots, 3) with the MII-method. Taking multiple steps with RK$4$ is straight forward, since it is explicit. With MIRK$4$ we use fixed point iterations to solve the implicit equations in every step (the inverse explicit property is no longer helpful in the multi-step setting). The results are interesting: 1. MII is superior in the noisy setting, $\sigma = 0.05$. 2. Reccurent training (multiple steps) is superior for the Hénon--Heiles problem without noise. However, for the double pendulum, reccurent training has the lowest performance. These results should be verified by re-running the experiments multiple times to compute the standard deviation of the errors. It is likely that the poor performance of the reccurent training approach is due to the highly non-linear (and chaotic) nature of the double pendulum problem. *Link to figure:* [Plot of the error when training neural networks with one-step, recurrent and MII intrgation.](https://gcdnb.pbrd.co/images/AHAIxFeHR5mH.png) **MIRK after training.** MIRK is implicit when solving an IVP, yes. The flow error is computed considering states predicted by DOP853, so the learned model can be used with any integrator after training. **Limited to Hamiltonian systems.** The methodology is not limited to Hamiltonian systems, however such systems are the current focus of the paper, building on the recent trend to model such systems with neural networks. We will make this point more clear in the introduction. [1]: Chen, Zhengdao, et al. "Symplectic recurrent neural networks." arXiv preprint arXiv:1909.13334 (2019). *** ## The most difficult issues to adress: 1. The connection between Theorem 4.2 and the experimental results. [Read through the section and check figures.] 2. Explain $\approx$, use reference to justiy this method. 3. Explaining / understanding the difference of the error between the double pendulum / Hénon-Heiles. [Consider studying different levels of noise] 4. Improving the narrative, in particular highlighting our contributions in addition to making the two parts of the paper fit more natural together. 5. Explaining certain choices in the set-up of the numerical experiments. (Number of initial values, number of runs, number of nodes in the neural networks, level of noise $\sigma$). 6. Explain the lack of comparison to other methods. [Consider SRNN, implement Størmer Verlet and take multiple steps. No ISO. We consider a more fundamental problem.]