# PINNsNTK: a reproduction blogpost
Authors: L. Guo, J. Schaap, M. Zhang
Reproduction code: https://github.com/llguo95/PINNsNTK
## Introduction
This blogpost is a product of the project component of the Deep Learning course CS4240, taught in Q3 of the 2022-23 academic year at TU Delft.
The focus of the project is to reproduce and augment a number of results presented in the 2022 article: [*When and why PINNs fail to train: A neural tangent kernel perspective*](https://www.sciencedirect.com/science/article/abs/pii/S002199912100663X) by Wang, Yu and Perdikaris.
In the highlighted paper, the authors analyze the training procedure of a physics-informed neural network (PINN). In short, a PINN is a fully connected neural network combined with automatic differentiation to solve partial differential equations (PDEs). The authors find and prove mathematically the connection between the optimization of a PINN and neural tangent kernels (NTK), and provide an open-source TensorFlow 1 implementation [(PINNsNTK)](https://github.com/PredictiveIntelligenceLab/PINNsNTK) of their work.
The key takeaways from the PINNsNTK manuscript can be summarized by the following points:
- The NTK characterizes the loss gradient of a PINN, under the following conditions:
- The learning rate of the gradient descent optimizer of a PINN is sufficiently small;
- The hidden layer is sufficiently wide.
- This characterization stems from spectral bias in PINNs, which translates to the eigenvalues of the NTK: *"components of the target function that correspond to kernel eigenvectors with larger eigenvalues will be learned faster".*
- The 1D Poisson equation and 2D wave equation were solved with the PINN architecture to showcase the developed theory, as well as a novel algorithm to adaptively weigh the residual and boundary condition loss components of the PINN.
In light of the results achieved by Wang et al., there were a number of possible extensions that we pretended to investigate in this blogpost. A few of these studies are based on the remarks in section 8 (Discussion) of the PINNsNTK paper:
- A number of hidden layers larger than one (1). The authors only considered the case of a single hidden layer.
- Non-linear PDEs. The Poisson and wave equations studied by the authors are linear PDEs.
- Optimizers with an adaptive learning rate. All trained PINN results generated by the authors utilized the vanilla gradient descent algorithm with a constant learning rate.
In addition to these explorations, we also provide some results regarding the following points which were not mentioned by the authors:
- Different activation functions. The authors only considered the hyperbolic tangent activation function in the PINN.
- Loss regularization. The authors did not use loss regularization to keep network weights in check.
- Optimizers with a momentum term. This is related to the point on adaptive learning rates: vanilla gradient descent does not use momentum to adaptively update its learning rate.
- Learning rate decay. Decay schemes to steadily decrease the learning rate has not been used by Wang et al., yet have the potential to be useful in combination with the time gradient flow system formulation of training a PINN.
These investigations are summarized in the hyperparameter studies overview below.
## Hyperparameter studies overview
In this blog, we present and discuss the results of the following hyperparameter studies.
| Activation function | Number of Layers | Loss optimizer | Learning rate decay scheme | Loss regularization | PDE |
| ----------------------------------- | ---------------- | - | ------------------------------------------------------ | ------------------- | ---------------------------------------- |
| tanh, sigmoid, softmax, ReLU, ReLU6 | 1 through 5 | Gradient descent (GD), GD with momentum, AdaGrad, Adam | No decay, xxponential learning rate decay | No regularization, $L^2$ regularization | 1D PDE, 2D linear PDE, 2D non-linear PDE |
## Results
The results that are shown in this section are generated based a forked version of the original PINNsNTK repository and can be found [here](https://github.com/llguo95/PINNsNTK).
### Alternative activation functions
Wang et al. mention that their technique assumes the activation function is smooth. Therefore, it is valuable to see what happens to the PINN if the activation function is not smooth, say with a Rectified Linear Unit (ReLU).
To test this, we trained the PINN using multiple activation functions, the original activation function(tanh), two with a smooth gradient, and two non-smooth activation functions(ReLU and ReLU6). We tested these configurations on the problem of modeling a one-dimensional Poisson equation. The initial eigenvalues corresponding to the different activation functions are shown in Figure J1.
||
|:------------------------------------:|
| *Figure J1: The eigenvalues of $K$, $K_{uu}$ and $K_{rr}$ at initialization in descending order for different activation functions.*|
Figure J1 shows that the eigenvalues for $K$ and $K_{rr}$ are substantially smaller for the non-smooth activation functions (the ReLU and ReLU6), suggesting that the non-smooth activation functions will fail to train, as seen in Figure J2.
||
|:------------------------------------:|
|*Figure J2: The resulting estimations(left) and the point-wise errors(right), for each tested activation functions.*|
However, Figure J2 also shows that the new smooth activation functions fail to obtain as good a result as the tanh function. This is surprising since these activation functions are smooth, and the initially calculated eigenvalues seemed similar to those of the original tanh function.
To conclude, from all the activation functions tested, we found that the original tanh activation function substantially outperforms the other tested functions, where the non-smooth activation functions resulted in the worst performance. Further research could explore whether these results generalize to more PDEs and why the used smooth activation functions failed to train correctly.
### Number of layers
The article mentions that their theory only applies to networks of a single layer. However, Wang et al. did try out their PINN for a neural network of 3 layers. When analyzing the PINN, they focus on the relative changes in the parameters of the network. While this gives some insight into how the network trains, it is also valuable to see how the error changes during training.
To evaluate the training process of the PINN from the paper further, we experimented with varying the number of layers while letting the size of each layer remain the same, and we measured the loss.
|<p float="left"> <img src="https://i.imgur.com/DN9odMl.png" width="32%" /> <img src="https://i.imgur.com/y5R5kFl.png" width="32%" /> <img src="https://i.imgur.com/trqAfKU.png" width="32%" /></p>|
|:------------------------------------:|
|*Figure J3: The loss during training for a network with 1(left), 2(middle) and 3 layers(right).*|
|<p float="left"> <img src="https://i.imgur.com/lVuAHB1.png" width="32%" /> <img src="https://i.imgur.com/BYz7NKM.png" width="32%" /></p>|
|:------------------------------------:|
|*Figure J4: The loss during training for a network with 4(left) and 5 layers(right).*|
Figure J3 indicates that the network with a few layers converges smoothly. However, when we add more layers to the network, the loss fluctuates substantially over consecutive iterations, as seen in Figure J4. We conjecture that the optimizer causes these rapid fluctuations by overshooting the optimal value for the parameters. We also notice that the random initialization of the initial state and the chosen data points to train on likely heavily influence this effect since we observed that two consecutive experiments often lead to substantially different plots where sometimes there were fewer or more fluctuations. Although more layers generally seemed to provide more fluctuations than few layers.
The following two paragraphs discuss two potential fixes for these fluctuations and show some results when experimenting with these techniques. We will start by experimenting with different batch sizes; hereafter, we will discuss the effect of an exponentially decaying learning rate.
#### Increasing the batch size
The first method we tried to reduce this problem was to increase the batch size of each iteration. This would result in the gradient of each step being averaged over a greater set of training samples, which should smooth the loss, as mentioned by [*Smith et al.*](https://arxiv.org/abs/1711.00489)
We, therefore, experimented with four different batch sizes (including the original size of 128) and the original PINN with three layers. Figure J5 shows the corresponding losses of the respective networks during training. Figure J5 displays no real connection between increasing the batch size and loss fluctuations during training since the occurrence for a five-layer network seemed to be quite sporadic. We suspect this is caused by the inherent randomness in choosing data points for the training set and the random initialization, as mentioned in the previous paragraph. Further research is necessary to determine the exact effect of changing the batch size on the training loss during training.
<!-- https://i.imgur.com/4JGmASm.png -->
|<p float="left"> <img src="https://i.imgur.com/vHSXDA1.png" width=24%> <img src="https://i.imgur.com/ClmjXqO.png" width=24%> <img src="https://i.imgur.com/ubeUrHS.png" width=24%> <img src="https://i.imgur.com/QYF76aq.png" width=24%></p>|
|:------------------------------------:|
|*Figure J5: The loss during training of a three layer PINN, whith a batch size of: 64, 128, 256, 512(left to right)* |
#### Exponential decay of the learning rate
The final adjustment to reduce the fluctuations of the training loss we implemented was reducing the learning rate exponentially over time. We suspected this might be a good idea since Figure J4 shows that most of the fluctuations happen later on during training and tend to increase in size on later iterations.
To determine the effect that adding the exponential decay would have on the training loss, we trained a five-layer network again with the exponential decaying learning rate. Figure J6 shows the results of the experiment.
||
|:--:|
|*Figure J6: The training loss of the five-layer PINN with an exponentially decaying learning rate.*|
Figure J6 still shows substantial fluctuations in the loss; however, the spikes seem less sharp and less pronounced when we compare them to spikes in Figure J4. Thus we speculate that introducing an exponentially decaying learning rate reduces these fluctuations, which should have a positive impact on the predictions produced by the network.
### Optimizers: adaptive learning rate & momentum
Based on numerical stability analysis, the article mentions a maximum learning rate of $2/\lambda_{\text{max}}$, where $\lambda_{\text{max}}$ is the largest eigenvalue of the NTK. In the 1D Poisson problem, the largest recorded NTK eigenvalue equals approximately 1.2 million (1204266.8), which means that $2/\lambda_{\text{max}} \approx 1.66\cdot10^{-6}$ is the theoretical limit for the learning rate for gradient descent.
Since the article claims that the NTK for this particular problem is independent of the training procedure, we can assume this to be a general learning parameter threshold.
The article makes use of a vanilla (full batch) gradient descent optimizer, with a constant learning rate and without any learning rate decay scheme. This gives rise to two questions:
- How does adding a momentum term to the optimizing algorithm affect the predictive capabilities and NTK eigenvalues?
- The same question, but with a decay scheme for the learning rate.
##### Without learning rate decay scheme
We first take a look at an array of different weight optimizers without learning rate decay.
|  |
|:--------------------------------------------------------------------------------:|
| *Figure L1: Benchmark of optimizers without additional learning rate decay. (Left) predictive result for the solution for the 1D Poisson problem. (Right) point-wise error of the prediction compared to the exact PDE solution.* |
From Figure L1, it can be seen that AdaGrad does not optimize the weights in the PINN architecture well. This could be because the adaptive learning rate becomes too large for the loss landscape.
It can also be seen that the Adam optimizer performs worse compared to the gradient descent and momentum optimizers, yet is able to optimize the weights better than AdaGrad. We conjecture that adding a the momentum term acts as a stabilizer for optimizer schemes with an adaptive learning rate.
|  |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| *Figure L2: (Left) the eigenvalues of the NTK $K$ at the final optimization step. (Middle) the corresponding eigenvalues of the block component related to the boundary condition $K_{uu}$ and (Right) residue $K_{rr}$.* |
As can be seen from Figure L2, the eigenvalues of the NTK are not affected greatly by che choice of weight optimizer. But, what about the values of the kernel itself?
<!---->
|  |
|:-------------------------------------------------------------------------------------------------------------------:|
| *Figure L3: The relative change in magnitude of the NTK values compared to the kernel before the training procedure.* |
From Figure L3, the inability of Adagrad to optimize the network parameters is visualized by the (lack of) evolution of the NTK: its values do not change over the course of the training procedure.
Moreover, whereas the GD and momentum optimizers yield a decreasing rate of change in NTK values, the Adam optimizer changes the NTK more and more as the number of training iterations increase.
##### With learning rate decay scheme
The same optimizers are subsequently compared with an exponential decay scheme for the learning rate. The decay rate for the scheme is selected to be 0.9, applied every 1000 training iterations.
|  |
|:--------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| *Figure L4: Benchmark of optimizers with exponential decay of the learning parameter. (Left) predictive result for the solution for the 1D Poisson problem. (Right) point-wise error of the prediction compared to the exact PDE solution.* |
If we compare Figures L4 and L1, we can see that the Adam has a significantly improved performance, which is also reflected in its relative $L^2$-error: 0.73 in the case of no decay versus 0.28 in the case of learning rate decay. In combination with similar NTK eigenvalues, shown in Figure L5 (compare with Figure L2), we speculate that this is because the additional decrease in learning rate yields faster training as explained by Section 5 of the article.
|  |
|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
| Figure L5: (Left) the eigenvalues of the NTK $K$ at the final optimization step. (Middle) the corresponding eigenvalues of the block component related to the boundary condition $K_{uu}$ and (Right) residue $K_{rr}$. |
Interestingly, the behavior of the NTK values across the training procedure does not seem to be affected compared to the case with no learning rate decay: see Figure L6.
|  |
|:--------------------------------------:|
| *Figure L6: The relative change in magnitude of the NTK values compared to the kernel before the training procedure.* |
### Loss regularization
<!-- What happens to the predictive quality (`r2_score(u_star, u_pred)`) if we use implement (L2) regularization? -->
A loss regularizer was not used in the implementation. In order to see what implications this has, we have decided to employ $L^2$ regularization and compare it to the PINNsNTK paper. In particular, we consider the solution to the 1D Poisson equation. We adhere to the original training scheme implementation: gradient descent with constant learning rate.
Loss regularization renders the network parameters to be more stable at a fixed hidden layer width.
|  |
|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| *Figure L7: Comparison between the presence and absence of a $L^2$ regularization term during the training of the PINN. (Left) predictive result for the solution for the 1D Poisson problem. (Right) point-wise error of the prediction compared to the exact PDE solution.* |
|  |
|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| *Figure L8: (Left) the eigenvalues of the NTK $K$ at the final optimization step. (Middle) the corresponding eigenvalues of the block component related to the boundary condition $K_{uu}$ and (Right) residue $K_{rr}$.* |
As we can see from Figures L7 and L8, adding an $L^2$ loss regularization term does not affect the predictive quality of the PINN nor the eigenvalues of the NTK.
However, the NTK itself *does* evolve in a different way when regularization is present, as demonstrated in Figure L9.
<!--[](https://i.imgur.com/QaiJR7p.png)-->
|  |
|:------------------------------------------------------------------------------------------------------------------- |
| *Figure L9: The relative change in magnitude of the NTK values compared to the kernel before the training procedure.* |
Here in Figure L9, we see that the kernel initially transforms similarly to the NTK without regularizer, away from the initial NTK values. However, after 10000 iterations, this behavior reverts, and the kernel becomes more similar to the original kernel as the iterations progress.
### PDE generalizability
The experiments conducted in the PINNsNTK paper only focuses on the 1D Poisson equation and 2D wave equation with a single hidden layer. Wang et al. have not conducted experiments that generalize to other types of PDE. Indeed, a single experiment could possibly not validate this choice on more complex PDEs: for example, for PINNs to solve non-linear or higher dimensional PDE, perhaps a fully connected network with multiple hidden layers is required.
Given this situation, we would like to do the following:
1. Check whether the choices made in the PINNsNTK manuscript generalize to solving other 1D PDEs.
2. Check the NTK on more complex PINNs by solving 2D PDE (both linear and non-linear) using a multi-layer PINN.
We log the eigenvalues of NTK and the decrease of the losses when training the PINN: $L_{uu}$ which is MSE($u$ + initial status + boundary condition) and $L_{rr}$, which is MSE(first and second derivatives of the variables in PDE).
The purpose of this study is to investigate whether the eigenvalues of NTK on these two loss components, in other types of PDE, also show a gap which lead to the different convergence rates on the components.
We consider the following PDEs:
| PDE | Exact solution | Properties |
| -------------------- | -------------------------------- |:-------------- |
| $u_{xx} - 6ax = 0$ | $ax^3$ where a is constant | 1D, linear |
| $u_t - cu_{xx}=0$ | $\cos({a{\pi}x})e^{-c{a\pi}^2t}$ | 2D, linear |
| $u_tu_{tt}-uu_x = 0$ | $\sin(x+t)$ | 2D, non-linear |
Here, $u_x,u_t,u_{xx},u_{tt}$ represent the first and second derivatives of x or t related to PDE, and u represents the solution for PDE. $L_u$,$L_r$ represent the loss respect to boundary loss and residual loss in PINN. Finally, $K_u$,$K_r$ correspond to the NTK eigenvalues.
We sample three sets of data: $x, t$ in $[0,1]$ with $x=0$ or $x=1$, $t$ in $[0,1]$ as well as $t = 0$, $x$ in $[0,1]$.
In the case of the 1D partial derivative, we only sampled $x$ randomly distributed among $[0,1]$ and repeated samples $x = 0$ and $x = 1$ to represent the boundary condition.
We calculate the loss by computing the MSE between the predicted value and value ($L_{u}$) from a specific solution. Then, we use automatic differentiation function to calculate the MSE between derivative of variables in the model and exact PDE residue ($L_{r}$).
Then we log the decrease of $L_{u}$ and $L_{r}$ and log the engienvalues of NTK for $L_{}$ and $L_{uu}$.
The PINN used to solve the 1D PDEs is a PINN with a single hidden layer and 512 neurons, whereas we utilize a PINN with three layers of 500 neurons wide, in accordance to the same as authors' experiments. But due to limitation of memory, we shrink the kernel size on 2D PDE from 300 to 120, which affect the batch size to calculate NTK in each gradient descent.
#### 1D PDE
##### Ground truth, prediction and pointwise error
||
|:-----------------------------------:|
| *Figure M1: predicted value (left) and error (right)* |
##### NTK eigenvalues
||
|:-----------------------------------:|
|*Figure M2: Eigenvalue decline of NTK. Left: Overall loss, middle: $L_{uu}$, right: $L_{rr}$*|
#### 2D PDE
##### Result and error
||
|:-----------------------------------:|
|*Figure M3: Model output and ground truth*|
##### NTK eigenvalues
||
|:-----------------------------------:|
|*Figure M4: Eigenvalue decline of NTK. Left: $L_{uu}$, middle: $L_{ut}$, right: $L_{r}$*|
##### Loss
||
|:----------------------------------:|
|*Figure M5: Losses curve*|
#### Non-linear PDE
##### Loss
|  |
|:-----------------------------------------:|
|*Figure M6: Losses curve of Non-linear PDE*|
##### NTK eigenvalues
|  |
|:--------------------------------------------------------------------------------------:|
| *Figure M7: Eigenvalue decline of NTK. Left: $L_{uu}$, middle: $L_{ut}$, right: $L_{r}$* |
##### Ground truth, prediction and pointwise error
|  |
|:----------------------------------------:|
|*Figure M8: Model output and ground truth*|
Observing 1D and linear 2D PDE's NTK eigenvalues, we can find that the conclusion can be generalized to other 1D PDE:
- The discrepancy between $K_{u}$ and $K_{r}$ cause the PINN to fail.
- The rate at which $L_r$ and $L_u$ decrease is unbalanced, aligned with the eigenvalues of corresponding NTK.
- However, this is not the case with the non-linear PDE solutions. from Figures M1, M3, M8, we see that the PINN performance on the non-linear PDE is much worse than the linear counterparts, and the eigenvalues gap between $K_{u}$ and $K_{r}$ is not very big.
Furthermore, the following speculations can be made from our results:
- From Figure M6 and M5, we find that the losses of $L_r$ drop quickly at first, but subsequently increase and converge with the $L_u$. This phenomenon may reflect the effect of different optimization speeds on the overall loss.
- Since the final output is involved in residual values in the non-linear equation, this might cause the decrease in eigenvalues discrepancy between the $K_u$ and $K_r$ components of the NTK, compared to the cases of the linear equations.
- However, we still observe that in the beginning of the training, the decrease of residual is still much faster than $L_u$ before they converge together, which may reflect the effect of spectral bias of the PINN mentioned by Wang et al.
## Conclusion
Here, we summarize the highlights of our reproduction results and propose a number of venues for possible research that could be realized in future work.
### Summary of findings
- The PINN suggested by the author only functioned when the activation function used within the network was smooth.
- Furthermore, the alternative smooth activation functions seemed to perform substantially worse than the original Tanh function proposed by the article.
- The network still trains well with multiple additional layers; however, it often fails to converge smoothly when the layers become numerous.
- Increasing the batch size did not have a notable effect on smoothening the loss during training.
- Using an exponentially decaying learning rate seemed to have a profound effect on the loss curve during training, making the PINN converge substantially smoother.
- The results presented by Wang et al. can be generalized to linear PDE and PINNs with more than one layer.
- When non-linear PDEs are solved by PINNs, the eigenvalues of the residual and boundary components of the NTK show a smaller discrepancy and much poorer performance.
- We conjecture that the presence of a momentum term in the PINN's loss optimization scheme might lead to a stabilizing effect when learning rates are adaptively calculated.
- Similarly, the presence of an exponential learning rate decay scheme in the training procedure of the PINN causes the Adam optimizer to achieve better results.
- Adding an $L^2$ loss regularizer does not significantly alter the NTK eigenvalues. This is reflected in the similarity of predictive power between the regularized and unregularized PINNs.
### Suggested future work
- A more in-depth analysis of why the tanh activation function outperforms the other smooth activation function by such a considerable margin.
- To better express the effect of increasing the batch size and using a decaying learning rate, future work could focus on running the network multiple times and measuring a metric to determine how much the loss on average fluctuates.
- The performance of PINNs with CNN / residual network / attention mechanism architecture, as also mentioned by Wang et al.
- Do machine learning techniques such as SVMs perform well on solving PDEs or not?
- Noisy training data. The PINNs trained in the PINNsNTK manuscript and this reproduction study used noiseless, analytic training data. In real-life applications, this can rarely be the case.