11-5. Physics-informed neural networks

--- title: 11-5. Physics-informed neural networks tags: Final reports --- **Reference** 1. Physics-informed neural networks, M. Raissi and P. Perdikaris and G. E. Karniadakis, Journal of Computational Physics **378** (2019) 686–707, [10.1016/j.jcp.2018.10.045](https://doi.org/10.1016/j.jcp.2018.10.045) 2. Automatic Differentiation in Machine Learning: a Survey, A. Baydin and B. Pearlmutter and A. Radul and J. Siskind, Journal of Machine Learning Research 18 (2018) 1-43. --- # (張尚文，Part I) ### Neural network A neural network is a function of form $$ f_{W_1, b_1,...,W_n,b_n}=\sigma_n\circ A_n\circ \sigma_{n-1}\circ A_{n-1}\circ...\circ \sigma_1\circ A_1 $$ where $\sigma_i$ are nonlinear "activation" functions and $A_i$ are affine functions $A_i(x)=W_ix+b_i$. They are aimed to approximate other functions and be their surrogate, much like interpolation in spectral method, but this time employing function composition instead of linear combinations. Classically in machine learning, we are given training data $\{x_j,y_j\}_{j=1}^N$ where we assume that there exists a correct function $f:\begin{array}{l}\mathcal{X}\to\mathcal{Y}\\ x_i\mapsto y_i\end{array}$ and we approximate it using $\tilde{f}_{W_1,b_1,...,W_n,b_n}:=\bigcirc_{i=1}^N (\sigma_i\circ A_i)$. To compute the approximation, we consider functional similar to $$ \text{Loss}: \begin{array}{l} C(\mathcal{X},\mathbb{R})\to \mathbb{R} \\ \tilde{f}\qquad\quad\mapsto \sum_{j=1}^N |||\tilde{f}(x_i)-y_i|||\end{array} $$ where $|||\cdot |||$ is some similarity measure. This turns finding the optimal $\tilde{f}_{W_1,b_1,...,W_n,b_n}$ into a finite-variate optimization problem. The Universal Approximation Theorem states that the set of such "neurally-defined" function is dense in the space of continuous functions on $\mathbb{R}^d$, yet the rate of convergence (with respect to number of parameters...) is an active area of research. ### Automatic differentiation Currently, there are four ways for a computer to understand the concept of "differentiation" (1) Symbolic differentiation - given a function explicitly defined by symbols, output another symbol using a priori formulae; only works for functions defined by symbols. (2) Finite difference - given evaluations of a function on a finite grid, output the difference quotient as approximation to differentiation; works for arbitrary functions. (3) Spectral method - given fintely many evaluations of a function, use the evaluations to determine an interpolation and output the derivative of that interpolation; works for arbitrary functions. (4) Automatic differentiation - given a function defined by compositions of elementary functions (such as `+,-,×,÷, sin, cos, exp, ln`), evaluate the derivative using chain rule and a priori formulae. However, instead of working with long symbolic expression, automatic differentiation organize the composition rule into a graph and only apply chain rule to necessary variables. Example: ![](https://i.imgur.com/mXuuUJT.png) ![](https://i.imgur.com/Ga1IxAN.png) Nevertheless, the border of symbolic and automatic differentiation is vague. The key point is that unlike finite difference and spectral difference, symbolic and automatic differentiation doesn't work for general, unknown functions. But with the help of neural network approximation, we can extend closed-form-based differentiation methods to arbitrary functions by first approximating functions by a neural network and then apply automatic differentiation to the tuned neural network, which is used as a surrogate. Thus, the accuracy of Automatic Differentiation on Neural Surrogate (let's coin a new word - ADNS!) hinges on how good a neural network approximates the function. Thus, the accuracy is subject to the fineness of evaluations just like spectral method and finite difference. ### Approximation theory of ADNS The quality of derivative hinges on how good the neural surrogate approximates the original function, where the approximation theory is still developing, as in "Deep Neural Network Approximation Theory" by P. Grohs, D. Perekrestenko, D. Elbrächter, H. Bölcskei. (https://arxiv.org/abs/1901.02220) I wrote some code to compare finite difference vs Chebyshev vs ADNS: Given 10, 20 or 100 points, let finite difference and Chebyshev use these points in the traditional way, and we train a neural network whose only goal is to guess the sample-generating function. Below are the results of "neural interpolation" and the derivative calculated using autodifferentiation. N = 10 ![](https://i.imgur.com/EJalGvw.png) N = 20 ![](https://i.imgur.com/YVRduuJ.png) N = 100 ![](https://i.imgur.com/eHXaDvh.png) Next, we compare Chebyshev, finite difference and ADNS. Notice that ADNS is a full-fledged function that can be evaluated at any given point, while Chebyshev and finite difference only has the information on the fixed grid points. ![](https://i.imgur.com/mEOvE6U.png) Thoughts - since the class of "neural functions" is too vast, serious aliasing might happen. Notice that the "neural interpolation" fits almost perfectly, but since there are so many interpolations that can fit those discrete points perfectly, there is no obvious choice to decide which is the "best" interpolation. However, in PINNs, we add requirements about differential equations and thus the problem of aliasing might be alleviated. ### Physics-Informed Neural Networks (PINNs) The basic idea of the paper is that given finite observation of phase-space variables $\{x_i, t_i\}_{i=1}^{N_f}\subset \text{int}\Omega\times (0,T), \{x_i, t_i, u_i\}_{i=1}^{N_b}\subset \partial(\Omega\times [0,T])$ to approximate $u$ by $$ \tilde{u}:=\bigcirc_{i=1}^n (\sigma_i\circ A_i) $$ whose values on $\{x_i, t_i\}_{i=1}^{N_b}$ are close to $\{u_i\}_{i=1}^{N_b}$ and whose derivative on $\{x_i, t_i\}_{i=1}^{N_f}$ satisfies the PDE. This is made possible since we can calculate $\tilde{u}_x(x_0)$ efficiently and accurately using automatic differentiation. For example, > Given (1) a PDE $u_t+\mathcal{N}[u]=0, x\in\Omega, t\in[0,T]$ where $\mathcal{N}$ is a non-linear differential operator w.r.t. x and (2) discrete evaluations of initial and boundary condition $\{x_i', t_i'\}_{i=1}^N$, use neural network > $$ > \tilde{u}:=\bigcirc_{i=1}^n (\sigma_i\circ A_i) > $$ s.t. > >$\tilde{u}(0,x)\approx x$ and >$\tilde{u}_x+\mathcal{N}[\tilde{u}]\approx 0$ > where the $\approx$ in the sense of neural network is to design a loss functional that. This paper shows that we can consider loss functionals that combines two aspect of a boundary value problem: (1) satisfying boundary conditions on $\{x_i, t_i, u_i\}_{i=1}^{N_b}\subset \partial(\Omega\times [0,T])$: $$ MSE_b(\tilde{u}) = \sum (\tilde{u}(x_i,t_i)-u_i)^2 $$ (2) satisfying the differential equation $u_t+\mathcal{N}[u]=0$ on $\{x_i', t_i'\}_{i=1}^N$ $$ MSE_u(\tilde{u}) = \sum (\tilde{u}_x(x_i,t_i)-0)^2 $$ The second part $MSE_u$ can be thought as a "regularization term" that softly restricts the area of search to those $u$ satisfying the equation. --- # (簡銘瑜) In general form,the nonlinear partial differential equation can be express as $$ u_t+\mathscr{N}[u]=0 , x \in \Omega, t \in [0,T] $$ where $u(t,x)$ denotes the latent (hidden) solution , $\mathscr{N}[u]$ is the nonlinear-term,and $\Omega$ is a subset of $R^d$. ### Continuous time models Question:The nonlinear Schrodinger equation along with periodic boundary conditions\ $ih_t+0.5h_{xx}+|h|^2h=0,x\in [-5,5], t\in[0,\pi/2]$ Initial condition:\ $h(0,x)=2sech(x)$ Boundary condition:\ $h(t,-5)=h(t,5)$ $h_x(t,-5)=h_x(t,5)$ where $h(t,x)$ is the complex-valued solution,and $h=u+iv$ . which imply $i(u_t+iv_t)+0.5(u_{xx}+iv_{xx})+|u+iv|^2(u+iv)=0$ . So the most important thing is the differential equation can be express as two parts without complex number as the following , $u_t+0.5v_{xx}+(u^2-v^2)v=0$ and $-v_t+0.5u_{xx}+(u^2-v^2)u$ $...(1)$ with boundary condition: $u(0,x)=sech(x)$ $...(2)$ $v(0,x)=0$ $...(3)$ And periodic boundary condition $...(4)$ Use $(1)+(2)+(3)+(4)$,we can apply chebfun. ![](https://i.imgur.com/BnKFqIS.png) ![](https://i.imgur.com/3lRn6oo.png) ![](https://i.imgur.com/MwBz27o.png) In addition,if your RAM can't load for this problem. Try to Export to m-file and change the number of 1e-5 to 1e-4 as the following ![](https://i.imgur.com/V7NdrRU.png) ### Define the cost function Let us define $f(t,x)$ to be given by\ $f:= ih_t+0.5h_{xx}+|h|^2h$ Because we want to use Neural network tp solve this problem,we need to define cost function.\ $MSE=MSE_0+MSE_b+MSE_f$\ where\ $MSE_0=\frac{1}{N_0}\sum^{N_0}_{i=1}|h(0,x^i_0)-h^i_0|^2$ $MSE_b=\frac{1}{N_b}\sum^{N_b}_{i=1}(|h^i(t^i_b,-5)-h^i(t^i_b,5)|^2+h^i(t^i_b,-5)-h^i(t^i_b,5)|^2)$ $MSE_f=\frac{1}{N_f}\sum^{N_f}_{i=1}|f(0,x^i_0)|^2$ \ Here,$\{x^i_0,h^i_0\}^{N_0}_{i=1}$ denotes the initial data,$\{t^i_b\}^{N_b}_{i=1}$ denotes the boundary data,and $\{t^i_f,x^i_f\}$ denotes the interior data. In this example,set $N_0=50,N_b=50,N_f=20000$. We use a 5-layer deep neural network with 100 neurons per layers and a hyperbolic tangent activation function.This neural network optimize all loss functions using L-BFGS,try to find a function that have two output to predict the real part and imaginary part. Code:https://github.com/jhssyb/PINN-master-Rassi $tanh(x)=(e^x-e^{-x})/(e^x+e^{-x})$ ![](https://i.imgur.com/Qja3Uw5.png) Result: The resulting prediction error is validated against the test data for this problem. Compare with the result which Raissi publish on github. In the papper,the experiment error is $Error{\ }h :1.97*10^{-3}$ with $|h(t,x)|=\sqrt{u(t,x)^2+v(t,x)^2}$ ![](https://i.imgur.com/pniHvHO.png) --- # (林暐軒) For Physic-informed netrual network, we can do data-driven solutions of partial differential equations . This means that given fixed model parameters λ we can find the unknown hidden state u(t, x) of the system. For example: Allen–Cahn equation along with periodic boundary conditions $u_t-0.0001u_{xx}+5u^3-5u=0, x\in[-1,1],t\in[0,1]$ $u(0,x)=x^2cos(\pi x)$ $u(t,-1)=u(t,1)$ $u_x(t,-1)=u_x(t,1)$ ## Discrete time models Let us recall the Runge-Kutta method consider $y'(t)=f(t)$,thus $y(t_0+h)=y(t_0)+\int_0^hf(t_0+s)ds\approx y(t_0)+h\sum_{l=1}^sb_lf(t_0+c_lh)$ consider $y_t=f(t,y)$ then $y(t_0+h)\approx y(t_0)+h\sum_{l=1}^sb_lf(t_0+c_lh,y(t_0+c_lh))$ An RK scheme $k_1=f(t_n+c_1h,y_n+h\sum_{j=1}^sa_{1j}k_j)$ $k_2=f(t_n+c_2h,y_n+h\sum_{j=1}^sa_{2j}k_j)$ $\vdots$ $k_s=f(t_n+c_sh,y_n+h\sum_{j=1}^sa_{sj}k_j)$ $y_{n+1}=y_n+h\sum_{l=1}^sb_lk_l$ 關於如何用 implicit Runge-Kutta methods解nonlinear equation，下方reference內有詳細步驟可供參考 **Reference** [Implicit Runge-Kutta methods](https://www.epfl.ch/labs/anchp/wp-content/uploads/2018/05/part2-1.pdf) Let us apply the general form of Runge–Kutta methods with q stages to equation and obtain $u^{n+c_i}=u^n-\Delta t \sum_{j=1}^q a_{ij}N[u^{n+c_j}]$ $u^{n+1}=u^n-\Delta t \sum_{l=1}^q b_{l}N[u^{n+c_l}]$ where $u^{n+c_j}=u(t^n+c_j\Delta t,x)$ ,for $j=1...q$ $N[u^{n+c_j}]=-0.0001u_{xx}+5u^3-5u$ 與implicit Runge-Kutta method的內容比對的話，這裡$u^{n+c_i}$等同$k_l$而$u^{n+1}$等同$y_{n+1}$ 這裡的係數 implicit Runge-Kutta weight {$a,b,c$}是已知的 Above equations can be equivalently expressed as $u^n=u_i^n$,$i=1,...,q$ $u^n=u_{q+1}^n$ where $u_i^n=u^{n+c_i}+\Delta t \sum_{j=1}^q a_{ij}N[u^{n+c_j}]$ $u_{q+1}^{n}=u^{n+1}+\Delta t \sum_{j=1}^q b_jN[u^{n+c_j}]$ 這裡$u^n$都表示在$t^n$時的值，就是說$u_i^n$ 和 $u_{q+1}^{n}$應該要是相同的值給定一個x值，藉由x經過各層的weight和bias調整後送入activation function，傳遞到最後形成輸出。 $$ u_{W_1, b_1,...,W_n,b_n}=\sigma_n\circ A_n\circ \sigma_{n-1}\circ A_{n-1}\circ...\circ \sigma_1\circ A_1 $$ where $\sigma_i$ are nonlinear "activation" functions and $A_i$ are affine functions $A_i(x)=W_ix+b_i$. 以上形式表示的u，當代入某個位置x後得到在初始時間經過一步時間間距的u值。所以u是預測經過一步時間間距後的某x位置的值。簡單來說，目標是要找到各層weight and bias。現在已知的是在初始時間各位置的u值，藉由改寫implicit Runge-Kutta method，原本時間往前走一步預測下一時間點的值的方法，在PINN中是往回走，意思是說藉由先猜下一時間點的值，得到$[u^{n+c_1},u^{n+c_2},..,u^{n+1}]$，再帶入下式 $u^{n+c_i}+\Delta t \sum_{j=1}^q a_{ij}N[u^{n+c_j}]\approx u_i^n$ $u^{n+1}+\Delta t \sum_{j=1}^q b_jN[u^{n+c_j}] \approx u_{q+1}^{n}$ 這裡$u^n$表示在$t^n$時的值，就是說$u_i^n$ 和 $u_{q+1}^{n}$應該要是相同的值，等同$u^n$。這裡的$t^n$其實指的是初始時間，$u^n$是初始的值。藉由往回頭檢查預測結果，是否吻合初始值，修正weights and bias。 For a physics-informed neural network ,we takes x as an input and outputs $[u_1^n,u_2^n,..,u_{q+1}^n]$ 這裡的output不是下個時間點的值，反倒是初始值的逼近。對於要求的下個時間點的值，是藉由在PINN中求到的weights and bias，帶入定義域的x再求出來。 Here is the cost function of our example $SSE=SSE_n+SSE_b$ Input data ${\{x^{n,i},u^{n,i}=u(t^n,x^{n,i})\}}_{i=1}^{N_n}$ $SSE_n=\sum_{j=1}^{q+1}\sum_{i=1}^{N_n}|u_j^n(x^{n,i})-u^{n,i}|^2$ $SSE_n$是要我們藉由PINN得到的初始狀態逼近可以穩合真實初始狀態。 $SSE_b=\sum_{i=1}^q|u^{n+c_i}(-1)-u^{n+c_i}(1)|^2+|u^{n+1}(-1)-u^{n+1}(1)|^2\\ +\sum_{i=1}^q|u_x^{n+c_i}(-1)-u_x^{n+c_i}(1)|^2+|u_x^{n+1}(-1)-u_x^{n+1}(1)|^2$ $SSE_b$是要滿足邊界條件，在任何時間點都能滿足。 We want $SSE_n$ can fit input data and $SSE_b$ can satisfy boundary conditions. In this example, Training data-set consists of $N_n$=200 initial data points that are randomly sub-sampled from the exact solution at time t=0.1. Time-step $\Delta t$ =0.8. Netural network have 4 hidden layers and 200 neurons per layer. the number of stage is 100. Our goal is to predict the solution at time t=0.9 using single time-step. ### Result ![](https://i.imgur.com/tysPKmV.png) ![](https://i.imgur.com/scon8nw.png) Our relative $L_2$ error is defined as $\frac{||u_{predict}-u_{exact}||_2}{||u_{exact}||_2}$ In my experiment , the error is 6.484016e-02. In papper,the error is 6.99e-03. Actually, some code about L-BFGS-optimizer is not work on my computer, so I change to use GradientDescentOptimizer because this is the simplest way to fix this problem. #### Some advantage of this model: 只需要在初始時間的x上取點，不需要其他時間點的訊息，就能求出想要知道的時間點的u值。 The key parameters controlling the performance of our discrete time algorithm are the total number of Runge–Kutta stages q and the time-step size . Low-order methods cannot retain their predictive accuracy for large time-steps,thus mandating a solution strategy with multiple time-steps of small size. On the other hand, the ability to push the number of Runge–Kutta stages to higher allows us to take very large time steps, and effectively resolve the solution in a single step without sacrificing the accuracy of our predictions. --- # (張尚文，Part 2) ### Data-driven discovery of PDE In this part, we consider a one-parameter family of PDEs $$ u_t+\mathcal{N}[u;\vec{\lambda}]=0 $$ and given data $\{\vec x_i,t_i, \vec u_i\}_{i=1}^N$, we would like to find the $\vec \lambda$ that best explaines the data. The general method is to let $\vec{\lambda}$ be a trainable paramter alongside with $W_1,b_1,...,W_n,b_n$. It is not involved with the neural approximation $\tilde{u}:=\bigcirc_{i=1}^n (\sigma_i\circ A_i)$ but is involved in the "regularization term" of loss function that forces $\tilde{u}$ to satisfy the differential equation. ### Continuous model - Navier Stokes > note: partial derivatives of neural approximation e.g. $\tilde{\psi}_x$ is given by automatic differentiation The 2D Navier-Stokes is $$ \bigg\{\begin{array}{l} u_t+\lambda_1(uu_x+vu_y)=-p_x+\lambda_2(u_{xx}+u_{yy})\\u_t+\lambda_1(uv_x+vv_y)=-p_y+\lambda_2(v_{xx}+v_{yy})\end{array} $$ and the data given is $\{x_i,y_i,t_i,u_i,v_i\}_{i=1}^N$ where we restrict our interest to divergence-free / close / exact (by Poincare's lemma) vector fields $\bigg\{\begin{array}{l}u=\psi_y\\ v=-\psi_x\end{array}$ We set up the neural approximation scheme : $$ \begin{pmatrix}\tilde{\psi}\\\tilde{p}\end{pmatrix}=\bigcirc_{i=1}^N (\sigma_i\circ A_i) $$ and write $$ \bigg\{\begin{array}{l}\tilde{u}:=\tilde{\psi}_y\\\tilde{v}:=-\tilde{\psi}_x\end{array}, \bigg\{\begin{array}{l}\tilde{f}:=\tilde{u}_{t}+\lambda_1(\tilde{u}\tilde{u}_x+\tilde{v}\tilde{u}_y)=-\tilde{p}_x+\lambda_2(\tilde{u}_{xx}+\tilde{u}_{yy})\\\tilde{g}:=\tilde{u}_t+\lambda_1(\tilde{u}\tilde{v}_x+\tilde{v}\tilde{v}_y)=-\tilde{p}_y+\lambda_2(\tilde{v}_{xx}+\tilde{v}_{yy})\end{array} $$ with loss functional $$ MSE=\frac 1N\sum_{j=1}^N\big(|\tilde{u}(x_j,y_j,t_j)-u_j|^2+|\tilde{v}(x_j,y_j,t_j)-v_j|^2\big)+\frac 1N \sum_{j=1}^N\big(|\tilde{f}(x_j,y_j,t_j)|^2+|\tilde{g}(x_j,y_j,t_j)|^2\big) $$ where (1) the first part measures how good $(\tilde{\psi}_y, -\tilde{\psi}_x)$ approximate $\{u_j, v_j\}_{i=1}^N$ (2) the second part measures how good the Navier-Stokes equation is satisfied. Notice that $MSE$ is a function of $\lambda_1, \lambda_2$ through $\tilde{f},\tilde{g}$. As `tensorflow` minimizes $MSE$, $\lambda_1, \lambda_2$ are "discovered" by finiding the values that best minimizes the (2) part of $MSE$. Result: ![](https://i.imgur.com/oyiBeb6.png) Interestingly, without any data about pressure $p$, the model still manages to approximate $\tilde{p}$. This is reasonable since the model is compelled to make $\tilde{f},\tilde{g}$ zero, which is dependent on the value of $\tilde{p}$