or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
Introduction
tags:
SciML-Lecture
Famous Recent Examples of Scientific Machine Learning
With scientific machine learning becoming an ever larger mainstay at machine learning conferences, a/nd ever more venues and research centres at the intersection of machine learning and the natural sciences / engineering appearing there exist ever more impressive examples of algorithms which connect the very best of machine learning with deep scientific insight into the respective underlying problem to advance the field.
Below are a few prime examples of recent flagship algorithms in scientific machine learning, of which every single one of them personifies the very best algorithmic approaches we have available to us today.
AlphaFold - predicts 3D protein structure given its sequence:
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →GNS - capable of simulating the motion of water particles:
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Codex - translating natural language to code:
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Geometric Deep Learning
Geometric deep learning aims to generalize neural network models to non-Euclidean domains such as graphs and manifolds. Good examples of this line of research include:
SFCNN - steerable rotation equivariant CNN, e.g. for image segmentation
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →SEGNN - molecular property prediction algorithm
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Stable Diffusion - generating images from natural text description
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →Definition
Machine learning at the intersection of engineering, physics, chemistry, computational biology etc. and core machine learning to improve existing scientific workflows, derive new scientific insight, or bridge the gap between scientific data and our current state of knowledge.
Important here to recall is the difference in approaches between engineering & physics, and machine learning on the other side:
Engineering & Physics
Models are derived from conservation laws, observations, and established physical principles.
Machine Learning
Models are derived from data with imprinted priors on the model space either through the data itself, or through the design of the machine learning algorithm.
Supervised vs Unsupervised
There exist 3 main types of modern day machine learning:
Supervised Learning
In supervised learning we have a mapping \(x \longrightarrow y\), where the inputs \(x\) are also called features, covariates, or predictors. The outputs \(y\) are often also called the labels, targets, or responses. The correct mapping is then learned from a labeled training set
\[\mathcal{D}_{n} = \left\{ \left( x_{i} \right)_{i=1,n} \right\}\]
with \(n\) the number of observations. Depending on the type of the response vector \(y\), we can then perform either regression, or classification
Regression
In regression the target \(y\) is is real-valued, i.e. \(y \in \mathbb{R}\)
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →(Source: Murphy)
at the example of a response surface being fitted to a number of data points in 3 dimensions, where in this instance the \(x\), and \(y\) axes are a two-dimensional space, and the \(z\)-axis is the temperature in the two-dimensional space.
Classification
In classification the labels \(y\) are categorical i.e. \(y \in \mathcal{C}\), where \(\mathcal{C}\) defines a set of classes.
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →(Source: Murphy)
for the example of flower classification, where we aim to find the decision boundaries which will sort each individual node into the respective class.
Unsupervised Learning
In unsupervised learning we only receive a dataset of inputs
\[\mathcal{D} = \{x_{n}: n = 1:N\}\]
without the respective outputs \(y_{n}\), i.e. we only have unlabelled data.
Two famous examples of unsupervised learning are clustering, and especially principal component analysis which is commonly used in engineering and scientific applications.
Clustering of Principal Components
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →(Source: Data-Driven Science and Engineering)
Combining clustering with principal component analysis to show the samples which have cancer in the first three principal component coordinates.
Supervised vs Unsupervised, the tl;dr in Probabilistic Terms
The difference can furthermore be expressed in probabilistic terms, i.e., in supervised learning we are fitting a model over the outputs conditioned on the inputs \(p(y|x)\), whereas in unsupervised learning we are fitting an unconditional model \(p(x)\).
Reinforcement Learning
In reinforcement learning one sequentially interact with an unknown environment to obtain an interaction trajectory \(T\), or a batch thereof. Reinforcement learning then seeks to optimize the way it interacts with the environment through its actions \(a_{t}\) to maximize for a (cumulative) reward function to obtain an optimal strategy.
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →(Source: lilianweng)
Polynomial Curve Fitting
Let's presume we have a simple regression problem, e.g.
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →(Source: Murphy)
then we have a number of observations \({\bf{x}} = (x_{1}, \ldots, x_{N})\), and a target \({\bf{y}} = (y_{1}, \ldots, y_{N})\). Then the tool we have probably seen before in the mechanical engineering curriculum is the simple approach to fit a polynomial function
\[y(x, w) = \omega_{0} + \omega_{1}x + \omega_{2} x^{2} + \ldots + \omega_{M}x^{M} = \sum_{j=0}^{M}\omega_{j}x^{j}\]
Then a crucial choice is the degree of the polynomial function.
We can then construct an error function with the sum of squares approach in which we are computing the distance of every target data point to our polynomial
\[E(w) = \frac{1}{2} \sum_{n=1}^{N} \{ y(x_{n}, w) - y_{n} \}^{2}\]
in which we are then optimizing for the value of \(w\).
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →(Source: Murphy)
To minimize this we then have to take the derivative with respect to the coefficients \(\omega_{i}\), i.e.
\[\frac{\partial E(w)}{\partial \omega_{i}}=\sum_{n=1}^{N}\{ y(x_{n}, w) - y_{n} \}x_{n}^{i}=\sum_{n=1}^{N}\{ \sum_{j=0}^{M}\omega_{j} x_{n}^{j} - y_{n} \}x_{n}^{i}\]
which we are optimizing for and by setting to 0, we can then find the minimum
\[\sum_{n=1}^{N}\sum_{j=0}^{M}\omega_{j}x_{n}^{i}x_{n}^{j}=\sum_{n=1}^{N}y_{n}x_{n}^{i}\]
this can be solved by the trusty old Gaussian elimination. A general problem with this approach is that the degree of the polynomial is a decisive factor which often leads to over-fitting and hence makes this a less desirable approach. Gaussian elimination, or a matrix inversion approach when implemented on a computer can also be a highly expensive computational operation for large datasets.
Bayesian Curve Fitting
Recap: Bayes Theorem
\[\mathbb{P}(A|B) = \frac{\mathbb{P}(B|A)\mathbb{P}(A)}{\mathbb{P}(B)}\]
If we now seek to reformulate the curve-fitting in probabilistic terms, then we have to begin by expressing our uncertainty over the target y with a probability distribution. For this we presume a Gaussian distribution over each target where the mean is the point value we previously considered, i.e.
\[p(y|x, w, \beta)=\mathcal{N}(y|y(x, w), \beta^{-1})\]
\(\beta\) corresponds to the inverse variance of the distribution \(\mathcal{N}\). We can then apply the maximum likelihood principle to find the optimal parameter \(w\) with our new likelihood function
\[p(y|x, w, \beta)=\prod^{N}_{n=1}\mathcal{N}(y_{n}|y(x_{n},w), \beta^{-1})\].
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →(Source: Bishop)
Taking the log likelihood we are then able to find the definitions of the optimal parameters
\[\text{ln } p(y|x, w, \beta) = - \frac{\beta}{2} \sum^{N}_{2} \{ y(x_{n}, w) - y_{n} \}^{2} + \frac{N}{2} \text{ln } \beta - \frac{N}{2} \text{ln }(2 \pi)\]
Which we can then optimize for the \(w\).
The herein obtained optimal maximum likelihood parameter \(w_{ML}\), and \(\beta_{ML}\) can then be resubstituted to obtain the predictive distribution for the targets \(y\).
\[p(y|x, w_{ML}, \beta_{ML})=\mathcal{N}(y|y(x, w_{ML}),\beta_{ML}^{-1})\]
To arrive at the full Bayesian curve-fitting approach we now have to apply the sum and product rules of probability
Recap: Sum Rules of Probability
\[\mathbb{P}(A \cap B) = \mathbb{P}(A) + \mathbb{P}(B)\]
Recap: Product Rules of Probability
\[\mathbb{P}(A, B) = \mathbb{P}(A) \cdot \mathbb{P}(B),\]
where \(A\), and \(B\) must be independent events.
The Bayesian curve fitting formula is hence
\[p(y|x, {\bf{x}}, {\bf{y}}) = \int p(y|x, w)p(w|{\bf{x}}, {\bf{y}})dw\]
with the dependence on \(\beta\) omitted for legibility reasons. This integral can be solved analytically hence giving the following predictive distribution
\[p(y|x, {\bf{x}}, {\bf{y}})=\mathcal{N}(y|m(x), s^{2}(x))\]
with mean and variance
\[m(x) = \beta \phi(x)^{T} {\bf{S}} \sum_{n=1}^{N}\phi(x_{n})y_{n},\]
\[s^{2}(x) = \beta^{-1} + \phi(x)^{T} {\bf{S}} \phi(x),\]
and \({\bf{S}}\) defined as
\[{\bf{S}}^{-1} = \alpha {\bf{I}} + \beta \sum_{n=1}^{N} \phi(x_{n}) \phi(x)^{T},\]
and \({\bf{I}}\) the unit matrix, while \(\phi(x)\) is defined by \(\phi_{i}(x) = x^{i}\). Examining the variance \(s^{2}(x)\), then the benefits of the Bayesian approach become readily apparent:
Maximum Likelihood, Information Theory & Log-Likelihood
Maximum Likelihood & Log-Likelihood
Formalizing the principle of maximum likelihood estimation, we seek to act in a similar fashion to function extrema calculation in high school in that we seek to differentiate our likelihood function, and then find the maximum value of it by setting the derivative equal to zero.
\[\hat{\theta} = \underset{\theta \in \Theta}{\arg \max} \mathcal{L}_{n}(\theta; y)\]
where \(\mathcal{L}\) is the likelihood function. In the derivation of the Bayesian Curve Fitting approach we have already utilized this principle by exploiting the independence between data points and then taking log of the likelihood to then utilize the often nicer properties of the log-likelihood
\[l(\theta;y) = \ln \mathcal{L}_{n}(\theta;y).\]
Information Theory
The core concept of information theory to consider here is that not every data point is equal to us! If an individual data point is outside of the previously estimated dynamics model, then it is a much more valuable data point than a data point which is e.g. directly on the response surface. Such information content can be measure with a function which is a logarithm of the probability of us observing a specific data point \(p(x)\).
\[h(x) = - \text{log}_{2} p(x)\]
i.e. the \(-\) ensures that provided information value is either net-positive or neutral. This can be further formalized with the concept of information entropy which is the expectation of \(h(x)\)
\[H[x] = - \sum_{x} p(x) \text{log}_{2} p(x).\]
Using statistical mechanics one can then derive the definitions of the entropy for continuous, as well as discrete variables.
Entropy of Discrete Variables
\[H[p]= - \sum_{i} p(x_{i}) \text{ln } p(x_{i})\]
Entropy of Continuous Variables
\[H[x] = - \int p(x) \text{ln }p(x) dx\]
Recap
Further References