## Kalman Filtering - an introduction
<!-- Put the link to this slide here so people can follow -->
[Slides](https://hackmd.io/Ahihs6CfQ-SQIxmvpNS21w?both)
[Colab](https://colab.research.google.com/drive/1Ajrjhgu9YkvmcFZMXvYSw-B665bnFcZj?usp=sharing#scrollTo=Tzq-wWT5nRbB)
---
#### The need for information fusion

---
#### Bayes helps us with his famous theorem
$$
\mathrm{posterior} = \dfrac{\mathrm{likelihood} \times \mathrm{prior}}{\mathrm{evidence}} \\
p(\theta | X) = \dfrac{p(X | \theta)p(\theta)}{p(X)} \\
\theta - \mathrm{model\ parameters} \\
X = \left(x_1, x_2, \dots, x_N\right)- \mathrm{data}
$$
---

---
#### How to get a recursive filter
**Motto:** yesterday's posterior is today's prior, namely (for $N=2$)
$$
p(\theta | x_1, x_2) = \dfrac{p(x_1, x_2 | \theta)p(\theta)}{p(x_1, x_2)}{\Bigr|_{(x_i \sim i.i.d. \ \ \forall i)}} \\
= \dfrac{\color{red}{p(x_1| \theta)}p(x_2| \theta)\color{red}{p(\theta)}}{\color{red}{p(x_1)}p(x_2 | x_1)}\\= \dfrac{p(x_2| \theta)\color{red}{p(\theta | x_1)}}{p(x_2| x_1)}
$$
---
#### Reading graphical models I/II

- Most general form: $p\left(x_1,\dots, x_n \right) = \prod_{i=1}^{n}p\left(x_i | x_1, \dots, x_{i-1}\right)$
- With relationships: $p\left(x_1,\dots, x_n \right) = \prod_{i=1}^{n}p\left(x_i | \mathrm{Parents}(x_i)\right)$
---
#### Reading graphical models II/II

I. $X \perp Y | Z$ (X: Monday, Z: Tuesday, Y: Wednesday); second is basically the same
III. $X \perp Y | Z$ (X: You love Túró Rudi., Z: :flag-hu: . , Y: You know János Arany. )
IV. $X \not\perp Y | Z$ (X: broken leg, Z:hospitalization , Y: TBC)
---
#### The Kalman Filter (KF)
- Does information fusion from multiple measurements (=likelihood)
- Uses a model (=prior)
- Difference compared with Hidden Markov Models (HMMs):
- HMM: discrete states and observations
- KF: continuous states and observations
---
#### Assumptions, assumptions everywhere

- Markov property: $x_{t+1} \perp x_{t-1} | x_t$
- Observations: $y_t \perp x_1, \dots, x_{\color{red}{t-1}}, x_{\color{red}{t+1}}, \dots | x_t$
$$p\left(x_1,\dots, x_N, y_1,\dots, y_N \right) = \prod_{i=1}^{N}p\left(x_i | x_{i-1}\right)p\left(y_i | x_{i}\right)$$
---
#### How to capture uncertainty
- With covariances!
- $P_k = \mathbb{E}\left\{(x_t - \hat{x}_t)(x_t - \hat{x}_t)^T\right\}$ denotes the state covariance matrix $(\in \mathbb{R}^{n\times n} )$
- $Q_k$ denotes the model covariance matrix $(\in \mathbb{R}^{n\times n} )$
- $R_k$ denotes the measurement covariance matrix $(\in \mathbb{R}^{m\times m})$
---
#### The linear-Gaussian model (states)
- $x_t$ is the state vector $(\in \mathbb{R}^n)$
- $A$ is the state transition matrix $(\in \mathbb{R}^{n\times n})$
- $u_t$ is the input vector $(\in \mathbb{R}^p)$
- $B$ is the control matrix $(\in \mathbb{R}^{n\times p})$
$$x_{t+1} = A x_t + B u_t + \epsilon_t\\
x_{t+1} \sim \mathcal{N}\left( A x_t + B u_t, Q_t\right)
$$
**Remarks:** $\epsilon_t \sim \mathcal{N}(0, Q_t)$ and i.i.d. Moreover, $u_t$ can be left out, as the input is deterministic (more on the importance of this later).
---
#### The linear-Gaussian model (observations, a.k.a measurements)
- $y_t$ is the observation vector $(\in \mathbb{R}^m)$
- $C$ is the observation matrix $(\in \mathbb{R}^{m\times n})$
$$
y_t = C x_t + \eta_t \\
y_t \sim \mathcal{N}\left(C x_t, R_t\right)
$$
**Remarks:** $\eta_t \sim \mathcal{N}(0, R_t)$ and i.i.d.
---
#### Why do we need a filter? I/II

- We can do inference: "What is the value of $x_t$?"(like the reconstruction model in a VAE)
- We can do learning: "What are the value of $A, B, C$?" (like what the SGD does in a VAE)
---
#### Why do we need a filter? II/II
- Notation: $y_{1:N} = (y_1, \dots, y_N)$
- Smoothing: $p\left(x_t | y_{1:N}\right)$
- Full inference: $p\left(x_{1:N} |y_{1:N}\right)$ (not done with the KF)
---
#### Filtering
- Filtering: $p\left(x_t | y_{1:t}\right) = \dfrac{p(y_t|x_t)\color{red}{p(x_t|y_{1:t-1})}}{p(y_t|y_{1:t-1})}$
- $p(y_t|x_t)$ : observation model
- $p(y_t|y_{1:t-1})$: as everything is Gaussian, this normalization constant is not needed
- $\color{red}{p(x_t|y_{1:t-1})} = \int p(x_t|x_{t-1})p(x_{t-1}|y_{1:t-1}) d x_{t-1}$
- $p(x_t|x_{t-1})$: state transition model
- $p(x_{t-1}|y_{1:t-1})$ : result of the previous filtering step
---
#### What is the optimization goal?
- Standard KF approach: MSE minimization
$$ \min_{\hat{x}_t} \mathbb{E}_{p(x_{1:t}, y_{1:t})}\left\{||x_t - \hat{x}_t ||^2\right\} $$
- Problem: $x_t$ is often hidden ($\sim$ Hidden Markov Model, HMM)
- Alternative formulation (like the VAE, $\hat{x}_t$ encodes observation, which is decoded by $C$):
$$ \min_{\hat{x}_t} \mathbb{E}_{p(x_{1:t}, y_{1:t})}\left\{||\color{red} {y_t} - \color{red} C \hat{x}_t ||^2\right\} $$
---
#### A Maximum Likelihood (ML) approach
- Reminder: the model is linear, noise is Gaussian and i.i.d
$$ \max_{\hat{x}_t} \prod_{t=1}^{N} p(y_t | x_t) \\
p(y_t | x_t) ~\sim \mathcal{N}\left(C x_t, R_t\right)
$$
- Maximizing the likelihood results in $y_t \approx C x_t,$ i.e. $\min_{\hat{x}_t} \mathbb{E}\left\{||y_t - C \hat{x}_t ||^2\right\}$
---
#### How the filter works
1. Time update (should be called **model update**): use the prior model to predict the "future"
2. Measurement update: extract information from measurements and correct the prediction with that
**Remark:** we need both steps, as the model will be imperfect (e.g. due to simplifications, not modeled effects - this is sometimes called _epistemic uncertainty_); and the measurements will be noisy as well (_aleatoric uncertainty_, from _Alea iacta est_.)
---
#### Model update
1. Use the model to predict the next state: $\hat{x}_{t+1} = A x_t + B u_t$
2. Calculate the covariance of the predicted and transformed variable $A x_t$: $\hat{P}_t = AP_tA^T + Q_t$ (the covariance increases by the model uncertainty; as $u_t$ is deterministic, it does not change the covariance).
---
#### Measurement update I/III
1. Now we observe/measure some new data $y_t$.
2. This is predicted by the transformed variable $C \hat{x}_{t}$ that has the covariance: $S_t = C \hat{P}_t C^T + R_t$ (often referred to as innovation covariance)
3. To minimize $\mathbb{E}\left\{||y_t - C \hat{x}_t ||^2\right\}$ over $\hat{x}_t$ - where this difference is called the innovation -, an optimal weighting is required between $\hat{x}_t$ and $y_t - C \hat{x}_t$.
---
#### Measurement update II/III
4. The weighting is done by the Kalman gain $K_t$. Let's first calculate $\hat{K}_t$, that is for predicting $C x_t$.
5. The gain weights each components inversely proportional to their respective covariances, i.e. $\hat{K}_t = \dfrac{C \hat{P}_t C^T}{C \hat{P}_t C^T + R_t}$
6. As $\hat{K}_t= C K_t, K_t$ is almost the same $K_t = \frac{\hat{P}_t C^T}{C \hat{P}_t C^T + R_t}_{\Bigr|_{(R_t\rightarrow 0)}}\approx \frac{1}{C}$
---
#### Measurement update III/III
7. The posterior state is $x_{t+1} = \hat{x}_t + K_t\left(y_t - C \hat{x}_t\right) = \left(I - K_t C\right)\hat{x}_t + K_t y_t$
8. The posterior covariance is the covariance of $\left(I - K_t C\right)\hat{x}_t,$ namely: $P_{t+1} = \left(I - K_t C\right)\hat{P}_t\left(I - K_t C\right)^T + K_tR_tK_t^T$
---
#### How does $K_t$ affect the posterior? I/II
- If the model is not accurate (and we know that, i.e. $Q_t$ is "big"), then $\hat{P}_t$ will be "big", thus $K_t \approx \dfrac{\infty}{C\infty + R_t} = 1/C,$ so
$$x_{t+1} = \hat{x}_t + K_t\left(y_t - C \hat{x}_t\right)_{\Bigr|_{(R_t=0 \ \mathrm{and}\ Q_t \rightarrow \infty)}} \\
= \hat{x}_t + \frac{1}{C}\left(y_t - C \hat{x}_t\right) = \frac{1}{C}y_t$$
---
#### How does $K_t$ affect the posterior? II/II
- On the other hand, if observations are noise (and we know that, i.e. $R_t$ is "big"), thus $K_t \approx \frac{\hat{P}_t C^T}{C \hat{P}_t C^T + \infty} = 0,$ so
$$x_{t+1} = \hat{x}_t + K_t\left(y_t - C \hat{x}_t\right)_{\Bigr|_{(R_t \rightarrow\infty)}} \\
= \hat{x}_t + 0\left(y_t - C \hat{x}_t\right) = \hat{x}_t$$
---
#### Open questions
- **Q:** What if we do not know $Q_t$ and $R_t$ **A:** estimate from the data with the empirical covariance (and maybe smooth it with an exponential filter)
- **Q:** What if the model is nonlinear? **A:** UKF (Unscented Kalman Filter); never try the Extended Kalman Filter (no gain but it can mess things up with highly nonlinear systems)
---
#### Useful resources
- [Kalman filter demystified](https://arxiv.org/pdf/1811.11618.pdf)
- [Tutorial: The Kalman Filter](http://web.mit.edu/kirtley/kirtley/binlustuff/literature/control/Kalman%20filter.pdf)
- [Tutorial: The Kalman Filter (different one)](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.408.5594&rep=rep1&type=pdf)
- [Another tutorial](https://web.archive.org/web/20130512192346/http://www.tristanfletcher.co.uk/LDS.pdf)
---

{"metaMigratedAt":"2023-06-15T18:36:52.302Z","metaMigratedFrom":"YAML","title":"The Kalman Filter","breaks":true,"description":"A dive into the concept of the Kalman Filter","contributors":"[{\"id\":\"55d551c5-aa44-4686-a54f-167de6f45e91\",\"add\":11664,\"del\":4866},{\"id\":\"e558be3b-4a2d-4524-8a66-38ec9fea8715\",\"add\":2,\"del\":6}]"}