The Kalman Filter

## Kalman Filtering - an introduction  [Slides](https://hackmd.io/Ahihs6CfQ-SQIxmvpNS21w?both) [Colab](https://colab.research.google.com/drive/1Ajrjhgu9YkvmcFZMXvYSw-B665bnFcZj?usp=sharing#scrollTo=Tzq-wWT5nRbB) --- #### The need for information fusion ![](https://i.imgur.com/QQTneMq.png) --- #### Bayes helps us with his famous theorem $$ \mathrm{posterior} = \dfrac{\mathrm{likelihood} \times \mathrm{prior}}{\mathrm{evidence}} \\ p(\theta | X) = \dfrac{p(X | \theta)p(\theta)}{p(X)} \\ \theta - \mathrm{model\ parameters} \\ X = \left(x_1, x_2, \dots, x_N\right)- \mathrm{data} $$ --- ![](https://i.imgur.com/3V19NOV.png) --- #### How to get a recursive filter **Motto:** yesterday's posterior is today's prior, namely (for $N=2$) $$ p(\theta | x_1, x_2) = \dfrac{p(x_1, x_2 | \theta)p(\theta)}{p(x_1, x_2)}{\Bigr|_{(x_i \sim i.i.d. \ \ \forall i)}} \\ = \dfrac{\color{red}{p(x_1| \theta)}p(x_2| \theta)\color{red}{p(\theta)}}{\color{red}{p(x_1)}p(x_2 | x_1)}\\= \dfrac{p(x_2| \theta)\color{red}{p(\theta | x_1)}}{p(x_2| x_1)} $$ --- #### Reading graphical models I/II ![](https://i.imgur.com/4vlTGU5.png) - Most general form: $p\left(x_1,\dots, x_n \right) = \prod_{i=1}^{n}p\left(x_i | x_1, \dots, x_{i-1}\right)$ - With relationships: $p\left(x_1,\dots, x_n \right) = \prod_{i=1}^{n}p\left(x_i | \mathrm{Parents}(x_i)\right)$ --- #### Reading graphical models II/II ![](https://i.imgur.com/XHEtaXj.png) I. $X \perp Y | Z$ (X: Monday, Z: Tuesday, Y: Wednesday); second is basically the same III. $X \perp Y | Z$ (X: You love Túró Rudi., Z: :flag-hu: . , Y: You know János Arany. ) IV. $X \not\perp Y | Z$ (X: broken leg, Z:hospitalization , Y: TBC) --- #### The Kalman Filter (KF) - Does information fusion from multiple measurements (=likelihood) - Uses a model (=prior) - Difference compared with Hidden Markov Models (HMMs): - HMM: discrete states and observations - KF: continuous states and observations --- #### Assumptions, assumptions everywhere ![](https://i.imgur.com/stVi9KT.png) - Markov property: $x_{t+1} \perp x_{t-1} | x_t$ - Observations: $y_t \perp x_1, \dots, x_{\color{red}{t-1}}, x_{\color{red}{t+1}}, \dots | x_t$ $$p\left(x_1,\dots, x_N, y_1,\dots, y_N \right) = \prod_{i=1}^{N}p\left(x_i | x_{i-1}\right)p\left(y_i | x_{i}\right)$$ --- #### How to capture uncertainty - With covariances! - $P_k = \mathbb{E}\left\{(x_t - \hat{x}_t)(x_t - \hat{x}_t)^T\right\}$ denotes the state covariance matrix $(\in \mathbb{R}^{n\times n} )$ - $Q_k$ denotes the model covariance matrix $(\in \mathbb{R}^{n\times n} )$ - $R_k$ denotes the measurement covariance matrix $(\in \mathbb{R}^{m\times m})$ --- #### The linear-Gaussian model (states) - $x_t$ is the state vector $(\in \mathbb{R}^n)$ - $A$ is the state transition matrix $(\in \mathbb{R}^{n\times n})$ - $u_t$ is the input vector $(\in \mathbb{R}^p)$ - $B$ is the control matrix $(\in \mathbb{R}^{n\times p})$ $$x_{t+1} = A x_t + B u_t + \epsilon_t\\ x_{t+1} \sim \mathcal{N}\left( A x_t + B u_t, Q_t\right) $$ **Remarks:** $\epsilon_t \sim \mathcal{N}(0, Q_t)$ and i.i.d. Moreover, $u_t$ can be left out, as the input is deterministic (more on the importance of this later). --- #### The linear-Gaussian model (observations, a.k.a measurements) - $y_t$ is the observation vector $(\in \mathbb{R}^m)$ - $C$ is the observation matrix $(\in \mathbb{R}^{m\times n})$ $$ y_t = C x_t + \eta_t \\ y_t \sim \mathcal{N}\left(C x_t, R_t\right) $$ **Remarks:** $\eta_t \sim \mathcal{N}(0, R_t)$ and i.i.d. --- #### Why do we need a filter? I/II ![](https://i.imgur.com/stVi9KT.png) - We can do inference: "What is the value of $x_t$?"(like the reconstruction model in a VAE) - We can do learning: "What are the value of $A, B, C$?" (like what the SGD does in a VAE) --- #### Why do we need a filter? II/II - Notation: $y_{1:N} = (y_1, \dots, y_N)$ - Smoothing: $p\left(x_t | y_{1:N}\right)$ - Full inference: $p\left(x_{1:N} |y_{1:N}\right)$ (not done with the KF) --- #### Filtering - Filtering: $p\left(x_t | y_{1:t}\right) = \dfrac{p(y_t|x_t)\color{red}{p(x_t|y_{1:t-1})}}{p(y_t|y_{1:t-1})}$ - $p(y_t|x_t)$ : observation model - $p(y_t|y_{1:t-1})$: as everything is Gaussian, this normalization constant is not needed - $\color{red}{p(x_t|y_{1:t-1})} = \int p(x_t|x_{t-1})p(x_{t-1}|y_{1:t-1}) d x_{t-1}$ - $p(x_t|x_{t-1})$: state transition model - $p(x_{t-1}|y_{1:t-1})$ : result of the previous filtering step --- #### What is the optimization goal? - Standard KF approach: MSE minimization $$ \min_{\hat{x}_t} \mathbb{E}_{p(x_{1:t}, y_{1:t})}\left\{||x_t - \hat{x}_t ||^2\right\} $$ - Problem: $x_t$ is often hidden ($\sim$ Hidden Markov Model, HMM) - Alternative formulation (like the VAE, $\hat{x}_t$ encodes observation, which is decoded by $C$): $$ \min_{\hat{x}_t} \mathbb{E}_{p(x_{1:t}, y_{1:t})}\left\{||\color{red} {y_t} - \color{red} C \hat{x}_t ||^2\right\} $$ --- #### A Maximum Likelihood (ML) approach - Reminder: the model is linear, noise is Gaussian and i.i.d $$ \max_{\hat{x}_t} \prod_{t=1}^{N} p(y_t | x_t) \\ p(y_t | x_t) ~\sim \mathcal{N}\left(C x_t, R_t\right) $$ - Maximizing the likelihood results in $y_t \approx C x_t,$ i.e. $\min_{\hat{x}_t} \mathbb{E}\left\{||y_t - C \hat{x}_t ||^2\right\}$ --- #### How the filter works 1. Time update (should be called **model update**): use the prior model to predict the "future" 2. Measurement update: extract information from measurements and correct the prediction with that **Remark:** we need both steps, as the model will be imperfect (e.g. due to simplifications, not modeled effects - this is sometimes called _epistemic uncertainty_); and the measurements will be noisy as well (_aleatoric uncertainty_, from _Alea iacta est_.) --- #### Model update 1. Use the model to predict the next state: $\hat{x}_{t+1} = A x_t + B u_t$ 2. Calculate the covariance of the predicted and transformed variable $A x_t$: $\hat{P}_t = AP_tA^T + Q_t$ (the covariance increases by the model uncertainty; as $u_t$ is deterministic, it does not change the covariance). --- #### Measurement update I/III 1. Now we observe/measure some new data $y_t$. 2. This is predicted by the transformed variable $C \hat{x}_{t}$ that has the covariance: $S_t = C \hat{P}_t C^T + R_t$ (often referred to as innovation covariance) 3. To minimize $\mathbb{E}\left\{||y_t - C \hat{x}_t ||^2\right\}$ over $\hat{x}_t$ - where this difference is called the innovation -, an optimal weighting is required between $\hat{x}_t$ and $y_t - C \hat{x}_t$. --- #### Measurement update II/III 4. The weighting is done by the Kalman gain $K_t$. Let's first calculate $\hat{K}_t$, that is for predicting $C x_t$. 5. The gain weights each components inversely proportional to their respective covariances, i.e. $\hat{K}_t = \dfrac{C \hat{P}_t C^T}{C \hat{P}_t C^T + R_t}$ 6. As $\hat{K}_t= C K_t, K_t$ is almost the same $K_t = \frac{\hat{P}_t C^T}{C \hat{P}_t C^T + R_t}_{\Bigr|_{(R_t\rightarrow 0)}}\approx \frac{1}{C}$ --- #### Measurement update III/III 7. The posterior state is $x_{t+1} = \hat{x}_t + K_t\left(y_t - C \hat{x}_t\right) = \left(I - K_t C\right)\hat{x}_t + K_t y_t$ 8. The posterior covariance is the covariance of $\left(I - K_t C\right)\hat{x}_t,$ namely: $P_{t+1} = \left(I - K_t C\right)\hat{P}_t\left(I - K_t C\right)^T + K_tR_tK_t^T$ --- #### How does $K_t$ affect the posterior? I/II - If the model is not accurate (and we know that, i.e. $Q_t$ is "big"), then $\hat{P}_t$ will be "big", thus $K_t \approx \dfrac{\infty}{C\infty + R_t} = 1/C,$ so $$x_{t+1} = \hat{x}_t + K_t\left(y_t - C \hat{x}_t\right)_{\Bigr|_{(R_t=0 \ \mathrm{and}\ Q_t \rightarrow \infty)}} \\ = \hat{x}_t + \frac{1}{C}\left(y_t - C \hat{x}_t\right) = \frac{1}{C}y_t$$ --- #### How does $K_t$ affect the posterior? II/II - On the other hand, if observations are noise (and we know that, i.e. $R_t$ is "big"), thus $K_t \approx \frac{\hat{P}_t C^T}{C \hat{P}_t C^T + \infty} = 0,$ so $$x_{t+1} = \hat{x}_t + K_t\left(y_t - C \hat{x}_t\right)_{\Bigr|_{(R_t \rightarrow\infty)}} \\ = \hat{x}_t + 0\left(y_t - C \hat{x}_t\right) = \hat{x}_t$$ --- #### Open questions - **Q:** What if we do not know $Q_t$ and $R_t$ **A:** estimate from the data with the empirical covariance (and maybe smooth it with an exponential filter) - **Q:** What if the model is nonlinear? **A:** UKF (Unscented Kalman Filter); never try the Extended Kalman Filter (no gain but it can mess things up with highly nonlinear systems) --- #### Useful resources - [Kalman filter demystified](https://arxiv.org/pdf/1811.11618.pdf) - [Tutorial: The Kalman Filter](http://web.mit.edu/kirtley/kirtley/binlustuff/literature/control/Kalman%20filter.pdf) - [Tutorial: The Kalman Filter (different one)](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.408.5594&rep=rep1&type=pdf) - [Another tutorial](https://web.archive.org/web/20130512192346/http://www.tristanfletcher.co.uk/LDS.pdf) --- ![](https://i.imgur.com/j1xGKc6.jpg)