Efficient memory based world modelling for improved sample efficiency in RL

# Efficient memory based world modelling for improved sample efficiency in RL ## Abstract In this work, we design a latent world model of interactive environments by incorporting memory in the framework. We use the recent progress in continous time state space model to design an efficient world model which maintains a memory of the observation and behaviour from the past to predict the future. Our world model trains in parallel and can learn from very long sequences of state actions. The learnt world model can be used to imagine tranjectories in the latent to augment the downstream Reinforcement Learning or Control Algorithm. The concept of optimal memory becomes more important when we are dealing with pixel or higher dimensional observation and the environment is partially observable where deriving a state vector from last few observations or using Markovian assumption on the derived states is incorrect. ## Theory ### Background #### Higher Order Polynomila Projection The Higher Order Polynomial Projection Operator (HiPPO) [(Gu et al.)](https://arxiv.org/pdf/2008.07669.pdf)[(Blog)](https://hazyresearch.stanford.edu/blog/2020-12-05-hippo) provides a theoretical framework to derive a memory representation $c(t) \in \mathbb{R}^N$ of the history of a signal $f(t)$ by making $c(t)$ the coefficient vector of the optimal polynomial approximation. At all time $t$, $c(t)$ maintains an optimal set of parameters which can reconstruct the past upto certain amount of error. Once we choose a measure, that is, how much we care about every time in the past, we can get an orthonormal polynomial basis which can be used to project the function onto that basis. This optimal memory representation $c(t)$ has closed form as a function of $f(t)$ for a given measure and basis function. Furthermore, $c(t)$ also follows a differential equation: $$c′(t) = Ac(t) + Bf(t)$$ Where $A$ and $B$ can be derived for a particular choice of measure, basis function and memory budget ($N$). Now, knowing this one can derive close form of the memory or any linear projection of the memory using existing ideas from state space models. #### Continuous State space models State Space Models (SSMs) are linear time-invariant systems that can be represented either as a linear ODE $$x′(t) = Ax(t) + Bu(t) \\ y(t) = Cx(t)$$ or as a convolution $$ k(t) = Ce^{tA}B \\ y(t) = (k ∗ u)(t) $$ where $u(t)$ \& $y(t)$ are the input and output signals respectively. The state matrix $A \in \mathbb{C}^{N×N}$, other matrices $B \in \mathbb{C}^{N×1}$ , $C \in \mathbb{C}^{1×N}$ are the parameters of the model. Note that, the convolution kernel can be interpreted as a linear combination of basis kernels given by $K(t) = e^{tA}B \in \mathbb{C}^{N×1}$ where $k(t) = \sum_{i = 1}^{N} C_{i}K_{i}(t)$. Intuitively the basis kernels are what we need to get the all the state representations upto a desired end time by convolution of the input signal with them. #### Connection to linear dynamical systems with control The above state space model is a linear dynamical system with a linear control input $u(t)$. For a given A and B we can perform an eigendecomposition of A and based on the eigenvalues do a stability analysis of the system. [(More on this in the other hackMD)](https://hackmd.io/@RaS8zPG1SD6SK_7mxQVZog/HJogH2w_o/edit) #### Discretization We can discretize a state space model by using general linear form. That is: $$x(t_k) - x(t_k - \Delta t_k) = \int_{t_k - \Delta t_k}^{t_k}(Ax(t)+ Bu(t))dt $$ $$ = \Delta t_k A[(1-\alpha)x(t_k - \Delta t_k) + \alpha x(t_k)] + \Delta t_kBu(t_k)$$ For brevity, we now replace $x(t_k)$ by $x_k$, $x(t_k-\Delta t_k)$ by $x_{k-1}$ and $\Delta t_k$ by $\Delta$: \begin{equation} x_k = \bar{A}x_{k-1} + \bar{B}u_k \end{equation} where $\bar{A} = (I-\Delta\alpha A)^{-1}(I+\Delta(1-\alpha) A)$ and $\bar{B}=(I-\Delta\alpha A)^{-1}\Delta B$. We get bilinear form by using $\alpha=\frac{1}{2}$ and forward/backward euler by using $\alpha=0/1$. The equivalent convolution kernel for a $L$ length sequence in discrete time becomes $$ Ker = [C \bar{B}, C\bar{A}\bar{B}, ....., C\bar{A}^{L-1}\bar{B}] $$ #### Efficient implementation using a diagonal A The main com This has been explored to solve longe-range modelling and made efficient previously in the context of sequence modelling. This achieved by: 1. Initializing A with certain class of matrices called Hippo-LegS matrices to give it longe range modelling abilities. [(Gu et al.)](https://proceedings.neurips.cc/paper/2020/hash/102f0bb6efb3a6128a3c750dd16729be-Abstract.html) 2. Computing the kernel $Ker$ efficiently by making the state matrix $A$ diagonal or diagonal plus low rank. [(Gu et al. 2021)](https://arxiv.org/abs/2111.00396)[(Gu et al. 2022)](https://arxiv.org/abs/2206.11893)  ### Method ### Optimal memory state for World Modelling

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.