Project: Model-based RL

# Project: Model-based RL Goal of the project: to implement and benchmark several models, for model-based RL. ### Possible models - GP regression - [Variational GP-SSMs](http://papers.nips.cc/paper/5375-variational-gaussian-process-state-space-models) by Roger Frigola and others - [NN Dynamics](https://arxiv.org/abs/1708.02596) by A. Nagabandi and others - [Neural model in "Deep RL in a handful of trials"](http://papers.nips.cc/paper/5375-variational-gaussian-process-state-space-models), by Kurtland Chua and others - Recurrent NN model in [World Models](https://arxiv.org/abs/1803.10122) by David Ha & Jürgen Schmidhuber - [Neural Ordinary Differential Equations](https://arxiv.org/abs/1806.07366) by Tian Qi Chen and others ### Possible benchmarks Even if the model is supposed to abstract away the action, in practice I've seen that following a similar _policy_ that the model was trained with is useful. Thus we should try these metrics with and without the policy that was used to generate the data. - Prediction-based metrics - Likelihood of trajectory - Root mean squared error (RMSE) of future states - Likelihood of predicted reward - RMSE of predicted reward - Performance-based metrics - Total reward achieved (in combination with policy optimisation) - "Data-efficiency", progression or number of episodes until reaching final reward ## Possible implementation starting point My implementation of PILCO in Python, [OpenAI Gym](https://gym.openai.com/), [Bullet](https://docs.google.com/document/d/10sXEhzFRSnvFcl3XxNGhnD4N2SedqwdAvK3dsihxVUA/edit), [TensorFlow](https://tensorflow.org/) and [GPFlow](https://github.com/GPflow/GPflow). Features: - Wrapped Gym environments (with alternative, Gaussian-integrable, cost function) - Gaussian distribution class - Facilities for moment-matching - Decently tested - Uses Bullet (free but slightly worse) instead of MuJoCo