# BeNeRL-2022 workshop
https://rlg.liacs.nl/benerl-2022
## Ann Nowe, "RL learning the optimal policy"
* safe fuzzy control
* robust control
* inference mechanism in fuzzy control
* fuzzy set --> piecewise linear interpolation
* multi-agent RL based on **learning automata**, robust to noise and delayed signals
* multi-type ants
* pareto Q-learning
* 可解释性:Rule distillation, action influence graphs
* RL with Formal Guarantees
## Vincent François-Lavet, "learn useful state abstractions in sequential decision making"
+ challenges in RL
+ computational constraints
+ safety constraints
+ due to limited exogeous states
+ abstraction representation
+ compressed states
+ include engough information
+ allow reasoning
+ [model-based + model-free] based on abstraction representations
+ encoder: inputs to hidden states
+ abstract/latent representation is got after encoder
+ **Information bottleneck**
+ to reprsent the learning of the state
+ simple labyrinth
+ abstract representation of states in a labyrinth task
+ combined reinforcement via abstraction representation
+ model and value function are learned via abstract representation
+ learning the model, minize the loss, which train the weights of both encoder and model-based components
+ entropy loss
+ interpretability loss
+ Planning
+ depth-d estimation of Q value function
+ Generalization
+ meta-learning score on a distribution of labyrinth
+ Exploration
+ undirected exploration `e-greedy`
+ reward function for novelty, K nearest neighbors
+ Transfer learning
+ agent is trained in a distribution of MDP and evaluation on a new domain
+ component transfer learning, just finetune the encoder
## Hado van Hasselt, "Credit Assignment"
https://hadovanhasselt.com/publications/
TD
TD($\lambda$)
ET($\lambda$)
Backward planning
Plan using replay, non-parametric model
Plan using a model
**How can we best use learned model**
Backward/Forward planning for credit assignment
Forward planning for behavior(promising)
+ Model-free credit assignment
+ propagage credit
+ ET($\lambda$) algorithm, simple weight update
Expected traces
+ use function to estimate the expectation
* TD(0)
* TD(lamda)
* ET(lamda), take expectations compare to TD(lamada), lower variance, coverge similar to TD(lamda)
Learning to assign credit
+ assume just see one transition
+ TD update previous state
+ generalization update related states
+ trace update past states
+ planning update more states
What is the best update to the whole value function?
+ **meta-gradients**
+ w_t+1 = w_t + $\nabla w(\eta)$
+ $\nabla w$ is function of $\eta$, and how to learn the $\eta$?
+ compute the best $\eta$
+ update can be fully parameterized
+ meta gradients vs MAML
+ Boostrapped meta-learning
+ Boostrap the learning itself
+ learning general updates
## Hendrik Baier "A vision for explainable search and some first steps towatds it"
Why is **Explainable Search (ES)** and how?
How search-based agent (use search for learning) explain its behavior
Explainable search: exploration of possible futures, evaluations and relationships and available choices
myopic action is not enough
entire policies not sufficient
Challenges of ES?
+ Explainations as conversations, two-way street, long-term interactions, integrated explainations od search and deep-learning
+ Explaination-aware search
First steps
+ Explainable MCTS toolset
Post hoc explainations (after the search)
Tree simplification
Subtree summarization
+ References
https://ir.cwi.nl/pub/30849
## Gergely Neu "Lifting the information ratio, an information-theoretic analysis of Thompson sampling"
+ Contents
+ Contextual bandits
+ Thompson sampling and the information ratio
+ Go contextual
+ **Thompson Sampling**
+ it is the first bandit algorithm proposed by Thompson in 1933
+ A bayesian algorithm, playing the action based on **posterior probability of being optimal**
+ Prove that **Bayesian regret is bounded**
+ **Information ratio** = (regret.^2 / information gain)
+ if information ratio small, information gain is also large when the regret is large
+ K-armed logistic bandits
+ What did we learn?
+ informatino ratio can be adapted to contexual bandits
+ References
+ Information-Theoretic Generalization Bounds for Stochastic Gradient Descent.
+ Online learning in MDPs with linear function approximation and bandit feedback.
## Elise van der Pol, "Symmetry and Structure in Deep Reinforcement Learning"
https://www.elisevanderpol.nl/
+ MDP Homomorphism


+ MDP Homomorphic Networks: Group Symmetries in Reinforcement Learning https://arxiv.org/abs/2006.16908
+ 'Multi-Agent MDP Homomorphic Networks': https://arxiv.org/pdf/2110.04495.pdf
+ require few interactions!
+ How to automatically find the symmetries?
+ Challenges
+ Symmetric in factored MDP, partially observability
+ Learn the symmetries
+ Exploit the approximate symmetries, and partial symmetries
People and institute
+ Giacomo Spigler: sim2real
http://spigler.net/giacomo/publications.html
+ RL group in Leiden https://rlg.liacs.nl/home
+ Ann's Group in Brussel: