# BeNeRL-2022 workshop https://rlg.liacs.nl/benerl-2022 ## Ann Nowe, "RL learning the optimal policy" * safe fuzzy control * robust control * inference mechanism in fuzzy control * fuzzy set --> piecewise linear interpolation * multi-agent RL based on **learning automata**, robust to noise and delayed signals * multi-type ants * pareto Q-learning * 可解释性:Rule distillation, action influence graphs * RL with Formal Guarantees ## Vincent François-Lavet, "learn useful state abstractions in sequential decision making" + challenges in RL + computational constraints + safety constraints + due to limited exogeous states + abstraction representation + compressed states + include engough information + allow reasoning + [model-based + model-free] based on abstraction representations + encoder: inputs to hidden states + abstract/latent representation is got after encoder + **Information bottleneck** + to reprsent the learning of the state + simple labyrinth + abstract representation of states in a labyrinth task + combined reinforcement via abstraction representation + model and value function are learned via abstract representation + learning the model, minize the loss, which train the weights of both encoder and model-based components + entropy loss + interpretability loss + Planning + depth-d estimation of Q value function + Generalization + meta-learning score on a distribution of labyrinth + Exploration + undirected exploration `e-greedy` + reward function for novelty, K nearest neighbors + Transfer learning + agent is trained in a distribution of MDP and evaluation on a new domain + component transfer learning, just finetune the encoder ## Hado van Hasselt, "Credit Assignment" https://hadovanhasselt.com/publications/ TD TD($\lambda$) ET($\lambda$) Backward planning Plan using replay, non-parametric model Plan using a model **How can we best use learned model** Backward/Forward planning for credit assignment Forward planning for behavior(promising) + Model-free credit assignment + propagage credit + ET($\lambda$) algorithm, simple weight update Expected traces + use function to estimate the expectation * TD(0) * TD(lamda) * ET(lamda), take expectations compare to TD(lamada), lower variance, coverge similar to TD(lamda) Learning to assign credit + assume just see one transition + TD update previous state + generalization update related states + trace update past states + planning update more states What is the best update to the whole value function? + **meta-gradients** + w_t+1 = w_t + $\nabla w(\eta)$ + $\nabla w$ is function of $\eta$, and how to learn the $\eta$? + compute the best $\eta$ + update can be fully parameterized + meta gradients vs MAML + Boostrapped meta-learning + Boostrap the learning itself + learning general updates ## Hendrik Baier "A vision for explainable search and some first steps towatds it" Why is **Explainable Search (ES)** and how? How search-based agent (use search for learning) explain its behavior Explainable search: exploration of possible futures, evaluations and relationships and available choices myopic action is not enough entire policies not sufficient Challenges of ES? + Explainations as conversations, two-way street, long-term interactions, integrated explainations od search and deep-learning + Explaination-aware search First steps + Explainable MCTS toolset Post hoc explainations (after the search) Tree simplification Subtree summarization + References https://ir.cwi.nl/pub/30849 ## Gergely Neu "Lifting the information ratio, an information-theoretic analysis of Thompson sampling" + Contents + Contextual bandits + Thompson sampling and the information ratio + Go contextual + **Thompson Sampling** + it is the first bandit algorithm proposed by Thompson in 1933 + A bayesian algorithm, playing the action based on **posterior probability of being optimal** + Prove that **Bayesian regret is bounded** + **Information ratio** = (regret.^2 / information gain) + if information ratio small, information gain is also large when the regret is large + K-armed logistic bandits + What did we learn? + informatino ratio can be adapted to contexual bandits + References + Information-Theoretic Generalization Bounds for Stochastic Gradient Descent. + Online learning in MDPs with linear function approximation and bandit feedback. ## Elise van der Pol, "Symmetry and Structure in Deep Reinforcement Learning" https://www.elisevanderpol.nl/ + MDP Homomorphism ![](https://i.imgur.com/PJ212fU.png) ![](https://i.imgur.com/Pk1yMzQ.png) + MDP Homomorphic Networks: Group Symmetries in Reinforcement Learning https://arxiv.org/abs/2006.16908 + 'Multi-Agent MDP Homomorphic Networks': https://arxiv.org/pdf/2110.04495.pdf + require few interactions! + How to automatically find the symmetries? + Challenges + Symmetric in factored MDP, partially observability + Learn the symmetries + Exploit the approximate symmetries, and partial symmetries People and institute + Giacomo Spigler: sim2real http://spigler.net/giacomo/publications.html + RL group in Leiden https://rlg.liacs.nl/home + Ann's Group in Brussel: