# Strength Through Diversity: Robust Behavior Learning via Mixture Policies Link: https://openreview.net/forum?id=OE0YPHOL8O Authors: Tim Seyde, Wilko Schwarting, Igor Gilitschenski, Markus Wulfmeier, Daniela Rus. Written by: Pronoma Banerjee --- ### Abstract and Introduction The paper discusses the following: 1. The authors propose the Hyperparameter Mixture Policies (HMP) algorithm, featuring diverse sub-policies covering different distribution types and parameterizations as a new training method for choosing a heirarchical policy over a mixture of low-level controllers. Advantages: * The method relies on just a single trial instead of the conventional approaches which require sequential and parallel repetition of experiments, by enabling agents to select and combine designs conditioned on the task. * The diverse sub-policies reduce the impact of design choices and unlocks synergies between low level components. * The heirarchical strategies improves robustness of the model and reduce the necessity for hyperparameter tuning and related data requirements. 5. The authors show that HMP yields robust, data-efficient learning by demonstrating strong performance on the DeepMind Control Suite, Meta-World tasks and a simulated AnyMAL robot. The paper is an extention of previous work done on heirarchical reinforcement learning. The approach enables learning robots to: * optimize a set of diverse controller designs concurrently for increased data-efficiency, * to select suitable controllers conditioned on the task themselves for reduced human parameter tuning, * to compose behaviors from multiple controller designs for exploitation of emergent synergies. --- ### Preliminaries The controller optimization is formulated as a Markov Decision Process (MDP). An MDP is defined by the tuple {$S,A,T,R,γ$}, where: * $S$ denotes the state space, which defines the current situation of the agent. * $A$ denotes the action space or the decision taken by the agent at a current time step. * $T: S × A->S$ represents the transition function. $T$ governs how frequently a certain next state follows from a certain current state. * $R : S × A → R$ is the reward mapping, showing the reward received at each step given a particular state and action. * $γ ∈ [0, 1)$ is the discount factor, which is used to limit the contribution of future rewards in the total reward, because of the uncertainty about the future rewards. $S_t$ and $a_t$ are defined to the state and action at time $t$, respectively. A policy is a probability distribution assigned to a set of actions a when the agent is at state s and the parameters are θ. $π_θ(a|s)= P[A_t=a | S_t=s]$ The discounted infinite horizon return is defined by the equation: $G_t= \sum_{t'=t}^{\infty} γ^{t'-t}R(s_{t'},a_{t'})$, where $s_{t+1} ∼ T(s_{t+1}|s_t, a_t)$ and $a_t ∼ π_θ(a_t|s_t)$. The goal is to learn the optimal policy maximizing $G_t$ under unknown dynamics and reward mappings. This is normally achieved by modeling $π_θ(a_t|s_t)$ as a Gaussian distribution, with a neural network predicting the mean and variance from $s_t$. In this work, a more diverse class of policy distributions are considered. --- ### Hyperparameter Mixture Policies The paper introduces Hyperparameter Mixture Policies (HMP) as an algorithm to train a heirarchical policy over a mixture of low-level controllers with distinct hyperparameters and distribution characteristics. The resulting equation is given by: $π_θ(a|s) = \sum_{i=1}^{M} π_θ^H(i|s)π_θ^L(a|s,i)$, where $π_θ^H(i|s)$ is the weight and $π_θ^L(a|s,i)$ are the probability density of component i. Hence, $π^H$ is a high level selector and $π^L$ is a sub-policy made from a diverse set of controller designs. The agent self-selects the most suitable controller for individual phases of a task at various stages of its learning process. **Policy improvement** The authors use the soft actor critic algorithm. The policy improvement relies on two stages: 1. A non-parametric policy (critic) $q(a|s)$ is optimized on samples from the state-action value function under the constraint of remaining close in expectation to the current parametric policy $π_θ$. 2. The parametric policy (actor) is then updated to better approximate the non-parametric target. Hence both the actor and critic start with unlearned parameters and move towards convergence in each step. Advantages of using this algorithm: 1. Bypasses the need for gradient estimation via likelihood ratio or reparametrization trick. 2. Enables optimisation of mixture distributions without continuous relaxation. **Step 1: Fitting the Non-parametric Policy** Since there is no access to the ground-truth state action value function $Q$ (otherwise we wouldn't need to learn the policy in the first place), it is substituted by a learned approximation $Q_\phi$, parameterized by $\phi$. The objective function at iteration k is formulated as: $max_q J(q)=E_{q,s \approx D}[Q_\phi(s,a)]$ such that $E_{s \approx D} [KL(q(a|s)||\pi_{\theta_k} (a|s))] ≤ \epsilon$, where $\epsilon$ defines a trust-region around the current parametric policy, $\pi_{\theta_k}$, used to ensure that the policy doesn't take a step larger than a certain amount, to be safe. The above equation can be solved to get the following closed-form relation: $q_k(a|s) ∝ pi_{\theta_k}(a|s)$exp$(\frac{Q_\phi(s,a)}{\eta})$, where $\eta$ is computed by minimizing the dual function: $g(\eta) = \eta \epsilon + \eta E_{s\approx D}$ [log $\int \pi_{\theta_k}(a|s)$ exp$(\frac{Q_\phi(s,a)}{\eta})da]$ (5) **Step 2: Updating the parametric policy** The parametric policy $\pi_\theta$ is optimized to approximate the non-parametric policy by minimizing their KL divergence. The update proceeds via the Lagrangian relaxation of the KL constraints enabling the application of gradient based optimization. By ensuring that the KL distance between the approximate policy and the actual policy doesn't exceed epsilon, we ensure that the policy that we are learning doesn't go too far away from the actual policy The KL constraints are further decoupled and defined independently for each distributional parameter p. To enable the training of diverse distributions with different constraints, we enforce the KL constraints per component. Finally, the updated parameters $\theta_{k+1}$ is obtained as a solution of: $max_\theta ~min_{\lambda_p>0} ~L(\theta, \lambda_p) = E_{{s \approx D},\pi_{\theta_k}}$ [exp $(\frac{Q_\phi(s,a)}{\eta})$. log$\pi_\theta(a|s)] + \sum_p \lambda_p (\epsilon - E_{s \approx D}[KL(\pi_{\theta} (a|s))||\pi_{\theta_k} (a|s))])$ (8), where the first term is the KL divergence of the parametric policy with the non-parametric policy, and the second term is sum over decoupled components, each only varying along its respective parameter $\theta$, defined in terms of component specific Langragian $\lambda_p$ and KL bounds $\epsilon_p$. **Policy evaluation** The authors use the retrace algorithm in order to stabilize $Q_\phi$. The final optimization objective is therefore: $min_\phi L(\phi) = E_{\tau ∼D} [(Q_t^{ret} − Q_\phi(s_t,a_t))^2]$. (10) The infinite series (in the equation for $Q_t^{ret}$) is truncated after 10 steps and bootstrapped from the target action-value network. To increase efficiency, two step transitions are considered by squashing consecutive timesteps before adding them to memory. **Combining continuous and discrete distributions** For an action a and a discrete mixture component i with finite support $C_i$, the discrete policies are approximated by piece-wise constant pseudo-density given by: $\pi_\theta^L(a|s,i)= \sum_{a ̃\in C_i}p_i(a ̃|s)·1_{B_δ(a ̃)}(a)$, where $p_i(a ̃|s)$ is the probability of $a ̃$ in the original discrete distribution, $1$ is the indicator function, and $B_δ(a ̃)$ is a ball of radius δ around $a ̃$. This improves sharing of gradient information between continuous and discrete policies and enables discrete components to train on samples generated by continuous components to accelerate learning. #### The algorithm Hence to summarize, the algorithm has the following steps: 1. Initialize target update steps $N_{step/target}$, action samples per state $N_s$ and KL bounds $\epsilon$. 2. While loop over k iterations where $k\leq N_{step}$ 3. Nested for loop over k from 1 to $N_{target}$ 4. Sample batch of trajectories τ from memory B 5. Estimate expectations for $N_s$ actions from $\pi_{\theta_k}$ 6. Computing gradients over batches 7. Computing gradient for π from equation (8) 8. Computing gradient for η from equation (5) 9. Computing gradient for the Langragian $\lambda_p$ from equation (8) 10. Computing gradient for non-parametric policy $Q_{\phi}$ from equation (10) 11. Apply gradients to update $π_{θ_{k+1}}, η, λp, Q_\phi$. Update $k\leftarrow k+1$. Exit for loop. 12. Update target networks $π_θ^′ = π_θ , Q_\phi^′ = Q_\phi$. Exit while loop. #### Game theoretic aspects and multi-agent optimization All the agents are provided with a common mixture of policies from a diverse set of low-level controllers that differ in their parametrrizations, i.e, at each time step, every agent optimizes its choice from a common pool of low-level sub-policies, depending on a high-level policy for a particular task. Each task-specific policy can be viewed as a Markov strategy. The entire training process can then be seen as a 2-player game between the set of non-parametric and parametric policies, each composed of a combination of controller designs, such that non-parametric policy is optimized while having the constraint of remaining close in expectation to the parametric policy $\pi_\theta$. The parametric policy is then updated to better approximate the non-parametric target. The algorithm iterates and the process continues till the agent reaches a Markov perfect equilibrium with respect to its policies. --- ### Experiments The experiments and inferences are summarized as following: * **Component specialization within a diverse policy**: The agent combines a narrow Gaussian policy ($σ_{ini}$ = 0.3) with a coarse Categorical ($n_{bin}$ = 2) policy. The agent leverages bang-bang control for movements and continuous control for stabilization and intricate contact phases following an action. t-SNE dimensionality reduction is applied and yields consistent clustering across trajectories, indicating consistent component specialization. This aligns with human intuition and highlights the promise of composition, further motivating HMPs. * **Evaluating synergies arising from heterogeneous mixtures**: Combining a narrow Gaussian policy ($σ_{ini}$ = 0.3) with a Categorical policy (with $n_{bin}$∈{2,9}); pairing a narrow Gaussian with a bang-bang policy significantly improves performance and guards against component failures induced by the bang-bang control. Coarse categorical control drives exploration, whereas fine categorical control can enable accurate tracking. * **Analysis of diverse distributions:** The performance of agent is evaluated by considering a Gaussian N ($σ_{ini}$ = 1.0), Kumaraswamy K ($c_{ini}$ = 1.0), Categorical C($n_{bin}$ = 5) and Discrete Gaussian D ($n_{bin}$ = 5), as well as their combination (referred to as NKCD) into an HMP. The HMP with the combination can solve all tasks and also guards against the component failures induced by K or premature convergence induced by D. This underlines the robust performance that diverse HMPs provide by evaluating multiple controller designs jointly, reducing environment-specific tuning and guarding from component failure. NKCD outperforms baseline DDPG and performs competitvely with RHPO (which leverages a homogeneous mixture policy consisting of 4 MPO-type Gaussians ($σ_{ini}$ = 1.0)). This underlines HMP’s ability to enable data-efficient learning by training multiple policy designs in parallel and transferring problem-specific controller selection to the agent. * **Random sub-policy parameterizations:** The sub-policies are sampled with random hyperparameters, varying in terms of distribution type, initialization, architecture and activations of each component. The randomized HMP performs competitively with an optimized RHPO agent. * **HMP versus HOOF:** HOOF (TNPG) and HOOF (A2C) agents, using gradient-free hyperparameter optimization were compared with the NKCD HMP agent. The diverse mixture of NKCD displays competitive performance on these benchmarks without any domain-specific fine-tuning. * **Hyperparameter variations:** 1. **Initialization:** The initial standard deviation of the Gaussian policy heads are varied ($σ_{ini}$ = {0.3, 1.0, 3.0}, standard value being 1.0) since it can significantly impact the rate of convergence. While a large variance drives exploration, a lower variance helps avoid instability. Generally, a diverse policy improves performance over the weaker component, yielding a robust controller that can prevent individual failure modes induced by high variance. 2. **Architecture:** The layer structure is varied with $π_l ∈ \{ [20], [200], [200, 200] \}$ with standard value $π_l^s = [200].$ The performance is observed to be robust to architecture with increased capacity slightly improving performance. 3. **Activations:** The policy activations are varied with $π_a$ = {ELU, Leaky ReLU, Tanh}, standard activation being ELU. The heterogeneous mixture improves performance over both the homogeneous mixture and the individual components on Control Suit tasks, whereas homogeneous ELU activations are better suited to work on ANYmal. This is the only instance where the heterogeneous mixture resulted in a reduced performance. 4. **Learning rate**: Policy learning rates are varied such that $α_{LR} \in 2×\{10^{−5}, 10^{−4}, 10^{−3}\}$, $α_{LR}^s=2×10^{−4}$. Smaller learning rates are shown to reduce efficiency. However, pairing a fast with a slow head yields competitive performance, significantly improving over the individual slow component. * **Robustness of the High-level Controller:** The robustness of the high level module to loss of state information and adversarial components is evaluated. The high-level controller is able to adapt its component selection accordingly to yield robust converged performance. This underlines HMP’s ability to efficiently select and focus on low-level controllers that are well-suited for a given task, while guarding against potential failure modes of individual components. --- ### Conclusion The paper proposes the use of diverse mixture policies, which combines different distributions and parameterizations, to train a heirarchical policy in a single trial thus effectively mitigating several challenges related to computational costs incurred in parallel and sequential repetition of experiments. HMPs induce diversity that can help in component specialization during different phases of a task, and is also capable of accelerating the learning process. This approach increases robustness and helps to accelerate research in reinforcement learning for complex dynamic robots, in particular when there is no access to extensive computational resources.