Hyeonah Kim
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    ## General Response Thanks for all the valuable comments. We address each concern and question separately. Overall, we have made revisions to our title and abstract to make the manuscript better align with our research aim. Our new title is **"Symmetric replay training: enhancing sample efficiency in deep reinforcement Learning for combinatorial optimization."** Please kindly check our revised manuscript, as well. --- ## Reviewer zPsa (Rating: 6) **W1: the novelty of this work is not very significant.** As you commented, our algorithm is both straightforward and effective. Our key novelty beyond our simplicity lies in proposing new replay training methods, which can also be applied to both off-policy and on-policy reinforcement learning in combinatorial optimization. Up until now, prior replay training methods have primarily been designed for off-policy reinforcement learning. Please take into consideration the significance of this contribution. **W2, Q1: why does applying SRT on top of such symmetric architectures have additional benefits?** <!-- Inducing inductive bias such as symmetries into architecture was crutial research for past year in machine learning communities. Our training algorithm is orthogonal to architectural direction so that our algorithm is architecture agnostic that can cope with existing promising architectures. --> Solution symmetries, defined as situations where multiple trajectories can lead to identical solutions, are challenging to incorporate directly into architectural structures. Therefore, loss optimization is needed to induce solution symmetries. In DevFormer, they introduce an andditional loss term to consider solution symmetries; however, these symmetries are not induced in a structured way. On the other hand, GFlowNet introduces a new kind of loss function that considers solution symmetries by modeling the sequential decision-making in CO on a directed acyclic graph (DAG) and considering flows on DAG as a policy. Our two-step learning procedure can be used along with these DRL methods, and SRT can further induce solution symmetries effectively. It is noteworthy that our algorithm can also easily cooperate with symmetric architectures like SE3 Transformer [1] and EGNN [2], which exhibit architectural symmetries in state-level representation (e.g., rotation of an image yields the same representation). Since these models (including the permutation-invariant NN) do not consider solution symmetries related to the action trajectories, further learning processes like SRT are required to induce solution symmetries. **Answer for W3 and Q2: how simultaneous updates were performed.** In our two-step training, the model is updated to minimize the policy gradient loss in Step A. Then, the model is updated to minimize imitation loss in Step B. These updates are conducted alternatively. On the contrary, in the ablation, after calculating RL loss, we collect the high-rewarded samples using greedy rollout without model updating.We compute $L_{SRT}$ using the high-rewarded samples and then update the model with $L = L_{RL}+L_{SRT}$ at once. To clerify this, we revised the manuscript. <!-- the total loss is computed in the form of $L = L_{RL}+L_{SRT}$, and the model is updated once to minimize the total loss. In detail, after calculating RL loss, we collect the high-rewarded samples using greedy rollout without model updating. We compute $L_{SRT}$ using the high-rewarded samples and then update the model with $L = L_{RL}+L_{SRT}$ at once. To clerify this, we revised the manuscript. --> **Answer for Q3: overemphasizing black-box nature (SRT is positioned as a general-purpose technique)** We agree that our SRT can gives general purpose replay training on CO. Our intention was that sample efficiency is matter to real-world application of CO such as hardware optimization and molecule optimization. We revised manuscipt based on your suggestion. #### Reference [1] Fuchs, Fabian, et al. "SE(3)-Transformers: 3D roto-translation equivariant attention networks." NeurIPS, 2020. [2] Satorras, Vıctor Garcia, Emiel Hoogeboom, and Max Welling. "E (n) equivariant graph neural networks." International conference on machine learning. PMLR, 2021. ------ ## Reviewer t4AA (rating: 6) **Answer for W1: symmeties in chemical literature** We agree that there are prior works about symmetries both in RL communities and chemical discovery communities. The major novelty of our works to compare with prior works is we reuse exsiting samples (which is expensive) given from RL training phase and exploit prior knowledge of symmetries so that make "free" samples to make replay training with imitation learning for support RL training more sample efficient. In particular, the paper "All SMILES Variational Autoencoder for Molecular Property Prediction and Optimization" focus how to make VAE representations invariant to symmetric transformation while we focus to architecture agonistic method to enhance sample efficiency of RL training. **Answer for W2: Experimental results for molecules may not be as impressive compared to the recent research** We agree about that. In scientific discovery side, there can be better candidate algortihms such as a genetic algorithm. On the other hand, RL is general tool can be cope with general purpose combinatorial discovery and optimization without needs of specific domain knowledge. In addtion, we revised the caption of Table 3 to avoid misleading. **Answer for why RL is used** It is challenging to sample solutions in CO, as they are high-dimensional. To effectively sample a solution, the joint distribution can be factorized as follows. $$x\sim p(x) = \prod_{t=1}^T p(x_i|s_t)$$ Sampling a solution from the factorized distribution can also be regarded as sequential decision-making, which RL effectively models. Our research focus on the auto-regressive RL, which constructs solution sequentially, as it is advantageous to avoid infeasible solutions. Since CO contains (hard) constraints, generating feasible solutions are important. We employed RL-based constructive policies to satisfy feasibility while maintaining computational tractability in high-dimension space. It leads to inevitable order bias and, thus, solution symmetries. Instead of surpressing symmetries, we use them to systematically explore the action trajectory space by replaying free symmetric trajectories. As we do not restrict architecture to induce symmetries, our method flexibly works with various DRL methods. In summary, constructive RL policies are advantageous to solve the CO problem effectively. In this study, we propose a generic method that leverages symmetric trajectories induced by introducing an artificial order to enhance sample efficiency in a positive manner. **Questions** - **Q1: hardness of TSP.** Euclidean TSP is one of the case of TSP where the distance is defined as Euclidean distance. Generally, cost function of TSP is defined as $\sum_{i,j} d_{ij}x_{ij}$, and regardless how to define $d_{ij}$,TSP is NP-hard. - **Q2: Unclear explanation of Theorem 1.** $p(x|s)$ is the distribution over the solution. Sincee have multiple action-trajectories that gives the same solution, $p(x|s_1) = \sum_{\vec{a} \in \vec{A}_x} \pi_\theta(\vec{a}|s_1)$ holds. Here, $\vec{A}_x$ denote the space of trajectory asscociated whit the solution x, i.e., $C(\vec{a}) = x$ for all $\vec{a} \in \vec{A}_x$. This Theorem gives a theoretical background for employing randomized symmetric transformation policy is advantageous. For example, in TSP, there are 2N symmetric trajectories. If we use the fixed symmetric transformation, e.g., flip only, the entropy of the policy is bounded. Therefore, to maximize the entropy of symmetric transformation policy, we uniformly samples the symmetric trajectories over the all possible $2N$ trajectories. The experimental results also support this as follows. Also, we revised the manuscript for better readibility. | Method | K=100K | K=1M | K=2M | | --- | --- | --- | --- | | A2C | 6.630 +- 0.037 | 6.514 +- 0.049 | 6.115 +- 0.009 | | + SRT (fixed) | 6.775 +- 0.079 | 6.198 +- 0.024 | 6.069 +- 0.014 | | + SRT (ours) | 6.560 +- 0.051 | 6.171 +- 0.025 | 6.037 +- 0.005 | - **Q3:** We revised the typo. ---- ## Reviewer j1D8 (rating: 3) ***Summary:** Our research goal is to enhance the sample efficiency of **deep RL for Combinatorial Optimization**, which is crucial since practical CO problems often involve computationally expensive objective functions, e.g., black-box function. While various methods have been suggested to address black-box optimization effectively, applying them directly to combinatorial optimization poses difficulties (details will be discussed later). On the other hand, RL-based methods also have shown promising performance in CO. However, they leverage numerous samples, which is unaffordable when the function evaluation is expensive. Specifically, **our research focuses on how to discover and induce effective inductive bias based on the nature of CO**. In the following responses, we further discuss why we mainly use on-policy RL in CO and address each concern in detail.* ---- **W1:** The paper seems to have a **scattered focus** **W4:** It seems slightly **unclear which community the paper could contribute**. **Answer** We agree that there were some misalignments in the manuscript. We believe our work contributes to **deep RL for Combinatorial Optimization** fields that require sample efficiency. For better alignment in the manuscript, we have made revisions to the title and abstract. **W2:** the baselines are not for black-box CO and there is no comparison with methods for black-box optimization. **Answer** Thanks for the insightful comments and suggestions. We have look through the literature that you recommended and we added the literature review about black-box combinatorial optimization based on the suggested research in Appendix E. First of all, we additionaly provide the results of a BO-based (GPBO), and a genetic algorith (Graph GA) according to your suggestion ([1,2,4] - Bayesian optimization, [3] - evolutionary compoutaion). The table shows the average AUC with three independent seeds. We provide the detailed results in Appendix D.3. | | GPBO | Graph GA | REINVENT | REINVENT + SRT | | --- | --- | --- | --- | --- | | Avg. AUC | 0.609 | 0.619 | 0.613 | **0.633** | <!-- | Num. of Best | 8 / 23 | 3 / 23 | 15 / 23 | --> Here, we would like to discuss why we solve CO problems using reinfocement leaning, especially when the function evaluation is expensive. **Challenges of applying evolutionary computation algorithms to CO.** The paper [3] introduces evolutionary computation (EC) algorithms, including genetic algorithms (GA). While EC algorithms have a well-established reputation for delivering powerful performance across diverse tasks, they do exhibit certain limitations. As mentioned in [3], *designing novel operators, which greatly affects EC algorithms' performance, requires domain knowledge.* This implies that careful algorithm design is necessitated whenever there is a change in tasks. For example, though Graph GA achieves promising performance, performance in other CO problems will vary depending on how well-designed crossover and mutation operators are. **Challenges of Bayesian Optimization Algorithms to CO.** As indicated in [1,2,4], Bayesian optimization (BO) is a promising method for solving black-box optimization problems. The key idea is to employ a surrogate model, usually a Gaussian Process, and optimize an acquisition function built upon the surrogate model to draw samples. BO has demonstrated its powerful performance in row-budget regime, especially in continuous space. However, in the case of combinatorial spaces, constructing an accurate surrogate model poses a substantial challenge due to its non-smoothness. Furthermore, the acquisition optimization is modeled as quadratic integer programming problems, which is NP-hard [5]. Strategies to address this NP-hardness involve the utilization of variational auto-encoders [6] or continuous relaxation [7]. Yet, these approximations can result in significant performance degradation. <!-- Therefore, applying BO to high-dimensional combinatorial black-box optimization is notoriously hard. We provide the experimental results of comparing BO-based and RL-based methods in molecular optimization tasks at the end of this response. --> <!-- Additionally, similar to Bayesian Optimization (BO), EC algorithms often require a surrogate function in the context of expensive optimization problems. However, modeling a surrogate function is challenging in CO due to the non-smoothness (i.e., a single flip in the input component can result in a substantial change in the corresponding output). --> **Sampling in high-dimension is hard.** It is challenging to sample solutions in CO, as they are high-dimensional. To effectively sample a solution, the joint distribution can be factorized as follows. $$x\sim p(x) = \prod_{t=1}^T p(x_i|s_t)$$ Sampling a solution from the factorized distribution can also be regarded as sequential decision-making, which RL effectively models. **Benefits of solving CO with RL.** In RL for CO, solutions are constructed by sequentially deciding on $x_i$. This constructive policy is advantageous in satisfying the constraints of CO by restricting decisions that lead to infeasible solutions at each step. The fulfillment of constraints is a key property in CO. Another benefit of this approach is that we can construct a general solver that addresses various target problem instances instead of exclusively seeking optimal solutions for individual target instances, i.e., $$x \sim p(x|c) = \prod_{t=1}^T p(x_i|s_t, c)$$ <!-- This can be considered as solving a contextualized black-box optimization, i.e., . --> In this research, we train a general solver for the travelling salesman problems (TSP) and decap placement problems (DPP) in Sections 4.1 and 4.2. <!-- Though deep RL methods have been proposed to solve CO problems, they assume the objective evaluation is cheap. However, this assumption does not align with practical scenarios, as many cases involve objective functions with costly evaluations (e.g., black-box function). *The proposed method aims to overcome this sample inefficiency issue in applying existing DRL to combinatorial optimization with expensive objective function.* --> **W3:** It remains unclear why on-policy RL is necessary for solving expensive black-box optimization problems. **Answer** Due to the combinatorial nature, 1) rewards are episodic and 2) value function estimation is notoriously hard. Here, episodic reward means we cannot evaluate objective function (i.e., reward) in the middle of constructing solutions. Therefore, Monte Carlo-based methods, like REINFORCE, have been employed in DRL for CO literature. **When we use Monte Carlo, importance sampling is required for off-policy methods, leading to high variance.** As evidence, the empirical results show that REINVENT, an on-policy method, outperforms off-policy methods such as GFlowNets and DQN in the PMO benchmark. Though on-policy is known to give superior performance in CO, **we also verified our method with an off-policy method, GFlowNet.** We added supplement explanations for GFlowNet in the manuscript. <!-- **Value estimation of combintorial decision is notorisouly hard.** We conduct evaluations of several off-policy methods, including GFlowNets and DQN. Notably, most state-of-the-art methods are on-policy in CO because estimating value function is highly challenging for combinatorial solutions. Furthermore, **a reward is episodic due to the combinatorial nature** of solutions, which means the intermediate state does not give a reward. **This episodic property amplifies the difficulties of applying value-based RL approaches**. As evidence, the empirical results show that REINVENT, an on-policy method, outperforms off-policy methods such as GFlowNets and DQN in the PMO benchmark. Though on-policy is known to give superior performance in CO, **we also verified our method with off-policy method, GFlowNet.** We acknowledge there were no supplement explanations for GFlowNet; we revised the manuscript. --> ---- **W4: comparison with data augmentation and experience replay** <!-- **(Off-policy)** Due to the episodic property discussed in the previous response, we need to consider the importance sampling weight for trajectories in off-policy learning, leading to high variance. Though on-policy is known to give superior performance in CO, we also verified our method with off-policy method, GFlowNet. We acknowledge there were no supplement explanations for GFlowNet; we revised the manuscript. --> **Answer** **(Augmentation)** We agree that data augmentation like cropping [8, 9], noise injection [8, 10], and Mixup [10, 11] can improve sample efficiency. However, these methods focus on visual RL scenarios (continuous space), hence not applicable to CO. In the case of TSP in Euclidean space, POMO and Sym-NCO propose augmenting graph node features by rotating coordination values. As illustrated in Appendix D.4, SRT improves the sample efficiency of Sym-NCO. Note that augmentations in Sym-NCO are done in input space, while ours utilize symmetries in output (action trajectories) space. **(Experienc Replay)** We have conducted additional experiments with A2C on TSP to compare ours to experience replay (ER) as follows. We set the buffer size to 10000, which is the same as the epoch size. Note that when we use experience replay to on-policy, importance sampling is required as follows. $$\mathbb{E}_{\vec{a} \sim p(\cdot|s_1)}[R(x)] = E_{\vec{a} \sim q(\cdot|s_1)}\left[ \frac{p(\vec{a}|s_1)}{q(\vec{a}|s_1)}R(x) \right] = \mathbb{E}_{\vec{a} \sim q(\cdot|s_1)} \left[\prod_{t=1}^{T}\frac{p(a_t|s_t)}{q(a_t|s_t)}R(x)\right],$$ where $p(\vec{a})$ is the training policy and $q(\vec{a})$ is the behavior policy that made the experiences in the replay buffer, and $x=C(\vec{a})$. The importance weight terms of $\prod_{t=1}^{T}\frac{p(a_t|s_t)}{q(a_t|s_t)}$ gives high variances to estimate objective so that gives poor performances on replay training as shown in the following table. The detailed experimental settings and results are also provided in Appendix D.2. | Method | 100K | 2M | | -------- | -------- | -------- | | A2C | 6.630 +- 0.037 | 6.115 +- 0.009 | | A2C + ER (size = 20) | 7.085 +- 0.326 | 6.156 +- 0.033 | | A2C + ER (size = 100) | 7.073 +- 0.293 | 6.139 +- 0.010 | | A2C + SRT (ours) | *6.560 +- 0.051* | *6.037 +- 0.005* | The results show that experience replay can degrade the DRL methode due to the high variance introduced by the importance sampling. In constrast, our method does not suffer from the importance sampling becuase SRT updates policy to maximize likelihood of symmteric samples. Additionally, we provide the results on GFlowNet, off-policy RL, on molecular optimization task. Here, the experience replay improve sample efficiency of GFlowNet, but our method show more effective results. Average AUC with three independent seeds are reported. | GFlowNet | + Experience Replay | + Reward-prioritized ER | + SRT | | -------- | -------- | -------- |-------- | | 0.436 | 0.491 | 0.497 | *0.511* | <!-- I would like to mention that PPO collects trajectories and updates parameters with k inner-loops (we set k=5 on TSP), and REINVENT contains experience replay. Ours can be beneficial even when the DRL methods in Step A include experience replay, and the experimental results show that ours can improve sample efficiency. --> *Note that on-policy RL methods with Monte Carlo estimation have shown notable performance in the CO domain. Though augmentation and experience replay are well-known general approaches to improve sample efficiency, directly applying these into DRL for CO can give insufficient performance gain and even lead to performance degradation.* **W4: In addition, I have concerns about whether all expensive black-box optimizations can identify appropriate symmetries.** **In general, finding appropriate symmetries can be tricky in back-box optimization. However, when a constructive policy is employed for CO, there are always symmetries we can utilize.** From a reinforcement learning perspective, a policy sequentially selects actions to generate a solution. In combinatorial optimization, solutions should be represented in sets, while RL policy generates a sequence (or a graph). Due to the relationship between sets (combination) and sequences (permutation), there must be symmetries. For instance, there are 4! ways to create a set like [A, B, C, D] in sequence. By utilizing this relationship, we can easily find symmetric trajectories whether the objective function is black-box or not. Note that the generation process of RL can be canonicalized in a particular order (e.g., from left to right) but introduces a significant bias. Hence, when tackling combinatorial optimization problems, it is crucial to consider symmetries in an appropriate way. **Q1: Considering the symmetric loss, should we factor in importance sampling because methods like PPO and A2C are on-policy RL?** We do not need to consider importance sampling because we use imitation learning as replay training, which is an off-policy method. We believe this is the significance of our approach over typical replay training methods for on-policy RL. - **Q2:** pseudo-code - **Answer:** We have added a pseudo-code in Appendix B.1. - **Q3:** How optimal are the results in Table 1 for TSP? - **Answer:** In the synthetic setting, we assume that the number of reward calls is limited (but, in fact, it can be computed). Therefore, we can get the optimal solutions with off-the-shelf solvers like Concorde. The average optimal cost is 5.696. - **Q4:** Why do the results in Table 3 could differ from those in the original paper? - **Answer:** We may use a different set of seeds with the original paper; we use [0, 1, 2, 3, 4] regardless of oracles. The PMO did not report which seeds were used for the main results. We use different version of PyTDC (Therapeutics Data Commons [12]), which directly affects Oracle score. We use the updated vertions (v0.4.0) since the previous version (v0.3.6) gives invalid score for some tasks. - **Q5:** Why the results of PPO show the high variance? - **Answer:** As provided in Appendix B.2, we have searched (2x3x3) hyper-parameter combinations for tuning PPO. Note that PPO is not a popular choice in the field of neural combinatorial optimization because PPO requires critic estimation for high performance, and estimating the value function for combinatorial optimization is challenging. Additionally, the variance in Table 1 is not that large compared to others, e.g., PG-Rollout with 100K. Most importantly, we use the same hyper-parameters for PPO + SRT. --- #### Reference [5] Baptista, Ricardo, and Matthias Poloczek. "Bayesian optimization of combinatorial structures." ICML, 2018. [6] Gomez-Bombarelli et al. “Automatic Chemical Design using a Data-driven Continuous Representation of Molecules.” ACS central science, 2018. [7] Oh, Changyong, et al. "Combinatorial bayesian optimization using the graph cartesian product." NeurIPS, 2019. [8] Laskin et al. "Reinforcement learning with augmented data." NeurIPS, 2020. [9] Denis Yarats, Ilya Kostrikov, and Rob Fergus. "Image augmentation is all you need: Regularizing deep reinforcement learning from pixels." ICLR, 2020. [10] Fan et al. "SECANT: Self-expert cloning for zero-shot generalization of visual policies." ICLR 2021. [11] Zhang et al. "Mixup: Beyond empirical risk minimization." arXiv preprint arXiv:1710.09412, 2017. [12] Huang et al., "Therapeutics data commons: Machine learning datasets and tasks for therapeutics." NeurIPS Track Datasets and Benchmarks, 2021. (https://tdcommons.ai) ---- ## Reviewer vaeD (rating: 6) **not clear how this method compares to other combinatorial BBO** **Answer** *First of all, our research **focus on constructive RL methods for combinatorial optimization** where the function evaluation is expensive. Based on the suggested paper, we clarify our reseach scope and why we focus on constructive RL methods.* Combinatorial Bayesian optimization solves a bi-level optimization where the upper problem is surrogate regression, and the lower problem is acquisition optimization. In the context of combinatorial space, the acquisition optimization is modeled as quadratic integer programming problems, which is NP-hard [A, C]. Note that the surrogate regression has cubic complexity when using the Gaussian process; thus, we need to solve problems with cubic complexity and NP-hardness every iteration. To address the NP-hardness of acquisition optimization, various techniques have been proposed, including continuous relaxation [B], sampling with simulated annealing [C], genetic algorithms (BOSS [1] and GPBO [2]), and random walk explorer (ChemBO [3]). While these methods have demonstrated competitive performance in lower-dimensional tasks like neural architecture search (NAS), they often demand substantial computation time, particularly in molecular optimization tasks [4]. In Appendix D.3, we have included the results of GPBO for comparison with our approach. Note that GPBO requires the design of domain-specific heuristic algorithms, which limits the extendability to other tasks. <!-- To handle the NP-hardness of acquisition optimization, the discrete variables can be relaxed into the continuous variables, making gradient-based optimization available to be employed, like [B]. However, relaxing variables into continuous space introduces significant approximation error, which leads to poor performance. On the other hand, the last paper [D] suggests to construct a Markov chain with simulated annealing, so they solve the acquisition optimization as a sampling. Yet, this gives sub-optimal solutions or achieves optimal with NP-hard complexity. As an RL-extended version in the molecular optimization domain, GFlowNet-AL [ref] is suggested to amortize this high sampling complexity with GFlowNet. The results of GFlowNet-AL are included in Table 3 as 'Model-based GFlowNet.' In other literature on molecular optimization, the acquisition optimization is solved with a genetic algorithm (BOSS [ref] and GPBO [ref]) and random walk explorer (ChemBO [ref]). However, BOSS and ChemBO do not successfully solve tasks in PMO [ref]. In Appendix D.3, we added the results of GPBO to compare with our approach. Note that they require the design of domain-specific heuristic algorithms, which limits the extendability to other tasks. --> As pointed out in [C], one of the alternative approaches is to utilize a variational auto-encoder (VAE) to map the high-dimensional combinatorial space into “compact” continuous latent space to apply BO, like [5]. Though this approach has shown successful performance in molecular optimization, it also introduces variational error since VAE maximizes a lower bound of the likelihood, known as evidence lower bound (ELBO), not directly maximizes the likelihood. <!-- Similarly to continuous relaxation, the approximation error can lead to performance degradation. As the PMO benchmark shows, VAE with BO does not give competitive performance compared to the RL approaches, such as REINVENT, from the view of the sample efficiency. --> *In addition, based on the suggested research, we added the literature review about black-box combinatorial optimization in Appendix E and the experimetal results of Bayesian optimization on the molecular optimzation tasks. The results are in Appendix D.3*. **Grey-box problems rather than black-box** **Answer** We agree with your comment that our target problems are grey-box rather than black-box. Our target problem is the CO problem, whose objective function evaluation is expensive. One example can be a CO problem with a black-box objective function and white-box constraints. The title and manuscript have been revised, as mentioned in the general response. - **Minor suggestion:** We adjusted the y-scale of plots in Figure 3 and also in Figure 6. Thanks for the suggestion. - **Q1: Is it possible to utilize symmetries in other methods for BBO?** - **Answer**: In reinforcement learning, a policy constructs a new solution by sequentially selecting actions from an empty solution. Consider the case of determining a set with three components, where a solution is constructed as follows: [ ] -> [a] -> [a, b] -> [a, b, c]. This sequential construction gives rise to multiple trajectories (e.g., [] -> [b] -> [b, c] -> [b, c, a]), and our approach leverages this symmetry. On the other hand, in Bayesian optimization, new solutions are suggested by exploring the solution space starting from an initial solution. For instance, by employing Gibbs sampling within a 1-Hamming ball on discrete space, e.g., [a, b, c] -> [d, b, c] -> [d, b, a], there are no such symmetries that we utilize in RL. - **Q2:** Symmetries in RL-based BBO - **Answer:** In DevFormer, they introduce an andditional loss term to consider solution symmetries; however, these symmetries are not induced in a structured way. On the other hand, GFlowNet introduces a new kind of loss function that considers solution symmetries by modeling the sequential decision-making in CO on a directed acyclic graph (DAG) and considering flows on DAG as a policy. Our two-step learning procedure can be used along with these DRL methods, and SRT can further induce solution symmetries effectively. It is noteworthy that our algorithm can also easily cooperate with symmetric architectures like SE3 Transformer [6] and EGNN [7], which exhibit architectural symmetries in state-level representation (e.g., rotation of an image yields the same representation). Since these models do not consider solution symmetries related to the action trajectories, further learning processes like SRT are required to induce solution symmetries. - **Q3:** What happens if we use intermediate reward to train RL policy in TSP? - **Answer:** First of all, as a TSP solution, we need to find a Hamiltonian cycle covering all cities. The cost of the solution can vary based on the last segment, which involves returning to the starting point. Note that the last segment is automatically defined on the previous sequence. In other words, the reward is episodic. Training the model with intermediate rewards could introduce bias into the return value, so performance is insufficient. Even in heuristic algorithms for TSP, the farthest insertion, which starts to form a new cycle by inserting the farthest city, outperforms other heuristics. Therefore, the RL policy should be trained using the Monte Carlo method with episodic rewards to ensure unbiased learning. - **Q4, Q5: about symmetric trajectories in TSP** - **Answer:** In the TSP, the decision variable $x_{ij}$ takes the value of 1 if city $j$ is visited immediately after city $i$. For instance, when considering four cities, the sequences 1-2-3-4-1 and 2-3-4-1-2 represent the same solution (i.e., x is the same). This implies an identical cycle and, apparently, the same decision variables in both sequences. This is what we intended in the example on page 3. Note that a TSP solution (there is no specific starting point) is a Hamiltonian cycle. We have made revisions to the example in the manuscript. - **Q6: 20+% improvement is coming in for PPO+SRT over PPO ** - **Answer:** There are some typos, we revised the sentence as follows: The most substantial improvement facilitated by SRT is observed in the case of GFlowNet, where a cost reduction of 3.76% is achieved when the sample size K=100K. - **Q7:** when the performance degradation happens? - **Answer:** Based on the experimental results in Figure 5 (the replay loops are increased), the model can suffer from overfitting when there are not enough symmetries to utilize. For example, unlike TSP, we cannot explicitly get symmetric solutions in the PMO benchmark. The symmetric transformation highly depends on the API; thus, it is hard to explore symmetric space effectively (the maximum entropy is achieved with the uniform symmetric transformation policy according to Theorem 1), resulting in performance degradation due to the early convergence in some cases. This can be mitigated by designing a more precise symmetric transformation policy or adjusting the frequency of conducting SRT. --- #### Reference [1] Moss et al. “BOSS: Bayesian optimization over string spaces.” NeurIPS, 2020. [2] Austin Tripp, Gregor NC Simm, and José Miguel Hernández-Lobato. “A fresh look at de novo molecular design benchmarks.” NeurIPS 2021 AI for Science Workshop, 2021. [3] Korovina et a. “ChemBO: Bayesian optimization of small organic molecules with synthesizable recommendations.” AISTATS, 2020. [4] Gao et al. "Sample efficiency matters: a benchmark for practical molecular optimization." NeurIPS, 2022. [5] Gomez-Bombarelli et al. “Automatic Chemical Design using a Data-driven Continuous Representation of Molecules.” ACS central science, 2018. [6] Fuchs, Fabian, et al. "SE(3)-Transformers: 3D roto-translation equivariant attention networks." NeurIPS, 2020. [7] Satorras, Vıctor Garcia, Emiel Hoogeboom, and Max Welling. "E (n) equivariant graph neural networks." International conference on machine learning. PMLR, 2021. We agree that SRT also requires domain knowledge. Lastly, we found that the results of A2C (+ ART) had been reported using the earlier version of the results. (All experiments had been reconducted when the code was merged; each task has different dependencies, so we use the lowest version.) That is the only change in the numbers. Sorry for forgetting the blue color on it. Related work in the field: Further related works on deep reinforcement learning (DRL) for combinatorial optimization have been added in Appendix E. Potential Future Impact of This Work: 1. **Expandability for future studies.** One of the key strengths of this work is its expendability. The proposed method, replay training with imitation learning using symmetric action trajectories, applies to both on-policy and off-policy reinforcement learning. Furthermore, ours exhibits additional improvements in DRL methods, such as Sym-NCO, considering symmetries. Our work has an impact as our method can easily be applied to various DRL methods for combinatorial optimization in the future. 2. **Meaningful take-home message for future research.** In combinatorial optimization, sample efficiency has been less studied than other challenges, such as distributional shift or scalability. This work brings up the importance of sample efficiency. The proposed method shows that properly exploiting solution symmetries, regarded as a reason for inevitably extended search space, can improve sample efficiency. We believe our study can inspire future work to consider sample efficiency in DRL for CO for better practicality and encourage properly utilizing symmetries in CO. Please consider our impact on future studies in this field because this is the first work considering sample efficiency in DRL-based combinatorial optimization. We believe our works can inspire future researchers.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully