[Rebuttal ICLR'24] Symmetric Replay Training

## General Response Thanks for all the valuable comments. We address each concern and question separately. Overall, we have made revisions to our title and abstract to make the manuscript better align with our research aim. Our new title is **"Symmetric replay training: enhancing sample efficiency in deep reinforcement Learning for combinatorial optimization."** Please kindly check our revised manuscript, as well. --- ## Reviewer zPsa (Rating: 6) **W1: the novelty of this work is not very significant.** As you commented, our algorithm is both straightforward and effective. Our key novelty beyond our simplicity lies in proposing new replay training methods, which can also be applied to both off-policy and on-policy reinforcement learning in combinatorial optimization. Up until now, prior replay training methods have primarily been designed for off-policy reinforcement learning. Please take into consideration the significance of this contribution. **W2, Q1: why does applying SRT on top of such symmetric architectures have additional benefits?**  Solution symmetries, defined as situations where multiple trajectories can lead to identical solutions, are challenging to incorporate directly into architectural structures. Therefore, loss optimization is needed to induce solution symmetries. In DevFormer, they introduce an andditional loss term to consider solution symmetries; however, these symmetries are not induced in a structured way. On the other hand, GFlowNet introduces a new kind of loss function that considers solution symmetries by modeling the sequential decision-making in CO on a directed acyclic graph (DAG) and considering flows on DAG as a policy. Our two-step learning procedure can be used along with these DRL methods, and SRT can further induce solution symmetries effectively. It is noteworthy that our algorithm can also easily cooperate with symmetric architectures like SE3 Transformer [1] and EGNN [2], which exhibit architectural symmetries in state-level representation (e.g., rotation of an image yields the same representation). Since these models (including the permutation-invariant NN) do not consider solution symmetries related to the action trajectories, further learning processes like SRT are required to induce solution symmetries. **Answer for W3 and Q2: how simultaneous updates were performed.** In our two-step training, the model is updated to minimize the policy gradient loss in Step A. Then, the model is updated to minimize imitation loss in Step B. These updates are conducted alternatively. On the contrary, in the ablation, after calculating RL loss, we collect the high-rewarded samples using greedy rollout without model updating.We compute $L_{SRT}$ using the high-rewarded samples and then update the model with $L = L_{RL}+L_{SRT}$ at once. To clerify this, we revised the manuscript.  **Answer for Q3: overemphasizing black-box nature (SRT is positioned as a general-purpose technique)** We agree that our SRT can gives general purpose replay training on CO. Our intention was that sample efficiency is matter to real-world application of CO such as hardware optimization and molecule optimization. We revised manuscipt based on your suggestion. #### Reference [1] Fuchs, Fabian, et al. "SE(3)-Transformers: 3D roto-translation equivariant attention networks." NeurIPS, 2020. [2] Satorras, Vıctor Garcia, Emiel Hoogeboom, and Max Welling. "E (n) equivariant graph neural networks." International conference on machine learning. PMLR, 2021. ------ ## Reviewer t4AA (rating: 6) **Answer for W1: symmeties in chemical literature** We agree that there are prior works about symmetries both in RL communities and chemical discovery communities. The major novelty of our works to compare with prior works is we reuse exsiting samples (which is expensive) given from RL training phase and exploit prior knowledge of symmetries so that make "free" samples to make replay training with imitation learning for support RL training more sample efficient. In particular, the paper "All SMILES Variational Autoencoder for Molecular Property Prediction and Optimization" focus how to make VAE representations invariant to symmetric transformation while we focus to architecture agonistic method to enhance sample efficiency of RL training. **Answer for W2: Experimental results for molecules may not be as impressive compared to the recent research** We agree about that. In scientific discovery side, there can be better candidate algortihms such as a genetic algorithm. On the other hand, RL is general tool can be cope with general purpose combinatorial discovery and optimization without needs of specific domain knowledge. In addtion, we revised the caption of Table 3 to avoid misleading. **Answer for why RL is used** It is challenging to sample solutions in CO, as they are high-dimensional. To effectively sample a solution, the joint distribution can be factorized as follows. $$x\sim p(x) = \prod_{t=1}^T p(x_i|s_t)$$ Sampling a solution from the factorized distribution can also be regarded as sequential decision-making, which RL effectively models. Our research focus on the auto-regressive RL, which constructs solution sequentially, as it is advantageous to avoid infeasible solutions. Since CO contains (hard) constraints, generating feasible solutions are important. We employed RL-based constructive policies to satisfy feasibility while maintaining computational tractability in high-dimension space. It leads to inevitable order bias and, thus, solution symmetries. Instead of surpressing symmetries, we use them to systematically explore the action trajectory space by replaying free symmetric trajectories. As we do not restrict architecture to induce symmetries, our method flexibly works with various DRL methods. In summary, constructive RL policies are advantageous to solve the CO problem effectively. In this study, we propose a generic method that leverages symmetric trajectories induced by introducing an artificial order to enhance sample efficiency in a positive manner. **Questions** - **Q1: hardness of TSP.** Euclidean TSP is one of the case of TSP where the distance is defined as Euclidean distance. Generally, cost function of TSP is defined as $\sum_{i,j} d_{ij}x_{ij}$, and regardless how to define $d_{ij}$,TSP is NP-hard. - **Q2: Unclear explanation of Theorem 1.** $p(x|s)$ is the distribution over the solution. Sincee have multiple action-trajectories that gives the same solution, $p(x|s_1) = \sum_{\vec{a} \in \vec{A}_x} \pi_\theta(\vec{a}|s_1)$ holds. Here, $\vec{A}_x$ denote the space of trajectory asscociated whit the solution x, i.e., $C(\vec{a}) = x$ for all $\vec{a} \in \vec{A}_x$. This Theorem gives a theoretical background for employing randomized symmetric transformation policy is advantageous. For example, in TSP, there are 2N symmetric trajectories. If we use the fixed symmetric transformation, e.g., flip only, the entropy of the policy is bounded. Therefore, to maximize the entropy of symmetric transformation policy, we uniformly samples the symmetric trajectories over the all possible $2N$ trajectories. The experimental results also support this as follows. Also, we revised the manuscript for better readibility. | Method | K=100K | K=1M | K=2M | | --- | --- | --- | --- | | A2C | 6.630 +- 0.037 | 6.514 +- 0.049 | 6.115 +- 0.009 | | + SRT (fixed) | 6.775 +- 0.079 | 6.198 +- 0.024 | 6.069 +- 0.014 | | + SRT (ours) | 6.560 +- 0.051 | 6.171 +- 0.025 | 6.037 +- 0.005 | - **Q3:** We revised the typo. ---- ## Reviewer j1D8 (rating: 3) ***Summary:** Our research goal is to enhance the sample efficiency of **deep RL for Combinatorial Optimization**, which is crucial since practical CO problems often involve computationally expensive objective functions, e.g., black-box function. While various methods have been suggested to address black-box optimization effectively, applying them directly to combinatorial optimization poses difficulties (details will be discussed later). On the other hand, RL-based methods also have shown promising performance in CO. However, they leverage numerous samples, which is unaffordable when the function evaluation is expensive. Specifically, **our research focuses on how to discover and induce effective inductive bias based on the nature of CO**. In the following responses, we further discuss why we mainly use on-policy RL in CO and address each concern in detail.* ---- **W1:** The paper seems to have a **scattered focus** **W4:** It seems slightly **unclear which community the paper could contribute**. **Answer** We agree that there were some misalignments in the manuscript. We believe our work contributes to **deep RL for Combinatorial Optimization** fields that require sample efficiency. For better alignment in the manuscript, we have made revisions to the title and abstract. **W2:** the baselines are not for black-box CO and there is no comparison with methods for black-box optimization. **Answer** Thanks for the insightful comments and suggestions. We have look through the literature that you recommended and we added the literature review about black-box combinatorial optimization based on the suggested research in Appendix E. First of all, we additionaly provide the results of a BO-based (GPBO), and a genetic algorith (Graph GA) according to your suggestion ([1,2,4] - Bayesian optimization, [3] - evolutionary compoutaion). The table shows the average AUC with three independent seeds. We provide the detailed results in Appendix D.3. | | GPBO | Graph GA | REINVENT | REINVENT + SRT | | --- | --- | --- | --- | --- | | Avg. AUC | 0.609 | 0.619 | 0.613 | **0.633** |  Here, we would like to discuss why we solve CO problems using reinfocement leaning, especially when the function evaluation is expensive. **Challenges of applying evolutionary computation algorithms to CO.** The paper [3] introduces evolutionary computation (EC) algorithms, including genetic algorithms (GA). While EC algorithms have a well-established reputation for delivering powerful performance across diverse tasks, they do exhibit certain limitations. As mentioned in [3], *designing novel operators, which greatly affects EC algorithms' performance, requires domain knowledge.* This implies that careful algorithm design is necessitated whenever there is a change in tasks. For example, though Graph GA achieves promising performance, performance in other CO problems will vary depending on how well-designed crossover and mutation operators are. **Challenges of Bayesian Optimization Algorithms to CO.** As indicated in [1,2,4], Bayesian optimization (BO) is a promising method for solving black-box optimization problems. The key idea is to employ a surrogate model, usually a Gaussian Process, and optimize an acquisition function built upon the surrogate model to draw samples. BO has demonstrated its powerful performance in row-budget regime, especially in continuous space. However, in the case of combinatorial spaces, constructing an accurate surrogate model poses a substantial challenge due to its non-smoothness. Furthermore, the acquisition optimization is modeled as quadratic integer programming problems, which is NP-hard [5]. Strategies to address this NP-hardness involve the utilization of variational auto-encoders [6] or continuous relaxation [7]. Yet, these approximations can result in significant performance degradation.   **Sampling in high-dimension is hard.** It is challenging to sample solutions in CO, as they are high-dimensional. To effectively sample a solution, the joint distribution can be factorized as follows. $$x\sim p(x) = \prod_{t=1}^T p(x_i|s_t)$$ Sampling a solution from the factorized distribution can also be regarded as sequential decision-making, which RL effectively models. **Benefits of solving CO with RL.** In RL for CO, solutions are constructed by sequentially deciding on $x_i$. This constructive policy is advantageous in satisfying the constraints of CO by restricting decisions that lead to infeasible solutions at each step. The fulfillment of constraints is a key property in CO. Another benefit of this approach is that we can construct a general solver that addresses various target problem instances instead of exclusively seeking optimal solutions for individual target instances, i.e., $$x \sim p(x|c) = \prod_{t=1}^T p(x_i|s_t, c)$$  In this research, we train a general solver for the travelling salesman problems (TSP) and decap placement problems (DPP) in Sections 4.1 and 4.2.  **W3:** It remains unclear why on-policy RL is necessary for solving expensive black-box optimization problems. **Answer** Due to the combinatorial nature, 1) rewards are episodic and 2) value function estimation is notoriously hard. Here, episodic reward means we cannot evaluate objective function (i.e., reward) in the middle of constructing solutions. Therefore, Monte Carlo-based methods, like REINFORCE, have been employed in DRL for CO literature. **When we use Monte Carlo, importance sampling is required for off-policy methods, leading to high variance.** As evidence, the empirical results show that REINVENT, an on-policy method, outperforms off-policy methods such as GFlowNets and DQN in the PMO benchmark. Though on-policy is known to give superior performance in CO, **we also verified our method with an off-policy method, GFlowNet.** We added supplement explanations for GFlowNet in the manuscript.  ---- **W4: comparison with data augmentation and experience replay**  **Answer** **(Augmentation)** We agree that data augmentation like cropping [8, 9], noise injection [8, 10], and Mixup [10, 11] can improve sample efficiency. However, these methods focus on visual RL scenarios (continuous space), hence not applicable to CO. In the case of TSP in Euclidean space, POMO and Sym-NCO propose augmenting graph node features by rotating coordination values. As illustrated in Appendix D.4, SRT improves the sample efficiency of Sym-NCO. Note that augmentations in Sym-NCO are done in input space, while ours utilize symmetries in output (action trajectories) space. **(Experienc Replay)** We have conducted additional experiments with A2C on TSP to compare ours to experience replay (ER) as follows. We set the buffer size to 10000, which is the same as the epoch size. Note that when we use experience replay to on-policy, importance sampling is required as follows. $$\mathbb{E}_{\vec{a} \sim p(\cdot|s_1)}[R(x)] = E_{\vec{a} \sim q(\cdot|s_1)}\left[ \frac{p(\vec{a}|s_1)}{q(\vec{a}|s_1)}R(x) \right] = \mathbb{E}_{\vec{a} \sim q(\cdot|s_1)} \left[\prod_{t=1}^{T}\frac{p(a_t|s_t)}{q(a_t|s_t)}R(x)\right],$$ where $p(\vec{a})$ is the training policy and $q(\vec{a})$ is the behavior policy that made the experiences in the replay buffer, and $x=C(\vec{a})$. The importance weight terms of $\prod_{t=1}^{T}\frac{p(a_t|s_t)}{q(a_t|s_t)}$ gives high variances to estimate objective so that gives poor performances on replay training as shown in the following table. The detailed experimental settings and results are also provided in Appendix D.2. | Method | 100K | 2M | | -------- | -------- | -------- | | A2C | 6.630 +- 0.037 | 6.115 +- 0.009 | | A2C + ER (size = 20) | 7.085 +- 0.326 | 6.156 +- 0.033 | | A2C + ER (size = 100) | 7.073 +- 0.293 | 6.139 +- 0.010 | | A2C + SRT (ours) | *6.560 +- 0.051* | *6.037 +- 0.005* | The results show that experience replay can degrade the DRL methode due to the high variance introduced by the importance sampling. In constrast, our method does not suffer from the importance sampling becuase SRT updates policy to maximize likelihood of symmteric samples. Additionally, we provide the results on GFlowNet, off-policy RL, on molecular optimization task. Here, the experience replay improve sample efficiency of GFlowNet, but our method show more effective results. Average AUC with three independent seeds are reported. | GFlowNet | + Experience Replay | + Reward-prioritized ER | + SRT | | -------- | -------- | -------- |-------- | | 0.436 | 0.491 | 0.497 | *0.511* |  *Note that on-policy RL methods with Monte Carlo estimation have shown notable performance in the CO domain. Though augmentation and experience replay are well-known general approaches to improve sample efficiency, directly applying these into DRL for CO can give insufficient performance gain and even lead to performance degradation.* **W4: In addition, I have concerns about whether all expensive black-box optimizations can identify appropriate symmetries.** **In general, finding appropriate symmetries can be tricky in back-box optimization. However, when a constructive policy is employed for CO, there are always symmetries we can utilize.** From a reinforcement learning perspective, a policy sequentially selects actions to generate a solution. In combinatorial optimization, solutions should be represented in sets, while RL policy generates a sequence (or a graph). Due to the relationship between sets (combination) and sequences (permutation), there must be symmetries. For instance, there are 4! ways to create a set like [A, B, C, D] in sequence. By utilizing this relationship, we can easily find symmetric trajectories whether the objective function is black-box or not. Note that the generation process of RL can be canonicalized in a particular order (e.g., from left to right) but introduces a significant bias. Hence, when tackling combinatorial optimization problems, it is crucial to consider symmetries in an appropriate way. **Q1: Considering the symmetric loss, should we factor in importance sampling because methods like PPO and A2C are on-policy RL?** We do not need to consider importance sampling because we use imitation learning as replay training, which is an off-policy method. We believe this is the significance of our approach over typical replay training methods for on-policy RL. - **Q2:** pseudo-code - **Answer:** We have added a pseudo-code in Appendix B.1. - **Q3:** How optimal are the results in Table 1 for TSP? - **Answer:** In the synthetic setting, we assume that the number of reward calls is limited (but, in fact, it can be computed). Therefore, we can get the optimal solutions with off-the-shelf solvers like Concorde. The average optimal cost is 5.696. - **Q4:** Why do the results in Table 3 could differ from those in the original paper? - **Answer:** We may use a different set of seeds with the original paper; we use [0, 1, 2, 3, 4] regardless of oracles. The PMO did not report which seeds were used for the main results. We use different version of PyTDC (Therapeutics Data Commons [12]), which directly affects Oracle score. We use the updated vertions (v0.4.0) since the previous version (v0.3.6) gives invalid score for some tasks. - **Q5:** Why the results of PPO show the high variance? - **Answer:** As provided in Appendix B.2, we have searched (2x3x3) hyper-parameter combinations for tuning PPO. Note that PPO is not a popular choice in the field of neural combinatorial optimization because PPO requires critic estimation for high performance, and estimating the value function for combinatorial optimization is challenging. Additionally, the variance in Table 1 is not that large compared to others, e.g., PG-Rollout with 100K. Most importantly, we use the same hyper-parameters for PPO + SRT. --- #### Reference [5] Baptista, Ricardo, and Matthias Poloczek. "Bayesian optimization of combinatorial structures." ICML, 2018. [6] Gomez-Bombarelli et al. “Automatic Chemical Design using a Data-driven Continuous Representation of Molecules.” ACS central science, 2018. [7] Oh, Changyong, et al. "Combinatorial bayesian optimization using the graph cartesian product." NeurIPS, 2019. [8] Laskin et al. "Reinforcement learning with augmented data." NeurIPS, 2020. [9] Denis Yarats, Ilya Kostrikov, and Rob Fergus. "Image augmentation is all you need: Regularizing deep reinforcement learning from pixels." ICLR, 2020. [10] Fan et al. "SECANT: Self-expert cloning for zero-shot generalization of visual policies." ICLR 2021. [11] Zhang et al. "Mixup: Beyond empirical risk minimization." arXiv preprint arXiv:1710.09412, 2017. [12] Huang et al., "Therapeutics data commons: Machine learning datasets and tasks for therapeutics." NeurIPS Track Datasets and Benchmarks, 2021. (https://tdcommons.ai) ---- ## Reviewer vaeD (rating: 6) **not clear how this method compares to other combinatorial BBO** **Answer** *First of all, our research **focus on constructive RL methods for combinatorial optimization** where the function evaluation is expensive. Based on the suggested paper, we clarify our reseach scope and why we focus on constructive RL methods.* Combinatorial Bayesian optimization solves a bi-level optimization where the upper problem is surrogate regression, and the lower problem is acquisition optimization. In the context of combinatorial space, the acquisition optimization is modeled as quadratic integer programming problems, which is NP-hard [A, C]. Note that the surrogate regression has cubic complexity when using the Gaussian process; thus, we need to solve problems with cubic complexity and NP-hardness every iteration. To address the NP-hardness of acquisition optimization, various techniques have been proposed, including continuous relaxation [B], sampling with simulated annealing [C], genetic algorithms (BOSS [1] and GPBO [2]), and random walk explorer (ChemBO [3]). While these methods have demonstrated competitive performance in lower-dimensional tasks like neural architecture search (NAS), they often demand substantial computation time, particularly in molecular optimization tasks [4]. In Appendix D.3, we have included the results of GPBO for comparison with our approach. Note that GPBO requires the design of domain-specific heuristic algorithms, which limits the extendability to other tasks.  As pointed out in [C], one of the alternative approaches is to utilize a variational auto-encoder (VAE) to map the high-dimensional combinatorial space into “compact” continuous latent space to apply BO, like [5]. Though this approach has shown successful performance in molecular optimization, it also introduces variational error since VAE maximizes a lower bound of the likelihood, known as evidence lower bound (ELBO), not directly maximizes the likelihood.  *In addition, based on the suggested research, we added the literature review about black-box combinatorial optimization in Appendix E and the experimetal results of Bayesian optimization on the molecular optimzation tasks. The results are in Appendix D.3*. **Grey-box problems rather than black-box** **Answer** We agree with your comment that our target problems are grey-box rather than black-box. Our target problem is the CO problem, whose objective function evaluation is expensive. One example can be a CO problem with a black-box objective function and white-box constraints. The title and manuscript have been revised, as mentioned in the general response. - **Minor suggestion:** We adjusted the y-scale of plots in Figure 3 and also in Figure 6. Thanks for the suggestion. - **Q1: Is it possible to utilize symmetries in other methods for BBO?** - **Answer**: In reinforcement learning, a policy constructs a new solution by sequentially selecting actions from an empty solution. Consider the case of determining a set with three components, where a solution is constructed as follows: [ ] -> [a] -> [a, b] -> [a, b, c]. This sequential construction gives rise to multiple trajectories (e.g., [] -> [b] -> [b, c] -> [b, c, a]), and our approach leverages this symmetry. On the other hand, in Bayesian optimization, new solutions are suggested by exploring the solution space starting from an initial solution. For instance, by employing Gibbs sampling within a 1-Hamming ball on discrete space, e.g., [a, b, c] -> [d, b, c] -> [d, b, a], there are no such symmetries that we utilize in RL. - **Q2:** Symmetries in RL-based BBO - **Answer:** In DevFormer, they introduce an andditional loss term to consider solution symmetries; however, these symmetries are not induced in a structured way. On the other hand, GFlowNet introduces a new kind of loss function that considers solution symmetries by modeling the sequential decision-making in CO on a directed acyclic graph (DAG) and considering flows on DAG as a policy. Our two-step learning procedure can be used along with these DRL methods, and SRT can further induce solution symmetries effectively. It is noteworthy that our algorithm can also easily cooperate with symmetric architectures like SE3 Transformer [6] and EGNN [7], which exhibit architectural symmetries in state-level representation (e.g., rotation of an image yields the same representation). Since these models do not consider solution symmetries related to the action trajectories, further learning processes like SRT are required to induce solution symmetries. - **Q3:** What happens if we use intermediate reward to train RL policy in TSP? - **Answer:** First of all, as a TSP solution, we need to find a Hamiltonian cycle covering all cities. The cost of the solution can vary based on the last segment, which involves returning to the starting point. Note that the last segment is automatically defined on the previous sequence. In other words, the reward is episodic. Training the model with intermediate rewards could introduce bias into the return value, so performance is insufficient. Even in heuristic algorithms for TSP, the farthest insertion, which starts to form a new cycle by inserting the farthest city, outperforms other heuristics. Therefore, the RL policy should be trained using the Monte Carlo method with episodic rewards to ensure unbiased learning. - **Q4, Q5: about symmetric trajectories in TSP** - **Answer:** In the TSP, the decision variable $x_{ij}$ takes the value of 1 if city $j$ is visited immediately after city $i$. For instance, when considering four cities, the sequences 1-2-3-4-1 and 2-3-4-1-2 represent the same solution (i.e., x is the same). This implies an identical cycle and, apparently, the same decision variables in both sequences. This is what we intended in the example on page 3. Note that a TSP solution (there is no specific starting point) is a Hamiltonian cycle. We have made revisions to the example in the manuscript. - **Q6: 20+% improvement is coming in for PPO+SRT over PPO ** - **Answer:** There are some typos, we revised the sentence as follows: The most substantial improvement facilitated by SRT is observed in the case of GFlowNet, where a cost reduction of 3.76% is achieved when the sample size K=100K. - **Q7:** when the performance degradation happens? - **Answer:** Based on the experimental results in Figure 5 (the replay loops are increased), the model can suffer from overfitting when there are not enough symmetries to utilize. For example, unlike TSP, we cannot explicitly get symmetric solutions in the PMO benchmark. The symmetric transformation highly depends on the API; thus, it is hard to explore symmetric space effectively (the maximum entropy is achieved with the uniform symmetric transformation policy according to Theorem 1), resulting in performance degradation due to the early convergence in some cases. This can be mitigated by designing a more precise symmetric transformation policy or adjusting the frequency of conducting SRT. --- #### Reference [1] Moss et al. “BOSS: Bayesian optimization over string spaces.” NeurIPS, 2020. [2] Austin Tripp, Gregor NC Simm, and José Miguel Hernández-Lobato. “A fresh look at de novo molecular design benchmarks.” NeurIPS 2021 AI for Science Workshop, 2021. [3] Korovina et a. “ChemBO: Bayesian optimization of small organic molecules with synthesizable recommendations.” AISTATS, 2020. [4] Gao et al. "Sample efficiency matters: a benchmark for practical molecular optimization." NeurIPS, 2022. [5] Gomez-Bombarelli et al. “Automatic Chemical Design using a Data-driven Continuous Representation of Molecules.” ACS central science, 2018. [6] Fuchs, Fabian, et al. "SE(3)-Transformers: 3D roto-translation equivariant attention networks." NeurIPS, 2020. [7] Satorras, Vıctor Garcia, Emiel Hoogeboom, and Max Welling. "E (n) equivariant graph neural networks." International conference on machine learning. PMLR, 2021. We agree that SRT also requires domain knowledge. Lastly, we found that the results of A2C (+ ART) had been reported using the earlier version of the results. (All experiments had been reconducted when the code was merged; each task has different dependencies, so we use the lowest version.) That is the only change in the numbers. Sorry for forgetting the blue color on it. Related work in the field: Further related works on deep reinforcement learning (DRL) for combinatorial optimization have been added in Appendix E. Potential Future Impact of This Work: 1. **Expandability for future studies.** One of the key strengths of this work is its expendability. The proposed method, replay training with imitation learning using symmetric action trajectories, applies to both on-policy and off-policy reinforcement learning. Furthermore, ours exhibits additional improvements in DRL methods, such as Sym-NCO, considering symmetries. Our work has an impact as our method can easily be applied to various DRL methods for combinatorial optimization in the future. 2. **Meaningful take-home message for future research.** In combinatorial optimization, sample efficiency has been less studied than other challenges, such as distributional shift or scalability. This work brings up the importance of sample efficiency. The proposed method shows that properly exploiting solution symmetries, regarded as a reason for inevitably extended search space, can improve sample efficiency. We believe our study can inspire future work to consider sample efficiency in DRL for CO for better practicality and encourage properly utilizing symmetries in CO. Please consider our impact on future studies in this field because this is the first work considering sample efficiency in DRL-based combinatorial optimization. We believe our works can inspire future researchers.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.