[ICML'24 Rebuttal] SRT

# [ICML'24 Rebuttal] SRT # General responses We are sincerely grateful to the reviewers for their valuable feedback on our manuscript. We are pleased to hear that the reviewers found our paper **well-written** (o2Ga, LobV, PGea) and supported by **extensive literature surveys and experiments** (o2Ga, PGea). The recognition of the **practical importance** of our research (orUM, LobV) is greatly appreciated, and we are encouraged by the acknowledgment of our **novelty, intuitiveness, and intrigue** (orUM, o2Ga, LobV). In response to the concern posed by the reviewers in common: - **Does sample efficiency matter in CO? Yes.** In many practical cases, the objective functions are hard to define in analytical forms or involve additional computational processes. - **Is SRT generalizable? Yes.** We utilize the fact that a constructive DRL policy sequentially constructs solutions in the form of sequences (i.e., permutations) while the target solution is a set (i.e., combination). Therefore, there are multiple ways to construct the same set; based on this relationship, we can easily find symmetric trajectories. Furthermore, the proposed method consistently enhances various DRL methods, including both on-policy and off-policy methods. Please see below for our detailed responses. We have also considered all concerns and questions raised and provided additional experiments and explanations to address them thoroughly; please see responses to each review in detail. ### Sample efficiency does matter in CO  As the reviewers mentioned, combinatorial optimization (CO) problems have often been modeled using linear objective functions in the literature. However, it is crucial to recognize that this does not imply that all CO problems have linear objective functions. **In practice, many cases exist where the objective functions (i.e., reward functions) cannot be defined as linear functions or even in a mathematically closed form.** - (Case 1) Objective functions cannot be defined analytically, often necessitating computationally intensive simulations for evaluation (e.g., black-box optimization). - (Case 2) Objective functions are analytically defined, yet their evaluation requires additional computation, such as solving another optimization problem (bi-level optimization) or making predictions from data (e.g., predict-then-optimize, stochastic optimization). For instance, in hardware design, the decap placement problem necessitates a black-box simulator that relies on the electrical and magnetic stimulation of the fabricated product or lab-level electrical measurements [1-3]. Similarly, in molecular optimization, evaluating designed molecules in real-world scenarios entails costly assessments, such as wet-lab experiments. This motivation led to the publication of an open-source benchmark [4]. Note that improving sample efficiency with the assumption of expensive reward evaluations is a general research approach (e.g., Bayesian optimization, active learning, and model-based RL), even though they verify the proposed method with synthetic data for better analysis and extensive experiments.  #### Literature evidence Here are a few more examples; there are more examples in various domains beyond those listed. Note that these examples have been studied as essential sub-areas. - Routing on real world enviroment: dynamic routing with traffic [5], routing with collision-free path-finding [6, 7], and etc [8-11]. - Machine scheduling in real world: scheduling with controllable processing times (SCPT) [12], deteriorating job-scheduling problems assume the time-dependent processing time, which is not cheap for computation [13, 14] - Biological engineering: de novo molecular design [4], biological sequence design [15], and protein structure design [16] This study covers hardware design and drug discovery in the experiments (Sections 5.2 and 5.3). Moreover, we also include scheduling problems and various routing problems in the appendix. Lastly, we introduce some survey papers dealing with various real-world applications and solution approaches for optimization with expensive reward computation [17-18].  #### Experimental evidence We addionally include the overall runtime of the hardware design tasks. Note that the practical molecular optmization benchmark includes ealry termination rules for early convergence, so not proper to compare overall runtime. Chip-package PDN | | Avg. Runtime | Avg. Reward | | --- | --- | --- | |A2C | 4,204.50 (+-342.90) | 9.772 (± 0.823) | |A2C + SRT | 4,265.25 (+-425.38) | 12.757 (± 0.267) | |PG-Rollout | 3,747.00 (+-516.97) | 10.240 (± 0.955) | |PG-Rollout + SRT | 3,795.25 (+-803.31) | 12.601 (± 0.467) | |PPO | 3,884.25 (+-81.15) | 9.821 (± 0.411) | |PPO + SRT | 4,700.00 (+-58.26) | 11.279 (± 0.511) | |GFlowNet | 3,528.75 (+-26.66) | 9.740 (± 0.245) | |GFlowNet + SRT | 3,528.25 (+-17.80) | 12.784 (± 0.186) | HBM PDN | | Avg. Runtime | Avg. Reward | | --- | --- | --- | |A2C | 33,289.75 (+-682.85) | 25.945 (± 0.177) | |A2C + SRT | 33,294.75 (+-1,722.59) | 26.449 (± 0.094) | |PG-Rollout | 34,251.25 (+-573.18) | 25.714 (± 0.122) | |PG-Rollout + SRT | 30,351.25 (+-1,150.44) | 26.355 (± 0.013) | |PPO | 32,148.75 (+-623.47) | 25.907 (± 0.068) | |PPO + SRT | 32,479.25 (+-168.71) | 26.322 (± 0.141) | |GFlowNet | 35,968.00 (+-3,942.48) | 25.845 (± 0.103) | |GFlowNet + SRT | 35,867.75 (+-5,307.87) | 26.469 (± 0.057) | The results show that SRT effectively enhances performance by achieving significantly lower costs while maintaining similar or slightly increased runtime, which is negligible compared to the simulation time. We will include this result in the manuscript. ### Is SRT Generalizable? #### Finding symmetric trajectories is generalizable in CO First, it is noteworthy that our research focuses on constructive DRL policies, where a solution is sequentially built from an empty solution; see the second paragraph in Section 1. In combinatorial optimization, a solution is represented in sets, whereas the DRL policy sequentially constructs a sample in the form of a sequence. Therefore, the way of constructing a solution can diverse, even though the solution is the same. For instance, there are 4! ways to create a set like (A, B, C, D) in sequence. This study actively exploits such symmetricity. Note that **symmetries always exist due to the relationship between sets (combination) and sequences (permutation).**  #### Iterative Update Steps with Log-Likelihood Maximization Loss in Eq (1) We **do not introduce additional restrictions on training or alter the architecture** of base DRL methods (in Step A) since our method is conducted separately (in Step B). Furthermore, the proposed method is straightforward to implement. Lastly, **SRT does not introduce importance weight even when the on-policy is used** —previous literature has shown the superiority of on-policy methods, especially REINFORCE, in DRL for CO. Experimental results demonstrate that SRT consistently improves both with on-policy and off-policy training. #### References [1] Zhang et al. (2021) Fast PDN Impedance Prediction Using Deep Learning. arXiv preprint arXiv:2106.10693. [2] Zhang et al. (2022) Efficient DC and AC impedance calculation for arbitrary-shape and multilayer PDN using boundary integration. IEEE Transactions on Signal and Power Integrity, 1, 1-11. [3] Kim et al. (2023) DevFormer: A symmetric transformer for context-aware device placement. In International Conference on Machine Learning (pp. 16541-16566). PMLR. [4] Gao et al. (2022) Sample efficiency matters: a benchmark for practical molecular optimization. Advances in Neural Information Processing System. [5] Kim et al. (2016) Solving the dynamic vehicle routing problem under traffic congestion. IEEE Transactions on Intelligent Transportation Systems, 17(8), 2367-2380. [6] Liu et al. (2019) Task and path planning for multi-agent pickup and delivery. In Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS). [7] Hönig et al. (2018) Conflict-based search with optimal task assignment. In Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems. [8] Legillon, Liefooghe, & Talbi (2012) Cobra: a cooperative coevolutionary algorithm for bi-level optimization. In: IEEE congress on evolutionary computation, pp 1–8. [9] Chaabani, Bechikh, & Said (2018) A new co-evolutionary decomposition-based algorithm for bi-level combinatorial optimization. Applied Intelligence, 48, 2847-2872. [10] Calvete, Galé, & Oliveros (2011) Bilevel model for production-distribution planning solved by using ant colony optimization. Comput Oper Res 38:320–327. [5] Wang et al. (2021) A bi-level framework for learning to solve combinatorial optimization on graphs. Advances in Neural Information Processing Systems, 34, 21453-21466. [11] Shioura et al. (2018). Preemptive models of scheduling with controllable processing times and of scheduling with imprecise computation: A review of solution approaches. European Journal of Operational Research, 266(3), 795-818. [12] Gupta, J. N., & Gupta, S. K. (1988). Single facility scheduling with nonlinear processing times. Computers & Industrial Engineering, 14(4), 387-393. [13] Wu & Lee. (2008). Single-machine group-scheduling problems with deteriorating setup times and job-processing times. International Journal of Production Economics, 115(1), 128-133. [14] Jain et al. (2022) Biological sequence design with gflownets. In International Conference on Machine Learning (pp. 9786-9801). PMLR. [15] Brookes, Park, & Listgarten (2019) Conditioning by adaptive sampling for robust design. In International conference on machine learning (pp. 773-782). PMLR. [16] Rios et al. (2021) Recent dynamic vehicle routing problems: A survey. Computers & Industrial Engineering, 160, 107604. [17] Li, Zhan, & Zhang (2022) Evolutionary computation for expensive optimization: A survey. Machine Intelligence Research, 19(1), 3-23. # Reviewer PGea (4 with confidence 3) ### Why the reward calls need to be minimized? As mentioned, CO problems are often modeled with linear objective functions that have O(n) computation. However, this does not necessarily mean a CO problem has a linear objective function. Especially when CO problems come from the practical setting, the objective functions are even challenging to define mathematically or involve NP-hard optimization problems; please refer to our general comments for more examples. One of the key contribution of the proposed method, SRT, is that it can further train the model without necessitating additional reward calls. This capability leads to improved performance per reward call, thereby enhancing sample efficiency. Such enhancement is particularly advantageous in scenarios where reward evaluations serve as computational bottlenecks. For example, in our experiment, the simulation for HBM PDN (one of the hardware design task in Section 5.2) takes approximately 108 seconds on average (Step A). On the other hand, **the additional computation time incurred by SRT amounts to approximately 0.09 seconds** (Step B). Note that SRT can be executed efficiently through GPU batch processing. Note: the runtime were measured using Xeon(R) Gold 5317 CPUs (48 cores) and a single GPU RTX 3090, with a batch size of 100. ### Wall-clock comparisons We will include the total runining time in the hardware design experiments as below. | | Chip-package PDN | HBM PDN | | --- | --- | --- | |A2C | 4,204.50 (+-342.90) | 33,289.75 (+-682.85) | |A2C + SRT | 4,265.25 (+-425.38) | 33,294.75 (+-1,722.59) | |PG-Rollout | 3,747.00 (+-516.97) | 34,251.25 (+-573.18) | |PG-Rollout + SRT | 3,795.25 (+-803.31) | 30,351.25 (+-1,150.44) | |PPO | 3,884.25 (+-81.15) | 32,148.75 (+-623.47) | |PPO + SRT | 4,700.00 (+-58.26) | 32,479.25 (+-168.71) | |GFlowNet | 3,528.75 (+-26.66) | 35,968.00 (+-3,942.48) | |GFlowNet + SRT | 3,528.25 (+-17.80) | 35,867.75 (+-5,307.87) | The results show that the increment from the SRT is negligible compared to the simulation time. Since we measure the validation average cost per epoch, the data points have different x values when we plot the results over runtime. So, instead of plotting the results over time, we will add the runtime results for the hardware design optimization tasks. Note that in our experiments, TSP utilizes synthetic data and includes metrics with additional computation for analysis. On the other hand, practical molecular optimization terminates early when it fails to find new candidates. Consequently, comparing performance over runtime may not be appropriate in these cases. ### Application with expensive reward evaluations Here are a few examples (more in the general comments); beyond the presented examples, we can find practical CO problems in various domains. - Logistics: Routing problems with expensive reward evaluations occur on a daily basis. For example, the cost function is not static in practice, so it may require simulation with consideration of time-varying traffic (this has been studied as a dynamic routing problem). - Smart factory: machine scheduling problems with time-dependent processing time, material handling in wearhouse (routing with multi-agent collision-free pathfinding) - Semiconductor industry: decap placement problems, PCB routing problems - Bio and pharmaceutical industry: drug discovery, protein design, RNA sequencing - Neural architecture search ### Minor Questions Thanks for the detailed questions. - A sudden cost drop, specifically at 1M reward calls: This occurs because of the MultiStepLR scheduler. The learning rate is decayed with a multiplicative factor of 0.1 at reward calls 1M and 1.5M. MultiStepLR has been used in other DRL for CO literature [1,2]. We will include a description of the learning rate scheduler in the experimental setting. - Where the TSP data comes from: According to Kool et al. (2017) [3], we generate TSP problems when every epoch starts (they are newly generated at each epoch). The 2-D coordinate data are sampled from the uniform distribution U(0, 1). #### Reference POMO Sym-NCO AM # Reviewer orUM (4 with confidence 4) ### Definition and existence of symmetric trajectories in CO Fisrt, it is noteworthy that our research focuses on the constructive DRL policies, where a solution is sequentially built from an empty solution; see the second paragraph in Section 1. In combinatorial optimization, a solution is represented in a set, whereas the DRL policy sequentially constructs a solution in the form of a sequence (the action trajectory is a permutation). Therefore, the way of constructing a solution can diverse, even though the soulution is the same. For instance, there are 4! ways to create a set like (A, B, C, D) in sequence. This study actively exploits such symmetricity. Note that **symmetries always exist due to the relationship between sets and sequences.** In combinatorial optimiztion, a solution is represented in a set, whereas the DRL policy sequentially constructs a sample in the form of a sequence. Therefore, the way of constructing a solution can diverse, even though the soulution is the same. For instance, there are 4! ways to create a set like (A, B, C, D) in sequence. This study actively exploits such symmetricity. Note that **symmetries always exist due to the relationship between sets (combination) and sequences (permutation).**  Precisely, to define symmetric action trajectories, we introduce a non-injective function $C$ that maps an action trajectory $\tau = (a_1, \ldots a_T)$ to its corresponding solution $x$ given initial state. **Definition (Symmetric action trajectories).** A pair of action trajectories $\tau$ and $\tau'$ given initial state are symmetric if they induce the same solution $x \in \mathcal{X}$, i.e., if $C(\tau) = C(\tau') = x$. For example, in TSP, a solution is a set of edges, and an action trajectory is a sequence of selected edges. In the case of a tour (1-2-3-4-1), both action trajectories ((1, 2)->(2, 3)->(3, 4)->(4, 1)) and ((2, 3)->(3, 4)->(4, 1)->(1, 2)) give the same edge set, i.e., the same solution, ((1, 2), (2, 3), (3, 4), (4, 1)). **Using the relationship between sets and sequences to find symmetries is applicable across problems, and the definition is not problem-specific.** In a similar way, we can find symmetric action trajectories in the decap placement problems. To find symmetries, an understanding of MDP for each problem can be required, but not a deeper understanding of the mathematical properties. Compared to designing a genetic algorithm for TSP [1] or formulating it as integer programming with better polyhedral studies [2], **finding symmetries requires much less problem-specific knowledge**. ### Answer for "Its competing counterpart does not utilize problem domain knowledge. Such comparison is not fundamentally fair." Lastly, our research goal is to propose effective methodologies for integrating this relatively easily accessible domain knowledge into learning processes. Many studies, both in optimization [1-4] and deep learning [5-7], have explored the identification and utilization of structural properties within problems. We do not consider introducing or utilizing new domain knowledge unfair. Furthermore, some baselines also utilize a similar or deeper level of domain knowledge (e.g., GFlowNets, Genetic Algorithms, Bayesian Optimization, SymNCO, and MatNet). ### Computation for finding symmetric trajectories We can get action trajectories by permuting given action trajectories (sampled from the policy). Permuting sequences is a relatively cheap computation compared to simulations or solving lower-level optimization problems. As discussed in the general response, when the reward evaluation is expensive **the additional cost incurred by SRT is substantially smaller than that of additional reward evaluations**.   For example, in the hardware design optimization, the simulation for the HBM PDN task requires about 108 seconds on average (Step A), while SRT (Step B) requires 0.09 seconds for the batch size = 100. We also provide total runtime for the hardware design optimization as follows. | | Chip-package PDN | HBM PDN | | --- | --- | --- | |A2C | 4,204.50 (+-342.90) | 33,289.75 (+-682.85) | |A2C + SRT | 4,265.25 (+-425.38) | 33,294.75 (+-1,722.59) | |PG-Rollout | 3,747.00 (+-516.97) | 34,251.25 (+-573.18) | |PG-Rollout + SRT | 3,795.25 (+-803.31) | 30,351.25 (+-1,150.44) | |PPO | 3,884.25 (+-81.15) | 32,148.75 (+-623.47) | |PPO + SRT | 4,700.00 (+-58.26) | 32,479.25 (+-168.71) | |GFlowNet | 3,528.75 (+-26.66) | 35,968.00 (+-3,942.48) | |GFlowNet + SRT | 3,528.25 (+-17.80) | 35,867.75 (+-5,307.87) | The results show that SRT effectively enhances performance by achieving significantly lower costs while maintaining similar or slightly increased runtime, which is negligible compared to the simulation time. We will include this result in the manuscript. ### How SRT exploration can improve sample efficiency? (further analysis for symmetric exploration) First, we need to clarify that **explorations of the constructive policy are conducted over action trajectory space**, not the solution space. For example, the policy regards action trajectories (A->B->C->D) and (B->C->D->A) as different, even though both give the same combinatorial solution (A, B, C, D). It means the symmetrically transformed trajectories might be unliely to be sampled from the current policy $\pi$ (samples unlikely seen before). In SRT, we use the fact that the symmetric trajectories are mapped to the same solution; the rewards for symmetric trajectoris are obtained without computations. Since we can utilize more trajectories without additional reward calls, the sample efficiency (the better performance with the same number of reward calls) is enhanced.  For the further analysis for symmetric exploration, we provide empirical analysis by measuring the log-likelihood difference among symmetric trajectories. In the ideal scenario where the policy learns symmetries with broad exploration over action trajectory space, the log-likelihood difference becomes zero (since they are giving the same solution). The results of A2C on TSP50 is as follows. | | 100K | 2M | | -------- | -------- | -------- | | A2C | 433.57 (+- 19.50) | 463.87 (+- 31.96) | | A2C + SRT (MaxEnt) | 40.45 (+- 7.43) | 22.21 (+- 0.70) | The results indicate that **SRT effectively reduces the log-likelihood difference without the need for additional reward evaluations.** We will inculde these results in the manuscript. ### Comparison with state-of-the-art (non DRL-based) methods We provide comparitive results with state-of-the-art (non DRL-based) methods in the practical molecular optimization (PMO) since it is an official benchmark that considers sample efficiency and provides well-defined guideline for assessment of various apporaches. | REINVENT(SELFIES) + SRT | REINVENT(SELFIES) | Graph GA | GPBO | | -------- | -------- | -------- | -- | | 0.632 | 0.615 | 0.616 | 0.609| | Hill Climbing | STONED | GFlowNet + SRT | GFlowNet-AL | GFlowNet | | -------- | -------- | -------- | -- | -- | | 0.591 | 0.588 | 0.489| 0.432 |0.431 | Gaussian process Bayesian Optimnization (GPBO) [3] and Graph-based Genetic algorithm (Graph GA) [4] are regarded as sota methods excepting DRL method. The full results of GPBO and Graph GA are provided in Table 9 in Appendix D.4; the rest of the results, which are not in the appendix, will also be included. #### References [1] Vidal, T. (2022). Hybrid genetic search for the CVRP: Open-source implementation and SWAP* neighborhood. Computers & Operations Research, 140, 105643. [2] Abeledo, H., Fukasawa, R., Pessoa, A., & Uchoa, E. (2013). The time dependent traveling salesman problem: polyhedra and algorithm. Mathematical Programming Computation, 5(1), 27-55. [3] Tripp, A., Simm, G. N., & Hernández-Lobato, J. M. (2021, October). A fresh look at de novo molecular design benchmarks. In NeurIPS 2021 AI for Science Workshop. [4] Jensen, J. H. (2019). A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space. Chemical science, 10(12), 3567-3572. [5] Bengio, Y., Lahlou, S., Deleu, T., Hu, E. J., Tiwari, M., & Bengio, E. (2023). Gflownet foundations. Journal of Machine Learning Research, 24(210), 1-55. [6] Kim, M., Park, J., & Park, J. (2022). Sym-NCO: Leveraging symmetricity for neural combinatorial optimization. Advances in Neural Information Processing Systems, 35, 1936-1949. [7] Kwon, Y. D., Choo, J., Kim, B., Yoon, I., Gwon, Y., & Min, S. (2020). POMO: Policy optimization with multiple optima for reinforcement learning. Advances in Neural Information Processing Systems, 33, 21188-21198. SymNCO POMO # Reviewer o2Ga (5 with confidence 2) ### Why SRT is generalizable? We assert the generalizability of SRT based on the following reasons. **Finding symmetric trajectories is generalizable in CO.** First, it is noteworthy that our research focuses on constructive DRL policies, where a solution is sequentially built from an empty solution; see the second paragraph in Section 1. In combinatorial optimization, a solution is represented in a combination, whereas the DRL policy sequentially constructs a solution in the form of a sequence (the action trajectory is a permutation). Thus, symmetries always exist due to the relationship between combinations and permutations. For example, in TSP, a solution is a set of edges, and an action trajectory is a sequence of selected edges. In the case of a tour (1-2-3-4-1), both action trajectories ((1, 2)->(2, 3)->(3, 4)->(4, 1)) and ((2, 3)->(3, 4)->(4, 1)->(1, 2)) give the same edge set, i.e., the same solution, ((1, 2), (2, 3), (3, 4), (4, 1)). Using the relationship between sets and sequences to find symmetries is applicable across problems. **Iterative Update Steps with Log-Likelihood Maximization Loss in Eq (1).** We do not introduce additional restrictions on training or alter the architecture of base DRL methods (in Step A) since our method is conducted separately (in Step B). Furthermore, the proposed method is straightforward to implement. Lastly, SRT does not introduce importance weight even when the on-policy is used —previous literature has shown the superiority of on-policy methods, especially REINFORCE, in DRL for CO. Experimental results demonstrate that SRT consistently improves both with on-policy and off-policy training. ### CO problems with expensive reward evaluation As we discussed in the general comments, CO problems arising from practical settings often feature objective functions that are challenging to define in mathematically closed forms. Here are a few examples (more in the general comments). Beyond the examples presented, a diverse array of practical CO problems can be found across various domains. - Logistics: Routing problems with expensive reward evaluations occur on a daily basis. For example, the cost function is not static in practice, so it may require simulation with consideration of time-varying traffic (this has been studied as a dynamic routing problem). - Smart factory: machine scheduling problems with time-dependent processing time, material handling in wearhouse (routing with multi-agent collision-free pathfinding) - Semiconductor industry: decap placement problems, PCB routing problems - Bio and pharmaceutical industry: drug discovery, protein design, RNA sequencing - Neural architecture search ### Additional cost of maximum likelihood estimation in the symmetric replay training step We express that **the samples are "free" in the sense that they do not necessitate reward evaluations**. As pointed out, SRT is not entirely computation-free. For instance, SRT requires $O(1)$ computation for sampling symmetric trajectories and $O(n^3)$ computation for maximum likelihood estimation when solving TSP with an attention model and utilizing a MaxEnt transformation policy.   However, it is worth for emphasizing that **SRT is benefitial when the reward evaluation costs are expensive** than the addional costs from SRT (see the examples above). Forthermore, SRT allows GPU **batch processing, resulting much smaller wall-clock runtime** compared to reward evaluation with simulation. To verify the benefit of SRT, we also report the overall runtime (for 2M evaluations) of the hardware design optimization tasks as follows. | | Chip-package PDN | HBM PDN | | --- | --- | --- | |A2C | 4,204.50 (+-342.90) | 33,289.75 (+-682.85) | |A2C + SRT | 4,265.25 (+-425.38) | 33,294.75 (+-1,722.59) | |PG-Rollout | 3,747.00 (+-516.97) | 34,251.25 (+-573.18) | |PG-Rollout + SRT | 3,795.25 (+-803.31) | 30,351.25 (+-1,150.44) | |PPO | 3,884.25 (+-81.15) | 32,148.75 (+-623.47) | |PPO + SRT | 4,700.00 (+-58.26) | 32,479.25 (+-168.71) | |GFlowNet | 3,528.75 (+-26.66) | 35,968.00 (+-3,942.48) | |GFlowNet + SRT | 3,528.25 (+-17.80) | 35,867.75 (+-5,307.87) | The results show that SRT effectively enhances performance by achieving significantly lower costs while maintaining similar or slightly increased runtime, which is negligible compared to the simulation time. We will include this result in the manuscript. #### References [1] AM [2] Keles, Wijewardena, & Hegde (2023) On the computational complexity of self-attention. In International Conference on Algorithmic Learning Theory (pp. 597-619). PMLR. ### The detailed definition of high-rewarded trajectories The strategies for collecting high-reward samples are flexibly employed according to tasks or DRL methods. 1. Samples whose rewards are higher than the average reward of the sampled batch 2. Greedy rollout 3. Top-K In TSP, thogh we employ greedy rollout strategy to get high-reward solution, the experiments with collecting higher reward samples compared to the average also give similar results as follows. | | 100K | 2M | | -------- | -------- | -------- | | A2C | 6.780 (± 0.208) | 6.129 (± 0.021) | | A2C + SRT (Greedy) | 6.586 (± 0.043) | 6.038 (± 0.005) | | A2C + SRT (higher-then-average) | 6.533 (+- 0.031) | 6.041 (+- 0.006) | Currently, we utilize different strategies according to tasks: greedy rollout for TSP, higher-than-average for hardware design optimization, and Top-K for molecular optimization. Specifically, in the case of molecular optimization, the benchmark assesses the Top-100 samples among all generated samples (10K) during training. Therefore, we gather Top-K samples from the entire pool of generated samples up to that point in training. Note that the value of K is set to 24, consistent with the online experience replay size originally implemented in REINVENT; please refer to Appendix C.3 for further details. While these strategies are outlined in the experimental setting and the appendix, we recognize the need for more explicit exposition, as mentioned in the comment; we intend to explain them in Section 4 of the manuscript. ### Benetifs of using more symmetric samples The experimental results for using more symmetric trajectories are provided in Appendix D.1. The findings indicate that increasing the number of symmetric trajectories (L) per replay loop does not yield significant benefits, although it does not worsen the results either. ### Scheduling SRT Thanks for the interesting idea. We agree that the samples collected in the early stage of traning can be noisy. Thus, we conducted experiments with the following SRT scheduling: - 0 ~ 500K: No SRT - 500K ~ 1.5M: SRT loop = 1 - 1.5M ~ 2M: SRT loop = 2 The average SRT loop is calculated to be 1. We then compared these results with those obtained from training with fixed loop 1. | | 100K | 2M | | -------- | -------- | -------- | | A2C | 6.780 (± 0.208) | 6.129 (± 0.021) | | A2C + SRT | 6.586 (± 0.043) | 6.038 (± 0.005) | | A2C + Scheduled SRT | 6.676 (+- 0.025) | 6.072 (+- 0.003) | In the case of TSP50, we observed a rapid decrease in costs during the early stages (e.g., A2C costs dropped from 10.35 at 10K to 7.27 at 50K). Consequently, additional training demonstrated beneficial even when sample qualities were relatively low. However, scheduling SRT could be advantageous if the initial policy performs poorly. ### Minimizing log-likelihood of the lower-reward samples' symmetric trajectories Thanks for the interesting idea. We also conducted experiements with the suggested idea and get the results as follows. | | 100K | 2M | | -------- | -------- | -------- | | A2C | 6.780 (± 0.208) | 6.129 (± 0.021) | | A2C + SRT | 6.586 (± 0.043) | 6.038 (± 0.005) | | A2C + Reverse SRT | 13.571 (+- 2.096) | 14.442 (+- 2.632) | For further analysis, we measured the log-likelihood differences among symmetric trajectories according to the equation in Section 5.1. | | 100K | 2M | | -------- | -------- | -------- | | A2C | 433.57 (+- 19.50) | 463.87 (+- 31.96) | | A2C + SRT | 40.45 (+- 7.43) | 22.21 (+- 0.70) | | A2C + Reverse SRT | 163.27 (+- 10.02) | 111.74 (+- 20.72) |  Based on the results, we conjecture that the likelihood landscape is flattening. Since the trajectory space is vast, training with high-reward samples is more effective. Yet, if the goal is to learn over the whole space, the suggested idea can be effective in regularizing the policy to have symmetries.  ### Typos Thanks for finding them. We revised them. # Reviewer LobV (6 with confidence 3) ## Weakness ### Definition and further explanation of symmetric trajectories Fisrt, it is noteworthy that our research focuses on the constructive DRL policies, where a solution is sequentially built from an empty solution; see the second paragraph in Section 1. The constructive policy is highly beneficial to generate feasible solution using masking scheme. In combinatorial optimization, solutions should be represented in sets, while RL policy generates a sequence. Due to the relationship between sets (combination) and sequences (permutation), there must be symmetries. For instance, there are 4! ways to create a set like (A, B, C, D) in sequence. First, it is noteworthy that our research focuses on the constructive DRL policies, where a solution is sequentially built from an empty solution; see the second paragraph in Section 1. In combinatorial optimization, a solution is represented in a set, whereas the DRL policy sequentially constructs a solution in the form of a sequence (the action trajectory is a permutation). Therefore, the way of constructing a solution can diverse, even though the solution is the same. For instance, there are 4! ways to create a set like (A, B, C, D) in sequence. This study actively exploits such symmetricity. Note that symmetries always exist due to the relationship between sets and sequences. Precisely, to define symmetric action trajectories, we introduce a non-injective function $C$ that maps an action trajectory $\tau = (a_1, \ldots a_T)$ to its corresponding solution $x$ given initial state. **Definition (Symmetric action trajectories).** A pair of action trajectories $\tau$ and $\tau'$ given initial state are symmetric if they induce the same solution $x \in \mathcal{X}$, i.e., if $C(\tau) = C(\tau') = x$. For example, in TSP, a solution is a set of edges, and an action trajectory is a sequence of selected edges. In the case of a tour (1-2-3-4-1), both action trajectories ((1, 2)->(2, 3)->(3, 4)->(4, 1)) and ((2, 3)->(3, 4)->(4, 1)->(1, 2)) give the same edge set, i.e., the same solution, ((1, 2), (2, 3), (3, 4), (4, 1)). Using the relationship between sets and sequences to find symmetries is applicable across problems, and the definition is not problem-specific. In a similar way, we can find symmetric action trajectories in the decap placement problems. ### More baselines TSP50 - SymNCO, POMO in Appendix The comparison with SymNCO and POMO on TSP50 is inculded in Appendix D.5. | | 100K | 2M | | -------- | -------- | -------- | | A2C + SRT | 6.586 (+- 0.043) | 6.038 (+- 0.005) | | A2C | 6.780 (+- 0.208) | 6.129 (+- 0.021)| | SymNCO | 7.035 (+- 0.209) | 6.334 (+-0.045) | | POMO | 7.910 (+-0.055) | 7.074 (+- 0.010) | We also provide results from more baselines in molecular optimization. The average AUC10 of 23 tasks, i.e., 23 oracle functions, is reported with 5 independent runs as follows. | REINVENT(SELFIES) + SRT | REINVENT(SELFIES) | Graph GA | GPBO | | -------- | -------- | -------- | -- | | 0.632 | 0.615 | 0.616 | 0.609| | Hill Climbing | STONED | GFlowNet + SRT | GFlowNet-AL | GFlowNet | | -------- | -------- | -------- | -- | -- | | 0.591 | 0.588 | 0.489| 0.432 |0.431 The baselines includes Bayesian optimization (GPBO), genetic algorithms (Graph GA, STONED), hill climbing, and model-based learning (GFlowNet-AL). Note that Graph GA and GPBO are already included in Appendix D.4, and others will be included with full results like Table 9. ### More detailed analysis We provide more analysis as follows: - Examining log-likelihood differences between symmetric trajectories with A2C in TSP50 - Assessing the robustness of the high-reward sample collecting strategies - Performing self-ablations (provided in Appendix D) - Conducting statistical analysis for molecular optimization; please see the response to the corresponding question - Evaluating the effectiveness of SRT when incorporated with symmetry-induced DRL methods; please see the response to the corresponding question These additional analyses will be integrated into the manuscript to enrich our findings and provide a more comprehensive understanding of our approach.  #### Log-likelihood differences The log-likelihood differences for A2C are depicted; note that A2C (on-policy) achieves the highest performance. | | 100K | 2M | | -------- | -------- | -------- | | A2C | 433.57 (+- 19.50) | 463.87 (+- 31.96) | | A2C + SRT | 40.45 (+- 7.43) | 22.21 (+- 0.70) | In the ideal scenario where the policy learns symmetries with broad exploration over action trajectory space, the log-likelihood difference becomes zero (since they are giving the same solution). The results indicate that SRT effectively reduces the log-likelihood difference without the need for additional reward evaluations. #### Robustness of the high-reward sample collecting strategies We conduct the experiments with different high-reward sample collection strategies in TSP50. Validation costs | | 100K | 2M | | -------- | -------- | -------- | | A2C + SRT (Greedy) | 6.586 (± 0.043) | 6.038 (± 0.005) | | A2C + SRT (higher-then-average) | 6.533 (+- 0.031) | 6.041 (+- 0.006) | Log-likelihood differences | | 100K | 2M | | -------- | -------- | -------- | | A2C + SRT (Greedy) | 40.45 (+- 7.43) | 22.21 (+- 0.70) | | A2C + SRT (higher-then-average) | 32.34 (+- 3.12) | 21.97 (+- 1.69) | The results show that SRT is effective with both strategies, which means we can flexibly employ high-reward sample collection strategies according to task setup. We will add these results in the appendix. In addition, self-ablation studies are included; please see Appendix D. - The multiply sampled symmetric trajectories (Appendix D.1) - Comparison with experience buffer (Appendix D.2) - Reward-maximizing loss with symmetric trajectories (Appendix D.3) Lastly, the effectiveness of SRT with DRL methods that consider symmetries in other tasks are inculded in Appendix D.5; we will further discuss in the following response. ### Symmetric nature and its generalizability We already discussed about the genrealizability in the first response, "Definition and further explanation of symmetric trajectories." ## Questions ### Reward calls and timesteps We include the runtime for hardware design optimization tasks. The simulation for reward evaluation serves as a computational bottleneck, so the number of reward calls is proportional to timesteps. DPP Runtime Table Since we measure the validation average cost per epoch, the data points have varying x-values when plotted against runtime. Thus, instead of plotting the results over time, we will provide the runtime results for the hardware design optimization tasks. Note that the experiments on TSP50 use synthetic data and entail measuring time for log-likelihood differences, which is not computationally cheap. In the molecular optimization tasks, early termination rules are implemented when the policy repeatedly fails to generate new candidates. Consequently, comparing performance over runtime may not be appropriate in these cases. ### Statistical analysis for Table 3 We perform a t-test to compare the average AUC10 scores, setting the null hypothesis as $\mu_{Base} = \mu_{SRT}$. The resulting **p-values are 0.0045 and 0.0.0015** for REINVENT and GFlowNets, respectively. Thus, the statistical analysis indicates that SRT enhances sample efficiency significantly for both REINVENT and GFlowNets. | | REINVENT | REINVENT + SRT | GFlowNet | GFlowNet + SRT | | -- | -- | -- | -- | -- | | Mean | 0.615 | 0.633 | 0.431 | 0.489 | | Variance | 1.32E-05 | 7.36E-05 | 8.98E-06 | | | Observations | 5 | 5 | 5 | 5 | | Hypothesized Mean Diff. | 0 | | 0 | | | t Stat | -4.1332 | | -6.4356 | | | P(T<=t) one-tail | 0.0045 | | 0.0015 | | | t Critical one-tail | 2.0150 | | 2.1318 | We will inculde thses results in the manuscript. ### Comparison with other baseline (data augmentation and replay ratio scaling approaches) **(Data Augmentation)** We agree that data augmentation like cropping [8, 9], noise injection [8, 10], and Mixup [10, 11] can improve sample efficiency. However, these methods focus on visual RL scenarios (continuous space), hence not applicable to CO. Moreover, in CO, even slight changes to solutions can result in drastically different rewards due to the combinatorial nature of the problem. Therefore, general data augmentation approaches necessitate reward evaluations for augmented samples. Another contribution of the propose method is the utilization of maximizing log-likelihood loss functions and separated training steps. They make SRT applicable to on-policy DRL methods without introducing importance weights, which often lead to large variances. The experiments in Appendix D.2 demonstrate that SRT is more effective than reward-maximization trainig with symmetircally augmented samples. Validation costs | | 100K | 2M | | -------- | -------- | -------- | | A2C + SRT | 6.586 (+- 0.043) | 6.038 (+- 0.005) | | A2C + Symmetric Augmentation | 16.851 (+- 11.162) | 6.179 (+- 0.003) | Log-likelihood differences | | 100K | 2M | | -------- | -------- | -------- | | A2C + SRT (Greedy rollout) | 40.45 (+- 7.43) | 22.21 (+- 0.70) | | A2C + Symmetric Augmentation | 23.65 (+- 22.33) | 445.04 (+- 19.05) | The results are in Table 8 and Figure 10 in Appendix D.3. Similar to the general experience replay training with data augmentation, using reward-maximizing loss for symmetrically augmented trajectories can diminish performance, leading to increased validation costs. Noticeably, the log-likelihood difference dramatically increases when the validation costs drop; please see Figure 10. **(Replay Ratio Scaling)** Most of them basically assume the utilization of replay training, thus off-policy training, and propose how to further increase replay ratio by preventing overfitting or primacy biases. On the other hand, we propose a new replay method both for off-policy and on-policy training in CO. Though SRT demonstrates that it can have a high replay ratio less suffering from overfitting, other replay ratio scaling approaches like periodic resetting policy parameters or sophisticated design for experience buffer can be incorporated to enhance sample efficiency further. In summary, SRT does not compete with other replay ratio scaling techniques. ### In the case of base DRL methods already consider symmetries As mentioned, if the RL training in Step A considers all trajectories, the benefits from SRT might diminish. However, the trajectory space is **exponentially large**, so exploring all trajectories is impossible, even though the reward is not expensive in CO. For example, in TSP with $n$ cities, there are $2^{n \times n}$ possible solutions, and each solution has $2n$ symmetric trajectories. Second, we empirically show that **SRT can enhance sample efficiency when it works with methods proposed to consider symmetries in CO**. In the main experiments, SRT consistently improves GFlowNets, which consider symmetries by learning a Markovian flow on a directly acyclic graph. We observed the log-likelihood differences decrease more rapidly in TSP50 when SRT is worked with, as follows. Validation cost | | 10K | 50K | 100K | 1M | | -------- | -------- | -------- | -- | --| | GFlowNet | 16.26 (+- 0.060) | 7.74 (+- 0.846) | 6.86 (+- 0.073) | 6.32 (+- 0.003) | | GFlowNet + SRT| 14.58 (+- 0.373) | 6.94 (+- 0.074) | 6.63 (+- 0.473) | 6.23 (+- 0.018) | Log-likelihood diferences | | 10K | 50K | 100K | 1M | | -------- | -------- | -------- | -- | --| | GFlowNet | 165.53 (+- 33.45) | 100.31 (+- 13.58) | 31.25 (+- 6.43) | 6.48 (+- 0.25) | | GFlowNet + SRT| 49.58 (+- 21.07) | 13.10 (+- 0.61) | 10.07 (+- 0.57) | 6.81 (+- 0.16) | (We re-conducted the experiments with two seeds with a maximum of 1M reward calls.) Furthermore, we also employ Sym-NCO (CVRP) and MatNet (Asymmetric TSP and flow-shop scheduling) as base DRL methods. The results provided in Appendix D.4 and D.5 demonstrate that SRT consistently enhances sample efficiency. Lastly, our replay loss in Eq (1) and separated update steps can be beneficial because **SRT does not introduce importance weight even when applied with on-policy methods**. ### How the proposed method indeed improves exploration in under-explored regions Throughout our experiments, we have measured the log-likelihood difference between all symmetric trajectories. The observed reduction in log-likelihood difference and improved validation rewards suggest that SRT is effectively trained over symmetric action trajectory spaces without necessitating additional reward calls. For clarity, we will refine the language in our manuscript to state that SRT is beneficial in training the policy over symmetric trajectory space rather than state that SRT improves explorations in under-explored regions.