# [Rebuttal] Meta-SAGE
## Opening
Thanks for the valuable comments. Reviewers' comments will be extremely helpful in improving our paper. We provide detailed responses for each reviewer. Also, we would like to provide general responses regarding our research target below.
---
The main goal of this study is not to create a state-of-the-art solver for specific CO tasks. The primary claim is that our new meta-learning algorithm can mitigate the scale shift of pre-trained DRL models applicable to solving various CO tasks. We propose a meta-learning based test-time adaptation method, not an NCO solver itself. To clarify our motivation and research direction, we state our research objective below (and it will be updated in the main paper).
**Research Objective**
Our main research objective is to mitigate the scale distributional shift of existing DRL-NCO models by leveraging our novel meta-learning and scheduled test-time adaptation techniques. Our method is a supportive method, which means it is not standalone. However, our method can be applied to improve various 'existing' DRL-NCO methods.
Similar to our work, efficient active search (EAS) and simulated guided beam search (SGBS) were proposed as NCO test-time adaptation methods. However, they focus on problems with relatively small sizes ($N \leq 200$), so they show limited performances on large-scale problems ($N > 200$). Note that the distribution shifts get bigger when the test size increases, as the pre-trained models with $N = 100$ are employed for all cases. Therefore, we suggest an improved version of EAS by introducing meta-learning and temperature scheduling to get high performance even at larger distribution shifts (e.g., pre-trained on $N=100$, tested on $N=1000$).
## Reply to Reviewer w7eR
**Important Note.** Although the rebuttal period for this emergency review is less than 24 hours, please don't hesitate to bring up any outstanding concerns during the upcoming reviewer-author discussion period from March 20th to March 26th. We appreciate your input and look forward to addressing any remaining issues at the reviewer-author discussion period (Mar 20 ~ Mar 26).
### Research scope and baseline choices
Thanks for the insightful comments. We agree that there are remarkable algorithms based on mathematical programming, and these algorithms have theoretical guarantees. We categorize approaches in neural combinatorial optimization (which might not be in the traditional categories in the operations research area) based on the algorithmic structures as follows:
- Heuristic-Based Methods (mainly described in the paper)
- **Constructive heuristics** (Auto-regressive)
- Improvement heuristics
- Mathematical Programming-Based Methods
- Learning to configure algorithms: deep learning models support decisions related to configurations (i.e., parameters) of algorithms (e.g., weighting rules for branching score [1], scheduling of primal heuristics in branch-and-bound [2])
- ML-augmented: algorithmic components, consisting of heuristics or computationally expensive, are replaced by deep learning models (e.g., selection of branching nodes [3,4] or cutting planes [5]).
In this paper, we **concentrate on** improving deep learning **constructive heuristics** because of the following reasons.
1. In mathematical programming-based methods, the main algorithms are not trained from data; thus, the main algorithms do not suffer from distribution shifts.
2. Mathematical programming-based methods require a relatively long time to get a solution. Therefore, test adaptation schemes which solve a given problem several times to train models are unsuitable.
3. Improvement heuristics already contain instance-specific processing, such as iterative evaluation of objective function value (specified only when the instances are specified).
*We will revise the paper to state our research scope more clearly.*
**Motivation of (Euclidean) DRL-NCO, including our method.** The primary goal of deep reinforcement learning-based neural combinatorial optimization (DRL-NCO) is not to outperform or replace existing operations research (OR) methods but rather to improve the learning efficiency and generalization capability of DRL methods on combinatorial space. DRL-NCO has a unique advantage in its applicability to black-box combinatorial optimization problems that cannot be solved through traditional mathematical programming. For example, DRL-NCO can be used for protein optimization [10] and hardware routing [11]. The Traveling Salesman Problem (TSP) and its variants are excellent benchmarks for measuring the exploration capability and learning efficiency of DRL methods, which is the main reason for their use as benchmarks.
**Motivation of our adaptation method and baseline choice.** In this research, we aim to address the issue of **distributional shift at scale**, which is a crucial topic in machine learning, particularly for deep reinforcement learning (DRL). We focus on exploring the effectiveness of adaptation methods to mitigate distribution shifts in the combinatorial action space and episodic Markov decision process (MDP), which has received less attention in the ML community.
We compare our proposed approach with existing adaptation methods such as active search, efficient active search (EAS), and simulated guided beam search (SGBS) that aim to mitigate distribution shifts in combinatorial action spaces. The primary evaluation metric for our study is the efficiency of these adaptation methods in mitigating distribution shifts:
1. Our first objective is to minimize the cost while keeping the number of adaptations fixed at $K$ (which is proportional to spent time for test-time adaptation). This can be seen in Table 1 and Table 3, where we compare the performance of different adaptation methods in terms of their cost.
2. Our second objective is to minimize the number of adaptations required to achieve a specific cost goal. This is demonstrated in Figure 5 and Figure 8, where we analyze the performance of different adaptation methods in terms of the number of adaptations required to achieve a specific cost. Our goal is to identify the adaptation method that requires the fewest number of adaptations (K) to achieve the desired cost goal.
The OR baselines are not our primary competitors but are important measures for objectives 1 and 2. We conduct comparative experiments with EAS and SGBS to verify the proposed meta-learning and bias scheduling approach. The results demonstrate that our method is effective in mitigating (scale) distribution shifts in both fixed adaptation scenarios (i.e., objective 1) and achieving specific cost goals (i.e., objective 2). Furthermore, our findings demonstrate that our method outperforms EAS and SGBS in terms of both objectives 1 and 2, providing further evidence of its effectiveness in handling distribution shifts.
### Applicapability of prposed locality bias.
Thanks for bringing this issue to our attention. We appreciate your feedback and will address it in the main paper's discussion section.
The locality bias can be applied beyond Euclidean optimization problems. For instance, in non-Euclidean routing problems such as the Asymmetric TSP, edge attributes are defined as distances between nodes, and the graph is fully connected. In this scenario, our Meta-Sage model with locality bias can be directly applied. A recent study [12] has demonstrated the effectiveness of a non-Euclidean NCO solver based on an AM-like architecture, which can be further enhanced by incorporating Meta-Sage for test-time adaptation. Additionally, the locality bias can also be leveraged in the field of 3D drug discovery. Here, the relative distance (i.e., 3D Euclidean distance) between atoms of a molecule plays a crucial role as an inducing bias [13].
**Important remark 1.** Our locality bias is not a hard heuristic rule but rather a soft bias designed to aid in training. While it is true that the locality bias may not work for all graphs, we would like to clarify that our proposed algorithm, the AM, incorporates a learnable neural architecture that includes a global attention mechanism between non-local nodes. Additionally, the level of locality bias gradually decreases over time. While we understand your concern, we believe that our approach effectively balances the benefits of locality bias with the need for a global attention mechanism.
**Important remark 2.** Our approach goes beyond simply suggesting a specific inductive bias - we also provide a way to control the bias level based on the adaptation level. Therefore, our method is not limited to only addressing locality bias; it can be used to control other meaningful inductive biases as well. In our paper, we argue that while inductive bias can be helpful in the early stages of adaptation, it should be gradually reduced as adaptation progresses. Our key insight is that inductive bias is only useful when dealing with significant distributional shifts. This can be compared to how parents hold onto their child's bicycle during the early stages of learning, but eventually let go once the child is capable of riding on their own. We believe that this novel perspective can lead to a better understanding and application of inductive bias in machine learning.
### Instance generations and evaluation metric
**Instance generations.** We generate training and test instances following Kool et al.[6], like many works in NCO [7-9]. *We will add these in appendix*.
- TSP: ($x, y$) coordination of each node is randomly sampled to obtain a uniform distribution within the range of $[0,1]$.
- CVRP: ($x, y$) coordination of each node is uniformly distributed within $[0,1]$, while the demand of each node is sampled from the uniform distribution within the range of $[1, 9]$.
- PCTSP: ($x, y$) coordination and the penalty of each node is sampled from the uniform distribution within the range of $[0,1]$.
- OP: ($x, y$) coordination of each node is randomly sampled to obtain a uniform distribution within $[0,1]$. Furthermore, each node's price ($\rho_i$) is determined based on its distance to the depot ($\rho_i = 1 +\lfloor 99 \cdot \frac{d_{0i}}{max_{j=1}^{n}d_{0j}}\rfloor$), such that it increases with distance.
**Evaluation metric.** We measure the performance using baseline solvers for each task. As commented, this is not an optimality gap since these baselines are not exact methods (except Concode, which is based on the cutting plane method); thus, *we changed our terminology in the paper*. Precisely, (Performance) Gap is computed as
$$ \text{Performance Gap} = \frac{obj - obj_B}{obj_B} \times 100 (\%),$$
where $obj$ is the objective value of DRL methods, and $obj_B$ is the objective value from baseline solvers (Concode, LKH3, OR-Tools, and Compass).
**Solver setting.**
We set the solve parameters following AM [6] for CVRP, PCTSP, and OP.
- Concorde for TSP: We implemented the code using Julia and used the default setting of Concorde.jl with scaler 10,000 (since Concord supports integer coefficients only); see https://github.com/chkwon/Concorde.jl for details.
- LKH3 for CVRP: We conduct 10 runs, and eEach run is performed with a maximum of 10,000 tirals. We use 'SPECIAL' parameters as specified in CVRP runcsript (run CVRP in http://akira.ruc.dk/˜keld/research/LKH-3/BENCHMARKS/CVRP.tgz).
- OR-Tools for PCTSP: We set parameters the same as AM except for the local search time. We set the local search time as 60 sec while AM used 10 sec.
- Compass for OP: We used the default parameters; see (https://github.com/bcamath-ds/compass for details.
**Results in Table 1 (synthetic benchmarks of random instances).** Consistent with other literature on NCO [6-9], we present our experimental findings as the average score and performance gap on random instances in Table 1. This approach allows us to make meaningful comparisons between neural methods, as the instance distribution is uniform within the range of $[0,1] \times [0,1]$, and the optimal score is similar across instances. Empirically, neural methods that give consistently high average scores outperform other neural methods that give low average scores. However, we acknowledge that Table 1 alone is insufficient to claim that ML methods outperform OR methods, as a uniform distribution may not provide a comprehensive assessment of solver performance.
**Addtional result of real-world benchmark on CVRPLIB.** In contrast to synthetic benchmarks, real-world benchmarks contain various distributed instances and are not normalized at $[0,1] \times [0,1]$. To address this, we compute the optimality gap between the optimal value and make an average between the optimal gap. As suggested, we present our findings for CVRPLIB, which show that our method outperforms other neural adaptation methods as follows:
| Range | LKH3 | Sym-NCO EAS | Sym-NCO Ours |
|----------------|-------------|-------------|--------------|
| 100 ≤ N < 200 | 0.003±0.002 | 0.017±0.010 | 0.016±0.008 |
| 200 ≤ N < 300 | 0.009±0.005 | 0.025±0.009 | 0.021±0.009 |
| 300 ≤ N < 400 | 0.010±0.004 | 0.034±0.016 | 0.024±0.010 |
| 400 ≤ N < 500 | 0.013±0.006 | 0.045±0.017 | 0.027±0.011 |
| 500 ≤ N < 600 | 0.010±0.006 | 0.050±0.019 | 0.032±0.008 |
| 600 ≤ N < 700 | 0.013±0.006 | 0.058±0.018 | 0.036±0.009 |
| 700 ≤ N < 800 | 0.018±0.006 | 0.056±0.013 | 0.031±0.008 |
| 800 ≤ N < 900 | 0.017±0.017 | 0.084±0.038 | 0.049±0.018 |
| 900 ≤ N ≤ 1000 | 0.015±0.008 | 0.073±0.029 | 0.042±0.015 |
**Note**: Here, we compute the 'optimality gap' based on the best-known objective value (reported as the upper bound in the CVRPLIB website) as follow:
$$\text{Optimality Gap} = \frac{obj - best}{best} \times 100.$$
### Reference
[1] Balcan, M.-F., Dick, T., Sandholm, T., and Vitercik, E. (2018). Learning to branch. In International conference on machine learning, pages 344–353. PMLR.
[2] Chmiela, A., Khalil, E., Gleixner, A., Lodi, A., and Pokutta, S. (2021). Learning to schedule heuristics in branch and bound. Advances in Neural Information Processing Systems, 34.
[3] Gasse, M., Chételat, D., Ferroni, N., Charlin, L., & Lodi, A. (2019). Exact combinatorial optimization with graph convolutional neural networks. Advances in neural information processing systems, 32.
[4] Gupta, P., Gasse, M., Khalil, E., Mudigonda, P., Lodi, A., & Bengio, Y. (2020). Hybrid models for learning to branch. Advances in neural information processing systems, 33.
[5] Tang, Y., Agrawal, S., & Faenza, Y. (2020). Reinforcement learning for integer programming: Learning to cut. In International conference on machine learning (pp. 9367-9376). PMLR.
[6] Kool, W., van Hoof, H., & Welling, M. Attention, Learn to Solve Routing Problems!. In International Conference on Learning Representations.
[7] Kwon, Y. D., Choo, J., Kim, B., Yoon, I., Gwon, Y., & Min, S. (2020). Pomo: Policy optimization with multiple optima for reinforcement learning. Advances in Neural Information Processing Systems, 33, 21188-21198.
[8] Kim, M., Park, J., & Park, J. Sym-NCO: Leveraging Symmetricity for Neural Combinatorial Optimization. In Advances in Neural Information Processing Systems.
[9] Hottung, A., Kwon, Y. D., & Tierney, K. Efficient Active Search for Combinatorial Optimization Problems. In International Conference on Learning Representations.
[10] Angermueller, C., Dohan, D., Belanger, D., Deshpande, R., Murphy, K., & Colwell, L. (2020). Model-based reinforcement learning for biological sequence design.
[11] Kim, H., Kim, M., Kim, J., & Park, J. Collaborative symmetricity exploitation for offline learning of hardware design solver. In 3rd Offline RL Workshop: Offline RL as a''Launchpad''.
[12] Kwon, Y. D., Choo, J., Yoon, I., Park, M., Park, D., & Gwon, Y. (2021). Matrix encoding networks for neural combinatorial optimization. Advances in Neural Information Processing Systems, 34, 5138-5149.
[13] Liu, Y., Wang, L., Liu, M., Lin, Y., Zhang, X., Oztekin, B., & Ji, S. (2022, January). Spherical message passing for 3d molecular graphs. In International Conference on Learning Representations (ICLR).
### Specific comments.
Thanks for the detailed comments. We revised our manuscript based on the comments.
Q1. I don't understand the main analogy at the end of the introduction.
A1. We agree that this analogy leads to some misunderstanding. We will remove it.
Q2. Definition of state is not clear in the statement on line 88.
$\boldsymbol{x}$ is a vector, but the space/components of this vector are not well defined. The reader is left wondering whether they are binaries, real numbers, or points in space? I understand that there are $N$ components to this vector but beyond this it is not clear. This ambiguity is resolved in the example given in Figure 1.
A2. $\boldsymbol{x}$ is a vector of 2D codinations, so $\boldsymbol{x} \in \mathbb{R}^{2N}$. Usually, $\boldsymbol{x}$ is used for the decision variables in MIP, but $\boldsymbol{x}$ is used to denote node information of a problem instance in this paper.
Q3. Line 114 what is "in-differentiable"? I believe you mean "non-differentiable".
A3. Yes, "non-differentiable" is correct, we revised it. Thanks.
Q4. Line 120 "In previous literature..." is used but no citation is given.
A4. We added citation.
Q5. Avoid using the notation $h^T$ for target embedding as this looks like "transpose".
A5. We revised it as $h^A$ (A means 'Adapted').
Q6. The parameter $K$ is introduced only in the introduction and not at all in the main math making it very easy to lose track of.
A6. Thanks for pointing it out. We revised to explain the meaning of $K$ at the main and experiment repetively to improve readability.
Q7. The introduction of the four main CO problems is very informal and requires knowledge of these problems in order to under their graphical nature. While TSP is well known to a broad audience, CVRP, PCTSP, and OP are somewhat more specific and so a more formal introduction may be warranted.
A7. We agree. We will provide additional to explain CVRP, PCTSP and OP in the main text.
Q8. The statement of the goal of PCTSP is not very clear. The objective is to minimize the penalty subject to achieving a minimum reward.
A8. Thanks for the comments. The cost function is the sum of total distances and penelties with achieving the given minimum prizes. We will revise the main paper to state more clearly.
Q9. The mention the Euclidean nature of the CO problems in order to justify the locality bias
A9. We already made discussion for euclidea nature of CO we focused.
---
## Reply to Reviewer zgFH
### Hyperparameters
The followings are mentioned as hyperparameters in the review:
1. the number of test time adaptation steps ($K$)
2. balance between RL and IL objective ($\lambda$)
3. locality bias coefficient ($\alpha$)
4. scheduling of locality bias ($\gamma_1$)
5. temperature scheduling ($\gamma_2$)
However, $K$ and $\lambda$ are from EAS, not our method; thus, we follow the choices of EAS for these hyperparameters. Note that $K$ is a controllable parameter depending on the computation budget. The performances rely on $K$, but we set $K$ as 200 for ours and EAS. In the case of locality bias coefficient $\alpha$, it is not a hyperparameter since it is initialized as one and changed by scheduling operation (Eq. 5) like temperature. We agree that $\alpha$ looks like one of the hyperparameters, which is misleading; we will revise related expressions to improve readability.
Therefore, only two hyperparameters, $\gamma_1$ and $\gamma_2$, are newly introduced, but, we set $\gamma_1 = \gamma_2$ in practice. *We provide sensitivity analysis for these hyperparameters below.*
| | Meta-SAGE | Meta-SAGE | Meta-SAGE | EAS |
| -------- | --------- | --------- | --------- | --- |
| $(\gamma_1,\gamma_2)$| (0.99, 0.99) | (0.995, 0.995) | (0.999, 0.999) | (1.00,1.00) |
| TSP500 | 62.74 | 62.69 | 62.89 | 63.97 |
| CVRP500 | 17.20 | 17.22 | 17.45 | 18.63 |
We conduct experiments on TSP and CVRP with 500 customers. The parameters are set $0.99 \leq \gamma_1 = \gamma_2 < 1.0$, since they are exponential decaying rates. The results show that our method is robust to the changes of $\gamma_1$ and $\gamma_2$ and consistently outperforms EAS (baseline method).
**Important note:** We set identical hyperparameters for every 12 tasks: three scale benchmarks ($N=200,500,1000$) for four combinatorial optimization problems (TSP, CVRP, PCTSP, and OP). Hyperparameter choices are not critical to the performances.
### Effectiveness of SML
Although our temperature scheduling gives an effective improvement on EAS, our SML gives a more effective improvement. The table below shows that EAS + SML (row 2) performs better than EAS + Scheduled temperature (row 3). If every algorithmic component is applied, the best performances are reported.
| SML | Sche. Temperature | Sche. Locality Bias | TSP | CVRP |
| --- | ----------------- | ------------------- | ----- | ----- |
| | | | 18.63 | 63.97 |
| :heavy_check_mark: | | | 17.70 | 63.17 |
| | :heavy_check_mark: | | 17.73 | 63.26 |
| :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | **17.22** | **62.69** |
### Standard deviation
Thanks for the constructive suggestion. *We updated table 3 in the paper with the standard deviation of multiple instances* as follows:
**Table 3**
| Range | LKH3 | Sym-NCO EAS | Sym-NCO Ours |
|----------------|-------------|-------------|--------------|
| 100 ≤ N < 200 | 0.003±0.002 | 0.017±0.010 | 0.016±0.008 |
| 200 ≤ N < 300 | 0.009±0.005 | 0.025±0.009 | 0.021±0.009 |
| 300 ≤ N < 400 | 0.010±0.004 | 0.034±0.016 | 0.024±0.010 |
| 400 ≤ N < 500 | 0.013±0.006 | 0.045±0.017 | 0.027±0.011 |
| 500 ≤ N < 600 | 0.010±0.006 | 0.050±0.019 | 0.032±0.008 |
| 600 ≤ N < 700 | 0.013±0.006 | 0.058±0.018 | 0.036±0.009 |
| 700 ≤ N < 800 | 0.018±0.006 | 0.056±0.013 | 0.031±0.008 |
| 800 ≤ N < 900 | 0.017±0.017 | 0.084±0.038 | 0.049±0.018 |
| 900 ≤ N ≤ 1000 | 0.015±0.008 | 0.073±0.029 | 0.042±0.015 |
### Grammar comments
Thanks for the helpful grammar comments. *We revised every grammar mistake based on the comment*. Note that we changed the notation of $h^{T}$ to $h^{A}$ (A means 'Adapted').
## Reply to Reviewer 7eg7
### Summarized comments and strategies
### Readability of paper in the introduction
Thanks for the valuable comments. We agree that the motivation and the objective of our research are not clearly stated; *we added the motivation section and problem target subsection in the introduction to make them clear*.
Our primary **motivation is mitigating the distributional shift**, precisely, scale shift in this work, of pre-trained NCO models. To this end, we suggest a new meta-learning style method based upon a recently proposed transfer learning algorithm (i.e., EAS). We improve EAS to make faster adaptations and handle larger distributional shifts (i.e., solving problems with $N > 200$ using the pre-trained model with $N=100$).
Our research focuses on how to mitigate existing pre-trained DRL models' distributional shifts. To explain the differences clearly, we present massive literature on previous research on NCO in the introduction to show a historical perspective on the recent research flows. *We also revised the introduction for better readability overall*.
### Experiment setting
The main goal of this study is not to create a state-of-the-art solver for specific CO tasks. Instead, the primary claim is that our new meta-learning algorithm can mitigate the scale shift of pre-trained DRL models applicable to solving various CO tasks. Since the proposed method is a test-time adaptation method, not an NCO solver itself, we employ other adaptation methods, such as SGBS and EAS, for baselines.
The experimental results demonstrate the effectiveness of Meta-SAGE in mitigating scale shifts, even for much larger scales beyond the range of existing NCO transfer learning methods. Moreover, the findings depict that our method, as well as SGBS and EAS, can be applied to existing pre-trained models to improve their performances without limiting the generalizability of DRL solvers. It is worth noting that these test-time adaptation methods can also be integrated into existing large-scale CVRP models.
**Note:** Our method outperforms EAS and SGBS at $N=200$, one of their main benchmarks.
### Compasisons with other large-scale algorithms
We acknowledge that many promising neural methods for large-scale CVRP can provide notably high performances with fast speed [1, 3]. However, these algorithms usually contain a hierarchical structure and employ problem-specific heuristic solvers like LKH (Lin-Kernighan-Helsgaun).
For example, learning to delegate (L2D) has shown promising performances on large-scale CVRP. L2D first assigns customers into several clusters using the supervised learning model, then complete each tour via LKH. This approach requires expert labels and SOTA heuristic solvers to complete solutions; thus, its application to other CO tasks. To be specific, there are two conditions for applying this approach. One is that the problems need to be decomposed (e.g., CVRP contains multiple tours so that it can be decomposed based on tours). The other is there is a solver that can give high-quality solutions (e.g., there are high-quality solvers for TSP, but this is not guaranteed in general).
DRL solvers like L2D are directly trained to find high-quality solutions to the target problems, but Meta-SAGE is trained to relieve distributional shifts of the base DRL solvers. Since **our method is a general test-time adaptation method**, it **can be embedded into the existing large-scale CVRP models** as well. Thus, as mentioned in the experimental setting, we conducted comparative experiments with other transfer learning algorithms for NCO, not with the neural methods for large-scale.
### The use of the terminologies ('large-scale' vs. 'larger-scale')
The comment about the 'larger scale' and 'large scale' terminologies is sensible. Thanks for the helpful comments. As mentioned, our main research goal is mitigating the scale distributional shift of the existing NCO method, not proposing a standalone high-performance model for the specific 'large-scale' task. Thus, we focus on the distribution shifts to a 'larger' scale as DRL models suffer from direct training when the problem sizes increase, which means the trained models are obtained in the 'smaller' scales in practice. *We revised all expressions related to this and will change our title to* **"Meta-SAGE: Scale Meta-Learning Scheduled Adaptation with Guided Exploration for Mitigating Scale Shift on Combinatorial Optimization"**.
## References (use APA Style)
[1] Li, S., Yan, Z., & Wu, C. (2021). Learning to delegate for large-scale vehicle routing. Advances in Neural Information Processing Systems, 34.
[2] Arnold, F., Gendreau, M., & Sörensen, K. (2019). Efficiently solving very large-scale routing problems. Computers & operations research, 107, 32-42.
[3] Zong, Z., Wang, H., Wang, J., Zheng, M., & Li, Y. (2022). RBG: Hierarchically Solving Large-Scale Routing Problems in Logistic Systems via Reinforcement Learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.