# ICLR Review
## Comment to all reviewers
We thank all the reviewers for their comments that helped us clarify the presentation of the paper. We now discuss all the missing related work that the reviewers brought to our attention. We would like to restate the objective of our paper which was to demonstrate that generalization in constructive deep RL can be achieved by using a combination of simple techniques. Although some ideas were proposed before and evaluated in different context, our proposed method still includes several novelties in our opinion:
- systematic exploitation of equivariance, notably with repeated preprocessing
- simplified neural network architecture thanks to equivariance
- intuitive motivation for smoothed policy gradient and interleaving policy update and local search
- novel stochastic curriculum learning technique
<!-- - \TODO{Anything else?} -->
Based on all the feedback, we have updated our submission. All the changes are written in orange in the new pdf for ease of inspection. We would like to summarize what we have added and modified below:
1. We added a more detailed derivation for the smoothed policy gradient formula in the appendix.
2. We added eMAGIC(G) and changed some notations: eMAGIC(G) greedily chooses the next city from the probability given by the attention mechanism without any sampling; changed eMAGIC to eMAGIC(s1) and eMAGIC(s) to eMAGIC(s10). Since the results for eMAGIC(s1) and eMAGIC(s10) may not be as interesting as eMAGIC(G), we put them in the appendix to save space.
3. We added the variance analysis for eMAGIC(G), eMAGIC(S), eMAGIC(s1) and eMAGIC(s10) in the appendix.
4. We additionally tested GPN [1] on TSPLib, and tested Furthest Insertion and OR-Tools on problems with size range from 400-1002. The results are included in the appendix.
5. We updated the related work section to add some discussion on the missing previous work.
6. We added the explanation of $^*$ in the footnotes of the tables, which we forgot to add in the previous version of the paper. We are sorry for the potential confusion. For fairness, we chose to use the reported results for previous methods (marked with a star in Tables 1, 2, and 3) when we could, since our experimental set-up is the same and their respective authors have normally already tried to achieve the best possible results for their methods.
7. We added GCN(BS) and GCN(BST) in Tables 1 and 2 to have more baselines.
8. We tested eMAGIC(G) and eMAGIC(S) on random TSP10000. We compared our results with LKH3, [2], [3], and [3] with a limited time budget. The details are included in the appendix. In summary, our approach can achieve good results with a much shorter runtime than LKH3 and [3], and we achieve better results compared to [3] with a comparable time budget. Those new experimental results further show that our method offers a better trade-off in terms of performance vs runtime: it can generate relatively good results in much less time.
9. We added one more ablation study in the appendix where our combined local search is removed both in training and testing.
10. We fixed all the typos and minor issues raised by the reviewers.
[1] Qiang Ma, Suwen Ge, Danyang He, Darshan Thaker, and Iddo Drori. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. CoRR, 2019.
[2] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems!, 2019.
[3] Zhang-Hua Fu, Kai-Bin Qiu, and Hongyuan Zha. Generalize a small pre-trained model to arbitrarily large TSP instances. In AAAI, 2021.
<!-- <span style="color:red">Todo: Add these into the paper.</span> <span style="color:green">(1) added. (2) (3) added. (4): GPN not added; (5) novelty claim not modified. (6) (7) (8) (9) (10) added by wenbin</span> -->
## Reviewer n7h8 (score - 5)
Thank you for your review and for noting the quality of our paper. We answer your questions below:
**Response to question 1:**
<!-- <span style="color:red">Todo: Add this into the paper.</span> <span style="color:green">Added into the paper by yisenw.</span> -->
<!--<span style="color:red">ISSUE: the minus sign of the gradient formula</span>-->
Recall $J(\boldsymbol\theta) = -\mathbb E[L_\sigma(\boldsymbol X)]$ is the standard objective used in most deep RL methods applied to TSP, where $L_\sigma(\boldsymbol X)$ is the tour length of $\sigma$ output by the RL policy. Instead, we optimize $J^+(\boldsymbol\theta)= -\mathbb E[L_{\sigma_+}(\boldsymbol X)]$ where $L_{\sigma_+}(\boldsymbol X)$ is the tour length of $\sigma$ after applying local search. This helps integrate better RL and local search by smoothing the value landscape and training an RL agent to output a tour that can be improved by local search.
This new objective function can be rewritten:
$$
\begin{aligned}
J^+(\boldsymbol\theta) &= -\mathbb E_{\sigma \sim \pi_\boldsymbol\theta,\sigma_+\sim\rho(\sigma)}[L_{\sigma_+}(\boldsymbol X)]\\
&= -\mathbb E_{\sigma \sim \pi_\boldsymbol\theta}[\mathbb E_{\sigma_+\sim\rho(\sigma)}[L_{\sigma_+}(\boldsymbol X) | \sigma]]
\end{aligned}
$$
where $\rho(\sigma)$ denotes the distribution over tours induced by the application of the stochastic local search on $\sigma$.
Taking the gradient of this new objective:
$$
\begin{aligned}
\nabla_{\boldsymbol{\theta}} J^+(\boldsymbol{\theta})&= - \mathbb E_\tau \left[\left(\sum_{t=1}^{N} \nabla_{\boldsymbol\theta} \log \pi_{\boldsymbol\theta}({a}_{t} | \boldsymbol{s}_{t})\right) \mathbb E_{\sigma_+\sim\rho(\sigma)}[L_{\sigma_+}(\boldsymbol X) | \sigma] \right]\\
&\approx - \hat{\mathbb E}_B\left[\left(\sum_{t=1}^{N} \nabla_{\boldsymbol\theta} \log \pi_{\boldsymbol\theta}({a}^{(b)}_{t} | \boldsymbol{s}^{(b)}_{t})\right)
\left(L_{\sigma^{(b)}_+}(\boldsymbol X^{(b)})\right)\right]
\end{aligned}
$$
where $\tau = (\boldsymbol{s}_1, a_1, \ldots)$ and $\sigma$ is its associated tour.
We simply approximate the conditional expectation over $\rho(\sigma)$ by a sample. Therefore, our gradient estimate is an unbiased estimator of the gradient of our new objective $J^+(\boldsymbol\theta)$.
<!--Recall $J(\boldsymbol\theta) = -\mathbb E[L_\sigma(\boldsymbol X)]$, where the expectation is taken with respect to $\sigma$, which is a random variable corresponding to the tour generated by policy $\pi_{\boldsymbol\theta}$. Then, we obtain:
$$
\begin{aligned}
\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})& \approx -\hat{\mathbb E}_B\left[\left(\sum_{t=1}^{N} \nabla_{\boldsymbol\theta} \log \pi_{\boldsymbol\theta}({a}^{(b)}_{t} | \boldsymbol{s}^{(b)}_{t})\right)
\left(L_{\sigma^{(b)}}(\boldsymbol X^{(b)})\right)\right]
\end{aligned}
$$
We interleave the gradient estimator with the local search that we improve $\sigma$ by our proposed combined local search and obtain $\sigma_+$. And the objective function $J(\boldsymbol\theta)$ is replaced by $J^+(\boldsymbol\theta)= -\mathbb E[L_{\sigma_+}(\boldsymbol X)]$. Then we obtain the *smoothed policy gradient*:
$$
\begin{aligned}
\nabla_{\boldsymbol{\theta}} J^+(\boldsymbol{\theta})& \approx -\hat{\mathbb E}_B\left[\left(\sum_{t=1}^{N} \nabla_{\boldsymbol\theta} \log \pi_{\boldsymbol\theta}({a}^{(b)}_{t} | \boldsymbol{s}^{(b)}_{t})\right)
\left(L_{\sigma_+^{(b)}}(\boldsymbol X^{(b)})\right)\right]
\end{aligned}
$$
For the policy rollout baseline, we subtract baseline $l^{(b)} = -L_{\sigma^{(b)}}(\boldsymbol X^{(b)})$ to the smoothed policy gradient, and finally obtain the gradient formula:
$$
\begin{aligned}
\nabla_{\boldsymbol{\theta}} J^+(\boldsymbol{\theta})& \approx -\hat{\mathbb E}_B\left[\left(\sum_{t=1}^{N} \nabla_{\boldsymbol\theta} \log \pi_{\boldsymbol\theta}({a}^{(b)}_{t} | \boldsymbol{s}^{(b)}_{t})\right)
\left(L_{\sigma_+^{(b)}}(\boldsymbol X^{(b)})-l^{(b)}\right)\right]\\
&\approx - \frac{1}{|B|} \sum_{b=1}^{|B|} \left(\sum_{t=2}^{N} \nabla_{\boldsymbol\theta} \log \pi_{\boldsymbol\theta}({a}^{(b)}_{t} | \boldsymbol{s}^{(b)}_{t})\right)
\left(L_{\sigma_+^{(b)}}(\boldsymbol X^{(b)})-l^{(b)}\right)
\end{aligned}
$$
<span style="color:red">TODO: Second, we show that our $J^+(\boldsymbol{\theta})$ without the policy rollout baseline is biased/unbiased. </span>
<span style="color:red">TODO: Third, we show that our $J^+(\boldsymbol{\theta})$ with the policy rollout baseline is biased/unbiased. </span>-->
Using our policy rollout baseline introduces some bias to the estimation of the smoothed policy gradient, however the variance reduction helps with achieving greater performance, as we observed in our experiments.
Thank you for this question, we added this discussion in the final version, which indeed clarifies the derivation of the smoothed policy gradient and its properties.
**Response to question 2:**
<!-- <span style="color:red">Todo: Add this into the appendix of the paper.</span> <span style="color:green">Added into the paper by yisenw.</span> -->
When reporting our results, we followed previous practice [1, 2, 3, 4], which do not provide this information. In our experiments, we repeated all our experiments (training + testing) with 3 random seeds. The variances are shown below:
| Model | TSP20 | TSP50 | TSP100| TSP200 | TSP500 | TSP1000 |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
|eMAGIC(G) |0.043| 0.049| 0.050| 0.064| 0.056 |0.066|
| eMAGIC(s1) | 0.0413 | 0.0407 | 0.0509 | 0.0510 | 0.0553 | 0.0655 |
| eMAGIC(s10) | 0.0446 | 0.0486 | 0.0434 | 0.0679 | 0.0652 | 0.0713 |
| eMAGIC(S) | 0.0478 | 0.0582 | 0.0680 | 0.1207 | 0.1288 | 0.1742 |
The variances are quite low, showing that our method gives relatively good and stable results. Note the variances increase with the number of sampling (s1, s10 and S) for larger TSP instances since there is more room for improvement.
We added this table in the appendix.
<!--If we understand the question correctly, it asks how many times we do the training and testing and what are the variences of those parallel experiments. We repeat the training and testing(with different seeds) for 3 times, and the variences are shown as below:
| Model | TSP20 | TSP50 | TSP100| TSP200 | TSP500 | TSP1000 |
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| eMAGIC | 0.0413 | 0.0407 | 0.0509 | 0.0510 | 0.0553 | 0.0655 |
| eMAGIC(s) | 0.0446 | 0.0486 | 0.0434 | 0.0679 | 0.0652 | 0.0713 |
| eMAGIC(S) | 0.0478 | 0.0582 | 0.0680 | 0.1207 | 0.1288 | 0.1742 |
The variences are low, showing that our method gives relatively good, stable results. The reason why we don't report the varience is because we follow the training and testing procedure in [1, 2, 3, 4] (those are the learning-based baselines we used in our paper), in which they don't report the varience and how many times they repeat the experients. By giving the varience and the time we repeat the training and testing, we believe our evaluation is more reliable.-->
**Response to question 3:**
For fairness, we chose to use the reported results for previous methods (marked with a star in Tables 1, 2 and 3) when we could, since our experimental set-up is the same and their respective authors have normally already tried to achieve the best possible results for their methods.
Moreover, we also noticed that reproducing some reported results were sometimes challenging despite our best effort. For instance, the source code of GPN does not exactly correspond to its description in their paper. The source code of Fu et al. uses a very old version of CUDA. Having said that, we reported the results of our run of GPN in Tables 1 and 2 in order to give an idea of its runtime.
For TSPLib, most previous work does not evaluate their approaches in those realistic instances. As additional baselines, we ran GPN, furthest insertion, and OR-tools on TSPLib for more comparison, and the results are as followed:
| Range | GPN |
| ----- | ----- |
| 50-199 | 56.46% |
| 200-399 | 147.10% |
| Range | Furthest Insertion | OR-Tools |
| ----- | ----- | ----- |
| 400-1002 | 10.11% |3.57%|
<!-- \TODO{add results for furthest insertion, and OR-tools.} -->
The detailed results can be found in Tables 14-16 in the appendix. Overall, they show that eMAGIC performs much better than GPN and furthest insertion, and we still beat OR-tools when the size of the problem is even larger. The runtime for our method is below 5 minutes for the larger instances and below 2 minutes for smaller instances.
<!-- \TODO{Verify previous sentence and comment on computation times.} -->
<!-- <span style="color:red">Todo: Maybe add this to the paper</span> -->
<!--We evaluated the attention model (referred as GAT, which we will rename as AM in our final version for better clarity) and GPN-->
<!--<span style="color:red">ISSUE: We don't explain that some results are used from others' paper.</span>
We would like to provide the explanations in serveral different aspects.
*Why some baselines compared in Table 1 and 2 are not included in Table 3.* Since in other papers, there are no results of those models testing on TSPLIB instances. For almost all the learning-based methods, we choose to report results directly from their papers. That's to ensure a fair compare since it's hard to tune the hyperparameters of their models to best fit our testing setting. We update the footnotes of all the tables in the new version of our paper that $\ ^*$ now indicates results directly from others' papers. For the heuristics methods except for LKH, we choose to report the best two models 2-opt and farthest insertion (farthest insertion is the best among all the insertion heuristics). Since our results are better than those two methods, we leave results of other heuristics unreported. For more heuristics results, please refer to [5]. For the exact solvers and LKH, it is relatively not meaningful to report their results on TSPLIB, since they all obtain best solution but in a very long time compared to other methods.
*There are in fact some common baselines models included in Table 1, 2 and 3, and we test more model on TSPLIB* The attention models[2] are included in all three tables (we refer it as GAT in the previous version of paper and we have updated it to AM to avoid confusion in the new version of paper). Also 2-opt and farthest insertion are included in Table 1, 2 and 3. Also, we additionally test GPN on the TSPLIB for more comparison, and the results are as followed:
<span style="color:red">Todo</span>
As we can see the our results are better than the GPN model.-->
[1] Zhang-Hua Fu, Kai-Bin Qiu, and Hongyuan Zha. Generalize a small pre-trained model to arbitrarily large TSP instances. In AAAI, 2021.
[2] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems!, 2019.
[3] Qiang Ma, Suwen Ge, Danyang He, Darshan Thaker, and Iddo Drori. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. CoRR, 2019.
[4] Chaitanya K. Joshi, Thomas Laurent, and Xavier Bresson. An efficient graph convolutional network technique for the travelling salesman problem. CoRR, 2019a.
<!--[5] Paulo R. de O. da Costa, Jason Rhuggenaath, Yingqian Zhang, and Alp Akcay. Learning 2-optheuristics for the traveling salesman problem via deep reinforcement learning, 2020.-->
## Reviewer eA3s (score - 5)
Thank you for your time and for noting the excellent performance of our proposition. We address your criticisms and answer your questions below:
**Response to Weakness 1**
<!-- <span style="color:red">Todo: Add this to the paper, change some novelty claims in the paper; change CL into stochastic CL when we refer to novelty.</span> <span style="color:green">Added into the paper by yisenw. </span> -->
Thank you for sharing those references. We now discussed them in the final version.
Compared to Lisicki et al., we propose a stochastic curriculum learning method. Moreover, we deal with much larger TSP instances.
Our idea of removing visited cities is indeed similar to Peng et al., however we explore this technique in the context of generalization to large instances. Moreover, our proposition integrates several other novel ideas, e.g., iterative preprocessing or smoothed policy gradient.
**Response to Weakness 2**
Our goal in this paper is to formulate a general framework (equivariant preprocessing+smoothed policy gradient+interleaving policy updates and local search+stochastic curriculum learning) that can train an RL solver on small instances and generalize to solve larger instances. Hence, we identify equivariance as a general property for preprocessing.
When instantiating our approach to TSP, equivariance happens at several occasions, notably:
- independence to city permutation, as already noted
- scaling of city positions requires a corresponding transformation of the rewards
- removing visited cities requires a corresponding transformation in the output of the RL policy, and another of the returns
**Response to Weakness 3**
<!-- <span style="color:red">Todo: add more explaination if the overall answers are short.</span>
-->
Although we demonstrate our approach on TSP, we believe that our general framework can be applied to other combinatorial optimization methods. In particular, for routing problems (e.g., TSP/VRP and variants), our equivariant preprocessing could still be applied and our architecture can be adapted in a similar way as in the attention model [1]. For the local search component, any combination of stochastic heuristics designed for the particular combinatorial optimization problem could be used.
**Response to Question 1**
<!-- <span style="color:red">Todo: Further check</span> -->
Your interpretation is correct. We just change the returns, but not the log probabilities. We propose an interpretation of this new objective (ie smoothing of the value landscape) in the paragraph before (8).
**Response to Question 2**
Good point! As mentioned in Appendix D, we set the number of epochs to 200. Since we focus on generalization, we noticed that training on instances of diverse sizes helps not to overfit to instances of one size. <!--By making the stochastic CL choose the size in this way, when the epoch number becomes large, we train on cases of diverse sizes to avoid overfitting and try to achieve better generalization.
<span style="color:red">Todo: check this part</span>-->
<!--Then range of epoch number (i.e. the range of $e$) if from 0 to 99 (a total of 100 epochs). Therefore, when the epoch number is larger than 50, the probability of chosing TSP size 50 will still be much bigger than the probability of choosing TSP size 10. However, it does mean that the probability of chosing TSP size 50 will be closer to choosing TSO size 10, and it is intended by us. The reason is that we want to train on TSP with a variety of sizes after epoch 50. In our own experiments, we did find training on TSP with various sizes will improve the performance.-->
**Response to Question 3**
The training time does increase when we add local search, update the encodings of all nodes, or apply preprocessing. We found that local search contributes the most in the increase of the training time, while all the other procedures have little impact. However, training time is not our main concern because our goal is to obtain a fast and generalizable TSP model. <!--Therefore, as long as our trained model is fast and performs well, we could say that we fulfilled our goal satisfactorily. -->
For your information, we provide below the training time increases when adding local search: for an epoch with TSP20 instances, the training time increases from 40s to 1 minute; for an epoch with TSP40 instances, the training time increases from 2 minutes to 4 minutes (with hyperparameters setting indicated in the paper).
<!-- <span style="color:red">Todo: check this paragraph</span> -->
**Response to Question 4**
<!-- <span style="color:red">Todo: add big cases data and also if the space allows, add this two lines in the table 1, 2.</span> <span style="color:green">Add into the paper by yisenw. </span> -->
We used the results reported in Fu et al. We did not focus much on comparing with the different versions of GCN because (1) we wanted to compare mainly with other RL methods similar to ours (RL used as a constructive heuristics) and (2) GCN does not perform very well in terms of generalization as shown in Table 2. However, we can definitely add such comparisons in the final version:
| Model | | TSP20 || | TSP50 | | | TSP100||
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |----- |----- |----- |
| | Len| Gap |Time| Len| Gap |Time |Len| Gap |Time|
| GCN(BS) |3.835| 0.12% |21m |5.707 |0.29%| 35m|7.876 |1.48% |32m |
| GCN(BST) |3.831| 0.01%| 22m| 5.692| 0.03%| 38m |7.872 |1.43%| 1.2h|
| eMAGIC(G) | 3.841| 0.29\% |2.8s| 5.732 | 0.74\% | 16s | 7.923 | 2.09\% | 1.4m|
| eMAGIC(S) |3.830| 0.00% |38s |5.691 |0.01%| 3.5m| 7.762 |0.02% |15m|
| Model | | TSP200 || | TSP500 | | | TSP1000||
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |----- |----- |----- |
| | Len| Gap |Time| Len| Gap |Time |Len| Gap |Time|
| GCN(BS) |16.19|51.02\%| 4.6m|30.37 |83.55\%| 38m|51.26 |122\%| 52m |
| GCN(BST) |16.21| 51.21\%| 4.0m| 30.43 |83.89\%| 31m |51.10 |121\%| 3.2h|
| eMAGIC(G) | 11.14 | 3.89\% | 36s | 17.52 | 5.89\% | 2.0m | 24.70| 6.85\% |4.9m |
| eMAGIC(S)| 10.77 |f 0.50\% |2.4m |17.03 | 2.92\% | 9.7m |24.13 | 4.36\% | 27m |
where BS corresponds to beam search and BST to Beam search and Shortest tour heuristic.
As can be seen, although GCN(BS) performs a little better than our model in Table 1 (small size instances), their runtime is much longer than ours. If we use sampling in our model, our method can beat GCN(BS) both in terms of runtime and performance.
<!--We did not include [Joshi et al 2019b] for the following reasons. First, although the BS (Beam Search) version of GCN [1] performs a little better than our model (with out sampling) in Table 1 (for TSP with small sizes), their runtime is much longer than ours. If we use sampling in our model, we could beat GCN with BS for both runtime and performance:
| Model | | TSP20 || | TSP50 | | | TSP100||
| ----- | ----- | ----- | ----- | ----- | ----- | ----- |----- |----- |----- |
| | Len| Gap |Time| Len| Gap |Time |Len| Gap |Time|
| GCN(BS) |3.835| 0.12% |21m |5.707 |0.29%| 35m|7.876 |1.48% |32m |
| GCN(BS*) |3.831| 0.01%| 22m| 5.692| 0.03%| 38m |7.872 |1.43%| 1.2h|
|eMAGIC|3.844 |0.37% |3.0s |5.763| 1.27%| 17s |7.964| 2.61% |1.3m|
| eMAGIC(S) |3.830| 0.00% |38s |5.691 |0.01%| 3.5m| 7.762 |0.02% |15m|
Second, our focus is generalization in this paper. We could found in Table 2 that GCN performs badly during generalization. This is exactly why we do not take a focus on GCN in this paper.
Third, the BS used in GCN is not a learning based technique (it is a search algorithm), and we only want to compare our method with learning-based model. Though we also used local search in our model, local search is part of our learning process and did not cost so much time as BS.-->
**Response to Question 5**
<!-- <span style="color:red">Todo: Add POMO into the final version of the papers (related work). </span> <span style="color:green">Todo: Added into related work by yisenw. I just add a citation (Kwon et al. 2021), maybe it is not enough. But it seems to me that this paper is not that important and does not worth so much attention in the related work. </span> -->
Thank your for pointing out the POMO model [2]! We added a reference to it in our final version. Note that their work does not focus on generalization and the largest TSP instances their approach is applied on are TSP100.
<!--For in table 1/2 and 3, we choose different baselines due to the following reason: solving random generated TSP and solving reaslitic TSP are different problems. Most papers either focus on solving random generated TSP or solving reaslitic TSP. Therefore, the models which perform well on random generated TSP generally do not perform well on reaslitic TSP, and vice versa. Plus, as for Concorde and LKH3, they are only used for comparison in this paper. In table 1 and 2, we do not have optimal solutions for randomly generated TSP cases, so we must use LKH3 or Concorde to approximate the optimal solutions. In table 3, we already have the optimal solutions, so we think we can exclude concorde or LKH3 from table 3.-->
The performances of the baselines mainly come from previously reported results. We believe that they clearly suggest that our RL agent, which is only trained on TSP10-50, is competitive and generalizes to both large TSP instances and realistic instances.
For the question why the baselines are not the same in Tables 1-2 and 3, please refer to the response to question 3 of reviewer n7h8.
For Concorde and LKH, they generally give optimal solutions to almost all the TSPLib instances, but with a very large runtime. That is why we did not include them. <!--Thus, it's not so interesting to compare them with our algorithms.-->
<!-- <span style="color:red">Todo: Check this</span> -->
**Response to Question 6**
<!-- <span style="color:red">Todo: We did not add the footnotes of the * in the previous version of the paper. I think we need to talk about that (maybe say sorry for that confusion.)</span> -->
GPN$^{*}$ represents the results reported by the GPN paper [3] (* means the results come from previous papers), and GPN represents the results that we obtained using their published codes. We did our best to run their algorithm, but could not achieve the good performances reported in their paper. We contacted the authors about this issue, but we didn't obtain any response. <!--Therefore, we felt that we need to report both results (reported by their paper and reproduced by ourselves).-->
We reported our experimental results of GPN to give an idea of its running time.
**Response to Question 7**
<!-- <span style="color:red">oywb</span>
<span style="color:red">Todo: it's hard to provide all the runtime of those algorithms</span> -->
Our goal for testing on TSPLib was to show that our method can generalize from training on small random TSP instances to realistic TSP instances. We believe that in terms of performance, our method is very competitive among learning-based methods. The runtime for our method is below 5 minutes for the larger instances and below 2 minutes for smaller instances.
We also run extra experiments for Furthest Insertion and OR-Tools on cases range from 400 to 1002 in TSPLib:
| Range | eMAGIC(S) |Furthest Insertion |OR-Tools|
| ----- | ----- | ----- |----- |
| 400-1002 | 3.40% | 10.11% | 3.57% |
The detailed results are added in the appendix. From the results above, we have a good generalization ability that we perform better on large cases comparing to Furthest Insertion and OR-Tools. <!--\TODO{There's no generalization in those two methods...}!--> For other baselines, they have worse gap when the sizes of problems become bigger. Notice that our average gap for the instances with sizes from 400 to 1002 is less than their average gaps for smaller problems, thus indicating that we can expect to have a smaller average gap when the problem sizes are in the range of 400 to 1002. This suggests further the greater generalization ability of our proposition.
<!-- <span style="color:red">TODO: Check above</span> -->
**Response to Additional Feedbacks**
<!-- <span style="color:red">Todo: better to say what have we fixed.</span> <span style="color:green">Fixed by yisenw. </span> -->
<!-- <span style="color:red">maybe provide some explanation for why this is no need to change: 1. Sec 3: “this RL problem corresponds to a repeated N-horizon sequential decision-making problem.” —> repeated (N-t)-horizon ? <!-- 2. Sec 4.2 1. Should $\tilde{x}_{\sigma(t-1)}$ be replaced by $\tilde{x}_{\sigma(t)}$ in the definition of the vector at $\sigma(n)$? (As in the definition for $\sigma(1)$) --> <!-- 2. As a GNN, it is invariant with respect to .. —> equivariant --> </span> -->
Thank you for reviewing our paper so carefully! We corrected all the issues in the final version.
<!-- <span style="color:red">Todo: check above.</span> -->
<!-- For the second feedback of Sec 4.2, since $\tilde{x}_{\sigma(t-1)}$ represents the last visited city, $t-1$ is correct for the notation here. -->
<!--[1] Chaitanya K. Joshi, Thomas Laurent, and Xavier Bresson. An efficient graph convolutional network technique for the travelling salesman problem. CoRR, 2019.-->
[1] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems!, 2019.
[2] Kwon et al, POMO: Policy Optimization with Multiple Optima for Reinforcement Learning, NeurIPS 2020.
[3] Qiang Ma, Suwen Ge, Danyang He, Darshan Thaker, and Iddo Drori. Combinatorial optimization
by graph pointer networks and hierarchical reinforcement learning. CoRR, 2019.
## Reviewer 13UN (score - 5)
Thank you for your constructive feedback and for noting the generalizability of our method notably using equivariance. We address your criticisms and questions below:
**Response to Weakness 1**
<!-- <span style="color:red">Todo: Recall the expectation of iclr in SOTA results </span> -->
Our main goal was to propose several techniques that can improve generalization of deep RL-based solvers. We demonstrate our proposition in the class of deep RL methods for learning constructive heuristics for TSP. As we discussed in our paper, Fu et al.'s method does not belong to that class of methods. Their method performs adaptive search on the instance to be solved in addition to learning on other TSP instances, while deep RL constructive heuristics only learn on other TSP instances without learning on the current instance. A possible avenue to improve further our method would also to be adaptive to the instance to be solved (eg by replacing local search, by MCTS), as we mentioned in our conclusion.
Moreover, [1] may have a potential memory issue. For testing M TSP cases of size n simultaneously, [1] has a $O(M*n^2)$ space complexity since they need to store the pairwise heatmap, while ours has only $O(M*n)$. For testing on 16 cases of TSP 10000, [1] needs 13.4 GB space to store the heatmap.
Also, we would like to recall that not achieving state-of-the-art performance should not be ground for rejection, as mentioned in the ICLR's guide for reviewers. We believe that demonstrating that our proposed techniques can boost generalization in constructive RL methods is an interesting achievement. Having said that, we now provide further comparison with Fu et al.'s method: their performance may not scale well with limited time budget. When given similar time budget, our proposition provides better solutions than Fu et al.'s method. See discussion below and Tables 1 and 2 of our updated submission.
Compared to LKH, our method offers a better trade-off in terms of performance vs runtime: it can generate relatively good results in much less time. To show that, we modified the hyperparameters of the combined local search such that $I=2$ and $\beta=1.4$, and tested on 16 TSP10000 instances. We modify the hyperparameters in this way because we find that for large TSP instances, doing too many iterations of local search is not efficient. We compared the results with [1], [2] and LKH3. Also, we tested [1] with a limited the time budget denoted by AGCRN+M$^{lim}$: since we only reproduced [1] sucessfully with CPU, we decreased their hyperparameter $T$ (the MCTS will end no longer than $T$ seconds) from $0.04n$ to $0.04n / 1.8 * (28/60) = 0.010n$ (1.8h and 28m are the runtime of [1] and eMAGIC(G)) to test one case at a time on CPU. By this, we expect that [1] and eMAGIC(G) will have a similar running time. Note that with their original setting $T=0.04n$, we obtain results similar to their reported ones.
<!-- \TODO{Can we explain why we modify our hyperparameters for local search?} -->
The results on TSP10000 are provided in the following table:
|Method| Type| |TSP 10000| |
|------|-----|---|---------|---|
| | |Len| Gap (to LKH3)| Time|
|LKH3* | H |70.78| 0.00\%| 8.8h |
|AM* | RL(S) | 431.6 | 501\% | 13m|
|AM$^{*}$ | RL(G) | 141.7 | 97.4\%| 6.0m|
|AM$^{*}$ | RL(BS) | 129.4 | 80.3\% | 1.8h|
|AGCRN+M$^{*}$ | SL+M | 74.92 | 4.4\% | 1.8h|
|AGCRN+M$^{lim}$ | SL+M | 80.11 | 13.2\% | -|
|eMAGIC(G) | RL(LS) | 79.28| 10.5\%| 28m|
|eMAGIC(S) | RL(LS) | 78.79| 9.76\%| 1.0h|
For eMAGIC(G), we greedily pick the next chosen city without any sampling. We can observe that our models outperform [2] to a large extent. Compared to [1] and LKH3, our algorithm runs much faster without a big gap of performance. With a similar runtime, eMAGIC(G)'s performance is better than [1]. Also, Fu et al.'s MCTS is written in C++. We could have a even shorter runtime if we had also written our code in C++ instead of Python. We added a discussion of these results in the appendix of the new version of the paper.
<!-- <span style="color:red">Todo: check above</span> -->
**Response to Weakness 2**
We should have been more precise in our writing. We meant that among the deep RL-based constructive heuristics, our method reaches state-of-the-art performances, as shown in Tables 1-3. We recall that our model is trained on TSP of sizes 10-50, and then can be used to solve much larger random or realistic TSP instances.
<!-- <span style="color:red">Todo: Results of TSP10000</span> -->
[1] Zhang-Hua Fu, Kai-Bin Qiu, and Hongyuan Zha. Generalize a small pre-trained model to arbitrarily large TSP instances. In AAAI, 2021.
[2] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems!, 2019.
## Reviewer XHWq (score - 6)
Thank you for your review and for noting the competitiveness of our proposition. We address your criticisms and answer your questions below:
**Response to Weakness 1**
<!-- <span style="color:red">Todo: check the difference</span> -->
Thank you for pointing out the work of Cai et al. After checking their paper, they indeed also train an RL model to initialize a heuristic optimizer. However, there are several differences with our approach:
- their RL model learns a perturbative heuristics, while ours learns a constructive heuristics.
- they do not consider the generalization issue as we do.
- they focus on bin packing while we focus on TSP.
- we provide an intuitive explanation why this approach can work via the derivation of the smoothed policy gradient.
<!--Interleave RL with traditional solutions may not be new, but apply local search in RL is entirely new, and it produced a better result. Plus, some additional techniques like the policy rollout baseline and stochastic curriculum learning upon RL based TSP model is entirely innovative based on our current knowledge.-->
**Response to Weakness 2**
<!-- Same as question 1 of reviewer n7h8. -->
<!-- <span style="color:red">Todo: add more explaination if the overall answers are short.</span> -->
Although we demonstrate our approach on TSP, we believe that our general framework can be applied to other combinatorial optimization methods. In particular, for routing problems (e.g., TSP/VRP and variants), our equivariant preprocessing could still be applied and our architecture can be adapted in a similar way as in the attention model [1]. For the local search component, any combination of stochastic heuristics designed for the particular combinatorial optimization problem could be used.
**Response to Minor Issue 1**
We did the extra ablation study on removing the local search both in training and testing. And the results are as follows:
| TSP Size | Full || w/o LS* | |
| ----- | ----- | ----- | ----- | ----- |
|TSP20 |3.844 |0.37% |3.873806477 |1.14%|
|TSP50 |5.763 |1.27% |6.172138214 |8.46%|
|TSP100 |7.964 |2.61% |8.688361168 |11.95%|
|TSP200 |11.14 |3.89% |12.19170316 |13.74%|
|TSP500 |17.54 |6.01% |19.30734062 |16.69%|
|TSP1000 |24.75 |7.05% |27.10354869 |17.24%|
We added it in the final version of the paper.
<!-- <span style="color:red">Todo: check this</span> -->
<!-- <span style="color:red">Todo: ADD this to the paper.</span> <span style="color:green">Done by yisenw.</span> -->
**Response to Minor Issue 2**
Thank you for pointing this issue out. We corrected it in the final version of the paper!
<!-- <span style="color:red">Todo: Correct this in the paper.</span> </span> <span style="color:green">Done by yisenw.</span> -->
[1] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems!, 2019.