ICLR Review - HackMD

# ICLR Review ## Comment to all reviewers We thank all the reviewers for their comments that helped us clarify the presentation of the paper. We now discuss all the missing related work that the reviewers brought to our attention. We would like to restate the objective of our paper which was to demonstrate that generalization in constructive deep RL can be achieved by using a combination of simple techniques. Although some ideas were proposed before and evaluated in different context, our proposed method still includes several novelties in our opinion: - systematic exploitation of equivariance, notably with repeated preprocessing - simplified neural network architecture thanks to equivariance - intuitive motivation for smoothed policy gradient and interleaving policy update and local search - novel stochastic curriculum learning technique  Based on all the feedback, we have updated our submission. All the changes are written in orange in the new pdf for ease of inspection. We would like to summarize what we have added and modified below: 1. We added a more detailed derivation for the smoothed policy gradient formula in the appendix. 2. We added eMAGIC(G) and changed some notations: eMAGIC(G) greedily chooses the next city from the probability given by the attention mechanism without any sampling; changed eMAGIC to eMAGIC(s1) and eMAGIC(s) to eMAGIC(s10). Since the results for eMAGIC(s1) and eMAGIC(s10) may not be as interesting as eMAGIC(G), we put them in the appendix to save space. 3. We added the variance analysis for eMAGIC(G), eMAGIC(S), eMAGIC(s1) and eMAGIC(s10) in the appendix. 4. We additionally tested GPN [1] on TSPLib, and tested Furthest Insertion and OR-Tools on problems with size range from 400-1002. The results are included in the appendix. 5. We updated the related work section to add some discussion on the missing previous work. 6. We added the explanation of $^*$ in the footnotes of the tables, which we forgot to add in the previous version of the paper. We are sorry for the potential confusion. For fairness, we chose to use the reported results for previous methods (marked with a star in Tables 1, 2, and 3) when we could, since our experimental set-up is the same and their respective authors have normally already tried to achieve the best possible results for their methods. 7. We added GCN(BS) and GCN(BST) in Tables 1 and 2 to have more baselines. 8. We tested eMAGIC(G) and eMAGIC(S) on random TSP10000. We compared our results with LKH3, [2], [3], and [3] with a limited time budget. The details are included in the appendix. In summary, our approach can achieve good results with a much shorter runtime than LKH3 and [3], and we achieve better results compared to [3] with a comparable time budget. Those new experimental results further show that our method offers a better trade-off in terms of performance vs runtime: it can generate relatively good results in much less time. 9. We added one more ablation study in the appendix where our combined local search is removed both in training and testing. 10. We fixed all the typos and minor issues raised by the reviewers. [1] Qiang Ma, Suwen Ge, Danyang He, Darshan Thaker, and Iddo Drori. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. CoRR, 2019. [2] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems!, 2019. [3] Zhang-Hua Fu, Kai-Bin Qiu, and Hongyuan Zha. Generalize a small pre-trained model to arbitrarily large TSP instances. In AAAI, 2021.  ## Reviewer n7h8 (score - 5) Thank you for your review and for noting the quality of our paper. We answer your questions below: **Response to question 1:**   Recall $J(\boldsymbol\theta) = -\mathbb E[L_\sigma(\boldsymbol X)]$ is the standard objective used in most deep RL methods applied to TSP, where $L_\sigma(\boldsymbol X)$ is the tour length of $\sigma$ output by the RL policy. Instead, we optimize $J^+(\boldsymbol\theta)= -\mathbb E[L_{\sigma_+}(\boldsymbol X)]$ where $L_{\sigma_+}(\boldsymbol X)$ is the tour length of $\sigma$ after applying local search. This helps integrate better RL and local search by smoothing the value landscape and training an RL agent to output a tour that can be improved by local search. This new objective function can be rewritten: $$ \begin{aligned} J^+(\boldsymbol\theta) &= -\mathbb E_{\sigma \sim \pi_\boldsymbol\theta,\sigma_+\sim\rho(\sigma)}[L_{\sigma_+}(\boldsymbol X)]\\ &= -\mathbb E_{\sigma \sim \pi_\boldsymbol\theta}[\mathbb E_{\sigma_+\sim\rho(\sigma)}[L_{\sigma_+}(\boldsymbol X) | \sigma]] \end{aligned} $$ where $\rho(\sigma)$ denotes the distribution over tours induced by the application of the stochastic local search on $\sigma$. Taking the gradient of this new objective: $$ \begin{aligned} \nabla_{\boldsymbol{\theta}} J^+(\boldsymbol{\theta})&= - \mathbb E_\tau \left[\left(\sum_{t=1}^{N} \nabla_{\boldsymbol\theta} \log \pi_{\boldsymbol\theta}({a}_{t} | \boldsymbol{s}_{t})\right) \mathbb E_{\sigma_+\sim\rho(\sigma)}[L_{\sigma_+}(\boldsymbol X) | \sigma] \right]\\ &\approx - \hat{\mathbb E}_B\left[\left(\sum_{t=1}^{N} \nabla_{\boldsymbol\theta} \log \pi_{\boldsymbol\theta}({a}^{(b)}_{t} | \boldsymbol{s}^{(b)}_{t})\right) \left(L_{\sigma^{(b)}_+}(\boldsymbol X^{(b)})\right)\right] \end{aligned} $$ where $\tau = (\boldsymbol{s}_1, a_1, \ldots)$ and $\sigma$ is its associated tour. We simply approximate the conditional expectation over $\rho(\sigma)$ by a sample. Therefore, our gradient estimate is an unbiased estimator of the gradient of our new objective $J^+(\boldsymbol\theta)$.  Using our policy rollout baseline introduces some bias to the estimation of the smoothed policy gradient, however the variance reduction helps with achieving greater performance, as we observed in our experiments. Thank you for this question, we added this discussion in the final version, which indeed clarifies the derivation of the smoothed policy gradient and its properties. **Response to question 2:**  When reporting our results, we followed previous practice [1, 2, 3, 4], which do not provide this information. In our experiments, we repeated all our experiments (training + testing) with 3 random seeds. The variances are shown below: | Model | TSP20 | TSP50 | TSP100| TSP200 | TSP500 | TSP1000 | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | |eMAGIC(G) |0.043| 0.049| 0.050| 0.064| 0.056 |0.066| | eMAGIC(s1） | 0.0413 | 0.0407 | 0.0509 | 0.0510 | 0.0553 | 0.0655 | | eMAGIC(s10) | 0.0446 | 0.0486 | 0.0434 | 0.0679 | 0.0652 | 0.0713 | | eMAGIC(S) | 0.0478 | 0.0582 | 0.0680 | 0.1207 | 0.1288 | 0.1742 | The variances are quite low, showing that our method gives relatively good and stable results. Note the variances increase with the number of sampling (s1, s10 and S) for larger TSP instances since there is more room for improvement. We added this table in the appendix.  **Response to question 3:** For fairness, we chose to use the reported results for previous methods (marked with a star in Tables 1, 2 and 3) when we could, since our experimental set-up is the same and their respective authors have normally already tried to achieve the best possible results for their methods. Moreover, we also noticed that reproducing some reported results were sometimes challenging despite our best effort. For instance, the source code of GPN does not exactly correspond to its description in their paper. The source code of Fu et al. uses a very old version of CUDA. Having said that, we reported the results of our run of GPN in Tables 1 and 2 in order to give an idea of its runtime. For TSPLib, most previous work does not evaluate their approaches in those realistic instances. As additional baselines, we ran GPN, furthest insertion, and OR-tools on TSPLib for more comparison, and the results are as followed: | Range | GPN | | ----- | ----- | | 50-199 | 56.46% | | 200-399 | 147.10% | | Range | Furthest Insertion | OR-Tools | | ----- | ----- | ----- | | 400-1002 | 10.11% |3.57%|  The detailed results can be found in Tables 14-16 in the appendix. Overall, they show that eMAGIC performs much better than GPN and furthest insertion, and we still beat OR-tools when the size of the problem is even larger. The runtime for our method is below 5 minutes for the larger instances and below 2 minutes for smaller instances.     [1] Zhang-Hua Fu, Kai-Bin Qiu, and Hongyuan Zha. Generalize a small pre-trained model to arbitrarily large TSP instances. In AAAI, 2021. [2] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems!, 2019. [3] Qiang Ma, Suwen Ge, Danyang He, Darshan Thaker, and Iddo Drori. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. CoRR, 2019. [4] Chaitanya K. Joshi, Thomas Laurent, and Xavier Bresson. An efficient graph convolutional network technique for the travelling salesman problem. CoRR, 2019a.  ## Reviewer eA3s (score - 5) Thank you for your time and for noting the excellent performance of our proposition. We address your criticisms and answer your questions below: **Response to Weakness 1**  Thank you for sharing those references. We now discussed them in the final version. Compared to Lisicki et al., we propose a stochastic curriculum learning method. Moreover, we deal with much larger TSP instances. Our idea of removing visited cities is indeed similar to Peng et al., however we explore this technique in the context of generalization to large instances. Moreover, our proposition integrates several other novel ideas, e.g., iterative preprocessing or smoothed policy gradient. **Response to Weakness 2** Our goal in this paper is to formulate a general framework (equivariant preprocessing+smoothed policy gradient+interleaving policy updates and local search+stochastic curriculum learning) that can train an RL solver on small instances and generalize to solve larger instances. Hence, we identify equivariance as a general property for preprocessing. When instantiating our approach to TSP, equivariance happens at several occasions, notably: - independence to city permutation, as already noted - scaling of city positions requires a corresponding transformation of the rewards - removing visited cities requires a corresponding transformation in the output of the RL policy, and another of the returns **Response to Weakness 3**  Although we demonstrate our approach on TSP, we believe that our general framework can be applied to other combinatorial optimization methods. In particular, for routing problems (e.g., TSP/VRP and variants), our equivariant preprocessing could still be applied and our architecture can be adapted in a similar way as in the attention model [1]. For the local search component, any combination of stochastic heuristics designed for the particular combinatorial optimization problem could be used. **Response to Question 1**  Your interpretation is correct. We just change the returns, but not the log probabilities. We propose an interpretation of this new objective (ie smoothing of the value landscape) in the paragraph before (8). **Response to Question 2** Good point! As mentioned in Appendix D, we set the number of epochs to 200. Since we focus on generalization, we noticed that training on instances of diverse sizes helps not to overfit to instances of one size.   **Response to Question 3** The training time does increase when we add local search, update the encodings of all nodes, or apply preprocessing. We found that local search contributes the most in the increase of the training time, while all the other procedures have little impact. However, training time is not our main concern because our goal is to obtain a fast and generalizable TSP model.  For your information, we provide below the training time increases when adding local search: for an epoch with TSP20 instances, the training time increases from 40s to 1 minute; for an epoch with TSP40 instances, the training time increases from 2 minutes to 4 minutes (with hyperparameters setting indicated in the paper).  **Response to Question 4**  We used the results reported in Fu et al. We did not focus much on comparing with the different versions of GCN because (1) we wanted to compare mainly with other RL methods similar to ours (RL used as a constructive heuristics) and (2) GCN does not perform very well in terms of generalization as shown in Table 2. However, we can definitely add such comparisons in the final version: | Model | | TSP20 || | TSP50 | | | TSP100|| | ----- | ----- | ----- | ----- | ----- | ----- | ----- |----- |----- |----- | | | Len| Gap |Time| Len| Gap |Time |Len| Gap |Time| | GCN(BS) |3.835| 0.12% |21m |5.707 |0.29%| 35m|7.876 |1.48% |32m | | GCN(BST) |3.831| 0.01%| 22m| 5.692| 0.03%| 38m |7.872 |1.43%| 1.2h| | eMAGIC(G) | 3.841| 0.29\% |2.8s| 5.732 | 0.74\% | 16s | 7.923 | 2.09\% | 1.4m| | eMAGIC(S) |3.830| 0.00% |38s |5.691 |0.01%| 3.5m| 7.762 |0.02% |15m| | Model | | TSP200 || | TSP500 | | | TSP1000|| | ----- | ----- | ----- | ----- | ----- | ----- | ----- |----- |----- |----- | | | Len| Gap |Time| Len| Gap |Time |Len| Gap |Time| | GCN(BS) |16.19|51.02\%| 4.6m|30.37 |83.55\%| 38m|51.26 |122\%| 52m | | GCN(BST) |16.21| 51.21\%| 4.0m| 30.43 |83.89\%| 31m |51.10 |121\%| 3.2h| | eMAGIC(G) | 11.14 | 3.89\% | 36s | 17.52 | 5.89\% | 2.0m | 24.70| 6.85\% |4.9m | | eMAGIC(S)| 10.77 |f 0.50\% |2.4m |17.03 | 2.92\% | 9.7m |24.13 | 4.36\% | 27m | where BS corresponds to beam search and BST to Beam search and Shortest tour heuristic. As can be seen, although GCN(BS) performs a little better than our model in Table 1 (small size instances), their runtime is much longer than ours. If we use sampling in our model, our method can beat GCN(BS) both in terms of runtime and performance.  **Response to Question 5**  Thank your for pointing out the POMO model [2]! We added a reference to it in our final version. Note that their work does not focus on generalization and the largest TSP instances their approach is applied on are TSP100.  The performances of the baselines mainly come from previously reported results. We believe that they clearly suggest that our RL agent, which is only trained on TSP10-50, is competitive and generalizes to both large TSP instances and realistic instances. For the question why the baselines are not the same in Tables 1-2 and 3, please refer to the response to question 3 of reviewer n7h8. For Concorde and LKH, they generally give optimal solutions to almost all the TSPLib instances, but with a very large runtime. That is why we did not include them.   **Response to Question 6**  GPN$^{*}$ represents the results reported by the GPN paper [3] (* means the results come from previous papers), and GPN represents the results that we obtained using their published codes. We did our best to run their algorithm, but could not achieve the good performances reported in their paper. We contacted the authors about this issue, but we didn't obtain any response.  We reported our experimental results of GPN to give an idea of its running time. **Response to Question 7**  Our goal for testing on TSPLib was to show that our method can generalize from training on small random TSP instances to realistic TSP instances. We believe that in terms of performance, our method is very competitive among learning-based methods. The runtime for our method is below 5 minutes for the larger instances and below 2 minutes for smaller instances. We also run extra experiments for Furthest Insertion and OR-Tools on cases range from 400 to 1002 in TSPLib: | Range | eMAGIC(S) |Furthest Insertion |OR-Tools| | ----- | ----- | ----- |----- | | 400-1002 | 3.40% | 10.11% | 3.57% | The detailed results are added in the appendix. From the results above, we have a good generalization ability that we perform better on large cases comparing to Furthest Insertion and OR-Tools.  For other baselines, they have worse gap when the sizes of problems become bigger. Notice that our average gap for the instances with sizes from 400 to 1002 is less than their average gaps for smaller problems, thus indicating that we can expect to have a smaller average gap when the problem sizes are in the range of 400 to 1002. This suggests further the greater generalization ability of our proposition.  **Response to Additional Feedbacks**    --> Thank you for reviewing our paper so carefully! We corrected all the issues in the final version.    [1] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems!, 2019. [2] Kwon et al, POMO: Policy Optimization with Multiple Optima for Reinforcement Learning, NeurIPS 2020. [3] Qiang Ma, Suwen Ge, Danyang He, Darshan Thaker, and Iddo Drori. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. CoRR, 2019. ## Reviewer 13UN (score - 5) Thank you for your constructive feedback and for noting the generalizability of our method notably using equivariance. We address your criticisms and questions below: **Response to Weakness 1**  Our main goal was to propose several techniques that can improve generalization of deep RL-based solvers. We demonstrate our proposition in the class of deep RL methods for learning constructive heuristics for TSP. As we discussed in our paper, Fu et al.'s method does not belong to that class of methods. Their method performs adaptive search on the instance to be solved in addition to learning on other TSP instances, while deep RL constructive heuristics only learn on other TSP instances without learning on the current instance. A possible avenue to improve further our method would also to be adaptive to the instance to be solved (eg by replacing local search, by MCTS), as we mentioned in our conclusion. Moreover, [1] may have a potential memory issue. For testing M TSP cases of size n simultaneously, [1] has a $O(M*n^2)$ space complexity since they need to store the pairwise heatmap, while ours has only $O(M*n)$. For testing on 16 cases of TSP 10000, [1] needs 13.4 GB space to store the heatmap. Also, we would like to recall that not achieving state-of-the-art performance should not be ground for rejection, as mentioned in the ICLR's guide for reviewers. We believe that demonstrating that our proposed techniques can boost generalization in constructive RL methods is an interesting achievement. Having said that, we now provide further comparison with Fu et al.'s method: their performance may not scale well with limited time budget. When given similar time budget, our proposition provides better solutions than Fu et al.'s method. See discussion below and Tables 1 and 2 of our updated submission. Compared to LKH, our method offers a better trade-off in terms of performance vs runtime: it can generate relatively good results in much less time. To show that, we modified the hyperparameters of the combined local search such that $I=2$ and $\beta=1.4$, and tested on 16 TSP10000 instances. We modify the hyperparameters in this way because we find that for large TSP instances, doing too many iterations of local search is not efficient. We compared the results with [1], [2] and LKH3. Also, we tested [1] with a limited the time budget denoted by AGCRN+M$^{lim}$: since we only reproduced [1] sucessfully with CPU, we decreased their hyperparameter $T$ (the MCTS will end no longer than $T$ seconds) from $0.04n$ to $0.04n / 1.8 * (28/60) = 0.010n$ (1.8h and 28m are the runtime of [1] and eMAGIC(G)) to test one case at a time on CPU. By this, we expect that [1] and eMAGIC(G) will have a similar running time. Note that with their original setting $T=0.04n$, we obtain results similar to their reported ones.  The results on TSP10000 are provided in the following table: |Method| Type| |TSP 10000| | |------|-----|---|---------|---| | | |Len| Gap (to LKH3)| Time| |LKH3* | H |70.78| 0.00\%| 8.8h | |AM* | RL(S) | 431.6 | 501\% | 13m| |AM$^{*}$ | RL(G) | 141.7 | 97.4\%| 6.0m| |AM$^{*}$ | RL(BS) | 129.4 | 80.3\% | 1.8h| |AGCRN+M$^{*}$ | SL+M | 74.92 | 4.4\% | 1.8h| |AGCRN+M$^{lim}$ | SL+M | 80.11 | 13.2\% | -| |eMAGIC(G) | RL(LS) | 79.28| 10.5\%| 28m| |eMAGIC(S) | RL(LS) | 78.79| 9.76\%| 1.0h| For eMAGIC(G), we greedily pick the next chosen city without any sampling. We can observe that our models outperform [2] to a large extent. Compared to [1] and LKH3, our algorithm runs much faster without a big gap of performance. With a similar runtime, eMAGIC(G)'s performance is better than [1]. Also, Fu et al.'s MCTS is written in C++. We could have a even shorter runtime if we had also written our code in C++ instead of Python. We added a discussion of these results in the appendix of the new version of the paper.  **Response to Weakness 2** We should have been more precise in our writing. We meant that among the deep RL-based constructive heuristics, our method reaches state-of-the-art performances, as shown in Tables 1-3. We recall that our model is trained on TSP of sizes 10-50, and then can be used to solve much larger random or realistic TSP instances.  [1] Zhang-Hua Fu, Kai-Bin Qiu, and Hongyuan Zha. Generalize a small pre-trained model to arbitrarily large TSP instances. In AAAI, 2021. [2] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems!, 2019. ## Reviewer XHWq (score - 6) Thank you for your review and for noting the competitiveness of our proposition. We address your criticisms and answer your questions below: **Response to Weakness 1**  Thank you for pointing out the work of Cai et al. After checking their paper, they indeed also train an RL model to initialize a heuristic optimizer. However, there are several differences with our approach: - their RL model learns a perturbative heuristics, while ours learns a constructive heuristics. - they do not consider the generalization issue as we do. - they focus on bin packing while we focus on TSP. - we provide an intuitive explanation why this approach can work via the derivation of the smoothed policy gradient.  **Response to Weakness 2**   Although we demonstrate our approach on TSP, we believe that our general framework can be applied to other combinatorial optimization methods. In particular, for routing problems (e.g., TSP/VRP and variants), our equivariant preprocessing could still be applied and our architecture can be adapted in a similar way as in the attention model [1]. For the local search component, any combination of stochastic heuristics designed for the particular combinatorial optimization problem could be used. **Response to Minor Issue 1** We did the extra ablation study on removing the local search both in training and testing. And the results are as follows: | TSP Size | Full || w/o LS* | | | ----- | ----- | ----- | ----- | ----- | |TSP20 |3.844 |0.37% |3.873806477 |1.14%| |TSP50 |5.763 |1.27% |6.172138214 |8.46%| |TSP100 |7.964 |2.61% |8.688361168 |11.95%| |TSP200 |11.14 |3.89% |12.19170316 |13.74%| |TSP500 |17.54 |6.01% |19.30734062 |16.69%| |TSP1000 |24.75 |7.05% |27.10354869 |17.24%| We added it in the final version of the paper.   **Response to Minor Issue 2** Thank you for pointing this issue out. We corrected it in the final version of the paper!  [1] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems!, 2019.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.