Wenbin Ouyang
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # ICLR Review ## Comment to all reviewers We thank all the reviewers for their comments that helped us clarify the presentation of the paper. We now discuss all the missing related work that the reviewers brought to our attention. We would like to restate the objective of our paper which was to demonstrate that generalization in constructive deep RL can be achieved by using a combination of simple techniques. Although some ideas were proposed before and evaluated in different context, our proposed method still includes several novelties in our opinion: - systematic exploitation of equivariance, notably with repeated preprocessing - simplified neural network architecture thanks to equivariance - intuitive motivation for smoothed policy gradient and interleaving policy update and local search - novel stochastic curriculum learning technique <!-- - \TODO{Anything else?} --> Based on all the feedback, we have updated our submission. All the changes are written in orange in the new pdf for ease of inspection. We would like to summarize what we have added and modified below: 1. We added a more detailed derivation for the smoothed policy gradient formula in the appendix. 2. We added eMAGIC(G) and changed some notations: eMAGIC(G) greedily chooses the next city from the probability given by the attention mechanism without any sampling; changed eMAGIC to eMAGIC(s1) and eMAGIC(s) to eMAGIC(s10). Since the results for eMAGIC(s1) and eMAGIC(s10) may not be as interesting as eMAGIC(G), we put them in the appendix to save space. 3. We added the variance analysis for eMAGIC(G), eMAGIC(S), eMAGIC(s1) and eMAGIC(s10) in the appendix. 4. We additionally tested GPN [1] on TSPLib, and tested Furthest Insertion and OR-Tools on problems with size range from 400-1002. The results are included in the appendix. 5. We updated the related work section to add some discussion on the missing previous work. 6. We added the explanation of $^*$ in the footnotes of the tables, which we forgot to add in the previous version of the paper. We are sorry for the potential confusion. For fairness, we chose to use the reported results for previous methods (marked with a star in Tables 1, 2, and 3) when we could, since our experimental set-up is the same and their respective authors have normally already tried to achieve the best possible results for their methods. 7. We added GCN(BS) and GCN(BST) in Tables 1 and 2 to have more baselines. 8. We tested eMAGIC(G) and eMAGIC(S) on random TSP10000. We compared our results with LKH3, [2], [3], and [3] with a limited time budget. The details are included in the appendix. In summary, our approach can achieve good results with a much shorter runtime than LKH3 and [3], and we achieve better results compared to [3] with a comparable time budget. Those new experimental results further show that our method offers a better trade-off in terms of performance vs runtime: it can generate relatively good results in much less time. 9. We added one more ablation study in the appendix where our combined local search is removed both in training and testing. 10. We fixed all the typos and minor issues raised by the reviewers. [1] Qiang Ma, Suwen Ge, Danyang He, Darshan Thaker, and Iddo Drori. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. CoRR, 2019. [2] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems!, 2019. [3] Zhang-Hua Fu, Kai-Bin Qiu, and Hongyuan Zha. Generalize a small pre-trained model to arbitrarily large TSP instances. In AAAI, 2021. <!-- <span style="color:red">Todo: Add these into the paper.</span> <span style="color:green">(1) added. (2) (3) added. (4): GPN not added; (5) novelty claim not modified. (6) (7) (8) (9) (10) added by wenbin</span> --> ## Reviewer n7h8 (score - 5) Thank you for your review and for noting the quality of our paper. We answer your questions below: **Response to question 1:** <!-- <span style="color:red">Todo: Add this into the paper.</span> <span style="color:green">Added into the paper by yisenw.</span> --> <!--<span style="color:red">ISSUE: the minus sign of the gradient formula</span>--> Recall $J(\boldsymbol\theta) = -\mathbb E[L_\sigma(\boldsymbol X)]$ is the standard objective used in most deep RL methods applied to TSP, where $L_\sigma(\boldsymbol X)$ is the tour length of $\sigma$ output by the RL policy. Instead, we optimize $J^+(\boldsymbol\theta)= -\mathbb E[L_{\sigma_+}(\boldsymbol X)]$ where $L_{\sigma_+}(\boldsymbol X)$ is the tour length of $\sigma$ after applying local search. This helps integrate better RL and local search by smoothing the value landscape and training an RL agent to output a tour that can be improved by local search. This new objective function can be rewritten: $$ \begin{aligned} J^+(\boldsymbol\theta) &= -\mathbb E_{\sigma \sim \pi_\boldsymbol\theta,\sigma_+\sim\rho(\sigma)}[L_{\sigma_+}(\boldsymbol X)]\\ &= -\mathbb E_{\sigma \sim \pi_\boldsymbol\theta}[\mathbb E_{\sigma_+\sim\rho(\sigma)}[L_{\sigma_+}(\boldsymbol X) | \sigma]] \end{aligned} $$ where $\rho(\sigma)$ denotes the distribution over tours induced by the application of the stochastic local search on $\sigma$. Taking the gradient of this new objective: $$ \begin{aligned} \nabla_{\boldsymbol{\theta}} J^+(\boldsymbol{\theta})&= - \mathbb E_\tau \left[\left(\sum_{t=1}^{N} \nabla_{\boldsymbol\theta} \log \pi_{\boldsymbol\theta}({a}_{t} | \boldsymbol{s}_{t})\right) \mathbb E_{\sigma_+\sim\rho(\sigma)}[L_{\sigma_+}(\boldsymbol X) | \sigma] \right]\\ &\approx - \hat{\mathbb E}_B\left[\left(\sum_{t=1}^{N} \nabla_{\boldsymbol\theta} \log \pi_{\boldsymbol\theta}({a}^{(b)}_{t} | \boldsymbol{s}^{(b)}_{t})\right) \left(L_{\sigma^{(b)}_+}(\boldsymbol X^{(b)})\right)\right] \end{aligned} $$ where $\tau = (\boldsymbol{s}_1, a_1, \ldots)$ and $\sigma$ is its associated tour. We simply approximate the conditional expectation over $\rho(\sigma)$ by a sample. Therefore, our gradient estimate is an unbiased estimator of the gradient of our new objective $J^+(\boldsymbol\theta)$. <!--Recall $J(\boldsymbol\theta) = -\mathbb E[L_\sigma(\boldsymbol X)]$, where the expectation is taken with respect to $\sigma$, which is a random variable corresponding to the tour generated by policy $\pi_{\boldsymbol\theta}$. Then, we obtain: $$ \begin{aligned} \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})& \approx -\hat{\mathbb E}_B\left[\left(\sum_{t=1}^{N} \nabla_{\boldsymbol\theta} \log \pi_{\boldsymbol\theta}({a}^{(b)}_{t} | \boldsymbol{s}^{(b)}_{t})\right) \left(L_{\sigma^{(b)}}(\boldsymbol X^{(b)})\right)\right] \end{aligned} $$ We interleave the gradient estimator with the local search that we improve $\sigma$ by our proposed combined local search and obtain $\sigma_+$. And the objective function $J(\boldsymbol\theta)$ is replaced by $J^+(\boldsymbol\theta)= -\mathbb E[L_{\sigma_+}(\boldsymbol X)]$. Then we obtain the *smoothed policy gradient*: $$ \begin{aligned} \nabla_{\boldsymbol{\theta}} J^+(\boldsymbol{\theta})& \approx -\hat{\mathbb E}_B\left[\left(\sum_{t=1}^{N} \nabla_{\boldsymbol\theta} \log \pi_{\boldsymbol\theta}({a}^{(b)}_{t} | \boldsymbol{s}^{(b)}_{t})\right) \left(L_{\sigma_+^{(b)}}(\boldsymbol X^{(b)})\right)\right] \end{aligned} $$ For the policy rollout baseline, we subtract baseline $l^{(b)} = -L_{\sigma^{(b)}}(\boldsymbol X^{(b)})$ to the smoothed policy gradient, and finally obtain the gradient formula: $$ \begin{aligned} \nabla_{\boldsymbol{\theta}} J^+(\boldsymbol{\theta})& \approx -\hat{\mathbb E}_B\left[\left(\sum_{t=1}^{N} \nabla_{\boldsymbol\theta} \log \pi_{\boldsymbol\theta}({a}^{(b)}_{t} | \boldsymbol{s}^{(b)}_{t})\right) \left(L_{\sigma_+^{(b)}}(\boldsymbol X^{(b)})-l^{(b)}\right)\right]\\ &\approx - \frac{1}{|B|} \sum_{b=1}^{|B|} \left(\sum_{t=2}^{N} \nabla_{\boldsymbol\theta} \log \pi_{\boldsymbol\theta}({a}^{(b)}_{t} | \boldsymbol{s}^{(b)}_{t})\right) \left(L_{\sigma_+^{(b)}}(\boldsymbol X^{(b)})-l^{(b)}\right) \end{aligned} $$ <span style="color:red">TODO: Second, we show that our $J^+(\boldsymbol{\theta})$ without the policy rollout baseline is biased/unbiased. </span> <span style="color:red">TODO: Third, we show that our $J^+(\boldsymbol{\theta})$ with the policy rollout baseline is biased/unbiased. </span>--> Using our policy rollout baseline introduces some bias to the estimation of the smoothed policy gradient, however the variance reduction helps with achieving greater performance, as we observed in our experiments. Thank you for this question, we added this discussion in the final version, which indeed clarifies the derivation of the smoothed policy gradient and its properties. **Response to question 2:** <!-- <span style="color:red">Todo: Add this into the appendix of the paper.</span> <span style="color:green">Added into the paper by yisenw.</span> --> When reporting our results, we followed previous practice [1, 2, 3, 4], which do not provide this information. In our experiments, we repeated all our experiments (training + testing) with 3 random seeds. The variances are shown below: | Model | TSP20 | TSP50 | TSP100| TSP200 | TSP500 | TSP1000 | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | |eMAGIC(G) |0.043| 0.049| 0.050| 0.064| 0.056 |0.066| | eMAGIC(s1) | 0.0413 | 0.0407 | 0.0509 | 0.0510 | 0.0553 | 0.0655 | | eMAGIC(s10) | 0.0446 | 0.0486 | 0.0434 | 0.0679 | 0.0652 | 0.0713 | | eMAGIC(S) | 0.0478 | 0.0582 | 0.0680 | 0.1207 | 0.1288 | 0.1742 | The variances are quite low, showing that our method gives relatively good and stable results. Note the variances increase with the number of sampling (s1, s10 and S) for larger TSP instances since there is more room for improvement. We added this table in the appendix. <!--If we understand the question correctly, it asks how many times we do the training and testing and what are the variences of those parallel experiments. We repeat the training and testing(with different seeds) for 3 times, and the variences are shown as below: | Model | TSP20 | TSP50 | TSP100| TSP200 | TSP500 | TSP1000 | | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | eMAGIC | 0.0413 | 0.0407 | 0.0509 | 0.0510 | 0.0553 | 0.0655 | | eMAGIC(s) | 0.0446 | 0.0486 | 0.0434 | 0.0679 | 0.0652 | 0.0713 | | eMAGIC(S) | 0.0478 | 0.0582 | 0.0680 | 0.1207 | 0.1288 | 0.1742 | The variences are low, showing that our method gives relatively good, stable results. The reason why we don't report the varience is because we follow the training and testing procedure in [1, 2, 3, 4] (those are the learning-based baselines we used in our paper), in which they don't report the varience and how many times they repeat the experients. By giving the varience and the time we repeat the training and testing, we believe our evaluation is more reliable.--> **Response to question 3:** For fairness, we chose to use the reported results for previous methods (marked with a star in Tables 1, 2 and 3) when we could, since our experimental set-up is the same and their respective authors have normally already tried to achieve the best possible results for their methods. Moreover, we also noticed that reproducing some reported results were sometimes challenging despite our best effort. For instance, the source code of GPN does not exactly correspond to its description in their paper. The source code of Fu et al. uses a very old version of CUDA. Having said that, we reported the results of our run of GPN in Tables 1 and 2 in order to give an idea of its runtime. For TSPLib, most previous work does not evaluate their approaches in those realistic instances. As additional baselines, we ran GPN, furthest insertion, and OR-tools on TSPLib for more comparison, and the results are as followed: | Range | GPN | | ----- | ----- | | 50-199 | 56.46% | | 200-399 | 147.10% | | Range | Furthest Insertion | OR-Tools | | ----- | ----- | ----- | | 400-1002 | 10.11% |3.57%| <!-- \TODO{add results for furthest insertion, and OR-tools.} --> The detailed results can be found in Tables 14-16 in the appendix. Overall, they show that eMAGIC performs much better than GPN and furthest insertion, and we still beat OR-tools when the size of the problem is even larger. The runtime for our method is below 5 minutes for the larger instances and below 2 minutes for smaller instances. <!-- \TODO{Verify previous sentence and comment on computation times.} --> <!-- <span style="color:red">Todo: Maybe add this to the paper</span> --> <!--We evaluated the attention model (referred as GAT, which we will rename as AM in our final version for better clarity) and GPN--> <!--<span style="color:red">ISSUE: We don't explain that some results are used from others' paper.</span> We would like to provide the explanations in serveral different aspects. *Why some baselines compared in Table 1 and 2 are not included in Table 3.* Since in other papers, there are no results of those models testing on TSPLIB instances. For almost all the learning-based methods, we choose to report results directly from their papers. That's to ensure a fair compare since it's hard to tune the hyperparameters of their models to best fit our testing setting. We update the footnotes of all the tables in the new version of our paper that $\ ^*$ now indicates results directly from others' papers. For the heuristics methods except for LKH, we choose to report the best two models 2-opt and farthest insertion (farthest insertion is the best among all the insertion heuristics). Since our results are better than those two methods, we leave results of other heuristics unreported. For more heuristics results, please refer to [5]. For the exact solvers and LKH, it is relatively not meaningful to report their results on TSPLIB, since they all obtain best solution but in a very long time compared to other methods. *There are in fact some common baselines models included in Table 1, 2 and 3, and we test more model on TSPLIB* The attention models[2] are included in all three tables (we refer it as GAT in the previous version of paper and we have updated it to AM to avoid confusion in the new version of paper). Also 2-opt and farthest insertion are included in Table 1, 2 and 3. Also, we additionally test GPN on the TSPLIB for more comparison, and the results are as followed: <span style="color:red">Todo</span> As we can see the our results are better than the GPN model.--> [1] Zhang-Hua Fu, Kai-Bin Qiu, and Hongyuan Zha. Generalize a small pre-trained model to arbitrarily large TSP instances. In AAAI, 2021. [2] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems!, 2019. [3] Qiang Ma, Suwen Ge, Danyang He, Darshan Thaker, and Iddo Drori. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. CoRR, 2019. [4] Chaitanya K. Joshi, Thomas Laurent, and Xavier Bresson. An efficient graph convolutional network technique for the travelling salesman problem. CoRR, 2019a. <!--[5] Paulo R. de O. da Costa, Jason Rhuggenaath, Yingqian Zhang, and Alp Akcay. Learning 2-optheuristics for the traveling salesman problem via deep reinforcement learning, 2020.--> ## Reviewer eA3s (score - 5) Thank you for your time and for noting the excellent performance of our proposition. We address your criticisms and answer your questions below: **Response to Weakness 1** <!-- <span style="color:red">Todo: Add this to the paper, change some novelty claims in the paper; change CL into stochastic CL when we refer to novelty.</span> <span style="color:green">Added into the paper by yisenw. </span> --> Thank you for sharing those references. We now discussed them in the final version. Compared to Lisicki et al., we propose a stochastic curriculum learning method. Moreover, we deal with much larger TSP instances. Our idea of removing visited cities is indeed similar to Peng et al., however we explore this technique in the context of generalization to large instances. Moreover, our proposition integrates several other novel ideas, e.g., iterative preprocessing or smoothed policy gradient. **Response to Weakness 2** Our goal in this paper is to formulate a general framework (equivariant preprocessing+smoothed policy gradient+interleaving policy updates and local search+stochastic curriculum learning) that can train an RL solver on small instances and generalize to solve larger instances. Hence, we identify equivariance as a general property for preprocessing. When instantiating our approach to TSP, equivariance happens at several occasions, notably: - independence to city permutation, as already noted - scaling of city positions requires a corresponding transformation of the rewards - removing visited cities requires a corresponding transformation in the output of the RL policy, and another of the returns **Response to Weakness 3** <!-- <span style="color:red">Todo: add more explaination if the overall answers are short.</span> --> Although we demonstrate our approach on TSP, we believe that our general framework can be applied to other combinatorial optimization methods. In particular, for routing problems (e.g., TSP/VRP and variants), our equivariant preprocessing could still be applied and our architecture can be adapted in a similar way as in the attention model [1]. For the local search component, any combination of stochastic heuristics designed for the particular combinatorial optimization problem could be used. **Response to Question 1** <!-- <span style="color:red">Todo: Further check</span> --> Your interpretation is correct. We just change the returns, but not the log probabilities. We propose an interpretation of this new objective (ie smoothing of the value landscape) in the paragraph before (8). **Response to Question 2** Good point! As mentioned in Appendix D, we set the number of epochs to 200. Since we focus on generalization, we noticed that training on instances of diverse sizes helps not to overfit to instances of one size. <!--By making the stochastic CL choose the size in this way, when the epoch number becomes large, we train on cases of diverse sizes to avoid overfitting and try to achieve better generalization. <span style="color:red">Todo: check this part</span>--> <!--Then range of epoch number (i.e. the range of $e$) if from 0 to 99 (a total of 100 epochs). Therefore, when the epoch number is larger than 50, the probability of chosing TSP size 50 will still be much bigger than the probability of choosing TSP size 10. However, it does mean that the probability of chosing TSP size 50 will be closer to choosing TSO size 10, and it is intended by us. The reason is that we want to train on TSP with a variety of sizes after epoch 50. In our own experiments, we did find training on TSP with various sizes will improve the performance.--> **Response to Question 3** The training time does increase when we add local search, update the encodings of all nodes, or apply preprocessing. We found that local search contributes the most in the increase of the training time, while all the other procedures have little impact. However, training time is not our main concern because our goal is to obtain a fast and generalizable TSP model. <!--Therefore, as long as our trained model is fast and performs well, we could say that we fulfilled our goal satisfactorily. --> For your information, we provide below the training time increases when adding local search: for an epoch with TSP20 instances, the training time increases from 40s to 1 minute; for an epoch with TSP40 instances, the training time increases from 2 minutes to 4 minutes (with hyperparameters setting indicated in the paper). <!-- <span style="color:red">Todo: check this paragraph</span> --> **Response to Question 4** <!-- <span style="color:red">Todo: add big cases data and also if the space allows, add this two lines in the table 1, 2.</span> <span style="color:green">Add into the paper by yisenw. </span> --> We used the results reported in Fu et al. We did not focus much on comparing with the different versions of GCN because (1) we wanted to compare mainly with other RL methods similar to ours (RL used as a constructive heuristics) and (2) GCN does not perform very well in terms of generalization as shown in Table 2. However, we can definitely add such comparisons in the final version: | Model | | TSP20 || | TSP50 | | | TSP100|| | ----- | ----- | ----- | ----- | ----- | ----- | ----- |----- |----- |----- | | | Len| Gap |Time| Len| Gap |Time |Len| Gap |Time| | GCN(BS) |3.835| 0.12% |21m |5.707 |0.29%| 35m|7.876 |1.48% |32m | | GCN(BST) |3.831| 0.01%| 22m| 5.692| 0.03%| 38m |7.872 |1.43%| 1.2h| | eMAGIC(G) | 3.841| 0.29\% |2.8s| 5.732 | 0.74\% | 16s | 7.923 | 2.09\% | 1.4m| | eMAGIC(S) |3.830| 0.00% |38s |5.691 |0.01%| 3.5m| 7.762 |0.02% |15m| | Model | | TSP200 || | TSP500 | | | TSP1000|| | ----- | ----- | ----- | ----- | ----- | ----- | ----- |----- |----- |----- | | | Len| Gap |Time| Len| Gap |Time |Len| Gap |Time| | GCN(BS) |16.19|51.02\%| 4.6m|30.37 |83.55\%| 38m|51.26 |122\%| 52m | | GCN(BST) |16.21| 51.21\%| 4.0m| 30.43 |83.89\%| 31m |51.10 |121\%| 3.2h| | eMAGIC(G) | 11.14 | 3.89\% | 36s | 17.52 | 5.89\% | 2.0m | 24.70| 6.85\% |4.9m | | eMAGIC(S)| 10.77 |f 0.50\% |2.4m |17.03 | 2.92\% | 9.7m |24.13 | 4.36\% | 27m | where BS corresponds to beam search and BST to Beam search and Shortest tour heuristic. As can be seen, although GCN(BS) performs a little better than our model in Table 1 (small size instances), their runtime is much longer than ours. If we use sampling in our model, our method can beat GCN(BS) both in terms of runtime and performance. <!--We did not include [Joshi et al 2019b] for the following reasons. First, although the BS (Beam Search) version of GCN [1] performs a little better than our model (with out sampling) in Table 1 (for TSP with small sizes), their runtime is much longer than ours. If we use sampling in our model, we could beat GCN with BS for both runtime and performance: | Model | | TSP20 || | TSP50 | | | TSP100|| | ----- | ----- | ----- | ----- | ----- | ----- | ----- |----- |----- |----- | | | Len| Gap |Time| Len| Gap |Time |Len| Gap |Time| | GCN(BS) |3.835| 0.12% |21m |5.707 |0.29%| 35m|7.876 |1.48% |32m | | GCN(BS*) |3.831| 0.01%| 22m| 5.692| 0.03%| 38m |7.872 |1.43%| 1.2h| |eMAGIC|3.844 |0.37% |3.0s |5.763| 1.27%| 17s |7.964| 2.61% |1.3m| | eMAGIC(S) |3.830| 0.00% |38s |5.691 |0.01%| 3.5m| 7.762 |0.02% |15m| Second, our focus is generalization in this paper. We could found in Table 2 that GCN performs badly during generalization. This is exactly why we do not take a focus on GCN in this paper. Third, the BS used in GCN is not a learning based technique (it is a search algorithm), and we only want to compare our method with learning-based model. Though we also used local search in our model, local search is part of our learning process and did not cost so much time as BS.--> **Response to Question 5** <!-- <span style="color:red">Todo: Add POMO into the final version of the papers (related work). </span> <span style="color:green">Todo: Added into related work by yisenw. I just add a citation (Kwon et al. 2021), maybe it is not enough. But it seems to me that this paper is not that important and does not worth so much attention in the related work. </span> --> Thank your for pointing out the POMO model [2]! We added a reference to it in our final version. Note that their work does not focus on generalization and the largest TSP instances their approach is applied on are TSP100. <!--For in table 1/2 and 3, we choose different baselines due to the following reason: solving random generated TSP and solving reaslitic TSP are different problems. Most papers either focus on solving random generated TSP or solving reaslitic TSP. Therefore, the models which perform well on random generated TSP generally do not perform well on reaslitic TSP, and vice versa. Plus, as for Concorde and LKH3, they are only used for comparison in this paper. In table 1 and 2, we do not have optimal solutions for randomly generated TSP cases, so we must use LKH3 or Concorde to approximate the optimal solutions. In table 3, we already have the optimal solutions, so we think we can exclude concorde or LKH3 from table 3.--> The performances of the baselines mainly come from previously reported results. We believe that they clearly suggest that our RL agent, which is only trained on TSP10-50, is competitive and generalizes to both large TSP instances and realistic instances. For the question why the baselines are not the same in Tables 1-2 and 3, please refer to the response to question 3 of reviewer n7h8. For Concorde and LKH, they generally give optimal solutions to almost all the TSPLib instances, but with a very large runtime. That is why we did not include them. <!--Thus, it's not so interesting to compare them with our algorithms.--> <!-- <span style="color:red">Todo: Check this</span> --> **Response to Question 6** <!-- <span style="color:red">Todo: We did not add the footnotes of the * in the previous version of the paper. I think we need to talk about that (maybe say sorry for that confusion.)</span> --> GPN$^{*}$ represents the results reported by the GPN paper [3] (* means the results come from previous papers), and GPN represents the results that we obtained using their published codes. We did our best to run their algorithm, but could not achieve the good performances reported in their paper. We contacted the authors about this issue, but we didn't obtain any response. <!--Therefore, we felt that we need to report both results (reported by their paper and reproduced by ourselves).--> We reported our experimental results of GPN to give an idea of its running time. **Response to Question 7** <!-- <span style="color:red">oywb</span> <span style="color:red">Todo: it's hard to provide all the runtime of those algorithms</span> --> Our goal for testing on TSPLib was to show that our method can generalize from training on small random TSP instances to realistic TSP instances. We believe that in terms of performance, our method is very competitive among learning-based methods. The runtime for our method is below 5 minutes for the larger instances and below 2 minutes for smaller instances. We also run extra experiments for Furthest Insertion and OR-Tools on cases range from 400 to 1002 in TSPLib: | Range | eMAGIC(S) |Furthest Insertion |OR-Tools| | ----- | ----- | ----- |----- | | 400-1002 | 3.40% | 10.11% | 3.57% | The detailed results are added in the appendix. From the results above, we have a good generalization ability that we perform better on large cases comparing to Furthest Insertion and OR-Tools. <!--\TODO{There's no generalization in those two methods...}!--> For other baselines, they have worse gap when the sizes of problems become bigger. Notice that our average gap for the instances with sizes from 400 to 1002 is less than their average gaps for smaller problems, thus indicating that we can expect to have a smaller average gap when the problem sizes are in the range of 400 to 1002. This suggests further the greater generalization ability of our proposition. <!-- <span style="color:red">TODO: Check above</span> --> **Response to Additional Feedbacks** <!-- <span style="color:red">Todo: better to say what have we fixed.</span> <span style="color:green">Fixed by yisenw. </span> --> <!-- <span style="color:red">maybe provide some explanation for why this is no need to change: 1. Sec 3: “this RL problem corresponds to a repeated N-horizon sequential decision-making problem.” —> repeated (N-t)-horizon ? <!-- 2. Sec 4.2 1. Should $\tilde{x}_{\sigma(t-1)}$ be replaced by $\tilde{x}_{\sigma(t)}$ in the definition of the vector at $\sigma(n)$? (As in the definition for $\sigma(1)$) --> <!-- 2. As a GNN, it is invariant with respect to .. —> equivariant --> </span> --> Thank you for reviewing our paper so carefully! We corrected all the issues in the final version. <!-- <span style="color:red">Todo: check above.</span> --> <!-- For the second feedback of Sec 4.2, since $\tilde{x}_{\sigma(t-1)}$ represents the last visited city, $t-1$ is correct for the notation here. --> <!--[1] Chaitanya K. Joshi, Thomas Laurent, and Xavier Bresson. An efficient graph convolutional network technique for the travelling salesman problem. CoRR, 2019.--> [1] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems!, 2019. [2] Kwon et al, POMO: Policy Optimization with Multiple Optima for Reinforcement Learning, NeurIPS 2020. [3] Qiang Ma, Suwen Ge, Danyang He, Darshan Thaker, and Iddo Drori. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. CoRR, 2019. ## Reviewer 13UN (score - 5) Thank you for your constructive feedback and for noting the generalizability of our method notably using equivariance. We address your criticisms and questions below: **Response to Weakness 1** <!-- <span style="color:red">Todo: Recall the expectation of iclr in SOTA results </span> --> Our main goal was to propose several techniques that can improve generalization of deep RL-based solvers. We demonstrate our proposition in the class of deep RL methods for learning constructive heuristics for TSP. As we discussed in our paper, Fu et al.'s method does not belong to that class of methods. Their method performs adaptive search on the instance to be solved in addition to learning on other TSP instances, while deep RL constructive heuristics only learn on other TSP instances without learning on the current instance. A possible avenue to improve further our method would also to be adaptive to the instance to be solved (eg by replacing local search, by MCTS), as we mentioned in our conclusion. Moreover, [1] may have a potential memory issue. For testing M TSP cases of size n simultaneously, [1] has a $O(M*n^2)$ space complexity since they need to store the pairwise heatmap, while ours has only $O(M*n)$. For testing on 16 cases of TSP 10000, [1] needs 13.4 GB space to store the heatmap. Also, we would like to recall that not achieving state-of-the-art performance should not be ground for rejection, as mentioned in the ICLR's guide for reviewers. We believe that demonstrating that our proposed techniques can boost generalization in constructive RL methods is an interesting achievement. Having said that, we now provide further comparison with Fu et al.'s method: their performance may not scale well with limited time budget. When given similar time budget, our proposition provides better solutions than Fu et al.'s method. See discussion below and Tables 1 and 2 of our updated submission. Compared to LKH, our method offers a better trade-off in terms of performance vs runtime: it can generate relatively good results in much less time. To show that, we modified the hyperparameters of the combined local search such that $I=2$ and $\beta=1.4$, and tested on 16 TSP10000 instances. We modify the hyperparameters in this way because we find that for large TSP instances, doing too many iterations of local search is not efficient. We compared the results with [1], [2] and LKH3. Also, we tested [1] with a limited the time budget denoted by AGCRN+M$^{lim}$: since we only reproduced [1] sucessfully with CPU, we decreased their hyperparameter $T$ (the MCTS will end no longer than $T$ seconds) from $0.04n$ to $0.04n / 1.8 * (28/60) = 0.010n$ (1.8h and 28m are the runtime of [1] and eMAGIC(G)) to test one case at a time on CPU. By this, we expect that [1] and eMAGIC(G) will have a similar running time. Note that with their original setting $T=0.04n$, we obtain results similar to their reported ones. <!-- \TODO{Can we explain why we modify our hyperparameters for local search?} --> The results on TSP10000 are provided in the following table: |Method| Type| |TSP 10000| | |------|-----|---|---------|---| | | |Len| Gap (to LKH3)| Time| |LKH3* | H |70.78| 0.00\%| 8.8h | |AM* | RL(S) | 431.6 | 501\% | 13m| |AM$^{*}$ | RL(G) | 141.7 | 97.4\%| 6.0m| |AM$^{*}$ | RL(BS) | 129.4 | 80.3\% | 1.8h| |AGCRN+M$^{*}$ | SL+M | 74.92 | 4.4\% | 1.8h| |AGCRN+M$^{lim}$ | SL+M | 80.11 | 13.2\% | -| |eMAGIC(G) | RL(LS) | 79.28| 10.5\%| 28m| |eMAGIC(S) | RL(LS) | 78.79| 9.76\%| 1.0h| For eMAGIC(G), we greedily pick the next chosen city without any sampling. We can observe that our models outperform [2] to a large extent. Compared to [1] and LKH3, our algorithm runs much faster without a big gap of performance. With a similar runtime, eMAGIC(G)'s performance is better than [1]. Also, Fu et al.'s MCTS is written in C++. We could have a even shorter runtime if we had also written our code in C++ instead of Python. We added a discussion of these results in the appendix of the new version of the paper. <!-- <span style="color:red">Todo: check above</span> --> **Response to Weakness 2** We should have been more precise in our writing. We meant that among the deep RL-based constructive heuristics, our method reaches state-of-the-art performances, as shown in Tables 1-3. We recall that our model is trained on TSP of sizes 10-50, and then can be used to solve much larger random or realistic TSP instances. <!-- <span style="color:red">Todo: Results of TSP10000</span> --> [1] Zhang-Hua Fu, Kai-Bin Qiu, and Hongyuan Zha. Generalize a small pre-trained model to arbitrarily large TSP instances. In AAAI, 2021. [2] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems!, 2019. ## Reviewer XHWq (score - 6) Thank you for your review and for noting the competitiveness of our proposition. We address your criticisms and answer your questions below: **Response to Weakness 1** <!-- <span style="color:red">Todo: check the difference</span> --> Thank you for pointing out the work of Cai et al. After checking their paper, they indeed also train an RL model to initialize a heuristic optimizer. However, there are several differences with our approach: - their RL model learns a perturbative heuristics, while ours learns a constructive heuristics. - they do not consider the generalization issue as we do. - they focus on bin packing while we focus on TSP. - we provide an intuitive explanation why this approach can work via the derivation of the smoothed policy gradient. <!--Interleave RL with traditional solutions may not be new, but apply local search in RL is entirely new, and it produced a better result. Plus, some additional techniques like the policy rollout baseline and stochastic curriculum learning upon RL based TSP model is entirely innovative based on our current knowledge.--> **Response to Weakness 2** <!-- Same as question 1 of reviewer n7h8. --> <!-- <span style="color:red">Todo: add more explaination if the overall answers are short.</span> --> Although we demonstrate our approach on TSP, we believe that our general framework can be applied to other combinatorial optimization methods. In particular, for routing problems (e.g., TSP/VRP and variants), our equivariant preprocessing could still be applied and our architecture can be adapted in a similar way as in the attention model [1]. For the local search component, any combination of stochastic heuristics designed for the particular combinatorial optimization problem could be used. **Response to Minor Issue 1** We did the extra ablation study on removing the local search both in training and testing. And the results are as follows: | TSP Size | Full || w/o LS* | | | ----- | ----- | ----- | ----- | ----- | |TSP20 |3.844 |0.37% |3.873806477 |1.14%| |TSP50 |5.763 |1.27% |6.172138214 |8.46%| |TSP100 |7.964 |2.61% |8.688361168 |11.95%| |TSP200 |11.14 |3.89% |12.19170316 |13.74%| |TSP500 |17.54 |6.01% |19.30734062 |16.69%| |TSP1000 |24.75 |7.05% |27.10354869 |17.24%| We added it in the final version of the paper. <!-- <span style="color:red">Todo: check this</span> --> <!-- <span style="color:red">Todo: ADD this to the paper.</span> <span style="color:green">Done by yisenw.</span> --> **Response to Minor Issue 2** Thank you for pointing this issue out. We corrected it in the final version of the paper! <!-- <span style="color:red">Todo: Correct this in the paper.</span> </span> <span style="color:green">Done by yisenw.</span> --> [1] Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems!, 2019.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully