# ICLR 2024 LS-GFN rebuttal ## Internal discussion between authors **Minsu: I updated Appendix B.7 and B.8; please take a look!** **Note: For paper revision, we are working on shared overleaf, revision.tex** *Minsu: I believe we don't need to make significant changes to the main page, as there haven't been any comments regarding its structure or formatting. In response to reviewer wUkH's suggestion to replace "reconstruction" with "backtracking," I've already made that revision. I've also made modifications to the "Conclusion" section, renaming it to the "Discussion" section, and updated the "Limitations and Further Work" section to align with the suggestions provided by Reviewer xEGb.* *I think our task is simply to update the appendix to incorporate the additional experiments we conducted during this rebuttal.* **Clarification for the experiements:** *Minsu: All experiments accompanied by tables are newly conducted and will be populated within a day. In contrast, experiments with references to the manuscript (e.g., Table 1, Appendix xxx) are based on results that were included in the initial submission.* --- ## Overall Responses We would like to express our sincere gratitude to all the reviewers for their valuable comments and feedback. In the following sections, we provide a summary of our revisions and responses to each reviewer's feedback. **Paper revision** - In response to feedback from Reviewer wUkH, we have replaced the term "destroy" with "backtrack." - We have incorporated Reviewer xEGb's suggestion and feedback from Reviewer wUkH by updating the "Limitations and Further Work" section, which is now located in Section 5. - In line with Reviewer wUkH's feedback, we have made revisions to Appendix B.7, specifically focusing on plotting the accuracy of our sampling method to illustrate accuracy curves extending beyond 100%. - We updated Appendix B.8 to incorporate the suggestions made by Reviewer wUkH regarding the inclusion of an additional baseline involving resampling and a filtering strategy. We have highlighted every revised section in purple. **Summary of Additional Experiments** We have conducted these additional experiments in response to the raised concerns. - **Computational Overhead (question raised by all reviewers)**: We assessed the computational overhead of LS-GFN and found that it does not introduce any additional computational load when compared to GFN baselines. - **Additional Baseline of Resampling and Filtering Strategy (suggested by Reviewer wUkH)**: LS-GFN demonstrates superior performance when compared to these baselines. This is also updated in Appendix B.8. - **Swapping Lines 7 and 8 in the Pseudo Code (suggested by Reviewer wUkH)**: We conducted experiments to compare the original LS-GFN algorithm with a swapped version, and the original algorithm performed better. - **Comparison with Replay Training Off-Policy RL Baseline (suggested by Reviewer xEGb)**: We conducted replay training for soft-Q learning and demonstrated that LS-GFN significantly outperforms the baseline. ## Reviwer wUkH (Score: 6) Thanks for your valuable comments. **W1-1: While the method itself is simple and well presented, it remains a little unclear the relative importance of different components. For example, the exposition focuses on the trajectory balanced objective — does this mean LS-GFNs don’t work with different training objectives?** LS-GFN is a versatile technique that can be effectively employed for a wide range of GFN objectives. We have already successfully applied LS-GFN to various GFN objectives, including Sub-trajectory balance (SubTB), detailed balance (DB), and maximum entropy GFN (MaxENT), as demonstrated in Table 1 and Table 2 in our original manuscipt. **W1-2: The paper also mentions building on top of Shen et al., 2023 which introduces prioritized replay training (PRT) to GFNs, but this method doesn’t appear as a baseline — so how much of the improvements are due to PRT and how much is due to the local search?** For a fair comparison, we consistently use "PRT" for training and "SSR" for parameterization in all GFN baselines. Additionally, for direct comparison purposes, we assessed our approach against Shen et al., 2023, denoted as "GTB" in Figure 5 and Figure 6. It's important to note that Shen et al., 2023 introduced three distinct techniques: "PRT" (pertaining to dataset sampling), "SSR" (concerning GFN parameterization), and "GTB" (associated with the guided trajectory balance objective). **W2: In addition to the hyper-parameter (number of revisions with local searches), I wished the number of backtracking steps and the acceptance rate were also studied — how much tuning did they require in order to get these good results?** We've already conducted experiments on hyperparameter $I$ in Appendix B.3, as well as explored the acceptance rate in Appendix B.6. It's worth noting that across various hyperparameter candidates, namely $I \in \{1, 3, 7, 15, 31\}$, LS-GFN ($I > 0$) consistently outperforms GFN ($I = 0$). This observation suggests that choosing the appropriate hyperparameter $I$ for LS-GFN is relatively straightforward, as it consistently yields superior results. **W3: Wall-clock time is never mentioned — how much overhead does the refining process of LS-GFNs incur? How about in terms of sampled states (instead of training rounds)?** There's negligible additional wall clock time overhead. Here are the wall clock times for each training round in the QM9 task. Wall clock time is computed as the mean of three independent seeds. To remove the bias from warmup, we compute the wall clock time per round after 10 initial training rounds. There is no meaningful wall time difference between the two methods: | | $I$ |$M$ | Wall clock time per round| | -------- | -------- | -------- | -------- | | TB | 0 | 32 | 1.93 ± 0.13 seconds | | TB + LS-GFN | 7 | 4 | 1.91 ± 0.07 seconds | The wall clock time is evaluated with single RTX 3090 GPU and Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz and 256 GB RAM. The wall clock time remains consistent because we always use the same number of samples per training rounds ($M \times (I+1) = 32$, $M$: batch size, $I$: number of local search) for every methods across all experiments. This choice is particularly crucial considering that evaluating the reward per sample is a significant bottleneck (in some applications one could spend a day for reward computation), especially in real-world scenarios like bio and chemical discovery tasks. By keeping the sample complexity uniform, we ensure that the computational overhead remains stable. <!-- Here is sample complexity analysis: | | $T$ | $I$ |$M$ | Sampling Complexity | | -------- | -------- | -------- |-------- |-------- | | GFN | 5000 | 0 | 32 | $5000 \times 1 \times 32 = 160,000$ | | LS-GFN | 5000 | 7 | 4 |$5000 \times (7+1) \times 4 = 160,000$ | Where $T$ represents the number of training rounds, $I$ stands for the number of local search iterations, and $M$ denotes the batch size per training round. --> **W4: One baseline I wished were included: given a first trajectory, resample the last K steps and only keeping the best candidate. This is akin to beam search in LLMs or go-explore in RL. Other nice-to-have baselines include top-p and top-k sampling.** Thank you for recommending baselines that can effectively showcase the LS-GFN method. We update a new secion in Appendix B.8 including new experiments and analysis of results. Here is summarized version of Appendix B.8. We conducted new experiments based on your recommendations. In our evaluation, we compared LS-GFN to the baseline methods you suggested, specifically focusing on the Trajectory Balance (TB) objective function in the QM9 task. In the context of resampling, we initiate a trajectory and subsequently resample the last $K$ steps, repeating this process four times. We concurrently evaluate and select the best trajectory among these resampled trajectories within batches of size $M=8$. Consequently, the total sampling complexity amounts to $8 \times 4 = 32$. In the case of Top K filering, we generate a total of $M=32$ trajectories and apply a filtering process to identify the Top K trajectories (Top2 among 32 trajectories), which are then utilized for training purposes. <!-- The results indicate that resampling and filtering effectively enhance exploitation by concentrating on high-reward regions. However, the sampler tends to generate duplicated samples, resulting in a lower unique fraction compared to LS-GFN methods. Among the baselines, LS-GFN exhibits the best performance across various metrics, including the number of modes, Top K rewards, and the unique fraction of sampling. --> **< QM9 >** | | Number of modes|Top K Reward| | -------- | -------- | -------- | | TB | 699 ± 14 | 0.66 ± 0.00 | | TB + Resampling | 778 ± 6 | **0.67 ± 0.00** | | TB + Top K filtering | 768 ± 6 | **0.67 ± 0.00** | | TB + LS-GFN | **793 ± 4** | **0.67 ± 0.00** | **< L14_RNA1 >** | | Number of modes|Top K Reward| | -------- | -------- | -------- | | TB | 6 ± 1 | 0.87 ± 0.01 | | TB + Resampling | 4 ± 0 | 0.87 ± 0.01 | | TB + Top K filtering | 4 ± 2 | 0.87 ± 0.01 | | TB + LS-GFN | **18 ± 0** | **0.97 ± 0.00** | In QM9 task, LS-GFN outperforms baselines havingg superior performances in terms of number modes. For the RNA task which contains larger search space than QM9 effective search for reward improvement is crutial. The result shows that simple resampling and filtering method is not enough as they are too much focus on the exploitation of existing trajectories, unable to finds novel modes effecitvely. However, LS-GFN harnesses a backward policy for backtracking, resulting in a diverse set of backtracked states that are subsequently reconstructed by the forward policy. This approach has proven to be notably more effective in navigating the extensive RNA search space compared to the resampling method, which merely samples the last $K$ steps without taking advantage of a backward policy. **Q1: Why call it “destroy” since the original trajectory isn’t always discarded? A more intuitive name could be “backtrack”, “rewind”, or anything that doesn’t suggest destruction.** We are in agreement on this matter. We have updated our manuscript to incorporate the term "backtrack" instead of "destroy." **The top plots in Figure 5 are strange: it looks like some curves go beyond 100% accuracy. Could you either fix them so we can still see the curves on the plot, or explain what is happening?** Please note that Figure 5 is plotted accurately in accordance with the baseline from the paper by Shen et al., 2023, where we have set our accuracy metrics following their guidelines as: \begin{equation*} \text{Acc}\left(p\left(x;\theta\right)\right) = 100 \times \text{min}\left(\frac{\mathbb{E}_{p(x;\theta)}[R\left(x\right)]}{\mathbb{E}_{p^{*}(x)}[R\left(x\right)]},1 \right), \end{equation*} This provides an upper bound of 100% accuracy. In response to your suggestion, we have included a graph illustrating accuracy beyond the 100% mark in Appendix B.7. **How can LS-GFNs recover from a biased backward policy? In other words, assume the forward policy is fine but the backward policy always backtracks to states which yield the same (high reward) candidate objects — how can LS-GFNs overcome this lack of exploration?** Biased backward policies can indeed pose challenges. In such scenarios, we can employ an $\epsilon$-greedy method to introduce exploratory back-sampling into the biased backward policy. It's worth noting that LS-GFN performs exceptionally well in MaxEnt GFN, even when the backward policy is set to a uniform distribution (i.e., $\epsilon = 1$). Building upon your valuable feedback, one potential avenue for future research could involve fine-tuning the backward policy to maximize the local search acceptance rate. This approach would aim to ensure that the backward policy identifies diverse intermediate states that lead to improved rewards, without exhibiting a bias towards yielding the same candidate objects repeatedly. We appreciate your insightful comment. **Please confirm that lines 7 and 8 in Algorithm 1 aren’t swapped. If they aren’t (which partially addresses my question above), wouldn’t swapping them and extending with further improve exploitation? Maybe this should be added as an ablation as well.** No swapping is involved here. We update every sample to the dataset, irrespective of whether it is accepted or not. The rationale behind this approach is that every sample, where the reward is calculated, is considered valuable. During the refinement phase, our primary objective is to explore the local region in order to discover high-reward samples, rather than determining whether the found sample should be used in training. In the training process, we employ PRT to sample trajectories in proportion to their rewards from the dataset. Consequently, accepted samples have a higher likelihood of being included during training. Regarding your suggestion to swap lines 7 and 8, it would entail updating the data only with accepted samples. However, as you pointed out, while this could enhance exploitation, it might excessively hinder exploration. To shed more light on this, here are the new results of our ablation study: | Column 1 | Number of modes| TopK reward | | -------- | -------- | -------- | | Swapped version | 754 ± 6 | 0.66 ± 0.00 | | Original LS-GFN | 793 ± 4 | 0.67 ± 0.00 | Based on above results original LS-GFN gives higher performances than swapped version. **References** Max W Shen, Emmanuel Bengio, Ehsan Hajiramezanali, Andreas Loukas, Kyunghyun Cho, and Tommaso Biancalani. Towards understanding and improving GFlowNet training. In International Conference on Machine Learning, pp. 30956–30975. PMLR, 2023 --- ## Reviwer xEGb (Score: 5) Thanks for the feedback. **W1, Q2: About fairness of sampling complexity** Thank you for highlighting this for clarification. All our experiments were conducted under a fair setting, as we use exactly same number of samples across all experiments. Note this is already outlined in Section 5.3. The calculation of sampling complexity follows this formula: $(I+1) \times M$, where $I$ stands for the number of local search iterations, and $M$ denotes the batch size per training round. | | $I$ |$M$ | Sampling Complexity per Training Rounds | | -------- | -------- |-------- |-------- | | GFN | 0 | 32 | $1 \times 32 = 32$ | | LS-GFN | 7 | 4 |$(7+1) \times 4 = 32$ | **W2: Limitation of LS-GFN** A limitation of LS-GFN lies in the potential impact of the quality of the backward policy on its performance, particularly when the acceptance rate of the local search becomes excessively low. One immediate remedy is to introduce an exploratory element into the backward policy, utilizing techniques like $\epsilon$-greedy or even employing a uniform distribution to foster exploration within the local search. A promising avenue for future research could involve fine-tuning $P_B$ to enhance the acceptance rate of the local search. We plan to incorporate this as a topic in our paper's "Limitations and Further Work" after we get one additional page for the final paper (we now put this into the appendix), and we appreciate your valuable suggestion. **Q2: About hyperparameter $I$** We have conducted an ablation study on the hyperparameter $I$, as now detailed in Appendix B.3. It's worth noting that across various hyperparameter candidates, namely $I \in \{1, 3, 7, 15, 31\}$, LS-GFN ($I > 0$) consistently outperforms GFN ($I = 0$). This observation suggests that choosing the appropriate hyperparameter $I$ for LS-GFN is relatively straightforward, as it consistently yields superior results. **Q3: Comparison with replay trainined RL methods** Taking into consideration your feedback, we implemented reward-prioritized experience replay training in conjunction with the off-policy RL baseline, SQL. It's worth noting that PPO and A2C are on-policy RL methods in which replay training is not directly utilized. Here is new experiement results that compares replay trained SQL and LS-GFN: | | Number of Modes | TopK reward | | -------- | -------- | -------- | | SQL | 232 ± 8 | 0.56 ± 0.01 | | SQL (replay trained) | 235 ± 6 | 0.56 ± 0.01 | | TB + LS-GFN | 793 ± 4 | 0.67 ± 0.00 | --- ## Reviwer 4fxw (Score: 8) Thanks for providing valuable feedback. **W1: Computational overhead** LS-GFN does not introduce any additional computational overhead. Below, you'll find the wall clock times for each training round in the QM9 task. | | $I$ |$M$ | Wall clock time per round| | -------- | -------- | -------- | -------- | | TB | 0 | 32 | 1.93 ± 0.13 seconds | | TB + LS-GFN | 7 | 4 | 1.91 ± 0.07 seconds | The wall clock time is evaluated with single RTX 3090 GPU and Intel(R) Xeon(R) Gold 5317 CPU @ 3.00GHz and 256 GB RAM. This restriction is in place to ensure that the sample complexity remains consistent with that of the baseline GFN methods. The calculation of sampling complexity follows this formula: $(I+1) \times M$, where $I$ signifies the number of local search iterations, and $M$ denotes the batch size per training round. | | $I$ |$M$ | Sampling Complexity per Training Round | | -------- |-------- |-------- |-------- | | GFN | 0 | 32 | $1 \times 32 = 32$ | | LS-GFN | 7 | 4 |$(7+1) \times 4 = 32$ | **Q1: How is the number of steps K picked for backtracking? If it's fixed, is there a way to pick it automatically?** We used $K=\lfloor (L+1) / 2\rfloor$, where $L$ is length of trajecotry. We also provided a hyperparameter analysis in the original manuscript in Appendix B.5. As you mentioned, we can automatically pick $K$ to adaptively balance the wideness of the local search region, which could be a great avenue for future work.