[NeurIPS'24 Rebuttal] Genetic GFN

## General Response  We are sincerely grateful to the reviewers for their valuable feedback on our manuscript. We are pleased to hear that the reviewers found our paper **well-written** (4tbn, gNyQ, kTKB, FCer) and supported by **extensive experiments with state-of-the-art performance** (4tbn, gNyQ, kTKB, FCer). We appreciate the recognition of the **usefulness** (4tbn, gNyQ), **simplicity** (4tbn, kTKB), **novelty** (4tbn), and **rationale** (gNyQ) of our method. Additionally, we are encouraged by the acknowledgment of our research as **well-motivated** and **promising with potential real-world impact** (4tbn, gNyQ). ### Answers for the common concern and feedback - Further explanation of Graph GA: We adopt the original implementation of Graph GA [1]. To make the manuscript self-contained, we will include detailed explanations of Graph GA along with figures; please see the summary of Graph GA andFig. 2 in the supplementary material. - Provided link does not work: We apologize for the inconvenience. Our code is available at https://anonymous.4open.science/r/genetic_gfn. (directories: PMO `pmo/main/genetic_gfn`, multi-objective `multi_objective/genetic_gfn`, SARS-CoV-2 `sars_cov2/genetic_gfn`) - Analysis of generated molecules: We will include more detailed explanations along with visual results (Fig. 3 and 4 in the supplementary material). Additionally, in our supplementary material, we provide a semantic overview of pretraining and fine-tuning with Genetic GFN (Fig. 1), examples of Graph GA operations (Fig. 2), visual results for valsartan_SMARTS (oracle ID: #22) in Fig. 3, and visual results of SARS-CoV-2 inhibitor designs in Fig. 4. ***Summary of Graph GA:*** Graph GA is implemented by utilizing predefined SMARTS patterns for operations. During crossover, the algorithm randomly selects either non_ring_crossover or ring_crossover with equal probability. Non_ring_crossover involves cutting an arbitrary non-ring edge of two parent molecules and combining the subgraphs, while ring_crossover cuts two edges within a ring and combines the subgraphs from different parents. For mutations, the algorithm randomly applies one of seven modifications: atom_deletion, atom_addition, atom_insertion, atom_type_change, ring_bond_deletion, ring_bond_addition, and bond_order_change. Invalid molecules resulting from mutation are discarded and the mutation is re-applied. ## Reviewer FCer (5) FCer: important scientific problem, well-organized and easy to follow, experimental results are comprehensive and strong > The novelty is somewhat limited, as the proposed Genetic GFN simply combines some existing techniques, that have been studied independently.  This method is novel because it is **the first to combine 1D sequence generation using GFlowNets with 2D molecular graph search via genetic algorithms**. This approach allows the policy to generate 1D sequences, which are easier to train, while utilizing a genetic algorithm to explore 2D molecular graphs, reaching regions that might be inaccessible to the 1D policy alone. Thanks to the off-policy nature of GFlowNets, we can leverage the insights from the genetic algorithm's 2D molecular graph search to enhance the training of the 1D sequence policy. Our experimental results empirically support these; please see Tables 2 and 3. Combining existing methods in innovative ways is both important and novel, as it creates synergies that significantly enhance performance and efficiency, achieving breakthroughs that neither method could accomplish independently.  > In Table 4, Genetic GFN is worse than Graph GA and MARS on GSK3 + JNK3. The reason should be discussed. | | GSK3b + JNK3 | GSK3b + JNK3 + QED + SA | | -------- | -------- | -------- | | Graph GA | **0.368 ± 0.020** | 0.335 ± 0.021 | | MARS | **0.418 ± 0.095** | 0.273 ± 0.020 | | HN-GFN | 0.669 ± 0.061 | 0.416 ± 0.023 | | Genetic GFN | 0.718 ± 0.138 | 0.642 ± 0.053 | Thanks for pointing out. The results are directly brought from the work of Zhu et al. (2023) [1], and there were some mistakes in the number of Graph GA and MARS on the GSK3b + JNK3 task. The provided table here shows the correct numbers. Please also refer to Fig. 3 in the HN-GFN paper (https://arxiv.org/pdf/2302.04040). [1] Zhu, Yiheng, et al. "Sample-efficient multi-objective molecular optimization with gflownets." Advances in Neural Information Processing Systems 36 (2024). (The numbers are from https://openreview.net/forum?id=ztgT8Iok130) > I am more curious about the difference between increasing KL term in loss function and directly using the logP of the reference distribution as part of the reward (Please see equation (4) in DPO [1]). In this situation, what's the different between GFlowNet and PPO. Thanks for the engaging discussion on this important topic, which has broad relevance across many domains: KL constraint optimization (e.g., RLHF). There are indeed several methods to incorporate KL constraints during optimization. One effective approach is to include a logP prior within the trajectory balance loss, treating it as a soft reward as you suggested. The following table shows both achieving outperforming performance (AUC10). Note that the results are obtained from three independent runs and our hyperparameters have been searched with Genetic GFN (KL).  | Genetic GFN (KL) | Genetic GFN (logP prior) | | --- | --- | | 16.088 ± 0.025 | 15.777 ± 0.018 | When using explicit KL loss terms, it is important to note the differences between PPO and GFlowNets. PPO aims for reward maximization, whereas GFlowNets aim for reward matching, generating samples proportional to the reward. Even with KL constraints, PPO will seek a unimodal maximum reward within the KL constraint region, while GFlowNets will sample diverse, high-reward modes within the KL constraint region. ## Reviewer kTKB (5) kTKB: method is simple and easy to follow, clearly have put a lot of efforts on conducting and designing the experiments, good job on describing necessary background knowledge, Sufficient technical details ***Thanks for the valuable comments.*** We address concerns as follows. > One weakness is the gap between "domain-specific knowledge" and "GA method" mentioned in the introduction. When I read the introduction, I am expecting the authors will propose some new genetic operators... It turns out that the authors are still using the standard genetic operators such as crossover and mutation.   We identify two potential but distinct approaches for integrating genetic algorithms with deep learning: (1) automating genetic algorithms (GA) using deep learning and (2) incorporating domain-specific GA's search capabilities into deep learning as an inductive bias. Our research focuses on the latter, leveraging powerful GAs designed based on chemical domain knowledge to enhance deep neural network policies for more sample-efficient molecular optimization strategies. We also acknowledge that the first approach, automating genetic algorithms with deep learning, could be a valuable direction for future research. We will revise our manuscript to articulate these points clearly. > Are there any special care taken that I may have missed? for example, the crossover fragments are not purely random but are collected from motifs specific to tasks. If no, I feel the statement of "GA method can encode domain-specific knowledge" is very vague, or at least it not being discussed comprehensively if previous works have explored. Crossover and mutation operations are conducted according to predefined patterns. For instance, when altering bond order (one type of mutation), the possible transformations are specified to ensure valid changes that adhere to chemical rules. Defining these valid operations requires domain knowledge and careful design of the logic. Graph GA employs two crossover operations and seven mutation operations based on SMARTS patterns (e.g., inserting an atom into a double bond: `[*;!H0:1]~[*:2]>>[*:1]=X-[*:2]`). Creating and utilizing these patterns requires expertise in chemistry to ensure accurate representation and manipulation of molecular structures. Additionally, validation and sanitization steps ensure that only chemically plausible and stable molecules are considered. Further details about genetic operations can be found in the original Graph GA paper. To make our manuscript self-contained, **we will include a detailed explanation of how the crossover and mutation operations are designed** to guarantee molecular validity in the Appendix, accompanied by **Fig.2 in our supplementary material**. > Another thing missing in the paper is enough qualitative results. I can only see Fig 4 & 5 contain some final generated molecules for two targets. It will be great if the authors can include more visual results, especially the trajectory of the sampled molecules, with highlights on what fragments have been changed during the process. Some analysis on why certain fragments are favored (if any) or remain at the final optimized structure is also very helpful for the readers to verify the effectiveness of the proposed method. We provide **additional visual results in Fig. 4 of our supplementary material**. Due to space constraints, only the top three samples for 50, 100, 500, and 1000 steps are reported. Our observations from the generated candidates for both targets are as follows. - Many molecules include heterocyclic rings, which contain nitrogen or other non-carbon atoms. These structures may play a role in the molecules' interactions with the target protein. - Benzene rings with various substituents (e.g., methyl, amide, hydroxyl) are frequently observed. These substitutions could provide diverse interaction points with the target protein, although their exact contribution to binding affinity needs further investigation. - There seems to be a trend of increasing molecular complexity and functional diversity over iterations, which might be aimed at enhancing the binding affinity and specificity of the inhibitors for the target protein. For example, at step 100, more complex substituents on aromatic rings are introduced compared to the generated candidates at step 50. After 1000 steps, we observe the addition of bulkier groups, such as tert-butyl and sulfone groups. We plan to provide more visual results in the revised manuscript. > Another interesting study to show is how the proposed pipeline can be incorporated with grammar-based representations, such as STONED, SELFIES, and many more if search "molecular grammar", rather than SMILES. Since these manually designed grammar consists of more "domain-specific knowledge" compared to pure atom-based SMILES string, I would expect the experimental results will be better. It would be great to include such analysis in the paper to provide more evidence on the key motivation of the paper. We have included the resutls of Genetic GFN with SELFIES in Appendix F.5. As pointed out in the work of Gao et al. [1], despite of clear benefits of SELFIES, SMILES often shows competitive and better performances in sample efficient molecular optimization tasks. For instance, SMILES-REINVENT outperforms SELFIES-REINVENT, and SMILES-LSTM-HC(hill climbing) outperforms SELFIES-LSTM-HC; please see Table 3 and the analysis in Section 3.2 in the work of Gao et al. [1]. We also additionally provide experiments that incoporating STONED (GA with SELFIES) as an exploration strategy to guide GFN training (policy generates SMILES) instead of Graph GA. Note that STONED only utilize mutations (designing valid crossover with string representation is difficult). | | AUC1 | AUC10 | AUC100 | | -------- | -------- | -------- | -------- | | Genetic GFN | 16.527 ± 0.043 | 16.213 ± 0.042 | 15.516 ± 0.041 | | Genetic GFN (STONED) |15.806 ± 0.037 | 15.439 ± 0.037 | 14.870 ± 0.036 | | Mol GA | 16.001 ± 0.027 |15.686 ± 0.025 | 15.021 ± 0.025 | | SMILES-REINVENT | 15.686 ± 0.035 | 15.185 ± 0.035 | 14.306 ± 0.033 | > line 55, please remove "expectional" Thanks for the suggestion. we removed the term in the revised manuscript. > line 118, no references to "previous works" Thank you for bringing this to our attention. The "previous works" indicate the string generation approaches, which are usually high-ranked in the benchmark, like REINVENT, SMILES-LSTM hill climbing, and GEGL. We added these references in the revised version. ### References [1] Gao et al. "Sample efficiency matters: a benchmark for practical molecular optimization." Advances in neural information processing systems 35 (2022): 21342-21357. ## Reviewer gNyQ (6 -> 5) gNyQ: very pedagogical and easy to read, clearly explaining the proposed method and its rationale, Extensive experiments with SOTA, promising direction, potential real-world impact, offers controllability of the score-diversity trade-off, which is valuable for practical applications > The proposed method is limited to molecular optimization and is not readily generalizable to other domains.  Our method focuses on integrating strong domain-specific search heuristics into deep neural network policies using the off-policy nature of GFlowNets for sample-efficient optimization. This approach is adaptable to any task where a powerful domain-specific search heuristic is available. For example, in jailbreaking tasks on LLMs, one of the state-of-the-art methods is a genetic algorithm [1]; we could use this to train GFlowNets for jailbreaking policies. We indeed agree that the potential of developing automated genetic algorithm methods that can be applied across general domains, leveraging deep learning, as a promising direction for future work. [1] Liu et al. "Autodan: Generating stealthy jailbreak prompts on aligned large language models." arXiv preprint arXiv:2310.04451 (2023). > While the graph GA is based on prior work, the paper would benefit from being more self-contained by including an algorithm or visualization of the mutation and crossover steps. For example, how does the GA ensures the validity of the molecules. Thanks for the helpful suggestion. We will include the new figure for examplifying GA operations, which is provided in the attached addtional material (Fig. 2), and more detailed explanation of Graph GA. > A schematic diagram illustrating the entire training process, including both pretraining and GFlowNet training, would improve clarity. Thanks for helpful suggestion. We will include the new diagram, which is **provided in the attached addtional material (Fig.1)**. > The anonymous link to code provided is not accessible, limiting reproducibility. We apologize for the inconvenience caused. There was an error in the link; please refer to this link (https://anonymous.4open.science/r/genetic_gfn). > Lack of efficiency/runtime analysis and comparison with the baseline methods In sample efficient molecular optimization, the main computational bottleneck is evaluating Oracle functions, so it is common to compare efficiency based on the number of samples. Also, note that the main metric, the area under the curve (AUC), is defined to measure sample efficiecy (the higher AUC score means the higher sample efficiency). Our average runtimes (seconds) are as follows. Please note that though we tested all algorithms using similar computational environments, we did not rigorously control the computation resource. | | Avg | Min | Max | | --- | --- | --- | --- | | Genetic GFN | 827.50 | 658.43 | 1040.47 | | SMILES-REINVENT | 252.88 | 193.15 | 318.06 | | Mol GA | 9803.26 | 3317.99 | 23348.23 | | Graph GA | 165.00 | 115.35 | 234.07 | | GPBO | 1519.65 | 1107.98 | 2236.69 | - The runtime of Mol GA is significantly longer than that of Graph GA, despite both utilizing the same crossover and mutation operations. This difference arises because Mol GA has a much smaller offspring size, and we observed a tendency to generate the already discovered molecules repeatedly (i.e., early convergence before reaching the maximum number of calls). - Compared to REINVENT, our runtime is increased, but it is not significantly longer than that of other baselines. - Note that the methods have different early termination rules, complicating direct comparisons. Questions: > Is the GA implemented on CPU or GPU? If it’s on CPU, how slow it is? GA is implemented on the CPU (we adopt the implementation of Graph GA without using parallel computation). As shown in the runtime table above, Graph GA has a relatively short runtime. The runtime increase compared to REINVENT comes from replay training, which is roughly twice longer than genetic search time. > Would the model's performance improve if combined with modern RNN architectures like Mamba or xLSTM? Thank you for the suggestion. We believe that utilizing modern RNN architectures can indeed enhance performance. Recent work [2] has demonstrated that Mamba outperforms vanilla RNNs in their methodology. [2] Guo & Schwaller (2024). Saturn: Sample-efficient Generative Molecular Design using Memory Manipulation. arXiv preprint arXiv:2405.17066.  > How sensitive is the graph genetic algorithm to the way molecules are fragmented and the resolution of fragmentation? Fisrt of all, we adopt the original implemenation of Graph GA. The Graph GA crossover divides each parent molecule into two fragments, either by cutting within a ring or arbitrarily along non-ring edges. Mutations are applied mostly at the atom level.  > How might this approach be adapted to other discrete optimization problems beyond molecular design? (such as TSP in combinatorial optimization. A quick intuition would be sufficient) We can utilize constructive RL policy, like AM [3], which sequentially adds elements into partial solution to complete solution. Notably, there is a work that trains AM with GFN [4]. To guide GFN training, we can utilize domain-inspired genetic algorithms, such as edge assembly crossover [5] or hybrid genetic search [6]. [3] Kool et al. (2019). Attention, learn to solve routing problems! ICLR [4] Kim et al. (2023). Symmetric Replay Training: Enhancing Sample Efficiency in Deep Reinforcement Learning for Combinatorial Optimization. ICML [5] Nagata & Kobayashi, (2013). A powerful genetic algorithm using edge assembly crossover for the traveling salesman problem. INFORMS Journal on Computing [6] Vidal et al. (2012). A hybrid genetic algorithm for multidepot and periodic vehicle routing problems. Operations Research > How does the method's performance scale with the size and complexity of the molecular space being explored? The maximum SMILES length is set to 140, consistent with REINVENT, corresponding to approximately 70-100 atoms. For JNK3 (#10), consisting of relatively large molecules, SMILES lengths range from 50 to 130, with 35 to 100 atoms. In contrast, isomers_c7h8n2o2 (#8) typically contain about 10-15 atoms. > Is there potential for incorporating human feedback or preferences into the optimization process? Our method follows the (unsupervised) pretraining and fine-tuning framework, similar to approaches like RL with human feedback (RLHF) and direct preference optimization (DPO). One possible approach is incorporating the reward model used in RLHF as our oracle function and fine-tuning the policy with Genetic GFN.  [ref] Ziegler et al. "Fine-tuning language models from human preferences." arXiv preprint arXiv:1909.08593 (2019).  ## Reviewer 4tbn (6) 4tbn: great results outperforming SOTA with a well-motivated strategy, neat > **(in Strengths)** Nonetheless, I believe that some of the other methods shown in Figure 3 also have that feature (e.g., REINVENT4 also has a controllable temperature parameter), which many be misleading in the current text/plot that suggests only the genetic GFN has it Thanks for the suggestion, and we agree on your feedback. We will include the explanation of temperature-like parameter in REINVENT4 [ref]. [ref] Loeffler, Hannes H., et al. "Reinvent 4: Modern AI–driven generative molecule design." Journal of Cheminformatics 16.1 (2024): 20. > How sensitive is the genetic GFN to the hyperparameters? I saw that many defaults were chosen to compare directly to REINVENT, did these also happen to be good options for GGFN or could even better results have been obtained? Some discussion around hyperparameter sensitivity would be interesting, perhaps I missed it. In the manuscript, we have searched additional parameters only; the sensitity results are provided in Appendix F.6. As commented, we further conducted hyperparameter sensitivity analysis (batch size, learning rate, and varying the number of layers). Since we have not searched all hyperparameters, there might be the better combination of hyperparameter setup. | batch 64, lr 0.0005 (main) | batch 128, lr 0.0005 | batch 64, lr 0.001 | | --- | --- | --- | | 16.088 ± 0.025 | 15.801 ± 0.016 | 15.900 ± 0.035 | The number of layers. | 2 layers | 3 layers (main) | 4 layers | | --- | --- | --- | | 15.628 ± 0.021 | 16.088 ± 0.025 | 16.012 ± 0.030 | These results are obtained with 3 independent runs.  In addition, the results in Appendix F.6 show that our method consistently achieves competitive performance under the different hyperparameter setups (GA parameters and the number of replay training). > Similarly, a discussion around the computational complexity of the model would be interesting. Unfortunately, rigorous analysis of computational complexity presents significant challenges for the following reasons: - The number of iterations is performance-dependent, terminating either when the maximum reward calls are reached (where repeated samples do not necessitate additional calls) or when early termination conditions are met. - Graph GA involves RDKit API calls for tasks such as converting molecules to SMILES or removing certain components, (e.g., kelkulization). > In Table 1, the authors demonstrate that their approach outperforms MolGA and REINVENT on most tasks from the PMO benchmark when using the AUC top-10 metric. Do these results still hold for the AUC top-1 (Table 14 suggests they do), and if so, is it by similar amounts or are the gaps wider between the top 3 models? Here, we provide the results of the top3 models (Genetic GFN, Mol GA, and REINVENT). For each metric, the performance gaps are achieved similarlly. | | AUC Top1 | AUC Top10 | AUC Top100 | | -------- | -------- | -------- | -------- | | Genetic GFN | 16.527 ± 0.043 | 16.213 ± 0.042 | 15.516 ± 0.041 | | Mol GA | 16.001 ± 0.027 |15.686 ± 0.025 | 15.021 ± 0.025 | | SMILES-REINVENT | 15.686 ± 0.035 | 15.185 ± 0.035 | 14.306 ± 0.033 | | | Avg. Top1 | Avg. Top10 | Avg. Top100 | | -------- | -------- | -------- | -------- | | Genetic GFN | 17.924 ± 0.054 | 17.760 ± 0.054 | 17.481 ± 0.054 | | Mol GA | 17.252 ± 0.032 | 17.116 ± 0.030 | 16.816 ± 0.029 | | SMILES-REINVENT | 17.345 ± 0.040 | 17.149 ± 0.042 | 16.763 ± 0.043 | > How does the model scale with the number of atoms? For instance, can it scale well to other modalities or macromolecules? The maximum length of SMILES is set to 140, consistent with the setting used in REINVENT. This length corresponds to approximately 70-100 atoms, depending on the molecules. For JNK3 (#10), which consists of relatively large molecules, the generated SMILES lengths range from 50 to 130, with the number of atoms varying from approximately 35 to 100. In contrast, for the isomers_c7h8n2o2 (#8), the generated molecules typically contain about 10-15 atoms. > It is interesting to see that for task #22 (valsartan_smarts), only the genetic GFN and REINVENT perform somewhat well on this. Do the authors have any idea why the genetic GFN and REINVENT are the only models which do not completely fail on this task? I think it would be interesting to dig more deeply into some of the specific oracles/tasks and look at the molecules being generated by the genetic GFN and REINVENT - for instance, do they find the same solutions here? Thanks for the insightful comments. The valsartan SMARTS targets molecules containing a SMARTS pattern related to valsartan while being characterized by physicochemical properties corresponding to the sitagliptin molecule [1]. It measures the arithmetic means of several scores, including (1) binary score about whether it contains a certain SMARTS structure, (2) LogP, (3) TPSA, and (4) Bertz score. Since we utilize a TDC oracle function for evaluations, we provide our empirical observations here. - Difficulty of the task: Due to the binary score (1 if the certain SMARTS structure is included), many tries terminate with 0 scores. Especially with a limited number of oracle calls, generating molecules containing a certain sub-structure is notoriously hard. Other literature shows that other methods achieve high scores with more oracle calls [2]. With 10K calls, even REINVENT and Genetic only suceed to find non-zero score molecules once out of five independent runs. - Another observation is that methods (REINVENT, Genetic GFN, and GEGL) achieving non-zero scores all generate SMILES with RNN-based models. Thus, we have a conjecture that SMILES generation is effective in generating a certain SMARTS pattern. We provide examples of generated molecules with non-zero valsartan_smarts scores. Note that the other four seeds failed. Each run generates similar molecules (see Top1,10,100 samples in Fig.3 in the additional material), but the samples between the two runs have different structures (the molecule-distance between two Top1 samples is 0.854). [1] Brown et al. (2019). GuacaMol: benchmarking models for de novo molecular design. Journal of chemical information and modeling. [2] Hu et al. (2024). De novo drug design using reinforcement learning with multiple gpt agents. Advances in Neural Information Processing Systems, 36. > Despite the excellent results, the authors fail to provide any accompanying code, which is a shame (the link in the paper points to a non-existing anonymized repo). We apologize for the inconvenience caused. There was an error in the provided link; please refer to this link (https://anonymous.4open.science/r/genetic_gfn).

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.