# General responses We are sincerely grateful to the reviewers for their valuable feedback on our manuscript. We are pleased to hear that the reviewers found our paper **easy to follow** (gE92, D17P) and supported by **sufficient experiments** (uUFH, 8pj5). The recognition of the **practical importance** of our research (uUFH, gE92) is greatly appreciated, and we are encouraged by the acknowledgment of our **originality** (uUFH, 8pj5), **simplicity** (D17P), and **performance** (gE92). In summary, the reviewers raise concerns about the practical molecular optimization (PMO) benchmark in common. Therefore, we address these concerns in the general response and provide detailed responses to each review. ### "Practical Molecular Optimization (PMO)" is an official benchmark published in NeurIPS 2023 Dataset & Benchmark Track <!-- Our method is tested on PMO benchmark which is accepted in NeurIPS 2023 Dataset & Benchmark tracks. --> Our method has been rigorously evaluated on the PMO benchmark, which has been accepted in the NeurIPS 2023 Dataset & Benchmark tracks. <!-- Comparing new algorithm with fair and practical benchmark is difficult and makes noisy for the communities. Yet there can be several discussion for the benchmark and hyperparameter setting for existing algorithms. We think NeurIPS Dataset & Benchmark track serves roles that providing unified benchmark that new algorithm can be fairly test upon hyperparameter tuned baselines to validate their new alrogithm. PMO benchmarks tested over 20 baseline algorithm including reinforcmenet learning both for model free and model based, bayesian optimization, genetic algorithm, active learning. The hyperparameter is tuned properly with search methods and provides several de novo molecular metric including multi objective metric. This is already accepted in the community so that we beilieve community should respect that. --> It is indeed challenging to compare new algorithms with a fair and practical benchmark. However, the NeurIPS Dataset & Benchmark track serves the crucial role of providing **a unified benchmark against which new algorithms can be fairly tested**, alongside hyperparameter-tuned baselines, to validate their efficacy. The PMO benchmark has been tested against over 20 baseline algorithms, including reinforcement learning (both model-free and model-based), Bayesian optimization, genetic algorithms, and active learning. In addition, the benchmark provides the pharmaceutically relevant tasks (Oracle) functions, including multi-objective. Furthermore, **this benchmark has already been accepted by the community, and we believe it deserves respect and recognition.** <!-- Based on this accepted benchmark, we achieved clear state-of-the-art performances. There is questions about the benchmark for example (1) number of oracle calls for 10,000 is too much (2) metric is not enough to consider diversity, (3) novelty metric is missing, (4) comparison with multi objective GFN is needed. We agree novelty and diversity is crutial metric for molecule optimization, and also multi-objectivity is important. However, the benchmark's setting and baseline chosen is already officially done and accepted focusing on "sample efficienty" and "area under curve" with 10,000 oracles calls. We believe maximizing AUC under 10,000 is very important focus similar to reviwers that accept this benchmark in NeurIPS 2023 Dataset & Benchmark Track. --> It is important to highlight that **our method has demonstrated clear state-of-the-art performances on the official benchmark**, with the primary objective being to maximize the area under the curve (AUC) using 10,000 oracle calls. Our method has consistently outperformed existing approaches based on the official benchmark. While there may be inquiries about the specific details of the experimental setup, such as the maximum number of oracle calls or the choice of metrics, it is essential to note that **the experimental design has undergone rigorous peer-review and has been meticulously designed to ensure fair comparisons and practicality**. <!-- We hope our discussion have to be done not where the PMO benchmark is valid (both for motivation and presentation way) which is already accepted on the official track of our community, but our proposed algorithm is technically valid enough to improve AUC (i.e., sample efficiency). --> <!-- We believe that our discussions should revolve **not around questioning the validity of the PMO benchmark**, as it has already been accepted by our community through official channels, **but rather on whether the proposed algorithm is technically sound enough** to improve sample efficiency within the established framework. --> # Reviewer uUFH > This is already confusing, since GFlowNet is usually not trained to maximise a function, but rather to sample from it. We acknowledge that GFlowNets is an amortized inference method capable of sampling probabilities proportional to rewards. The term is used because the application of GFlowNets can vary including in optimization tasks (i.e. maximizing objective function), as demonstrated by Zhang et al. (2023) [1]. We intended to leverage GFlowNets to maximize the objective function, but we understand that the state can be confusing. According to the feedback, we will revise this "Subsequently, the model is tuned to maximize a specific oracle function value." into "Subsequently, the model is refined by utilizing GFlowNets to sample with probabilities proportional to the value of a specific oracle function. This approach enhances the likelihood of discovering high-value samples due to its diverse exploration capabilities." [1] Zhang, Dinghuai, et al. "Let the flows tell: Solving graph combinatorial problems with GFlowNets." Advances in Neural Information Processing Systems 36 (2023). > What is oracle-directed fine-tuning? This does not seem to be a commonly used term but is used here matter-of-factly. And which data set? In the first sentence, a data set is mentioned to carry out the pre-training phase. It is logical to think this would refer to the same data set. However, if my understanding is correctly, this is not the case. We employ the term "oracle-directed fine-tuning" to describe the fine-tuning phase, which involves training by querying oracle to assess generated molecules. The dataset comprises exploration samples generated by the GFlowNets agent (i.e., de novo designed samples), and the samples are selected in proportion to their rank. It is important to clarify that this data is distinct from the "pre-training data," as it contains oracle values obtained through oracle-directed fine-tuning. > The concept of GFlowNet fine-tuning should also be explained. And how are genetic algorithms integrated into the GFlownet? Thank you for bringing up this clarification question. We explain the concept of GFlowNet fine-tuning in the above response. To elaborate on the explanation for how genetic algorithms are integrated into the GFlowNet, we collect samples using both the training policy and genetic algorithm and train the policy using GFlowNets, which is an off-policy training method. Note that we keep samples in the same replay buffer, and the genetic algorithm finds the improved samples using the samples in the buffer and includes them in the buffer. In short, we employ genetic algorithms as one of the behavior policies. We will revise this section for better clarification. <!-- Thank you for bringing up this clarification question. To elaborate, we train the policy using GFlowNets, which is an off-policy training method, and incorporate the genetic algorithm as part of the "behavior policy." We will revise this section for better clarification. --> > From this description, it is hard to infer the basic ideas of the training procedure that is proposed. To my understanding, the basic idea is that the training batches used to minimise the GFlowNet loss are constructed by first sampling trajectories from the GFlowNet policy, which are then evolved through genetic search. However, this is not stated in similar terms anywhere in the description of the method. Thank you for confirming. We acknowledge that this aspect is not well presented in the main text of the introduction. We will revise it accordingly, referring to the comments provided above. In addition, Algorithm 1 will be provided in Section 4. > Regarding the pre-training phase, it is also not fully clear what the exact procedure is. My understanding is that the policy \pi is a GFlowNet that is trained off-policy with backward trajectories from a data set and the maximum likelihood loss. However, this is also not mentioned explicitly. Instead, Section 3.1 talks about graph-based alternatives that are not relevant to the description of the method. The description of the proposed method is rather limited to sentence "our policy is pre-trained to maximize the likelihood of valid molecules on existing chemical datasets" and Equation 2. During the pretraining phase, we utilize a simple maximum likelihood estimation (MLE) approach over valid molecules on a sequential model to pretrain $\pi$; the loss function is presented in Eq(2). It is important to note that GFlowNets are not used during this pretraining phase. Following this, we fine-tune $\pi$ using the GFlowNets training algorithm, where the genetic algorithm is leveraged to guide exploration for the off-policy training of GFlowNets. > The clarity of the presentation of results (Section 5), could also be improved, in my humble opinion. First of all, the main results are presented in a full-page table (Table 1) containing more than 230 AUC results with their corresponding standard deviations. It is not possible for an average human to make any sense from such an amount of information. I strongly encourage the authors to consider presenting these results visually so that one can understand the overall performance of a method across the entire range of tasks at a glance, without having to perform multiple pairwise comparisons of floating point numbers with their standard deviations. The same comment would apply for Table 2. We agree that the presentation can be challenging to recognize. We intended to follow the existing benchmark paper's presentation to reflect the benchmark's intention. We will change the presentation following your comments. ### Clarity We agree on your comment about our experiement in terms of ablation study and statical anaysis. ### Statistical analysis The resulting p-values of the average AUC10 are as follows. | | p-values | |--- | --- | | Genetic GFN vs. Mol GA | 0.0011 | | Genetic GFN vs. REINVENT | 0.0025 | | Genetic GFN vs. GEGL | 0.00003 | | Genetic Search Ablation | 0.0109 | | KL Penalty Ablation | 0.1357 | ### Conclusion related to LLMs Our discussion can offer insights for future research. As evidenced by recent studies, such as the excellent example provided in [1], genetic algorithms have been used in LLM. We believe that our integration of GFlowNets and genetic algorithms can be applied in LLM as well, and we hope this serves as inspiration for future researchers. By orienting our discussion towards future applications, we aim to provide readers with forward-looking perspectives. #### Reference [1] Liu, Xiaogeng, et al. "Autodan: Generating stealthy jailbreak prompts on aligned large language models." arXiv preprint arXiv:2310.04451 (2023). ### Science rigourousity > Related to scientific rigour too, I would note that Figure 3 highlights results on 4 tasks from the set of more than 20 tasks, but I have not found a justification of why these four tasks are more relevant than others. Therefore, I am inclined to think that they may have been cherry-picked. We follow the official benchmark (PMO benchmark) that is already accepted in the NeurIPS 2023 dataset & benchmark track. We did not cherry-pick tasks; the two tasks (isomers_c9h10n2o2pf2cl and celecoxib_rediscovery) are included in Fig 3, following the PMO benchmark's choice for modification, and the other two are chosen since they are multi-property oracles. We present 4 tasks simply because presenting all results (more than 20 methods with 23 oracles) is too complex (as commented in Clarity) Please note that we include all results of 23 oracles in Table 1. As previously commented, presenting all results can be confusing since they include more than 20 methods with 23 oracles. > Figure 4, in turn, plots the diversity and average top-10 scores of various models. The authors indicate that the proposed algorithm defines the Pareto front. While the results of the proposed model are indeed positive, I think it would be important to extend the Pareto front in the plot to include Graph GA, which achieves the maximum diversity from the whole set of methods. We agree that both Genetic GFN and Graph GA lie on the Pareto frontier. We will revise the terms accordingly for clarity. ### Revision for the claims or statements. > pre-training is inevitable since training the generative policy from scratch is excessively sample-inefficient": is there really any evidence that pre-training is inevitable? Agree. We will revise the statement into "Pre-training is helpful since training the generative policy from scratch might be sample-inefficient without guidance from the valid chemical dataset." In addtion, we performed experiments without pretraining, conducting three independent runs. However, in certain oracles such as scaffold_hop, the policy failed to generate new valid molecules before reaching the 10,000 calls. In such cases, the AUC score can not be measured fairly, so we report the total and average AUC10 scores across 21 oracles by excluding such case. | | Total AUC10 | Avg. AUC10| | -------- | -------- | -------- | | Genetic GFN | 14.877 | 0.647 | | Genetic GFN w/o pretraing | 6.111 | 0.266 | > The paper claims to introduce replay training in several occasions, while replay training in GFlowNets have been used in various publications before. We agree on the feedback. We will revise the statement by replacing the term 'introduce' with 'use' for the replay training. > Similarly, the authors also claim to introduce an inverse temperature term in the reward function, while this has been repeatedly used in the GFlowNet literature, including in the original paper. We agree on the feedback. Note we did not intend to claim we introduce a new parameter of inverse temperature. We will revise the statements to explain that we 'use' the inverse temperature. > "The genetic search explores high-reward regions, which are hard to find using the current generative policy only.": can this be proven empirically or theoretically? I understand that this is the motivation to introduce genetic search, but I would perhaps tone down the statement. Thanks for the insightful comment. This description serves as motivation rather than a formal claim. We will tone it down accordingly. ### Questions > In the ablation study, one of the models is referred to as "Genetic GFN without genetic search". From the paper, it is not clear to me how the GFlowNet in this variant is trained. The original model is trained purely off-policy with data from a data set constructed via genetic search. If no genetic search is performed, is the GFlowNet then trained on-policy? Or off-policy with the pre-existing data set as is? We agree there are confusing explanations. To address the confusion, our algorithm comprises the following components: (1) unsupervised pretraining, (2) policy parameterization for string generation (not graph generation), and (3) GFlowNets training on the pretrained policy with (4) genetic algorithm-based exploration. In the case of "Genetic GFN without genetic search," we eliminate component (4) and conduct on-policy exploration instead. > In the ablation study, does "w/o KL include" the maximum likelihood pre-training. My understanding is that it is included, since it is not mentioned otherwise. In this case, I think it would be a good idea to also include in the ablation study a variant without pre-training, in order to assess the contribution of this component. Thanks for the suggetion. We performed experiments without pretraining, conducting three independent runs. However, in certain oracles such as scaffold_hop, the policy failed to generate new valid molecules before reaching the 10,000 calls. In such cases, the AUC score can not be measured fairly, so we report the total and average AUC10 scores across 21 oracles by excluding such case. | | Total AUC10 | Avg. AUC10| | -------- | -------- | -------- | | Genetic GFN | 14.877 | 0.647 | | Genetic GFN w/o pretraing | 6.111 | 0.266 | Please note that pretraining with open-source dataset (without oracle score) is a common practice in de novo molecular optimization. > Local search GFN is another GFlowNet method designed to achieve the same goal of Genetic GFN. Have the authors compared their results with LS-GFN? We conducted experiments to compare our approach with LS-GFN, which is not originally designed for de novo molecular optmization with SMILES. In our experiment, LS-GFN is implemented to partially remove SMILES sequences from the back and reconstruct them from the partial SMILES. The total AUC10 and diversity metrics are reported based on five independent runs. | | Total AUC10 | Avg. AUC10| | -------- | -------- | -------- | | Genetic GFN | 16.213 | 0.705 | | LS-GFN | 15.230 | 0.662 | We also conduct t-test with Avg AUC10 as follows. | | Genetic GFN | LS-GFN | | -- | -- | -- | | Mean | 0.705 | 0.662 | | Variance | 7.10E-05 | 7.50E-05 | | Observations | 5 | 5 | | Hypothesized Mean Diff. | 0 | | | t Stat | 7.9065 | | P(T<=t) one-tail | 2.37594E-05 | | | t Critical one-tail | 1.8595 | The p-value of t-test is 2.37594E-05. It is important to note that local search may be less effective when the DAG is a tree and when the dimensionality is high. In contrast, we employ Graph GA, which enables genetic search to operate within the molecule graph space. Additionally, the operations within Graph GA are meticulously designed to facilitate the exploration of the extensive molecular space. ### # Reviewer 8pj5 ### Weakness > the problem falls under the category of active learning in ML. Several methods have been attempted this problem, and the authors not only need to discuss them but also highlight their differences. Specifically, what the current method offers compared to existing methods is lacking. <!-- **Please check!!!!** --> Thans for the valuable suggestion. We agree that the discussion for active learning. **The active learning baselines are already included.** The PMO benchmark provides several active learning baselines, such as MolPAL [1] (a model-based screening method), GP BO [2] (the active learning version of Graph GA with a proxy model), and GFlowNet-AL [3] (the active learning version of GFlowNets). Note that GP BO and GFlowNet-AL are already included in the reproduced results. Based on the experimental results, the paper [4] provides an interesting discussion regarding model-based methods (i.e., active learning using a proxy model at the outer level and training a generative model at the inner level) versus model-free methods (i.e., not using active learning): "Model-based methods are potentially more efficient but need careful design." <!-- Thank you for providing additional context. We acknowledge that the PMO benchmark already includes several active learning baselines, such as MolPAL [1] (a model-based screening method), GP BO [2] (the active learning version of Graph GA with a proxy model), and GFlowNet-AL [3] (the active learning version of GFlowNets). There is ongoing discussion within the PMO paper regarding model-based methods (i.e., active learning using a proxy model at the outer level and training a generative model at the inner level) versus model-free methods (i.e., not using active learning), as described in "Model-based methods are potentially more efficient but need careful design." --> **Genetic GFN can be incorporated with active learning.** It is worth noting that our method can also utilize a proxy model and be structured as an active learning-style algorithm. Despite this, our algorithm outperforms existing algorithms, even when using a proxy model and employing an active learning-style approach. [1] Graff, David E., Eugene I. Shakhnovich, and Connor W. Coley. "Accelerating high-throughput virtual screening through molecular pool-based active learning." Chemical science 12.22 (2021): 7866-7881. [2] Tripp, Austin, Gregor NC Simm, and José Miguel Hernández-Lobato. "A fresh look at de novo molecular design benchmarks." NeurIPS 2021 AI for Science Workshop. 2021. [3] Jain, Moksh, et al. "Biological sequence design with gflownets." International Conference on Machine Learning. PMLR, 2022. [4] Gao, Wenhao, et al. "Sample efficiency matters: a benchmark for practical molecular optimization." Advances in neural information processing systems 35 (2022): 21342-21357. > Even GFlowNet-based baselines are omitted. For instance, GFlowNet-AL and MOGFN-AL can address the same problem. Although the authors include GFlowNet-AL in the supplement, they fail to specify how many rounds they ran the method and if the comparison is fair. Additionally, MOGFN-AL intuitively has GA capabilities to some extent. It is important to emphasize that **our goal is not to establish superiority over other GFlowNets or to achieve Pareto performance for multi-objective optimization**. Additionally, the results of GFlowNet and GFlowNet-AL are provided in Appendix D.3; we followed the provided hyperparameters of each method, which were searched according to fair guidelines in the benchmark. However, we agree that the detailed descriptions regarding the setup of baselines are missed. Therefore, we will include the implementation details of important baselines to address this concern. > Discussion around GFNSeqEditor would be useful. <!-- **Please check!!** --> GFNSeqEditor represents a novel algorithm that utilizes GFN to guide sequence editing, which could potentially serve as an effective evolutionary algorithm across various optimization fields. In contrast, our Genetic GFN takes an orthogonal direction by leveraging existing evolutionary-style algorithms for exploration and aims to amortize them. We believe that Genetic GFN could potentially benefit from leveraging GFNSeqEditor to enhance exploration further. We appreciate your suggestion of the GFNSeqEditor paper, and we will incorporate a discussion of its relevance in our manuscript. > LaMBO is another method that demonstrates it could outperform GA, which is missing in the baselines. There might be many other methods with which I'm not familiar. Our selection of baselines is rooted in the official PMO benchmark, which includes over 20 baselines and three representative Bayesian optimization algorithms. While LaMBO is specifically designed for optimizing biological sequences, it may not directly translate to chemical structure design without further algorithmic development and engineering. Therefore, we believe it does not serve as a direct baseline for our work. Nonetheless, we will certainly engage in a discussion of how LaMBO works and its potential implications for our research. > Offline and online-based methods can also be considered as baselines. Firstly, the problem setting is not **offline** - the limited reward calls are allowed for **online learning**. Secondly, the PMO benchmark already includes various baseline methods, including MCTS, Genetic Algorithms, Hill-climbing, VAEs, gradient ascent, score-based modeling, and Bayesian optmization. Especially, we re-evaluate Top-8 methods by strictly following the provided setup. ### Method > 10K calls to the oracle are not practical. While we are constrained by a maximum of 10,000 oracle calls, it's essential to note that performance is evaluated based on the area under the curve (AUC), a metric inherently indicative of sample efficiency. The superiority of AUC suggests that a method can achieve higher scores with reduced oracle calls, such as 1K, compared to methods with lower AUC. In addtion, as discussed, we follow the experiment setup of the officialy published benchmark, which has been thoroughly verified from various perspectives. > It is unclear what the benefit of pre-training and using the oracle during fine-tuning or continuous training is. The molecule space is nutoriously vast and contains unrealistic molecules. Thus, it is challenging to training policy from the scratch especially with the limited oracle calls. The most widely used approach is to conduct pre-training with real-world molecule dataset. It is important to note that the dataset does not include oracle scores. Therefore, the policy is trained to generate valid molecules not to maximize a certain property. Then, the pre-trained policy is fine-tunned by interactly querying oracle fuctions. We performed experiments without pretraining, conducting three independent runs. However, in certain oracles such as scaffold_hop, the policy failed to generate new valid molecules before reaching the 10,000 calls. In such cases, the AUC score can not be measured fairly, so we report the total and average AUC10 scores across 21 oracles by excluding such case. | | Total AUC10 | Avg. AUC10| | -------- | -------- | -------- | | Genetic GFN | 14.877 | 0.647 | | Genetic GFN w/o pre-training | 6.111 | 0.266 | > it is unclear if the method can achieve the best of both worlds when combining the two methods. Also, when compared with other methods, diversity and property improvement might be entirely different. A metric that considers both simultaneously is needed because one can control one metric and try to improve the other. As previously discussed, our primary goal, consistent with the benchmark's objective, is to maximize the molecule score within the limited oracle calls. We introduced diversity from the perspective that high-ranked methods often achieve optimization ability at the expense of diversity, as evident in the clear trade-off observed in the PMO benchmark. Our motivation stems from the belief that GFlowNet can attain high optimization ability by incorporating the high-reward seeking property of other search methods, such as genetic algorithms. This enables GFlowNet to focus its learning on the high-rewarded regions of the target distribution. In summary, our proposed method empirically demonstrates not only improved optimization ability but also controllability over the score-diversity trade-off. > The novelty metric is missing. In the GFlowNet-AL [1], novelty is assessed by measuring the distance from an initial offline dataset, particularly in the context of generating novel biological sequences from this dataset. This approach is vital in offline model-based optimization (MBO) communities, where the generation of novel data points yielding high rewards is crucial. GFlowNet-AL effectively tackles this challenge by leveraging the offline dataset. However, our methodology takes a different direction. The tasks from the PMO benchmark do not start with a predefined offline dataset containing labeled datapoints. Instead, we utilize an unlabeled chemical dataset as a reference set to enrich the chemical realism of our models. Our objective is not to generate highly novel chemical structures that are out of distribution. Instead, we aim to produce realistic chemical structures resembling the reference set while possessing the desired properties, thereby ensuring high rewards. [1] Jain, Moksh, et al. "Biological sequence design with gflownets." International Conference on Machine Learning. PMLR, 2022. ### Questions > It is not clear whether the authors, who are focusing on sequences, why plot molecules in their main figure. We employ Graph GA, whose operations are conducted with molecule graphs. We can utilize other string-based GA like SMILES GA or STONED, but they do not include crossover operations, outperformed by Graph GA. Note that we can easily convert the SMILES to Graph. > Implementation details are lacking, making it impossible to reproduce the results. According to the response of all reviews, we will revise our manuscript. Please note that we will provide the Pseudo-code (currently, in Appendix B) in the main manuscript and make the code publicly available for reproducibility. > Statistics of the data, the size of the models, and their complexity are missing. We agree that there is no statistical analysis, and some details are missing. We are going to revise to include more details and analysis. The resulting p-values of t-test are as follows. | | p-values | |--- | --- | | Genetic GFN vs. Mol GA | 0.0011 | | Genetic GFN vs. REINVENT | 0.0025 | | Genetic GFN vs. GEGL | 0.00003 | | Genetic Search Ablation | 0.0109 | | KL Penalty Ablation | 0.1357 | Our model consists of three layers of GRU with 128 token embedding dimension and 512 hidden dimension, described in Appendix A.2. Though we employ the same model architecture with REINVET, we will add further detailed description in our manuscript. > It is not obvious how important pre-training is in this case. If the model pre-training was not effective, what would happen? Furthermore, implementation details for pre-training are also missing. During the pretraining phase, we trian the model to maximize likelihood over valid molecules, which from ZINC250K dataset without oracle scores, using Eq (2). Note that GFlowNets are not utilized in this pretraining process. We follow the unified pre-training guidelines from the PMO benchmark. In addition, the results of with and without pretraining are provided in the previous response. We will revise our manuscript to incldue the guidelines and the results. > It is unclear how the data has been split and how performance has been evaluated. It seems that the authors evaluated the method on the same oracle they used in GA. As previously discussed, the initial data (with oracle score) is not given, so there is no need to discuss about data splition. In addition, GA is not utilized for pre-collecting data. In our approach, genetic search occurs concurrently with training, and the oracle calls made during the genetic search are also counted. # Reviewer gE92 > GFlowNets have been widely used for molecular optimization, more recent related work should be included and discussed. For example, GFlowNets have been extended to multi-objective molecular optimization [1,2] Thanks for the suggestion. The manuscript provides the discussion of recent advances including forward-looking GFlowNet and LS-GFN, in Section 4.2. Even though our research scope does not include multi-objective optimzation, it is worth for discuss. We will revise our manuscript to discuss the suggested papers. > The proposed Genetic GFN simply combines some existing techniques that have been shown useful in previous work. For example, the pretrain-then-finetune paradigm from REINVENT, the genetic algorithm from Graph GA, the ranking-based sampling from weighted retraining. Our contribution lies in the novel and sophisticated combination of existing approaches that have been studied independently. > In additional analysis, the authors compare the proposed Genetic GFN with vanilla graph-based GFlowNets. However, these baselines have been improved in terms of different aspects, including reply buffer and training objectives. It is interesting to include more advanced implementations of GFlowNet. It is important to emphasize that our goal is not to establish superiority over other GFlowNets. However, we sincerly agree with including more advanced GFlowNet would be interesting. Therfore, we provide the experimental results to compare the proposed method and Local Search GFN (LS-GFN) as follows. | | Total AUC10 | Avg. AUC10| | -------- | -------- | -------- | | Genetic GFN | 16.213 | 0.705 | | LS-GFN | 15.230 | 0.662 | We also conduct t-test with Avg AUC10 as follows. | | Genetic GFN | LS-GFN | | -- | -- | -- | | Mean | 0.705 | 0.662 | | Variance | 7.10E-05 | 7.50E-05 | | Observations | 5 | 5 | | Hypothesized Mean Diff. | 0 | | | t Stat | 7.9065 | | P(T<=t) one-tail | 2.37594E-05 | | | t Critical one-tail | 1.8595 | It is worth noting that LS-GFN might be less effective in tree-structured MDPs, where the backward policy distribution consistently equals one, and in cases of high dimensionality. This is because LS-GFN employs a back-and-forth search strategy, which heavily depends on the robustness of the backward policy. In contrast, we employ Graph GA, which enables genetic search to operate within the molecule graph space. Additionally, the operations within Graph GA are meticulously designed to facilitate the exploration of the extensive molecular space. > Wrong citation. Why (Jain et al., 2022) is cited for using GNNs to generate molecules. >Fix the typo. 'SELFIES-REIVENT' in line 262. Thanks for fiding our mistakes; we revied them. > The training objective of genetic GFN is TB loss with KL penalty. Does the global minima of this objective lead to a consistent flow network? The authors should discuss this situation because it's the main purpose of the framework of GFlowNet. We anticipate that the global minimum will be reached when the TB loss reduces to zero across every support space. This would result in GFlowNets with a sufficiently small KL penalty; note that the coefficient is set as 0.01. Achieving this would mean that the GFlowNets' distribution closely aligns with the reference distribution. > GFlowNet is known for its ability to generate more diverse objects. However, the diversity of Genetic GFN is lower than other baselines (including other genetic algorithms), in Table 1. By adjusting the inverse temperature (beta), we can achieve the higher AUC10 score with the higher diversity as stated in Sections 1 and 5.3. | | Total AUC10 | Diversity | | -------- | -------- | -------- | | Genetic GFN (beta = 30) | 15.815 | 0.528 | | Mol GA | 15.686 | 0.465 | | REINVENT | 15.185 | 0.468 | Our algorithm is specifically designed for sample-efficient molecule optimization, properly controlling the diverse exploration capabilities offered by GFlowNets. As a result, our emphasis lies in the application of GFlowNets to optimization tasks, setting our approach apart from conventional GFlowNets research that often prioritizes metrics such as the L1 probabilistic gap or the number of modes. # Reviewer D17P > The method's novelty is fairly limited. The pre-training phase is equivalent to learning the reward via maximum likelihood --- as described in Section 3 of "Unifying Generative Models with GFlowNets and Beyond" by Zhang et al. [1]. Since the underlying pointed DAG of Genetic GFN is tree-structured, the distribution over terminal states resumes to the probability of a path. Our approach to maximum likelihood estimation (MLE)-based pretraining deviates from the MLE-GFN method proposed by Zhang et al. [1]. Unlike MLE-GFN, our MLE training process operates in an **unsupervised manner**, focusing solely on the sequence policy **without leveraging reward information**. Following the pretraining phase, we transition to GFlowNet training, where the reference distribution originates from MLE and undergoes fine-tuning using the KL penalty within GFlowNets. To the best of our knowledge, this unique approach has not been extensively investigated within the context of GFlowNets. [1] Zhang, Dinghuai, et al. "Generative flow networks for discrete probabilistic modeling." International Conference on Machine Learning. PMLR, 2022. > The hyperparameter choices for Genetic GFN seem arbitrary, which may play a significant role in the results. The results change significantly (in average) by varying GA hyperparameters as shown in Appendix D.6 --- n.b. standard deviations are missing. I believe changing GFlowNet hyper-parameters (e.g., the temperature and could have even more stern consequences We also provide the standard deviations of AUC-Top10 for the results in Tables 11 and 12 as follows. Please note that we denote the value of GA-related hyperparameters as (offspring x generation). *Table. Average and standard deviation of results with different generations (GA iterations)* | | Average | Std. | | -- | -- | -- | | 8 x 1 | 15.615 | 0.107 | | **8x2 (main)** | **16.213** | **0.173** | | 8 x 3 | 15.735 | 0.128 | *Table. Average and standard deviation of results with different offspring size* | | Average | Std. | | -- | -- | -- | | 4 x 2 | 15.846 | 0.107 | | **8x2 (main)** | **16.213** | **0.173** | | 16 x 2 | 15.669 | 0.174 | | 32 x 2 | 15.621 | 0.200 | The results does not provide any evidence that varying GA hyperparameters inccurs significant changes more than varying the temperature or rank coefficient - the results in Tables 3 and 4. The maximum AUC10 is the same in all tables (16.213). The minimum values observed when varying the number of GA generations and offspring size are 15.615 and 15.669, respectively. In contrast, the minimum values in Tables 3 and 4 are 14.597 and 14.652, respectively -- larger changes. > It seems that one O is consulted, its return is thrown away --- which seems sample inefficient. I would be natural to, e.g., use a neural network surrogate to train the GFlowNet. <!-- **Minsu** --> We employ a replay buffer to enhance sample efficiency by reusing samples, similar aims observed in other algorithms that utilize surrogate or proxy models within active learning frameworks, such as Bayesian optimization and GFlowNets-AL. The integration of a (prioritized) replay buffer with an off-policy algorithm like GFlowNets is a natural strategy, according to practices in reinforcement learning community. While incorporating a proxy model holds promise for our algorithm, it is worth noting that observations from the benchmark paper suggest that the addition of a proxy model and active learning may not always be advantageous , as discussed in Gao et al. [1] with an example of GFlowNet-AL. Conversely, empirical analyses focused on the sample-efficient training of GFlowNets by Shen et al. [2] emphasize that employing a prioritized replay buffer is both a straightforward and effective method for sample reuse. [1] Gao, Wenhao, et al. "Sample efficiency matters: a benchmark for practical molecular optimization." Advances in neural information processing systems 35 (2022): 21342-21357. [2] Shen, Max W., et al. "Towards understanding and improving gflownet training." International Conference on Machine Learning. PMLR, 2023. <!-- ??? We do not throw away any samples. We utilize all samples generated by the policy and genetic search in GFlowNet training. One of the key messages from the original paper of the PMO benchmark is that "model-based methods are potentiall y more effiecien but need careful design." In addition, the paper says "GFlowNet (model-based variant: GFlowNet-AL) indicate that simply adding a predictive model might not necessarily be helpful." in the last paragraph in page 8. --> > Following up on the previous points, there are three natural missing in this work: 1) A GFlowNet trained directly to sample from using queries; 2) A GFlowNet trained on a neural surrogate of O; 3) Multi-objective GFlowNets. The first one seems to mean GFlowNet training without genetic search, which means the policy is solely generate samples and trained with that samples. We inculde the results in our self-ablation study (denoted as 'w/o GS' in Table 2). The second one is discussed in the previous comment. Lastly, the PMO benchmark is basically a single-objective optimization task. > From Algorithm 1, I am inferring the GFlowNet is trained based solely on a fixed set of terminal states (which does not cover some with probability 1). If this is the case, it is impossible to achieve balance and, in practice, GFlowNet might be sampling from arbitrary targets --- e.g., sampling arbitrary nodes. Do the authors have some way to filter those out? For instance, do we need to check the oracle when drawing from the trained model to avoid hallucinations? <!-- **Please use chatgpt to revise this paragraph** GFlowNets can samples with distribution proportional to reward using Algorithm 1 because given temrinal state $x$, the trajectory is determined uniquely $\tau = (s_0,...,s_n = x)$ where $s_t$ be the subsequence of $s_{t+1}$ by eleminating last element of the seuquence. Than we can use trajectory balance over $\tau$ so that it will learns $p(x) = P_F(\tau_{\rightarrow x})/P_B(\tau_{\rightarrow}|x) = P_F(\tau_{\rightarrow}) = R(x)/Z$ where $Z$ be partition function by minimizing loss function: --> GFlowNets can sample distributions proportional to reward using Algorithm 1 by leveraging the uniqueness of trajectories. Given a terminal state $x$, the trajectory is uniquely determined as $\tau = (s_0,...,s_n = x)$. Here, $s_{t-1}$ is obtained by obtained by eliminating the last element of the sequence in $s_t$. By employing trajectory balancing over $\tau$, the model learns $p(x) = P_F(\tau_{\rightarrow x})/P_B(\tau_{\rightarrow}|x) = P_F(\tau_{\rightarrow}) = R(x)/Z$, where $Z$ is the partition function minimized through the loss function as follows: $$ L(\tau;\theta) = \log (\frac{P_F(\tau;\theta)Z_{\theta}}{R(x)})^2$$ Here, if $L(\tau;\theta) = \log (\frac{P_F(\tau;\theta)Z_{\theta}}{R(x)})^2 =0$, then $\frac{P_F(\tau;\theta)Z_{\theta}}{R(x)} = 1$. Therefore, $p(x) = P_F(\tau_{\rightarrow x})/P_B(\tau_{\rightarrow}|x) = P_F(\tau) = R(x)/Z$. Note tht $P_B$ is set to 1 for tree-based MDP. The $\tau_{\rightarrow x}$ stands for trajectory $\tau$ that terminate to $x$; thus, $P_F(\tau) = \prod_{t=1}^n P_F(s_t|s_{t-1})$. <!-- ???? We assess all samples drawn by GFlowNet and searched by genetic search. In the early stage, the policy draw samples arbitrarly, but is trained to samples from the target distribution. --> > It is hard to understand what exactly counts within the queries budget. How exactly does this account for each of the model's training steps? (see also the bullet above) It is important to note that the rule of counting oracle calls are shared regardless of methods in the PMO benchmark. Firstly, the GFlowNet policy generates SMILES with a batch size of 64 and removes duplicated or invalid samples. Note that validity can be checked by converting SMILES into molecules, which is a simple computation. Subsequently, the oracle is queried for newly generated samples. The PMO benchmark provides an online buffer to prevent duplicate queries of samples that have already been computed. Following this, we collect initial populations and generate offspring (8), then query the oracle to assess them. This process is repeated over two generations, totaling 8 x 2 iterations. Without considering duplication, each iteration consumes a maximum of 64 + 16 oracle calls. The hyperparmeters like the batch size and offspring size are explained in Section 5.1.