[ICLR '25 Rebuttal] delta-CS

## General Response We are sincerely grateful to the reviewers for their valuable feedback on our manuscript. We are pleased to hear that the reviewers found our paper **well-written** (AHCz, WGsn) and supported by **comprehensive experiments (t88c, WGsn, 6RU1)**. TODO: 추가 **novle/good contribution (AHCz, 6RU1)** + address key challege of balancing between novelty with conservativeness (t88c, WGsn) ### General concern improved exploration in gflownets 비교 왜 안했는지 (특히, LS-GFN, Thompson sampling gfn은 두 명이 언급) => 실험: noising + denoising vs. back-and-forth search back-and-forth search 우리꺼에서 noising 을 베르누이 with delta 말고 뒤에 마지막 delta x L 토큰 noising 하는 것 conservative search 아이디어를 improved exploration 연구들과 접목 가능 + motivation 자체가 다름 ### Reviewer AHCz: 8 (3) #### Weakness >In the adaptive method for setting δ, why does setting $\delta \approx \delta_{const.} - 1/L$ not set $\lambda \sigma \approx 1/L$ to a constant that does not depend on the uncertainty? Given this, I find the name "adaptive" and the exposition slightly misleading. The adaptivity seems to stem from the sequence length L and not the uncertainty about the reward proxy $\sigma$. How are the experimental results affected if $\lambda$ was chosen a constant? We acknowledge that our initial expression may cause confusion. Our method adapts to each datapoint $x$ (a given sequence to edit) based on the prediction uncertainty for $x$, leveraging $\sigma(x)$ to adjust $\delta$. The misleading part arose from our handwavy expression $\lambda \sigma \approx 1/L$. Therefore, we revised it in the manuscript as $\lambda \mathbb{E}_{P_{\mathcal{D}_0}(x)}[\sigma(x)] \approx \frac{1}{L}$, which means we set the scaling hyperparameter $\lambda$ based on the average value of $\sigma(x)$ during the first round. Consequently, $\delta \approx \delta_{\text{const.}} - 1/L$ is not true because: 1) $\sigma$ varies for each datapoint $x$'s, and 2) even for the same $x$, $\sigma(x)$ evolves throughout the active learning rounds.  > It would be nice to see the evaluation of various from Appendix B.1 extended to multiple datasets. It seems that the choice of δ is critical for the achieved performance and it would be interesting what ranges of δ-values outperform the strongest baseline in each of the tasks to develop an understanding for the robustness of δ-CS. Thank you for your suggestion. We conduct additional experiments with different $\delta$ values to validate the robustness of $\delta$-CS. The summarized results are described in the tables below, and we also include more comprehensive results in Appendix B.1 of the updated manuscript. **RNA-A** | | $\delta = 0.1$ | $\delta = 0.25$ | $\delta = 0.5$ | $\delta = 0.75$ | GFN-AL | | - | - | - | - | - | - | | Max | 1.040 ± 0.014 | 1.031 ± 0.026 | **1.055** ± 0.000 | 0.925 ± 0.055 | 1.030 ± 0.024 | | Median | 0.918 ± 0.034 | 0.916 ± 0.028 | **0.939** ± 0.008 | 0.776 ± 0.018 | 0.838± 0.013 | **GFP** | | $\delta = 0.01$ | $\delta = 0.025$ | $\delta = 0.05$ | $\delta = 0.075$ | GFN-AL | | - | - | - | - | - | - | | Max | 3.593 ± 0.003 | **3.594** ± 0.005 | 3.592 ± 0.003 | 3.592 ± 0.005 | 3.578 ± 0.003 | | Median | **3.578** ± 0.002 | 3.574 ± 0.003 | 3.567 ± 0.003 | 3.563 ± 0.004 | 3.511 ± 0.006 | > Can the authors elaborate how $\delta$ is chosen in the experiments? Is this a choice that can be reproduced in practice, i.e., does it require running $\delta$-CS with multiple (all?) values of $\delta$, or does it require determining $\delta$ from a (small) validation experiment, or something else? We set $\delta = 0.5$ for short sequence tasks (e.g, DNA, RNA), and $\delta = 0.05$ for long sequence tasks (e.g., GFP, AAV). This hyperparameter was very roughly tuned with small validation experiments; we conducted tests with a proxy model trained on just part of the initial dataset and a validation model trained on the entire initial dataset as oracle. These experiments were only done in the short DNA sequence task and the long protein sequence task. This tuning can be further improved if we finely tune task-by-task using such validation experiments.    ### Reviewer t88c: 3 (1) #### Weakness > Novelty: This paper draws from evolutionary search and existing GFlowNet frameworks, combining aspects of both rather than developing a unique algorithmic structure. Specifically, δ-Conservative Search extends GFlowNets by adding a conservativeness parameter, δ, to manage exploration around known data. While useful, balancing exploration with reliability is a recurring theme in off-policy RL, particularly with techniques like uncertainty-based exploration in Bayesian optimization and upper confidence bounds (UCB). We think a proper way to combine search methods ($\delta$-CS) and learning methods (GFN) together is not trivial, as many works in biological sequence design that use exploration-exploitation balancing methods like UCB or evolutionary search-based methods like GFNSeqEditor did not get satisfactory performance compared to our method. That shows our novel approach, which controls conservativeness parameters in terms of the level of the Hamming ball of local sequence space, was more effective than others, and such integration with GFN-AL and uncertainty adaptation was critical to performance; **novel combination** of the techniques is clearly categorized as novelty in conference review guidelines.  >(W2) Evaluating a sequence's performance is critical in the design of biological sequences. A fast, low-cost, and relatively accurate method can often significantly reduce the complexity of the problem. In this paper, the reward information provided to the model is Oracle f. Since the reward signal provided by f, determines the model's learning, the accuracy and generalizability of f are crucial. I doubt the evaluation and experiments are testing out-of-distribution sequences, which rely on the accuracy of f. >(Q3) It seems that model f is fixed in training, and it provides reward information for δ-CS; how can we ensure f can provide safe and correct reward information? In many active learning settings, the **oracle** function $f$, representing the **ground truth**, can only be accessed a limited number of times. Therefore, researchers focus on introducing a proxy model $f_{\theta}$ that learns to imitate the oracle function using the dataset we have. Thus, the setting does not have to consider the out-of-distribution (OOD) capability of the oracle $f$. However, we need to consider the OOD generalization capability of the proxy $f_{\theta}$, which is very difficult. This motivates us to perform conservative searches when the proxy yields high uncertainty. Therefore, the proxy function's limited accuracy further supports our motivation to use the $\delta$-CS off-policy method to compare with existing methods that rely more on the performance of the proxy function, which is our method's novelty.  #### Questions > Regarding the choice of δ, although the authors reported a control experiment on δ in the appendix, are all the experiments in Table 1 based on the same δ? For the short sequence task with (e.g., DNA, RNA), we set $\delta = 0.5$, and for the long sequence task (AAV, GFP), we set $\delta = 0.05$. These parameter values were roughly tuned using some validation experiments. Specifically, we conducted tests with a proxy model trained on a portion of the initial dataset and a validation model trained on the entire initial dataset.   > What is the initial dataset $D_0$? How is it initialized? What is the cost to acquire the reward for all sequences in $D_t$ in line 3 of Alg.1? As we noted in experimental section, the initial dataset $D_0$ for TFbind, RNA, we follows exisiting benchmark [1,2]. We collect $D_0$ for AAV and GPF based on [3], by using evolutionary search on wild type sequences given from [3]. The cost to acquire the reward for all sequences in $D_t$ is not required in line 3 because they are already precomputed for each data point and included in the dataset. [1] Kim, Minsu, et al. "Bootstrapped training of score-conditioned generator for offline design of biological sequences." Neural Information Processing Systems (NeurIPS), 2024. [2] Trabucco, Brandon, et al. "Design-bench: Benchmarks for data-driven offline model-based optimization." International Conference on Machine Learning. PMLR, 2022. [3] Sinai, Sam, et al. "AdaLead: A simple and robust adaptive greedy search algorithm for sequence design." arXiv preprint arXiv:2010.02141 (2020). ### Reviewer WGsn: 3 (4) > W1. proxy model misspecification. the hypothesis (*) is not quantitatively validated well enough when the search space is scaled up > Q1. Can we use $\delta$-Conservative Search with any other proxy model? Thank you for pointing this out. To validate our hypothesis on larger-scale tasks, we conducted additional experiments on AAV and GFP tasks (see tables below), which we have updated in the revised manuscript. **AAV with CNN** | | $D_{0}$ | $D_{heldout, \leq 1}$ | $D_{heldout, \leq 2}$ | $D_{heldout}$ | | - | - | - | - | - | | Spearman $\rho$ | 0.943 | 0.872 | 0.777 | 0.407 | **GFP with CNN** | | $D_{0}$ | $D_{heldout, \leq 1}$ | $D_{heldout, \leq 2}$ | $D_{heldout}$ | | - | - | - | - | - | | Spearman $\rho$ | 0.551 | 0.293 | 0.098 | -0.352 | > W2. Another major concern is the lack of proxy model ablation Thanks for pointing that out. To verify our hypothesis on various types of proxy models, we conducted additional experiments using a different proxy, MuFacNet [1], in AAV, GFP, and RNA tasks (the default proxy model is based on 1d-CNN). With another proxy model, the trends are consistent, validating our hypothesis. See the results below (we have also updated this in the revised manuscript). [1] Ren, Zhizhou, et al. "Proximal exploration for model-guided protein sequence design." International Conference on Machine Learning. PMLR, 2022. **AAV with MuFacNet** | | $D_{0}$ | $D_{heldout, \leq 1}$ | $D_{heldout, \leq 2}$ | $D_{heldout}$ | | - | - | - | - | - | | Spearman $\rho$ | 0.932 | 0.893 | 0.827 | 0.371 | **GFP with MuFacNet** | | $D_{0}$ | $D_{heldout, \leq 1}$ | $D_{heldout, \leq 2}$ | $D_{heldout}$ | | - | - | - | - | - | | Spearman $\rho$ | 0.456 | 0.404 | -0.076 | -0.551 | **Maximum scores** | | RNA-A | GFP | AAV | | - | - | - | - | |CNN| 1.055 ± 0.000 | 3.592 ± 0.003 | 0.708 ± 0.010 | | MufacNet | 1.050 ± 0.003 | 3.592 ± 0.005 | 0.699 ± 0.017 | > W3. how does δ-CS differ from existing off-policy search methods for GFlowNets such as [7,8]? (Especially, LS-GFN [7]) There are many off-policy methods in GFN, as you mentioned (we have updated them in the related works section). Among them, the most relevant method is Local Search GFN (LS-GFN) [7], where both our methods focus on restricted search in local regions for reward exploitation (the other suggested method, Thompson Sampling [8] is for diverse exploration rather than conservatism). We made a direct comparison with LS-GFN: The reason that LS-GFN performs worse than ours is that they utilize back-and-forth search using backward and forward policies, where their search space in unidirectional sequences is limited to only the leaf part of sequences. On the other hand, our local neighbors are defined by a Hamming ball of sequence space; we distribute the search not only on the leaf part but also equally on the every tokens in sequences (from head to leaf), which makes for a more flexible local search. More results are provided in the updated manuscript (Appendix B.5). Results (MAXIMIUM) | | RNA-A (L=14) | RNA-B (L=14) | RNA-C (L=14) | TFBind8 (L=8) | GFP (L=238) | AAV (L=90) | | - | - | - | - | - | - | - | | back-and-forth | 0.613 ± 0.009 | 0.572 ± 0.003 | 0.722 ± 0.009 | 0.977 ± 0.008 | 3.592 ± 0.001 | 0.549 ± 0.005 | | Ours | 1.055 ± 0.000 | 1.014 ± 0.001 | 1.094 ± 0.045 | 0.981 ± 0.002 | 3.592 ± 0.003 | 0.708 ± 0.010 | > W4.More BO baselines (especially with trust region) Thank you for your suggestion. We agree that comparison with state-of-the-art trust-region based BO methods can provide stronger evidence of the effectivenss of our method. We conduct experiment with TuRBO [1], a widely used trust-region based BO method for our setting. As shown in the table, while TuRBO exhibits generally higher score than classical BO, our method surpass TuRBO across various tasks, exhibiting the superiority of our $\delta$-CS constraints. maximum | | RNA-A (L=14) | RNA-B (L=14) | RNA-C (L=14) | TFBind8 (L=8) | GFP (L=238) | AAV (L=90) | | - | - | - | - | - | - | - | | BO | 0.722 ± 0.025 | 0.720 ± 0.032 | 0.506 ± 0.003 | 0.977 ± 0.008 | 3.572 ± 0.000 | 0.500 ± 0.000 | | TuRBO | 0.935 ± 0.034 | 0.921 ± 0.052 | 0.912 ± 0.036 | 0.974 ± 0.019 | 3.586 ± 0.000 | 0.500 ± 0.000 | | Ours | **1.055** ± 0.000 | **1.014** ± 0.001 | **1.094** ± 0.045 | **0.981** ± 0.002 | **3.592** ± 0.003 | **0.708** ± 0.010 | median | | RNA-A (L=14) | RNA-B (L=14) | RNA-C (L=14) | TFBind8 (L=8) | GFP (L=238) | AAV (L=90) | | ----- | ------------- | ------------- |:------------- |:------------- |:------------- |:------------- | | BO | 0.510 ± 0.008 | 0.502 ± 0.013 | 0.506 ± 0.003 | 0.806 ± 0.007 | 3.378 ± 0.000 | 0.478 ± 0.000 | | TuRBO | 0.622 ± 0.046 | 0.629 ± 0.030 | 0.541 ± 0.068 | **0.974** ± 0.019 | **3.583** ± 0.003 | 0.500 ± 0.000 | | Ours | **0.939** ± 0.008 | **0.929** ± 0.004 | **0.972** ± 0.043 | 0.971 ± 0.006 | 3.567 ± 0.003 | **0.663** ± 0.007 | [1] David Eriksson, Michael Pearce, Jacob Gardner, Ryan D Turner, and Matthias Poloczek. Scalable Global Optimization via Local Bayesian Optimization. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, pages 5496–5507, 2019. > W5. A quantitative analysis showing how the level of conservativeness/noise injection amount differs throughout the active learning rounds > Q6. How does noise injection/level of conservativeness differ throughout the active learning rounds? Have you analyzed the behavior with respect to the proxy uncertainty? The level of conservativeness varies depending on the data point. As active learning progresses through more rounds, it continues to seek new data points with high predictive uncertainty. Therefore, as active learning actively discovers new data points, the conservativeness level associated with these points is also dynamic. No empirical distributional tendencies regarging on the number of active rounds have been found, so the model must adapt in every active learning round and for each data point $x$. > W6. A part that requires a bit of clear discussion is that, if the proxy model is not of good quality then exploration helps to recover from getting stuck at local optima and collect diverse points, whereas over-exploration yields unreliable results as explained for using GFlowNets case. I think this trade-off could have been argued more clearly while motivating the method Thanks for opening insightful discussion. Such diverse exploration can help recover from bad local optima, yet it can produce unreliable results if the exploration is too extensive. Therefore, we need to balance diversity to focus on reliable regions without being so local that we cannot escape from bad local optima. Our $\delta$ factor can be interpreted as such a balancing parameter. > W7. batch size ablation Thank you for your suggestion. We conduct additional experiments with query batch size 32 and 512 (default setting is 128). The same ablation is also applied for AdaLead as a baseline. We report the results in the tables below, and we also added **Figure 15,16 in Appendix B.6** to the manuscript. The results demonstrate that our method achieves superior performance compared to AdaLead across different batch size, exhibiting robustness. maximum (bs=32) | | RNA-A (L=14) | GFP (L=238) | AAV (L=90) | | - | - | - | - | | AdaLead | 0.866 ± 0.049 | 3.572 ± 0.000 | 0.508 ± 0.006 | | GFN-AL | 0.979 ± 0.007 | 3.577 ± 0.005 | 0.533 ± 0.004 | | Ours | 1.021 ± 0.033 | 3.583 ± 0.006 | 0.685 ± 0.019 | maximum (bs=512) | | RNA-A (L=14) | GFP (L=238) | AAV (L=90) | | - | - | - | - | | AdaLead | 1.008 ± 0.034 | 3.584 ± 0.003 | 0.656 ± 0.015 | | GFN-AL | 1.041 ± 0.016 | 3.593 ± 0.002 | 0.579 ± 0.007 | | Ours | 1.049 ± 0.005 | 3.597 ± 0.002 | 0.707 ± 0.012 |    > W8. theoretical analysis Thanks for pointing out. We provided theoretical analysis on exploration range and sample complexity: 1. **Exploration range**: Let $\delta \in [0,1]$ be the exploration parameter serving as the success probability in a Bernoulli distribution for each token in a sequence of length $L$. Each token independently has a probability $\delta$ of being flipped (changed) and $1 - \delta$ of remaining the same. Define the random variable $H$ as the Hamming distance—the number of differing tokens between two sequences. The probability distribution of $H$ is: $$ P(H = k) = \binom{L}{k} \delta^k (1 - \delta)^{L - k}, $$ where $k = 0, 1, \dots, L$. This is a binomial distribution with parameters $n = L$ and $p = \delta$. The expected value and variance of $H$ are: $$ \mathbb{E}[H] = L\delta, \quad \operatorname{Var}(H) = L\delta(1 - \delta). $$ A higher $\delta$ increases $\mathbb{E}[H]$, expanding the exploration region in the sequence space and promoting diversity. 2. **Sample Complexity:** We do not introduce additional meaningful sample complexity as a scale because the noising process involves sampling from a tractable trivial distribution (Bernoulli). The major bottleneck is sampling sequences proportional to the reward, which is intractable. However, the GFlowNet (GFN) amortizes this process using a neural network, reducing the sampling cost to the neural network's forward pass with token-by-token $O(L)$ complexity, where $L$ is the sequence length. This approach is consistent with every sequence-based decoding method in deep learning. 3.**Convergence Throughout**: Proving the convergence of deep learning models is notoriously difficult without making substantial assumptions, even on fixed datasets. Therefore, conducting convergence analysis on active learning that includes an inner deep-learning training process is even more challenging. Instead, this paper focused to demonstrate convergence through empirical validation across six tasks. > W9. There is no clear discussion on the limitations and potential drawbacks of the method. Thank you for your feedback. We added a discussion section (Section 7), including a few notes on the limitations, in the updated manuscript. > Q2. For the baseline methods that require surrogate/proxy (e.g. BO), is the same proxy model of $\delta$-CS used, with the same initial dataset? We use the same 1dCNN-based proxy (from Sinai et al., 2020) for all algorithms except GFN-AL. We implement two versions and report the better results between them. The first version is the original implementation (Jain et al. 2022) with MLP-based proxy, and the second is our own implementation with 1dCNN-based proxy. Theses are described in **Implementation details** in Section 6 and Appendix A.3. > Q3.How diversity/novelty is calculated? Is it the diversity of the last top-k batches? Presenting this metric across active learning rounds would have enhanced clarity and provided a more comprehensive understanding of the diversity achieved by $\delta$-CS. Following GFN-AL (Jain et al. 2022), the diversity and novelty is calculated with top-128 at the final rounds. We added the definition of these metrics in Appendix B.8. As you suggested, we report the progress of diversity and novelty across active learning rounds in Appendix B.8. > Q4. In Figure 5, AAV task, there is a substantial difference in diversity between the proposed method and baselines. Given that the search space is medium-sized—somewhere between RNA-A and GFP—what might be the reason for this significant disparity? You must mean the Figure 3. The low divrsity and novelty for the baselines is because: 1) The initial dataset is not very diverse. The diversity of top-128 samples in the initial dataset is around 5. 2) The queried candidates from GFN-AL or GFNSeqEditor have (relatively) low oracle scores, and thus the top-128 batch couldn't be populated by newly queried diverse and novel candidates. Compared to these baselines, our method generated both diverse and novel high-score candidates through out the active learning rounds. >Q5. How does performance vary with respect to the initial dataset size, particularly in domains with large search spaces? The initial dataset sizes currently employed are very large (e.g. 50% of the whole search space for TF-Bind-8, or the value set for GFP is again too much) compared to what is used for active learning approaches in the literature. Thank you for your suggestion. The initial dataset size is cruical for the performance of active learning approaches, especially the search space is exponentially large. We conduct ablation studies on initial dataset size in high-dimensional tasks, AAV and GFP. We conduct experiments by varying the initial dataset size $\vert\mathcal{D}\vert=1000$. As shown in the table, our method outperforms AdaLead and GFN-Al even with a small amount of datasets. Also, we've discussed about TF-Bind-8 and introduced the hard version of itm whgich has a much smaller and locally biased initial dataset in the manuscript. $\vert\mathcal{D}\vert=1000$ **maximum scores** | | GFP (L=238) | AAV (L=90) | | ------- |:------------- |:------------- | | AdaLead | 3.568 ± 0.005 | 0.557 ± 0.023 | | Ours | 3.591 ± 0.007 | 0.704 ± 0.024 | **median scores** | | GFP (L=238) | AAV (L=90) | | ------- |:------------- |:------------- | | AdaLead | 3.529 ± 0.006 | 0.494 ± 0.010 | | Ours | 3.570 ± 0.008 | 0.666 ± 0.018 | $\vert\mathcal{D}\vert=5000$ **maximum scores** | | AAV (L=90) | | ------- |:------------- | | AdaLead | 0.564 ± 0.029 | | Ours | | **median scores** | | AAV (L=90) | | ------- |:------------- | | AdaLead | 0.500 ± 0.021 | | Ours | | $\vert\mathcal{D}\vert=10000$ **maximum scores** | | AAV (L=90) | | ------- |:------------- | | AdaLead | 0.564 ± 0.037 | | Ours | | **median scores** | | AAV (L=90) | | ------- |:------------- | | AdaLead | 0.497 ± 0.009 | | Ours | | > Q7. Line 478, Shouldn't the phrase in line 478 be "prone to generating low rewards"? You're right. We've corrected the manuscript. Thank you for pointing this out! > Q8. What is the main source of benefit of using an RL method that utilizes a proxy model (as a reward function) and trains a generative policy instead of using some other active learning frameworks that directly utilize inexpensive proxy/surrogate models that can work under very limited training data? Why train a reward model + policy network instead of directly using the proxy model for querying? Since these sequences exist in a high-dimensional combinatorial space, generating complex sequences by directly using the proxy model to query the oracle is intractable. This approach would require additional sampling procedures like MCMC, which are slow and suffer from issues like slow exploration and mode mixing. As a result, they may not query highly informative samples (i.e., a diverse set of queries where we have high prediction uncertainty). In contrast, RL methods like GFlowNets amortize the search process and enable fast mode mixing to efficiently generate such candidates. #### Minor comments 1. That sentence means that "$20^{238}$, which is bigger than $10^{309}$". 2. We've revised our manuscript not to use "trajectory" before defining it. 3. The typo has been corrected. We've updated our manuscript. Thanks for the detailed feedback! ### Reviewer 6RU1: 5 (3) > Complexity of in implemetation We want to emphasize that the hyperparameters are not excessively tuned. First, the reweighting coefficient $k$ is usually set to 0.01 or 0.001 [1, 2]; the results below (and more in Appendix B.8) show that both settings outperform the baselines. The adaptation is not critical for $\delta$-CS, as performance is decent with just a constant hyperparameter and is robust to its selection. $\lambda$ (which is only required for the adaptive version of $\delta$-CS) is also easy to tune because it is just a scaling factor. We can use a tuned $\lambda$ for similar scale tasks (e.g., we use identical $\lambda$ for short-scale tasks like DNA and RNA, and long-scale tasks like AAV and GFP), which can be adjusted with a simple validation experiment. To demonstrate hyperparameter robustness, our sensitivity analysis results in Appendix B.2 and the following tables show that our method performs well with various values of these parameters, highlighting its robustness.  [1] Kim, Minsu, et al. "Bootstrapped training of score-conditioned generator for offline design of biological sequences." Advances in Neural Information Processing Systems 36 (2024). [2] Kim, Hyeonah, et al. "Genetic-guided GFlowNets for sample efficient molecular optimization." The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024. **Ablation for $k$** | | $k=0.01$ (default) | $k=0.001$ | | - | - | - | | RNA-A (Max) | 1.055 ± 0.000 | 1.036 ± 0.023 | | GFP (Max) | 3.592 ± 0.003 | 3.598 ± 0.003 | **RNA-A** | | $\delta = 0.1$ | $\delta = 0.25$ | $\delta = 0.5$ | $\delta = 0.75$ | GFN-AL | | - | - | - | - | - | - | | Max | 1.040 ± 0.014 | 1.031 ± 0.026 | **1.055** ± 0.000 | 0.925 ± 0.055 | 1.030 ± 0.024 | | Median | 0.918 ± 0.034 | 0.916 ± 0.028 | **0.939** ± 0.008 | 0.776 ± 0.018 | 0.838± 0.013 | **GFP** | | $\delta = 0.01$ | $\delta = 0.025$ | $\delta = 0.05$ | $\delta = 0.075$ | GFN-AL | | - | - | - | - | - | - | | Max | 3.593 ± 0.003 | **3.594** ± 0.005 | 3.592 ± 0.003 | 3.592 ± 0.005 | 3.578 ± 0.003 | | Median | **3.578** ± 0.002 | 3.574 ± 0.003 | 3.567 ± 0.003 | 3.563 ± 0.004 | 3.511 ± 0.006 | It is noteworthy that we have not excessively searched these parameters, and the results are not cherry-picked. As evidences, the better combinations for each task are found in our additional results above e.g., $\delta=0.025$ or $k=0.001$ for GFP. > Uncertainty Estimate Good point. Theoretically, uncertainty can be measured even when prediction accuracy is low, as they have different characteristics. The low prediction accuracy of the proxy arises from **out-of-distribution problems**—data points not seen during training. Uncertainty is measured based on how such a new data point affects the uncertainty of the proxy function. This can be quantified through posterior inference over the proxy model's parameters, given the current **in-distribution training dataset**. Although theoretically valid, such posterior inference is difficult in practice. We use ensemble methods to estimate uncertainty, which are sub-optimal but still practical and widely used. Developing better Bayesian posterior inference methods for active learning could be valuable future work. Our method is orthogonal to this, as deta-CS can be used with such improved uncertainty estimation methods. > Choice of GFlowNets 1. Performance gain of GFN-AL + $\delta$-CS over GFN-AL. We argue that the performance improvement achieved by $\delta$-CS is significant, particularly evident in the large margins observed for RNA-C, GFP, and AAV. Notably, as shown in Table 2, while GFN-AL underperforms in terms of average score compared to the RL baseline (DyNA PPO), our approach (with EI) outperforms it in *all metrics*, mean and max score, diversity, and novelty. 2. About [1,4,5] We agree with this suggestion. Such off-policy advancements in the GFlowNet community can potentially improve active learning performance. While prior methods design off-policy exploration [1,4,5] under the assumption that the reward model is accurate, our delta-CS method is built on a different assumption and purpose: conducting conservative search to mitigate the risks from proxy misspecification. Among these prior methods, local search GFlowNets (LS-GFN) [1] seem to be very related to our method and can also be applied in conservative search. We directly compare our method with LS-GFN as follows: Results (MAXIMIUM) | | RNA-A (L=14) | RNA-B (L=14) | RNA-C (L=14) | TFBind8 (L=8) | GFP (L=238) | AAV (L=90) | | - | - | - | - | - | - | - | | back-and-forth (LS-GFN) | 0.613 ± 0.009 | 0.572 ± 0.003 | 0.722 ± 0.009 | 0.977 ± 0.008 | 3.592 ± 0.001 | 0.549 ± 0.005 | | Ours | 1.055 ± 0.000 | 1.014 ± 0.001 | 1.094 ± 0.045 | 0.981 ± 0.002 | 3.592 ± 0.003 | 0.708 ± 0.010 | As shown in the table, our method clearly outperforms LS-GFN. We suspect this is because the search flexibility of delta-CS is better than that of LS-GFN. The algorithm of LS-GFN is based on back-and-forth local search, where they use a backward policy to partially destroy a solution and a forward policy to reconstruct a new solution. However, in the auto-regressive sequence generation setting, such a backward policy must be unidirectional; therefore, their local search region is bounded in the leaf space of the sequences. On the other hand, delta-CS randomly destroys tokens independently, regardless of whether the token is located at the head or leaf node, allowing it to search more flexibly in the sequence space. We have included experimental results in Appendix XXX and have added prior training methods of GFlowNets to the related works section. 3. About [2,3] Regarding other related works, references [2] and [3] have clearly different purposes and are **orthogonal** to ours, as they focus on improving the training scheme of GFlowNet but we focus to improve exploration which provides experiences for training. Specifically, [2] suggests a new loss function for better credit assignment, while [3] proposes a new method for multi-objective settings. We may utilize [2] with delta-CS when the trajectory length is too large and better credit assignment is needed. Similarly, we might employ [3] with delta-CS when multi-objective active learning is required. These could be exciting directions for future work. We have added this to the discussion section of our manuscript. Thank you for pointing this out. We can greatly improve our manuscript. ---- Thanks for the feedback. We are pleased to know that your concerns except for regarding uncertasinty have been addressed. We would like to emphasize that considering uncertainty in $\delta$ represents an additional component—the adaptive version of $\delta$. Furthermore, in the adaptive $\delta$, the uncertainty is measured for observed data sets, where the proxy is relatively reliable compared to OOD data points, with the properly chosen sailing factor $\lambda$. Our main message is that employing a conservative search using $\delta$ itself is beneficial for active learning in off-policy reinforcement learning for biological sequence design, as it effectively limits the search space of the generative policy. The subsequent experimental results demonstrate that even a constant $\delta$ without uncertainty measuring gives competitive results, outperforming other baselines. **maximum scores** | | RNA-A | RNA-B | RNA-C | | ------- | -- | --- | - | | AdaLead | 0.968 ± 0.070 | 0.965 ± 0.033 | 0.867 ± 0.081 | | GFN-AL | 1.030 ± 0.024 | 1.001 ± 0.016 | 0.951 ± 0.034 | | Ours (Adaptive) | 1.055 ± 0.000 | 1.014 ± 0.001 | 1.094 ± 0.045 | | Ours (Constant) | 1.041 ± 0.023 | 1.014 ± 0.001 | 1.102 ± 0.024 | **median scores** | | RNA-A | RNA-B | RNA-C | | ------- | -- | --- | - | | AdaLead |0.808 ± 0.049 | 0.817 ± 0.036 | 0.723 ± 0.057 | | GFN-AL | 0.838 ± 0.013 | 0.858 ± 0.004 | 0.774 ± 0.004 | | Ours (Adaptive) | 0.939 ± 0.008 | 0.929 ± 0.004 | 0.972 ± 0.043 | | Ours (Constant) | 0.914 ± 0.016 | 0.914 ± 0.009 | 0.958 ± 0.033 |