Reviewer 1 - HackMD

We sincerely thank the reviewers for their insightful comments and for highlighting these important related works. We greatly appreciate the opportunity to address these concerns directly, as they have helped us refine our contributions and strengthen the paper. In our revision, we have incorporated detailed discussions of these baselines, including empirical comparisons across all suggested methods (SVDD [1], DAS [2], TreeG [3], KF-Steering [4], T-SCEND [5], MCTD [6], and Diffusion Rejection Sampling [7]). These additions clarify the novelty of our work and demonstrate where our approaches provide meaningful advantages, particularly in terms of scalability, efficiency, and robustness under equal compute budgets. ---- > W1 & Q1: Novelty and advantage over prior search-based methods > > W2: local langevin step partly overlap with prior art To directly answer your question: Our primary sources of gains stem from (1) the unification of local and global search, allowing compute to scale flexibly across both (a dimension not explored in prior works); (2) the adaptive budget allocation in our BFS methods, which dynamically optimizes branching and pruning to avoid wasteful duplication; and (3) the dual-verifier mechanism, which mitigates reward hacking more robustly in guided generation. These gains persist under equal compute, as evidenced by our head-to-head experiments belows. Our unified framework's components collectively push the compute-quality Pareto frontier further, offering a more versatile toolkit for practitioners. We elaborate on our key contributions below, with supporting experiments against the baselines you suggested. For the DFS-specific ablation on hyperparameter sensitivity, please see our response to Q2. 1. **Scaling Local Search via Langevin MCMC (Novelty over DAS [2])** We appreciate your point that injecting verifier gradients during the reverse process overlaps with DAS [2]. Indeed, both approaches use gradient guidance and resampling, but our method diverges in key ways to provide both theoretical and practical advancements. * **Theoretical novelty**: DAS [2] and similar methods approximate the conditional score $\nabla \log p(x|c)$ by injecting gradients directly, which can introduce bias from the first-order approximation error (e.g., assuming $\mathbb{E}[f(x_0) \mid x_t] = f(\mathbb{E}[x_0 \mid x_t])$). We address this by deriving an asymptotically unbiased sampler via annealed Langevin MCMC. This provides a stronger theoretical foundation, ensuring more reliable sampling without the implicit assumptions in DAS [2]. * **Empirical advantage**: A core innovation is enabling compute scaling in local search through additional local search steps—a new dimension not emphasized in DAS [2] or other priors, which only injectes the gradient with no recurrence. To demonstrate, we compared our Best-of-N with scaled local search (BoN-i, where i is the number of local steps) against DAS [2], TreeG [3], SVDD [1], and KF-Steering [4] on the challenging PointMaze Ultra task, ensuring equal compute budgets (NFEs) for fair comparison. **Table 1: PointMaze - recurrence (success rate %↑)** |Compute (NFEs)|BoN-0|BoN-1|BoN-2|BoN-6 (our baseline)|BoN-8|TreeG|SVDD|FK|DAS| |-|-|-|-|-|-|-|-|-|-| |128|6.875±2.1|3.125±2.72|5.6±2.07|**19.2±2.2**|17.5±2.5|4.375±2.1|5.0±3.1|5.0±3.1|5.0±3.1| |256|6.25±1.2|7.5±3.1|13.75±2.8|**27.0±3.2**|20.6±1.1|9.375±2.1|8.75±1.25|6.9±2.1|6.875±2.1| |512|7.5±1.8|10.0±3.1|26.25±2.2|36.7±1.8|**38.1±1.1**|10.0±1.77|15.0±3.25|10.0±1.8|12.5±1.8| |1024|15.6±2.7|20.6±1.1|41.25±2.2|**58.3±3.2**|50.6±3.2|19.375±3.2|17.625±2.1|26.9±3.2|29.4±3.25| |2048|25±2.5|35±2.5|53.75±2.8|**81.0±2.5**|71.9±1.1|22.5±2.5|20.675±3.2|37.25±2.2|35.0±1.1| These results confirm that scaling local search via our Langevin MCMC approach significantly outperforms prior methods at matched compute levels, establishing a new Pareto frontier. We believe this highlights an underexplored scaling direction that could benefit the broader diffusion community beyond our framework. 2. **BFS-Resampling and BFS-Pruning (Novelty over TreeG[3], SVDD[2], FK[4], DAS[1])** To your point on overlaps, our BFS methods build on particle-based ideas in these works but introduce adaptive mechanisms to overcome their fixed allocation inefficiencies. * **Adaptive Branching**: We dynamically adjust fan-out based on real-time particle scores and remaining compute, unlike the static strategies in prior works, which can lead to suboptimal resource use. * **Duplication Avoidance via Pruning**: We explicitly prune low-scoring or redundant particles, reducing waste—especially critical for deterministic samplers like DDIM or DPM-Solver, where duplicates are common in methods like TreeG [3] or SVDD [1]. We isolated these benefits by comparing our BFS variants (using the same local search kernel) against baselines at equal NFEs on PointMaze. **Table 2: PointMaze - BFS variants (success rate %↑)** |Compute(NFEs)|BFS-Pruning|BFS-Resampling|TreeG|SVDD|FK|DAS| |-|-|-|-|-|-|-| |192|**32.5±1.1**|20.0±2.5|18.75±2.8|16.25±3.7|17.5±2.5|18.1±1.1| |384|**48.1±1.1**|41.25±2.7|38.1±1.1|15.0±2.5|32.5±3.1|36.9±1.1| |768|**71.25±2.2**|57.75±3.2|52.5±1.8|17.5±4.3|45.75±2.8|50±2.5| Our adaptive designs yield clear advantages at the same budgets, underscoring their value in making tree search more efficient and scalable. 3. **Double-Verifier Mechanism (Novelty in Guided Generation)** Existing methods like those in [1-6] are vulnerable to reward hacking, where a single verifier can be exploited. Our double-verifier uses separate ones for local and global search, providing a simple yet effective defense. We validated this on ImageNet, comparing our BoN-Double against priors at equal NFEs. **Table 3: Imagenet - Double verifier (FID↓/Acc%↑)** |Compute(NFEs)|BoN-Single|FK|DAS|TreeG|SVDD|BoN-Double (our baseline)| |-|-|-|-|-|-|-| |200|188.5±0.1/25.6±0.5|181.4±0.2/26.7±0.3|180.2±0.1/27.8±0.6|182.3±0.1/26.2±0.2|190.2±0.2/22±0.2|**178.4±0.2/29.7±0.3**| |400|171.5±0.3/31.8±0.3|152.8±0.1/37.5±0.2|156.0±0.1/36.0±0.1|157.2±0.3/34.8±0.5|178.3±0.2/29.5±0.1|**151.2±0.1/37.5±0.2**| |800|155.7±0.2/35.8±0.3|132.1±0.1/46.5±0.1|133.1±0.2/46.2±0.1|134.2±0.1/45.8±0.2|167.5±0.3/32.8±0.5|**127.8±0.2/49.2±0.3**| The double-verifier consistently outperforms, especially at higher compute, confirming its role in robust gains. 4. **Unified Framework for Joint Local and Global Search Scaling** Unlike prior works that treat local guidance and global search separately, our unification allows seamless compute allocation across both, enabling better overall scaling. This, combined with the above, forms the core of our "unification" contribution, as you queried. ---- > Q2: $\delta_t$ sensitivity in DFS Thank you for raising this important concern about the hand-crafted $\delta_t$ threshold in DFS and its potential sensitivity—we agree that providing robustness evidence is crucial for real-world reliability. To clarify, we designed $\delta_t$ not as a brittle hyperparameter needing extensive tuning, but as an intuitive knob for users to control the compute-quality tradeoff: higher values encourage more backtracking for tougher instances, adapting compute dynamically without per-task optimization. In response, we conducted the requested $\delta_t$ sweep on the text-to-image task, reporting performance across a practical range ($\delta_t = 0.5, 0.7, 0.9$). As shown in Table 4, DFS delivers consistent gains over BoN and priors like FK [4], DAS [2], TreeG [3], and SVDD [1] across thresholds and compute levels, with minimal variance—demonstrating robustness rather than fragility. For instance, even at lower $\delta_t$ (less backtracking), gains persist, while higher values scale quality further at modest compute costs. **Table 4: Text to image - DFS threshold(average score ↑)** |Compute(NFEs)|BoN|DFS-0.5|DFS-0.7|DFS-0.9|FK|DAS|TreeG|SVDD| |-|-|-|-|-|-|-|-|-| |100|0.631±0.001|0.682±0.002|**0.693±0.001**|-|0.635±0.001|0.637±0.002|0.633±0.003|0.630±0.003| |150|0.675±0.002|0.721±0.001|**0.725±0.002**|0.708±0.001|0.678±0.002|0.683±0.001|0.676±0.003|0.653±0.001| |200|0.704±0.002|-|**0.737±0.003**|0.733±0.002|0.710±0.002|0.711±0.001|0.709±0.002|0.667±0.001| |250|0.719±0.002|-|-|**0.748±0.001**|0.721±0.002|0.723±0.002|0.722±0.001|0.679±0.001| Regarding your comparison to Diffusion Rejection Sampling [7]: While both involve rejection, DFS is distinct—it uses an off-the-shelf verifier (no extra training needed) and focuses on adaptive scaling for inference efficiency, whereas [7] relies on a trained time-dependent discriminator. We believe a key insight of DFS is that it enables **compute-efficient inference by adapting to instance difficulty and quality requirements**, which we view as broadly valuable for the community. ---- [1] Li, Xiner, et al. "Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding." arXiv preprint arXiv:2408.08252 (2024). [2] Kim, Sunwoo, Minkyu Kim, and Dongmin Park. "Alignment without Over-optimization: Training-Free Solution for Diffusion Models." arXiv preprint arXiv:2501.05803 (2025). [3] Guo, Yingqing, et al. "Training-free guidance beyond differentiability: Scalable path steering with tree search in diffusion and flow models." arXiv preprint arXiv:2502.11420 (2025). [4] Singhal, Raghav, et al. "A general framework for inference-time scaling and steering of diffusion models." arXiv preprint arXiv:2501.06848 (2025). [5] Zhang, Tao, et al. "T-SCEND: Test-time Scalable MCTS-enhanced Diffusion Model." arXiv preprint arXiv:2502.01989 (2025). [6] Yoon, Jaesik, et al. "Monte Carlo Tree Diffusion for System 2 Planning." arXiv preprint arXiv:2502.07202 (2025). [7] Na, Byeonghu, et al. "Diffusion rejection sampling." arXiv preprint arXiv:2405.17880 (2024).