Reviewer 2 - HackMD

Thank you for your thoughtful feedback. We are encouraged that you share our excitement for this research direction. We strongly agree that a theoretically grounded local search can unlock new scaling possibilities for generative models, and extending these ideas to broader domains like reinforcement learning is a promising future direction. In the detailed response below, we address your concerns with theoretical clarifications, new empirical results, and direct comparisons against the suggested baselines. We are confident this will better highlight the novelty and strengths of our work. > W1: limited novelty We highlight the novelty of our contributions as follows: * A novel and scalable local search algorithm. * An improved and unified BFS. * An adaptive and efficient DFS. * A unified framework for jointly scaling local and global search. * Task-specific designs across multiple domains. We believe a more detailed, point-by-point elaboration of our methods, supplemented with new experiments, will help address your primary concerns. > W1.1: Local search is similar to classifier guidance, and could possibly hinder the efficiency. We respectfully argue that our local search with Langevin MCMC significantly differs from classifier guidance, both theoretically and empirically. * **Theoretical Distinction from Classifier Guidance** In classifier guidance [1], a trained time-conditional classifier $f_t(x_t) = \mathbb{E}[f(x_0) \mid x_t]$ is required to compute conditional scores for guided diffusion sampling—incurring significant additional training cost. To overcome this limitation, we introduce a theoretically grounded Langevin MCMC–based recurrence method that enables **training-free asymptotically exact sampling** from a properly tempered posterior. Unlike classifier guidance, our approach is based on Langevin MCMC rather than conditional diffusion sampling, reflecting a fundamental conceptual and methodological distinction. Our local search enables scaling test-time compute via scaling recurrence steps, which we show could greatly improve efficiency in **Table 1** below. * **Empirical Efficiency and Justification for Recurrence Cost** We recognize that recurrence introduces additional sampling steps; however, its cost is fully accounted for in our compute budgets. Importantly, we benchmark best-of-N sampling with recurrence (BoN-i, i = number of recurrence steps) under identical total compute budgets against strong baselines. **Table 1: PointMaze - recurrence** |Compute(NFEs)|BoN-0(success rate ↑)|BoN-1|BoN-2|BoN-6 (our baseline)|BoN-8|TreeG|SVDD|FK|DAS| |-|-|-|-|-|-|-|-|-|-| |128|6.875±2.1|3.125±2.72|5.6±2.07|**19.2±2.2**|17.5±2.5|4.375±2.1|5.0±3.1|5.0±3.1|5.0±3.1| |256|6.25±1.2|7.5±3.1|13.75±2.8|**27.0±3.2**|20.6±1.1|9.375±2.1|8.75±1.25|6.9±2.1|6.875±2.1| |512|7.5±1.8|10.0±3.1|26.25±2.2|36.7±1.8|**38.1±1.1**|10.0±1.77|15.0±3.25|10.0±1.8|12.5±1.8| |1024|15.6±2.7|20.6±1.1|41.25±2.2|**58.3±3.2**|50.6±3.2|19.375±3.2|17.625±2.1|26.9±3.2|29.4±3.25| |2048|25±2.5|35±2.5|53.75±2.8|**81.0±2.5**|71.9±1.1|22.5±2.5|20.675±3.2|37.25±2.2|35.0±1.1| Overall, recurrence offers **substantial gains** that more than justify its modest cost, making it an effective strategy for inference scaling. > W1.2: The introduced global search has been widely studied in previous diffusion-aligned papers, which is more commonly known as particle sampling. The introduced BFS search methods can be viewed as variants of particle sampling We appreciate this observation and agree that our global search shares high-level similarities with particle-based methods. However, we emphasize that our BFS-Resampling and BFS-Pruning strategies include several key innovations that advance beyond existing approaches: * **Adaptive Branching**: Unlike prior methods such as SVDD, which allocate fixed particle fan-outs, we introduce dynamic branching based on real-time scores and budget constraints. This allows more efficient exploration-exploitation tradeoffs. * **Duplication Avoidance**: Existing methods often suffer from redundant sampling when sampling multiple children from the same parent, especially when using deterministic samplers. Our pruning strategy explicitly removes low-scoring particles, improving compute efficiency. **Table 2: PointMaze Ultra – BFS Variants vs. Particle Sampling** |Compute(NFEs)|BFS-Pruning(success rate %)|BFS-Resampling|TreeG|SVDD|FK|DAS| |-|-|-|-|-|-|-| |192|**32.5±1.1**|20.0±2.5|18.75±2.8|16.25±3.7|17.5±2.5|18.1±1.1| |384|**48.1±1.1**|41.25±2.7|38.1±1.1|15.0±2.5|32.5±3.1|36.9±1.1| |768|**71.25±2.2**|57.75±3.2|52.5±1.8|17.5±4.3|45.75±2.8|50±2.5| These improvements yield consistent performance gains, showing our BFS variants are principled extensions that outperform prior methods. > W1.3 The DFS search method can be viewed as a variant of the SoP[4]. We respectfully argue that our DFS method is fundamentally different from SoP [4]: 1. **Global Backtracking vs Local variation**: DFS uses strong noise injection ($\Delta_\sigma > 20$) for global exploration, while SoP only performs mild perturbation ($\Delta_\sigma = 0.78$), limited to local neighborhoods. 2. **Adaptive Scheduling**: SoP applies perturbation and branching using a fixed schedule, whereas DFS employs score-aware, dynamic scheduling tailored to both verifier feedback and resource constraints. This allows DFS to allocate compute more efficiently. 3. **Empirical Superiority**: SoP has been shown to underperform best-of-N sampling in several scenarios ([4], Table 2). In contrast, our DFS strategy consistently improves both quality and efficiency, as shown below. **Table 3: Text to Image – DFS vs. Baselines** |Compute(NFEs)|BoN (average score ↑)|DFS|FK|DAS|TreeG|SVDD| |-|-|-|-|-|-|-| |100|0.631±0.001|**0.693±0.001**|0.635±0.001|0.637±0.002|0.633±0.003|0.630±0.003| |150|0.675±0.002|**0.725±0.002**|0.708±0.001|0.678±0.002|0.683±0.001|0.676±0.003|0.653±0.001| |200|0.704±0.002|**0.737±0.003**|0.710±0.002|0.711±0.001|0.709±0.002|0.667±0.001| |250|0.719±0.002|**0.748±0.001**|0.721±0.002|0.723±0.002|0.722±0.001|0.679±0.001| > W2: Limited baselines. Only best-of-N is compared in the main experiments. I encourage the authors to compare the proposed method with beam search or particle sampling [3]. Thank you for the helpful suggestion. In our updated experiments, we have included strong baselines (FK, DAS, SVDD, TreeG). These results (Tables 1–3) show our methods consistently outperform comparable baselines. ---- > Q1: I observed that the verfier takes the clean $x_0$ as inputs. Thereby, how is the computation cost for obtaining rewards at each denoising step? Does this rewarding mechanism require denoising $x_t\rightarrow x_0$ at each denoising step? The verifier operates on $x_{0|t}$, the predicted clean sample at each step, which is already computed by the diffusion model during denoising. There is no need for an explicit reverse process or extra denoising. In addition, our verifiers are lightweight, and their compute is negligible relative to the diffusion model. > Q2: For the results in Table 1, the paper claims that pre-trained diffusion policy and ground truth Q function in QGPO are leveraged for test-time scaling. However, why are the results of TTS lower than QGPO? The key distinction lies in QGPO’s use of an **additional learned, noise-dependent energy function**, trained via contrastive learning alongside the pretrained diffusion policy and the ground-truth Q-function: $\varepsilon_t(s,a_t)=\log \mathbb{E}\_{\mu\_{0t}(a_0|a_t,s)}[e^{\beta Q(s,a_0)}]$. This introduces substantial extra training complexity. In contrast, our test-time scaling method uses **only the pretrained Q-function**, which is both training-free and allows composition of various off-the-shelf foundation models. This offers a more practical and modular approach, especially valuable in low-data settings like robotics. > Q3: Why can double verifiers mitigate reward hacking and improve computation efficiency? Will it increase computational cost? Does the paper include OOD experiments? The double-verifier strategy addresses a key issue: adversarial reward hacking, as discussed in [5]. Training-free guidance can produce samples that exploit weaknesses of a single verifier. However, adversarial examples often fail to transfer between independently designed verifiers. Thus, we introduce a separate global verifier to perform cross-verification. This simple yet effective mechanism reduces reward hacking and improves robustness. Regarding efficiency, the global verifier is applied only sparsely (e.g., 4 times over 100 steps in ImageNet), adding minimal compute overhead. We evaluate OOD using the MSP score [6], which measures the max predicted class probability: $$\text{MSP}(x)=\max_c p(c)$$ |Compute|BoN-single|BoN-double| |-|-|-| |400|0.161|0.164| |800|0.165|0.184| Higher MSP scores for BoN-Double indicate improved robustness and less OOD generation, confirming the benefit of double-verifiers. ---- [1] Dhariwal, Prafulla, and Alexander Nichol. "Diffusion models beat gans on image synthesis." arXiv preprint arXiv:2105.05233 (2021). [2] Kim, Sunwoo, Minkyu Kim, and Dongmin Park. "Alignment without Over-optimization: Training-Free Solution for Diffusion Models." arXiv preprint arXiv:2501.05803 (2025) [3] Singhal, Raghav, et al. "A general framework for inference-time scaling and steering of diffusion models." arXiv preprint arXiv:2501.06848 (2025). [4] Ma, Nanye, et al. "Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps." arXiv preprint arXiv:2501.09732 (2025). [5] Shen, Yifei, et al. "Understanding and Improving Training-free Loss-based Diffusion Guidance." arXiv preprint arXiv:2403.12404 (2024). [6] Hendrycks, Dan, and Kevin Gimpel. "A baseline for detecting misclassified and out-of-distribution examples in neural networks." arXiv preprint arXiv:1610.02136 (2016).