Reviewer 4 - HackMD

Thank you for your constructive feedback. We agree that a theoretically grounded Langevin MCMC is a powerful approach for improving training-free guidance. Prompted by your feedback, we now include a more rigorous concentration bound and new ablations on recurrence steps, which substantiate our method's efficacy. We also share your excitement for integrating classical search to improve efficiency. Indeed, we agree that **test-time scaling for diffusion models in decision-making domains is underexplored**. We consider our application to reinforcement learning (RL) a key contribution of this paper, one that we hope will inspire future research and serve as a strong baseline. We recognize our initial presentation lacked clarity. Below, we clarify our contributions with these new theoretical results and ablations to better demonstrate the advantages of our approach. > W1: The clarity of the paper is not sufficient for a conference like NeurIPS, and some of the notation is a little loose/inconsistent (and the proofs in C.3 come across a little too informal). Thank you. We acknowledge our initial presentation lacked clarity. We have revised the paper, adding a dedicated notation section and a formal concentration bound to clarify the core theoretical insight: the equivalence between our recurrence and Langevin MCMC. **Theorem**: Denote $\tilde{q}\_t$ as our target distribution at diffusion timestep $t$. Suppose $\tilde{q}\_t$ has bounded support, is $\alpha$-strongly log-concave and $L$-log-smooth, and $-\nabla^2\log\tilde{q}\_t$ is $M$-Lipschitz. Denote $x\_t^N$ as the sample after $N$ recurrent steps. When the step size $e^{-\lambda\_t}-e^{-\lambda\_{t-1}}<\frac{2}{\alpha+L}$, we can bound the Wasserstein distance between $p(x\_t^N)$ and $\tilde{q}\_t$ as: $$W\_2(p(x\_t^N),\tilde{q}\_t)\leq\mathcal{O}\left(\sqrt{\lambda\_{t-1}-\lambda\_t}+e^{-\lambda\_t}-e^{-\lambda\_{t-1}}+(1-e^{-\lambda\_t}+e^{-\lambda\_{t-1}})^{N/2}\right),$$ where $\lambda_t=\log\frac{\alpha_t^2}{\sigma_t^2}$ is the log-SNR at timestep $t$ **Proof sketch**: The recurrence is equivalent to unadjusted Langevin algorithm (ULA) on a tempered distribution $p^{\text{tempered}}\propto (\tilde{q}\_t)^{1+\mathcal{O}(\lambda\_{t-1}-\lambda\_t)}$. Following Lemma 1 and Lemma 2 in [1], we can bound the error of ULA discretization as: $$W_2(p(x_t^N),p^{\text{tempered}})\leq \mathcal{O}(e^{-\lambda\_t}-e^{-\lambda\_{t-1}}+(1-e^{-\lambda\_t}+e^{-\lambda\_{t-1}})^{N/2}).$$ Since $\tilde{q}\_t$ has bounded support, we have from Proposition 7.10 in [2]: $$W\_2(p^{\text{tempered}},\tilde{q}\_t)\leq\mathcal{O}\left(\sqrt{\text{TV}(p^{\text{tempered}},\tilde{q}\_t)}\right)=\mathcal{O}\left(\sqrt{\lambda\_{t-1}-\lambda\_t}\right)$$ Putting them together, we obtain our desired bound, providing a rigorous foundation for the recurrence strategy grounded in Langevin dynamics. > W2: Experiments do not stack up against claimed contributions: only the RL experiments have baseline comparisons, and key comparisons to other SOTA guided image generation methods are missing. Thank you. We agree SOTA comparisons are essential and have now included comparions against both training-free guidance (TFG[3]) and classifier guidance[4]. We generated 1000 samples for each class and computed FID w.r.t the images of that class. |Method|Compute (NFEs)|FID↓|Acc%↑| |-|-|-|-| |Inference-scaling|1100|97|**65.5**| |Classifier-guidance|100+training compute|**53.5**|59.5| |TFG|100|142|22| These results show that our method achieves competitive performance relative to classifier-guidance without any training overhead, and significantly outperforms the strongest available training-free alternative. > W3: No experiments have error bars/confidence apart from the RL experiment, and so it is hard to judge how effective some of the design choices are (e.g. number of local search recurrent steps) We agree confidence intervals are crucial and have added them to all key experiments by reporting standard deviations over multiple trials (**4 trials** for Maze with 40 seeds each, **3 trials** for image generation). **Table 1** below now shows that increasing recurrence to 6 or 8 steps consistently improves performance, demonstrating the efficacy of our local search. > W4.1: A controlled ablation study to empirically demonstrate the efficacy of the recurrent method Excellent suggestion. We conducted a controlled ablation comparing best-of-N with different recurrence steps (BoN-i) against prior methods under equivalent compute for PointMaze and ImageNet. **Table 1: PointMaze - recurrence** |Compute(NFEs)|BoN-0(success rate ↑)|BoN-1|BoN-2|BoN-6 (our baseline)|BoN-8|TreeG|SVDD|FK|DAS| |-|-|-|-|-|-|-|-|-|-| |128|6.875±2.1|3.125±2.72|5.6±2.07|**19.2±2.2**|17.5±2.5|4.375±2.1|5.0±3.1|5.0±3.1|5.0±3.1| |256|6.25±1.2|7.5±3.1|13.75±2.8|**27.0±3.2**|20.6±1.1|9.375±2.1|8.75±1.25|6.9±2.1|6.875±2.1| |512|7.5±1.8|10.0±3.1|26.25±2.2|36.7±1.8|**38.1±1.1**|10.0±1.77|15.0±3.25|10.0±1.8|12.5±1.8| |1024|15.6±2.7|20.6±1.1|41.25±2.2|**58.3±3.2**|50.6±3.2|19.375±3.2|17.625±2.1|26.9±3.2|29.4±3.25| |2048|25±2.5|35±2.5|53.75±2.8|**81.0±2.5**|71.9±1.1|22.5±2.5|20.675±3.2|37.25±2.2|35.0±1.1| **Table 2: ImageNet - recurrence** |Compute(NFEs)|BoN-1(FID↓/Acc↑)|BoN-2|FK|DAS|TreeG|SVDD| |-|-|-|-|-|-|-| |200|**178.4±0.2/29.7±0.3**|180±0.3/28±0.1|181.4±0.2/26.7±0.3|180.2±0.1/27.8±0.6|182.3±0.1/26.2±0.2|190.2±0.2/22±0.2| |400|151.2±0.1/37.5±0.2|**146±0.5/43±0.3**|152.8±0.1/37.5±0.2|156.0±0.1/36.0±0.1|157.2±0.3/34.8±0.5|178.3±0.2/29.5±0.1| |800|127.8±0.2/49.2±0.3|**118±0.1/52.3±0.2**|132.1±0.1/46.5±0.1|133.1±0.2/46.2±0.1|134.2±0.1/45.8±0.2|167.5±0.3/32.8±0.5| These results show that recurrence alone greatly improves performance compared to prior methods, validating its standalone contribution. > W4.2: Clarify the performance of the global search methods independently of the local search methods, where possible This is an important point. To empirically disentangle the benefits of our global search strategy from the local search, we have conducted new ablations evaluating BFS and DFS while holding local search fixed or disabled entirely. **Table 3: BFS in PointMaze (6 recurrent steps fixed)** |Compute(NFEs)|BFS-Pruning(success rate %)|BFS-Resampling|TreeG|SVDD|FK|DAS| |-|-|-|-|-|-|-| |192|**32.5±1.1**|20.0±2.5|18.75±2.8|16.25±3.7|17.5±2.5|18.1±1.1| |384|**48.1±1.1**|41.25±2.7|38.1±1.1|15.0±2.5|32.5±3.1|36.9±1.1| |768|**71.25±2.2**|57.75±3.2|52.5±1.8|17.5±4.3|45.75±2.8|50±2.5| **Table 4: DFS in text-to-image generation (no local search)** |Compute(NFEs)|BoN (average score ↑)|DFS-0.5|DFS-0.7|DFS-0.9|FK|DAS|TreeG|SVDD| |-|-|-|-|-|-|-|-|-| |100|0.631±0.001|0.682±0.002|**0.693±0.001**|-|0.635±0.001|0.637±0.002|0.633±0.003|0.630±0.003| |150|0.675±0.002|0.721±0.001|**0.725±0.002**|0.708±0.001|0.678±0.002|0.683±0.001|0.676±0.003|0.653±0.001| |200|0.704±0.002|-|**0.737±0.003**|0.733±0.002|0.710±0.002|0.711±0.001|0.709±0.002|0.667±0.001| |250|0.719±0.002|-|-|**0.748±0.001**|0.721±0.002|0.723±0.002|0.722±0.001|0.679±0.001| **Table 5: Imagenet with classifier guidance** |Compute(NFEs)|BoN(FID↓/Acc↑)|BFS|DFS| |-|-|-|-| |200|58.6±0.2/80.5±0.4|56.6±0.1/81±0.2|**56.2±0.2/90±0.3**| |400|54.5±0.3/88.5±0.5|**54.2±0.2/92.2±0.3**|-| These results confirm that BFS and DFS are effective even when used independently. > W5: The related works section seems to only mention literature of tangential interest, and not the most closely related diffusion works, which are discussed in the appendix. This is an excellent point. We have substantially revised the Related Work section in the main paper to prioritize and properly contextualize the most relevant literature. The discussion now focuses on closely related training-free methods (e.g., TFG) and search-based approaches [5–8]. ----- > Q1: What is the computational cost of the local search compared to competing training free, and trained guidance methods? And do you think any costs compared to non-recurrent methods are “paid back” by e.g. more efficient sampling, etc? This is a great question. As shown in **Table 1**, our method with 6 recurrence steps outperforms other training-free methods under the same total inference-time compute (NFE). Compared to classifier-guidance methods, our framework requires no training of classifier. Moreover, classifier-guidance requires generating target-class samples from the base model, which often adds substantial inference cost. > Q2: Would it be possible to separately test the global search methods with trained guidance diffusion models on appropriate tasks, e.g. guided image generation? Excellent suggestion. As shown in our new results in **Table 5**, both BFS and DFS improve classifier-guidance performance. This demonstrates our framework's modularity and ability to enhance existing SOTA-trained guidance methods. ---- [1] Wibisono, Andre. "Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem." Conference on learning theory. PMLR, 2018. [2] Villani, Cédric. Topics in optimal transportation. Vol. 58. American Mathematical Soc., 2021. [3] Ye, Haotian, et al. "Tfg: Unified training-free guidance for diffusion models." NIPS 2024 [4] Dhariwal, Prafulla, and Alexander Nichol. "Diffusion models beat gans on image synthesis." NIPS 2021 [5] Kim, Sunwoo, Minkyu Kim, and Dongmin Park. "Alignment without Over-optimization: Training-Free Solution for Diffusion Models." arXiv preprint arXiv:2501.05803 (2025). [6] Li, Xiner, et al. "Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding." arXiv preprint arXiv:2408.08252 (2024). [7] Guo, Yingqing, et al. "Training-free guidance beyond differentiability: Scalable path steering with tree search in diffusion and flow models." arXiv preprint arXiv:2502.11420 (2025). [8] Singhal, Raghav, et al. "A general framework for inference-time scaling and steering of diffusion models." arXiv preprint arXiv:2501.06848 (2025).