NeurIPS 2025 Response

# Summary of Neurips response # Response to Reviewer Z6e3 We sincerely thank the reviewer for their detailed and constructive feedback. Your thoughtful questions and insightful critiques have greatly helped us clarify the scope, strengthen the technical rigor, and improve the interpretability of our work. Below, we provide point-by-point responses to each of your questions and identified weaknesses. ## Question 1. Clarify the scope of applicability beyond node regression tasks: The proposed method is evaluated exclusively on node regression problems. Could the authors comment on whether QpiGNN can be extended to other tasks common in GNNs, such as node classification, link prediction, or graph-level regression? If not, what are the main limitations preventing such extensions? Clarifying this could improve the significance and generalizability of the work. > **Weakness 1.** Additionally, the method focuses on node regression but does not extend to other graph tasks (e.g., graph-level prediction or link prediction), which may limit its broader applicability. > **Weakness 3.** This work addresses a practically important problem (i.e., uncertainty quantification in GNNs) without relying on quantiles, bootstrapping, or calibration, which are limitations in prior work. The results show consistent improvements in both interval coverage and compactness, which has clear implications for risk-sensitive applications. However, its impact is somewhat constrained to node regression tasks, and it would be valuable to see whether the approach generalizes to classification tasks or other GNN applications, which are common in the field. ### Response to Question 1 (Weakness 1 & 3): On Broader Applicability Beyond Node-level Regression We appreciate the reviewer’s question on the broader applicability of QpiGNN. Although our current study focuses on node-level regression to ensure clarity and depth of evaluation, the proposed framework—especially the dual-head architecture and joint loss—is readily extensible to other GNN tasks involving continuous-valued outputs. For graph-level regression, QpiGNN can be applied to pooled graph representations such as $\mathbf{h}_G = \rho(\text{GNN}(\mathbf{X}, E))$, where $\rho$ denotes a pooling function. The dual-head outputs are computed as $\hat{y}_G = \mathbf{w}{\text{pred}}^\top \mathbf{h}_G$, $\hat{d}_G = \text{Softplus}(\mathbf{w}{\text{diff}}^\top \mathbf{h}_G)$, yielding the interval $[\hat{y}_G - \hat{d}_G,\ \hat{y}_G + \hat{d}_G]$ without altering the architecture or loss. For link prediction, we construct edge embeddings, e.g., $\mathbf{h}_{uv} = \phi([\mathbf{h}_u \parallel \mathbf{h}_v \parallel \mathbf{h}u \odot \mathbf{h}v])$, and apply the same dual-head mechanism to obtain calibrated predictions $\hat{y}{uv} \pm \hat{d}{uv}$. While not directly applicable to classification tasks, the quantile-free, dual-head paradigm could be adapted to classification (e.g., via predictive sets or confidence margins), which we consider a promising future direction. We will clarify these extensions in the revised manuscript to emphasize that QpiGNN is not limited to node-level regression, and its design remains broadly applicable to structured prediction tasks. ## Question 2 Provide diagnostic analysis to explain empirical gains: While QpiGNN shows strong empirical results, the paper lacks an analysis of why it outperforms baselines in practice. Are there specific characteristics of the datasets, model architectures, or graph structures that favor QpiGNN’s approach? A deeper analysis of where and why the method succeeds or fails would strengthen the reader’s confidence in the empirical results and improve the paper’s technical depth. > **Weakness 1.** The technical quality of the paper is generally strong. The proposed method is well-motivated and theoretically grounded, and the experimental evaluation is comprehensive, spanning 19 datasets with strong results against multiple baselines. However, one potential weakness is that the paper does not thoroughly explore why QpiGNN outperforms prior methods in specific cases; there is limited diagnostic analysis of failure modes or sensitivity to hyperparameters. ### Response to Question 2 (Weaknness 1): When and Why QpiGNN Works Well We thank the reviewer for highlighting this important point. Identifying when and why QpiGNN performs well is essential for interpretability and technical clarity. Across 19 datasets, we find that QpiGNN consistently outperforms baselines under three key conditions: (1) In real-world graphs with heteroskedastic or structured noise (e.g., Anaheim, Chicago, Education), QpiGNN adapts interval widths to local uncertainty. Competing methods like SQR or MC Dropout fail to capture such variation, resulting in under-coverage or over-smoothed intervals. (2) In sparse or irregular graphs (e.g., Twitch), QpiGNN produces tighter and better-calibrated intervals due to its robustness against noisy neighbor aggregation. The dual-head architecture helps decouple uncertainty from unstable message passing. (3) On synthetic graphs with high intrinsic variance (e.g., Tree, Power), QpiGNN maintains high PICP (≥ 0.90) and compact intervals (MPIW ≈ 1.0) under moderate λ, showing strong performance even in stochastic settings. Ablation studies (Tab. 2, Fig. 8) support these findings: removing the dual-head or joint loss significantly degrades the calibration–width trade-off. Coverage-only models yield overly wide intervals, while width-only objectives collapse coverage. This confirms that both components are essential to QpiGNN’s effectiveness. We also examined sensitivity and robustness. As shown in Fig. 5, QpiGNN remains stable across various λ values and coverage targets. Although coverage drops beyond λ = 0.57, the trade-off is predictable. Under perturbations like edge dropout, feature corruption, and target noise (Fig. 6), QpiGNN retains high calibration (PICP ≥ 0.97) and low MPIW, while other methods degrade quickly. This highlights QpiGNN’s resilience to structural and label noise. We will include a new subsection summarizing dataset-specific trends and extend discussion in Section 4.2 and Appendix D. These additions, prompted by the reviewer’s suggestion, will improve the technical depth and diagnostic clarity of the paper. ## Question 3 Discuss hyperparameter sensitivity and training stability: The proposed coverage-width loss involves balancing coverage and interval compactness. Could the authors elaborate on how sensitive the performance is to this balance, and whether any tuning is required for different datasets? Providing insights into robustness and practical usability (e.g., default settings or guidelines) would strengthen the quality and reproducibility of the work. ### Response to Question 3: Sensitivity to $\lambda$ and Robustness of Coverage-Width Loss We appreciate the reviewer’s thoughtful question regarding the sensitivity and robustness of our proposed coverage–width loss, particularly the balance controlled by the penalty coefficient $\lambda_{\text{width}}$. **(1) Sensitivity and Trade-off Behavior:** As shown in Fig. 4 (b), the performance of QpiGNN is indeed sensitive to $\lambda_{\text{width}}$, which directly governs the trade-off between calibration and interval compactness. Specifically, lower values (e.g., $\lambda = 0.1$) lead to nearly perfect coverage (PICP $\geq$ 0.98) but result in overly wide intervals (higher MPIW), while higher values (e.g., $\lambda > 0.57$) cause a sharp collapse in coverage—indicating a tipping point in the calibration–sharpness trade-off. We emphasize, however, that this sensitivity follows a predictable and interpretable pattern, making it feasible to tune or even set heuristically in practice. **(2) Default Setting and Practical Usability:** In our main experiments, we used a default value of $\lambda_{\text{width}} = 0.5$ across all 19 datasets (unless otherwise noted). This setting achieved competitive PICP–MPIW trade-offs on 16 out of 19 datasets, and it strikes a reasonable balance for general use. Thus, while some tuning can further improve performance, QpiGNN remains practically usable without exhaustive hyperparameter search. **(3) Dataset-specific Tuning with Bayesian Optimization:** To further investigate the effect of $\lambda$ across tasks, we conducted dataset-wise tuning via Bayesian optimization (Appendix I; Tab.8, Fig. 12). The optimal values range between 0.2 and 0.5, depending on structural and statistical characteristics of the dataset. For instance: (1) Synthetic datasets (e.g., *Tree*, *Edge*, *Grid*) favor stronger regularization ($\lambda \geq 0.4$) to prevent overly conservative intervals. (2) Real-world datasets vary more: high-noise or heterophilic graphs (e.g., *Chameleon*, *Squirrel*) perform better with smaller values (~0.22–0.28), while more structured or low-noise graphs (e.g., *Crocodile*) prefer higher $\lambda$. This analysis shows that no single λ fits all, but the optimal range is relatively narrow and consistent, enabling efficient tuning with minimal overhead. **(4) Robustness to Training Perturbations:** Finally, our robustness analysis (Fig. 6) demonstrates that QpiGNN maintains stable calibration (PICP ≥ 0.96) and compact intervals under various training perturbations—feature noise, edge dropout, and target noise—highlighting the model's resilience to real-world uncertainties, regardless of precise λ settings. We acknowledge that $\lambda_{\text{width}}$ is key to balancing calibration and sharpness. Our results show that QpiGNN behaves predictably with respect to $\lambda$, performs well with a default setting (e.g., 0.5), and can be efficiently tuned via Bayesian optimization. Moreover, it remains robust under noise even without fine-tuning. We will clarify these findings and practical guidelines in the revised manuscript. ## Question 4 Improve theoretical clarity and provide more intuition: The theoretical properties of asymptotic coverage and width optimality are important but may be hard to follow for non-experts. Could the authors improve clarity by summarizing the intuition behind the theory or adding a visual illustration of how the loss function behaves during training? This would enhance accessibility and clarity for a broader audience. > **Weakness 2.** The paper is mostly well-written and easy to follow, especially in its explanation of the dual-head architecture and the coverage-width loss. That said, some parts of the theoretical analysis could benefit from clearer intuition and more visual aids to guide readers unfamiliar with uncertainty estimation. Also, while the architecture and loss design are described clearly, details around implementation (e.g., architecture-specific choices, how CWL is balanced during training) could be more explicitly discussed. ### Response to Question 4 (Weakness 2): Clarifying Theotical Motivation We will revise Section 3.3 to include an intuitive summary of our theoretical guarantees. Specifically, we will clarify that the proposed loss function acts as a Lagrangian relaxation of a constrained optimization problem, aiming to satisfy the coverage target while minimizing interval width. This allows the model to initially prioritize calibration and then gradually focus on tightening intervals—a behavior we visualize in Fig. 3 as a sequential convergence trajectory. We also agree that non-expert readers may benefit from additional visual aids. In the revised version, we will incorporate a schematic diagram that illustrates how the two loss components (coverage vs. width) interact during training, along with a brief explanation of each design choice (e.g., softplus activation for positivity, dual-head structure for decoupling). These revisions will improve theoretical intuition and practical interpretability without changing the core results. ## Question 5 Justify architectural choices in the dual-head design:** The dual-head architecture is a central part of QpiGNN, but the paper does not fully explain why this design was chosen over alternatives, such as modeling the mean and variance jointly or using existing uncertainty layers. Could the authors clarify whether they explored other architectural variations and how the current design contributes to the observed performance? This would help assess the originality and robustness of the proposed approach. > **Weakness 4.** The combination of a quantile-free dual-head architecture with a joint coverage-width loss represents a novel contribution in the context of GNNs. While similar ideas have appeared in non-graph settings (e.g., in deep ensembles or conformal prediction), applying and adapting these to graph-based uncertainty estimation is a clear innovation. Still, the core novelty lies in the loss design and architecture adaptation, rather than in fundamentally new theoretical tools or GNN structures. ### Response to Question 5 (Weakness 4): Motivation for the Dual-head Design We designed QpiGNN with a dual-head architecture to address a key limitation of single-head models in GNNs—namely, the entanglement of prediction and uncertainty estimation. This entanglement, exacerbated by message passing, often leads to unstable or poorly calibrated intervals, especially under node-level heteroscedasticity. By separating the prediction head ($\hat{y}$) and uncertainty head ($\hat{d}$), our architecture enables independent learning of accuracy and interval width. This structural decoupling improves calibration and robustness, particularly in noisy or irregular graphs. We evaluated several alternatives—quantile regression (e.g., RQR-W), variance-based heads, and shared-output models with post-hoc calibration—but found them inadequate due to interval overlap, calibration collapse, poor compactness, or the need for external supervision. Ablation studies across nine synthetic datasets (Tab. 2, Fig. 8) confirm that the dual-head design consistently yields better coverage–width trade-offs. Our approach aligns with recent findings in heteroscedastic and Bayesian deep learning, where structural decoupling enhances both interpretability and performance. We will clarify this design rationale in the revised manuscript to emphasize its necessity and empirical advantage. ## Paper Formatting Concerns The paper does not appear to include a clearly labeled "Limitations" section, which is encouraged by NeurIPS to improve transparency and ethical reflection. ### Response to Formatting Concerns: Missing "Limitation" Section We acknowledge the formatting concern. While limitations are already discussed under the heading “Limitations and Practical Considerations” at the end of Section 3 (p.6), we will revise the paper to present this content in a clearly marked, standalone “Limitations” section. This will improve clarity and align with NeurIPS formatting guidelines. # Response to Reviewer reFF We thank the reviewer for the thoughtful and constructive feedback. Your comments have helped us better articulate the theoretical assumptions, clarify design motivations, and improve the transparency of our methodology and evaluation. Below, we provide structured responses addressing each weakness and question in order. We will incorporate all relevant clarifications and improvements in the revised version of the manuscript. Reviewer 94Lh Major Weakness 2 Response to Clarifying of Theorem 1 ## Response to Weaknesses ### Weakness 1 While the theoretical analysis provides useful insights, the coverage consistency proof (Theorem 1) relies on the assumption of "bounded and independent noise across nodes," which may not hold in graph settings where message passing inherently creates dependencies between node predictions. #### Response to Weakness 1: Independence Assumption in Theoretical Analysis We thank the reviewer for raising an important concern regarding the independence assumption used in Theorem 1. While the proof of asymptotic coverage consistency assumes bounded and independent noise across nodes, we clarify below why this assumption remains practically valid and how our method remains robust under the mild dependencies introduced by message passing in GNNs. First, the independence assumption in Theorem 1 applies specifically to the target noise $\varepsilon_v = y_v - f(x_v)$, rather than to the predictions themselves. In practice, label noise—such as measurement error or labeling uncertainty—is often naturally independent across nodes, even though GNNs may introduce dependencies through neighborhood aggregation during representation learning. Importantly, Theorem 1 is not intended as a novel theoretical contribution, but as a principled explanation for the empirically observed stability in coverage. It leverages the WLLN to show that the empirical coverage $\hat{c}$ converges to the target level $1 - \alpha$ under mild regularity and convergence conditions. While the theorem mentions nodewise independence, this assumption is better understood as bounded and weak dependence, which is commonly sufficient for applying WLLN and concentration inequalities in graph-based models. For instance, in sparse graphs such as ER graphs where the average degree is $\mathcal{O}(\log N)$, the influence of a single node on distant nodes diminishes rapidly. In GraphSAGE with mean aggregation, each neighbor contributes at most $\mathcal{O}(1/\deg(v))$ per layer, leading to a total influence of $\mathcal{O}(1/N)$. Even after $k$ message-passing layers, the cumulative effect remains bounded by $\mathcal{O}(k/N)$—supporting the validity of the weak dependence approximation in many practical settings. Second, as detailed in Appendix B.5, the bounded-difference condition required for applying McDiarmid’s inequality approximately holds in the presence of localized message passing. Specifically, modifying a single node’s input or label changes the global coverage $\hat{c}$ by at most $\frac{1}{N} + \delta_G$, where $\delta_G$ is a small constant depending on the graph structure. This ensures that finite-sample concentration inequalities such as McDiarmid or Hoeffding bounds remain applicable with minor adjustments. Third, our convergence results are grounded in classical stochastic approximation theory and the WLLN, both of which tolerate weak dependencies as long as they decay with distance. Empirically, our model maintains tight coverage and compact intervals across 19 diverse datasets, including graphs with structural noise and heterogeneity—demonstrating robustness even under non-ideal conditions. Finally, while our theoretical coverage guarantee assumes idealized conditions (e.g., symmetric distributions, bounded noise), we empirically demonstrate that QpiGNN maintains validity and compactness even under skewed or heavy-tailed targets—highlighting its robustness beyond theoretical assumptions. Extending the coverage theory to formally incorporate dependency structures (e.g., via graph mixing or dependency graphs) is an exciting direction we plan to pursue in future work. ### Weakness 2 The paper would benefit from deeper theoretical or empirical justification for why this specific design is optimal for graph settings compared to standard heteroscedastic regression approaches. #### Response to Weakness 2: Justification for Dual-Head Architecture We thank the reviewer for this valuable observation. Our choice of a dual-head design in QpiGNN is motivated by the unique challenges of uncertainty quantification in graph-structured data, where standard heteroscedastic regression approaches may fall short. **(1) Limitations of Standard Approaches:** Conventional heteroscedastic regression models often predict both the mean and variance from a shared representation, assuming i.i.d. inputs. However, in graph settings, node features are often correlated due to message passing, and uncertainty varies with structural properties (e.g., node degree, homophily, community boundaries). **(2) Motivation for the Dual-Head Design:** QpiGNN separates the prediction of the central tendency (mean) and the uncertainty (interval width), allowing each head to capture distinct aspects of the data. The interval width head can be more sensitive to local graph variability, while the mean head focuses on prediction accuracy. This separation enhances calibration and sharpness by optimizing each component with appropriate losses. **(3) Empirical Justification:** As shown in Table 1 and Figure 6, QpiGNN consistently outperforms baseline heteroscedastic methods (e.g., Deep Ensemble, MC Dropout) across 19 datasets, achieving tighter prediction intervals and better coverage even under structural noise or perturbations. **(4) Future Directions:** While our current justification is empirical and architectural, we acknowledge the importance of formal theoretical analysis. We plan to extend the theoretical framework to further ground the effectiveness of this design in graph-based UQ tasks. The dual-head design in QpiGNN is a deliberate choice tailored to the graph domain, offering practical advantages in both prediction accuracy and uncertainty estimation robustness. ### Weakness 3 Some experimental results raise questions about baseline implementations. For instance, Table 1 shows that several baseline methods achieve coverage rates substantially different from the target 0.90 (BayesianNN consistently achieves 1.00, while SQR often undercovers), which may indicate implementation issues or suboptimal hyperparameter settings. #### Response to Weakness 3: Baseline Implementation Fidelity We thank the reviewer for raising this important point. We have carefully reviewed our baseline implementations and confirm that all methods, including BayesianNN and SQR, are faithfully implemented according to standard practices and prior work. (1) BayesianNN (Coverage ≈ 1.00): Our implementation uses a stochastic forward pass through a Bayesian linear layer (`BayesianLinear` in `model.py`). Prediction intervals are constructed using empirical mean and standard deviation over `num_samples=100`, with a Gaussian quantile multiplier (e.g., 1.645 for 90% coverage). The consistently high coverage (close to 1.00) is not an implementation issue, but rather reflects overestimated epistemic uncertainty, which is common in sparse or noisy graphs. (2) SQR (Often Under-Covering): During training, we randomly sample quantile levels $\tau \in [0, 1]$, while at inference we fix the interval bounds (e.g., 0.05 and 0.95). This mismatch between training and evaluation quantiles can lead to under-coverage. Future work may address this through quantile-aware training curricula or post-hoc calibration strategies. (3) Advantages of Our Method (GQNN): Our model learns prediction intervals directly through a global coverage–width loss, without relying on sampling or fixed quantile assumptions. This makes our method more robust to implementation details and consistently delivers valid coverage across datasets. We will clarify these implementation aspects and empirical observations in the final version of the paper to improve transparency and reproducibility. ## Response to Questions ### Question 1 The empirical violation loss could potentially be expressed more compactly. The current formulation $ℓ_viol = E[|y - y_low|·I[y < y_low] + |y - y_up|·I[y > y_up]]$ might be simplified to $ℓ_viol = E[max(0, (y - y_low)(y - y_up))]$, which captures the same penalty structure more elegantly. #### Response to Question 1: Violation Loss Formulation We appreciate the reviewer’s thoughtful suggestion regarding the formulation of the violation loss. Indeed, the alternative expression $\ell_{\text{viol}} = \mathbb{E} \left[ \max\left(0,\; (y - y_{\text{low}})(y - y_{\text{up}}) \right) \right]$ is mathematically equivalent to our original definition $\ell_{\text{viol}} = \mathbb{E}\left[ |y - y_{\text{low}}| \cdot \mathbb{I}[y < y_{\text{low}}] + |y - y_{\text{up}}| \cdot \mathbb{I}[y > y_{\text{up}}] \right].$ While the compact form is elegant, we chose the original formulation for its practical advantages. It explicitly separates the contributions of lower and upper violations, which improves interpretability and makes it easier to analyze asymmetries in miscoverage. Additionally, we observed slightly more stable gradient behavior during training using this decomposition. Nonetheless, we agree that the simplified expression may offer notational clarity, and we will consider including it in the appendix or footnote of the revised manuscript to highlight the equivalence. ### Question 2 The method's dependence on $\lambda_{width}$ selection through Bayesian optimization limits practical applicability. Figure 5 demonstrates steep performance degradation beyond certain parameter values, suggesting the need for more robust parameter selection strategies. #### Response to Question 2: On Tuning $\lambda_{\text{width}}$ We agree that the selection of $\lambda_{width}$ plays an important role in balancing coverage and compactness. However, our empirical analysis (Fig. 5, Appendix I) shows that performance remains stable within a practical range of $\lambda$ (typically 0.2–0.5), and the degradation occurs only beyond this interpretable boundary. To enhance usability, we provide a simple default value $(\lambda = 0.5)$ that performs competitively across 19 datasets without tuning. Moreover, while Bayesian optimization was used in our ablation for deeper analysis, it is not essential in practice. We will revise the manuscript to clarify this point and to include guidelines for default and dataset-specific settings. ### Question 3 While not strictly missing from the discussion, the following paper might be interesting to consider, which presents a complementary approach combining Conformal Quantile Regression (CQR) with GNNs for edge weight prediction with coverage guarantee. > [1] Luo, Rui, and Nicolo Colombo. "Conformal load prediction with transductive graph autoencoders." Machine Learning 114, no. 3 (2025): 1-22. #### Response to Question 3: Relation to Luo and Colombo (2025) Thank you for pointing us to the recent work by Luo and Colombo (2025), which integrates Conformal Quantile Regression (CQR) with transductive graph autoencoders for edge-level load prediction. While our focus is on node-level regression under full supervision, both approaches aim to provide calibrated uncertainty in graph settings. CQR requires quantile estimates and exchangeability assumptions, whereas QpiGNN directly optimizes coverage without calibration sets or explicit quantiles. We see these methods as complementary—CQR is well-suited to edge-level tasks with calibration data, while our method emphasizes scalable, end-to-end training for node regression. We will include this work in the related literature section to better position our contributions. # Response to Reviewer hNk7 We thank the reviewer for the insightful and constructive comments. Your questions have allowed us to clarify key concepts and better contextualize both the theoretical and empirical contributions of our work. In what follows, we provide structured responses to each question, addressing related concerns also raised under Weakness 1–3. We will incorporate all relevant clarifications into the revised version of the manuscript to improve precision, transparency, and accessibility. ## Question 1 The issue of "quantile-level inputs" were mentioned multiple times in the introduction, but it is unclear what that means exactly. Can you explain that? ### Response to Question 1: Terminology and Model Complexity We appreciate the reviewer’s question regarding the meaning of "quantile-level inputs" as mentioned in our introduction. This term refers to methods that rely on explicitly specifying quantile thresholds—typically lower and upper quantiles (e.g., $\tau = 0.05$, $\tau = 0.95$)—to generate prediction intervals. In standard quantile regression, the model is trained to estimate the conditional quantile function $f(x_v; \tau)$ for a given quantile level $\tau$, and the prediction interval is constructed using two such estimates $[ŷ_v^{low}, ŷ_v^{up}] = [f(x_v; τ_{low}), f(x_v; τ_{up})]$ This requires either passing $\tau$ as an additional input to the model or training separate networks for each quantile level. However, this approach has notable limitations. It increases model complexity, requires careful tuning of quantile levels, and suffers from the issue of quantile crossing, where the estimated lower bound may exceed the upper bound—leading to invalid intervals. Moreover, it does not provide a direct mechanism for optimizing interval calibration. In contrast, our proposed method, QpiGNN, eliminates the need for quantile-level inputs entirely. Instead of estimating conditional quantiles, QpiGNN directly predicts the conditional mean $\hat{y}_v$ and an associated symmetric interval half-width $\hat{d}_v$, constructing the interval as $[\hat{y}_v - \hat{d}_v,\; \hat{y}_v + \hat{d}_v].$ This interval is optimized end-to-end using a coverage–width loss function that explicitly enforces both calibration (i.e., achieving the desired coverage level) and compactness (i.e., minimizing interval width). As a result, QpiGNN avoids quantile-level specification and produces reliable uncertainty intervals in a principled and scalable manner. We will clarify this terminology more precisely in the revised manuscript. ## Question 2 The theory assumes iid data which is acknowledged as impractical, and is also the issue identified in the introduction for other methods. Such a discrepancy is undesirable. I understand that it's difficult to prove theory under dependence, but this feels like an overclaiming -- the authors blame other methods for requiring exchangeability, but the current theory not only needs iid but also convergence to truth. Are all these realistic? I would urge the authors reframe the presentations to address this issue. > **Weakness 1.** There are several places where the advantages of the proposed method are exaggerated. For example, the authors blame other works for requiring iid/exchangeability, yet their theory relies on even more stringent and unrealistic conditions. It is also unclear whether such a requirement is addressed in the experiments. ### Response to Question 2 (Weakness 1): Theoretical Assumptions vs. Graph Dependencies We appreciate the reviewer’s concern about the gap between our theoretical assumptions and the practical settings in which GNNs operate. We acknowledge that the coverage consistency theorem (Theorem 1) assumes i.i.d. node samples with bounded noise — a simplification that does not fully reflect the dependencies introduced by message passing in GNNs. However, we would like to clarify and reframe our contributions in light of this limitation: (기존) The i.i.d. assumption applies to the noise variables $\varepsilon_v = y_v - f(x_v)$, rather than to the predictions themselves. In practice, label noise or measurement error is often uncorrelated across nodes, even when the predictive model (i.e., GNN) induces dependencies via neighborhood aggregation. (수정) The i.i.d. assumption applies to the noise variables $\varepsilon_v = y_v - f(x_v)$, not to the predictions themselves. In practice, label noise or measurement error is often uncorrelated across nodes, even when the predictive model (e.g., GNN) induces dependencies via neighborhood aggregation. Moreover, in sparse graphs such as ER graphs where the average degree is $\mathcal{O}(\log N)$, the influence of a single node diminishes rapidly with distance. For example, in GraphSAGE with mean aggregation, each neighbor contributes at most $\mathcal{O}(1/\deg(v))$ per layer, yielding a total influence of $\mathcal{O}(1/N)$, and this remains bounded as $\mathcal{O}(k/N)$ over $k$ message-passing layers. As discussed in Appendix B.5, we analyze the robustness of the coverage guarantee under realistic GNN behavior. In particular, we show that the bounded-difference condition required for concentration inequalities (e.g., McDiarmid's inequality) still holds approximately in the presence of localized message passing. That is, the influence of perturbing a single node $(x_v, y_v)$ on the empirical coverage $\hat{c}$ is bounded by $\sup_{X_1, ..., X_N, X_v'} \left| \hat{c}(X_1, ..., X_v, ..., X_N) - \hat{c}(X_1, ..., X_v', ..., X_N) \right| \leq \frac{1}{N} + \delta_G$, where $\delta_G$ is a small graph-dependent residual. This supports the validity of our finite-sample concentration bounds and helps explain the empirical calibration observed in practice. Our method achieves strong coverage and compactness across 19 datasets, including graphs with varying degrees of dependency, structure, and noise (Section 4). These results provide empirical backing that the method’s performance generalizes well beyond the idealized theoretical assumptions. We agree that our current framing may overstate the theoretical claims. In the revision, we will clarify that our contribution is a practical and effective framework grounded in theory, not a complete solution for uncertainty estimation under general graph-dependent noise. ## Question 3 Line 184, it's claimed that theoretically the minimal length prediction set with marginal coverage $\hat{c} \geq 1-\alpha$ is achieved by the true conditional quantiles. However, this does not seem to be true if the $y_v$'s are heteroskedastic or if the density is not unimodal. > **Weakness 2.** The theoretical results do not appear all correct to me. Please see the "Questions" section for further details. ### Response to Question 3 (Weakness 2): Symmetric Interval Optimality Conditions We thank the reviewer for pointing out this important nuance. We agree that the statement in Line 184 should be interpreted under specific conditions, and we clarify below. Our theoretical result on width optimality assumes that the conditional distribution $P(y_v|x_v)$ is symmetric and unimodal, as noted in Appendix B.2. Under this assumption, the minimal-width interval that satisfies marginal coverage $\hat{c} \geq 1-\alpha$ is indeed achieved by the symmetric interval around the conditional mean using the appropriate quantile $\hat{d}_v^* = F_v^{-1}(1 - \alpha / 2)$. However, as the reviewer correctly points out, in heteroskedastic settings or under non-unimodal distributions, the symmetric interval may no longer be optimal. In such cases, the true minimal-length set achieving the desired coverage may not be symmetric, and may require asymmetric quantiles or data-driven calibration. We acknowledge this in Appendix B.5, where we empirically observe that QpiGNN remains robust even under moderate deviations from these assumptions (e.g., skewed or heavy-tailed distributions). While our model optimizes for symmetric intervals for simplicity and interpretability, the joint loss (coverage and width) implicitly adapts to the data distribution during training. We have revised the main text to make the assumptions clearer, and softened the language to avoid overclaiming. ## Question 4. How are the labeled and unlabeled data chosen in the experiments? If they are determined by a random split, then this does not necessarily break the exchangeability needed for conformal prediction (see e.g., Huang et al., CF-GNN). ### Response to Question 4: Exchangeability Assumptions in Graph Splits We thank the reviewer for raising this important point regarding exchangeability in the labeled/unlabeled node split. In our experiments, the labeled and unlabeled nodes are selected using a random split, which, in principle, preserves exchangeability under ideal conditions (i.i.d. samples). However, we argue that exchangeability is often violated in practical graph settings, even with random splits, due to the inherent structural dependencies introduced by message passing and local connectivity biases. Specifically, In real-world graphs, node features and labels are not identically distributed (e.g., due to community structure or degree heterogeneity), so exchangeability fails even with random node sampling. Message passing in GNNs leads to information leakage across labeled and unlabeled nodes. For instance, in homophilic graphs, neighboring labels are highly correlated, breaking conditional independence between train/test nodes.These structural dependencies are particularly problematic for conformal prediction methods that rely on i.i.d. residuals. While conformal GNNs such as CF-GNN attempt to mitigate this with leave-one-out strategies, they still assume approximate exchangeability, which is not guaranteed in graph settings with non-uniform label distributions or overlapping neighborhoods. In contrast, our method (QpiGNN) avoids this issue entirely by not requiring exchangeability assumptions—we optimize coverage directly via a global loss, without relying on calibration residuals or held-out scores. We will clarify the data split strategy and its implications for exchangeability in the revised manuscript, and more clearly distinguish our theoretical foundations from existing conformal approaches. ### Official Comment 1 Thank you for the detailed response! The fact that the experiments are under exchangeability despite the claims for non-exchangeability is still a bit concerning, so I will maintain my current score. ### Response to Official Comment 1 We appreciate the reviewer’s continued engagement. While we understand the concern that random splits may preserve exchangeability in theory, our empirical analysis strongly suggests that this assumption breaks down in practice for many real-world graphs. To validate this, we computed the Pearson correlation between each node's label and the mean label of its neighbors—both before and after a 50/50 random split. If the labeled and unlabeled nodes were exchangeable, we would expect this correlation to drop significantly after the split. However, as shown below, many datasets retain a high degree of label correlation even after random partitioning: | Dataset|Full-Graph Corr|Post-Split Corr| |-|-|-| | Education| 0.762| 0.542| | Election| 0.857| 0.720| | Income| 0.843| 0.683| | Unemploy.| 0.902| 0.817| | Twitch| -0.442| -0.455| | Chameleon| 0.218| 0.147| | Crocodile| 0.118| 0.199| | Squirrel| 0.148| 0.138| | Anaheim| 0.365| 0.213| | Chicago| 0.616| 0.594| These consistently high post-split correlations—particularly in Education, Election, Income, and Unemployment—strongly suggest that labeled and unlabeled nodes are not independent in practice, even when selected randomly. This is largely due to structural dependencies, label homophily, and feature correlations inherent in graph data. This violates the assumption of exchangeability needed for conformal prediction. These findings reinforce our argument that exchangeability is not just a theoretical assumption that can be “approximately true”—rather, it is systematically violated in many real-world scenarios. We will include these empirical results and their visualizations in the revised manuscript to better support our claims. ### Official Comment 2 I appreciate the follow-up results. Exchangeability doesn't necessarily connect to the correlation between neighboring nodes (or did I misunderstand the definition of "neighbors" in your setup?), but about the overall distribution of calibration and test data. As such, I don't think the evaluated correlation is supportive for the claim. I think the most convincing result should be evaluating the method under non-random splits, if that's possible. ### Response to Official Comment 2 |Split Type| Education| Election| Income| Unemploy.| Twitch| Chameleon| Crocodile| Squirrel| Anaheim | Chicago| Random (on paper) 0.99/0.90 1.00/0.97 1.00/0.72 1.00/0.93 0.98/0.54 0.98/0.40 1.00/0.54 0.99/0.47 0.99/0.74 0.99/0.60 Degree 0.99/0.75 0.99/0.87 0.99/0.44 0.99/0.93 0.94/0.68 0.97/0.57 1.00/0.72 0.97/0.45 0.95/0.58 0.96/0.46 Community 0.99/0.54 0.99/0.86 1.00/0.46 0.97/0.84 0.99/0.49 0.95/0.58 0.98/0.74 0.97/0.47 0.97/0.60 1.00/0.53 ## Question 5. In Table 1, the coverage can sometimes be low with $\lambda_{0.5}$ How should one navigate this choice while maintaining validity in practice? > **Weakness 3.** The numerical result seems a bit sensitive to the choice of the width penalty. ### Response to Question 5 (Weakness 3): Sensitivity to $\lambda_{\text{width}}$ We appreciate the reviewer’s observation regarding the sensitivity of QpiGNN’s performance to the choice of the width penalty coefficient $\lambda_{\text{width}}$. **(1) Interpretable and Predictable Trade-off:** As illustrated in Fig. 4(b), $\lambda_{\text{width}}$ governs a clear trade-off between calibration and interval compactness. Smaller values (e.g., $\lambda = 0.1$) yield nearly perfect coverage but overly wide intervals, while larger values (e.g., $\lambda > 0.57$) can cause sharp drops in coverage. Despite this, the behavior is smooth and interpretable, making it manageable in practice. **(2) Effective Default Setting:** We used a fixed value of $\lambda_{\text{width}} = 0.5$ across all 19 datasets in our main experiments. This setting consistently delivers strong performance on 16 out of 19 datasets, striking a reasonable balance without dataset-specific tuning. **(3) Efficient Dataset-Specific Tuning:** Appendix I (Tab. 8, Fig. 12) shows that the optimal $\lambda$ values lie within a narrow and stable range (0.2–0.5), which reflects structural and statistical characteristics of each graph. Bayesian optimization enables efficient tuning without incurring high computational cost. **(4) Robustness to Perturbations:** As reported in Fig. 6, QpiGNN maintains stable performance (PICP ≥ 0.96) even under noisy features, edge dropout, and label corruption. This indicates that the model’s reliability is preserved across various $\lambda$ settings. We will revise the manuscript to offer clear guidelines and emphasize that $\lambda$ is interpretable, tunable, and practically stable. # Response to Reviewer 94Lh Thank you for the thoughtful review. Your feedback has been valuable in improving this work and guiding future directions. Below, we provide structured responses addressing the major and minor weaknesses, as well as specific questions. All clarifications and improvements will be incorporated into the revised manuscript. ## Response to Major Weaknesses ### (Major) Weakness 1 The main weakness concerns the novel contribution compared to previous work, in particular RQR. While the text leaves it somewhat unclear how RQR is adapted to graphs, my interpretation is that the backbone GNN directly outputs the two interval bounds, i.e. there is a final round of message passing for both the lower and upper interval boundary. > - The difference of QpiGNN amounts to omitting the message passing by using an additional linear layer and having fewer dependencies between the interval boundaries of adjacent nodes this way. I see this as an incremental improvement that is not too fundamentally different from the RQR baseline itself. > - The empirical success of OpiGNN over RQR seems less convincing when considering: i) That OpiGNN ($\lambda^{opt}$) uses hyperparameter tuning to find a good trade-off between coverage and interval size while RQR uses $\lambda = 1$. ii) It appears that on real data, RQR performs well in terms of interval size, often outperforming all OpGNN variants at fixed $\lambda$ and even those with tuned $\lambda$ sometimes, while still having very high coverage. The coverage is usually only slightly below 90%, which is why it is not underlined in red. By either choosing a marginally more lenient coverage threshold (e.g. 85%) or slightly changing lambda to increase coverage a bit, I would expect that the performance of RQR and OpiGNN does not clearly favor OpiGNN. > - In light of this, the architectural novelty of OpiGNN compared RQR seems even less meaningful. #### Response to Weakness 1: Novelty and Comparison with RQR We appreciate the reviewer’s critique regarding the comparison with RQR. Our method offers meaningful architectural and empirical improvements in graph-based settings. We clarify these distinctions and the limitations of RQR in more detail below. RQR was originally developed for MLPs, not GNNs. To enable a fair comparison, we adapted RQR to GNNs by modifying the backbone to predict lower and upper bounds per node and applying the RQR-W loss. Unlike our dual-head design, which separately estimates the interval center and width, this direct adaptation lacks architectural separation, limiting its expressiveness in graph-based settings. During this adaptation, we frequently observed quantile crossing due to representation entanglement caused by message passing. To mitigate this, we introduced an ordering penalty into the RQR loss (which is not part of the original formulation) to enforce valid intervals. This modification stabilized RQR’s outputs and improved its performance in GNNs, making our adapted version a fairer and more competitive baseline. Even when adapted to GNNs, RQR tends to produce overly smooth and wide intervals due to the averaging effect of message passing, which obscures node-specific uncertainty. In contrast, QpiGNN’s dual-head design decouples prediction and uncertainty estimation, enabling better modeling of heteroskedasticity in sparse or noisy graphs. Although QpiGNN benefits from $\lambda$ tuning, it already performs well with the default setting. RQR was evaluated at $\lambda=1$ to align with prior work; results for tuned $\lambda$ can be found in our response to Question 4. ### (Major) Weakness 2 Theorem 1 seems to not be very impactful, as it makes very strong assumptions that are unrealistic, especially on graphs. Noise is typically not independently distributed across nodes, that is precisely why UQ for GNNs is challenging. Also assuming that the mean and interval-width converge to the targets is basically equivalent to assuming consistency. There is no reason to assume that when using a black-box GNN to predict them. From these very strong assumptions, the consistency claim basically follows directly from the law of large numbers. I do not see sufficient justification to explicitly state "if the errors are independent and the intervals approach their correct distribution, the intervals are consistent) as the main Theorem of the paper. Note that some of these strong assumptions are also relevant to Theorems 3 and 4 in the Appendix. They, too, constitute more or less just applying well-known bounds for such random variables and do not contribute new insights. #### Response to Weakness 2: Clarifying of Theorem 1 (기존 버전) We appreciate the reviewer’s concern about the strong assumptions in Theorem 1. While we acknowledge that conditions such as independent noise across nodes and convergence of predicted bounds are idealized—especially in GNNs where message passing induces dependencies—we argue that the theorem remains both theoretically meaningful and practically useful. First, Theorem 1 formalizes the desirable property of asymptotic coverage consistency: if the predicted lower and upper bounds converge in probability to their true values and noise is bounded and independent, then the empirical coverage $\hat{c}$ converges in probability to the target level $1 - \alpha$, following the Weak Law of Large Numbers. Second, we provide finite-sample guarantees (Theorems 3 & 4), showing that the deviation between $\hat{c}$ and its expectation is bounded with high probability, e.g., $|\hat{c} - \mathbb{E}[\hat{c}]| \leq \sqrt{\log(2/\delta)/(2N)}$. These results offer additional reliability even for moderate sample sizes. Third, in Section B.5, we show that although full independence may not hold in GNNs, the bounded-difference condition required by McDiarmid’s inequality still holds approximately. The influence of perturbing one node is bounded by $(1/N + \delta_G)$, where $\delta_G$ is a small graph-dependent term, allowing concentration results to remain valid under weak dependencies. --- (수정 버전) We appreciate the reviewer’s thoughtful feedback. While we acknowledge that assumptions such as independent noise across nodes and convergence of predicted bounds may appear idealized—particularly in GNNs where message passing induces dependencies—we argue that Theorem 1 remains both theoretically grounded and practically meaningful. Below, we offer a more realistic interpretation of the assumptions, supported by prior theoretical work and empirical evidence. Theorem 1 is not intended as a novel theoretical contribution, but as a principled explanation for the empirically observed coverage stability. It uses the WLLN to show that empirical coverage $\hat{c}$ approaches the target level $1 - \alpha$ under mild convergence and regularity conditions. The assumption of nodewise independence should be more accurately interpreted as “bounded and weak dependence,” which is often sufficient for concentration-based arguments in graph-based models. In sparse graphs such as ER graphs, where the average degree is $\mathcal{O}(\log N)$, the influence of a single node on distant nodes diminishes rapidly. Specifically, in GraphSAGE with mean aggregation, each neighbor contributes at most $\mathcal{O}(1/\deg(v))$ per layer, resulting in a total influence of $\mathcal{O}(1/N)$. Even over $k$ message-passing layers, the cumulative effect remains bounded as $\mathcal{O}(k/N)$. This interpretation is consistent with prior work—for example, [1] shows that the WLLN can hold for stationary graph signals, providing theoretical precedent for applying WLLN in graph domains without requiring full independence. Moreover, our paper offers finite-sample theoretical guarantees. Specifically, Theorems 3 and 4 apply Hoeffding's and McDiarmid’s inequalities to bound the deviation $\left|\hat{c} - \mathbb{E}[\hat{c}]\right|$ with high probability, even under weak dependencies. In Appendix B.5, we show that perturbing a single node changes the output by at most $(1/N + \delta_G)$, where $\delta_G$ is a small graph-dependent constant. This satisfies the bounded difference condition, enabling valid concentration arguments even in realistic GNN settings. Empirically, QpiGNN maintains high coverage and compact intervals under various perturbations and generalizes well across graph types (see Figures 6 and 7), supporting that Theorem 1’s assumptions hold approximately in practice. We will revise the theorem to assume bounded, weakly dependent noise and clarify that convergence is to stable reference targets—not true quantiles. #### References [1] Gama, Fernando, and Alejandro Ribeiro. "Ergodicity in stationary graph processes: A weak law of large numbers." IEEE TSP 67.10 (2019): 2761-2774. ### (Major) Weakness 3 The Generalisation Analysis (Figure 7) seems out of place. While an interesting property of the architecture, it does not align with the red thread of the paper that deals with UQ. Neither is UQ typically mainly assessed by testing the accuracy (in this case coverage) on OOD nor does the architecture motivate why there should be an improved generalization for this model. There is also no comparison to how well the baselines handle this task, so there is no context for the performance visualized here. #### Response to Weakness 3: About Generalization Analysis This experiment goes beyond testing generalization—it evaluates whether coverage consistency holds under structural shifts in graph topology. In real-world settings, graphs often change over time or are partially observed, making robustness to structural changes essential for practical UQ. Yet, most prior work has focused on feature distribution shifts, overlooking the significant impact of structural changes on both prediction and uncertainty in GNNs. To evaluate coverage under structural shifts, we conduct a train-test transfer experiment across graphs with different structures. This approach aligns with prior work [1] on robustness to conditional distribution shifts and is critical for validating GNN-based UQ in realistic settings (Fig. 7). We acknowledge the reviewer’s point about missing baselines and will include comparable structure-shift experiments and coverage comparisons in the revised version to improve the credibility of our results. #### References [1] Huang, Kexin, et al. "Uncertainty quantification over graph with conformalized graph neural networks." NeurIPS (2023): 26699-26721. [2] Akansha, S. "Conditional shift-robust conformal prediction for graph neural network." arXiv preprint arXiv:2405.11968 (2024). ## Response to Minor Weaknesses ### (Minor) Weakness 1 The ablation in Table 2 and Figure 8 would be more impactful if conducted on real data instead of synthetic graphs. #### Response to Weakness 1: Real-Data Ablation Study We conducted ablation studies on real-world datasets, which confirmed the effectiveness of the dual-head architecture and full loss formulation, consistently yielding better calibration–sharpness trade-offs. |Variant|PICP|MPIW| |-|-|-| |Dual Head & Full Loss|0.87|0.35| |Fixed Margin|0.64|0.10| |Single Head|0.51|0.09| |Coverage Loss Only|1.00|1.10| |Width Loss Only|0.07|0.00| (제외) Simpler variants underperformed, and tuning the width penalty $\lambda$ was beneficial on some datasets (e.g., Twitch). We will clarify this in the revised manuscript. ### (Minor) Weakness 2 The Related Work section mostly restates the introduction. This redundancy could be removed in favor of deepening the background on node regression, UQ in regression, or UQ on graphs (where node classification and graph classification have been gaining a lot of attention in recent years). #### Response to Weakness 2: Improving Related Work We agree that the Related Work section overlaps with the Introduction. In the revision, we will reduce redundancy and expand discussion on node regression and UQ in both general and graph-based settings to better contextualize our contributions. ## Response to Questions ### Question 1 Why is the joint loss used for QpiGNN in Equation 4 different from the joint loss of RQN in Equation 3? What is the justification for these changes? #### Response to Question 1: Difference in Loss Functions We thank the reviewer for highlighting the difference in loss formulations. RQR uses a max-based loss that penalizes only when the true value falls outside the predicted interval. In contrast, QpiGNN uses an asymmetric penalty that separately handles under- and over-coverage $\mathbb{E}[|y - \hat{y}^{\text{low}}| \cdot \mathbb{I}(y < \hat{y}^{\text{low}}) + |y - \hat{y}^{\text{up}}| \cdot \mathbb{I}(y > \hat{y}^{\text{up}})] + \lambda_{\text{width}} \cdot (\hat{y}^{\text{up}} - \hat{y}^{\text{low}})$ This improves interpretability, stabilizes convergence, and makes λ tuning more intuitive—crucial for GNNs where oversmoothing and interval imbalance are prevalent. ### Question 2 What is the performance when using RQN with this new objective? Is it sufficient to alleviate oversmoothing as claimed by the authors? #### Response to Question 2: Evaluating RQR with Revised Objective As detailed in Major Weakness 1, points (1) and (2), we adapted RQR to GNNs with an added ordering penalty to reduce quantile crossing. While this improves interval validity, RQR still suffers from oversmoothing and wide intervals due to architectural limitations. In contrast, QpiGNN’s dual-head design and coverage–width loss jointly address these issues. Thus, applying the QpiGNN loss alone to RQR is insufficient—architectural changes are key to mitigating oversmoothing. ### Question 3 Can the authors include some baselines in Figure 6? Without this, it is hard to contextualize how well QpiGNN handles noise compared to its competitors (i.e. is the performance of the competitors under feature/edge noise constant as well?) #### Response to Question 3: Baselines in Robust Analysis We thank the reviewer for the suggestion. In response, we have included additional robustness experiments comparing QpiGNN with two representative baselines: SQR and RQR, under three types of noise—feature, edge dropout, and target noise. The results are summarized in the table below. Visualizations will be included in the revised version. |PerturbationType|Level|QpiGNN (PICP/MPIW)|SQR (PICP/MPIW)|RQR (PICP/MPIW)| |-|-|-|-|-| |GaussianNoise|0.1|0.89/0.66|0.76/0.78|0.85/0.78| ||0.2|0.90/0.80|0.65/0.69|0.85/0.78| ||0.3|0.92/0.83|0.71/0.81|0.84/0.76| |EdgeDropout|0.2|0.84/0.21|0.53/0.44|0.86/0.80| ||0.4|0.88/0.23|0.60/0.44|0.89/0.84| ||0.6|0.91/0.21|0.82/0.50|0.89/0.84| |TargetNoise|0.1|0.96/0.44|0.49/0.52|0.95/0.87| ||0.2|0.99/0.71|0.95/0.84|0.97/1.05| ||0.3|1.00/1.05|0.98/0.99|0.98/1.32| ### Question 4. Can the authors provide experiments where $\lambda$ is tuned for RQN as well or at least try a lambda that enables a slightly higher coverage and report a comparison with OpiGNN? #### Response to Question 4: Tuning $\lambda$ for RQR To assess RQR’s best-case performance, we conducted additional experiments with $\lambda$ tuning across both synthetic and real-world datasets. While tuned RQR can yield narrower intervals than QpiGNN, this relies on our modified version with an added ordering constraint; without it, RQR often produces invalid intervals and shows poor calibration. Furthermore, despite having tighter intervals, RQR frequently fails to achieve the target coverage. |Model|Syn.(PICP/MPIW)|US.(PICP/MPIW)|Twitch (PICP/MPIW)|Wiki (PICP/MPIW)|Trans.(PICP/MPIW)| |-|-|-|-|-|-| |RQR(λ=1)|0.86/0.66|0.89/0.44|0.91/0.42|0.87/0.13|0.87/0.40| |RQR(λ=0.5)|0.86/0.64|0.88/0.42|0.91/0.41|0.88/0.07|0.86/0.38| |RQR(λ=0.1)|0.86/0.66|0.88/0.44|0.91/0.42|0.88/0.13|0.87/0.40| |QpiGNN(λ=0.5)|0.94/0.50|0.99/0.62|0.59/0.08|0.72/0.06|0.95/0.37| |QpiGNN(λ=0.1)|0.99/0.88|0.99/0.88|0.98/0.54|0.99/0.47|0.99/0.67| (제외) In contrast, QpiGNN reliably achieves target coverage—even under noise or structural shifts—showing a better coverage–sharpness trade-off and stronger generalization. We will clarify this comparison in the revision.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.