KDD'25 Rebuttal Submission

# KDD'25 Rebuttal Submission [openerview link](https://openreview.net/group?id=KDD.org/2025/Research_Track_February/Authors) --- Title: **Successfully Completed Author Rebuttal** Dear AC, SAC, and PC, Thank you for your time and effort in coordinating the review process. Although our original submission already included extensive comparisons against 12 strong baselines, some reviewers requested the inclusion of additional methods. In rebuttal, we conducted further experiments with four recent models (TopoGCL, AutoST, iTransformer, and TimeMixer) bringing the total to **16 baseline comparisons**. Notably, 4 out of 5 reviewers were satisfied with our clarifications and updates, thereby **raising their scores** or **maintaining positive assessments**. Overall, we are pleased that the majority of reviewers acknowledged the strengths of our work and responded positively. Sincerely, Authors --- **Title: Resolved Most of Reviewers' Concerns** Dear AC, We sincerely appreciate your time and effort in evaluating our submission. We believe the performance comparison has been sufficiently addressed with both main paper and rebuttal. Several reviewers acknowledged our efforts and updated their scores accordingly. As one reviewer hasn’t responded after the additional results, we kindly ask the AC to consider this context when making a judgement. Specifically, the main paper already includes comparison with **12 strong baselines**, and in response to reviewer feedback, we additionally conducted and reported experiments with **4 recent SOTA methods (iTransformer, TimeMixer, AutoST, and TopoGCL)**. Reviewers **wtfq** and **3BYJ** explicitly acknowledged additional comparisons and reflected the improvements in their updated evaluations. However, Reviewer **QEL2** has not responded after the additional experiments were shared. Once again, we kindly ask the AC to consider that the concern has been positively resolved by multiple reviewers, and hope this context helps inform your judgment. Thank you again for your thoughtful consideration. --- ## Reviewer1 - QEL2, Novelty(2), Technical Quality(3), Confidence(3) → Novelty(3) ### 2244 characters > [W1] Baseline selections: Comparisons focus on single-modality models but lack inclusion of recent multi-modal graph methods (see next). > [W2] Limited novelty/insight: Within the context of “STG contrastive learning with node and edge features”, there have been many works [1-4]. While the approach used by BINTS to “extend” a model from single modality to bi-modality is different from previous works, it’s unclear how the design choices fare with its alternatives (not the baselines in the paper but works like [1-4]). > [Q2] The paper states that BINTS extends an existing single-modality approach to bi-modality. Could the authors discuss how this extension differs from the design choices in other papers such as [1–4]? What new insights or challenges arose during this extension, and how were they tackled? `W1, W2, Q2 Response` While contrastive learning for STGs is well-studied, works [1–4] differ fundamentally from our setting. BINTS introduces a **3D bi-modal contrastive framework**, where node time series, edge flows, and graph topology form **independent** axes. These components of 3D structure are key for mobility systems where node and edge signals evolve separately. However, prior methods operate in a **2D contrastive** space, which has limited contrast axes, only two axes of ours(see Table A). None of the prior works consider such explicit **bi-modal contrastive learning across 3D axes**. We also include five strong graph-based baselines for fair comparison. `Submission W1, W2, Q2 Response` While contrastive learning for STGs is well-studied, previous works [1–4] differ fundamentally from our setting. BINTS introduces a **3D bi-modal contrastive framework**, where node time series, edge flows, and graph topology form **independent** axes. Fully leveraging 3D axes is key for mobility systems where node and edge signals evolve separately. However, prior methods operate in a 2D contrastive space, involving only two axes (e.g., Node×Time), whereas ours explicitly contrasts along all three—Node, Edge, and Time (see Table A). Also, we already included five strong graph-based baselines for fair comparison. Specifically, BINTS introduces a **3D bi-modal contrastive framework**, where node time series, edge flows, and graph topology form **independent** axes. Fully leveraging 3D axes is key for mobility systems where node and edge signals evolve separately. However, prior methods operate in a 2D contrastive space, involving only two axes (e.g., Node×Time), whereas ours explicitly contrasts along all three—Node, Edge, and Time (see Table A). Also, we already included five strong graph-based baselines for fair comparison. ***Table A. Modality Coverage Comparison*** |Method|Node TS|Edge TS|Graph|Contrast Axis| |--|--|--|--|--| |**BINTS**|✔|✔|✔|**Node×Edge×Time(3D)**| |[1] TF-GCL|✔|✘|✔|Node×Time(2D)| |[2] STGCL|✔|✘|✔|Node×Time(2D)| |[3] TGCL4SR|✔(local)|✔(seq)|✔(TITG)|Item×Time(2D)| |[4] DySubC|✔|✘|✔|Node×Temp.Subgraph(2D)| > [W3] Potential Domain-Specific Data analysis: Given that the datasets are confined to transportation and epidemic domains, while the standard forecasting metrics are thorough, discussing how BINTS’ performance translates into domain-specific impacts could strengthen practical relevance. In particular, it would be interesting to see how the difference in temporal patterns (Figures 2,3) correlates with model performances. > [Q3] Given the paper’s emphasis on both transportation and epidemic datasets, could the authors provide further analysis connecting the different temporal patterns (Figures 2 and 3) to BINTS’ performance? For instance, how do the morning/evening commute peaks or sudden surges in NYC COVID cases impact model accuracy? `W3, Q3 Response` Figures 2 and 3 show temporal patterns where node and edge modalities behave asynchronously or even inversely(commute peaks, infection surges). BINTS captures these dynamics via soft contrastive alignment, improving performance in such periods. We selected transportation and epidemic domains as representative bi-modal settings, using diverse cities. BINTS is domain-agnostic and can generalize to other co-evolving systems(energy, supply chains). --- `[W3, Q3 Full Response]` We thank the reviewer for the insightful point. Due to the tight character limit, we briefly addressed this point earlier. Here, we provide additional clarification and discussion. As briefly noted in rebuttal response, Figures 2 and 3 show that node and edge modalities often behave asynchronously or even inversely, node density increases while flows drop during infection surges. These patterns challenge conventional single-modality models that assume aligned signals. BINTS is designed to handle these challenging cases via **soft contrastive alignment**, which flexibly learns cross-modal relationships without assuming fixed correlation patterns. In transportation data(Busan, Seoul Metro), BINTS improves forecasting during commute peaks by aligning lagged OD and density signals. In epidemic data(NYC COVID), where mobility collapses as infections spike, BINTS captures inverse trend to maintain robustness. Transportation and epidemic domains are selected because they naturally reflect real-world **bi-modal networked time series**, independently evolving node- and edge-level signals over a shared topology. To ensure generality, we include diverse cities(Seoul, Busan, NYC) with varied urban structures and mobility dynamics. Beyond the transportation and epidemic domains, BINTS is **domain-agnostic** and can generalize to other scenarios like power grids, logistics, or network traffic, where node and edge signals evolve independently. We will clarify your point in the revision and include additional visualizations showing time-aligned prediction errors during key domain events(rush hours, lockdowns) to highlight model behavior under cross-modal disruption. --- > [W4] Complexity analysis: The paper presents running times but may benefit from additional analysis regarding scaling to large graphs or higher-frequency data. A detailed complexity analysis both in memory and time (asymptotic, in addition to Table 5) would clarify feasibility. > [Q4] The paper presents runtime comparisons in Table 5 but does not include a theoretical (asymptotic) complexity analysis. Can the authors provide more detailed estimates of the memory/time complexity and discuss how BINTS might scale to much larger graphs or higher-frequency data? Are there any recommended strategies (sampling, approximate methods) to mitigate potential bottlenecks? `W4, Q4 Response` ***Theoretical complexity***: BINTS computes spatial and temporal contrastive losses using representations $h_{TCN}^{r_i,t}$ and $h_{GCN}^{r_j,t} \in \mathbb{R}^{d_h}$. Naively computing similarity for all node pairs leads to $O(N^2d_h)$, in practice it's limited to within-node and local temporal windows, yielding tractable complexity of $O(Nwd_h)$, where $w$ is time sequence length. ***Scalability***: BINTS handles datasets with up to 54,755 nodes and 65,000+ time steps on a **single GPU** with practical runtime. For larger cases, we suggest temporal pooling, subgraph sampling, and GPU-parallel encoders, which are commonly used for optimizing the performance of deep learning. > [Q1] Are the datasets preprocessed from existing ones or from newly collected ones? `Q1 Response` The datasets are newly curated from multiple public sources. We construct a benchmark suite for bi-modal forecasting by merging and aligning mobility and epidemic data across four cities in two countries, with an unified spatio-temporal resolution. Our benchmark was recently published as an independent benchmark paper, further supporting its contribution. --- To more directly address your remaining concern regarding the lack of comparison on performance, we have added new results in **Table XI**, comparing BINTS against recent strong baselines including **TopoGCL**, **AutoST**, **iTransformer**, and **TimeMixer**. These models cover a diverse range of model categories. Empirically, ***BINTS*** consistently outperforms all of these baselines across all datasets, confirming that handling temporally-evolving bi-modal signals through **3D soft contrastive learning** offers substantial benefits in real-world tasks. Thank you once again for your insightful feedback, which helped us better position our contributions relative to the latest methods and strengthened the presentation of our core strengths. --- ## Reviewer2 - wtfq, Novelty(2), Technical Quality(3), Confidence(4) → Novelty(3) ### 2480 characters > [W1] The novelty of proposed BINTS is limited. There are many papers have done similar works regarding similarity matrix construction, and spatio-temporal contrastive learning (see, for example, Chen et al TopoGCL, Zhang et al, Automated Spatio-Temporal Graph Contrastive Learning etc) `W1 Response` As shown in Table B, prior methods differ in modalities and contrastive structure. TopoGCL targets static graphs and contrasts topological summaries without temporal or modal separation. AutoST uses fused region-level inputs without disentangling node and edge time series. In contrast, BINTS **explicitly separates node and edge dynamics** and aligns them via **3D contrastive learning**(Node×Edge×Time), a setting not addressed in prior work. We will clarify this distinction with the comparison table in the revision. ***Table B. Modality Coverage Comparison*** |Method|Node TS|Edge TS|Graph|Contrast Axis|Contrast Level| |--|--|--|--|--|--| |**BINTS**|✔|✔|✔|Node×Edge×Time(3D)|Node-level(soft)| |TopoGCL|✘|✘|✔|Topological Shape(via PH)|Topo-Topo(Topology-level CL)| |AutoST|✔|✔|✔|Multi-view Node×Time(2D)|Region-level,Graph-level| > [W2] Why using cosine similarity to build similarity matrix? `W2 Response` We use cosine similarity as it is a standard choice in contrastive learning, including in models like **CLIP**[a], due to its scale-invariance and interpretability. It aligns node and edge representations consistently with prior work. --- ```W2 Full Response``` Due to the tight character limit, we briefly addressed this point earlier. We provide additional clarification and discussion. Cosine similarity is widely used in representation alignment tasks, including in foundational models such as CLIP[a], where it serves as the standard metric for comparing image–text embeddings. In our case, it provides a scale-invariant and interpretable measure for aligning node and edge representations across modalities. We follow this well-established practice to maintain consistency with prior contrastive learning works. [a] Radford et al., Learning Transferable Visual Models From Natural Language Supervision, ICML2021. --- > [W3] Why the authors set different loss functions to be equal? `W3 Response` If we understand the question correctly, the concern is why the spatial and temporal contrastive losses are weighted equally. We chose equal weights to maintain balance between modalities. As shown in Table C(also Table 4 in the main paper), our hyperparameter analysis confirms that performance becomes highest when the weights are equal. ***Table C. Sensitivity analysis(shorten ver.)*** |$w^{\text{spatial}}$|$w^{\text{temporal}}$|MAE|MSE| |--|--|--|--| |0.1|0.8|0.461|0.453| |0.2|0.1|0.422|0.413| |0.25|0.25|**0.417**|**0.404**| |0.7|0.2|0.466|0.456| |0.8|0.1|0.454|0.444| > [W4] Can the authors report other evaluation metrics such as RMSE, MAPE, etc? `W4 Response` Due to space limits, we report results only on the Busan 7-day setting in Table D. BINTS achieves the best RMSE and MAPE. The full result of this additional evaluation will be included in the final version. ***Table D. RMSE, MAPE(shorten ver.)*** |Model|RMSE|MAPE(%)| |--|--|--| |DLinear|0.728|82.425| |TimesNet|0.734|86.050| |BINTS|**0.633**|**81.603**| > [W5] Can the authors provide more explanations and insights about the achieved results? `W5 Response` We will surely add more explanations and insights in the final version. In short, the gains are attributed to two key factors. (1) Bi-modality: BINTS separately encodes node-level and edge-level time series. (2) Soft contrastive alignment: Our spatial and temporal contrast objectives align modality-specific patterns, which improves generalization under temporal shifts and cross-modal gaps. ### Reviewer2 additional comments > Thanks for the response. However, without proper numerical comparison with SOTAs for spatio-temporal data such as TopoGCL and AutoST, I am not convinced in gains of BINTS. Thank you for giving us another opportunity to highlight the strengths of our work. To further address your concerns, we conducted additional experiments, using AutoST’s official pipeline, adapting our bi-modal datasets into single-modality form in line with AutoST’s contrastive learning and traffic prediction setup. As shown in Table X, BINTS consistently outperforms AutoST across all datasets. Notably, we observe that BINTS achieves even greater performance margins in **epidemic datasets**, where **cross-modal patterns are more asynchronous**. This observation suggests that BINTS offers greater robustness across domains, especially in scenarios requiring richer temporal reasoning. We are also working to compare TopoGCL into our benchmark. We will share the results immediately once available. We sincerely hope these efforts help demonstrate the generality and effectiveness of our approach. **Table X. Additional Performance Comparison Result** |Models|Prediction Lengths|Daegu MAE|Daegu MSE|Busan MAE|Busan MSE|Seoul MAE|Seoul MSE|NYC MAE|NYC MSE|COVID MAE|COVID MSE|NYC_COVID MAE|NYC_COVID MSE| |---|---|---|---|---|---|---|---|---|---|---|---|---|---| |TopoGCL|7d|0.7078|1.076|0.7859|1.2289|0.6881|2.2832|0.739|0.8537|0.8797|1.3818|0.6981|0.928| | |14d|0.7083|1.0827|0.7781|1.1538|0.6525|1.0854|0.7194|0.9864|0.8854|1.4558|0.6994|0.9483| | |30d|0.7100|1.2193|0.7740|1.1473|0.6555|1.0899|0.7155|0.9525|0.8578|1.3163|0.7559|1.1755| | |**Avg.**|0.7087|1.126|0.7793|1.1767|0.6654|1.4862|0.7246|0.9309|0.8743|1.3846|0.7178|1.0173| |AutoST|7d|0.4866|0.9060|0.6904|2.7226|0.4026|0.4744|0.4598|0.5835|0.7036|0.8743|0.5866|0.7360| | |14d|0.4928|0.8996|0.6832|2.6161|0.4129|0.4849|0.4598|0.5792|0.7080|0.8803|0.5667|0.7091| | |30d|0.4951|0.9131|0.6722|2.5182|0.4107|0.4813|0.4679|0.5743|0.9767|1.5884|0.6338|0.8214| | |**Avg.**|0.4915|0.9062|0.6819|2.6190|0.4087|0.4802|0.4625|0.5790|0.7961|1.1143|0.5957|0.7555| |iTransformer|7d|0.5154|0.9308|0.5985|1.4457|0.5281|0.7249|0.4942|0.6516|0.5101|0.5633|0.4392|0.6471| | |14d|0.6439|1.1214|0.7264|1.8659|0.5149|0.7028|0.5416|0.6447|0.7182|0.9394|0.4628|0.6957| | |30d|0.6492|1.1360|0.7300|1.9286|0.5327|0.7364|0.5638|0.6868|0.7517|1.0129|0.4561|0.6983| | |**Avg.**|0.6028|1.0627|0.6849|1.7467|0.5252|0.7214|0.5332|0.6611|0.6600|0.8386|0.4527|0.6803| |TimeMixer|7d|0.4876|0.8642|0.5671|1.4420|0.4500|0.5620|0.4205|0.5187|0.4938|0.4965|0.4178|0.6025| | |14d|0.4903|0.8697|0.5746|1.5024|0.4561|0.5779|0.4208|0.5200|0.5165|0.5373|0.4171|0.6078| | |30d|0.4965|0.8801|0.5849|1.5754|0.4587|0.5845|0.4286|0.5357|0.5588|0.6181|0.4297|0.6279| | |**Avg.**|0.4915|0.8713|0.5755|1.5066|0.4549|0.5748|0.4233|0.5248|0.5230|0.5506|0.4216|0.6127| |**BINTS**|7d|0.4106|0.4295|0.4130|0.4011|0.3912|0.4434|0.4049|0.4431|0.3278|0.4877|0.3981|0.4258| | |14d|0.4171|0.4355|0.4169|0.4041|0.3962|0.4570|0.4121|0.4576|0.3394|0.5183|0.4069|0.4381| | |30d|0.4233|0.4444|0.4260|0.4120|0.4078|0.4800|0.4172|0.4643|0.3559|0.5616|0.4045|0.4393| | |**Avg.**|**0.4170**|**0.4365**|**0.4186**|**0.4057**|**0.3984**|**0.4601**|**0.4114**|**0.4550**|**0.3410**|**0.5225**|**0.4032**|**0.4344**| --- To further validate BINTS, we also included **TopoGCL** as a baseline in the *updated* Table X (see the above comment). While TopoGCL performs well in static graph tasks by contrasting topological summaries via persistence diagrams and contrasting structural features across graph augmentations, TopoGCL differs fundamentally from BINTS in design and goal. BINTS performs **3D contrastive learning across node, edge, and time axes**, which is critical for learning representations from modalities that are inherently complementary. Empirically, BINTS consistently outperforms TopoGCL across all datasets, confirming that handling temporally-evolving bi-modal signals through soft contrastive learning offers substantial benefits in real-world tasks. Thank you once again for your insightful comment, which allowed us to clearly position our contributions against recent contrastive learning works. Your feedback has meaningfully contributed to highlighting the unique strengths of our approach. --- ## Reviewer3 - 3BYJ, Novelty(2), Technical Quality(2), Confidence(4) → Novelty(3), Technical Quality(3) ### 2769 characters > [W1] The terms "node features" and "node interactions" should be replaced with more appropriate expressions, as they appear to describe node characteristics and edge features in a graph, which hardly reflects the paper's innovation. `W1 Response` Thank you for the suggestion to improve terminology clarity. We used "node features" and "node interactions" to emphasize bi-modal time-varying signals, but we understand these may be misread as static attributes. To avoid confusion, we will revise them to clearer terms like “node-/edge-originated time series” in the revised version. > [W2] The authors do not clearly summarize this paper's research challenges and contributions in the Introduction, which increases reading difficulty. `W2 Response` We are glad that our rebuttal has successfully addressed most of your concerns. Once again, we sincerely thank you for your time, thoughtful review, and invaluable suggestions, which have undoubtedly contributed to improving and refining this work. --- `[W2 Full Response]` Thank you for the comment. Due to the tight character limit, we briefly addressed this point earlier. Here, we provide additional clarification and discussion. We believe that Introduction already outlines the key challenge: how to jointly model and align time-varying node and edge signals, which often behave asynchronously. As summarized in Table X, prior works typically consider only node-level time series(TF-GCL, STGCL, DySubC) or fuse information without explicit alignment(AutoST). In contrast, BINTS addresses the joint modeling of time-varying node features and time-varying edge interactions over a shared graph, and proposes a novel framework that performs **3D contrastive learning(Node × Edge × Time)**. The central challenge is how to leverage the synergy between node-originated and edge-originated time series, which often exhibit asynchronous or complementary patterns. We motivate the need to disentangle and align these modalities for accurate forecasting. **Table X. Modality Coverage Comparison** |Method|Node TS|Edge TS|Graph|Contrast Axis| |---|---|---|---|---| |**Ours (BINTS)**|✔|✔|✔|Node×Edge×Time (3D)| |TF-GCL|✔|✘|✔|Node×Time (2D)| |STGCL|✔|✘|✔|Node×Time (2D)| |TGCL4SR|✔(local)|✔(seq)|✔(TITG)|Item×Time (2D)| |DySubC|✔|✘|✔|Node×Temp.Subgraph (2D)| |TopoGCL|✘|✘|✔|Topological Shape (via PH)| |AutoST|✔|✔|✔|Multi-view Node×Time (2D)| --- > [W3] There is no significant conceptual innovation in modeling the second mode (i.e., node interactions). How to statistic the node interaction information and the input and output of the node interaction modeling need to be further clarified. `W3 Response` Modeling OD flows alone is not novel. Our contribution is in **jointly modeling independently evolving node and edge time series**, fused via 3D contrastive learning(Node×Edge×Time). --- `W3 Full Response` Thank you for the comment. Due to the tight character limit, we briefly addressed this point earlier. Here, we provide additional clarification and discussion. We acknowledge that modeling edge-level interactions(OD flows) alone is not novel. Many prior works model OD flows in isolation. The key conceptual innovation of BINTS is in **jointly modeling and aligning two time-varying modalities**: node-level and edge-level time series. For instance, in a subway network, passenger densities(node-level) and inter-station flows(edge-level) have traditionally been treated as independent modalities. BINTS introduces a framework that fuses these signals to capture their complementary dynamics, using **3D contrastive learning(Node × Edge × Time)**. As summarized in Table X, prior works either use one modality or do not contrast them explicitly. --- > [W4] The intuition of the method in temporal contrastive learning is not clear enough. The overall design of 3-dimensional(3D) soft contrastive learning lacks further research and has limited contribution. > [Q1] How do the authors relate the research challenges to the proposed techniques? The challenges should be summarized concisely, and the corresponding solutions should be explained in the techniques section. `W4, Q1 Response` Temporal contrastive learning assumes temporally close patterns are more semantically similar, a widely used idea[a–c]. BINTS extends this principle to a **3D contrastive framework** across node, edge, and time. --- `W4, Q1 Full Response` We extend temporal contrastive learning concept to 3D contrastive learning(Node × Edge × Time). BINTS aligns separate yet interdependent signals like densities and flows via soft contrastive objectives. For instance, in mobility data, rising passenger densities can lead to increased flows; then, BINTS captures such dependencies via soft alignment of node and edge time series, which aligns with our challenge: learning from seperate but interdependent modalities. [a] Duan, Jufang, et al. MF-CLR: multi-frequency contrastive learning representation for time series. ICML. 2024. [b] Lee, Seunghan, Taeyoung Park, and Kibok Lee. Soft contrastive learning for time series. ICLR. 2024. [c] Chen, Minghao, et al. Frame-wise action representations for long videos via sequence contrastive learning. CVPR. 2022. --- > [W5] The experimental evaluation needs expansion in two points. First, the performance comparison primarily includes outdated methods rather than state-of-the-art approaches from 2024. Second, the ablation study's scope is restricted to the Deagu dataset, leaving its generalizability across all datasets unverified. `W5 Response` **(1)** PDFormer is flow-specific and incompatible with our setup; we plan to include it later. Instead, we added **iTransformer** and **TimeXer**, top-performing recent models compatible with our setting(see Table E). **(2)** Due to space limits, we showed ablation only on Daegu, but other datasets exhibit similar trends—removing spatial or temporal losses degrades performance. We will include full results in the appendix of the final version. --- `W5 Full Response` Due to the tight character limit, here, we provide the results of the additional experiments with more recent methods, iTransformer and TimeMixer. ***Table E. Comparison with Recent Works (average across all prediction lengths)*** |Models|Prediction Lengths|Daegu MAE|Daegu MSE|Busan MAE|Busan MSE|Seoul MAE|Seoul MSE|NYC MAE|NYC MSE|COVID MAE|COVID MSE|NYC_COVID MAE|NYC_COVID MSE| |--|--|--|--|--|--|--|--|--|--|--|--|--|--| |iTransformer|7d|0.5154|0.9308|0.5985|1.4457|0.5281|0.7249|0.4942|0.6516|0.5101|0.5633|0.4392|0.6471| | |14d|0.6439|1.1214|0.7264|1.8659|0.5149|0.7028|0.5416|0.6447|0.7182|0.9394|0.4628|0.6957| | |30d|0.6492|1.1360|0.7300|1.9286|0.5327|0.7364|0.5638|0.6868|0.7517|1.0129|0.4561|0.6983| | |Avg.|0.6028|1.0627|0.6849|1.7467|0.5252|0.7214|0.5332|0.6611|0.6600|0.8386|0.4527|0.6803| |TimeMixer|7d|0.4876|0.8642|0.5671|1.4420|0.4500|0.5620|0.4205|0.5187|0.4938|0.4965|0.4178|0.6025| | |14d|0.4903|0.8697|0.5746|1.5024|0.4561|0.5779|0.4208|0.5200|0.5165|0.5373|0.4171|0.6078| | |30d|0.4965|0.8801|0.5849|1.5754|0.4587|0.5845|0.4286|0.5357|0.5588|0.6181|0.4297|0.6279| | |Avg.|0.4915|0.8713|0.5755|1.5066|0.4549|0.5748|0.4233|0.5248|0.5230|0.5506|0.4216|0.6127| --- > [Q2] What is the purpose of introducing the prior Gaussian distribution in the temporal contrastive learning process? The use of ResNet for modeling the similarity matrix also seems to lack clear justification. `Q2 Response` The prior Gaussian encodes temporal proximity, aligning recent steps more strongly. ResNet1D is used to match output shape; other simple encoders (e.g., CNN1D, MLP) are also applicable. --- `Q2 Full Response` The prior Gaussian distribution in temporal contrastive learning is used to encode temporal proximity to make temporally close representations more similar and distant ones less so. Intuitively speaking, recent patterns(e.g., 3 PM vs. 4 PM) are more semantically related than those far apart(e.g., 3 PM vs. 10 AM). ResNet1D is used lightly to match the output shape for forecasting. Any simple encoders(e.g., CNN1D, MLP, Linear layers) could be used instead. We chose ResNet for its simplicity and stable training behavior. --- > [Q3] In the comparative experiments, DLinear and TimesNet perform remarkablely. Why can single-modality methods achieve results close to bi-modality approaches? Are there any other new baselines for comparison, such as PDformer, STPGNN, PatchFormer, and iTransformer? `Q3 Response` **(1)** Single-modality models perform well when one signal dominates, but results vary across datasets. BINTS achieves consistent performance by combining modalities. **(2)** PDFormer, STPGNN, and PatchFormer focus on flow/image tasks and aren't directly compatible with our bi-modal setting. Instead, we added recent top-3 time-series models(with TimesNet already included, see Table E). > [Q4] The change in hyperparameter k does not seem to bring significant performance variation, especially in the MAE metric. Does this imply that a fully connected graph could also achieve good results? `Q4 Response` We clarify that the $k$-hop setting has nothing to do with a fully connected graph. $k$ controls how many hops away node interactions are considered in the contrastive learning process. Even though MAE is stable across $k$, subtle variations in MSE and learning dynamics justify using a bounded $k$-hop structure, which balances performance and computational efficiency. $k$-hop has nothing to do with full connectivity. It limits interaction depth in contrastive learning. While MAE is stable across $k$, subtle MSE variations justify using bounded hops for efficiency. --- `Reminder` Dear Reviewer 3BYJ Thank you for your thoughtful and constructive feedback. We greatly appreciate your recognition of the core strengths of our work, including the integration of temporal flow characteristics and OD flow patterns for networked time series prediction, the well-designed case studies that illustrate the dynamic interplay between the two modalities, and the contribution of new benchmark datasets in this domain. Your positive recognition of these aspects truly encourages us. At the same time, we carefully addressed the concerns you raised. In our rebuttal, we provided detailed clarifications on your concerns, revised terminology, improved explanation of the 3D contrastive mechanism, and expanded the evaluation by incorporating new baselines. We greatly appreciate your feedback, as it has provided significant assistance for us in undertaking further clarification. Since only two days remain in the discussion period, please feel free to share any further feedback. We’ll do our best to respond within the available time. Thank you once again for your time and consideration. Best regards, Authors. --- We are glad that our rebuttal has effectively addressed most of your concerns. Once again, we sincerely thank you for your time, thoughtful review, and invaluable suggestions, which have undoubtedly contributed to improving and refining this work. Regarding your suggestion, we fully agree that the paper would benefit from further clarity on the core challenges, key innovations, and the specific roles of each module. We will make sure to revise the final version with a clearer problem setting and explicitly highlight how each component—bi-modal encoders, spatial and temporal contrastive losses, and 3D alignment strategy—contributes to addressing the challenges we identify. --- ## Reviewer4 - YyC8, Novelty(3), Technical Quality(2), Confidence(4) → Technical Quality(3) ### 2442 characters > [W1] Insufficient Discussion of Existing Methods: The authors assert the necessity of bi-modal learning for both population density and population flow in the context of spatiotemporal learning; however, they provide insufficient discussion on the advantages and disadvantages of existing state-of-the-art methods, which undermines the motivation presented. `W1 Response` We summarize the distinction in Tables A and B(see QEL2, wtfq responses). While prior works focus on either node- or edge-level sequences, BINTS jointly models both over a shared topology using 3D contrastive learning(Node×Edge×Time), uniquely aligning independently evolving modalities for more robust forecasting. > [W2] Limited Review of Related Works: Currently, node-level spatiotemporal forecasting encompasses population density, population inflow and outflow, and traffic flow, each involving numerous methodologies. Interaction-level spatiotemporal forecasting or generation often pertains to origin-destination flow prediction, which is closely related to population density [1, 2]. Furthermore, many advanced GNNs consider edge (interaction) features, including EdgeGNN, MPNN, and Graph Transformers, which have not yet been addressed. `W2 Response` While our main focus has been on bi-modal time-varying modeling(as shown in Table F), we agree that EdgeGNN, MPNN, and Graph Transformers represent an important family of GNN architectures. However, these methods focus on static graphs and do not handle time-varying signals. Similarly, models like GODDAG and DeepFlowGen incorporate edge-level information or OD reasoning but operate in **single-modality or static contexts**. In contrast, BINTS models both node- and edge-level time series and aligns them contrastively. We will revise the related work section to clarify our **unique** formulation. --- `[W2 Full Response]` Thank you for pointing this out. Due to the tight character limit, here, we provide the additional comparison table. ***Table F. Modality Coverage Comparison*** |Method|Node-level TS|Edge-level TS|Graph Topology| |--|--|--|--| |**Ours (BINTS)**|✔|✔|✔| |[1] GODDAG|✘|✔ (static OD)|✔| |[2] DeepFlowGen|✘|✔ (via POI dynamics)|✔| |EdgeGNN|✘|✘|✔| |MPNN|✘|✘|✔| |Graph Transformer|✘|✘|✔| --- > [W3] Complex Methodology: Firstly, the design of the node feature encoder and node interaction encoder primarily utilizes classical TCN and GCN without discussing the advantages of more advanced spatiotemporal forecasting methods. Secondly, while the spatial contrastive loss incorporates nested hinge loss, the justification for the Gaussian constraint in the temporal contrastive loss is lacking. Additionally, the total loss in Equation (11) consists of three hyperparameters for the three loss components; however, it could be simplified to two hyperparameters by dividing $w^{prediction}$. This unnecessary complexity challenges the credibility of the proposed BINTS. Lastly, the overall computational complexity of bi-modal learning is not adequately discussed. > [Q2] Complexity Considerations: Why is nested hinge loss applied to both $D_{neg}$ and $L^{spatial}$? Is this design essential, and what are its experimental effects? What is the complexity of the proposed BINTS, and how does it compare to baseline methods? `W3, Q2 Response` First, we used simple encoders (TCN, GCN) to isolate the effect of bi-modal contrastive learning; our modular design allows easy upgrades. Second, the Gaussian prior reflects temporal locality, as seen in prior works[a–c]. We also clarify that the "hinge loss" in Line 493 was a typo—it is standard margin ranking loss[d,e]. Regarding Equation (11), we agree that the formulation can be simplified(dividing out $w^{prediction}$). We will adopt a more compact notation in the final version—thank you for the suggestion. Table 5 already compares runtime vs. accuracy, showing that BINTS is practical despite its bi-modal design. --- `[W3, Q2 Full Response]` We appreciate the reviewer's detailed feedback on methodology design. Due to the tight character limit, we briefly addressed this point earlier. Here, we provide additional clarification and discussion. We chose classical architectures(TCN for nodes, GCN for edges) to emphasize the contribution of the **bi-modal contrastive learning mechanism**, not the backbone architecture. Since the two modalities are orthogonal in nature, our design separates their encoders, and the framework is modular; that is, more advanced temporal or spatial encoders can be plugged in. Regarding the temporal contrastive loss, we employ a Gaussian prior to softly enforce temporal locality: patterns closer in time should be more similar than distant ones. This idea is well-supported in the literature[a–c], especially in time-series contrastive learning for structured data such as video or sensor signals. For the hinge loss, the loss in Eq. (6) is based on a margin ranking loss, not a "nested hinge loss" as mistakenly described in Line 493. We sincerely appreciate your careful reading and will correct it in the final version. To clarify, our formulation uses the hinge function once in the computation of $D_{neg}$, and the final loss in Eq. 6 follows the standard margin ranking loss design, which is widely used in contrastive learning[d, e].  **References** [a] Duan et al., MF-CLR: Multi-Frequency Contrastive Learning for Time Series, ICML 2024. [b] Lee et al., Soft Contrastive Learning for Time Series, ICLR 2024. [c] Chen et al., Frame-wise Action Representations via Sequence Contrastive Learning, CVPR 2022. [d] Hadsell et al., Dimensionality Reduction by Learning an Invariant Mapping, CVPR 2006. [e] Sohn, Improved Deep Metric Learning with Multi-class N-pair Loss, NeurIPS 2016. --- > [W4] Overstated Performance Claims: The authors state in the introduction that BINTS can surpass existing methods by up to 69.32% in terms of MAE; however, BINTS does not exceed the performance of the second-best baseline by this margin, as shown in Table 2. Furthermore, BINTS underperforms compared to baseline methods in the Busan, NYC Taxi, COVID, and NYC COVID datasets. The unsatisfactory performance of BINTS contradicts the motivation outlined in this paper. > [Q3] Performance Assessment: Do the authors assert that BINTS outperforms the worst baseline by up to 69.32% in terms of MAE? However, DLinear and TimesNet can also achieve this level of improvement. Typos: `W4, Q3 Response` We reported MAE gains over the worst baselines, not second-best models. To clarify, we now compare against strong single-modality baselines. BINTS improves over **DLinear** by -6%–34.5% (MAE) and 2.4%–45.9% (MSE), and over **TimesNet** by -13.5%–10.1% (MAE) and -6.2%–40.8% (MSE). While gaps vary by dataset, BINTS consistently performs better in most cases. --- `W4, Q3 Full Response` Thank you for pointing this out. Due to the tight character limit, we briefly addressed this point earlier. Here, we provide additional clarification and discussion. In the experiment, we reported the performance range (8.19--69.32%) based on **average MAE improvements across all datasets** relative to the worst-performing baselines, not just the second-best. We agree that this phrasing may be misleading if not contextualized properly. In the final version, we will clearly report detailed improvement margins as shown in Table G (Table G reports the average performance gap across all datasets) and avoid overstated claims in the evaluation. ***Table G. Performance Gain*** |Baseline|MAE Gain(%)|MSE Gain(%)| |--|--|--| |DLinear|14.87|25.47| |NLinear|19.67|34.27| |SegRNN|46.44|61.98| |Informer|45.05|57.19| |Reformer|44.46|59.05| |PatchTST|17.98|28.78| |TimesNet|8.19|18.98| |STGCN|34.00|61.95| |MPNNLSTM|49.51|71.34| |STSSL|69.32|76.05| |GWNet|21.81|30.68| |DSTAGNN|29.98|41.81| --- > [Q1] Methodological Clarifications: What does $M$ represent in $\mathbb{R}^{(M+N)}$ in Section 4.1? What distinguishes the proposed method from self-supervised methods in graph learning? Why have advanced encoders not been utilized for node and interaction representations? `Q1 Response` Thank you for your questions. The symbol $M$ is the **node feature dimensionality**, as defined in **Lines 307-309** of the main paper. Unlike general graph SSL methods, BINTS aligns bi-modal signals via soft contrastive learning. We intended to use commonly-used encoders to highlight our bi-modal framework. That is, other advanced encoders can surely be incorporated into our framework. > [Q4] In the abstract, the last sentence should conclude with a period. `Q4 Response` We will fix the missing period in the final version. ## Reviewer5 - cAGv, Novelty(3), Technical Quality(3), Confidence(3) ### 2369 characters > [W1] Lack of Definition for Networked Time Series. The paper does not provide a clear definition of the Networked Time Series mentioned in the Title. What distinguishes Networked Time Series from other related concepts such as temporal-spatial graphs or traditional time series? A formal clarification would improve the conceptual clarity of the paper. `W1 Response` Thank you for the thoughtful comment. While our problem setting is described in Section 4, we agree that the term "Networked Time Series" needs a clearer definition. In BINTS, it refers to a setting where nodes and edges on a static graph emit independent, time-varying signals(densities and OD flows). Unlike traditional time series(unstructured) or spatio-temporal graphs(typically single modality), our setup explicitly models and aligns bi-modal signals. We will define this setup more clearly in the final paper. > [W2] Handling of Modal Discrepancy. As illustrated in Figures 2 and 3, node features and node interactions can exhibit completely different correlation patterns (e.g., positive in Figure 2 vs. negative in Figure 3). How does the proposed model account for these contrasting correlations between modalities? What impact do these correlations have on the learning process? `W2 Response` Indeed, as shown in Figures 2 and 3, node and edge modalities can exhibit either positively or negatively correlated patterns across time. The key is that some form of correlation exists, whether aligned or inverse, which our model is designed to capture. BINTS uses soft contrastive alignment that does not assume any fixed correlation type. It learns to align these modalities flexibly, capturing useful patterns even when the relationship is inverse(rising infections vs. reduced mobility). > [W3] Lack of Qualitative Analysis. Although the paper includes extensive quantitative experiments, the inclusion of qualitative analysis—such as a case study or visualization of similarity matrices—could greatly enhance the understanding of how the model works in practice. `W3 Response` We appreciate your suggestion. Appendix D already includes prediction visualizations, and we will expand with more case studies. While Appendix A shows curated data patterns, we agree that including model-derived similarity matrix visualizations would aid understanding, and will add them in the final version. > [W4] Absence of Theoretical Analysis. The paper would be strengthened by a theoretical explanation of why bi-modal learning leads to performance gains. A formal analysis could help justify the design choices and further validate the effectiveness of the approach. `W4(references in comments)` Thank you for your insightful suggestion. Theoretical works[a, b] show that multi-modal learning improves generalization by **reducing representation uncertainty**. In our case, we can formalize this via the Rademacher complexity[c, d]: if $\mathcal{H}_n$ and $\mathcal{H}_e$ denote the hypothesis space for node and edge modalities, respectively, we can model the joint space $\mathcal{H}_{\text{joint}} \subseteq \mathcal{H}_n + \mathcal{H}_e$ with $\mathcal{R}_n + \mathcal{R}_e \geq \mathcal{R}_{joint}$ where $\mathcal{R}$ denotes the Rademacher complexity. This theory implies that bi-modal fusion reduces model complexity. We will discuss the theory in detail in the final version. `final` Thank you for your insightful suggestion. Theoretical works[a, b] show that multi-modal learning improves generalization by **reducing representation uncertainty**. We formalize multi-modality via Rademacher complexity[c, d]. Let $\mathcal{H}_n$ and $\mathcal{H}_e$ denote the hypothesis spaces for node and edge modalities, respectively. Then the joint hypothesis space satisfies $ \mathcal{H}_{\text{joint}} \subseteq \mathcal{H}_n + \mathcal{H}_e $ with corresponding complexity, $\mathcal{R}_{\text{n}}$ $+$ $\mathcal{R}_e$ $\geq$ $\mathcal{R}_{joint}$ where $\mathcal{R}$ denotes the Rademacher complexity. The inequality implies that bi-modal fusion leads to a reduced hypothesis space and lower model complexity. A detailed discussion will be included in the final version. --- **References** [a] Lu, Zhou. "On the computational benefit of multimodal learning." International Conference on Algorithmic Learning Theory. 2024. [b] Huang, Yu, et al. "What makes multi-modal learning better than single (provably)." NeurIPS. 2021. [c] Peter L. Bartlett, Shahar Mendelson, Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research 3 463–482, 2002 [d] Giorgio Gnecco, Marcello Sanguineti, Approximation Error Bounds via Rademacher's Complexity. Applied Mathematical Sciences, Vol. 2, 2008, no. 4, 153–176, 2008 --- > [W5] No Discussion of Limitations. A discussion of such limitations would provide a more balanced and rigorous evaluation. `W5 Response` Thank you. We will include limitations such as higher data collection cost and potential computational overhead. However, as shown in **Section 5.5**, BINTS maintains fast inference (0.36–3.45ms/instance), showing it remains practical for real-time use. --- `Reminder` Dear Reviewer cAGv Thank you for your thoughtful and constructive feedback. We sincerely appreciate your recognition of the key strengths of our work, including the problem setting, experimental results, and well-written and well-organized paper. Your positive assessment of these aspects is deeply encouraging to us. At the same time, we carefully addressed the concerns you raised. In our rebuttal, we provided detailed clarifications on your concerns, defining “Networked Time Series”, clarifying cross-modal discrepancies, adding qualitative analysis, providing theoretical justification via Rademacher complexity, and discussing limitations. Your feedback has greatly helped us strengthen the presentation and technical clarity of our work. Since only two days remain in the discussion period, please don’t hesitate to let us know if any further questions or concerns remain. We’ll do our best to respond within the available time. Thank you once again for your time and consideration. Best regards, Authors. --- Thank you for your thoughtful and constructive feedback. We have made every effort to address your concerns, and we are pleased to hear that our rebuttal has addressed your concerns. In our rebuttal, we provided detailed clarifications on your concerns, defining “Networked Time Series”, clarifying cross-modal discrepancies, providing theoretical justification via Rademacher complexity, and discussing limitations. Your feedback has been helpful in strengthening the presentation and technical clarity of our work. If you still have any concerns, we're happy to provide further clarification and improvements within the remaining discussion period. We would really appreciate it if you could reflect in your final assessment that our efforts are viewed positively.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.