# AAAI'25 Rebuttal Submission
## Common comment
We sincerely appreciate the reviewers' valuable comments. In general, we are very pleased that the reviewers have concurred with the three main contributions of this study: **(1) novelty**---"new framework for bi-modal learning from time-varying node features and interactions within a graph topology" (Reviewer Xuui); **(2) practicality**---"curating a comprehensive human-mobility dataset, and the process of curating the data has also been released", "many real-world case analyses (Figure 2-3) to simultaneously motivate the study of node features and interactions", and "code and datasets are made available, and detailed descriptions of the experimental setups, architectures, and metrics are included" (Reviewers AaoK and fdLX) including "performance improvements over baselines" (Reviewer fdLX).
We also hope that our clarifications can further address some concerns raised by the reviewers, which can be summarized as follows.
* **Improving the presentation of the motivation** (Reviewers mtGU, fdLX, and Xuui): In response to the valuable feedback from Reviewers, we have undertaken a comprehensive revision to clearly express the unique contributions of our work. Specifically, we have provided a thorough justification for the proposed bi-modal learning framework.
* **Clarifying the technical contribution** (Reviewers mtGU and Xuui): We would like to emphasize our technical contributions in two main aspects. First, we have explained the importance of spatial soft contrastive learning, which aligns representations for the same region across different modalities, enhancing spatial consistency. Second, we have detailed the role of temporal soft contrastive learning, which ensures temporal coherence by aligning 3D cube representations across time axis. These clarifications are intended to provide a sufficient understanding of the unique technical contributions of our paper.
* **Enriching the hyperparameter experiment results** (Reviewers NKfj and gMPs): The results for relevant interesting graph-based baselines and the effect of parameters have been added.
We will make sure that all suggestions and clarifications made in this response are reflected in the final version of the paper.
---
## Summary
We could summarize and categorize all reviews as follows:
* **Novelty**
* 이 문제가 unique한 이유, 시계열 간의 의존성을 표현하기 위해 그래프 구성 요소를 사용하는 이 문제만의 이유?
* How the authors contribute to the bi-modal learning in this scenario?
* 이전 시공간 GNN과의 차이점 입증, why previous methods are suboptimal for this domain., critical challenge가 뭔지?
* **Technique/Methods**
* 제안된 방법이 그리 새롭게 느껴지지 않음.
* 5hop을 사용하는 이유?, 계산적으로 비용이 많이 들게 됨.
* 손실 함수의 가중치에 대한 설명
* What is the edge index in Eq.(2)? By this time, it has not been introduced.
* What is the difference between using the exponential or using the linear decay function?
* In the inference stage, where does Eq. (9) use the embedding $\mathbf{h}{\text{GCN}}^{r_j, t}, \mathbf{h}{\text{TCN}}^{r_j, t}$ from the model?
* why is it technically challenging for problem posed
* why is the contrastive loss learning innovative and why is it needed to solve this problem.
* **Experiments**
* GNN-based baselines (STGCN, MPNNLSTM, ST-SSL)
* transporation related approaches.
* k-hop adjacency matrix sensitivity
* The paper mentions several hyperparameters, including gamma for the hinge loss in spatial contrastive learning, the margin (m) in the ranking loss, and the weights (w_spatial, w_temporal, w_prediction) for the total loss function.
---
## Reviewer1 - mtGU, Rating(4), Confidence(4)
> *Q1. Although the paper frames it as a new approach, in fact any method were there is a series of time series, and then we can build a pairwise relationship between the time series, will fit into the proposed problem formulation. It does not require the pairwise relationship to be origin-destination, and this is only relevant as the paper may have transportation/tap-on, tap-off data. One thing useful would be to provide more explanation what makes this problem, of time-series (based on transportation?) with origin-destination, unique and requires another method that uses graph based components to represent dependencies/relationships between time-series.*
Thank you for your thoughtful comments and for highlighting the need for a more robust justification. We will enhance the discussion on our contributions to clarify the unique aspects of our research.
```Q1 Response```
Our approach is applicable for the transportation domain even when tap-on/tap-off data is not available, and **can be extended to fields** such as epidemiology. For example, in the epidemic domain, **infection counts** exist independently and are incorporated with **origin-destination** and spatial **network** within the framework. Therefore, what makes our work unique is the **integration of node interactions, node features, and spatial networks** collected from diverse raw sources. In other words, our framework combines human mobility data (spatial networks, time-varying node features, and interactions) from disparate sources to facilitate a unified interpretation, thereby enhancing the understanding of **human behavior** and supporting future predictions.
```Q1 Short ver.```
Our approach applies to the transportation domain, even without tap-on/tap-off data, and can **extend to fields** like epidemiology. For instance, infection counts in epidemiology integrate with origin-destination data and spatial networks within our framework. Our framework uniquely combines human mobility data (spatial networks, time-varying node features, and interactions) from **diverse sources**, enhancing understanding of **human behavior** and enabling future predictions.
> *Q2. The proposed method is also not that novel. It proposes standard time series representation learning, combined with constrastive loss, which is really building a spatial and temporal similarity measure/kernel for the representation. The idea to use multiple hop relations to, supposedly enrich the network and longer range dependencies is reasonable, although using 5 hops as the default seems on the extreme side of things. Typically there is little influence for nodes that are 5 hops away, particularly when we describing aggregate origin-destination flows. It also make the network much more dense, and computationally much more expensive. There is also a minor improvement to the weights for the loss function, as it appears the weights add to 1, so we don't need 3 (just need 2, the third one can be derived from the first 2 weights).*
```Q2 Response```
The proposed method goes beyond simply standard time-series representation learning. We proposed the **effective integration of graph representation and time-series representation** to capture meaningful patterns across the regions. This fusion is conceptually similar to the CLIP [a], which combines image and text modalities. In our case, due to the time-varying nature of human mobility data, we construct a **3D similarity matrix** to facilitate temporal contrastive learning as well.
[a] Radford, Alec, et al. "Learning transferable visual models from natural language supervision.", ICML, 2021.
We have conducted additional experiments on $k$-hop sensitivity. In its basic form, the adjacency matrix is typically composed of binary values (0 or 1), indicating the presence or absence of direct connections between nodes. However, using an initial adjacency matrix yielded suboptimal performance, highlighting the need for **a broader spatial context**. The choice of a $k$-hop configuration, though it increases computational cost, provides significant performance gains that justify this selection. Thus, finding an optimal $k$ value is essential to balancing performance improvements with computational efficiency.
<!-- As the k value increases, computational cost rises considerably; conversely, if k is too small, the model fails to capture sufficient spatial relationships. -->
We report an additional $k$-hop hyperparameter sensitivity experiments using the Busan dataset with a prediction length of 14 days due to the space limit. We also report the running time to build the $k$-hop adjacency matrix for each $k$.
<span style="background-color:#fff5b1"> + k-hop experiment table </span>
| k | MAE | MSE | Running Time (s) |
|----|-------|-------|------------------|
| 1 | 0.4904 | 0.5412 | 0.0010 |
| 2 | 0.4235 | 0.4265 | 0.0061 |
| 3 | 0.4231 | 0.4225 | 0.0747 |
| 4 | 0.4180 | 0.4211 | 0.0794 |
| 5 | **0.4169** | **0.4041** | 0.0325 |
| 6 | 0.4178 | 0.4179 | 0.0468 |
| 7 | 0.4201 | 0.4233 | 0.3933 |
| 8 | 0.4201 | 0.4240 | 0.7061 |
| 9 | 0.4198 | 0.4233 | 1.3855 |
| 10 | 0.4195 | 0.4207 | 0.6125 |
Each weight in the loss function has a distinct role, influencing model updates through spatial contrastive loss, temporal contrastive loss, and prediction loss. Consequently, the weights for these three losses are essential for effectively learning the representations and maintaining the balance across the components in our framework.
```Q2 Short ver.```
Unlike standard time-series representation learning, our method integrates graph and time-series representations to capture patterns across regions. This fusion, conceptually similar to CLIP [a] (combining image and text), employs a **3D similarity matrix** for temporal contrastive learning, reflecting the time-varying nature of mobility data.
Additional experiments on $k$-hop sensitivity show that a basic adjacency matrix (binary connections) offers limited performance. Using a $k$-hop configuration improves performance, justifying the added computational cost. Optimal $k$ selection is thus crucial for balancing performance with efficiency. Due to space limits, we report $k$-hop hyperparameter sensitivity with a 14-day prediction on the Busan dataset, including the runtime for each $k$.
Each weight in the loss function serves a specific purpose, guiding model updates through spatial/temporal contrastive loss, and prediction loss. These weights are crucial for balanced representation learning across all components in our framework.
> *3. Experiments appear reasonable, with a number of transport related datasets (are they all publicly available?), and a number of baselines. It would be ideal to include more GNN based approaches, given these are most similar to proposed works. Also related work in transport prediction would be useful to compare against. The presented results show improvement over baselines, but it would be ideal to have more baselines related to transporation related approaches.*
```Q3 Response```
<span style="background-color:#fff5b1"> + GNN-based baeline MAE table </span>
| **Model** | **Days** | **Daegu** | **Busan** | **Seoul** | **NYC** | **COVID** | **NYC COVID** |
|---|---|---|---|---|---|---|---|
| GWNet | 7 | 0.453 | 0.455 | 0.459 | 0.414 | 0.698 | 0.556 |
| | 14 | 0.452 | 0.454 | 0.439 | 0.413 | 0.722 | 0.549 |
| | 30 | 0.455 | 0.457 | 0.445 | 0.421 | 0.752 | 0.575 |
| | Avg | 0.453 | 0.455 | 0.448 | 0.416 | 0.724 | 0.560 |
| DSTAGNN | 7 | 0.454 | 0.446 | 0.476 | 0.446 | 0.881 | 0.691 |
| | 14 | 0.463 | 0.438 | 0.469 | 0.448 | 0.888 | 0.697 |
| | 30 | 0.447 | 0.438 | 0.491 | 0.452 | 0.906 | 0.707 |
| | Avg | 0.455 | 0.441 | 0.479 | 0.449 | 0.892 | 0.698 |
| ***BINTS*** | 7 | 0.411 | 0.413 | 0.391 | 0.405 | 0.328 | 0.398 |
| | 14 | 0.417 | 0.417 | 0.396 | 0.412 | 0.339 | 0.407 |
| | 30 | 0.423 | 0.426 | 0.408 | 0.417 | 0.356 | 0.405 |
| | Avg | **0.417** | **0.419** | **0.398** | **0.411** | **0.341** | **0.403** |
For reproducibility of the experiments, we uploaded our implementation code and all of datasets to an anonymous Github. We also conducted additional experiments on GWNet [b] and DSTAGNN [c], GNN-based baselines for spatio-temporal traffic forecasting. The results showed that our approach outperformed the three existing GNN-based baselines and two additional GNN-based baselines for different forecasting horizons using a 4-day look-back window and a MAE evaluation metric.
```Q3 Short ver.```
For reproducibility, we uploaded our code and datasets to an anonymous GitHub. Additional experiments on GNN-based baselines, GWNet [b] and DSTAGNN [c], showed that ***BINTS*** outperformed three existing GNN-based baselines and two additional ones across different forecasting horizons, using a 4-day look-back window and MAE metric.
> *Q4. The paper is relatively easy to follow. However, in the introduction, it could motivate more about a) why is it technically challenging for problem posed (current motivation is not convincing, why is analysing different modalities difficult, when lots of prior work have done this - perhaps not for transporation, but definitely in other areas); b) why is the contrastive loss learning innovative and why is it needed to solve this problem.*
```Q4 Response```
<!-- Thank you for the feedback. We will improve the paper to better highlight the technical challenges and innovations addressed by our work. -->
a) Technical Challenges
Our research addresses the difficulty of integrating multiple modalities, including node features, node interactions, and spatial networks, each often originating from different sources. Our work on modality integration has generally not focused on the transportation domain, where data presents unique complexities, such as time-varying human mobility patterns and spatial dependencies. Our approach is distinct in that it leverages both graph-based components and temporal-based components to represent these dependencies, effectively integrating disparate bi-modality to provide a unified view of mobility behavior.
b) Necessity and Novelty of Contrastive Loss
As we mentioned above, the use of contrastive loss for spatial and temporal learning is crucial in this context, as it enables the model to capture **semantic similarities across regions and time**. The BINTS design is conceptually similar to the CLIP framework, allowing our model to learn both spatial and temporal patterns through contrastive mechanisms. Unlike prior methods, our framework constructs a **3D similarity matrix** for temporal contrastive learning, which accommodates the dynamic nature of human mobility data. This innovation is essential for enhancing predictive performance, enabling the model to better represent relationships between nodes over time.
```Q4 Short ver.```
a) Our research addresses the challenge of integrating multiple modalities—node features, interactions, and spatial networks—often from disparate sources. Unlike typical approaches, our focus on the transportation domain presents with unique complexities, such as time-varying mobility patterns and spatial dependencies. Our approach combines graph-based and temporal-based components to unify these bi-modal data sources, offering a cohesive view.
b) The use of contrastive loss for spatial and temporal learning is key, as it captures semantic similarities across regions and time. Our design, inspired by the CLIP framework, learns spatial and temporal patterns through contrastive mechanisms. Unlike previous methods, it constructs a 3D similarity matrix for temporal contrastive learning, which effectively models the dynamic relationships between nodes, enhancing predictive performance.
[a] Radford, Alec, et al. "Learning transferable visual models from natural language supervision.", ICML, 2021.
[b] Wu, Zonghan, et al. "Graph wavenet for deep spatial-temporal graph modeling.", IJCAI, 2019.
[c] Lan, Shiyong, et al. "Dstagnn: dynamic spatial-temporal aware graph neural network for traffic flow forecasting.", ICML, 2022.
---
## Reviewer2 - fdLX, Rating(5), Confidence(3)
> *Q1. How the authors contribute to the bi-modal learning in this scenario? It seems that the paper adopt separate encoders for interactions and features, and use temporal and spatial contrastive loss. It is necessary to demonstrate the difference between this paper and previous spatial-temporal GNNs, to highlight the novelty.*
```Q1 Response```
Our contribution to bi-modal learning in this scenario is centered on how we integrate two distinct modalities: node features and node interactions. Unlike traditional methods that may use concatenation or simple merging, our framework employs separate encoders for each modality and combines them through both temporal and spatial contrastive learning. This approach allows us to capture the relationships within each modality while also understanding how they **interact over time and space**.
In contrast to previous spatio-temporal GNNs, which often focus on a single modality or straightforward integration, our method provides a **structured way** to learn from the network structure, node features, and node interactions **simultaneously**. By preserving the distinct properties of each modality with distinct encoders, our approach more effectively represents the complexities of human mobility dynamics.
```Q1 short ver.```
Our contribution to bi-modal learning integrates two distinct modalities: node features and interactions. Instead of simple merging, our framework uses separate encoders for each modality, combining them through temporal/spatial contrastive learning to capture intra-modality relationships and interactions over time and space.
Unlike prior GNNs that often focus on a single modality or basic integration, our method simultaneously learns from network structure, node features, and interactions, allowing us to more effectively represent human mobility dynamics.
> *Q2. Better to incorporate temporal or dynamic GNN baselines to have sufficient evaluations.*
```Q2 Response```
<span style="background-color:#fff5b1"> + GNN-based baeline MAE table </span>
| **Model** | **Days** | **Daegu** | **Busan** | **Seoul** | **NYC** | **COVID** | **NYC COVID** |
|---|---|---|---|---|---|---|---|
| GWNet | 7 | 0.453 | 0.455 | 0.459 | 0.414 | 0.698 | 0.556 |
| | 14 | 0.452 | 0.454 | 0.439 | 0.413 | 0.722 | 0.549 |
| | 30 | 0.455 | 0.457 | 0.445 | 0.421 | 0.752 | 0.575 |
| | Avg | 0.453 | 0.455 | 0.448 | 0.416 | 0.724 | 0.560 |
| DSTAGNN | 7 | 0.454 | 0.446 | 0.476 | 0.446 | 0.881 | 0.691 |
| | 14 | 0.463 | 0.438 | 0.469 | 0.448 | 0.888 | 0.697 |
| | 30 | 0.447 | 0.438 | 0.491 | 0.452 | 0.906 | 0.707 |
| | Avg | 0.455 | 0.441 | 0.479 | 0.449 | 0.892 | 0.698 |
| BINTS | 7 | 0.411 | 0.413 | 0.391 | 0.405 | 0.328 | 0.398 |
| | 14 | 0.417 | 0.417 | 0.396 | 0.412 | 0.339 | 0.407 |
| | 30 | 0.423 | 0.426 | 0.408 | 0.417 | 0.356 | 0.405 |
| | Avg | **0.417** | **0.419** | **0.398** | **0.411** | **0.341** | **0.403** |
This table shows MAE performances of GWNet[a] and DSTAGNN[b], two additional GNN-based baselines for spatio-temporal prediction with a 4-day look-back window. These additional approaches include both temporal and spatial aspects with the static spatial topology. Meanwhile, STGCN, one of the existing GNN baseline, also captures temporal and spatial spects. MPNNLSTM, another existing GNN baseline, deals with dynamic graph structure. Therefore, we provide sufficient GNN baseline evaluations. As a result, BINTS outperform three existing GNN baselines and two additional GNN baselines.
```Q2 short ver.```
This table shows MAE performances of two additional GNN-based baselines, GWNet[a] and DSTAGNN[b], for spatio-temporal prediction with a 4-day look-back window across prediction lengths (7, 14, and 30 days). Additionally, STGCN (existing GNN baseline) captures temporal and spatial aspects, while MPNNLSTM handles dynamic graph structures. This comprehensive comparison shows that BINTS outperforms three existing and two additional GNN baselines.
<!-- We'll also show the performance of GWNet[a] and DSTAGNN[b]. The provided GNN-based baselines with MAE are all transportation-based models, so we included additional baselines to further validate our approach. We demonstrated the results with additional experiments.-->
> *Q3. The authors are expected to show what the critical challenge is for this domain, and why previous methods are suboptimal for this domain.*
```Q3 Response```
The primary challenge in this domain is accurately capturing the complex, dynamic relationships across **both spatial and temporal dimensions within the static networks**. Effective modeling requires an approach that can address both the evolving attributes of individual nodes and node interactions between regions as these features change over time.
Previous methods often fall short because they typically focus on a single aspect—either spatial structure or temporal dynamics—**without deeply integrating both**. Many approaches treat graph structures as static [c] or apply sequential methods [a, d] that overlook the interdependent nature of locations and time-varying movement patterns. This limitation makes it difficult to forecast mobility patterns effectively, where both individual node attributes and the flow of interactions across the network are crucial.
Our approach addresses these challenges by employing dedicated encoders for each **modality** and integrating them through spatial and temporal contrastive learning. This framework captures the complex dependencies both within and between modalities, resulting in a comprehensive representation that significantly enhances predictive accuracy in networked environments.
```Q3 short ver.```
The main challenge in this domain is capturing complex, dynamic relationships across **spatial and temporal dimensions within static networks**. Effective modeling requires addressing both evolving node attributes and regional interactions over time. Previous methods lack integration, focusing on either spatial structure or temporal dynamics, limiting accurate mobility forecasting. Many treat graph structures as static[c] or use sequential methods[a, d] that overlook interdependent, time-varying movement patterns. Our approach addresses these issues with dedicated encoders for each modality, integrated through spatial and temporal contrastive learning, enhancing predictive accuracy in networked environments.
> *Q4. Better to provide captions for the framework.*
```Q4 Response```
BINTS uses supervisory signals within paired node features and node interactions to learn perceptual representations. By aligning a 3D similarity cube-constructed from 2D spatial similarity matrices stacked along the temporal axis-with the static network, our framework captures spatial and temporal dependencies through joint training of a time-series and graph encoder for accurate prediction of future target attributes.
```Q4 short ver.```
BINTS uses supervisory signals within paired node features and interactions to learn perceptual representations. By aligning a 3D similarity cube—constructed from 2D spatial similarity matrices stacked along the temporal axis—with the static network, our framework jointly trains a time-series and graph encoder to capture spatial and temporal dependencies for accurate future target attributes prediction.
[a] Wu, Zonghan, et al. "Graph wavenet for deep spatial-temporal graph modeling.", IJCAI, 2019.
[b] Lan, Shiyong, et al. "Dstagnn: Dynamic spatial-temporal aware graph neural network for traffic flow forecasting.", ICML, 2022.
[c] Yu, Bing, et al. "Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting.", IJCAI, 2018.
[d] Panagopoulos, George et al. "Transfer graph neural networks for pandemic forecasting.", AAAI, 2021.
---
## Reviewer3 - AaoK, Rating(6), Confidence(3)
> *Q1. In Figure 3, it is unclear whether there is some positive correlation between the total flows and infection cases. For example, the spike in infection cases on Dec 21 does not align with the spike in total flow.*
```Q1 Response```
The intention in Figure 3 was to show that surges in infection cases are often accompanied by significant increases in mobility over certain periods, particularly from the early months of 2020 to 2022. Following a rise in infection cases, total flow tends to decrease due to government-recommended self-isolation periods.
In the period you mentioned, December 2021, while infection cases spiked, we can observe that total flows subsequently decreased during January and February 2022, which aligns with this pattern.
```Q1 short ver.```
Figure 3 illustrates that infection surges are often followed by significant increases in mobility, especially from early 2020 to 2022. After a spike in cases, total flow typically decreases due to government-recommended self-isolation. For instance, in December 2021, infections rose, and we see total flows decreasing in January and February 2022, following this pattern.
> *Q2. In Figure 2, What is the OD flow in the y-axis?*
```Q2 Response```
As explained in Section 3, OD flow means the origin-destination(OD) flows --- i.e., node interactions between nodes(administrative regions). In Figure 2, the OD flow in the y-axis signifies the cumulative sum of people moving between different regions, including all locations and time periods.
> *Q3. What is the edge index in Eq.(2)? By this time, it has not been introduced.*
```Q3 Response```
Thank you for pointing out this detail in our paper. Here, the edge index refers to the raw adjacency matrix, represented with binary values (0 or 1) to indicate node connections. **Since both the time series and OD data share a static network structure**, we assigned the connection information between nodes directly in the graph model.
Edge weights are not used directly as edge weights in the graph-based encoder. Instead, edge weights function as a metric in the spatial soft contrastive learning process, guiding how close specific regions should become in feature space. Specifically, $\mathcal{g}^{(k)}_{3D} [i, j]$ is applied to assign a positive loss. This approach encourages closer similarity among neighboring nodes, depending on their proximity. Thus, the edge weight informs the model of the relative spatial closeness required between regions, shaping the feature representation according to geographic and network-based proximity.
```Q3 short ver.```
Thank you for highlighting this detail. Here, the edge index represents the raw adjacency matrix, with binary values (0 or 1) indicating node connections. Since both the time series and OD data share a static network structure, node connections are directly assigned in the graph model.
Edge weights are not directly used in the graph-based encoder but serve as metrics in spatial soft contrastive learning, **guiding feature space proximity** between regions. Specifically, $\mathcal{g}^{(k)}_{3D} [i, j]$ assigns a positive loss to promote closer similarity among neighboring nodes based on proximity. This helps the model understand spatial closeness, shaping feature representations by geographic and network-based proximity.
<!-- We did not use OD flow as an edge weight because the node features themselves already represent OD data, and adding OD as an edge weight would provide redundant information. -->
> *Q4. What is the difference between using the exponential or using the linear decay function? Based on my understanding, the only place the model uses it is Eq. (4) when we decide the positive pairs. However, we only select positive ones if the value in the adjacency matrix is larger than zero, which cannot be decided by the decay function (as long as it is bigger than 3/5 hop, the value would be 0).*
```Q4 Response```
For a 5-hop example, with a linear decay function, the adjacency matrix value rapidly decreases after the first hop and can reach negative values. In contrast, an exponential decay function always remains above zero and decreases more gradually than the linear decay function [a]. This allows the exponential decay to cover a broader range of distances for positive loss, extending influence to more distant regions. We use $g_{ij}^{(k)} = \exp\left( -(k-1) \right)$, if $1 \leq k \leq 5$ and $g_{ij}^{(k)} = 0$ if $k > 5$ for exponential decay function, and we use $g_{ij}^{(k)} = - \frac{(k-1)}{n} +1$, if $1 \leq k \leq 5$ and $g_{ij}^{(k)} = 0$ if $k > 5$ and $n=5$ for 5-hop example of linear decay function, for comparison.
```Q4 short ver.```
In a 5-hop example, a linear decay function causes the adjacency matrix value to drop quickly after the first hop, potentially reaching negative values. In contrast, an exponential decay function **remains positive** and **decreases more gradually**[a], covering a broader range for positive value and extending influence to distant regions. For exponential decay, we use $g_{ij}^{(k)} = \exp\left( -(k-1) \right)$, for linear decay, we use $g_{ij}^{(k)} = - \frac{(k-1)}{n} +1$ with $n=5$ if $1 \leq k \leq 5$. ($g_{ij}^{(k)} = 0$ if $k > 5$)

<!-- figure 첨부 불가. 참고용 -->
> *Q5. In the inference stage, where does Eq. (9) use the embedding $\mathbf{h}_{\text{GCN}}^{r_j, t}, \mathbf{h}_{\text{TCN}}^{r_j, t}$ from the model?*
```Q5 Response```
We regret that we could not include a more detailed explanation in the main, and we plan to add this in the final revisions.
In the inference stage, raw input data is passed through the trained GCN and TCN models to obtain the embeddings, $\mathbf{h}_{\text{GCN}}^{r_j, t}$ and $\mathbf{h}_{\text{TCN}}^{r_j, t}$. These embeddings are then used to calculate the similarity score between modalities, as shown in Eq. (3). The computed similarity $\mathcal{S}^{(r_i, r_j)}$ is projected into $\mathcal{Z}^{(r_i, r_j)}$ via the learned projection, and this output is passed through the prediction model to complete the final forecasting.
```Q5 short ver.```
During inference, raw input data passes through the trained GCN and TCN models, producing embeddings $h_{{GCN}}^{r_j, t}$ and $h_{{TCN}}^{r_j, t}$. These embeddings are used to compute the similarity score between modalities, as in Eq.(3). The similarity $S^{(r_i, r_j)}$ is projected into $Z^{(r_i, r_j)}$ via the learned projection, and this is input to the prediction model for final forecasting.
[a] Hobbie, Russell K., et al. "Exponential growth and decay." Intermediate physics for medicine and biology (2007): 31-47.
---
## Reviewer4 - Xuui, Rating(5), Confidence(4)
> *Q1. The paper introduces the use of a k-hop adjacency matrix to account for connections spanning multiple hops. However, several concerns arise from this module:*
> Choice of K: The paper mentions setting k to 5 but does not justify this choice. Choosing an appropriate k is crucial as it directly impacts the sparsity and representational power of the adjacency matrix. A thorough sensitivity analysis or an automated approach to select k based on data characteristics would be helpful.
> Scalability: The number of edges in the graph grows exponentially with k. Setting k=5 can lead to an extremely dense adjacency matrix, especially for large graphs, which could significantly increase computational costs and potentially make the model intractable for real-world applications. The paper does not discuss any strategies to mitigate this issue, such as pruning edges based on their relevance or importance.
```Q1 Response```
Thank you for these valuable points. We understand that the choice of $k$ and its impact on computational cost and matrix density are crucial considerations.
We report an additional $k$-hop hyperparameter sensitivity experiments using the Busan dataset with a prediction length of 14 days due to the space limit. We also report the running time to build the $k$-hop adjacency matrix for each $k$.
- Choice of $k$: We selected $k=5$ based on preliminary experiments, which showed a balance between performance and computational cost. Since the optimal $k$ can vary depending on the dataset and domain, we treat $k$ as a hyperparameter that can be adjusted accordingly. Using a smaller $k$ resulted in limited spatial context, while a larger $k$ increased computational demands without proportionate performance gains. We agree that a more comprehensive sensitivity analysis of $k$ could further clarify its impact, and we plan to include additional experiments in the final version. Exploring automated methods to adjust $k$ dynamically based on data characteristics is a valuable suggestion for improvement.
- Scalability: As you noted, a higher $k$ significantly increases edge density, which can create scalability issues in larger graphs. Specifically, the number of edges grows at $O(N_k + E_k)$ as $k$ increases (where $N_k$ is the number of nodes within $k$-hops, and $E_k$ is the number of edges among these nodes). Through preliminary experiments on our benchmark domains, we confirmed that our choice of $k$ maintains computational feasibility within practical scenario for our datasets. By setting $k$ as a hyperparameter, we allow for flexibility and scalability across domains while ensuring that computational costs remain manageable.
```Q1 short ver.```
We recognize the importance of careful $k$ selection due to its impact on computational cost and matrix density. We conducted $k$-hop sensitivity experiments on the Busan dataset(14-day prediction), including running times for each $k$-hop adjacency matrix.
- Choice of $k$: We chose $k=5$ as it balanced performance and cost. Optimal $k$ values vary by dataset, so we treat it as an adjustable hyperparameter. Smaller $k$ reduced spatial context, while larger $k$ increased computational costs with minimal gains.
- Scalability: Higher $k$ values increase edge density, impacting scalability as edge count grows at $O(N_k + E_k)$, where $N_k$ is nodes within $k$-hops and $E_k$ is edges among them. Chosen $k$ remains feasible for our benchmarks, ensuring flexibility and scalability across datasets while controlling costs.
| k | MAE | MSE | Running Time (s) |
|:----:|:-------:|:-------:|:------------------:|
| 1 | 0.4904 | 0.5412 | 0.0010 |
| 2 | 0.4235 | 0.4265 | 0.0061 |
| 3 | 0.4231 | 0.4225 | 0.0747 |
| 4 | 0.4180 | 0.4211 | 0.0794 |
| 5 | **0.4169** | **0.4041** | 0.0325 |
| 6 | 0.4178 | 0.4179 | 0.0468 |
| 7 | 0.4201 | 0.4233 | 0.3933 |
| 8 | 0.4201 | 0.4240 | 0.7061 |
| 9 | 0.4198 | 0.4233 | 1.3855 |
| 10 | 0.4195 | 0.4207 | 0.6125 |
> *Q2. The motivation behind performing spatial contrastive learning on the representations learned by GCN and TCN is unclear. The paper briefly mentions enforcing spatial consistency but does not provide a detailed reason for why this is necessary or how it contributes to the overall framework.*
```Q2 Response```
The motivation for applying spatial contrastive learning within our framework is to enhance **spatial consistency** by aligning representations learned through GCN and TCN. This approach allows our model to capture meaningful spatial relationships across regions, which is critical for accurately modeling dynamic, networked data like human mobility patterns.
Our method is conceptually similar to CLIP[a], which combines image and text modalities, by integrating graph and time-series representations. Due to the time-varying nature of data, we construct a **3D similarity matrix**, where spatial contrastive learning enforces similarity between physically proximate nodes. This consistency ensures that nodes with spatial proximity have similar representations, enhancing the model’s ability to learn spatial dependencies while also preserving the temporal dynamics of the data through the time axis.
Incorporating spatial contrastive learning ultimately improves the robustness of our framework, allowing it to better capture spatially meaningful patterns and enhance predictive accuracy across the network.
```Q2 short ver.```
We use spatial contrastive learning to enhance spatial consistency by aligning GCN and TCN representations, enabling our model to capture key spatial relationships in dynamic networked data. Similar to CLIP[a] (integrating image and text), our approach combines graph and time-series. To handle time-varying data, we construct **a 3D similarity matrix** where spatial contrastive learning enforces similarity between proximate nodes, helping the model learn spatial dependencies while preserving temporal dynamics, ultimately enhancing robustness and predictive accuracy across the network.
[a] Radford, Alec, et al. "Learning transferable visual models from natural language supervision.", ICML, 2021.
> *Q3. The paper mentions several hyperparameters, including gamma for the hinge loss in spatial contrastive learning, the margin (m) in the ranking loss, and the weights (w_spatial, w_temporal, w_prediction) for the total loss function. However, there are no detailed parameter experiments.*
```Q3 Response```
<span style="background-color:#fff5b1"> + hyper-parameter experiment table </span>
| margin | gamma | MAE | MSE |
|--------|-------|-------|-------|
| 0.1 | 0.2 | 0.423 | 0.415 |
| | 0.5 | 0.422 | 0.414 |
| | 0.8 | 0.418 | 0.408 |
| 0.2 | 0.2 | 0.417 | 0.404 |
| | 0.5 | 0.421 | 0.415 |
| | 0.8 | 0.430 | 0.425 |
| 0.5 | 0.2 | 0.425 | 0.418 |
| | 0.5 | 0.422 | 0.413 |
| | 0.8 | 0.421 | 0.411 |
| 0.8 | 0.2 | 0.420 | 0.411 |
| | 0.5 | 0.420 | 0.413 |
| | 0.8 | 0.420 | 0.409 |
| w_prediction | w_spatial | w_temporal | MAE | MSE |
|--------------|-----------|------------|-------|-------|
| 0.8 | 0.1 | 0.1 | 0.422 | 0.413 |
| 0.7 | 0.1 | 0.2 | 0.429 | 0.424 |
| | 0.2 | 0.1 | 0.422 | 0.413 |
| 0.5 | 0.2 | 0.3 | 0.420 | 0.411 |
| | 0.25 | 0.25 | 0.417 | 0.404 |
| | 0.3 | 0.2 | 0.419 | 0.412 |
| 0.3 | 0.2 | 0.5 | 0.420 | 0.413 |
| | 0.5 | 0.2 | 0.417 | 0.407 |
| 0.1 | 0.1 | 0.8 | 0.461 | 0.453 |
| | 0.2 | 0.7 | 0.470 | 0.460 |
| | 0.7 | 0.2 | 0.466 | 0.456 |
| | 0.8 | 0.1 | 0.454 | 0.444 |
Thank you for pointing this out. We conducted several experiments to explore the effects of key hyperparameters, including gamma, the margin $m$, and the weights $w_{spatial}, w_{temporal}$, and $w_{prediction}$.
In our main experiments, we selected hyperparameter values that offered consistent performance across validation sets, prioritizing predictive accuracy.
```Q3 short ver.```
Thank you for pointing this out. We conducted experiments to examine the effects of key hyperparameters: gamma, margin $m$, and weights $w_{spatial}$, $w_{temporal}$, and $w_{prediction}$.
In our main experiments, we chose hyperparameter values that provided consistent performance across validation sets, prioritizing predictive accuracy.
```Shorten tables```
|k|MAE|MSE|Time(s)|
|--|--|--|--|
|1|0.490|0.541|0.001|
|3|0.423|0.423|0.075|
|5|0.417|0.404|0.033|
|7|0.420|0.423|0.393|
|9|0.420|0.423|1.386|
|10|0.420|0.420|0.613|
|Model|Daegu|Busan|Seoul|NYC|COVID|NYC COVID|
|--|--|--|--|--|--|--|
|GWNet|0.453|0.455|0.448|0.416|0.724|0.560|
|DSTAGNN|0.455|0.441|0.479|0.449|0.892|0.698|
|BINTS|**0.417**|**0.419**|**0.398**|**0.411**|**0.341**|**0.403**|
|margin|gamma|MAE|MSE|
|--|--|--|--|
|0.1|0.2|0.423|0.415|
||0.5|0.422|0.414|
||0.8|0.418|0.408|
|0.2|0.2|0.417|0.404|
||0.5|0.421|0.415|
||0.8|0.430|0.425|
|0.5|0.2|0.425|0.418|
||0.5|0.422|0.413|
||0.8|0.421|0.411|
|0.8|0.2|0.420|0.411|
||0.5|0.420|0.413|
||0.8|0.420|0.409|
|w_prediction|w_spatial|w_temporal|MAE|MSE|
|--|--|--|--|--|
|0.8|0.1|0.1|0.422|0.413|
|0.7|0.1|0.2|0.429|0.424|
||0.2|0.1|0.422|0.413|
|0.5|0.2|0.3|0.420|0.411|
||0.25|0.25|0.417|0.404|
||0.3|0.2|0.419|0.412|
|0.3|0.2|0.5|0.420|0.413|
||0.5|0.2|0.417|0.407|
|0.1|0.1|0.8|0.461|0.453|
||0.2|0.7|0.470|0.460|
||0.7|0.2|0.466|0.456|
||0.8|0.1|0.454|0.444|