# The ICLR 6021 Rebuttal
## QVEM
### Weakness:
1. It is essential to provide a more detailed explanation of how each component contributes to addressing the proposed issue, e.g., clarify the necessity of graph neural network and the specific problem it aims to solve.
2. The readability and visual appeal of the process diagram for RDA in Figure 2 could be improved.
3. The experimental setting of dividing source and target domains is unrealistic.
### Questions:
1. After constructing the emotional graph, did the author only perform a single convolution layer by multiplying an adjacency matrix with node embeddings? Did the author consider using multiple convolution layers or using the more expressive graph neural network?
2. Due to the necessity of employing Bayesian graph inference to construct multiple graphs in the author's method, I am concerned about whether the performance benefits outweigh the increased computational burden.
3. In Table 1, VBH-GNN achieves optimal accuracy, but the F1 scores consistently demonstrate poor performance. Does this observation imply the class imbalance issue in the predicted results of the method?
4. The authors used leave-one-subject-out paradigm to divide the source and target domains is unrealistic. In this case, the target domain is actually a validation set, which is not the real domain adaptation setting.
5. The heterogeneity (e.g., EEG, ECG) in this paper can be regarded as multi-variant time series data. Please clarify the difference between the heterogeneity and multi-variant time series. The experiments should also include the baseline methods for learning multi-variant time series (e.g., learning a graph structure to represent the spatio relationships in multi-variant time series).
### Response to Reviewer QvEM:
We sincerely thank the reviewer for the insightful and valuable comments! We have addressed the points raised by the reviewer in detail below and updated our paper based on your comments. The response is a bit late since we have supplemented some experiments during the rebuttal period. Please accept our apologies. Please do not hesitate to post additional comments and questions; we will be more than happy to address them.
#### Answer to Weekness 1:
If we understand correctly, the reviewer means that our explanation of several essential components (e.g., BGI, EGT) in RDA is not detailed enough, which causes difficulties in understanding. Thank you for pointing out this weakness, and **we have updated our paper with more detailed descriptions for each module of RDA**. Here, we explain its core components.
**The Bayesian Graph Inference (BGI) ensures that the model can find the latent relationship distribution of multi-modal physiological signals shared by the source subjects and the target subject**. As we analyzed in Section 4.3:
> BGI loss determines whether the model converges or not or whether the model can learn the spatio-temporal relationship distribution of modalities.
When the BGI is removed, the model fails to converge due to its inability to learn this latent relationship distribution.
**The Emotional Graph Transform (EGT) ensures that the model can distinguish between the latent relationship distribution of multi-modal physiological signals in different emotions**. In Section 4.3:
> For EGT loss, its effect on the model is to determine the degree of convergence.
When the EGT is removed, there is a significant decrease in classification accuracy, and thus, it determines how well the model converges on the ER task.
**we have also added a flow chart of our method VBH-GNN to illustrate different modules.** We graphically show the different modules, including Wav-to-Node, Spatial RDA, Temporal RDA, and Classifier, and also update their function description in Section 3.1.
#### Answer to Weekness 2:
We have redrawn Figure 3 (i.e., Figure 2 in the original version) to improve the readability of the RDA module. Specifically, we use different background colors to denote different sub-modules and thus make the figure details more transparent. We have also added a global flow chart in Figure 2 to illustrate the role of each module in the overall process. Please check Figure 2 and Figure 3 in our revised PDF version.
#### Answer to Weekness 3 and Question 4:
We used the leave-one-subject-out (LOSO) paradigm to divide source and target domains. If we understand correctly, the reviewer means that the validation and testing sets should originate from different domains (subjects) and should not overlap. We explain why we use LOSO as follows.
**First, there are differences between domain adaptation (DA) and domain generalization (DG).** Our paper aims at the DA problem in a cross-subject emotion recognition (ER) task. The DA is similar to DG, and we speculate whether this might be causing the misunderstanding.
The main difference between DG and DA is whether the model can use the data from the target domain in the training process. For the DG, the subjects in the testing set are regarded as the target domain, and these subjects should be unseen to the trained model. To evaluate the trained model, the validation set should also be from the target domain but use a different subject from the testing set. In this case, using LOSO is unrealistic. However, for the DA, the training set typically contains data from both the source and target domains. Without access to target domain data, models cannot appropriately adapt to the new unknown domain during initial training. In other words, the DA implies using data from source subjects and the target subject for ER, so the LOSO can be used to select the target subject.
**Second, the LOSO is a commonly used experiment setup for domain adaptation (DA) in EEG-based cross-subject ER**. To support our point, we cite several typical papers and quote their relevant contents as follows:
1. Mentioned in ref [1]:
> Specifically, only one experiment from each subject is involved in the **leave-one-subject-out** cross strategy to study the inter-subject variability.
2. Mentioned in ref [2]:
> Based on the above EEG dataset, we adopt the **leave-one-out-cross-validation** method in the following experiments to evaluate the performance of the models in cross-subject scenarios.
3. Mentioned in ref [3]:
> On the control experiments, we employ two transfer paradigms, i.e., 'one-to-one' and **'multi-to-one'**. ... . In the latter paradigm, when **one subject serves as the target, all the remaining subjects form the source**.
Despite their different paradigm names, they all randomly select one subject as the target domain and the rest of the subjects as the source domain. We follow their settings in our experiments.
**Third, the division of the training, testing and validation sets in our experiments is also common, and there is no data leakage**. We will explain this with the content of a published paper.
Mentioned in the baseline MMDA-VAE [4]:
> **The training set consisted of the source data and the labelled target data**, i.e., all the samples from the source session and samples from the first three or four trials (one trial per class) in the other target session.
This means that the training set contains source and target domain data.
> We used **samples from the second three or four trials in the target session as the validation set**.
This indicates that the validation set comprises part of the target domain data.
> The samples from **the rest of the twelve or sixteen trials in the target session were used to evaluate classification accuracy**.
This suggests that the testing set is also derived from the target domain but does not intersect with the training and validation sets. This experiment setting is the same as ours, where part of the target domain is included in the training set, while the training, validation, and testing sets do not overlap.
**In summary, our experiment setting (i.e., LOSO paradigm) is commonly used in the cross-subject ER field and does not result in data leakage**. At the same time, we agree with the experiment setting that the reviewer suggested, and we will take the relatively more challenging DG task as our further research direction. We look forward to discussing this with you and thank you for noting the details of the experiment setup.
[1] Zhao, Li-Ming, Xu Yan, and Bao-Liang Lu. "Plug-and-play domain adaptation for cross-subject EEG-based ER." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 1. 2021.
[2] Gu, Rong-Fei, et al. "Cross-Subject Decision Confidence Estimation from EEG Signals Using Spectral-Spatial-Temporal Adaptive GCN with Domain Adaptation." 2023 International Joint Conference on Neural Networks (IJCNN). IEEE, 2023.
[3] Peng, Yong, et al. "Joint feature adaptation and graph adaptive label propagation for cross-subject ER from EEG signals." IEEE Transactions on Affective Computing 13.4 (2022): 1941-1958.
[4] Wang, Yixin, et al. "Multi-modal domain adaptation variational autoencoder for eeg-based ER." IEEE/CAA Journal of Automatica Sinica 9.9 (2022): 1612-1626.
### Answer to Question 1:
If we understand the reviewer correctly, the reviewer asks whether we considered using a more expressive network layer to replace the RDA's Graph Attention (GA) module. We have tried using Graph Convolutional Network layers (GCNs) instead of GA, but it is unnecessary. The reasons are as follows.
**The output of EGT (EmoG) already adequately represents the relationships between nodes, so it does not need more expressive networks (such as GCNs) to represent such relationships again**. We found in our experiments that using GCNs does not improve performance compared to using GA. This is because the core idea of GCNs is to infer the relationship between nodes based on the input adjacency matrix and node embeddings and then update the node embeddings based on this relationship. However, the function of inferring the relationship between nodes is precisely accomplished by BGI and EGT. The EmoG has adequately represented the relationship between nodes under specific emotions, and the GA operation is also more efficient than the convolution layers or graph neural networks. Therefore, we simplify the GCNs to GA that retain only the function of updating node embeddings.
### Answer to Question 2:
If we understand the reviewer correctly, the reviewer is concerned that there is a step in BGI that couples $n$ relationships between multi-modal signals. This step constructs multiple graphs, thus incurring a substantial computational burden. However, this step does not construct multiple graphs and, therefore, does not increase the computational burden.
**The BGI does not directly align the graphs between domains, but rather the edge existence probability distributions, so it doesn't construct multiple graphs**. In the paper, we mentioned:
> From this, we define the prior HetG edge distribution from the source domain as follows:
> $$P(\text{HetG}|E_s) \sim \text{BIN}(n, p_s)$$
where $p_s \in \mathbb{R} ^{B \times N_{e} \times 1}$ is computed by the network and mathematically represents the probability of the existence of each of the $N_e$ edges. It is much smaller than the edge embedding matrix $E_s\in \mathbb{R}^{B \times N_{e} \times D_e}$.
To align domains by such probability distributions, we design the BGI Loss (essentially Kullback-Leibler Divergence (KLD)), which is used to minimize the divergence between the two probability distributions. However, as the reviewer is concerned, the direct computation of this KLD imposes a substantial computational burden due to the presence of $n$. To make this computation possible in the network, we propose Theorem 2, which mathematically computes an upper bound for this KLD:
> $$\mu_{lt} \log \frac{\mu_{lt}+\epsilon}{p_s+\epsilon} +(1-\mu_{lt}) \log \frac{1-\mu_{lt}+{\mu_{lt}}^2 / 2+\epsilon}{1-p_s+{p_s}^2 / 2+\epsilon}$$
which is not directly related to $n$. The computational complexity of this equation is $O(B \cdot N_e)$, which will not impose a significant computational burden.
### Answer to Question 3:
Yes, the class imbalance issue is a common and challenging problem in cross-subject EEG ER. Detailed investigation of the imbalance issue will be part of our future effort. Actually, our method achieves the second-best results regarding the F1 score among all methods in most cases. For a better illustration, we updated Table 1 and highlighted the top 3 performances to show the advantages of our method.
### Answer to Question 4:
See the answer in Weakness 3.
### Answer to Question 5
Thank you for bringing this to our attention! We agree that the heterogeneity (e.g., EEG, ECG) mentioned in our paper can be regarded as multi-variant time series data.
The heterogeneous data in our paper is to emphasize the diversity of physiological signals captured by different sensors. The term "heterogeneity" is commonly used in multi-modal or EEG ER [5][6][7]. The multi-variant time series data is the sequential observations (e.g., physiological signals and vital signs) that may be irregularly sampled by sensors [9]. Therefore, the heterogeneous data and multi-variant time series data **have little differences in data format**.
As suggested by the reviewer, the baseline methods for learning multi-variant time series can also be used for the ER task. We adopt two graph structure based methods [8][9] as our new baselines. Their preliminary experimental results are as follows:
| Method | DEAP | | | | DREAMER | | | |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| | Arousal |Arousal | Valence | Valence | Arousal | Arousal | Valence |Valence |
| | Accuracy | F1 Score | Accuracy | F1 Score | Accuracy | F1 Score | Accuracy | F1 Score |
| MTGNN [8] | $67.46 \pm 11.51$ | $63.03 \pm 12.19$ | $64.77 \pm 7.98$ | $67.24 \pm 8.33$ | $66.66 \pm 9.54$ | $66.24 \pm 11.5$ | $63.35 \pm 6.29$ | $64.01 \pm 9.39$ |
| RAINDROP [9] | $66.06 \pm 10.11$ | $63.7 \pm 12.43$ | $65.59 \pm 7.38$ | $64.29 \pm 7.98$ | $65.74 \pm 8.99$ | $62.17 \pm 10.82$ | $65.85 \pm 7.61$ | $62.44 \pm 8.07$ |
| Our VBH-GNN | **73.5** $\pm$ **7.22** | **71.53** $\pm$ **10.86** | **71.21** $\pm$ **6.41** | **71.85** $\pm$ **7.38** | **70.64** $\pm$ **7.74** | **69.66** $\pm$ **9.51** | **73.38** $\pm$ **4.21** | **69.08** $\pm$ **6.98** |
We observe that **the baselines designed for multi-variant time series data perform not that good when applied to emotion recognition on our heterogeneous data. This is because the heterogeneous data and the multi-variant time series data have different characteristics, such as data sparsity and sampling rate.**
[5] Ziyu Jia, et al. "HetEmotionNet: two-stream heterogeneous graph recurrent neural network for multi-modal ER." ACM MM, 2021.
[6] Linlin Gong, et al. "Emotion recognition from multiple physiological signals using intra-and inter-modality attention fusion network." Digital Signal Processing, 2024.
[7] Wei Li, et al. "Can emotion be transferred?—A review on transfer learning for EEG-Based Emotion Recognition." IEEE TCDS, 2021.
[8] Zonghan Wu, et al. "Connecting the dots: Multivariate time series forecasting with graph neural networks." ACM SIGKDD, 2020
[9] Xiang Zhang, et al. "Graph-Guided Network for Irregularly Sampled Multivariate Time Series." ICLR, 2022.
We hope that our answers clarify your concerns. Thank you for your time and valuable feedback! We are glad to have any further discussion.