KDD 2024-CAT-VIet

## To Area Chair Dear Area Chair, We are writing to kindly remind you that we have diligently addressed all the concerns and suggestions raised by **Reviewers MgUp, EQ6w, and 7KMF**. Particularly, as suggested by **Reviewer MgUp**, we have compared the proposed method with a latest baseline, and demonstrated an intepretable solution utilizing disentangled representation learning as a feature encoder for the image classification task on the MNIST dataset. Regarding the comments from **Reviewer EQ6w**, we have elaborated on the specific advantage of our method in computational complexity over existing solutions, and have further bolstered our findings by conducting experiments on two additional datasets. Lastly, addressing the remarks of **Reviewer 7KMF**, we have highlighted the novelty and technical contributions of the proposed method, as well as its notable enhancement in accuracy and computation cost compared to existing works. However, we have yet to receive any further communication from these three reviewers. We kindly request your assistance in determining if they have any remaining concerns or questions regarding our responses. We would appreciate it if you could encourage the reviewer to follow up. Thank you very much for your help! ## To Reviewer WfDu  Reminder to reviewer -- Dear Reviewer WfDu, We would like to thank you for taking the time to review our paper and rebuttal, and most importantly for raising your scores. We are grateful for the insightful comments you made, and truly appreciate this opportunity to improve our work. Please kindly let us know if you have any additional questions or concerns. Thank you very much! Q1: The motivation behind incorporating TaylorNet. Its distinctions over existing GAMs. -- **A1:** The motivation behind incorporating TaylorNet is that it can explicitly learn the mapping function between the input and output using polynomials, without the need for activation functions. In addition, the developed TaylorNet based on tensor decomposition can significantly reduce the number of model parameters compared to existing GAMs. By combining the concept encoder and the TaylorNet, our method can better explain the prediction results in a manner that humans can easily understand. ==(length 608)==   Q2: Only do experiments on tabular data. -- **A2:** We have done extra experiments on the MNIST dataset. The core idea is to employ disentangled representation learning [1] as a concept encoder to uncover the high-level latent factors from the image data, which are subsequently fed into the TaylorNet. Please see the table in **A1** in the response to **Reviewer Bn6s**. Our method outperforms the baselines in both accuracy and computational costs. [1] Shao et al. Rethinking controllable VAE. CVPR, 2022   ==(length 544)== Q3: Performance is not significant. -- **A3:** Please note that our method substantially reduces the number of model parameters, while still slightly outperforming SPAM in terms of accuracy. Moreover, despite having a similar number of model parameters compared to NBM, our method achieves higher accuracy. Following a comprehensive comparison in both accuracy and computational costs, we conclude that our method achieves a better trade-off between them. ==(length 487)== Q4: typos -- **A4:** We will fix the typos ==(length 80)== Q5: Empirical computation cost -- **A5:** We have compared the empirical computation cost to the baselines using Training Throughput (samples/sec) in Table 4 in our submission. Our method exhibits much higher training throughout than the baselines on all three datasets. ==(length 319)==  Q6: About ablation of feature encoding. -- **A6:** We conducted an ablation study to explore the impact of feature encoding (FE) on model performance using the UCI-HAR dataset. We can see from the following table that the accuracy of CAT drops while the number of parameters increases drastically without feature encoding. This underscores the necessity of incorporating it into our model. |UCI-HAR|Acc↑|Macro-F1↑|# of Paras↓| |-|-|-|-| |No FE(Ord. 2)|0.9723|0.9141|194,256| |No FE(Ord. 3)|0.9728|0.9171|1,190,656| |CAT(Ord. 2)|0.9814|0.9431|**161,138**| |CAT(Ord. 3)|**0.9829**|**0.9480**|227,634|  ==(length 621)== ==**(total 2676)**==  ## To Reviewer u6b5  Reply to reviewer -- Dear Reviewer u6b5, Thank you for your prompt response to our rebuttal, and for raising your scores. By the way, what are the issues that you still have regarding the presentation of our paper? We will incorporate your feedback and suggestions into our final version. Dear Reviewer u6b5, We greatly appreciate your prompt response to our rebuttal and for revising your scores accordingly. Could you please specify any remaining concerns you have regarding the presentation of our paper? Your feedback and suggestions will be invaluable as we work on our final version. Again, thank you very much! Reminder to reviewer -- Dear Reviewer u6b5, We would like to thank you for taking the time to review our paper and for the astute comments. We have addressed all the comments and suggestions that you made. In particular, we have **discussed the suggested related works, conducted new experiments on another commonly-used COMPAS benchmark, and provided additional visual results on the interpretability of the proposed method**. Please kindly let us know if you if you have any further concerns. We truly appreciate this opportunity to improve our work, and will be most grateful for any feedback you made. If no further question remains, we are curious if you could consider raising our score. Thank you very much! Q1: Compared to work [1] -- **A1:** We requested the source code from the authors of [1], but we have not received a reply. Moreover, we believe the approach outlined in [1] complements our CAT, as it can automatically cluster concepts using an SVM classifier. Following clustering, these clustered concepts can be fed into TaylorNet. ==(length 350)==   Q2: Figure for learned shape function -- **A2:** We included another figure to show the shape functions of concepts here (https://anonymous.4open.science/r/CAT-KDD-2024-AC7F). Fig. 1 elucidates how values of certain concepts, such as Location and Property, influence the listing price. For example, Location with values in [-0.58,-0.59] have a positive impact on the listing price. Consequently, users can discern the original features related to these concepts, enabling more detailed explanations.  ==(length 499)== Q3: Do experiments on COMPAS -- **A3:** We assess our method on COMPAS in Table 1. It outperforms the baselines in both accuracy and computational costs. Table 1: Evaluation on COMPAS. |COMPAS|Acc↑|Macro-F1↑|# of Paras↓| |-|-|-|-| |NAM|0.6699 |0.6623|40,326| |NBM|0.6742|0.6708|64,664| |SPAM(Order 2)|0.6659|0.6569|82,254| |SPAM(Order 3)|0.6688|0.6608|124,982| |CAT(Order 2)|0.6772|0.6710|14,354| |CAT(Order 3)|**0.6793**|**0.6726**|18,514| ==(length 447)==   Q4: Missing related work [2,3,4]. -- **A4:** [2] mainly addressed the leakage issue in concept bottleneck models (CBM). Additionally, [3] tried to improve the robustness of CBM under malicious attacks via adversarial training. [4] illustrated that current joint training of CBM might not adequately explain model predictions.  ==(length 324)==  Q5: A motivation figure. -- **A5:** Will do. ==(length 42)== Q6: Line 259: "so that we can manually ... representing concepts", provide grounded literature on this assumption. -- **A6:** Using COMPAS dataset as an example to explain this assumption. Its input features include Age, Sex, Race, Priors Count, Charge Degree, and Custody Length. Based on their semantics, we can divide these features into 2 groups: Demographic (first 3) and Criminal History (last 3). Alternatively, feature grouping can be done empirically by clustering methods as studied in prior works [5,6]. [5] Approaches for the clustering of geographic metadata. ISPRS, 2022. [6] Review of feature selection approaches based on grouping of features. PeerJ, 2023. ==(length 668)== Q7: About tuning hyperparameters. -- **A7:** We used the grid search to tune the hyperparameters of TaylorNet in Appendix A1.   ==(length 123)== ==**(total 2947)**==   ## To Reviewer Bn6s ==(please replied to his comments later)== ## Follow-up response to reviewer Thank you for your careful examination of our experimental results. Q1: I have a major concern about your Fig. 3 provided in the anonymous link. I notice that most salient second-order concepts involve _BackgroundBrightness_. This may imply potential overfitting on this dataset. Also, this result brings a new question to me: why _BackgroundBrightness_ itself contributes little to the classification, while _BackgroundBrightness x BackgroundBrightness_ shows strong contribution? -- **A1**: Per your suggestion, we have investigated the potential overfitting concern regarding our model. However, we discovered that the second-order CAT model exhibited symptoms of underfitting when applied on the learned concepts from the CelebA dataset. This underfitting stemmed from our initial choice of a rank 8 for tensor decomposition within TaylorNet, which is too low to effectively capture the relationship between the second-order concepts and the target variable. By increasing the rank to 16, we have improved the prediction accuracy for second-order CAT as reported in Table 1 below, and revised Figure 3 for interpreting its classification of genders on the anonymous website (https://anonymous.4open.science/r/CAT-KDD-2024-AC7F). In particular, the adjustment in rank has led to a notable change: the second-order concept _Background Brightness x Background Brightness_ no longer dominates the second-order contributions or shows stronger contribution than _Background Brightness_ itself, as observed in Figure 3. However, _Background Brightness_ remains integral to several significant second-order concepts, such as _Background Brightness x Hair Color_. The reason for this is the model's nuanced perception of Hair Color, which dynamically shifts based on _Background Brightness_, similarly to human visual perception. Additionally, we can observe the increased contributions of _Hair Length_ and _Hair Fringe_ to the classification of genders, underscoring how our refined model now better mirrors gender identification in humans. Table 1: Evaluation on CelebA averaged over 3 random seeds. | CelebA | Acc↑ | Macro-F1↑ | Number of Parameters↓ | |-|-|-|-| | NAM | 0.7441 | 0.7226 | 67,210 | | NBM | 0.7455 | 0.7231 | 65,076 | | SPAM (Order 2) | 0.7468 | 0.7129 | 136,822 | | SPAM (Order 3) | 0.7642 | 0.7385 | 207,634 | | CAT (Order 2)* | 0.7609 | 0.7446 | **4,896** | | CAT (Order 3) | **0.7728** | **0.7579** | 11,760 | ## Follow-up response to reviewer Thank you for your prompt response to our rebuttal and your insightful suggestions. Q1: MNIST is a toy dataset which cannot be used to justify the strength of this method on images. -- **A1**: We have conducted additional experiments to further justify the applicability of our proposed method for image processing on another CelebA dataset for gender classification. As shown in Table 1, our method outperforms all interpretable baselines while utilizing fewer parameters. Table 1: Evaluation on CelebA averaged over 3 random seeds. | CelebA | Acc↑ | Macro-F1↑ | Number of Parameters↓ | |:--------------:|:----------------:|:----------------:|:---------------------:| | NAM | 0.7441 | 0.7226 | 67,210 | | NBM | 0.7455 | 0.7231 | 65,076 | | SPAM (Order 2) | 0.7468 | 0.7129 | 136,822 | | SPAM (Order 3) | 0.7642 | 0.7385 | 207,634 | | CAT (Order 2) | 0.7599 | 0.7435 | **1,440** | | CAT (Order 3) | **0.7728** | **0.7579** | 11,760 | Q2: The authors can focus more on interpretability rather than speed. For online applications, we actually do not need it to be explainable, as the evidence can not be processed by humans in time. -- Per your suggestion, we have showcased the gender prediction contributions of 9 high-level concepts acquired through disentangled representation learning on the CelebA dataset in Figure 3, available on the anonymous website (https://anonymous.4open.science/r/CAT-KDD-2024-AC7F). From this figure, it is evident that certain concepts such as Skin Tone, Hair Azimuth, and Hair Length exhibit a significant influence on the model's gender predictions, aligning closely with human reasoning at a high level. Furthermore, we included Figure 4 to delineate the shape functions of these concepts, providing clarity on how variations in specific concept values, such as Skin Tone and Hair Length, impact predictions of female gender. For instance, individuals with darker Skin Tone are more likely to be classified as male, whereas those with longer Hair Length tend to be classified as female. Consequently, by employing the outlined concept-based explanation method, users can interpret the model's decision-making process in time with relative ease. Reminder to reviewer -- Dear Reviewer Bn6s, We would like to thank you for taking the time to review our paper and for the insightful comments. We have addressed all the comments and suggestions that you made. In particular, as suggested, we have **conducted new experiments on MNIST dataset to demonstrate that our proposed method is not only suitable for image data, but it can also improve the prediction accuracy and computation cost**. Please kindly let us know if you if you have any further concerns. We are most grateful for any feedback you made, as they are valuable for improving our work. If no further question remains, we are curious if you could consider raising our score. Thank you very much! Q1: This method is problematic for images. -- **A1**: We have conducted additional experiments to demonstrate the applicability of our method to image classification on MNIST using disentangled representations. Based on prior work [1, 2], disentangled representation learning can extract the high-level latent factors like style and shape from MNIST. Then, the disentangled factors can be fed into the TaylorNet. As shown in Table 1, our method outperforms all interpretable baselines while utilizing fewer parameters. Table 1: Evaluation on MNIST. |MNIST|Acc↑|Macro-Prec↑|Macro-F1↑| Number of Paras.↓ |Throughput (samples/sec)↑| |-|-|-|-|-|-| |EBM|0.9885|0.9423|0.9422| 61,400|N/A| |NAM|0.9723|0.8635|0.8599|40,326|46,684| |NBM|0.9770|0.8878|0.8843|64,720|51,247| |SPAM(Order 2)|0.9860|0.9318|0.9300|83,862|58,973| |SPAM(Order 3)|0.9883|0.9426|0.9414|128,998|30,585| |CAT(Order 2)|0.9892|0.9469|0.9458|**880** |**67,097**| |CAT(Order 3)|**0.9902**|**0.9517**|**0.9510**|5,200|62,981| [1] Mathieu et al. Disentangling disentanglement in variational autoencoders. ICML, 2019. [2] Shao et al. Rethinking controllable variational autoencoders. CVPR, 2022. ==(length 1188)== Q2: Do we need this fast training? ------------------------ **A2**: Yes. Fast training holds significant importance in real-world scenarios like autonomous driving and stock price prediction. Not only does it enhance model training and inference speed, but it also contributes to minimizing energy consumption, a critical factor in today's context. ==(length 346)== Q3: Typos: line 22: simply the process -> simplify the process. ---------------------- **A3**: Thanks. We will proofread the whole manuscript in our revision. ==(length 156)== ==(total 1690)== ## To reviewer MgUp Second reminder to reviewer -- Dear Reviewer MgUp, We are writing to kindly remind you that we have diligently addressed all the concerns and suggestions you have made. Specifically, we have **compared the proposed method with a latest baseline GPNAM, and further demonstrated the strength of our method for the image classification task on the MNIST dataset**. Should you have any additional concerns or suggestions, please do not hesitate to share them with us. Your feedback is invaluable to us and plays a crucial role in refining our work. Additionally, we would like to inquire whether you would consider improving our paper's score given the revisions we have made. Once again, we appreciate your time and efforts in reviewing our paper and providing constructive feedback. Reminder to reviewer -- Dear Reviewer MgUp, We would like to thank you for taking the time to review our paper and for the insightful comments. We have addressed all the comments and suggestions that you made. In particular, as suggested, we have **compared the proposed method with a latest baseline GPNAM, and demonstrated an intepretable solution utilizing disentangled representation learning as a feature encoder for the image classification task on the MNIST dataset**. Please kindly let us know if you if you have any further concerns. We are most grateful for any feedback you made, as they are valuable for improving our work. If you do not have further questions, we are curious if you could consider raising our score. Thank you very much! Q1: No baselines after 2023 were considered. -------------------- **A1:** Per your suggestion, we have compared our method with a latest baseline GPNAM [1]. We can observe from the following Table that our method can improve prediction accuracy on all three datasets. Table 1: Evaluation of GPNAM using Airbnb. |Airbnb|RMSE↓|Number of Parameters| |-|-|-| |GPNAM|0.6664|12,601|35,002| |CAT(Order 2)|0.5486| |CAT(Order 3)|**0.5461**| Table 2: Evaluation of GPNAM using Diabetes. |Diabetes|Acc↑|Macro-F1↑| |-|-|-| |GPNAM|0.8052|46,712| |CAT(Order 2)|0.8286|0.7269| |CAT(Order 3)|**0.8295**|**0.7270**| Table 3: Evaluation of GPNAM using UCI-HAR. |UCI-HAR|Acc↑|Macro-Prec↑|Macro-F1↑| |-|-|-|-| |GPNAM|0.9792|0.9386|0.9366| |CAT(Order 2)|0.9814|0.9451|0.9431| |CAT(Order 3)|**0.9829**|**0.9504**|**0.9480**| [1] Zhang, W., Barr, B., & Paisley, J. (2024). Gaussian Process Neural Additive Models. AAAI, 2024. ==(length 855)== Q2: Some more complex tasks and networks need to be considered. ----------------------- **A2:** Based on your suggestion, we have conducted experiments on the MNIST dataset. The core idea is to employ disentangled representation learning [1,2] as a concept encoder to uncover the high-level latent factors from the image data, which are subsequently fed into the TaylorNet. The results are presented in Table 2. Notably, our method outperforms the baselines in both accuracy and parameter count. Specifically, both the second-order and third-order CAT outperform all baselines while utilizing significantly fewer parameters. Table 2: Evaluation on MNIST. |MNIST|Acc↑|Macro-Prec↑|Macro-F1↑| Number of Paras.↓ |Throughput (samples/sec)↑| |:---:|:---:|:---:|:---:|:---:|:---:| |EBM|0.9885|0.9423|0.9422| 61,400|N/A| |NAM|0.9723|0.8635|0.8599|40,326|46,684| |NBM|0.9770|0.8878|0.8843|64,720|51,247| |SPAM(Order 2)|0.9860|0.9318|0.9300|83,862|58,973| |SPAM(Order 3)|0.9883|0.9426|0.9414|128,998|30,585| |CAT(Order 2)|0.9892|0.9469|0.9458|**880** |**67,097**| |CAT(Order 3)|**0.9902**|**0.9517**|**0.9510**|5,200|62,981| [1] Mathieu, Emile, et al. "Disentangling disentanglement in variational autoencoders." ICML, 2019. [2] Shao et al. "Rethinking controllable variational autoencoders." CVPR, 2022. ==(length 867)== ==(total 1722)== ## To Reviewer EQ6w  Second Reminder to reviewer -- Dear Reviewer EQ6w, We are writing to kindly remind you that we have diligently addressed all the concerns and suggestions you have made. In particular, we have **discussed the specific advantage of the proposed method in computation complexity over existing works, and conducted experiments on two additional datasets, COMPAS and MNIST, to showcase the diverse applications of our method**. Should you have any further concerns or suggestions, please do not hesitate to share them with us. Your feedback is important to us, and we remain committed to further improving our work based on your insights. Additionally, we are curious whether you would be willing to consider revising our paper's score. Once again, we sincerely appreciate your time and expertise in reviewing our paper and providing valuable feedback. Reminder to reviewer -- Dear Reviewer EQ6w, We would like to thank you for taking the time to review our paper and for the insightful comments. We have addressed all the comments and suggestions that you made. In particular, we have **discussed the specific advantage of the proposed method in computation complexity over existing solutions, and conducted experiments on two additional datasets COMPAS and MNIST to demonstrate its variety of application**. Please kindly let us know if you if you have any further concerns. We are most grateful for any feedback you made, as they are valuable for improving our work. If you do not have further questions, we are curious if you could consider raising our score. Thank you very much! Q1: Partitioning the features into concepts is not end-to-end learning. ----------------------- **A1**: The prior work [1] has illustrated that end-to-end training, namely joint training, might encounter challenges in interpreting prediction outcomes, while independent training can enhance interpretability. Nevertheless, we will discuss the limitation in the revised version. **References:** [1] Do concept bottleneck models learn as intended, ICLR, 2021 ==(length 453)==   Q2: Lack of demonstration of experimental randomness. ------------------------- **A2:** Please note that we have already compared our method with the baselines using 3 random seeds in our work. Please refer to the captions in Tables 3 and 6 in our manuscript. ==(length 257)== Q3: Better explanation of improvement scenarios would add value. -------------------------- **A3:** We would like to note that our method substantially reduces the number of model parameters, while slightly outperforming SPAM in terms of performance. In addition to the results on the UCI-HAR dataset in Table 3 and Table 4 of the submission, our experimental results on the MNIST dataset (see **A1** in the response to **Reviewer Bn6s**) further illustrate that our method scales much better than the baselines especially for a multi-class classification problem despite the small improvement in prediction accuracy. ==(length 584)== Q4: Evaluation on a metric quantitatively. ---------------------- **A4:** Note that current generalized additive models do not have a quantitive metric for evaluating interpretability, so it is a good idea to develop a new one in the future. ==(length 239)== Q5: Do experiments on more datasets beyond 3. -------------------- **A5:** We have done additional experiments on two more datasets: COMPAS and MNIST. Please see the tables in **A3** and **A1** in the responses to **Reviewer u6b5** and **Bn6s**, respectively. Our method outperforms the baselines in both accuracy and computational costs. ==(length 335)== ==(total 1868)==   ## Response to Reviewer 7KMF Second reminder to reviewer -- Dear Reviewer 7KMF, We are writing to kindly remind you that we have diligently addressed all the concerns and suggestions you have made. In particular, as suggested we have **provided comprehensive insights into the novelty and technical contributions of the proposed method, and highlighted its advancements in both accuracy and computation cost compared to existing works**. If there are any additional concerns or suggestions you wish to discuss, please do not hesitate to share them with us. Your feedback is immensely valuable to us, and we are committed to incorporating any further improvements you deem necessary. Additionally, we are curious whether you would be willing to consider revising our paper's score. Once again, we sincerely appreciate your thoughtful evaluation and the opportunity to refine our work based on your insights. Reminder to reviewer -- Dear Reviewer 7KMF, We would like to thank you for taking the time to review our paper and for the insightful comments. We have addressed all the comments and suggestions that you made. In particular, we have **discussed at length about the novelty and technical contributions of the proposed method, as well as its improvement in accuracy and computation cost compared to existing works**. Please kindly let us know if you if you have any further concerns. We truly appreciate this opportunity to improve our work and shall be most grateful for any feedback you could give to us. If you do not have further questions, we are curious if you could consider raising our score. Thank you very much! Q1: The idea of using concept learning and tensor decomposition are not novel. --------------------------------------- **A1**: We would like to highlight the contributions of our work below. Our concept-based learning is different from existing works, which rely heavily on expert annotations. However, our approach can reduce the label cost of human annotations. As noted by **Reviewer Bn6s**, "it can alleviate the labeling effort for concept models, which is an important direction." Furthermore, we introduce a pioneering TaylorNet that explicitly learns the mapping function between input and output through polynomials, eliminating the necessity for activation functions. This unique feature empowers our approach to provide more interpretable predictions using polynomials. To the best of our knowledge, we are the first to employ Tucker tensor decomposition to reduce the model parameters of Taylor polynomials. ==(length 916)== Q2:The proposed method is a bit straightforward. ------------------------------------ **A2:** Our CAT is simple yet effective method for interpretable machine learning. Thus, it can be readily deployed in real-world applications like finance and healthcare. ==(length 255)== Q3: Experiments do not exhibit significant improvements over SOTA. ---------------------------------- **A3**: Please note that our method can significantly reduce the number of model parameters while its performance is slightly better than SPAM. In addition, though our method has the similar number of model parameters compared to NBM, it achieves higher accuracy than NBM. After comprehensive comparison in both accuracy and computational costs, we conclude our method can achieve a better trade-off between them. ==(length 513)==  Q4: About contributions of this paper ------------------------------- **A4:** Our significant contribution lies in the development of an interpretable neural architecture termed TaylorNet. This novel framework has the capability to learn the non-linear mapping function between input and output exclusively through polynomials, without relying on traditional activation functions. Consequently, our proposed method not only improves model performance but also enhances interpretability through the utilization of polynomials. Furthermore, as highlighted by **Reviewer Bn6s**, our approach has the added benefit of reducing the label cost associated with human annotations. Q5: About Source Code -- **A5:** We will make the source code publicly available upon publication. ==(length 670)== ==(total 2354)== ------------------- ## Follow-up questions from Reviewer MgUp We thank reviewer for the further discussion. Below, we address all the comments and concerns. Q1: I observe that the results of GPNAM are not better than the old baselines, so why are you using GPNAM? I look forward to strong baselines rather than new baselines. I think at least two or three baselines after 2023 should be considered. -------------------- **A1**: In our effort to compare our method with recent research beyond 2023, we only identified two studies [1] [2] related to our work. Given the time constraint of the rebuttal period, we opted to evaluate our approach against the latest GPNAM model featured in AAAI 2024. It's worth noting that GPNAM strives to minimize model parameters associated with NAM while maintaining good performance. However, GPNAM did not surpass the other baselines due to its fewer number of parameters, as evident in Tables 1 and 2 below. For the second method [2], considering the time constraint, we will compare to it in the revised version.  Table 1: Evaluation on Airbnb dataset. |Airbnb|RMSE↓|Number of Parameters↓| |-|-|-| |EBM|0.6344|28,110| |NAM|0.6681|822,780| |NBM|0.6637|76,897| |SPAM(Order 2)|0.5664|1,719,219| |SPAM(Order 3)|0.5560|2,604,292| |GPNAM|0.6664|**12,601**| |CAT(Order 2)|0.5486|48,742| |CAT(Order 3)|**0.5461**|52,990| Table 2: Evaluation on Diabetes dataset. |Diabetes|Acc↑|Macro-F1↑|Number of Parameters↓| |-|-|-|-| |EBM|0.8269|0.7031|351,165| |NAM|0.8242|0.7199|1,142,750| |NBM|0.8257|0.7167|82,071| |SPAM(Order 2)|0.8230|0.7242| 2,338,102| |SPAM(Order 3)|0.8272|0.7188| 3,617,729| |GPNAM|0.8052|0.6707|**35,002**| |CAT(Order 2)|0.8286|0.7269|51,310| |CAT(Order 3)|**0.8295**|**0.7270**|51,646| [1] Zhang et al. Gaussian Process Neural Additive Models. AAAI, 2024. [2] Ibrahim et al. GRAND-SLAMIN’Interpretable Additive Modeling with Structural Constraints. NeuRIPS, 2023. Q2: The best results for most datasets are achieved by EBM published in 2015. What is the reason? Did you tune these baselines carefully? And why not use stronger baselines? -------------------- **A2**: We would like to note that some other baseline models in our submission achieve lower MSE or higher accuracy/F1 compared to EBM on the AirBnb and Diabetes datasets, respectively. Please see the results in the Tables 1 and 2 in **A1**. Moreover, the superior accuracy of EBM on the MNIST dataset can be attributed to its larger number of model parameters, as highlighted in the following Table 3. However, when we experimentally reduce the model parameters of EBM (referred to as EBM*), its performance diminishes in comparison to SPAM. Table 3: Evaluation on MNIST dataset. |MNIST|Acc↑|Macro-F1↑| Number of Parameters↓ | |:---:|:---:|:---:|:---:| |EBM|0.9885|0.9422| 122,880| |EBM*|0.9808|0.9030| 86,800| |NAM|0.9723|0.8599|40,326| |NBM|0.9770|0.8843|64,720| |SPAM(Order 2)|0.9860|0.9300|83,862| |SPAM(Order 3)|0.9883|0.9414|128,998| |CAT(Order 2)|0.9892|0.9458|**880** | |CAT(Order 3)|**0.9902**|**0.9510**|5,200| Q3: The details of Q2 are not clear. --------------------- **A3**: We have outlined our experimental process on the image dataset MNIST as follows. First of all, we utilized the VAE-based disentangled representation learning methods to extract the high-level concepts from image data. Take MNIST dataset as an example. We can disentangle the latent concepts such as style, shape, and thickness from input images. In our experiments, we employed a disentangled representation learning method as a feature encoder to learn the high-level concepts from the MNIST (or CelebA) images. Then, we fed these high-level concepts into the TaylorNet and baselines for classification. The evaluation results demonstrated that our method outperforms the baselines.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.