## Global response
Dear Reviewers,
First, we would like to express our gratitude for your time and the effort put into thoroughly reviewing our manuscript.
We are pleased to see that you found our manuscript well-written and clear. Your recognition of the novelty of our ClimDetect dataset and its relevance to the critical issue of climate change detection and attribution is greatly appreciated. We also value your positive feedback on our comprehensive presentation of the background domain knowledge.
We want to emphasize that the primary aim of our ClimDetect dataset is to serve as a benchmark for standardizing climate change detection and attribution tasks, thereby facilitating objective comparisons across various models. While some reviewers have questioned the validity of our Vision Transformer baselines and GradCAM interpretations, we wish to clarify that we are not advocating these as superior models. Instead, we are presenting them as potential tools, worthy of further exploration, that have yet to be widely adopted in the climate science community.
Below, we have addressed each of your comments point-by-point. Should there be any parts that remain unclear, or if we have misunderstood any of your comments, please let us know. We are committed to promptly addressing all concerns during the author-reviewer discussion period.
Sincerely,
ClimDetect authors
## Reviewer ksut (R1)
We sincerely thank you for taking the time to review our manuscript and for your valuable suggestions. We are so happy to hear that you found our ClimDetect dataset to be of high quality, the topic engaging, and the content informative. We have provided detailed responses to your comments and suggestions below. Please feel free to reach out if any part of our response is unclear or if new questions arise.
> Explicitly Explain Domain-Specific Terms: Proper explanations should be provided when terms first appear. For example, terms like SSP2-4.5, SSP3-7.0, CMIP6, IPCC, SSP, historical, and scenarioMIP should be clearly explained.
We have added 'Appendix A: Glossary' to the manuscript, which details domain-specific terms such as SSP2-4.5, SSP3-7.0, CMIP6, IPCC, SSP, historical, and scenarioMIP. Each term is briefly defined at its first use in the text and referred to the Glossary for a more detailed explanation.
> Additionally, clarify whether SSP2-4.5 (line 132) and SSP245 (line 147) refer to the same concept.
They are identical. In the updated manuscript, we have standardized the notation by replacing "SSP245" with "SSP2-4.5" and "SSP370" with "SSP3-7.0" across all instances to avoid confusion and maintain consistency.
> Explain Details of Feature Selection: More meteorological features could affect global temperature, such as pressure and radiation. The feature selection criteria should be explained, and an ablation study is needed to justify the choices.
The rationale behind our feature selection is detailed in lines 134-137 of our manuscript, where we selected temperature, precipitation, and humidity as the three primary variables for our dataset. These variables are central to climate response analysis and have been extensively studied in previous detection and attribution (D&A) research, ensuring that our dataset aligns with well-established scientific methodologies and facilitates comparable and replicable research outcomes.
Additionally, please note that any climate variables could also be used as input features. The primary goal of ClimDetect is not merely to predict the target variable using any available data but to detect climate change signals within input fields, with a target variable serving as a proxy for climate change. Given this aim, an ablation study to justify the choice of specific features is not necessary by design.
> Clarify the Concept of “Fingerprints”: The authors claim that ClimDetect can identify specific “fingerprints” in climate response variables.
We agree that our submitted manuscript did not explicitly clarify the concept of "fingerprints". To address this, we have included a precise definition in the newly created Appendix A Glossary in the updated manuscript: "Fingerprint: A distinctive spatial pattern in climate response variables attributable to external forcing mechanisms, such as greenhouse gases or aerosols. Fingerprints are derived from models or analyses that distinguish these specific patterns (signals) from natural internal variability (noise)."
> However, the use of Grad-CAM alone may not sufficiently define “fingerprints” in meteorology.
We acknowledge that Grad-CAM alone may not fully capture the complexities of “fingerprints” in meteorology. While we use Grad-CAM to visualize fingerprints from Vision Transformer (ViT) models, we do not claim it as the definitive tool for defining fingerprints in the climate sciences. Instead, our ClimDetect dataset is intended as a standardized and versatile platform for exploring a variety of advanced deep learning techniques in climate change detection and attribution tasks. We maintain neutrality regarding the choice of model architecture or interpretation method.
We recognized the limitations of Grad-CAM for interpreting ViT models, as specifically discussed in lines 295-301 and 307-311 of our manuscript. These sections highlight the need for further exploration of diverse interpretative tools for climate chage D&A problems.
> Further explanation is needed on distinguishing human-induced climate signals from natural variability.
We believe that Section 2.1 (L65-74), along with the newly added "fingerprint" information in the new glossary section, provides the information about human-induced climate signals and natural variability. Additionally, we have revised the Introduction’s first paragraph to further clarify the challenge of distinguishing subtle, long-term human-induced signals from the more volatile patterns of natural climate variability.
> Proofread and Correct Typos: Address minor typos, such as “designed designed” (line 37), “]” (figure 2 caption), and “SSP3-7.0. scenarios.” (line 132).
Thank you for your detailed reading. We have corrected typos throughout the manuscript, including the ones you pointed out.
> The paper did not explain limitations well. I suggest adding the following discussions:
(1) What are the challenges in data source selection and feature selection?
(2) What are the challenges in differentiating human-induced climate signals from natural variability?
(3) What are the challenges in promoting benchmarks to encourage broader collaboration?
We have addressed points (1) and (2) throughout the revised manuscript, for instance:
- The updated Introduction’s first paragraph highlights the ClimDetect task's core challenge: distinguishing the subtly emerging, long-term signals of human-induced warming from the more volatile, transient patterns of natural climate variability.
- In the updated Section 3, we have clarified that any climate variable can serve as an input feature for ClimDetect, which is designed not just to predict the target variable, but to detect climate change signals in these inputs.
- In the updated Section 5, we have acknowledged the dataset’s limitations, noting it does not encompass the entire CMIP6 dataset due to technical and practical constraints.
Regarding point (3), the challenges of promoting benchmarks for broader collaboration are universal and not unique to our dataset. Thus, we focus on specific implications for our dataset.
> In post-processing, what is the reason for the step of “Removal of the Climatological Daily Seasonal Cycle?” Since the seasonal effects of the input have been removed, how can the model accurately predict the absolute value of the output?
The 'Removal of the Climatological Daily Seasonal Cycle' from the ClimDetect dataset is essential for emphasizing long-term climate trends over seasonal variations, which can mask more subtle interannual changes in the annual global mean temperature (AGMT). By normalizing against these strong seasonal fluctuations, the model better detects significant long-term trends, enhancing its predictive accuracy for AGMT. Removing the climatological daily cycle is a standard practice aligned with established methods in climate change detection and attribution studies. We have outlined this rationale in L178-179 of the original submission and provided additional detail in the updated manuscript.
> Is there an ablation study to evaluate each post-processing step?
We have not conducted an ablation study for each post-processing step, as these methods are well-established practices in climate studies. Recognizing the potential for improved methodologies, in the updated Section 3, we have updated our manuscript to acknowledge that alternative normalization schemes may offer advantages over z-score standardization, and we encourage users to explore these options.
> In Table 2, mean removed data (tas-huss-pr_mr or tas_mr) always performs worse than the mean not removed data (tas-huss-pr or tas_only) on RMSE value. Does this indicate that the step of “Removal of the Climatological Daily Seasonal Cycle” is unnecessary?
No, they are distinct. In the "mean removed" experiments, we remove the _spatial_ mean of each input climate field snapshot. Conversely, removing the climatological cycle involves eliminating the _temporal_ mean of the seasonal cycle, not the spatial characteristics.
> Is there any plan to extend the benchmark to include more collaborations?
Yes, ClimDetect is a fully open-source project. We actively encourage participation through our community forum, available under the "community" tab on our Hugging Face repository. We welcome potential collaborators to engage with us on data sharing, method development, validation, testing, and further cross-disciplinary research between climate scientists and computer scientists. We hope this collaborative approach will continually enhance and extend the benchmark.
## Reviewer Fqp1 (R2)
Thank you very much for your review and feedback on our manuscript. We appreciate your acknowledgment of our dataset as a significant effort in advancing detection and attribution for human-caused climate change, and that you find the concept novel and worth exploring. We hope our responses clarify your questions and provide a better understanding of our work. Please let us know if you need any further clarification.
> Why such high resolution CMIP6 data used? Was downscaling considered?
Retaining sufficiently fine resolution is essential for ClimDetect. Finer resolutions provide more actionable information for regional climate policy decision-making. Additionally, they introduce a more challenging machine learning problem due to increased spatial variability, which is crucial for advancing ML approaches. The resolution of 2.8° x 2.8° in our dataset, although on the coarser end for CMIP6 models, was chosen to balance manageable data volume with the necessary challenge and practical utility for robust analysis.
> Not sure if the problem has enough input signals to substantially define the relationship between predictands and target (e.g. precip is noisy and might not be the right variable to use as an input).
There appears to be a misunderstanding about the objectives of ClimDetect. Our primary aim is not to accurately predict target variables but to detect climate change signals within input fields, using a target variable as a proxy for these changes. We have carefully selected these variables based on their prominence in recent, peer-reviewed climate detection and attribution (D&A) studies. We acknowledge that the noisiness of precipitation fields poses a challenge, as documented in lines 267-270 of our manuscript. However, this complexity is precisely what makes the ClimDetect dataset valuable, as it necessitates the application of advanced deep learning approaches. Furthermore, we highlight that using daily precipitation fields for climate fingerprinting represents a cutting-edge topic in contemporary research, exemplified by recent studies such as Ham et al. 2023 in Nature, referenced in our manuscript.
> Binning leads to expected values and may not represent a true value for large bin sizes for AGTM. Were lower bins considered?
We have not experimented with varying bin sizes for the Grad-CAM fingerprint maps. However, based on additional GradCAM figures presented in Supplemental Information Section 4.3, finer bins are not expected to significantly alter our findings. For a given model, the GradCAM figures for all three bins with positive AGMT values demonstrate consistent spatial patterns across models. Grad-CAM figures for the zero-value bin ([-0.5, 0.5]) show different spatial patterns, which are likely because this bin does not contain signals indicative of climate warming.
> What are the biases in CMIP6 data that might affect the attribution?
Yes, biases in CMIP6 models could potentially impact the attribution performance of the ClimDetect model. However, these models represent the best available method for generating spatiotemporal data on future climates. To mitigate these biases, we carefully constructed our dataset using multiple models, incorporating a range of climate sensitivities. This approach is detailed in lines 200-206 of the manuscript, emphasizing our efforts to minimize model-specific biases.
## Reviewer u15x (R3)
We extend our heartfelt appreciation for your meticulous and constructive feedback. We are encouraged by your commendation of our novel dataset and experiments, and by your recognition of the work’s potential impact in the ML community. We have carefully considered your suggestions for improvement and made revisions to strengthen the paper. If any of our responses need further clarification, or if additional questions arise, please feel free to reach out. Thank you again for your valuable feedback.
> The paper could be tailored much better to an ML audience. It currently features several (~5) pages of background on climate science, which is certainly useful for the typical ML community member who is generally not trained in this field, but a lot of this could be summarized more succinctly in the main text and described in more detail in the Supplement. There are other aspects that would be more useful to emphasize in the main text, including:\
(1) A clear description of the structure of the dataset. What are the inputs and the outputs? What are their dimensions? What is the prediction task? Are there design decisions in formulating the task that users will have freedom to make (e.g. temporal horizon), or is it fixed? These are important to clearly describe as it will really help ML people understand what methods could be appropriate.\
(2) The challenges of the dataset. What makes the prediction task hard for existing ML approaches to solve? How will it induce the development of novel ML approaches? I think explainability is a key area with this dataset that is underemphasized in the paper.\
(3) An explicit, clear comparison to other ML for climate datasets. The notable datasets are discussed in related work, but comparisons of this dataset to those is not made clear, which is important for the ML community when deciding which dataset to work with. This is related to the point above about structure and task.
We appreciate your suggestions and have made several changes to our manuscript to address the first three points:
- We recognize that the description of the dataset’s structure in Section 3 "ClimDetect" (L132-155) needed to be more accessible. We have restructured this section to succinctly detail the dataset’s structure, inputs, outputs, their dimensions, and the prediction tasks, in order to help ML researchers quickly understand its framework and potential applications.
- We have revised the Introduction to better highlight the specific challenges associated with the climate change detection problem that our dataset addresses.
- We have enhanced Section 2.2, "Related Work - Climate Datasets for ML,"" to provide clearer comparisons between our dataset and existing ones, emphasizing its unique contributions and relevance. [TODO]
> (4) What good performance looks like on this dataset. Is there any way to measure this beyond just “better than the baseline”? Worth at least a discussion point.
Regarding the last point, it is indeed an important question. First, we'd like to provide our thought behind choosing RMSE as a main metric. The followings are well-established in climate detection and attribution literature:
- RMSE has been shown to be a reliable proxy for detection sensitivity,
- The positive test for climate detection is the time of emergence for climate change signals, identified when model predictions for a given test period consistently exceed the reference (pre-global warming) period's confidence interval,
- Improved RMSE for model prediction correlates with improved sensitivity by tightening the distribution of both the reference and the test periods.
ML practitioners can evaluate model detection accuracy by measuring RMSE, as we have demonstrated. While we are open to developing a target-specific metric based on the concept of "emergence" mentioned earlier, we are cautious about its potential to complicate rather than help. We would be grateful if you could share more of your thoughts on this matter.
> The “Model” column of Table 2 should be split into “Architecture” and “Initialization” to make it clear which are ViTs, what are their settings (model parameter size, patch size) and how each was pre-trained.
We have updated Table 2 and its caption in our manuscript to indicate that the first four models are ViTs finetuned from their respective pretrained checkpoints. Detailed specifications, including model size and patch size, are now provided in Section 4.1 "Baseline Models and Training Details." [TODO]
> As this data is differently distributed than natural images, it would be interesting (and should be straightforward) to test the impact of pre-training vs. random initialization. Does it actually help to transfer from weights learned from natural images?
We investigated this by comparing the performance of a ViT-b/16 model pretrained with masked image modeling (MIM) on the ClimDetect training set from randomly initialized weights, and then finetuned for regression tasks in the "tas-huss-pr" experiment. Our findings indicate no significant performance difference between the model finetuned from natural image weights (Hugging Face checkpoint) and one trained from scratch. Specifically, the R2 values were 0.98472 for the pretrained model and 0.98492 for the model trained from scratch. Based on these results, we opted to continue finetuning from the pretrained weights, as it did not compromise performance while simplifying the training process.
> There are better interpretability approaches than GradCAM specifically designed for transformers. Check out: https://arxiv.org/abs/2012.09838v1 Given interpretability is a key part of this task, doing this well seems important.
Thank you for sharing newer interpretability approaches specifically designed for transformers. In this study, the use of GradCAM was an initial foray into the interpretability of Vision Transformers within the ClimDetect models. Moving forward, we plan to investigate a broader range of interpretative tools to investigate deeper into ViT fingerprinting. We have revised Section 4.3 "Physical Interpretation" to clarify that while GradCAM is employed here, it is merely one of several possible approaches.
> They acknowledge one limitation about the use of GradCAM compared to using coefficients of the linear model, but no other limitations are mentioned. Here are a few initial suggestions (I encourage the authors to think of others that could be added):\
(1) Difficult evaluating/groundtruthing the interpretability.\
(2) Limited SSPs.\
(3) Only three variables (minor limitation given these are the primary factors for the task, but I think worth stating still).
We agree that groundtruthing the interpretability is indeed a difficult problem. So, we have added a paragraph to acknowledge those difficulties in Section 4.3 "physical Interpretation".
However, we respectfully suggest that the last two points may not represent limitations of our study.However, the last two points are not reall ythe limitation of our study.
- We strategically selected SSP2 and SSP3 from the five available SSP scenarios due to their broad AGMT ranges. SSP1 was excluded because of its limited range (up to ~2°C), while SSP5, with ranges extending to ~6°C, is reserved for future generalizability tests. SSP4 was omitted due to its overlapping AGMT range with SSP2 and SSP3 and limited model outputs. This strategy prevents bias towards specific AGMT ranges, ensuring a balanced representation of climate states.
- As detailed in lines 134-137, we selected the three feature variables because they are essential climate variables representing climate states and are also extensively used in prior detection and attribution research. While other climate variables could be used as input features, we limited our selection to manage the dataset size effectively. We have updated our manuscript to further clarify our selection of the three input variables in Section 3.
> No potential negative societal impacts are discussed. While I believe the dataset is likely strictly beneficial for society, it is worthwhile for the authors to think about potential negative impacts and include at least a sentence on this (even just stating they don't envision any negative impacts).
Thank you for highlighting this oversight. We have addressed this point by including the following statement in the Conclusion section of our updated manuscript: "Although no significant negative impacts are foreseen, we commit to ongoing monitoring to ensure its responsible use."
> The dataset seems constructed in a sound way, the evaluation designed and performed correctly, and claims made correctly, barring one point in the Supplement: "Error bars related to random seed variations were not reported as our vision transformer models were fine-tuned from pretrained weights from pretrained weights, and their training did not necessitate any random initializations." The regression head is randomly initialized so this is not true.
Thank you for your thorough review and for pointing out the oversight concerning the random initialization of the regression head. Although our main model weights were pretrained, the regression head was indeed initialized randomly. We have corrected this statement and acknowledge it to the Supplemental Information Section 2, "Model Uncertainty and Resampling Analysis."
> The paper is well-written and easy to understand. As stated above, I think it could benefit substantially from a rework to tailor more to an ML audience.
Based on your first comments, we have updated our manuscript in several places to make it more intuitive and better suited for ML audience.
> The most relevant prior work is discussed but a clear comparison of the proposed dataset to previous datasets is lacking.
We have updated Section 2.2, 'Related Work - Climate Datasets for ML,' to explicitly highlight the distinctions between our dataset and existing climate ML datasets, emphasizing unique features and the specific advantages our dataset offers. [TODO]
> Why were only two SSPs selected? Generally more data is better, and would suspect that to be true for climate detection/attribution specifically, but curious if the authors feel that is not the case in this particular setting.
As we clarified in our response to your first comment, our decision to focus on SSP2 and SSP3 were intnentional. These two scenarios cover broad AGMT ranges.
> Lines 236-240: It’s uncommon to reference specific implementations instead of architectures. I appreciate being very specific, but when code is released I don’t think it’s helpful to do so in the paper (the names are a bit hard to read).
In the updated manuscript, we have revised the text to list only the model names for improved readability, while the specific checkpoint names have been moved to a footnote.
## Reviewer 7ghr (R4)
We genuinely appreciate your review and insights. Thank you for acknowledging the importance of our work and the positive remarks on our dataset’s comprehensiveness and the paper's readability. We also appreciate the opportunity to clarify our results, as some of your concerns may stem from misunderstandings. We hope our responses address these and clarify our intentions with the ClimDetect dataset. Please reach out if you have any further questions.
> Simple MLP and ridge regression seem to perform similarly to the much more complex architectures including the vision transformer (ViT), limiting claims about leveraging deep learning for identifying useful fingerprints. Additionally, ridge regression has a much clearer model interpretation than ViT + GradCAM.
The ClimDetect dataset is designed as a benchmark to standardize climate change detection and attribution tasks, facilitating objective model comparisons. We do not specifically advocate for complex architectures like ViTs or particular interpretation tools such as GradCAM. Instead, our inclusion of ViT and associated GradCAM analysis serves as a prompt to explore these advanced AI models, which have not yet been widely adopted in climate research. We have updated Section 5 "Physical Interpretation" to better communiate this point.
> It’s also not clear how robust these results are since there are no error bars (Table 2).
In our initial submission, error bars (95% confidence intervals estimated by bootstrapping) for the results in Table 2 were already included in Supplemental Information Section 2. To improve visibility and ensure this important detail is not overlooked, we have now explicitly referenced it in the caption of Table 2 in our revised manuscript.
> The data seems to range from 1850-2014, stopping short a full decade before the present year, and based on the provided datasheet, there are no plans to update this dataset, limiting its usefulness.
It appears there may have been a misunderstanding regarding the temporal scope of the ClimDetect dataset. As detailed in our manuscript (e.g., lines 78-84 and 158-160), the dataset not only includes historical simulations covering 1850-2014 but also extends to the year 2100 through SSP simulations that start from 2015.
> There seems to be very little utility in the explainability AI (XAI) experimental results. Figure 3 attributes influence to very different sets of pixels depending on the model; the similar AGMT scores but widely varying influence scores across different models limits the usefulness of these data-attributions since it is not clear or consistent about what features are actually contributing to climate change. As mentioned earlier, the simplest model performs almost as well as ViT, limiting any claims about leveraging deep learning for identifying useful fingerprints.
As previously noted, our primary aim is to provide a standardized dataset that facilitates exploration of various methodologies in climate detection and analysis, rather than advocating for specific models or XAI methods such as ViT or GradCAM. Our dataset serves as a sandbox, allowing researchers to experiment with a range of approaches, from simple to complex. We have revised our manuscript to more clearly articulate this objective, to ensure that the utility of the dataset extends beyond the performance of any single method.
> At a high-level, It’s not clear what the advantages of these interpretations are via the three channels. Can the authors better describe the ideal scenario of how XAI techniques like GradCam can advance the understanding of climate science and provide actionable insights via the proposed benchmark dataset given the data with the proposed three channel types. For example, how does higher influenced surface-air temperature in a certain region provide actionable insight? I recommend experiments that combine XAI with domain experts to validate the output of these XAI techniques generate useful information.
XAI techniques, including GradCAM, are crucial for enhancing the understanding and interpretation of deep learning models in climate change detection. Saliency maps from these techniques serve as "climate change fingerprint maps," which visually represent distinct spatial patterns associated with climate change. By clarifying how regional changes are related to global climate trends, this analysis aids in identifying global warming patterns and assessing regional impacts. This not only deepens scientific understanding but also informs policy decisions by pinpointing areas requiring targeted adaptation strategies. Thank you for your insightful suggestion. We plan to further explore various XAI techniques and their applications in collaboration with fellow climate and ML researchers.
> What is the point of the “mean-removed” dataset variants? Since predictive performance on these variants are generally worse than the non mean-removed variants, what is the benefit of keeping them, do they provide better interpretability?
The "mean-removed" experiments, detailed in lines 223-230 of our submitted manuscript, involve subtracting the spatial mean from each input climate field snapshot. This methodology, informed by the work of Santer et al. (2018) and Sippel et al. (2020), is designed to enhance the detection of spatial patterns in climate change signals. While these variants display lower predictive performance, they offer increased interpretability by focusing on spatial anomalies.Additionally,by adding further value to our benchmark dataset, this approach creates a more challenging target task, potentially necessitating advanced ML approaches.
> I'm not sure I agree the post-processing steps (removing daily seasonal cycles and standardizing anomalies) is necessary. This extra information seems like it could be beneficial, depending on the analyses.
We emphasize the importance of removing the 'Climatological Daily Seasonal Cycle' from the ClimDetect dataset, as it is crucial for focusing on long-term climate trends rather than seasonal variations. These seasonal fluctuations, which is generally much larger than slowly evolving climate change signals, can obscure the more subtle interannual or long-term variability of the annual global mean temperature (AGMT). Normalizing against these fluctuations positions the model to better detect subtle changes in AGMT, thus enhancing its predictive accuracy—a practice aligned with established climate data analysis techniques.
L178-179 of the original submission outlines the rationale for removing the seasonal cycle. In the updated manuscript, we have expanded on this explanation. Additionally, we have added statements acknowledging that alternative normalization schemes may offer advantages over z-score standardization and encouraging users to explore these options.
> The claims related to the explainability AI (XAI) experiments and results seem overstated.
We respectfully disagree with the assertion that the claims related to XAI experiments are overstated. In our manuscript, we have explicitly acknowledged the limitations of GradCAM for interpreting Vision Transformer models, particularly in lines 295-301 and 307-311. These sections emphasize the ongoing need for further exploration and development of diverse model interpretation tools to enhance the interpretative accuracy of complex models in climate science. Additionally, we have updated Section 4.3 "Physical Interpretation" to further clarify these points.
> Prior work is discussed, however, it's not entirely clear what the main differences and contributions are between this and prior work. Perhaps providing a bulleted list of contributions and discussing the specific differences from previous work would make it easier for readers to understand this paper's contributions.
We have enhanced Section 2.2, 'Related Work - Climate Datasets for ML,' to better differentiate our dataset from existing ones in the field, further clarifying its distinct contributions and relevance. [TODO]
> Why are there 816K examples in the dataset? I thought each instance represents daily input features (air temperature, humidity, precipitation rate) from 1850-2014. What does each instance represent?
As stated in L141-145 of our manuscript, each example in the ClimDetect dataset represents a daily snapshot of global climate field data across three variables (channels). Specifically, each instance is encapsulated in a 3-dimensional matrix with dimensions 64 x 128 x 3, where 64 represents latitude, 128 represents longitude, and the three channels correspond to three input variables (temeprature, humidity, and precipitation).
> Why are there only three channel types: surface-air temperature, precipitation, humidity? Are these the only factors one should consider when attempting to predict AGMT?
No, other climate variables could also be used as inputs. The primary aim of ClimDetect is not merely to predict the target variable using any available data but to detect climate change signals within input fields, with a target variable serving as a proxy for climate change. However, we selected temperature, precipitation, and humidity as the three input variables for our dataset because these are climate variables of broad interest that have been extensively studied in previous detection and attribution research, as detailed in L134-137 of our manuscript. This selection was made to ensure that our dataset aligns with well-established scientific approaches and facilitates comparable research outcomes.
> What do the numbers in Figure 3 represent? Do higher numbers represent more influence? Are the pink regions significantly more influential than non-pink regions? Can the authors quantify how much more influential something needs to be for it to be meaningful?
Yes, higher numbers on Figure 3's GradCAM heatmap indicate greater influence on the model's output, with pink regions denoting areas of higher influence, as depicted on the colorbar. Unlike regression coefficients, which are expressed in interpretable physical units (e.g., K/K), the units in GradCAM output are less straightforward to quantitatively interpret. Additionally, the heatmaps are normalized using a min-max scheme, where values are transformed by (X−min(X))/(max(X)−min(X)), to facilitate comparison. We recognize that this was not adequately explained in the figure caption, and have since updated it to include details about the unit of measure and normalization method used.
However, determining a specific threshold at which influence becomes meaningful is beyond the scope of our exploratory analysis. The primary aim of using GradCAM is to visualize global patterns of influence, rather than to provide precise quantitative assessments.
> Consider adding empirical runtime results between the simple models (MLP and ridge regression), and the more complex ViT models.
The training times for finetuning ViT models are detailed in lines 244-246 of our manuscript, where we note that it took an average of 4.75 hours per model on eight Nvidia A6000 GPUs. We did not include training times for simpler MLP and ridge regression models, nor the inference times for any models, as these are significantly shorter. Given their minimal impact on overall computational costs in practical settings, we deemed it unnecessary to report their runtime stats.