Tools & Metadata

# Tools & Metadata - Take an existing paper (from Aiden, Sara or Dominic?) to analyze what is missing regarding FAIR (tools & metadata). ReproHack style analysis. - Humanitarian need drives multilateral disaster aid, https://www.pnas.org/content/118/4/e2018293118 - Equilibrium climate sensitivity above 5 °C plausible due to state-dependent cloud feedback, https://www.nature.com/articles/s41561-020-00649-1 - (Aiden:) I will work on a reprohack style analysis of FAIRness on a study, but I guess I'll have to do that on my own - CMIP6 paper: - "Causes of Higher Climate Sensitivity in CMIP6 Models", https://doi.org/10.1029/2019GL085782 - Bias in CMIP6 models as compared to observed regional dimming and brightening, https://acp.copernicus.org/articles/20/16023/2020/ - CMIP6 data request: Migrate CMIP request data into database. (Anne, Naoe, Jean, Klaus, Kirsten, Yanchun) - https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html ## A day in the life of a researcher (starting from a paper) ### Humanitarian need drives multilateral disaster aid - Humanitarian need drives multilateral disaster aid, https://www.pnas.org/content/118/4/e2018293118 #### FAIR score with f-uji.net - Not bad (70%) but the question is: is this score helpful for the researcher? ![](https://i.imgur.com/ix9a0ng.png) Anne opens the aforementioned paper: https://www.pnas.org/content/118/4/e2018293118. - When working online (with the HTML version), she goes to the Data availability link (as available from the paper publisher), which links in this case to a Harvard Dataverse deployment, which provides a list of 80+ datasets (they can be also seen in a folder-view using the Tree tab). The information in the metadata tabs seems rather simple and is probably not used too much. This is a very specific case, of course. - While there is an option at the top to download the entire dataset, it does not seem to work. We obviously need to download files one by one which is very cumbersome (and we loose the directory structure). Also the number of downloads increases every time we click on the button even though we do not download anything. - First read README.txt as we thought we would get some information on where to start. ``` ### REPLICATION DO-FILES, SCRIPTS AND DATA ### This dataverse contains the do-files scripts and some of the data used to conduct the social and meteorological analyses in the article. The directory for the meteorological analysis is divided into two parts: Part 1: running the regression analyses of UN aid using the do-file "undisasteraid_final.do" and dataset "undisasteraid_final_forrep.dta" Part 2: running the out of sample predictions using the do-file "OOS_evaluations_final.do" and dataset "undisasteraid_final_forrep.dta" The directory for the meteorological analysis is divided into three parts: Part 1: obtaining the meteorological variables from ERA-Interim ("Pt1_meteorological_vars") Part 2: validating EMDAT extremes using the meteorological variables from ERA-Interim data ("Pt2_validation") Part 3: calculating the yearly hazard severity index ("Pt3_yearly_hazard_severity") ``` - So we started from `undisasteraid_final.do` which is available online: Obviously Stata has been used for some of the analysis: stata is not open source so it makes it difficult to reproduce/reuse the script provided. - When getting a file, there is a URL but the filename (real filename) is in the text: ![](https://i.imgur.com/ewrGY7j.png) So to download a file, we need to manually add the output filename: ``` wget https://dataverse.harvard.edu/api/access/datafile/4288994 -O accum_precip_calc.py ``` - Data from ERA-Interim (not ERA5 so we guess the analysis has been done a while ago). Forecast data is being used (not analysis) because some variables are not available for analysis (not mentionned). However: - We could not find precise information on the input datasets (only ERA-Interim can be downloaded). - even though a lot of information is available online https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VWQ5AY the paper and data are disconnected from each other (no cross-reference) so it is often difficult to understand how figures have been produced - Many data are python or R programs: License is CC0 - "Public Domain Dedication" is not the best for software; for instance MIT should be preferred (we were not sure this can be changed when uploading data). See https://choosealicense.com/ - Stata software has been used for some part of the analysis. Stata is not open source so we cannot reproduce the work easily. - From a general point of view, there is no description of the software used for the analysis. And for python and R programs, the list of necessary packages/libraries is missing (we could have one environment for each programming language) ### Causes of Higher Climate Sensitivity in CMIP6 Models An example of a bad practice (no clear link to the actual datasets used, or the software used for pre-processing the datasets - computing anomalies). The aforementioned https://doi.org/10.1029/2019GL085782, which is highly cited. Check section 2 on Data and Methodology. Wish list (currently Oscar's point of view, so feel free to add things): - Instead of data dump, it would be nice to **have in an explicit manner the "query" to be used to access these datasets online in another system** (e.g., from ESGF, ES-DOC). It would be similar to when somebody writes a systematic literature review and describes the keywords used to find the papers in Web of Science, Scopus or Google Scholar. -- It is a bit better (not perfect yet) in https://acp.copernicus.org/articles/20/16023/2020/ since it provides clear keywords to the experiments, and an Excel file with the analysis done. It would be better with the actual queries to the ESGF (or other database), or the actual URIs/DOIs/PIDs of each of the sub-datasets. -- The best option would be to put a real query like those taht can be done at https://esgf-data.dkrz.de/projects/esgf-dkrz/ (e.g., as [here](http://esgf-data.dkrz.de/esg-search/search/?offset=0&limit=10&type=Dataset&replica=false&latest=true&experiment_id=1pctCO2-cdr&mip_era=CMIP6&activity_id%21=input4MIPs&facets=mip_era%2Cactivity_id%2Cmodel_cohort%2Cproduct%2Csource_id%2Cinstitution_id%2Csource_type%2Cnominal_resolution%2Cexperiment_id%2Csub_experiment_id%2Cvariant_label%2Cgrid_label%2Ctable_id%2Cfrequency%2Crealm%2Cvariable_id%2Ccf_standard_name%2Cdata_node&format=application%2Fsolr%2Bjson) and using this API: https://esgf.github.io/esg-search/ESGF_Search_RESTful_API.html) - (***BEWARE: this would require a strong cultural change, although it is not impossible to achieve it in the medium term!!***). Do something similar to what it is being done in the Machine Learning domain with https://paperswithcode.com/ (not climate community), where those publishing there are trying to make their papers reproducible, and users can comment and provide stars to papers. - ESMValTool: provenance metadata available (in work folder) (get handle for inputs but also for recipes) - https://www.esmvaltool.org/ ## 2 concrete tasks for today 1. Write a best practice/report from existing papers - Show an example of best it could be in an example paper For example using: - Equilibrium climate sensitivity above 5 °C plausible due to state-dependent cloud feedback, https://www.nature.com/articles/s41561-020-00649-1 2. Reworking ontology on ES-DOC to OWL - Climate data ontology roadmap: Roadmap on how to get there.