Notes from day-1

# Notes from day-1 # NICEST2 hackathon on FAIR climate data This document is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents. Attendees are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/ _______________________________________________________________________________________________________ ## Icebreaker *Write down your greatest hope for this hackathon and then your greatest fear* - hope: learn about providing and consuming FAIR open climate data; fear: getting lost 😅 +1 - hope: learna bout practicals around FAIRification process of research data; (fear: same as above) - hope to learn about fair - research data in general and about climate data.(fear: as above) - hope to understand a bit more about the type of metadata that you are working with and how you use it (no fears at all ;-)) - hope: get help to make our data FAIR! - I hear about FAIR all the time and now I really want to know what is behind, what is missing for my data to become FAIR, if it is worth doing, ... - hope: FAIR makes sense, but it feels like a lot of work to get it right. I hope to learn how to make it work efficiently. fear: wasting time - Hoping: to get/cement best practices to keep my research FAIR, and to learn how to make it easier to keep things fair. Afraid: I don't have the competence to develop the tools (but I won't learn unless I get started!) - - - - - - - - - - - - - ## Welcome ### Code of Conduct - https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html #### How to report COC issues to the CoC committee? - Email/contact: Naoe Tatara (naoe.tatara@ub.uio.no) ### Logistics - [Zoom mechanics and controls](https://coderefinery.github.io/manuals/zoom-mechanics/) - [HackMD mechanics and controls](https://coderefinery.github.io/manuals/hackmd-mechanics/) - **Presentations will be recorded** (turn-off your video if you do not want to appear on it) ## Presentation Please sign in so we can record your attendance. *Name, Institution, relevant project, Email & Twitter (optional)* - Hamish Struthers, National Supercomputing Center, Linköping Sweden (climate modeling support), NICEST2, EOSC-Nordic projects - Matúš Kalaš, University of Bergen, Norway, ELIXIR, matus.kalas@uib.no, [@matuskalas](https://twitter.com/matuskalas) - Naoe Tatara, University of Oslo Library, Digital Scholarship center, naoe.tatara@ub.uio.no [@NaoeTatara](https://twitter.com/NaoeTatara) - Abdelakder Mezghani, Norwegian Meteorological Institute, Department of model and climate analyses group, Oslo, Norway, abdelkaderm@met.no - Tyge Løvset, NORCE, NICEST2, INES (NorESM), Klima i Norge 2100, tylo@norceresearch.no - Prashanth Dwarakanath, National Supercomputing Center, Linköping Sweden, NICEST2, IS-ENES2, Copernicus projects, pchengi@nsc.liu.se - Oskar Landgren, Norwegian Meteorological Institute, Oslo, Norway. Part of WP2 in NICEST2, WP4 in national climate assessment Klima i Norge 2100. oskar.landgren@met.no - Anne Fouilloux, University of Oslo, NICEST2, RELIANCE, EOSC-NORDIC, annefou@geo.uio.no/@AnneFouilloux - Lesetja Lekoloane, South African Weather Service, South Africa, Lesetja.Lekoloane@weathersa.co.za, @Leeseetja - Aiden Jönsson, Department of Meteorology at Stockholm University, Bolin Centre modeling at NSC, aiden.jonsson@misu.su.se - Klaus Zimmermann, Swedish Meteorological and Hydrological Institute (SMHI), Norrköping, Sweden - Nohlahla Dlamini, Global Change Institute, University of the Witwatersrand, South Africa, Nohlahla.Dlamini1@students.wits.ac.za - Jean Iaquinta, University of Oslo - Heiko Goelzer, Norce, INES, NorESM, heig@norceresearch.no - Dominic Heslin-Rees, University of Stockholm (ACES), PhD student. - Adil Hasan Uninett/Sigma2 EOSC-Nordic project. - Atef Ben Nasser, Insitut Pierre Simon Laplace Paris-France, Research Engineer, abennasser@ipsl.fr - Joakim Philipson, Stockholm University Library, EOSC-Nordic T5.3.2, joakim.philipson@su.se - - - - - - - ## What is your experience with Open Science & FAIRification? - No formal experience. - A bit of working with open bio and geo data, more with open source software - Have attended a FAIR data workshop.+1+1+1 - I thought I WAS DOING SCIENCE "IN AN OPEN WAY" but clearly not as people now define it but I am willing to learn and improve - 👍🏽👍🏽👍🏽 - Author, reviewer and editor for open access journals; published data in public archives; shared code through github. - Very nice! - Some git experience; I am making it standard practice to make all my research with git backups and hosting in order to refer to it easily upon publication. Not much otherwise, as I'm a PhD student, but want to make it standard practice - Awesome, excellent start! - - - - - - - - - - - ## What is your experience with Climate data? - ca. 15 years climate modeling research. - ca 5-6 years, wind-data, 1.5 years climate models. - 0 (+1) - 8 years experience publishing climate data. - ca 15 years of climate and ice sheet modelling - ca 2 years of climate data analysis (CERES, OSI, ERA) and ca 1 year of climate modeling - A few courses and workshops (<2 years). - - - - - - - - - - - - Agenda for day-1: https://nordicesmhub.github.io/nicest2-fair-hackathon/day1-agenda/ ## Write your questions below ### Questions: Introduction to Open Science and FAIR data - A reflection I had is the distinction between repeating an experiment and coming up with the same *results* versus coming to the same *conclusions*. Are these two different questions? I'm thinking about the difficulty in reproducing results in chaotic systems (like GCMs) — the results should of course converge to the same outcome - Exactly 👍🏽 - More at https://the-turing-way.netlify.app/reproducible-research/overview/overview-definitions.html ### Data management planning, data life cycle - perspective of a data manager - This was a creative idea to start discussing how to make data more FAIR https://www.researchdata.uni-jena.de/en/news/\_fairest+dataset_+wettbewerb - Maybe one more idea here, that related identifiers are very important. To link your paper, code, data… - Yes, I think it is very important! Not always easy to do in practice. - Absolutely crucial! - Aren't contact points (names) in metahttps://nordicesmhub.github.io/nicest2-fair-hackathon/day1-agenda/data GDPR problematic? - I honestly don't know! The main issue is that usually we may not be aware that names are added - Ideally, these contacts would have to explicitly agree. We're trying to do that at least when listing contributors on a GitHub project. But completely hopeless when submitting some abstracts on EasyChair/_etc._ and asked for email addresses of all co-authors🤦🏽‍♀️ - What does "enable machine access" mean in practice? - for instance: information filled in DMP could be used to allocate storage (and/or computing resources?) - what has that got to do with me (who produced the data)? how can I have any impact on machines to find/use my data? which languages do machines speak? - True, well partially. If you want to apply for storage for your data then at the moment you need to fill in an application form. If the DMP were able to be read by other services then you could imagine that you fill into the DMP that you want X amount of storage at location Y. That could be picked up by the storage reservation service without you having to fill in more forms. - Questions to researchers: what would be useful to have in DMP? - Perhaps it's useful to ask: What do you wish was in the DMP of projects whose data you have used?+1 - Where to start: I usually get a bulk of data but never know which one I should look at first. - This is very useful. So, are you saying that the data needs some docs on the order in which to read the data? This would come under the documentation part of the DMP. Very useful. It would be interesting to hear more things. ### Climate researchers perspective - Reproducibility for climate models: usually not possible and is it important? - I completely agree with Klaus’ earlier comment. I’d just like to add that bit-reproducibility is most likely going to be even less likely to be achievable in the future, considering for example the move towards heterogenous computing with accelerators and GPUs using single precision. Therefore I guess my question is: Is the "R" in FAIR practically unreachable for climate modelling? Is this a problem? - Bit-reproducibility is not really needed, but if the differences can be verified to be within a certain error threshold, it would still be very useful. If reproduced data are wildly different from the original, one should question the accuracy of the original data as well, even if they were created in a different environment. - For example, is it enough that statistics and conclusions are reproducible, but not the actual data? - It is probably linked to the conclusions: if they are reproducible then it may be sufficient (in climate, we are not interested in a "fully" reproducible state) - Speaking for my example model NorESM and probably for any other large infrastructure ESM: I would argue that without access to the specific HPC architecture that my experiments are run on, it would be near impossible for anyone to even start thinking about trying to reproduce the experiments. This is not an argument against trying to produce FAIR data/code/work flow but puts questions about bit-wise reproducibility into perspective. - One can try one of the scientific research data hosts e.g. https://researchdata.4tu.nl/en/about/organisation/, still one has to pay for it. - Is this data repository used for depositing climate data? - I’d just like to add that bit-reproducibility is most likely going to be even less likely to be achievable in the future, considering for example the move towards heterogenous computing with accelerators and GPUs using single precision. Is this a problem? For example, is it enough that statistics and conclusions are reproducible, but not the actual data? - Thanks a million for the point, important! Bit-reproducibility is not what's required by "reproducible science" in general. - Bit-reproducibility is not really needed, but if the differences can be verified to be within a certain error threshold, it would still be very useful. If reproduced data are wildly different from the original, one should question the accuracy of the original data as well, even if they were created in a different environments. - One issue with this is that climate simulations can be long. Historical simulations (that at least can be verified against observations) can be ~200 yrs, decadal predictions ~20 yrs, scenarios ~100 yrs into the future. Other types like spin-ups can be even several thousand years long. The fact that there individual accuracy is indeed in question is part of the reason for having ensembles and keeping a veritable zoo of models around. - A climate researcher might know a lot of theory about FAIR data, but at the same time have no idea what to do with their own data, where to start, how to apply the "FAIR theory" - In CMIP, there is a lot of the data management defined and common practices are followed at all levels of the data/metadata publishing process. But: - What about non-CMIP data? - Could CMIP conventions not necessarily specific to CMIP (_e.g._ naming variables and units) be made into general conventions? - I (Hamish) think [CF](https://cfconventions.org) + [ACDD](https://wiki.esipfed.org/Attribute_Convention_for_Data_Discovery_1-3) +[ES-DOC](https://es-doc.org/) gets us a long way to describing research (modeling) data - As a bit of background information of CMIP (probably known to people here): [The CMIP6 Data Request](http://clipc-services.ceda.ac.uk/dreq/index.html) standardizes among other things [variables](http://clipc-services.ceda.ac.uk/dreq/index/var.html). - What about observations? - FAIR seems to me like an ideal. My practical approach as a researcher is to try to improve the fairness of my data/publication practice rather than being FAIR altogteher. - I fully agree with you. I think that this is what we wil see happening in research. I suspect that what FAIR means will change. - - - # Introduction to the hackathon aims, general discussion - Do we only focus on climate data procued by climate models? - Are we only interested by FAIR for machines or "FAIR" for researchers? - At least, the published results in a peer-reviewed journal are reproducible. Is that sth worth discussing? # Technical material - Cannot see the document in the home dir. in JupyterLab? I reloaded, but it did not help. - It is located in ~/shared-ns1000k/dataporten-home/87e7b7d1-2d512f-2d4713-2d9c15-2d3a2d54fd9ca0 - Awesome, I found it :+1: ### ES-DOC: Introduction - How granular is the search on es-doc, i.e. what is the smallest unit that a query result refers to? - The smallest queryable units are: project, institute, model, mip, document-type, document-version. Note that we do not have full semantic search, but we plan to enhance search capabilities via elastic search in Q4 2021. - Is there any file-level info about data published on ESGF, that's available on ES-DOC? - Yes, the Further Info URL links back to relevant files available on the ESGF nodes via the 'Dataset Documentation' section against the item 'ESGF Search'. Also some documentation sections for given workflow components contain 'Data link' information which links back to the data and often the file (though perhaps elsewhere to ESGF). - Can't see NorESM? - Good question! It turns out that NorESM is referred to by the acronym 'NCC' (**N**orESM **C**limate modeling **C**onsortium consisting of CICERO) so when searching for NorESM related documentation look for 'NCC' instead of 'NorESM'. - That does not show up in [the explorer](https://explore.es-doc.org/) - What is the link between ESGF and ES-DOC? One can publish data in ESGF and not add info in ES-DOC - In short, I think it could be summarised that ESGF manage and archive the *datasets* whereas ES-DOC manages and archives the *documentation*. ESGF have a ESG Publisher service which captures the relevant information about any CMIP6 data that is published in ESGF and passes it through to the ES-DOC ecosystem. If you add CMIP6 data via ESGF it will eventually be incorporated to the documentation available on ES-DOC (site and FUrther Info UR. - ES-DOC on github: https://github.com/ES-DOC - cf-python package: https://ncas-cms.github.io/cf-python/ - ES-DOC ocean description metadata for CMIP6: https://specializations.es-doc.org/static/index.html?target=cmip6.ocean&client=esdoc-url-rewrite ### Errata - How can you verify that the errata is genuine - I mean is not something malicious? - At the moment only the modellers themselves can create errata entries, so the likelyhood of that being malicious is rather low. - Is there a link to errata from ESGF? Or as a researcher, should we always first search for errata before downloading/checking data? - I second this question, but also am wondering if it more useful when the data is e.g. sitting on a research group's server? Data straight from ESGF should already be the latest/most secure, right? - The link between the errata and esgf for the time being remains within the PID handles only. We have proposed an integration with the ESGF front-end, some sort of a binary signal in front of the data to signal the existence of errata. But for the time being it's not yet in implementation. We already have the API available so depends on ESGF devs. I think there needs to be a reflex adopted by researchers to first use the errata to check the sanity of data used. - Quick follow-up, the reason why data available on ESGF not necessarily being the latest/safest is also due to the fact that an errata can be about something really trivial (e.g. metadata issue) and doesn't necessarily harm the content of the files for example, but also keeping older versions available for transparency and openess ends as well as reproducibility concerns. ### ES-DOC tooling and internals - Is it easy to extend ES-DOC services to new projects? - ### HackMD question: Expected hackathon outcomes Are there specific, concrete outcomes that you can suggest for next weeks working groups? - Low hanging fruit - Climate model output (research): esm2cmor (model specific) like tool for generating **metadata** only from research simulations (CF + ACDD? + ...) - push to a *simulation register*, research simulation metadata database. - Integrating DMP with (storage) services - High value tasks - Pushing data from DMPs to workflow -> metadata generation (so people are not repeatedly inputing the same information in multiple places) - Reuse of ESGF for research simulations. What pieces are missing/required - Replacing MIP tables with ?? - Errata - ES-DOC? - Metadata beyond model output? Metadata objects. - Other - https://c6dreq.dkrz.de/ data requeata requests -> database - ocean grids ### Planning for practical sessions Day 2 (03/16, 13:00-17:00 CET) and day 3 (03/17, 13:00-17:00 CET) are scheduled for practical work. We hope this provides an opportunity for those interested/motivated in diving deeper into some of the questions raised today in a breakout/working group setting? Some ideas/suggestions for breakout themes (but we want to hear your ideas also!): - **Tools**: Is it possible to identify one (or more) tools that are needed for metadata capture. If yes, outline the software specification. - **Metadata**: What are the relevant metadata sources? (ACDD, CF, ES-DOC, …). What ,if anything is missing for describing research simulations/workflow? - **Workflow/DMP integration***: How in practice to integrate DMPs and researchers workflow to ensure information (metadata) is propagated. Are there things missing in DMPs that are important for Climate? How to make DMPs relevant & userful for researchers? - **Big picture/roadmap (the ponderers club)**: Roadmap to FAIR climate data, collaborations: who can/should be involved (Nordic, international), governance ... ## Thoughts & inputs about hackathon - Matus: Apply to climate and use knowledge from bio-informatics (inputs as an "outsider" from climate), interested in options / efforts towards motivating & lowering barrier - Abdelkader: interested to work on lowering the barrier for the scientists. There has been various developments in recent years in terms of data formats and tools (e.g NetCDF, CF-compilants) to make sure the climate data is standardized and well documented. Would it be possible to come up with automated FAIRness checks extending the meta data so that increasing our confidence in (re) using these datasets? - Oskar: motivate on why we "should" share data. Regional modelling to catch up with the global. Publication of research climate data to ESGF to be considered to benefit from existing work - Jean: see on a practical example how to "make something FAIR" +1+1 - Adil: DMP integration with other services (for instance in Norway with Sigma2 MAS), ES-DOC. also interested in working on making data FAIR (practical example). - Klaus: identify existing project/tools to extend/reuse. https://c6dreq.dkrz.de/, undertand what kind of grids are used (CMIP6, etc.), make a registry of the different grids (can be reused to re-grid) ES-DOC ocean description metadata for CMIP6: https://specializations.es-doc.org/static/index.html?target=cmip6.ocean&client=esdoc-url-rewrite - Aiden: see what tools could be reused (in our own research group), from own data, consider using Aiden *et al.*'s data as a practical example - Joakim: librarian, DMP (machine actionable) as a tool to evaluate FAIRness of the data. ES-DOC and Errata are interesting too.