FAIRPoints- Things you need to know about Data Access Statements

# FAIRPoints- Things you need to know about Data Access Statements :::danger **Date**: April 21th 2022 **Event Summary:** [FAIRPoints_Point_4](https://www.fairpoints.org/fairpoints_resources/) ::: **Speaker bio:** [Juliane](https://www.rd-alliance.org/users/juliane) & [Chris](https://www.rd-alliance.org/users/libcce) **Intro slides:** https://shiny.link/dvu6zT **Juliane's slides:** https://docs.google.com/presentation/d/11xt8rAxp-CsiMqujy6pyUO9SHkIxHN-PoFYtf5W8HqU/edit#slide=id.p **Chris slides:** https://docs.google.com/presentation/d/16gFz8nxshyhk_jwY5zevLqkM1yI2HCzmdf47nIkzwNE/edit#slide=id.g12225ab0835_0_1139 https://youtu.be/rpx57qnCGx8 **Keynote slides:** Participants #: 25 ### Agenda: | Time | Agenda | Speaker | | ------------ | ------------------------------------- | ---------------- | | 15:00-15:10 | Welcome, Housekeeping & introductions | Sara | | 15:10-15:50 | Keynote + Q&A | Juliane & Chris | | 15:50-15:55 | Takehome messages | All | | 15:55-16:00 | Wrap-up | Sara | ### Links: - **Monthly Keynote events** May 30th 2022: FAIRPoints-Enhancing Sample Provenance and Experimental Reproducibility **Register** 👉 https://www.lyyti.fi/reg/fairpoints_samples - **Slack:** [shiny.link/F71wE](https://shiny.link/F71wE) - **Monthly Community Discussions** April 29th 2022 👉 https://shiny.link/Jl6nuV - FAIR for beginners - Schemas for training events-FAIRPoints - Sign-up to **event series** 👉 https://bit.ly/3BEQ06X - To get in touch with speakers-Twitter: @libcce & @JulianeS - FAIR4beginners: https://hackmd.io/@selgebali/fair_4_beginners/edit - ---- ## Code of Conduct reminder * Be respectful, honest, inclusive, accommodating, appreciative, and open to learning from everyone else. * Do not attack, demean, disrupt, harass, or threaten others or encourage such behavior. * Be patient, allow others to speak, and use the zoom reactions & chat if you would like to voice something. * See also our [participation guidelines](https://www.fairpoints.org/participation_guides/). # Rollcall: 🗣 Name / 🐸 pronouns/ 📣 Social media handle # Q&A: :::success ❓ *Please add any questions you might have during the course of the session here:* ::: * I am just curious how can journal editors or publishers implement or enforce not only having availability statements but also improve data quality to ensure data are re-usable; it is not "garbage in, garbage out" * Chris: Peer review is the first step and trying to get on the same page with peer reviewers takes time, iterations. Another step is our staff, publications specialists and they need information in a checklist format, need resources/education on how to handle data/software. Editors are also scanning submissions to see they meet requirements. When you get to more of the challenging questions, they come to me, the data steward, where I work through their scenario. * Could you please expand on how the Data Help Desk works in AGU, what you do and what you do not do? * Chris: I answered a bit above, that I work with challenging scenarioes, I develop guidance/training for staff, authors, editors. I also work on streamlining these workflows as well, working with community stakeholders. I also work directly with authors if they reach out earlier in the publications process. Publications staff and peer reviewers work with many of the papers, I get a subset, and I help our pubs team streamline steps further. The expanded help desk would involve a lead steward role or roles, where we would feed questions to a wider group in a help desk solution/forum. This allows us to get different perspectives and get a more complete answer. * How are some of the templates built around audience impact from sharing methods? I wish there were flexible brainstorming templates no this part prior to DMP plans where the research expert shapes some of that content (seperate effort from sponsor policy & standard journal policy) * * * * * Chris, Thanks for the intro to Data Availability Statements, about which prior to listening I knew nothing (!), so please excuse a possibly naive question. It struck me when listining that they have a lot in common with DMPs. DMPs are prepared before or at the start of the research process, and Data Availability Statements at the end when the predicted data uses in the DMP have actually happened, or not happened, or changed. Do you see synergies between the two? Or another way of looking at it is that the DASs make up for the lack of DPMs being able to evolve. Comments? * ## Notes: Slides from today's talk: https://docs.google.com/presentation/d/16gFz8nxshyhk_jwY5zevLqkM1yI2HCzmdf47nIkzwNE/edit#slide=id.g12225ab0835_0_1139 https://docs.google.com/presentation/d/11xt8rAxp-CsiMqujy6pyUO9SHkIxHN-PoFYtf5W8HqU/edit#slide=id.p * AGU improving guidance to sharing data and software; position statement goes back to 1997 for sharing data * Recently received support from NSF to improve sharing of information with regards to data, later working on software * See "Data & Software for Authors" site on AGU for more info * Context is helpful for people who want to access data and software and understand related research: working with authors on how they provide this info, especially availability statements (lots more to share than data: software, workflows, notebooks...) * Availability statement: metadata that goes along with citation; key to this is in-text citation in References can provide further impact/value * See Availability Statement Templates * Bracketed description (common to APA) signals to others in process that citation is something different (dataset, software, notebook...); tags come from DataCite schema that references different description types * Force11 implementation group releasing guidance about improving citation process (e.g., indexed properly by CrossRef) * Example of Availability Statement: 10.1029/2021EA001675 * Help Desk Challenges: Government sites--repositories know about DOIs, but other groups outside of repository community, there is limited knowledge of these types of sharing mechanisms, so still a challenge * Also data with national concerns, e.g., seismic data (could use to understand where nuclear tests are happening) * Firewalls also a challenge, but still many systems that have firewalls * Friction with curatorial issues, authors may have to wait months for data to be released * Challenge with lack of info on how data or software should be cited * "Citation Nothingness": authors cite other publication where data should be available, but it is not actually available * Many data links: how do you properly cite? * Preserving large data: with increase in computation/ML, AGU seeing explosion of large datasets, authors may not be able to share all of the data (repos can't handle TBs of data) * Takeaways from AGU lessons learned: want do do this as a community and share helpdesk with others in the community to develop guidance and instructions collectively, not just within AGU (see ESIP data help desk) * How do you work with authors earlier in process? Reach out as they write their paper: curvenote.com, Jupyter notebooks elevated as new form of publication, called "Notebooks Now" * Cookiecutter Data Science: template in Git repo where you can fork to own repo/org and use this template as guide for how to store and organize code; could be used for data as well? * Show researchers the value of doing this work, see PLOS as example where created data badge that allows filtering of papers that have data/software available; Researchers have positively responded to this * Community needs to work together: researchers faced with multiple discipline repositories, inst should find way to streamline information to avoid overwhelming researchers * Lots of pieces to data availability statements * discrepancy between all of data and what is used in publication * SAGE: asks for lots of metadata from researchers when describing dataset (Synapse data portal); metadata annotated to data files contained within that data set * See slides for Data Avail. Statement example; description of research, location, description of what repository is; conditions for access and use with link to instructions; DOI to landing page * Problem: not telling you exactly what subset of files are being used in study, if indeed an actual subset * New solution to address this: called "dataset", is a collection of data files that are used in study plus metadata * "dataset" given a DOI * Pros: direct link to data, able to choose data from current studies and disparate ones; Cons, must be in Synapse platform, access issues if linking out to other repos, no automatic access (system doesn't know who you are/access level if clicking on DOI), versioning and provenance not immediately evident or machine actionable * Versioning and Provenance current topics of discussion: as data is getting reused/reprocessed, versioning becoming important; synchronizing specimen IDs and tracking reuse still challenging, coordination of data for reuse is tricky * Authors still making up own availability statements! So hard to enforce still :::info # FAIRPoints- What is your take home message from todays session? *✏️ Silent documenting of learning outcomes+ share outs, add +1.* ::: * * * * * * * * # Thank you for joining! 🎉 ## 5 ways to stay involved * Sign-up to event series: [https://bit.ly/3BEQ06X](https://bit.ly/3BEQ06X) * Website: https://www.fairpoints.org/ * Twitter: [@FAIR_Points](https://twitter.com/FAIR_Points) * Slack: [shiny.link/F71wE](https://shiny.link/F71wE) * Email: [fairpoints@protonmail.com](mailto:fairpoints@protonmail.com)