sn|nat workshop on "Open Data and Data Management"

# sn|nat workshop on "Open Data and Data Management" Monday 29 October 2018, Papiermühlestrasse 15, Bern Program: [naturalsciences.ch](https://naturalsciences.ch/service/events/105080-open-data-and-data-management---issues-and-challenges) ## Executive summary As much as there seemed to be a consensus at the Open Access (to publication) event at UNIL last Friday (-> "we need to reach this goal as a ecosystem of actors, let’s explore multiple strategies and the best solution will eventually survive"), the Open Data debate is still very confused. In the conversation, the issue is not framed properly, the goals are unclear (what are we trying to achieve with Open Research Data?). Most arguments against sharing are not based on actual cost-benefit analysis or even an analysis of what qualifies as valuable data in various discipline. Basically, most researchers speakers showing how difficult / costly / stupid it is to share all of the research data that is produced, and how it is a good reason not to share. Hard to see any common ground emerge. The realisation that disciplines are different is there, but not that there are things they can learn from each other. ## The future of science is open ([Christophe Rossel](https://researcher.watson.ibm.com/researcher/view.php?person=zurich-rsl), EU Open Science Policy Platform, sn|nat, IBM) The rationale behind open science - expend from a closed system with a variety of new actors and tools - affects how reserch is performed, knowledge is shared, results are evaluated, research is funded, researchers are rewarded - disruptive change The benefits - Better ROI of the R&I investment - Faster circulation of new ideas - More transparency - Allowing transdisciplinarity **2016** - 8 [holistic policy priorities](http://euagendas.weebly.com/data.html) (now called [Comparative Agendas Project](https://www.comparativeagendas.net/pages/About)): 4 with regards to use and management of research: open and FAIR data, science cloud, almtetrics, future of scholarly communication (100% OA)) 4 with regards to research actors (rewards, research integrity, education and skilly, citizen science) - Open Science policy now: 1. Mandatory OA to publication (2014 -> 2018 new platform) 2. [OA to research data](https://www.openaire.eu/what-is-the-open-research-data-pilot) (2017 pilot, initiatially mandatory, now revised recommendations -> ORD by default, with opt-out, however DMP and FAIR data mandatory) 3. [European Open Science Cloud](https://ec.europa.eu/research/openscience/index.cfm?pg=open-science-cloud) (today! launch of the 1st phase) - see European Research Data pilot in [‘European Cloud Initiative – Building a competitive data and knowledge economy in Europe’](http://ec.europa.eu/newsroom/dae/document.cfm?doc_id=15266) (PDF) 4. [Open Science Policy Platform](https://ec.europa.eu/research/openscience/index.cfm?pg=open-science-policy-platform) (OSSP, delivers recommendations on the 8 policies above) is composed of 25 members from 8 stakeholders groups. General recommendations are: appoints national coordinators and create task force; ensure scholarly communication infrastructure; develop of culture of OS; foster OS litercy and train skills; etc. See also: - Overview of the [current funded projects in Open Science](https://ec.europa.eu/research/openscience/pdf/project_overview_os_responsible_ri_2018.pdf#view=fit&pagemode=none) from Horizon 2020 FY2015+ (PDF) Data Package (measures to facilitate the creation of a common data spacn of a c) EOSC policy milestones: June 2017: 1st summit; October 2017: declaration; roadmap; 2nd summit; governance structure (EOSC Board -> review and decide, Executive Board -> elaborate and propose, Stakeholder Forum -> ?) Survey on Open Sciene and career development for researchers 2018 by the European Physical Society ([survey linked here](https://www.eps.org/blogpost/751263/292467/Survey-on-Open-Science--Career-Development-for-Researchers-2017-2018?hhSearchTerms=%22career+and+survey%22&terms=) but results not yet public ?) - Connecting and building communities remains a central problem - Main issues: 1. personal data protection & confidentiality 2. legal restrictions 3. time and effort 4. lack of skills 5. costs ## Views from the SNSF (Matthias Egger) Cites Amsterdam Call for Action (holistic definition of Open Science) Opens with ["Topics"](https://naturalsciences.ch/topics) proposed by sn|nat and how important Open Science is to them. Animal research (3R) benefits from Reproducibility enabled by data sharing. [A Framework for Improving the Quality of Research in the Biological Sciences](https://mbio.asm.org/content/7/4/e01256-16) Arturo Casadevall et al - standard operating procedure throughout the experimental process Upcoming SCNAT event: [Beyond Impact Factor](https://naturwissenschaften.ch/service/events/103587-beyond-impact-factor-h-index-and-university-rankings-evaluate-science-in-more-meaningful-ways) (21.11.2018) - how to evaluate science in more meaningful ways Citing the "credibility crisis", how it generated communities around open and reproducible science, adopting state of the art methodologies to evaluate rate of reproducibility. Cites the article [Estimating the reproducibility of psychological science](http://science.sciencemag.org/content/349/6251/aac4716)), Matt Todd's work (i.e. [Open Science is a research accelerator](https://www.nature.com/articles/nchem.1149)), CMS open data and reuse by MIT researchers (see news [here](http://news.mit.edu/2017/first-open-access-data-large-collider-subatomic-particle-patterns-0929) and artivle [Exposing the QCD Splitting Function with CMS Open Data](https://dspace.mit.edu/handle/1721.1/111834) cited as great example of citizen science collaboration. Marcel Tanner (who unfortunately is not here - see [WHO interview](http://www.who.int/malaria/mpac/bio-marcel-tanner/en/)) did a lot of research on schistosomiasis, an area of intense collaboration between hospitals and scientists and "a fantastic example of open collaboration for the common good". In biodiversity access to free and open satellite data are key. *(Side note: this is something I'm working on with Uni Bern right now! -oleg .. check the article shown later)* The potential impact of Open Research Data: - Accelerate scientific discovery - Improve quality - Foster reproducibility - Speed up innovation - Improve citizen engagement - Foster collaborative, transnational research and access in developing countriesss in de SNSF is not the first agency to make open data a priority. The US NSF made sharing data between researchers a grant guideline in 2005 and had a Data Management Plan since 2011. In the UK DMP have been mandated since 2003. In Switzerland we are part of the European research ecosystem and it makes sense to go forward. The SNSF open research data policy is here: http://www.snf.ch/en/theSNSF/research-policies/open_research_data/Pages/default.aspx DMP is a formal requirement, shared at the end of the project, data needs to be deposited on a repository (commercial ("open") repositories are allowed, but SNSF won't pay for upload or preparation) http://www.snf.ch/SiteCollectionDocuments/DMP_content_mySNF-form_en.pdf FAIR DATA principles https://www.force11.org/group/fairgroup/fairprinciples apply to choosing appropriate repository for the research data You can adapt/change/develop the strategy of your DMP until the project is completed. DMPs are very individual. "For research that does not produce re-usable data, only part of the form has to be completed" `•͡㇁•͡` Next steps: - introduce DMP to all intruments - in-depth evaluation of DMPs of proposals funded by Div I-III - collaboration with Science Europe, swissuniversities, etc. - concept on SNSF actions to support repositories ## Joel Mesot, Director of PSI, future ETH President He asks: what are we talking about? - open access to published data? - open access to all data enabling an independent reproducibility of the published data by third parties - open data to not yet analysed and publised data (he calls it "live access")? PSI scientists publish each year 1000 papers x 4-10 figures = a lot of data Some argument in favour: - researchers have an interest in sharing data But: - Taxpayers may expect first a national exploitation of the published data -> needs explanation Mentions a successful example: PDB (data stream imporant, raw data from Mb to Tbyte of data -> density data Mb -> model kb to Mb) Access to raw data should help promote the subject with younger scientists testing the waters. Transparency law of public institutions applies to every document we produce. Important to meet, discuss and clarify few points. Complexity as a barrier to Openness? Lack of reciprocity as a barrier to Openness? Open access to published data -> yes, desirable and already Open access to all data -> hard in practice Conclusion: Implement suitable measures for suitable data to allow efficient use. # Open Research Data and data sharing, a scientist perspective (Florian Altermatt) https://twitter.com/altermatt_lab Pragmatic approach to data sharing in Ecology His group works using a variety of approaches, from modeling (causality) to field studies (realism) Motivations/reasons for data sharing -> paywall are morally wrong, making data available can be in the benefit of individual scientists. Top motivation in group is spirit of "openness"? (open science, reprodicibility, visibility and credibility), second is "internal incentives" (role model for collaborators, teaching), least important is "external incentives" (journal and funders requirements) How do we do data sharing? e..g Video too big to be shared -> "Impossible or impractical to keep all this raw data".."at least in our field, there are no repositories to hold this".."you have to trust that our code and analysis that we release does it right" -> *(Note: how about using samples?)* Instead share simple matrices or data frames of processed data (specifically: looking at video of single celled organisms, extracting movement speed, and other key parameters) Code, protocols, sequence data -> easier, widely accepted, helpful to ensure quality control Typical workflow: BioRxiv (preprints), Dryad (data), GitHub (code/development) - open research publications and code goes hand-in-hand. Outsourcing long term storage of data is a big problem/stretch goal - even as amount and costs of data sharing vary a lot. *(Note: this is such a universal problem, why are we not solving it jointly?)* Advantages: as a PI -> forces lab members to structure research (datasets, code), visibility, track record and credit for data use, availability for reuse, outsourcing storage and archiving Time investment: 1-2 hours for uploading, to endless for exhaustive data management. Cost investment: 0-120.- Challenges (technical): - Big Data (video, etc) - Time series, incremental, where to publish? - How to handle "re-analyses"? Problems with radical openness: - Researchers worry about being scooped - Long-term datasets -> When to release? How to get credit for significant efforts - Misuse? ## Open Data and Data Management: issues and challenges (Marcel Mayor, Dpt Chemistry, Uni Basel) - group of individuals ccreating a lot of data that others can benefit from (climate research, high particle physics) - Big science -> benefit from large infrastructure, team up, lobbying for large investment - individuals having ideas, enable optimisation, realisation of individual idea, concept, not ready to share data before she/he can get recognition - Small science -> small infrastructure, easy to copy -> secretive behaviour of chemists is necessary due to the nature of their work Makes the following statement: "Achivements and findings are IP of the PI" [Ends with a Calvin & Hobbes cartoon](http://www.chemistry-blog.com/wp-content/uploads/2010/05/OChemAthiesm.jpg) ## Open Data Management (Robert Jones, IT Department, CERN) New era in Fundamental Science? Data sharing supported by Worldwide LHC Computing Grid, federated model (as proposed for the EOSC) We need to preserve all of this, but not all is open, or after "suitable" embargo periods. CERN launched an open data portal in 2014 (which the community was always [excited about](https://twitter.com/loleg/status/537722938259030016)) and will continue supporting it. "It takes time, effort and money to get data to a state where it can be reused". A project (CAPREANA) to support reusable analyses https://github.com/cernanalysispreservation Working on standards, not specific to physics, together with the life science community ISO 16363 for trusted digital repository -> scope: scientific data and CERN's digital memory; timescale: complete prior to 2020 Hybrid Cloud Model: Massive rise over the past decade of commercial services for data processing & storage - CERN wants to take advantage of this and integrate new possibilities in the platform. [Helix Nebula Science Cloud](https://www.hnscicloud.eu), joint pre-commercial procurement. Mutualisation of resources with a diverstiy of organisations -> public session at CERN on 29 Nov. 2018 (live steamed via website?) ## Space and Open data - How far do you go? (aka "How deep is your pocket?") (Nick Thomas, Space Sciences, University of Bern "Open data is a standard thing that we do" [in space science] Started at least in 1986 with the [Giotto mission](https://en.wikipedia.org/wiki/Giotto_(spacecraft)) whose data on Halley's comet was [published on the Internet](https://pdssbn.astro.umd.edu/data_sb/missions/giotto/index.shtml) CaSSIS - a current example by Uni Bern, camera currently orbiting Mars. Requirements from space missions: right at the beginning, science management plan, sharing no later than 6 months after reception and distribution of the data by the MOC, on [Planetary Data System](https://pds.jpl.nasa.gov/) (and other archives), all of this signed 3 years before launch. You have to write your data according to standard, something we haven't really talked about today *(O RLY? Or maybe when people say 'standard' they mean 'the world as I see it')*. <- please keep these comments in ;) I'm going nuts.. *(sure thing :)* Why do all this? It is pointless archiving data that cannot be used by anyone but the producers. *Isn't it pointless to produce data that only the producers can use?* Keep in mind that data producers themselves have 'data mortuary' issues, being unable to access their own results after a decade or so. But this (sharing data) is not enough.. Calibration of instruments, and how it affects the data - is a big issue in physics (in fact in all science, and in every kind of data that is coming from observation of the world) - *(are we also talking about experimental bias here?)* - the data producer should consider calibrating the data for the community, not just for their own analytical purpose Data processing -> argues not a circular process, but linear process (with a beginning and an end) "Don't talk to me about Python!" - high level languages are essential, and originally there was a requirement from PDS to publish results in C / Fortran (which are open / free as opposed to MATLAB etc.) Archiving Costs - 15% of the current spending of CaSSIS (1 FTE) is dealing with the requirements to open the mission's data (*so what?*) But this has value! Archiving data for community use is highly valuable if done properly Doing it properly costs manpower. Do not underfund this. "Data vultures" a buzzword this speaker did not have time for. You mean this type: [Research Parasites](http://researchparasite.com) ## Data Management Landscape (Ana Sesartic, Digital Curation Office ETHZ/Center for Climate Systems Modeling, ETH Zurich) Research Data Management is organizing your data in a way: - which not only allows yourself to track back what you have done some time ago - but also facilitates sharing and publishing relevant data - ... *Volume does not correlate with complexity.. no need to show there is more data.. Motivations from young scientists (esp. MSCA and Ambizione): do the right thing, perform better than their own PhD advisors *(This was a nice introductory lecture, but I'm not getting much out of it personally \O - Same L)* Audience comments: #### are there rules to decide what data gets accepted or not? Small labs are worried about sharing data. Big labs are so keen to get to the results they don't spend the time to do it properly. We cannot go into collaborations because of this problem, a central problem is Intellectual Property / patent generation based on data. SNF: happy to discuss IP issues. #### did you have a discussion at SNF about the DMP? as a scientist, I would have appreciated being involved *These people don't get that this is exactly an excuse to have the discussion ;)* SNF: The research data plan is evolutionary and can develop with the needs of the community. #### Will the Swiss community be able to join the EOSC or do Swiss universities need to build their own infrastructure? It's called "open" for a reason! The European Open Science platform is open to all. #### Does data sharing policy accelerate research? For years a participant has apparently been trying to share data on the topic of dark energy and being ignored / stonewalled by the community. *(Sounds like the gentleman is not very optimistic, putting it mildly, about fundamental the promises of "openness" - IMHO the qualms of people in the field is something the community will need to bitterly work through until a new ethic is prevalent and mutual respect is restored. See also: [flaming](https://www.urbandictionary.com/define.php?term=flaming). ^ol*) #### Curation of data management plans is expensive Sharing data needs repositories, not mortuaries. Costs will be non-negligeable (resources, HR, etc.) and should not be done by PhD students. #### There seems to be a dichotomy between moral imperative of funders and pragmatic and real hidden costs to the project (question from a chemist) Nick Thomas: In space, it's really demanded because we're a very "public" field. In other communities / disciplines, stakeholders have to sit and decide for themselves. "I couldn't in a million years analyze all the data sitting on my laptop hard drive." That's a powerful incentive to open it up. #### Should the level of openness needed, the level of standardization required, get more input bottom up from the community - and how could it be facilitated? "Communities getting together: that's what happening here!" Nick Thomas: A huge amount of effort is going into setting standards, but individual communites should be getting together to review and feedback. #### Matthias Egger does not understand how a chemist can make this statemetn "We don't publish negative results" with the viewpoint of the clinician - it's very important to make it clear that this doesn't work. - a negative result in medicine and biology is very different. - [we could spend all day on this, but we won't] #### Confusion between "sharing data" and "open data"? Stating that "Sharing has value, open has a cost". *I think the confusion is exactly that people do not understand that the conditions for data to be open is what provides basis for value in sharing - You agree with this statement? ## Panel Moderation: Gerd Folkers, President Swiss Science Council Gregor Häfliger, State Secretary for Education, Research and Innovation SERI Christophe Rossel, EU Open Science Policy Platform/SCNAT Patrick Furrer, swissuniversities Angelika Kalt, Swiss National Science Foundation SNFS What is new? -> mentions Diderot and Humbolt Is the taxpayer actually asking for open science? Is there an economic incentive we're hiding? Millions in the data (according to McKinsey anyway) AK: What has changed? -> Diderot and Humbolt weren't "publicly funded" GH: What has changed? -> well technology has changed CR: What has changed? -> Solvay conference was a small community, now reserch community is huge and disciplinary silos are crumbling into interdisciplinarity