_Reproducibility in Science: Helmholtz ReproHack_

![](https://github.com/reprohack/reprohack-hq/blob/master/assets/reprohack-banner.png?raw=true) _Reproducibility in Science: Helmholtz Incubator Summer Academy ReproHack 2023_ === <img src="https://ords-mv.github.io/assets/images/ordsmv-hex-sticker.png" alt="drawing" width="200"/> ###### tags: `Reprohack` `ORDS-MV` `Helmholtz Incubator Summer Academy 2023` :::info - :calendar: **21st September 2023 CEST** - :watch: **9:30--15:30** - :earth_africa: **Remote:** via Gathertown - :purple_heart: **Code of Conduct:** https://github.com/reprohack/reprohack-hq/blob/master/CODE_OF_CONDUCT.md - :arrow_forward: **Slides:** https://hackmd.io/@ords/rkBJt7C03 # Agenda **9:30 - Opening and virtual come together** * Introduction ORDS-MV (https://ords-mv.github.io/) * Icebreaker * Example: Article including Code and Data * Team formation **10:00 - 1st part of the workshop** * Hands-on analysis in teams **11:30 - Rejoin and tell** **12:30 - Lunch break :pizza: :stew: :strawberry:** **13:30 - 2nd part of the workshop** * Continue working in your team * Prepare feedback to the group and for the authors **15:00 - Evaluation and Goodbye** ### **Participants** ***Please add yourself (Affiliation / Twitter / GitHub)*** * Frank Krüger (University of Applied Sciences Wismar / [@\_frank_k\_](https://twitter.com/_frank_k_) / [f-krueger](https://github.com/f-krueger)) **Frank is ill, unfortunately, and cannot participate today** * Anja Eggert (FBN Dummerstorf / [@AnjaEggert42](https://twitter.com/AnjaEggert42) / [AnjaEggert](https://github.com/AnjaEggert)) * Max Schröder (University of Rostock / [@m6121](https://twitter.com/m6121) / [m6121](https://github.com/m6121)) * Patrick Draheim (DLR-VE) * Riccardo Massei (UFZ - Leipzig / rmassei) * Naiara Fernandez (GFZ-Potsdam) * Stelian Curceac (KIT-IMK/IFU/@curceac) * Lautaro Hickmann (DLR-KI) * Elisabeth Schiessler (Hereon / no twitter / ElisabethJS) * Florian Heinrich (DLR-KI) * Feng Lin (FZJ- Jülich/Ddorf) * Kshema Shaju(AWI- Universitat Wuppertal) # Icebreaker Form teams and try to answer the following questions in breakout rooms. - Why are you here? - Could you ever not reproduce something (other's work, your own work)? **Name your group!** - What do you have in common? # :recycle: ReproHacking - Plan of Action Select an article from the ReproHack repository: https://www.reprohack.org/paper/ # :computer: Form Teams Feel free to either join the predefined teams *Beginners* (Room 1) *More Advanced* (Room 2), or create your own team (Rooms 3 or 4), or work individually on the paper. ## Beginners R - Room 1 The paper is analysed with respect to their published resources and the original analysis is re-run in order to see whether the same results will be generated. _Participants are expected to have some basic knowledge of R_ Participants: * none ### Feedback ## More advanced Docker - Room 2 A docker image is created in order to re-run the original analysis in a well-specified environment. _Participants are expected to have some basic knowledge in R and Docker_ Participants: * Stelian * Max ### Feedback #### Initial Feedback for rejoin * Paper and Code available at the ReproHack Website (both available via CC-BY 4.0) * Guide on how to repeat the analysis in the README.md in the code repository * Code Repository misses LICENSE file, however license is written in README.md * README.md contains list of packages including version numbers but no computational environment in repository #### Final Feedback ##### Access * How easy was it to gain access to the materials? * All Materials are accessible, no problem to download code/paper/data incl. submodules are available via open access license CC-BY 4.0 * Code/data repositories miss LICENSE file, however license is written in README.md * Did you manage to download all the files you needed? * Yes, no problems ##### Installation * How easy / automated was installation? * (Docker was a little difficult to install and get up running) * README contains information on software stack incl. versions, but no further guidance/automation/package management * Did you have any problems? * How to get the same computing environment * How did you solve them? * We created an renv environment inside a docker image which worked quite well ##### Data * Were data clearly separated from code and other items? * a really nice literate programming file `analysis.qmd` intertwined documentation, code and results in a nice way * data is separately in a git submodule * Were large data files deposited in a trustworthy data repository and referred to using a persistent identifier? * For the analysis the data is included from a versioned git repository * Furthermore, a zenodo archive as well as a SPARQL endpoint exist: https://data.gesis.org/somesci/ * Were data documented ...somehow... * Yes, quite good documented in the SPARQL and the git repo as well ##### Analysis * Were you able to fully reproduce the paper? * The authors did a really good job and most plots could be reproduced clompletely identically * Some very tiny differences compared to the paper could be found: * Fig S1 (R) compared to Figure 3 (article) has a different order on the x-axis as well as a different label * Fig S7 (R) says article figure number is 7 which seems to be Figure 8; and so on for the other figures * How automated was the process of reproducing the paper? * Totally automated by just clicking on "render", great! * If the analysis was not fully reproducible: * Were there missing dependencies? * Was the computational environment not adequately described / captured? * Was there bugs in the code? * Did code run but results (e.g. model outputs, tables, figures) differ to those published? By how much? ##### Documentation Was there adequate documentation describing: * how to install necessary software including non-standard dependencies? * The required software incl. versions were stated, but no further guidance on how to get there and no package management system * how to use materials to reproduce the paper? * Yes, a short reproduce guide was given * how to cite the materials, ideally in a form that can be copy and pasted? * No, there was no information on how to properly cite the code/data; no cff-file was provided ##### Transparency * How easy was it to navigate the materials? * The materials was nice to find and the low number of files was helpful, however the submodule contains quite a lot of files which have not been looked at in this review * How easy was it to link analysis code to: * the plots it generates * This was great due to the literate programming approach * sections in the manuscript in which it is described and results reported * A reference was given in the plots. However, the reference is pointing to a wrong figure number ## Room 3 https://www.reprohack.org/paper/75/ Paper: Measuring the impact of COVID-19 vaccine misinformation on vaccination intent in the UK and USA Python/Jupyter NoteBooks Participants: * Naiara * Kshema Shaju ### Feedback ## Room 4 Same (original proposed) R paper, but try to make it possible in Python https://www.reprohack.org/paper/86/ Participants: * Florian Heinrich * Lautaro Hickmann * Elisabeth Schiessler * Patrick * Anja ### Feedback General comments: * (analyse.qmd lines 60-62) software type 'book' is changed to 'software_article' without comment * We only managed part of the paper due to the time limit (up until Fig. 3) * CC-BY-License is not suitable for code; also there is no designated license file in the github repo * translating from R to python requires really understanding each step of the code, which was good; but also requires good knowledge of both languages Encountered errors: * Images were missing, R file need to be re-rendered to generate them * grouping df_citation_types by software_citation_types (contained in df_tmp) yields different counts (analyse.qmd lines 98-100); the sum is correct however * subsequent multinomial confidence interval estimator yields quite different results (lower and upper bounds are different even when using the counts from the original R file) - might be version conflict/usage of different method in the background ('goodman' instead of 'sison-glaz') * Figure 3 axis labels are different; plotted data as well due to above error ## Room 0 Paper: The viewing angle in AGN SED models, a data-driven analysis (https://www.reprohack.org/paper/51/) Jupyter Notebooks/Binder Participants: * Feng (Room 0) ### Feedback * Grid search cross-validation & 10-fold cross-validation of the model evaluation keep crashing... but other notebooks (10/12) worked well * need better internet connection or else easily lose connection with binder server... * jupyter notebook itself is not very good for running time-consuming ML project using imaging datasets(personal feeling) * Final figures in notebooks I could run are all the same as original paper * based on pre-saved models (so can be wrong if I fully reproduced the 2 notebooks) * Question: all notebooks seem independent with each others... * pros: can locate error during reproduction well, easy for others to check step by step * cons: authors saved intermediate output already, might cause confusion during reproduction... --- * Fill in the author feedback form, documenting your experiences reproducing your chosen group --- # Feedback tips Here's some tips on more specific aspects of the materials to focus on: ## Access * How easy was it to gain access to the materials? * We would like to propose to add the images into the repository or render the `analyses.qmd` file with the additonal `self-contained: true` property. Otherwise one have to have R and rendering tools installed to see the results. * Did you manage to download all the files you needed? ## Installation * How easy / automated was installation? * Did you have any problems? * How did you solve them? ## Data * Were data clearly separated from code and other items? * Were large data files deposited in a trustworthy data repository and referred to using a persistent identifier? * Were data documented ...somehow... ## Analysis * Were you able to fully reproduce the paper? * How automated was the process of reproducing the paper? * If the analysis was not fully reproducible: * Were there missing dependencies? * Was the computational environment not adequately described / captured? * Was there bugs in the code? * Did code run but results (e.g. model outputs, tables, figures) differ to those published? By how much? ## Documentation Was there adequate documentation describing: * how to install necessary software including non-standard dependencies? * how to use materials to reproduce the paper? * how to cite the materials, ideally in a form that can be copy and pasted? ## Transparency * How easy was it to navigate the materials? * How easy was it to link analysis code to: * the plots it generates * sections in the manuscript in which it is described and results reported --- # :raised_hand_with_fingers_splayed: 5-finger Feedback for this event ## :point_up: One thing that you enjoyed _(put your comments here)_ * mixing R and python and having to understand what is going on in the code * learning about ReproHack Hub * get chance to try Binder * sticking to the schedule ## :point_up: One thing that can be improved _(put your comments here)_ * evaluating 2-3 papers that are similar levels of reproduction could be nice * the list of possible papers is a bit much to handle, not clear what is feasible without lots of extra software * Prepare a software solution that enables better collaboration, e.g. a Jupyter notebook where everyone can write code. * provide a choice of few papers so that groups of 2-3 people are created * [maybe Frank Krüger had even prepared something, but in general it would be good if the moderator (if available) has a look at the task beforehand and if necessary has a look at possible problematic places beforehand, so that the group does not fail because of small things (e.g. R->Python methods)] * Maybe a few more references to the first part of the lecture would have been nice (e.g. like a checklist) ## :point_up: One thing that you did not like _(put your comments here)_ * lack of information beforehand what software is required (installation could have been done already) * getting stucked with compiler installation * not knowing an approx hardware requirement/running time of the software ## :point_up: One thing that you would like to keep * the posibility to choose our own topics / papers * the gathertown / hackpad combination works well * structure of the workshop works very well, with mid-point all-hands discussions ## :point_up: One thing that came up short _(put your comments here)_ * understanding what the main point of the paper is (we did not manage to go through everything/no time for reading about the background)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.