owned this note
owned this note
Published
Linked with GitHub
![](https://github.com/reprohack/reprohack-hq/blob/master/assets/reprohack-banner.png?raw=true)
_Reproducibility in Science: Helmholtz Incubator Summer Academy ReproHack 2023_
===
<img src="https://ords-mv.github.io/assets/images/ordsmv-hex-sticker.png" alt="drawing" width="200"/>
###### tags: `Reprohack` `ORDS-MV` `Helmholtz Incubator Summer Academy 2023`
:::info
- :calendar: **21st September 2023 CEST**
- :watch: **9:30--15:30**
- :earth_africa: **Remote:** via Gathertown
- :purple_heart: **Code of Conduct:** https://github.com/reprohack/reprohack-hq/blob/master/CODE_OF_CONDUCT.md
- :arrow_forward: **Slides:** https://hackmd.io/@ords/rkBJt7C03
# Agenda
**9:30 - Opening and virtual come together**
* Introduction ORDS-MV (https://ords-mv.github.io/)
* Icebreaker
* Example: Article including Code and Data
* Team formation
**10:00 - 1st part of the workshop**
* Hands-on analysis in teams
**11:30 - Rejoin and tell**
**12:30 - Lunch break :pizza: :stew: :strawberry:**
**13:30 - 2nd part of the workshop**
* Continue working in your team
* Prepare feedback to the group and for the authors
**15:00 - Evaluation and Goodbye**
### **Participants**
***Please add yourself (Affiliation / Twitter / GitHub)***
* Frank Krüger (University of Applied Sciences Wismar / [@\_frank_k\_](https://twitter.com/_frank_k_) / [f-krueger](https://github.com/f-krueger)) **Frank is ill, unfortunately, and cannot participate today**
* Anja Eggert (FBN Dummerstorf / [@AnjaEggert42](https://twitter.com/AnjaEggert42) / [AnjaEggert](https://github.com/AnjaEggert))
* Max Schröder (University of Rostock / [@m6121](https://twitter.com/m6121) / [m6121](https://github.com/m6121))
* Patrick Draheim (DLR-VE)
* Riccardo Massei (UFZ - Leipzig / rmassei)
* Naiara Fernandez (GFZ-Potsdam)
* Stelian Curceac (KIT-IMK/IFU/@curceac)
* Lautaro Hickmann (DLR-KI)
* Elisabeth Schiessler (Hereon / no twitter / ElisabethJS)
* Florian Heinrich (DLR-KI)
* Feng Lin (FZJ- Jülich/Ddorf)
* Kshema Shaju(AWI- Universitat Wuppertal)
# Icebreaker
Form teams and try to answer the following questions in breakout rooms.
- Why are you here?
- Could you ever not reproduce something (other's work, your own work)?
**Name your group!**
- What do you have in common?
# :recycle: ReproHacking - Plan of Action
Select an article from the ReproHack repository:
https://www.reprohack.org/paper/
# :computer: Form Teams
Feel free to either join the predefined teams *Beginners* (Room 1) *More Advanced* (Room 2), or create your own team (Rooms 3 or 4), or work individually on the paper.
## Beginners R - Room 1
The paper is analysed with respect to their published resources and the original analysis is re-run in order to see whether the same results will be generated.
_Participants are expected to have some basic knowledge of R_
Participants:
* none
### Feedback
## More advanced Docker - Room 2
A docker image is created in order to re-run the original analysis in a well-specified environment.
_Participants are expected to have some basic knowledge in R and Docker_
Participants:
* Stelian
* Max
### Feedback
#### Initial Feedback for rejoin
* Paper and Code available at the ReproHack Website (both available via CC-BY 4.0)
* Guide on how to repeat the analysis in the README.md in the code repository
* Code Repository misses LICENSE file, however license is written in README.md
* README.md contains list of packages including version numbers but no computational environment in repository
#### Final Feedback
##### Access
* How easy was it to gain access to the materials?
* All Materials are accessible, no problem to download code/paper/data incl. submodules are available via open access license CC-BY 4.0
* Code/data repositories miss LICENSE file, however license is written in README.md
* Did you manage to download all the files you needed?
* Yes, no problems
##### Installation
* How easy / automated was installation?
* (Docker was a little difficult to install and get up running)
* README contains information on software stack incl. versions, but no further guidance/automation/package management
* Did you have any problems?
* How to get the same computing environment
* How did you solve them?
* We created an renv environment inside a docker image which worked quite well
##### Data
* Were data clearly separated from code and other items?
* a really nice literate programming file `analysis.qmd` intertwined documentation, code and results in a nice way
* data is separately in a git submodule
* Were large data files deposited in a trustworthy data repository and referred to using a persistent identifier?
* For the analysis the data is included from a versioned git repository
* Furthermore, a zenodo archive as well as a SPARQL endpoint exist: https://data.gesis.org/somesci/
* Were data documented ...somehow...
* Yes, quite good documented in the SPARQL and the git repo as well
##### Analysis
* Were you able to fully reproduce the paper?
* The authors did a really good job and most plots could be reproduced clompletely identically
* Some very tiny differences compared to the paper could be found:
* Fig S1 (R) compared to Figure 3 (article) has a different order on the x-axis as well as a different label
* Fig S7 (R) says article figure number is 7 which seems to be Figure 8; and so on for the other figures
* How automated was the process of reproducing the paper?
* Totally automated by just clicking on "render", great!
* If the analysis was not fully reproducible:
* Were there missing dependencies?
* Was the computational environment not adequately described / captured?
* Was there bugs in the code?
* Did code run but results (e.g. model outputs, tables, figures) differ to those published? By how much?
##### Documentation
Was there adequate documentation describing:
* how to install necessary software including non-standard dependencies?
* The required software incl. versions were stated, but no further guidance on how to get there and no package management system
* how to use materials to reproduce the paper?
* Yes, a short reproduce guide was given
* how to cite the materials, ideally in a form that can be copy and pasted?
* No, there was no information on how to properly cite the code/data; no cff-file was provided
##### Transparency
* How easy was it to navigate the materials?
* The materials was nice to find and the low number of files was helpful, however the submodule contains quite a lot of files which have not been looked at in this review
* How easy was it to link analysis code to:
* the plots it generates
* This was great due to the literate programming approach
* sections in the manuscript in which it is described and results reported
* A reference was given in the plots. However, the reference is pointing to a wrong figure number
## Room 3
https://www.reprohack.org/paper/75/
Paper: Measuring the impact of COVID-19 vaccine misinformation on vaccination intent in the UK and USA
Python/Jupyter NoteBooks
Participants:
* Naiara
* Kshema Shaju
### Feedback
## Room 4
Same (original proposed) R paper, but try to make it possible in Python
https://www.reprohack.org/paper/86/
Participants:
* Florian Heinrich
* Lautaro Hickmann
* Elisabeth Schiessler
* Patrick
* Anja
### Feedback
General comments:
* (analyse.qmd lines 60-62) software type 'book' is changed to 'software_article' without comment
* We only managed part of the paper due to the time limit (up until Fig. 3)
* CC-BY-License is not suitable for code; also there is no designated license file in the github repo
* translating from R to python requires really understanding each step of the code, which was good; but also requires good knowledge of both languages
Encountered errors:
* Images were missing, R file need to be re-rendered to generate them
* grouping df_citation_types by software_citation_types (contained in df_tmp) yields different counts (analyse.qmd lines 98-100); the sum is correct however
* subsequent multinomial confidence interval estimator yields quite different results (lower and upper bounds are different even when using the counts from the original R file) - might be version conflict/usage of different method in the background ('goodman' instead of 'sison-glaz')
* Figure 3 axis labels are different; plotted data as well due to above error
## Room 0
Paper: The viewing angle in AGN SED models, a data-driven analysis (https://www.reprohack.org/paper/51/) Jupyter Notebooks/Binder
Participants:
* Feng (Room 0)
### Feedback
* Grid search cross-validation & 10-fold cross-validation of the model evaluation keep crashing...
but other notebooks (10/12) worked well
* need better internet connection or else easily lose connection with binder server...
* jupyter notebook itself is not very good for running time-consuming ML project
using imaging datasets(personal feeling)
* Final figures in notebooks I could run are all the same as original paper
* based on pre-saved models (so can be wrong if I fully reproduced the 2 notebooks)
* Question: all notebooks seem independent with each others...
* pros: can locate error during reproduction well, easy for others to check step by step
* cons: authors saved intermediate output already, might cause confusion during reproduction...
---
* Fill in the author feedback form, documenting your experiences reproducing your chosen group
---
# Feedback tips
Here's some tips on more specific aspects of the materials to focus on:
## Access
* How easy was it to gain access to the materials?
* We would like to propose to add the images into the repository or render the `analyses.qmd` file with the additonal `self-contained: true` property. Otherwise one have to have R and rendering tools installed to see the results.
* Did you manage to download all the files you needed?
## Installation
* How easy / automated was installation?
* Did you have any problems?
* How did you solve them?
## Data
* Were data clearly separated from code and other items?
* Were large data files deposited in a trustworthy data repository and referred to using a persistent identifier?
* Were data documented ...somehow...
## Analysis
* Were you able to fully reproduce the paper?
* How automated was the process of reproducing the paper?
* If the analysis was not fully reproducible:
* Were there missing dependencies?
* Was the computational environment not adequately described / captured?
* Was there bugs in the code?
* Did code run but results (e.g. model outputs, tables, figures) differ to those published? By how much?
## Documentation
Was there adequate documentation describing:
* how to install necessary software including non-standard dependencies?
* how to use materials to reproduce the paper?
* how to cite the materials, ideally in a form that can be copy and pasted?
## Transparency
* How easy was it to navigate the materials?
* How easy was it to link analysis code to:
* the plots it generates
* sections in the manuscript in which it is described and results reported
---
# :raised_hand_with_fingers_splayed: 5-finger Feedback for this event
## :point_up: One thing that you enjoyed
_(put your comments here)_
* mixing R and python and having to understand what is going on in the code
* learning about ReproHack Hub
* get chance to try Binder
* sticking to the schedule
## :point_up: One thing that can be improved
_(put your comments here)_
* evaluating 2-3 papers that are similar levels of reproduction could be nice
* the list of possible papers is a bit much to handle, not clear what is feasible without lots of extra software
* Prepare a software solution that enables better collaboration, e.g. a Jupyter notebook where everyone can write code.
* provide a choice of few papers so that groups of 2-3 people are created
* [maybe Frank Krüger had even prepared something, but in general it would be good if the moderator (if available) has a look at the task beforehand and if necessary has a look at possible problematic places beforehand, so that the group does not fail because of small things (e.g. R->Python methods)]
* Maybe a few more references to the first part of the lecture would have been nice (e.g. like a checklist)
## :point_up: One thing that you did not like
_(put your comments here)_
* lack of information beforehand what software is required (installation could have been done already)
* getting stucked with compiler installation
* not knowing an approx hardware requirement/running time of the software
## :point_up: One thing that you would like to keep
* the posibility to choose our own topics / papers
* the gathertown / hackpad combination works well
* structure of the workshop works very well, with mid-point all-hands discussions
## :point_up: One thing that came up short
_(put your comments here)_
* understanding what the main point of the paper is (we did not manage to go through everything/no time for reading about the background)