SORTEE Code Club Hackathon: Creating a Code Standard

# SORTEE Code Club Hackathon: Creating a Code Standard ###### tags: `sortee` `open-code` `reproducibility` This is the collaborative notebook for **SORTEE's Code Club Hackathon: Creating a Code Standard** for the Society for Open, Reliable, and Transparent Ecology & Evolution (SORTEE). >**Session #1:** Wed Oct 16, 06:00-07:55 UTC +00:00 - [slides](https://docs.google.com/presentation/d/1fSY_UCjT8Wz---Ultba62r_sItDC2qKkmnwcY78LEuY/edit?usp=sharing) >**Session #2:** Tue Nov 19, 09:00-10:30 UTC +00:00 - [slides](https://docs.google.com/presentation/d/1LQDrWR2ATOEtmPBzyquWwC2W8kSbjo41gzU32qD_MEY/edit?usp=sharing) **:information_source: Links to more information:** > - [2024 SORTEE Conference](https://www.sortee.org/upcoming/) > - [Become a SORTEE member](https://sortee.org/join) > - [Code Club Agenda](https://docs.google.com/spreadsheets/d/1rOOOE7ghPduwtFftG0DJJf0DXVigAdcmQ0xdEwbKQXo/edit?usp=sharing) > - [Debriefs of previous Code Clubs](https://www.sortee.org/tags/code-club) :::warning **:link: Link to Hackathon Github page**: [CodeStandard](https://github.com/SORTEE/CodeStandard) ::: :::danger :exclamation: Make sure you are familiar with [SORTEE's Code of Conduct](https://www.sortee.org/codeofconduct/) so this can be a safe and fun place to learn and discuss. ::: :::info :bulb: **Hint:** HackMD is great for sharing information in this kind of online set up, as the code formatting is nice & easy with [MarkDown](https://www.markdownguide.org/basic-syntax/)! Just add 3 ticks (``` ` ```) for the``` code blocks ```. Otherwise, it's like a Google doc: it allows simultaneous editing. There's a section for practice down there ⬇️ ::: **Table of Contents:** [ToC] ## Hackathon outline **Audience = Anyone with a Github account and experience in coding for scientific analysis is welcome to participate!** :::info 🔗 You will receive the zoom link to participate in the Hackathon on [SORTEE's Slack](https://sortee.org/join) ::: **Why a Code Standard?** Publishing our code and data is an important Open, Reliable, and Transparent (ORT) practice to ensure the reproducibility of research. A field-specific Code Standard can facilitate the production and reviewing of code as an accessible and easy way to implement ORT practices in your own coding by example. **Which code will we work on?** The selected code is for a simple ecology/evolution analysis, coming from the published paper: [Van Dis et al. (2023). Phenological mismatch affects individual fitness and population growth in the winter moth, Proc Roy Soc B, 290(2005)](https://doi.org/10.1098/rspb.2023.0414) **How will we work?** We will collaboratively edit a piece of R code in [this Github repository](https://github.com/SORTEE/CodeStandard). We will split up into six groups, with each group having a specific goal for code editing: >1. Reported >2. Run >3. Reliable >4. Reproducible >5. Organisation & Structure >6. Other (opinionated) considerations for public code sharing (following the [the 17-step code review checklist for Ecology and Evolution](https://forms.gle/7bmDwTg7DiFGZqPt5)) **How to get started?** Join the main breakout room for your group on zoom (find your group [here](https://docs.google.com/spreadsheets/d/1U3LnAbkklFMbEmkUIWbzjq7RAtb_xPedyc2VvixNRDE/edit?usp=sharing)). Together discuss what would constitute the "perfect" ORT piece of code considering your group's focus area (see above). Below you will find suggestions of tasks for each group to get started. Discuss, edit and supplement these tasks. When you agree on a task, self-organise who will work on this task. The task owners can go to a side breakout room to start rewriting the code. :::danger :exclamation: When code editing, make sure to use **git** and frequently ```commit``` and ```push/pull``` to [the Github repo]() to keep track of your changes. **Make sure you are on the ```hackathon2``` branch**. ::: ## ICE BREAKER (practice HackMD editing) Let's learn how to use this HackMD document by answering an ice breaker question! Somewhere on the screen (probably at the top), you should see three icons :pencil2: :desktop_computer: :eye:(a pencil, a two-column window, and an eye). You can add your answer by clicking the edit button :pencil2: (the pencil). **Q: What is the word for _to code_ in your first/second/third language?** _Answers:_ * coderen (Dutch) * programar (Portuguese) * codgozari (Persian/Farsi) * coder (French) ## Collaboratively writing the "perfect" Open, Reliable, and Transparent (ORT) piece of code We have breakout rooms standing ready in zoom: one main room for each group plus additional side breakout rooms for task completion. **Join the main breakout room for your group and start discussing!** :::info :bulb: **COPY-PASTE ME:** - [ ] **TaskX: [description]** - _Considerations:_ - _Ideas:_ - _Example code:_ ::: ### Group 1: Reported - [x] **Task1: Check if the code matches the methods as described in the corresponding article, metadata, and/or documentation. If not, edit** - _Considerations:_ Read the methods section of the paper and check along the code if the variables are correct and that the analyses is done like reported. - _Example code:_ - [x] **Task2: Add comments to code, if necessary** - _Considerations:_ Add comments to the code that makes it easier to link to the reported methods/results in the paper. - _Example code:_ - [x] **Task3: Consistency in naming of variables** - _Considerations:_ - _Ideas:_Keep consistent names for variables from paper to code - _Example code:_`glm1 <- glmer(Event ~ (MismTreat1 + MismTreat2)*PhotoTreat + (1|TubeID), family=binomial, data=d_surv, na.action="na.fail", control=glmerControl(calc.derivs=F))` uses TubeID instead of MotherID, despite the paper saying that "mother ID was included as a random effect" ### Group 2: Run - [x] **Task1: Does the code run in its entirety and without issue? If not, edit** - _Considerations:_ Even if code matches the methods reported in the paper, this does not mean that the code is executable. Programmatic and syntactic errors can make the code fail to run. - _Ideas:_ - _Example code:_ - _Comment:_ `rmisc` was used in line 77, but not loaded beforehand. Caused error if user didn't have package installed. - [x] **Task2: Add a R project file to repo** - _Considerations:_ After download, the repo needs to be a standalone project without need to manually change working directory - _Ideas:_ - _Example code:_ - _Comment:_ Added R Project at the root directory. - [x] **Task3: Discuss/research and implement an easy way for a user to recreate the analysis R environment** - _Considerations:_ Make it as easy as possible for someone to run the analysis code (preferably with as little manual/user specific steps as possible after downloading the repo) - _Ideas:_ write a script with ```if``` statements for installing packages with the right versions; or find out if there is an automatic way to replicate the environment when e.g. using ```renv``` package lock files - _Example code:_ - _Comment_: ```renv``` is currently one of the best approaches to allow other to reproduce the analyses using the same R environment and requires no additional manual steps, unless necessary system dependencies are missing. Therefore we implemented this approach. - [ ] **Task4: ~~Add data~~ Alternative: write script to download the data** ==> :exclamation:to be discussed - _Considerations:_ Dataset should be included alongside code if possible (?) <- BUT data management and technical considerations to *not* store data on git - _Ideas:_ Write R script using R package rdryad to atuomatically download and store the data + all data files listed in .gitignore file - _Example code:_ - _Comment:_ We added the data file into the _data folder. - [x] **Task 5: Keep all objects in the environment** - _Considerations:_ When objects are created and later deleted using `rm` in the same script, it can be confusing for users - it makes it hard to go back and look at objects from a previous step. Created objects should be preserved within a single script. - _Comment_: We removed the `rm` calls and made sure those variables weren't reused in a way that might cause any issues. ### Group 3: Reliable - [ ] **Task1: Check if the columns of the data are correctly selected. If not, edit** - _Considerations:_ It is better to select variables explicitly than by index. - _Ideas:_ - _Example code:_ - good: `selected_data <- data[, c("size", "color")]` - bad: `selected_data <- data[, c(3, 7)]` - [ ] **Task2: Avoid overwriting columns/data objects** - _Considerations:_ Avoid overwriting columns with the same name, especially in the case of factor() and the labels argument -> keep data.frame as is after read in and make a new data.frame that is data wrangled, and then can use the two to compare and do unit checks - _Ideas:_ Use a slightly different name and keep original data column names consistent, to ensure that the name pairings have not been incorrectly overwritten - _Example code:_`mutate(Treatment=factor(Treatment, levels=c("ChangDay-4", "ChangDay-3", "ChangDay-2", "ChangDay-1", "ChangDay0", "ChangDay+1", "ChangDay+2", "ChangDay+3", "ChangDay+4", "ChangDay+5", "ConstDay-4", "ConstDay-2", "ConstDay0", "ConstDay+2", "ConstDay+4")), ...` could use Treatment2 instead -> more descriptive name needed! e.g. Treatment_relevelled - [ ] **Task3: Check that main decisions are clear to find. If not, add** - _Considerations:_ Important configuration values/parameters are stored in variables and these important variables appear at the top of the code/section - _Ideas:_ see "How-to guide for code sharing in biology paper" under Useful links below for an example - _Example code:_ - [ ] **Task4: Does the code include "unit tests" or other checks that verify the code is working as intended? If not, add** - _Considerations:_ For example, following a bit of data wrangling or transformation, is there code that checks whether the transformation has been accurately implemented? - _Ideas:_ If there is an error/mismatch in an unit test, break/stop the code + throw an error (should be useful error message) - _Example code:_ Check if the variables have the right structure: `validation <- validate::validator( is.characted(id), is.integer(X), is.factor(group) ) data_check<- validate::confront(data, rules, raise = 'all) validation_summary <- summary(data_check) if (any(validation_summary$fails >0 | validation_summary$error>0)) { warning("Error message") } ` - [ ] **Task5: Add comments on what the desired output is of the code** - _Considerations:_ Also link to Reported? e.g. output=Table 1 - _Ideas:_ - _Example code:_ - [ ] **Task6: Implement html report?** - _Considerations:_ maybe here just having a html outptut or something equivalent (using e.g. Rmarkdown, Quarto) is the simplest solution ? that way you can directly see what you are supposed to obtain (including graphs, tables etc) --> Could make reliability checks easier e.g. for reviewer - _Ideas:_ - _Example code:_ rmarkdown::render() - [ ] **Task7: Add comments about what values etc to check for statistical analysis** - [ ] ### Group 4: _**R**eproducible_ - [ ] **Task 1: Test and fine-tune `renv` implementation** - _Considerations:_ - 1. Make sure `renv` installs the correct package versions! _How-to:_ Install packages with right version (code does not need to be included in repo?) and reinitialise `renv` / update `renv.lock` file The package versions used in the original analysis are in the `_scr` folder ~~- 2. Check that all necessary files have been included in the repo to easily recreate the analysis environment with `renv` after repo cloning (and make sure all *not* necessary `renv` files are in `.gitignore` file )~~ _How-to:_ Clone repo and run the code - _Question_: how to work in another R version than the one installed on your machine? answer: using newer versions should not be a problem with renv, it might be a problem when using older version than the one use to produce the code. - [ ] **Task2: Are the results and conclusions reproducible from the code as provided?** - _Considerations:_ Run the code and check that the results (and figures) match the ones reported in the paper. - _Ideas:_ It will be useful if the code has comments linking the code to the paper results (figures, sections for example) (task 3) - _Example code:_ - [ ] **Task3: Does the code include clear documentation that detail the rationale behind it? If not, add** - _Considerations:_ e.g., rationale behind data wrangling decisions and analytical approaches should be documented - _Ideas:_ - _Example code:_ - [ ] **Task4: Is the whole workflow code/script-based? If not, write code for the manual manipulation parts** - _Considerations:_ The workflow should be self contained (e.g., the code does not involve steps outside the script or pipeline, such as manual manipulation in other software like Excel). - **Hint**: Have a look at the Results section 3(a) in the paper - final paragraph, can you find the code to reproduce the percentages fitness loss mentioned there? - _Ideas:_ - _Example code:_ - [ ] **Task5: Use of multiple R versions:** - _Considerations: rig(https://github.com/r-lib/rig) package can work with multiple R version. We can check if is it worth implement it to the code. _ - _Ideas:https://github.com/r-lib/rig - _Example code:_ - [ ] **Task6: Check the need set seed** - _Considerations:_ Check if the analysis has any random processes that needs to set seed for reproducibility. - _Ideas:_ - _Example code:_ - [ ] **Task7: Check the possibility for use Docker** - _Considerations:_ To have a fully reproducble environmet we can use Docker, with has the r version, packages but also the OS system. It needs to be checked how to in=mplement in R. - _Ideas:_ - _Example code:_ ### Group 5: _Organisation & Structure_ - [ ] **Task1: Edit Repository structure** - _Considerations:_ Is the repository well organized? If not, edit - _Ideas:_ - _Example code:_ - [ ] **Task2: Should the script be split into multiple scripts, one per main task?** - _Considerations:_ Discuss if it is better to have a single script (long) with all the steps or if it is better to have multiple scripts where each one produces an output that is used in the next step. - _Ideas:_ If the script is split, than it could have the data preparation (`01_data_prep.R`), modeling (`02_modeling.R`) and summarizing results (`03_summarizing_results.R`) seperately - _Example code:_ - [ ] **Task3: Add a consistent code style** - _Considerations:_ Good readability of code is very important in enabling effective code review. Change the code to have a easy to read and consistent style. As a suggestion, you can use the [tidyverse style](https://style.tidyverse.org/syntax.html). - _Ideas:_ Keep snake_case vs. camelCase - [ ] **Task4: Improve code readability** - _Considerations:_ - _Ideas:_ Improve readability sticking to right-hand line margins when possible and using line breaks. Additional line breaks allow more opportunities for comments/notes. - _Example code:_ ``` df |> mutate(name="var", name2="var")) ``` - [ ] **Task5: Does the code use informative names for objects/variables? If not, edit** - _Considerations:_ - _Ideas:_ - _Example code:_ - [ ] **Task5: [description]** - _Considerations:_ - _Ideas:_ - _Example code:_ ### Group 6: _Other (opinionated) considerations for public code sharing_ - [ ] **Task1: Rewrite the code to increase efficiency** - _Considerations:_ For example, you can implement functional programming (e.g., use generalised, custom functions or for loops to repeat processes) - _Ideas:_ - _Example code:_ - [ ] **Task2: Edit the code so that when calling functions, it explicitly calls the package namespace (i.e., `package::function()`)** - _Considerations:_ More transparent and clearer which functions came from which package - _Ideas:_ - _Example code:_ - [ ] **Task3: Make sure the code contains no hard-coding (i.e. the code includes no sections that assign fixed values or data directly rather than using variables)** - _Considerations:_ - _Ideas:_ - _Example code:_ - [ ] **Task4: Make sure documentation (metadata, README) includes clear citations to related materials like data and the article** - _Considerations:_ - _Ideas:_ - _Example code:_ - [ ] **Task5: [description]** - _Considerations:_ - _Ideas:_ - _Example code:_ ## Useful links Add here any links to resources you think could be useful for the Hackathon. - [4 R’s of Code Review paper](https://doi.org/10.1111/jeb.14230) - [17-step code review checklist for Ecology and Evolution](https://forms.gle/7bmDwTg7DiFGZqPt5) - [SORTEE Library of Code Mistakes](https://docs.google.com/presentation/d/12QN3WUc5v1Df7OArEox2U7l_N_qnHHuwzjCYiI4idC8/edit?usp=sharing) - [Minimum standards for EE data & code paper](https://doi.org/10.1002/ece3.9961) - [How-to guide for code sharing in biology paper](https://doi.org/10.1371/journal.pbio.3002815) - [BES guide to reproducible code in ecology and evolution](https://www.britishecologicalsociety.org/wp-content/uploads/2019/06/BES-Guide-Reproducible-Code-2019.pdf) - [SORTEE Open Code Primer](https://docs.google.com/document/d/1qXXNwCuGihsXOIcNxhTyeK8nqQivOoK40pHt4fdMflo/edit?usp=sharing) :exclamation:NB: work in progress - [2024 SORTEE Conference Workshop "Reproducible research and collaboration in git + Github"](https://osf.io/5dzyg/) - .... - .... - .... ## Other TO-DOs :::success **:sparkles: Become SORTEE Code Club Leader 2025 :sparkles:** Are you interested in Open Science practices related to code and code review? Would you like to learn more? We are looking for you to lead Code Club in 2025! No prior experience needed, just a willingness to learn and invest some of your time. >**Perks**: meeting lots of nice and like-minded people, the chance to develop your leadership skills, and planning Code Club sessions on topics that you’d like to learn more about. >**Tasks**: scheduling Code Club meetings, planning topics and potential speakers, and writing debriefs. :warning: Applications to volunteer for SORTEE in 2025 are closed, but please feel free to [contact the chairs of the committees](https://www.sortee.org/people/) you're interested in directly if you'd still like to apply! ::: :::warning **:raised_hands: Add to SORTEE's Library of Code Mistakes** The [SORTEE Library of Code Mistakes](https://docs.google.com/presentation/d/12QN3WUc5v1Df7OArEox2U7l_N_qnHHuwzjCYiI4idC8/edit?usp=sharing) is open for editing! >If you feel comfortable, please add your coding mistakes (and if possible how to recognize them). This way we can build a resource of (common) code mistakes that people can use during coding and code review. _NB: You can make your mistake as anonymous as you like._ ::: :::info **:first_place_medal: Want to contribute to Code Peer Review in the future?** Add your details to the [Find an ORTEE Code Reviewer list](https://docs.google.com/spreadsheets/d/1eHdU8o0psUj6Y4dPxqA-uW8Fc8SQVwzY1BnEbXM5k54/edit?usp=sharing) ::: ## Give feedback Any feedback on this Hackathon or Code Club in general is welcome! Things you liked, things that could be improved, topics you would like to see in upcoming Code Clubs etc. **Feedback Session #1:** - Nicely organised hackathon! It substantially improved my familiarity with code review - I did not know about HackMD and will start using it in my teaching. Thank you! - The time for discussion at the end was a bit short, but looking forward for the code club meeting. - it was a really nice idea for a hackathon :smile: **Feedback Session #2:** - .... - ....