# Discussion notes on Reproducibility lesson ## 2024-03-18 - after Enricos container lesson updates: highlight some words for easier following -> SW - Enrico to update last exercise in container (ex 1 stays, R example, add Singularity tab) - make interaction explicit! Ask questions, mention name! ## 2024-03-15 - 2x 5 and one 10 min break? :check: - Some examples on project setup: https://heidiseibold.ck.page/posts/setting-up-a-fair-and-reproducible-project Actions: - Samantha to update snakemake lesson, time walk through of typing snakemake exercise - Enrico to rethink Container exercise for demo format - Enrico to check and update dependencies lesson - All: check https://heidiseibold.ck.page/posts/setting-up-a-fair-and-reproducible-project and add suitable links to project setup :check: - All: think about collaborative doc questions for the small breaks ### Plan for workflow episode (last parts can be left out if time runs out) Use https://github.com/rkdarst/prompt-log/ ! Go directly to word count repository: https://github.com/coderefinery/word-count Check the readme, show the python codes briefly Clone the repository Activate coderefinery conda environment Run python codes for one book Start running it for second, abort and use "run_all_loop.sh" script, looping through all books - Collaborative document: advantages over manual run? reproducible? Still good when more inputs/books? Sometimes it may be helpful to go from imperative to declarative style. Rather than saying "do this and then that" we describe dependencies but we let the tool figure out the series of steps to produce results. -> Example workflow tool: Snakemake Snakemake is one of many tools to create reproducible and scalable data analysis workflows. Workflows are described via a human readable, Python based language. Snakemake workflows scale seamlessly from laptop to cluster or cloud, without the need to modify the workflow definition. -> Look at snakefile in repo ` DATA = glob_wildcards('data/{book}.txt').book ` -> snakemake function; finds all booktitles, DATA will be list of values for the wildcard {book}. We can see that Snakemake uses **declarative style**: Snakefiles contain rules that relate targets (`output`) to dependencies (`input`) and commands (`shell`). -> How does it know what to run first? Start with "all" (imagine as a bag that collects all results (things go "in"(put) the bag)) and look what it depends on. Now search for rules that have these as output. Look for their inputs and search where they are produced. In other words, search backwards and build a graph of dependencies. This is what Snakemake does. Let's first check if the snakefile is setup correctly: ` snakemake --lint`. INDENTATIONS! input, output, shell Let's run ` snakemake --delete-all-output --jobs 1` Look at what it says. Discuss. Run again ` snakemake --jobs 1` Look at what it says. Discuss. -> It can see that outputs are newer than inputs. It will only regenerate outputs if they are not there or if the inputs or scripts have changed. Change something in plot.py (eg size of plot) Run again ` snakemake --jobs 1` -> Observed how it only runs those steps again -> Power of workflows tools! Check result with `feh plot/sierra.png`. (leave this out if no time) Add another book (eg `vi data/new.txt` with `hello this is a new book. there is a lot of text here` ) Run again ` snakemake --jobs 1` -> Observe how it only runs all steps for that new file again -> Power of workflow tools! AWESOME! It only generates steps and outputs that are missing or outdated. **The workflow does not run everything every time. ** In other words if you notice a problem or update information "half way" in the analysis, it will only re-run what needs to be re-run. Nothing more, nothing less. Another advantage is that it can distribute tasks to multiple cores, off-load work to supercomputers, offers more fine-grained control over environments, and more. (leave this out if no time) Let's visualize the workflow: ` snakemake -j 1 --dag | dot -Tpng > dag.png` -> ` feh dag.png` (leave this out if no time ) To lesson page: https://coderefinery.github.io/reproducible-research/workflow-management/#why-snakemake -> Tools like Snakemake help us with **reproducibility** by supporting us with **automation**, **scalability** and **portability** of our workflows. Last thing: Other tools: https://coderefinery.github.io/reproducible-research/workflow-management/#similar-tools ## 2024-02 Notes: - only change towards last year: no hands-on - more time for wrapup - exercises as demos - bring back the 4 different ways of doing same things: clicking, jupyter, bash without for loop, then with snakemake - materials update: we need to test new snakemake - install instructions: currently pins 7.22, updated to 8.x - we have more time to talk about snake make, rules, - container exercise could be more showing what you can use container? (Container-2 extended) Actions: - Samantha to update snakemake to new version, Teemu helped :heavy_check_mark: - Enrico to rethink Container exercise for demo format - Both to check if leading same sections as last time is ok --- ## §§Schedule - 09:50 - 10:00 Soft start and icebreaker question - Page: collaborative notes document - Give more space to the icebreaker and see what people are writing and talk about our own experiences - 10:00 - 10:03 Enrico: Collab document intro (EG screenshare) - 10:03 - 10:05 Enrico to do learning outcomes: https://coderefinery.github.io/reproducible-research/ - ask Samantha how does it all connnect - 10:05 - 10:10 Overview of CR and how it all fits together - Lead: Samantha - Page: https://coderefinery.github.io/reproducible-research/intro - Heidis image - Learning outcomes from index - 10:10 - 10:20 Reproducible research, Motivation - Lead: Enrico (he still has the screen share) - Exercise in notes doc with the discussions in bottom of motivation page (SW to put in colab doc during the session) - Page: https://coderefinery.github.io/reproducible-research/motivation/ - 10:20 - 10:30 Organizing your projects - Lead: Enrico - Copy the discussion on the notes and if we have time we can highlight some answers - Page: https://coderefinery.github.io/reproducible-research/organizing-projects/ - 10:30 - 10:35 ask in collab document and discuss - https://coderefinery.github.io/reproducible-research/organizing-projects/#discussion-on-reproducibility - Are you using version control for academic papers? - ... - ... - How do you handle collaborative issues e.g. conflicting changes? - ... - ... - 10:35 - 10:55 Recording computational steps - (SW to screenshare) - Lead: Samantha - Page: https://coderefinery.github.io/reproducible-research/workflow-management/ - Start on wordcount repo - Ask Enrico on scripted version: Is this reproducible? What about adding more books/steps? - 10:55 - 11:05 Real break - 11:05 - 11:25 Recording dependencies (EG to screenshare) - Lead: Enrico - (important to tell that they were alreadying doing it by setting up the coderefinery env) -> not applicable anymore, but many might already have come across it - https://coderefinery.github.io/reproducible-research/dependencies/#exercises - ask first one in collab doc and discuss on stream - show difference between created env from env file vs exported env file on stream - 11:25 - 11:30 ask in collaborative document - Are you using any dependency and/or environment management tool in your work? - No: o - why not? - .. - .. - Yes: o - which? - .. - .. - Have you heard about or been in contact with containers (docker, singularity, podman) in your work? How did you come across them? - No: o - Yes: - .. - .. - .. - 11:30 - 11:50 Recording environments (SW to screenshare) - Lead: Samantha - The first contact with containers is often: Take this and run this command and then when you need to share/build. - Discuss setup issues, permissions if docker wants root, bandwidth, etc - Pros and cons of containers - Ask Enrico: Have you used containers? leading over to below. - first look and disucss the definition file - build lolcow example - Rstudio - Enrico can lead the demo of two pre-made containers e.g. expand the R studio optional exercise? (EG to screenshare) - 11:50 - 12.00 Wrapup - Lead: Enrico - where to go from here: idea would be to give it more practical focus: what to do with these tools? Project level reproducibility. Time-scales of what changes (short time changes of code, long time years changes of OS-s, libraries). - Bring your code session advertisement - Material + recording available - 12:00 - long break starts ## 2023-09 Timeline from: https://github.com/coderefinery/reproducible-research/issues/236 Enrico is on wifi so maybe no screen sharing. Samantha will be on wired connection at CSC - 08:50 - 09:00 Soft start and icebreaker question - Page: collaborative notes document - Let's see if we want a guest but not needed, we could give more space to the icebreaker and see what people are writing and talk about our own experiences - 09:00 - 09:10 Overview of CR and how it all fits together - Lead: Samantha - Page: https://coderefinery.github.io/reproducible-research/intro -> Heidis graphic and how CR lessons fit in there (WIP) - Heidis image - Learning outcomes from index - 09:10 - 09:20 Reproducible research, Motivation - Lead: Enrico - Exercise in notes doc with the discussions in bottom of motivation page (SW to put in colab doc during the session) - Page: https://coderefinery.github.io/reproducible-research/motivation/ - 09:20 - 09:27 Organizing your projects - Lead: Samantha - Enrico can copy the discussion on the notes and if we have time we can highlight some answers - Page: https://coderefinery.github.io/reproducible-research/organizing-projects/ - 09:27 - 09:35 Recording computational steps - - Lead: Enrico - generic motivation on the time-scales of reproducibility (let's check if we had some generic text on the computational steps). ALso snakemake and bash/python/R/your_fav_language script could be mentioned before the actual everithing as a generic overview of what is coming (this needs to be added to the doc and keep short) - kitchen analogy can be moved before the actual example - And then pass the lead to Samantha for the word count example. The discussion can be between us - Page: https://coderefinery.github.io/reproducible-research/workflow-management/ - Intro to exercise: we do it in the stream. Lead by Samantha and show the exercise preparation together with the learners in the stream. Then discuss the 2 workflows and what should be done. - 09:35 - 10:00 Snakemake exercise (25 min) - - Page: https://coderefinery.github.io/reproducible-research/workflow-management/#exercise - 10:00 - 10:10 Break - 10:10 - 10:15 Summary of workflows and the exercise - Lead: Samantha - Why use snakemake section - If there is time the viz - 10:15 - 10:30 Recording dependencies - Lead: Enrico - Todo: Check if some content from the kitchen analogy should be moved also to previous - important to tell that they were alreadying doing it by setting up the coderefinery env - Exercise is for those who want to check materials later. If there is time Samantha can take the lead on this and enrico can be the person who is trying to pick the answers - If there is no time we say that it is a homework - 10:30 - 10:40 Recording environments - Lead: Samantha - Enrinco can ask if they already had contact with containers in notes doc - The first contact with containers is often: Take this and run this command and then when you need to share/build. PR to add this before the definition files. - Before the exercises it is important to mention why we don't actually build a full container (setup issues, permissions if docker wants root, bandwidth, etc) - Pros and cons of containers - Enrico can lead the exercise intro, mention already that this is the last bit of the first part and that later we have this and that. - 10:40 - 11:00 Container-1 exercise (20 min) - maybe instead of the exercise we can demo two pre-made containers e.g. expand the R studio optional exercise - 11:00 - 11.0x Wrapup - Samantha on the post exercise + comments - Enrico can lead the wrap up "where to go from here" - where to go from here: idea would be to give it more practical focus: what to do with these tools? Project level reproducibility. Time-scales of what changes (short time changes of code, long time years changes of OS-s, libraries). - 11:05 - long break starts