Discussion notes on Reproducibility lesson
March 25 edition
§§Schedule
Enrico to keep screenshare during whole episode
-
09:50 - 10:00 Soft start and icebreaker question
- Page: collaborative notes document
- Give more space to the icebreaker and see what people are writing and talk about our own experiences
- Icebreaker:
-
Heard/read any good April fools stories today (or in other years)?
- Have you ever heard or said "It works on my computer"? What does this mean in practice? How did you solve it?
-
10:00 - 10:03 Collab document intro
-
10:03 - 10:05 Learning outcomes: https://coderefinery.github.io/reproducible-research/
- Lead: Enrico
- ask Samantha how does it all connnect
-
10:05 - 10:10 Overview of CR and how it all fits together
-
10:10 - 10:20 Reproducible research, Motivation
- Page: https://coderefinery.github.io/reproducible-research/motivation/
- Lead: Enrico
- Discussion in notes:
- What are your experiences re-running or adjusting a script or a figure you created few months ago?
- Have you continued working from a previous/another student's script/code/plot/notebook? What were the biggest challenges?
- ..
- ..
-
- What are your experiences re-running or adjusting a script or a figure you created few months ago?
- …
- …
- Have you continued working from a previous/another student's script/code/plot/notebook? What were the biggest challenges?
-
10:20 - 10:35 Organizing your projects
-
10:35 - 10:55 Recording computational steps - (Enrico keep screenshare)
-
10:55 - 11:05 Real break
-
11:05 - 11:25 Recording dependencies
-
11:25 - 11:30 ask in collaborative document
- Are you using any dependency and/or environment management tool in your work?
- Have you heard about or been in contact with containers (docker, singularity, podman) in your work? How did you come across them?
-
11:30 - 11:50 Recording environments
- Lead: Samantha
- The first contact with containers is often: Take this and run this command and then when you need to share/build.
- Discuss setup issues, permissions if docker wants root, bandwidth, etc
- Pros and cons of containers
- Ask Enrico: Have you used containers? leading over to below.
- first look and disucss the definition file
- build lolcow example
- Rstudio
- Enrico can lead the demo of two pre-made containers e.g. expand the R studio optional exercise?
-
11:50 - 12.00 Wrapup
- Lead: Enrico
- where to go from here: idea would be to give it more practical focus: what to do with these tools? Project level reproducibility. Time-scales of what changes (short time changes of code, long time years changes of OS-s, libraries).
- Bring your code session advertisement (Wed (Apr 16) , 9:00 - 11:00 CEST)
- Material + recording available
-
12:00 - long break starts
2024-03-18
- after Enricos container lesson updates: highlight some words for easier following -> SW
- Enrico to update last exercise in container (ex 1 stays, R example, add Singularity tab)
- make interaction explicit! Ask questions, mention name!
2024-03-15
Actions:
- Samantha to update snakemake lesson, time walk through of typing snakemake exercise
- Enrico to rethink Container exercise for demo format
- Enrico to check and update dependencies lesson
- All: check https://heidiseibold.ck.page/posts/setting-up-a-fair-and-reproducible-project and add suitable links to project setup :check:
- All: think about collaborative doc questions for the small breaks
Plan for workflow episode (last parts can be left out if time runs out)
Use https://github.com/rkdarst/prompt-log/ !
Go directly to word count repository: https://github.com/coderefinery/word-count
Check the readme, show the python codes briefly
Clone the repository
Activate coderefinery conda environment
Run python codes for one book
Start running it for second, abort and use "run_all_loop.sh" script, looping through all books
- Collaborative document: advantages over manual run? reproducible? Still good when more inputs/books?
Sometimes it may be helpful to go from imperative to declarative style. Rather than saying "do this and then that" we describe dependencies but we let the tool figure out the series of steps to produce results.
-> Example workflow tool: Snakemake
Snakemake is one of many tools to create reproducible and scalable data analysis workflows.
Workflows are described via a human readable, Python based language.
Snakemake workflows scale seamlessly from laptop to cluster or cloud, without the need to modify the workflow definition.
-> Look at snakefile in repo
DATA = glob_wildcards('data/{book}.txt').book
-> snakemake function; finds all booktitles, DATA will be list of values for the wildcard {book}.
We can see that Snakemake uses declarative style:
Snakefiles contain rules that relate targets (output
) to dependencies
(input
) and commands (shell
).
-> How does it know what to run first? Start with "all" (imagine as a bag that collects all results (things go "in"(put) the bag)) and look what it depends on. Now search for rules that
have these as output. Look for their inputs and search where they
are produced. In other words, search backwards and build a graph of
dependencies. This is what Snakemake does.
Let's first check if the snakefile is setup correctly: snakemake --lint
. INDENTATIONS! input, output, shell
Let's run snakemake --delete-all-output --jobs 1
Look at what it says. Discuss.
Run again snakemake --jobs 1
Look at what it says. Discuss.
-> It can see that outputs are newer than inputs. It will only regenerate
outputs if they are not there or if the inputs or scripts have changed.
Change something in plot.py (eg size of plot)
Run again snakemake --jobs 1
-> Observed how it only runs those steps again -> Power of workflows tools!
Check result with feh plot/sierra.png
.
(leave this out if no time)
Add another book (eg vi data/new.txt
with hello this is a new book. there is a lot of text here
)
Run again snakemake --jobs 1
-> Observe how it only runs all steps for that new file again -> Power of workflow tools!
AWESOME!
It only generates steps and outputs that are missing or outdated.
**The workflow does not run everything every time. **
In other words if you notice a problem or update information "half way" in the analysis, it will only re-run what needs to be re-run. Nothing more, nothing less.
Another advantage is that it can distribute tasks to multiple cores, off-load work to supercomputers, offers more fine-grained control over environments, and more.
(leave this out if no time)
Let's visualize the workflow: snakemake -j 1 --dag | dot -Tpng > dag.png
-> feh dag.png
(leave this out if no time )
To lesson page: https://coderefinery.github.io/reproducible-research/workflow-management/#why-snakemake
-> Tools like Snakemake help us with reproducibility by supporting us with automation, scalability and portability of our workflows.
Last thing:
Other tools: https://coderefinery.github.io/reproducible-research/workflow-management/#similar-tools
2024-02
Notes:
- only change towards last year: no hands-on
- more time for wrapup
- exercises as demos
- bring back the 4 different ways of doing same things: clicking, jupyter, bash without for loop, then with snakemake
- materials update: we need to test new snakemake
- install instructions: currently pins 7.22, updated to 8.x
- we have more time to talk about snake make, rules,
- container exercise could be more showing what you can use container? (Container-2 extended)
Actions:
- Samantha to update snakemake to new version, Teemu helped
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Enrico to rethink Container exercise for demo format
- Both to check if leading same sections as last time is ok
§§Schedule
-
09:50 - 10:00 Soft start and icebreaker question
- Page: collaborative notes document
- Give more space to the icebreaker and see what people are writing and talk about our own experiences
-
10:00 - 10:03 Enrico: Collab document intro (EG screenshare)
-
10:03 - 10:05 Enrico to do learning outcomes: https://coderefinery.github.io/reproducible-research/
- ask Samantha how does it all connnect
-
10:05 - 10:10 Overview of CR and how it all fits together
-
10:10 - 10:20 Reproducible research, Motivation
-
10:20 - 10:30 Organizing your projects
-
10:30 - 10:35 ask in collab document and discuss
-
10:35 - 10:55 Recording computational steps - (SW to screenshare)
-
10:55 - 11:05 Real break
-
11:05 - 11:25 Recording dependencies (EG to screenshare)
-
11:25 - 11:30 ask in collaborative document
- Are you using any dependency and/or environment management tool in your work?
- Have you heard about or been in contact with containers (docker, singularity, podman) in your work? How did you come across them?
-
11:30 - 11:50 Recording environments (SW to screenshare)
- Lead: Samantha
- The first contact with containers is often: Take this and run this command and then when you need to share/build.
- Discuss setup issues, permissions if docker wants root, bandwidth, etc
- Pros and cons of containers
- Ask Enrico: Have you used containers? leading over to below.
- first look and disucss the definition file
- build lolcow example
- Rstudio
- Enrico can lead the demo of two pre-made containers e.g. expand the R studio optional exercise? (EG to screenshare)
-
11:50 - 12.00 Wrapup
- Lead: Enrico
- where to go from here: idea would be to give it more practical focus: what to do with these tools? Project level reproducibility. Time-scales of what changes (short time changes of code, long time years changes of OS-s, libraries).
- Bring your code session advertisement
- Material + recording available
-
12:00 - long break starts
2023-09
Timeline from: https://github.com/coderefinery/reproducible-research/issues/236
Enrico is on wifi so maybe no screen sharing.
Samantha will be on wired connection at CSC
- 08:50 - 09:00 Soft start and icebreaker question
- Page: collaborative notes document
- Let's see if we want a guest but not needed, we could give more space to the icebreaker and see what people are writing and talk about our own experiences
- 09:00 - 09:10 Overview of CR and how it all fits together
- 09:10 - 09:20 Reproducible research, Motivation
- 09:20 - 09:27 Organizing your projects
- 09:27 - 09:35 Recording computational steps -
- Lead: Enrico
- generic motivation on the time-scales of reproducibility (let's check if we had some generic text on the computational steps). ALso snakemake and bash/python/R/your_fav_language script could be mentioned before the actual everithing as a generic overview of what is coming (this needs to be added to the doc and keep short)
- kitchen analogy can be moved before the actual example
- And then pass the lead to Samantha for the word count example. The discussion can be between us
- Page: https://coderefinery.github.io/reproducible-research/workflow-management/
- Intro to exercise: we do it in the stream. Lead by Samantha and show the exercise preparation together with the learners in the stream. Then discuss the 2 workflows and what should be done.
-
09:35 - 10:00 Snakemake exercise (25 min)
- 10:00 - 10:10 Break
- 10:10 - 10:15 Summary of workflows and the exercise
- Lead: Samantha
- Why use snakemake section
- If there is time the viz
- 10:15 - 10:30 Recording dependencies
- Lead: Enrico
- Todo: Check if some content from the kitchen analogy should be moved also to previous
- important to tell that they were alreadying doing it by setting up the coderefinery env
- Exercise is for those who want to check materials later. If there is time Samantha can take the lead on this and enrico can be the person who is trying to pick the answers
- If there is no time we say that it is a homework
- 10:30 - 10:40 Recording environments
- Lead: Samantha
- Enrinco can ask if they already had contact with containers in notes doc
- The first contact with containers is often: Take this and run this command and then when you need to share/build. PR to add this before the definition files.
- Before the exercises it is important to mention why we don't actually build a full container (setup issues, permissions if docker wants root, bandwidth, etc)
- Pros and cons of containers
- Enrico can lead the exercise intro, mention already that this is the last bit of the first part and that later we have this and that.
- 10:40 - 11:00 Container-1 exercise (20 min)
- maybe instead of the exercise we can demo two pre-made containers e.g. expand the R studio optional exercise
- 11:00 - 11.0x Wrapup
- Samantha on the post exercise + comments
- Enrico can lead the wrap up "where to go from here"
- where to go from here: idea would be to give it more practical focus: what to do with these tools? Project level reproducibility. Time-scales of what changes (short time changes of code, long time years changes of OS-s, libraries).
- 11:05 - long break starts