Discussion notes on Reproducibility lesson

March 25 edition

§§Schedule

Enrico to keep screenshare during whole episode

09:50 - 10:00 Soft start and icebreaker question
- Page: collaborative notes document
- Give more space to the icebreaker and see what people are writing and talk about our own experiences
- Icebreaker:
  - Heard/read any good April fools stories today (or in other years)?
    - ..
  - Have you ever heard or said "It works on my computer"? What does this mean in practice? How did you solve it?
    - ..
    - ..
10:00 - 10:03 Collab document intro
- Lead: Enrico
10:03 - 10:05 Learning outcomes: https://coderefinery.github.io/reproducible-research/
- Lead: Enrico
- ask Samantha how does it all connnect
10:05 - 10:10 Overview of CR and how it all fits together
- Lead: Samantha
- Page: https://coderefinery.github.io/reproducible-research/intro
10:10 - 10:20 Reproducible research, Motivation
- Page: https://coderefinery.github.io/reproducible-research/motivation/
- Lead: Enrico
- Discussion in notes:
  - What are your experiences re-running or adjusting a script or a figure you created few months ago?
    - …
    - …
  - Have you continued working from a previous/another student's script/code/plot/notebook? What were the biggest challenges?
    - ..
    - ..
    - - What are your experiences re-running or adjusting a script or a figure you created few months ago?
    - …
    - …
  - Have you continued working from a previous/another student's script/code/plot/notebook? What were the biggest challenges?
    - ..
    - ..
10:20 - 10:35 Organizing your projects
- Page: https://coderefinery.github.io/reproducible-research/organizing-projects/
- Lead: Enrico
- Discussion in notes doc (SW to put in colab doc during the session)
  - How do you collaborate on writing academic papers?
  - …
  - …
  - How do you handle collaborative issues e.g. conflicting changes?
  - …
  - …
10:35 - 10:55 Recording computational steps - (Enrico keep screenshare)
- Page: https://coderefinery.github.io/reproducible-research/workflow-management/
- Lead: Samantha
  - Start on wordcount repo
  - Ask Enrico on scripted version: Is this reproducible? What about adding more books/steps?
10:55 - 11:05 Real break
11:05 - 11:25 Recording dependencies
- Lead: Enrico
  - https://coderefinery.github.io/reproducible-research/dependencies/#exercises
    - ask https://coderefinery.github.io/reproducible-research/dependencies/#demo in notes and discuss
      - Which version do you expect to be easiest to re-run? Why?
      - What problems do you anticipate in each solution?
        
        A:
        
        …
        
        ..
        
        B:
        
        ..
        
        ..
        
        C:
        
        …
        
        ..
        
        D:
        
        ..
        
        ..
        
        E:
        
        ..
        
        ..
    - show difference between created env from env file vs exported env file on stream
11:25 - 11:30 ask in collaborative document
- Are you using any dependency and/or environment management tool in your work?
  - No: o
    - why not?
      - ..
      - ..
  - Yes: o
    - which?
      - ..
      - ..
- Have you heard about or been in contact with containers (docker, singularity, podman) in your work? How did you come across them?
  - No: o
  - Yes:
    - ..
    - ..
    - ..
11:30 - 11:50 Recording environments
- Lead: Samantha
- The first contact with containers is often: Take this and run this command and then when you need to share/build.
- Discuss setup issues, permissions if docker wants root, bandwidth, etc
- Pros and cons of containers
- Ask Enrico: Have you used containers? leading over to below.
  - first look and disucss the definition file
  - build lolcow example
  - Rstudio
- Enrico can lead the demo of two pre-made containers e.g. expand the R studio optional exercise?
11:50 - 12.00 Wrapup
- Lead: Enrico
- where to go from here: idea would be to give it more practical focus: what to do with these tools? Project level reproducibility. Time-scales of what changes (short time changes of code, long time years changes of OS-s, libraries).
- Bring your code session advertisement (Wed (Apr 16) , 9:00 - 11:00 CEST)
- Material + recording available
12:00 - long break starts

2024-03-18

after Enricos container lesson updates: highlight some words for easier following -> SW
Enrico to update last exercise in container (ex 1 stays, R example, add Singularity tab)
make interaction explicit! Ask questions, mention name!

2024-03-15

2x 5 and one 10 min break? :check:
Some examples on project setup: https://heidiseibold.ck.page/posts/setting-up-a-fair-and-reproducible-project

Actions:

Samantha to update snakemake lesson, time walk through of typing snakemake exercise
Enrico to rethink Container exercise for demo format
Enrico to check and update dependencies lesson
All: check https://heidiseibold.ck.page/posts/setting-up-a-fair-and-reproducible-project and add suitable links to project setup :check:
All: think about collaborative doc questions for the small breaks

Plan for workflow episode (last parts can be left out if time runs out)

Use https://github.com/rkdarst/prompt-log/ !

Go directly to word count repository: https://github.com/coderefinery/word-count

Check the readme, show the python codes briefly

Clone the repository
Activate coderefinery conda environment
Run python codes for one book

Start running it for second, abort and use "run_all_loop.sh" script, looping through all books

Collaborative document: advantages over manual run? reproducible? Still good when more inputs/books?

Sometimes it may be helpful to go from imperative to declarative style. Rather than saying "do this and then that" we describe dependencies but we let the tool figure out the series of steps to produce results.

-> Example workflow tool: Snakemake

Snakemake is one of many tools to create reproducible and scalable data analysis workflows.
Workflows are described via a human readable, Python based language.
Snakemake workflows scale seamlessly from laptop to cluster or cloud, without the need to modify the workflow definition.

-> Look at snakefile in repo

DATA = glob_wildcards('data/{book}.txt').book -> snakemake function; finds all booktitles, DATA will be list of values for the wildcard {book}.

We can see that Snakemake uses declarative style:
Snakefiles contain rules that relate targets (output) to dependencies
(input) and commands (shell).

-> How does it know what to run first? Start with "all" (imagine as a bag that collects all results (things go "in"(put) the bag)) and look what it depends on. Now search for rules that
have these as output. Look for their inputs and search where they
are produced. In other words, search backwards and build a graph of
dependencies. This is what Snakemake does.

Let's first check if the snakefile is setup correctly: snakemake --lint. INDENTATIONS! input, output, shell

Let's run snakemake --delete-all-output --jobs 1
Look at what it says. Discuss.

Run again snakemake --jobs 1
Look at what it says. Discuss.
-> It can see that outputs are newer than inputs. It will only regenerate
outputs if they are not there or if the inputs or scripts have changed.

Change something in plot.py (eg size of plot)
Run again snakemake --jobs 1
-> Observed how it only runs those steps again -> Power of workflows tools!
Check result with feh plot/sierra.png.

(leave this out if no time)
Add another book (eg vi data/new.txt with hello this is a new book. there is a lot of text here )
Run again snakemake --jobs 1
-> Observe how it only runs all steps for that new file again -> Power of workflow tools!

AWESOME!
It only generates steps and outputs that are missing or outdated.
**The workflow does not run everything every time. **

In other words if you notice a problem or update information "half way" in the analysis, it will only re-run what needs to be re-run. Nothing more, nothing less.
Another advantage is that it can distribute tasks to multiple cores, off-load work to supercomputers, offers more fine-grained control over environments, and more.

(leave this out if no time)
Let's visualize the workflow: snakemake -j 1 --dag | dot -Tpng > dag.png -> feh dag.png

(leave this out if no time )
To lesson page: https://coderefinery.github.io/reproducible-research/workflow-management/#why-snakemake

-> Tools like Snakemake help us with reproducibility by supporting us with automation, scalability and portability of our workflows.

Last thing:
Other tools: https://coderefinery.github.io/reproducible-research/workflow-management/#similar-tools

2024-02

Notes:

only change towards last year: no hands-on
more time for wrapup
exercises as demos
bring back the 4 different ways of doing same things: clicking, jupyter, bash without for loop, then with snakemake
- materials update: we need to test new snakemake
- install instructions: currently pins 7.22, updated to 8.x
- we have more time to talk about snake make, rules,
container exercise could be more showing what you can use container? (Container-2 extended)

Actions:

Samantha to update snakemake to new version, Teemu helped
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Enrico to rethink Container exercise for demo format
Both to check if leading same sections as last time is ok

§§Schedule

09:50 - 10:00 Soft start and icebreaker question
- Page: collaborative notes document
- Give more space to the icebreaker and see what people are writing and talk about our own experiences
10:00 - 10:03 Enrico: Collab document intro (EG screenshare)
10:03 - 10:05 Enrico to do learning outcomes: https://coderefinery.github.io/reproducible-research/
- ask Samantha how does it all connnect
10:05 - 10:10 Overview of CR and how it all fits together
- Lead: Samantha
- Page: https://coderefinery.github.io/reproducible-research/intro
- Heidis image
- Learning outcomes from index
10:10 - 10:20 Reproducible research, Motivation
- Lead: Enrico (he still has the screen share)
- Exercise in notes doc with the discussions in bottom of motivation page (SW to put in colab doc during the session)
- Page: https://coderefinery.github.io/reproducible-research/motivation/
10:20 - 10:30 Organizing your projects
- Lead: Enrico
- Copy the discussion on the notes and if we have time we can highlight some answers
- Page: https://coderefinery.github.io/reproducible-research/organizing-projects/
10:30 - 10:35 ask in collab document and discuss
- https://coderefinery.github.io/reproducible-research/organizing-projects/#discussion-on-reproducibility
  - Are you using version control for academic papers?
    - …
    - …
  - How do you handle collaborative issues e.g. conflicting changes?
    - …
    - …
10:35 - 10:55 Recording computational steps - (SW to screenshare)
- Lead: Samantha
  - Page: https://coderefinery.github.io/reproducible-research/workflow-management/
  - Start on wordcount repo
  - Ask Enrico on scripted version: Is this reproducible? What about adding more books/steps?
10:55 - 11:05 Real break
11:05 - 11:25 Recording dependencies (EG to screenshare)
- Lead: Enrico
  - (important to tell that they were alreadying doing it by setting up the coderefinery env) -> not applicable anymore, but many might already have come across it
  - https://coderefinery.github.io/reproducible-research/dependencies/#exercises
    - ask first one in collab doc and discuss on stream
    - show difference between created env from env file vs exported env file on stream
11:25 - 11:30 ask in collaborative document
- Are you using any dependency and/or environment management tool in your work?
  - No: o
    - why not?
      - ..
      - ..
  - Yes: o
    - which?
      - ..
      - ..
- Have you heard about or been in contact with containers (docker, singularity, podman) in your work? How did you come across them?
  - No: o
  - Yes:
    - ..
    - ..
    - ..
11:30 - 11:50 Recording environments (SW to screenshare)
- Lead: Samantha
- The first contact with containers is often: Take this and run this command and then when you need to share/build.
- Discuss setup issues, permissions if docker wants root, bandwidth, etc
- Pros and cons of containers
- Ask Enrico: Have you used containers? leading over to below.
  - first look and disucss the definition file
  - build lolcow example
  - Rstudio
- Enrico can lead the demo of two pre-made containers e.g. expand the R studio optional exercise? (EG to screenshare)
11:50 - 12.00 Wrapup
- Lead: Enrico
- where to go from here: idea would be to give it more practical focus: what to do with these tools? Project level reproducibility. Time-scales of what changes (short time changes of code, long time years changes of OS-s, libraries).
- Bring your code session advertisement
- Material + recording available
12:00 - long break starts

2023-09

Timeline from: https://github.com/coderefinery/reproducible-research/issues/236

Enrico is on wifi so maybe no screen sharing.
Samantha will be on wired connection at CSC

08:50 - 09:00 Soft start and icebreaker question
- Page: collaborative notes document
- Let's see if we want a guest but not needed, we could give more space to the icebreaker and see what people are writing and talk about our own experiences
09:00 - 09:10 Overview of CR and how it all fits together
- Lead: Samantha
- Page: https://coderefinery.github.io/reproducible-research/intro -> Heidis graphic and how CR lessons fit in there (WIP)
- Heidis image
- Learning outcomes from index
09:10 - 09:20 Reproducible research, Motivation
- Lead: Enrico
- Exercise in notes doc with the discussions in bottom of motivation page (SW to put in colab doc during the session)
- Page: https://coderefinery.github.io/reproducible-research/motivation/
09:20 - 09:27 Organizing your projects
- Lead: Samantha
- Enrico can copy the discussion on the notes and if we have time we can highlight some answers
- Page: https://coderefinery.github.io/reproducible-research/organizing-projects/
09:27 - 09:35 Recording computational steps -
- Lead: Enrico
  - generic motivation on the time-scales of reproducibility (let's check if we had some generic text on the computational steps). ALso snakemake and bash/python/R/your_fav_language script could be mentioned before the actual everithing as a generic overview of what is coming (this needs to be added to the doc and keep short)
  - kitchen analogy can be moved before the actual example
  - And then pass the lead to Samantha for the word count example. The discussion can be between us
- Page: https://coderefinery.github.io/reproducible-research/workflow-management/
- Intro to exercise: we do it in the stream. Lead by Samantha and show the exercise preparation together with the learners in the stream. Then discuss the 2 workflows and what should be done.
09:35 - 10:00 Snakemake exercise (25 min)
- Page: https://coderefinery.github.io/reproducible-research/workflow-management/#exercise
10:00 - 10:10 Break
10:10 - 10:15 Summary of workflows and the exercise
- Lead: Samantha
- Why use snakemake section
- If there is time the viz
10:15 - 10:30 Recording dependencies
- Lead: Enrico
  - Todo: Check if some content from the kitchen analogy should be moved also to previous
  - important to tell that they were alreadying doing it by setting up the coderefinery env
- Exercise is for those who want to check materials later. If there is time Samantha can take the lead on this and enrico can be the person who is trying to pick the answers
- If there is no time we say that it is a homework
10:30 - 10:40 Recording environments
- Lead: Samantha
- Enrinco can ask if they already had contact with containers in notes doc
- The first contact with containers is often: Take this and run this command and then when you need to share/build. PR to add this before the definition files.
- Before the exercises it is important to mention why we don't actually build a full container (setup issues, permissions if docker wants root, bandwidth, etc)
- Pros and cons of containers
- Enrico can lead the exercise intro, mention already that this is the last bit of the first part and that later we have this and that.
10:40 - 11:00 Container-1 exercise (20 min)
- maybe instead of the exercise we can demo two pre-made containers e.g. expand the R studio optional exercise
11:00 - 11.0x Wrapup
- Samantha on the post exercise + comments
- Enrico can lead the wrap up "where to go from here"
- where to go from here: idea would be to give it more practical focus: what to do with these tools? Project level reproducibility. Time-scales of what changes (short time changes of code, long time years changes of OS-s, libraries).
11:05 - long break starts

Discussion notes on Reproducibility lesson

March 25 edition

§§Schedule

Heard/read any good April fools stories today (or in other years)?

2024-03-18

2024-03-15

Plan for workflow episode (last parts can be left out if time runs out)

2024-02

§§Schedule

2023-09

09:35 - 10:00 Snakemake exercise (25 min)

Read more

CodeRefinery 4 kick-off

CodeRefinery meeting notes

Email drafts

Workshop Spring 25 - Social Coding