Go directly to word count repository: https://github.com/coderefinery/word-count Check the readme, show the python codes briefly Clone the repository Activate coderefinery conda environment Run python codes for one book Start running it for second, abort and use "run_all_loop.sh" script, looping through all books - Collaborative document: advantages over manual run? reproducible? Still good when more inputs/books? Sometimes it may be helpful to go from imperative to declarative style. Rather than saying "do this and then that" we describe dependencies but we let the tool figure out the series of steps to produce results. -> Example workflow tool: Snakemake Snakemake is one of many tools to create reproducible and scalable data analysis workflows. Workflows are described via a human readable, Python based language. Snakemake workflows scale seamlessly from laptop to cluster or cloud, without the need to modify the workflow definition. -> Look at snakefile in repo ` DATA = glob_wildcards('data/{book}.txt').book ` -> snakemake function; finds all booktitles, DATA will be list of values for the wildcard {book}. We can see that Snakemake uses **declarative style**: Snakefiles contain rules that relate targets (`output`) to dependencies (`input`) and commands (`shell`). -> How does it know what to run first? Start with "all" (imagine as a bag that collects all results (things go "in"(put) the bag)) and look what it depends on. Now search for rules that have these as output. Look for their inputs and search where they are produced. In other words, search backwards and build a graph of dependencies. This is what Snakemake does. Let's first check if the snakefile is setup correctly: ` snakemake --lint`. INDENTATIONS! input, output, shell Let's run ` snakemake --delete-all-output --jobs 1` Look at what it says. Discuss. Run again ` snakemake --jobs 1` Look at what it says. Discuss. -> It can see that outputs are newer than inputs. It will only regenerate outputs if they are not there or if the inputs or scripts have changed. Change something in plot.py (eg size of plot) Run again ` snakemake --jobs 1` -> Observed how it only runs those steps again -> Power of workflows tools! Check result with `feh plot/sierra.png`. (leave this out if no time) Add another book (eg `vi data/new.txt` with `hello this is a new book. there is a lot of text here` ) Run again ` snakemake --jobs 1` -> Observe how it only runs all steps for that new file again -> Power of workflow tools! AWESOME! It only generates steps and outputs that are missing or outdated. **The workflow does not run everything every time. ** In other words if you notice a problem or update information "half way" in the analysis, it will only re-run what needs to be re-run. Nothing more, nothing less. Another advantage is that it can distribute tasks to multiple cores, off-load work to supercomputers, offers more fine-grained control over environments, and more. (leave this out if no time) Let's visualize the workflow: ` snakemake -j 1 --dag | dot -Tpng > dag.png` -> ` feh dag.png` (leave this out if no time ) To lesson page: https://coderefinery.github.io/reproducible-research/workflow-management/#why-snakemake -> Tools like Snakemake help us with **reproducibility** by supporting us with **automation**, **scalability** and **portability** of our workflows. 