owned this note
owned this note
Published
Linked with GitHub
---
tags: ggg, ggg2020,ggg298
---
[toc]
# GGG 298, March 2020 - Week 9, integrating all the things
March 4, 2020
Titus Brown
Learning goals:
* review the previous 8 lessons by building something that puts them all together!
## Today's plan!
Today we're going to try to integrate across many of the things you've seen this quarter. This is the first time I'm trying anything like this, so please be kind :).
What we're going to do is try to start a _new_ project from scratch, knowing just the commands to run. And then we're going to install stuff with conda, develop a snakemake workflow, make an rmarkdown report, and push it all to github.
What I'd like to try is breaking things down into 10-20 minute chunks during which you'll put something together. After each chunk, I'll walk through a solution, post it for you all, and then move on to the next chunk.
Importantly, you absolutely don't have to complete each chunk! I'd like you to try, but having a set of questions and thoughts is perfectly fine! And you can definitely work with others; you might try pair- or group-programming.
## The workflow!
Scenario:
> Sandra has received a collection of bacterial genomes and wants to see how similar they are at a nucleotide level. To do this, she decides to use the [sourmash tool](http://sourmash.rtfd.io/) to compare them and create a figure showing their similarity.
Briefly, sourmash is a tool that compares genomes at a k-mer level. First, it breaks genomes down into k-mers, then it calculates the Jaccard similarity between the genomes, and then it plots the similarities in various ways.
### A. Computing signatures
Starting from five genome files,
```
ls -lh *.fa.gz
```
>-r--r--r--@ 1 t staff 1.6M Mar 3 17:02 1.fa.gz
-r--r--r--@ 1 t staff 566K Mar 3 17:02 2.fa.gz
-r--r--r--@ 1 t staff 548K Mar 3 17:02 3.fa.gz
-r--r--r--@ 1 t staff 565K Mar 3 17:02 4.fa.gz
-r--r--r--@ 1 t staff 1.6M Mar 3 17:02 5.fa.gz
we first compute genome signatures for each of them, using a k-mer size of 31. Each of these commands takes as input a `.fa.gz` file, and outputs a `.fa.gz.sig` file.
```
sourmash compute -k 31 1.fa.gz
sourmash compute -k 31 2.fa.gz
sourmash compute -k 31 3.fa.gz
sourmash compute -k 31 4.fa.gz
sourmash compute -k 31 5.fa.gz
```
### B. Comparing the signatures
Next we want to compare all five of these signature files.
The following command takes as input multiple `.sig` files and outputs a comparison results file named `all.cmp`. We can optionally produce a CSV matrix file using `--csv all.csv`.
```
sourmash compare *.sig -o all.cmp
```
### C. Plotting the comparison
Finally, we want to plot the comparison.
This final command takes as input the files `all.cmp` and `all.cmp.labels.txt`, and outputs three files,`all.cmp.hist.png`, `all.cmp.dendro.png`, and `all.cmp.matrix.png`. Use the `--labels` option to put labels on the matrix figure.
```
sourmash plot --labels all.cmp
```
This produces the following figure showing that genomes 1 and 5 are similar, genomes 3 and 4 are similar, and genome 2 is not similar to either of the other sets.
![figure](https://github.com/ngs-docs/2020-GGG298/raw/master/Week9-Sourmash_project/all.cmp.matrix.png)
## The tasks!
Please log into farm first :)
### 1. Conda for software installation
#### 1.1 First, use conda to install sourmash and snakemake.
We'll need to have both `sourmash` and `snakemake-minimal` installed. Let's use conda to do it!
**Task:** Create a conda environment containing the latest version of sourmash and snakemake-minimal, installed from bioconda.
:::info
[Pointer to the relevant section from the conda lesson in week 3](https://github.com/ngs-docs/2020-GGG298/tree/master/Week3-conda_for_software_installation#installation)
:::
**Check**: `conda env list` should show that your new environment exists.
**One solution:**
Here's a command to create an environment named `smash` with the packages `sourmash` and `snakemake-minimal`:
```
conda create -y -n smash sourmash snakemake-minimal
```
The `-y` is for "just say yes".
Some commands in the tutorial may have `-c` present, that is needed to specify the "channels" which elsewhere in this class you've added using `conda channel add`.
#### 1.2 Second, activate your conda environment
**Task:** Activate your conda environment.
**Check:** Running `sourmash` should show something other than "command not found".
**One solution:**
```
conda activate smash
```
Questions:
Q: if I already have snakemake in an environment and I want to add sourmash, how would I do that?
A: activate the environment, and run `conda install -y sourmash` to install sourmash in the current environment. You can also run `conda install -n smash -y snakemake` to install snakemake in a non-active environment.
#### Stretch: Create a conda environment.yml that you can use `conda create -f` with
:::info
[See link to section in conda tutorial](https://github.com/ngs-docs/2020-GGG298/tree/master/Week3-conda_for_software_installation#making-and-using-environment-files)
:::
Solution:
* use `conda env export > export.yml` in your environment to create a file named export.yml
* edit out all of the dependency lines in that file that aren't sourmash or snakemake
* then you can use `conda env create -f export.yml -n <new name>` to create a new environment from that file.
### 2. Create a github repository to hold your files.
#### 2.1 Create a repository on github.com/
Let's create a repository on github!
:::info
[Link to relevant section from git/github lesson](https://github.com/ngs-docs/2020-GGG298/tree/master/Week6-Git_and_GitHub_for_file_tracking_and_sharing#create-a-git-repository-on-github)
:::
**Task:** Create a repository on github.com/.
**Check:** There is a repository with 'sourmash' in the name at `http://github.com/<your account>/`
**One solution:**
* go to github
* click the little plus on upper right
* select "new repository"
* enter some name (2020-ggg298-class9)
* check "create README" box
* hit "create"
* voila, https://github.com/ctb/2020-ggg298-class9
#### 2.2 Clone your repository from github onto farm.
Let's make a local copy (clone) of the repository from github on farm! This will be the directory we work in.
:::info
[See the relevant section from git/github lesson](https://github.com/ngs-docs/2020-GGG298/tree/master/Week6-Git_and_GitHub_for_file_tracking_and_sharing#clone-the-repository)
:::
**Task:** Clone the repository to a local directory on farm and change into that directory.
**Check:** On farm, there is a directory named the same thing as your repository on github.com.
**One solution:**
Run
```
git clone http://github.com/<account>/<repo name>
cd <repo name>
```
### Alternative/weird situations
If you want to check if you are in a git repo, you can do
```
ls -ld .git
```
in the top level directory of what _should_ be the repo.
If you created the directory without cloning it from github, you can try:
```
cd <not-a-git-repo>/
git init
git remote add origin http://github.com/<account>/<repo name>
git pull origin master
```
a less weird approach would be:
```
mv <not-a-git-repo> <somedir>.bak #here, somedir.bak is a renamed directory of the not-a-git-repo
git clone http://github.com/<account>/<repo name>
cd <repo name>
cp ../somedir.bak/Snakefile .
```
### 3. Create an automated workflow!
Let's automate this set of commands using snakemake!
:::info
Remember to use `nano -ET4 Snakefile` when editing the Snakefile! This will set tabs to be four spaces, instead. And <kbd>ctrl-x y ENTER</kbd> saves files!
:::
Reminder: rules look like this,
```python=
rule rulename:
input: "input_file1", "input_file2"
output: "output_file1", "output_file2"
shell: "shell_command goes here"
```
And you can use `{input}` and `{output}` in the shell command.
#### 3.1 Create snakemake rules to download the genome files.
Create a rule to run the following commands to download the data [from this OSF](https://osf.io/ke2tu/files/)
```
wget https://osf.io/t5bu6/download -O 1.fa.gz
wget https://osf.io/ztqx3/download -O 2.fa.gz
wget https://osf.io/w4ber/download -O 3.fa.gz
wget https://osf.io/dnyzp/download -O 4.fa.gz
wget https://osf.io/ajvqk/download -O 5.fa.gz
```
:::info
[Link to relevant snakemake tutorial section](https://github.com/ngs-docs/2020-GGG298/tree/master/Week4-snakemake-for-workflows#add-a-variable-substitution)
:::
**Task:** Create one or more rules to download the files.
**Check:** `snakemake 1.fa.gz` should run the appropriate `wget` command and produce the right file.
**Hint:** you can run multiple shell commands by doing this:
```
shell:
"""command1
command2
command3"""
```
**One solution:**
You can put the following block in your (otherwise empty) Snakefile:
```
rule download_files:
output: "1.fa.gz", "2.fa.gz", "3.fa.gz", "4.fa.gz", "5.fa.gz"
shell: """
wget https://osf.io/t5bu6/download -O 1.fa.gz
wget https://osf.io/ztqx3/download -O 2.fa.gz
wget https://osf.io/w4ber/download -O 3.fa.gz
wget https://osf.io/dnyzp/download -O 4.fa.gz
wget https://osf.io/ajvqk/download -O 5.fa.gz
"""
```
and then run it like so:
```
snakemake
```
-- since it's the only rule, it'll run that rule by default; otherwise, snakemake runs the _first_ rule in the file.
If you want to see a more elegant solution (from a more civilized world?) please see [the first example in this blog post](http://ivory.idyll.org/blog/2020-snakemake-hacks-collections-files.html).
#### 3.2 Create snakemake rules to run `sourmash compute` on each of the files.
**Task:** Create a snakemake rule to run sourmash compute on each genome.
**Check:** `snakemake 1.fa.gz.sig` should run `sourmash compute` on `1.fa.gz` to produce `1.fa.gz.sig`.
**One solution:**
```
rule sourmash_compute:
input: "{sample}.fa.gz"
output: "{sample}.fa.gz.sig"
shell: "sourmash compute -k 31 {input}"
```
To test this rule, ask snakemake to produce `1.fa.gz.sig` and `2.fa.gz.sig`:
```
snakemake 1.fa.gz.sig
snakemake 2.fa.gz.sig
```
Stretch goal: build a generic rule using wildcards ([link to relevant lesson section](https://github.com/ngs-docs/2020-GGG298/tree/master/Week4-snakemake-for-workflows#add-wildcards)).
#### 3.3 Create a snakemake rule to run `sourmash compare`
Note: you can have multiple input files by using "file1", "file2", "file3"
**Task:** Create a snakemake rule that runs `sourmash compute` on all of the signature files.
**Check:** `snakemake <cmp file>` should run `sourmash compare`.
**One solution:**
```
rule sourmash_compare:
input: "1.fa.gz.sig", "2.fa.gz.sig", "3.fa.gz.sig", "4.fa.gz.sig",
"5.fa.gz.sig"
output: "all.cmp"
shell: "sourmash compare {input} -o all.cmp"
```
A few notes:
* if you don't list the inputs explicitly then snakemake won't make sure they're there before running this
* you want to change `*.sig` to `{input}` so that you avoid comparing all of the signatures in the directory, and just use those five. (This doesn't matter in this situation, but computers like precision.)
* if you don't like typing in lots of input: filenames, you can use expand like so: `input: expand("{sample}.fa.gz.sig", sample=[1,2,3,4,5])` - here, expand makes 5 concrete targets (filenames not containing wildcards) out of one wildcard expression with five values to fill in.
* more on that vein - you can't have `input: "{sample}.fa.gz.sig"` because snakemake won't know how to fill in {sample} for input based on the output: block, which just contains `all.cmp`.
#### 3.4 Create a snakemake rule to run `sourmash plot.`
**Task:** Create a snakemake rule that runs `sourmash plot` to produce PNG files.
**Check:** `snakemake <cmp file>.matrix.png` should run `sourmash plot`.
**One solution:**
```
rule sourmash_plot:
input: "all.cmp", "all.cmp.labels.txt"
output: "all.cmp.hist.png", "all.cmp.dendro.png", "all.cmp.matrix.png"
shell: "sourmash plot --labels all.cmp"
rule sourmash_compare:
input: expand("{sample}.fa.gz.sig", sample=range(1, 6))
#input: expand("{sample}.fa.gz.sig", sample=[1,2,3,4,5])
#input: "1.fa.gz.sig", "2.fa.gz.sig", "3.fa.gz.sig", "4.fa.gz.sig",
# "5.fa.gz.sig"
output: "all.cmp", "all.cmp.labels.txt"
shell: "sourmash compare {input} -o all.cmp"
```
**note* that you have to change the output: block of sourmash_compare to include `all.cmp.labels.txt`.
Stretch goal: build a generic rule using wildcards, so that any `.cmp` file can be turned into a set of plots.
#### 3.5 Create a default rule that runs the entire workflow
:::info
[Link to tutorial section on default rules](https://hackmd.io/cGYzxz07SseGxH0y2gjYJw?view#Create-a-good-%E2%80%9Cdefault%E2%80%9D-rule)
:::
**Task:** Create a default rule that asks for the plots to be created.
**Check:** Running just `snakemake` produces the plots.
**One solution:**
Add a default rule at the very tippy top of the file:
```
# default rule, b/c it's the first rule in the Snakefile
rule all:
input: "all.cmp.matrix.png"
```
The entire Snakefile should look like this now:
```
# default rule, b/c it's the first rule in the Snakefile
rule all:
input: "all.cmp.matrix.png"
rule sourmash_plot:
input: "all.cmp", "all.cmp.labels.txt"
output: "all.cmp.hist.png", "all.cmp.dendro.png", "all.cmp.matrix.png"
shell: "sourmash plot --labels all.cmp"
rule sourmash_compare:
input: expand("{sample}.fa.gz.sig", sample=range(1, 6))
#input: expand("{sample}.fa.gz.sig", sample=[1,2,3,4,5])
#input: "1.fa.gz.sig", "2.fa.gz.sig", "3.fa.gz.sig", "4.fa.gz.sig",
# "5.fa.gz.sig"
output: "all.cmp", "all.cmp.labels.txt"
shell: "sourmash compare {input} -o all.cmp"
rule sourmash_compute:
input: "{sample}.fa.gz"
output: "{sample}.fa.gz.sig"
shell: "sourmash compute -k 31 {input}"
rule download_files:
output: "1.fa.gz", "2.fa.gz", "3.fa.gz", "4.fa.gz", "5.fa.gz"
shell: """
wget https://osf.io/t5bu6/download -O 1.fa.gz
wget https://osf.io/ztqx3/download -O 2.fa.gz
wget https://osf.io/w4ber/download -O 3.fa.gz
wget https://osf.io/dnyzp/download -O 4.fa.gz
wget https://osf.io/ajvqk/download -O 5.fa.gz
"""
```
#### 3.6 Stretch Goal: use a conda environment file for those commands that run sourmash
:::info
[Relevant section in snakemake tutorial](https://github.com/ngs-docs/2020-GGG298/tree/master/Week4-snakemake-for-workflows#rule-specific-conda-environments-with-conda-and---use-conda)
:::
**Task:** Create an environment file and name it in a `conda:` block for each snakemake rule that uses sourmash.
**Check:** `snakemake --use-conda` should create a new environment and run the commands in that environment.
**One solution:**
(TBA)
### Stretch goal: use an R script to produce `.png` output
* You'll probably need the `r-essentials` and `r-base` packages installed in your environment.
* Produce a `.csv` file from `sourmash compare` using `--csv <output file>`
You can use the following code to get a plot going (ht Taylor Reiter and Shannon Joslin) --
```r
#read in data
sourmash_comp_matrix <- read.csv("all.csv")
# Label the rows
rownames(sourmash_comp_matrix) <- colnames(sourmash_comp_matrix)
# Transform for plotting
sourmash_comp_matrix <- as.matrix(sourmash_comp_matrix)
# Make an unclustered heatmap
heatmap(sourmash_comp_matrix, Colv=F, Rowv=F, scale='none')
# save the heatmap!
png("sourmash_heatmap.png")
heatmap(sourmash_comp_matrix, Colv=F, Rowv=F, scale='none')
dev.off()
```
### Stretch goal: create an Rmarkdown file to produce a matrix figure in R
:::info
Relevant section in Rmarkdown tutorial for ++[syntax](https://github.com/ngs-docs/2020-GGG298/tree/master/Week8-Rmarkdown_for_reports_documentation_and_beyond#anatomy-of-rmarkdown-documents)++ and ++[running from the command line](https://github.com/ngs-docs/2020-GGG298/tree/master/Week8-Rmarkdown_for_reports_documentation_and_beyond#rendering-from-the-command-line)++
:::
(TBD)
### 4. Add your files to git and push to github
Now let's back all of our code up somewhere that will track our changes and (potentially) let us share it with others!
#### 4.1 Add Snakefile to git
:::info
[Relevant section from git/github lesson](https://github.com/ngs-docs/2020-GGG298/tree/master/Week6-Git_and_GitHub_for_file_tracking_and_sharing#create-some-files)
:::
**Task:** Use `git add` to add the Snakefile to git version control management.
**Check:** `git status` should show `Snakefile` under "Changes to be committed".
**One solution:**
```
git add Snakefile
```
#### 4.2 Commit Snakefile to git
:::info
[Relevant section (below the git add)](https://github.com/ngs-docs/2020-GGG298/tree/master/Week6-Git_and_GitHub_for_file_tracking_and_sharing#create-some-files)
:::
**Task:** Use `git commit` to commit the Snakefile to git version control.
**Check:** `git status` should no longer list Snakefile under "Changes to be committed".
**One solution:**
```
git commit -am "added Snakefile to run sourmash plot"
```
#### 4.3 Push to github
:::info
[Relevant section (below the git add/git commit)](https://github.com/ngs-docs/2020-GGG298/tree/master/Week6-Git_and_GitHub_for_file_tracking_and_sharing#create-some-files)
:::
**Task:** Run `git push`.
**Check:** Your Snakefile should be visible under your github.com/ account, in the repository you created.
**One solution:**
```
git push
```
This will result in everything being available on github, e.g. [github.com/ctb/2020-ggg298-class9/](https://github.com/ctb/2020-ggg298-class9/).
#### Stretch goal: create a `.gitignore` to ignore automatically produced files
:::info
[Relevant section from Github lesson](https://github.com/ngs-docs/2020-GGG298/tree/master/Week6-Git_and_GitHub_for_file_tracking_and_sharing#using-gitignore-to-ignore-files)
:::
**Task:** Add `.snakemake` and other files to `.gitignore`
**Check:** `git status` should show fewer (or no) untracked files.
**One solution:**
(TBA)
### Stretch: add 6 more genomes!
There are 6 more files [in the OSF project](https://osf.io/ke2tu/files/). Add them into your Snakefile!
(Note: you can get their download URLs by either guessing, or by copying the URL in the <kbd>download</kbd> button that you get from clicking on each file.)
### 5. Create a slurm batch file to run this workflow on farm
#### 5.1 Create a slurm batch file
:::info
[Use this batch file as a template](https://github.com/ngs-docs/2020-ggg-201b-assembly/blob/master/farm-queue.sh)
:::
**Task:** Create a new file that will run your sourmash workflow.
**Check:** You have a file that (1) changes to the right directory, (2) activates the right conda environment, and (3) runs snakemake.
**One solution:**
(TBA)
#### 5.2 Test the slurm batch file
**Task:** Run the slurm batch file at the command line to make sure it works.
**Check:** Running `bash <scriptname>` runs your snakemake workflow, with the first error coming from something _below_ your snakemake command. (Note: be certain to use the `-n` flag with snakemake when testing your script!!)
**One solution:**
(TBA)
#### 5.3 Submit the slurm batch file to the queue
**Task:** Run `sbatch <scriptname>`.
**Check:** A `slurm-*.out` file is created!
**One solution:**
(TBA)
### Voila, done!
Questions?
## A checklist for new projects!
- [ ] You are working in a project directory!
- [ ] The project's directory is under version control! (Git, or something else)
- [ ] You have listed the conda packages you need somewhere, and created an environment with them!
- [ ] You have created a snakemake workflow that runs some, or all, of your workflow!
- [ ] The latest version of your Snakefile is visible on github (or somewhere else you are saving your work)!
- [ ] You have scripted your figure creation, too!