March 4, 2020
Titus Brown
Learning goals:
Today we're going to try to integrate across many of the things you've seen this quarter. This is the first time I'm trying anything like this, so please be kind :).
What we're going to do is try to start a new project from scratch, knowing just the commands to run. And then we're going to install stuff with conda, develop a snakemake workflow, make an rmarkdown report, and push it all to github.
What I'd like to try is breaking things down into 10-20 minute chunks during which you'll put something together. After each chunk, I'll walk through a solution, post it for you all, and then move on to the next chunk.
Importantly, you absolutely don't have to complete each chunk! I'd like you to try, but having a set of questions and thoughts is perfectly fine! And you can definitely work with others; you might try pair- or group-programming.
Scenario:
Sandra has received a collection of bacterial genomes and wants to see how similar they are at a nucleotide level. To do this, she decides to use the sourmash tool to compare them and create a figure showing their similarity.
Briefly, sourmash is a tool that compares genomes at a k-mer level. First, it breaks genomes down into k-mers, then it calculates the Jaccard similarity between the genomes, and then it plots the similarities in various ways.
Starting from five genome files,
ls -lh *.fa.gz
-r–r–r–@ 1 t staff 1.6M Mar 3 17:02 1.fa.gz
-r–r–r–@ 1 t staff 566K Mar 3 17:02 2.fa.gz
-r–r–r–@ 1 t staff 548K Mar 3 17:02 3.fa.gz
-r–r–r–@ 1 t staff 565K Mar 3 17:02 4.fa.gz
-r–r–r–@ 1 t staff 1.6M Mar 3 17:02 5.fa.gz
we first compute genome signatures for each of them, using a k-mer size of 31. Each of these commands takes as input a .fa.gz
file, and outputs a .fa.gz.sig
file.
sourmash compute -k 31 1.fa.gz
sourmash compute -k 31 2.fa.gz
sourmash compute -k 31 3.fa.gz
sourmash compute -k 31 4.fa.gz
sourmash compute -k 31 5.fa.gz
Next we want to compare all five of these signature files.
The following command takes as input multiple .sig
files and outputs a comparison results file named all.cmp
. We can optionally produce a CSV matrix file using --csv all.csv
.
sourmash compare *.sig -o all.cmp
Finally, we want to plot the comparison.
This final command takes as input the files all.cmp
and all.cmp.labels.txt
, and outputs three files,all.cmp.hist.png
, all.cmp.dendro.png
, and all.cmp.matrix.png
. Use the --labels
option to put labels on the matrix figure.
sourmash plot --labels all.cmp
This produces the following figure showing that genomes 1 and 5 are similar, genomes 3 and 4 are similar, and genome 2 is not similar to either of the other sets.
Please log into farm first :)
We'll need to have both sourmash
and snakemake-minimal
installed. Let's use conda to do it!
Task: Create a conda environment containing the latest version of sourmash and snakemake-minimal, installed from bioconda.
Check: conda env list
should show that your new environment exists.
One solution:
Here's a command to create an environment named smash
with the packages sourmash
and snakemake-minimal
:
conda create -y -n smash sourmash snakemake-minimal
The -y
is for "just say yes".
Some commands in the tutorial may have -c
present, that is needed to specify the "channels" which elsewhere in this class you've added using conda channel add
.
Task: Activate your conda environment.
Check: Running sourmash
should show something other than "command not found".
One solution:
conda activate smash
Questions:
Q: if I already have snakemake in an environment and I want to add sourmash, how would I do that?
A: activate the environment, and run conda install -y sourmash
to install sourmash in the current environment. You can also run conda install -n smash -y snakemake
to install snakemake in a non-active environment.
conda create -f
withSolution:
conda env export > export.yml
in your environment to create a file named export.ymlconda env create -f export.yml -n <new name>
to create a new environment from that file.Let's create a repository on github!
Task: Create a repository on github.com/.
Check: There is a repository with 'sourmash' in the name at http://github.com/<your account>/
One solution:
Let's make a local copy (clone) of the repository from github on farm! This will be the directory we work in.
Task: Clone the repository to a local directory on farm and change into that directory.
Check: On farm, there is a directory named the same thing as your repository on github.com.
One solution:
Run
git clone http://github.com/<account>/<repo name>
cd <repo name>
If you want to check if you are in a git repo, you can do
ls -ld .git
in the top level directory of what should be the repo.
If you created the directory without cloning it from github, you can try:
cd <not-a-git-repo>/
git init
git remote add origin http://github.com/<account>/<repo name>
git pull origin master
a less weird approach would be:
mv <not-a-git-repo> <somedir>.bak #here, somedir.bak is a renamed directory of the not-a-git-repo
git clone http://github.com/<account>/<repo name>
cd <repo name>
cp ../somedir.bak/Snakefile .
Let's automate this set of commands using snakemake!
Remember to use nano -ET4 Snakefile
when editing the Snakefile! This will set tabs to be four spaces, instead. And ctrl-x y ENTER saves files!
Reminder: rules look like this,
rule rulename:
input: "input_file1", "input_file2"
output: "output_file1", "output_file2"
shell: "shell_command goes here"
And you can use {input}
and {output}
in the shell command.
Create a rule to run the following commands to download the data from this OSF
wget https://osf.io/t5bu6/download -O 1.fa.gz
wget https://osf.io/ztqx3/download -O 2.fa.gz
wget https://osf.io/w4ber/download -O 3.fa.gz
wget https://osf.io/dnyzp/download -O 4.fa.gz
wget https://osf.io/ajvqk/download -O 5.fa.gz
Task: Create one or more rules to download the files.
Check: snakemake 1.fa.gz
should run the appropriate wget
command and produce the right file.
Hint: you can run multiple shell commands by doing this:
shell:
"""command1
command2
command3"""
One solution:
You can put the following block in your (otherwise empty) Snakefile:
rule download_files:
output: "1.fa.gz", "2.fa.gz", "3.fa.gz", "4.fa.gz", "5.fa.gz"
shell: """
wget https://osf.io/t5bu6/download -O 1.fa.gz
wget https://osf.io/ztqx3/download -O 2.fa.gz
wget https://osf.io/w4ber/download -O 3.fa.gz
wget https://osf.io/dnyzp/download -O 4.fa.gz
wget https://osf.io/ajvqk/download -O 5.fa.gz
"""
and then run it like so:
snakemake
– since it's the only rule, it'll run that rule by default; otherwise, snakemake runs the first rule in the file.
If you want to see a more elegant solution (from a more civilized world?) please see the first example in this blog post.
sourmash compute
on each of the files.Task: Create a snakemake rule to run sourmash compute on each genome.
Check: snakemake 1.fa.gz.sig
should run sourmash compute
on 1.fa.gz
to produce 1.fa.gz.sig
.
One solution:
rule sourmash_compute:
input: "{sample}.fa.gz"
output: "{sample}.fa.gz.sig"
shell: "sourmash compute -k 31 {input}"
To test this rule, ask snakemake to produce 1.fa.gz.sig
and 2.fa.gz.sig
:
snakemake 1.fa.gz.sig
snakemake 2.fa.gz.sig
Stretch goal: build a generic rule using wildcards (link to relevant lesson section).
sourmash compare
Note: you can have multiple input files by using "file1", "file2", "file3"
Task: Create a snakemake rule that runs sourmash compute
on all of the signature files.
Check: snakemake <cmp file>
should run sourmash compare
.
One solution:
rule sourmash_compare:
input: "1.fa.gz.sig", "2.fa.gz.sig", "3.fa.gz.sig", "4.fa.gz.sig",
"5.fa.gz.sig"
output: "all.cmp"
shell: "sourmash compare {input} -o all.cmp"
A few notes:
*.sig
to {input}
so that you avoid comparing all of the signatures in the directory, and just use those five. (This doesn't matter in this situation, but computers like precision.)input: expand("{sample}.fa.gz.sig", sample=[1,2,3,4,5])
- here, expand makes 5 concrete targets (filenames not containing wildcards) out of one wildcard expression with five values to fill in.input: "{sample}.fa.gz.sig"
because snakemake won't know how to fill in {sample} for input based on the output: block, which just contains all.cmp
.sourmash plot.
Task: Create a snakemake rule that runs sourmash plot
to produce PNG files.
Check: snakemake <cmp file>.matrix.png
should run sourmash plot
.
One solution:
rule sourmash_plot:
input: "all.cmp", "all.cmp.labels.txt"
output: "all.cmp.hist.png", "all.cmp.dendro.png", "all.cmp.matrix.png"
shell: "sourmash plot --labels all.cmp"
rule sourmash_compare:
input: expand("{sample}.fa.gz.sig", sample=range(1, 6))
#input: expand("{sample}.fa.gz.sig", sample=[1,2,3,4,5])
#input: "1.fa.gz.sig", "2.fa.gz.sig", "3.fa.gz.sig", "4.fa.gz.sig",
# "5.fa.gz.sig"
output: "all.cmp", "all.cmp.labels.txt"
shell: "sourmash compare {input} -o all.cmp"
*note that you have to change the output: block of sourmash_compare to include all.cmp.labels.txt
.
Stretch goal: build a generic rule using wildcards, so that any .cmp
file can be turned into a set of plots.
Task: Create a default rule that asks for the plots to be created.
Check: Running just snakemake
produces the plots.
One solution:
Add a default rule at the very tippy top of the file:
# default rule, b/c it's the first rule in the Snakefile
rule all:
input: "all.cmp.matrix.png"
The entire Snakefile should look like this now:
# default rule, b/c it's the first rule in the Snakefile
rule all:
input: "all.cmp.matrix.png"
rule sourmash_plot:
input: "all.cmp", "all.cmp.labels.txt"
output: "all.cmp.hist.png", "all.cmp.dendro.png", "all.cmp.matrix.png"
shell: "sourmash plot --labels all.cmp"
rule sourmash_compare:
input: expand("{sample}.fa.gz.sig", sample=range(1, 6))
#input: expand("{sample}.fa.gz.sig", sample=[1,2,3,4,5])
#input: "1.fa.gz.sig", "2.fa.gz.sig", "3.fa.gz.sig", "4.fa.gz.sig",
# "5.fa.gz.sig"
output: "all.cmp", "all.cmp.labels.txt"
shell: "sourmash compare {input} -o all.cmp"
rule sourmash_compute:
input: "{sample}.fa.gz"
output: "{sample}.fa.gz.sig"
shell: "sourmash compute -k 31 {input}"
rule download_files:
output: "1.fa.gz", "2.fa.gz", "3.fa.gz", "4.fa.gz", "5.fa.gz"
shell: """
wget https://osf.io/t5bu6/download -O 1.fa.gz
wget https://osf.io/ztqx3/download -O 2.fa.gz
wget https://osf.io/w4ber/download -O 3.fa.gz
wget https://osf.io/dnyzp/download -O 4.fa.gz
wget https://osf.io/ajvqk/download -O 5.fa.gz
"""
Task: Create an environment file and name it in a conda:
block for each snakemake rule that uses sourmash.
Check: snakemake --use-conda
should create a new environment and run the commands in that environment.
One solution:
(TBA)
.png
outputr-essentials
and r-base
packages installed in your environment..csv
file from sourmash compare
using --csv <output file>
You can use the following code to get a plot going (ht Taylor Reiter and Shannon Joslin) –
#read in data
sourmash_comp_matrix <- read.csv("all.csv")
# Label the rows
rownames(sourmash_comp_matrix) <- colnames(sourmash_comp_matrix)
# Transform for plotting
sourmash_comp_matrix <- as.matrix(sourmash_comp_matrix)
# Make an unclustered heatmap
heatmap(sourmash_comp_matrix, Colv=F, Rowv=F, scale='none')
# save the heatmap!
png("sourmash_heatmap.png")
heatmap(sourmash_comp_matrix, Colv=F, Rowv=F, scale='none')
dev.off()
Relevant section in Rmarkdown tutorial for syntax and running from the command line
(TBD)
Now let's back all of our code up somewhere that will track our changes and (potentially) let us share it with others!
Task: Use git add
to add the Snakefile to git version control management.
Check: git status
should show Snakefile
under "Changes to be committed".
One solution:
git add Snakefile
Task: Use git commit
to commit the Snakefile to git version control.
Check: git status
should no longer list Snakefile under "Changes to be committed".
One solution:
git commit -am "added Snakefile to run sourmash plot"
Task: Run git push
.
Check: Your Snakefile should be visible under your github.com/ account, in the repository you created.
One solution:
git push
This will result in everything being available on github, e.g. github.com/ctb/2020-ggg298-class9/.
.gitignore
to ignore automatically produced filesTask: Add .snakemake
and other files to .gitignore
Check: git status
should show fewer (or no) untracked files.
One solution:
(TBA)
There are 6 more files in the OSF project. Add them into your Snakefile!
(Note: you can get their download URLs by either guessing, or by copying the URL in the download button that you get from clicking on each file.)
Task: Create a new file that will run your sourmash workflow.
Check: You have a file that (1) changes to the right directory, (2) activates the right conda environment, and (3) runs snakemake.
One solution:
(TBA)
Task: Run the slurm batch file at the command line to make sure it works.
Check: Running bash <scriptname>
runs your snakemake workflow, with the first error coming from something below your snakemake command. (Note: be certain to use the -n
flag with snakemake when testing your script!!)
One solution:
(TBA)
Task: Run sbatch <scriptname>
.
Check: A slurm-*.out
file is created!
One solution:
(TBA)
Questions?