GGG 298, March 2020 - Week 9, integrating all the things

GGG 298, March 2020 - Week 9, integrating all the things

March 4, 2020
Titus Brown

Learning goals:

review the previous 8 lessons by building something that puts them all together!

Today's plan!

Today we're going to try to integrate across many of the things you've seen this quarter. This is the first time I'm trying anything like this, so please be kind :).

What we're going to do is try to start a new project from scratch, knowing just the commands to run. And then we're going to install stuff with conda, develop a snakemake workflow, make an rmarkdown report, and push it all to github.

What I'd like to try is breaking things down into 10-20 minute chunks during which you'll put something together. After each chunk, I'll walk through a solution, post it for you all, and then move on to the next chunk.

Importantly, you absolutely don't have to complete each chunk! I'd like you to try, but having a set of questions and thoughts is perfectly fine! And you can definitely work with others; you might try pair- or group-programming.

The workflow!

Scenario:

Sandra has received a collection of bacterial genomes and wants to see how similar they are at a nucleotide level. To do this, she decides to use the sourmash tool to compare them and create a figure showing their similarity.

Briefly, sourmash is a tool that compares genomes at a k-mer level. First, it breaks genomes down into k-mers, then it calculates the Jaccard similarity between the genomes, and then it plots the similarities in various ways.

A. Computing signatures

Starting from five genome files,

ls -lh *.fa.gz

-r–r–r–@ 1 t staff 1.6M Mar 3 17:02 1.fa.gz
-r–r–r–@ 1 t staff 566K Mar 3 17:02 2.fa.gz
-r–r–r–@ 1 t staff 548K Mar 3 17:02 3.fa.gz
-r–r–r–@ 1 t staff 565K Mar 3 17:02 4.fa.gz
-r–r–r–@ 1 t staff 1.6M Mar 3 17:02 5.fa.gz

we first compute genome signatures for each of them, using a k-mer size of 31. Each of these commands takes as input a .fa.gz file, and outputs a .fa.gz.sig file.

sourmash compute -k 31 1.fa.gz
sourmash compute -k 31 2.fa.gz
sourmash compute -k 31 3.fa.gz
sourmash compute -k 31 4.fa.gz
sourmash compute -k 31 5.fa.gz

B. Comparing the signatures

Next we want to compare all five of these signature files.
The following command takes as input multiple .sig files and outputs a comparison results file named all.cmp. We can optionally produce a CSV matrix file using --csv all.csv.

sourmash compare *.sig -o all.cmp

C. Plotting the comparison

Finally, we want to plot the comparison.
This final command takes as input the files all.cmp and all.cmp.labels.txt, and outputs three files,all.cmp.hist.png, all.cmp.dendro.png, and all.cmp.matrix.png. Use the --labels option to put labels on the matrix figure.

sourmash plot --labels all.cmp

This produces the following figure showing that genomes 1 and 5 are similar, genomes 3 and 4 are similar, and genome 2 is not similar to either of the other sets.

The tasks!

Please log into farm first :)

1. Conda for software installation

1.1 First, use conda to install sourmash and snakemake.

We'll need to have both sourmash and snakemake-minimal installed. Let's use conda to do it!

Task: Create a conda environment containing the latest version of sourmash and snakemake-minimal, installed from bioconda.

Pointer to the relevant section from the conda lesson in week 3

Check: conda env list should show that your new environment exists.

One solution:

Here's a command to create an environment named smash with the packages sourmash and snakemake-minimal:

conda create -y -n smash sourmash snakemake-minimal

The -y is for "just say yes".

Some commands in the tutorial may have -c present, that is needed to specify the "channels" which elsewhere in this class you've added using conda channel add.

1.2 Second, activate your conda environment

Task: Activate your conda environment.

Check: Running sourmash should show something other than "command not found".

One solution:

conda activate smash

Questions:

Q: if I already have snakemake in an environment and I want to add sourmash, how would I do that?
A: activate the environment, and run conda install -y sourmash to install sourmash in the current environment. You can also run conda install -n smash -y snakemake to install snakemake in a non-active environment.

Stretch: Create a conda environment.yml that you can use `conda create -f` with

See link to section in conda tutorial

Solution:

use conda env export > export.yml in your environment to create a file named export.yml
edit out all of the dependency lines in that file that aren't sourmash or snakemake
then you can use conda env create -f export.yml -n <new name> to create a new environment from that file.

2. Create a github repository to hold your files.

2.1 Create a repository on github.com/

Let's create a repository on github!

Link to relevant section from git/github lesson

Task: Create a repository on github.com/.

Check: There is a repository with 'sourmash' in the name at http://github.com/<your account>/

One solution:

go to github
click the little plus on upper right
select "new repository"
enter some name (2020-ggg298-class9)
check "create README" box
hit "create"
voila, https://github.com/ctb/2020-ggg298-class9

2.2 Clone your repository from github onto farm.

Let's make a local copy (clone) of the repository from github on farm! This will be the directory we work in.

See the relevant section from git/github lesson

Task: Clone the repository to a local directory on farm and change into that directory.

Check: On farm, there is a directory named the same thing as your repository on github.com.

One solution:

Run

git clone http://github.com/<account>/<repo name>
cd <repo name>

Alternative/weird situations

If you want to check if you are in a git repo, you can do

ls -ld .git

in the top level directory of what should be the repo.

If you created the directory without cloning it from github, you can try:

cd <not-a-git-repo>/
git init
git remote add origin http://github.com/<account>/<repo name>
git pull origin master

a less weird approach would be:

mv <not-a-git-repo> <somedir>.bak #here, somedir.bak is a renamed directory of the not-a-git-repo
git clone http://github.com/<account>/<repo name>
cd <repo name>
cp ../somedir.bak/Snakefile .

3. Create an automated workflow!

Let's automate this set of commands using snakemake!

Remember to use nano -ET4 Snakefile when editing the Snakefile! This will set tabs to be four spaces, instead. And ctrl-x y ENTER saves files!

Reminder: rules look like this,




rule rulename:
    input: "input_file1", "input_file2"
    output: "output_file1", "output_file2"
    shell: "shell_command goes here"

And you can use {input} and {output} in the shell command.

3.1 Create snakemake rules to download the genome files.

Create a rule to run the following commands to download the data from this OSF

wget https://osf.io/t5bu6/download -O 1.fa.gz
wget https://osf.io/ztqx3/download -O 2.fa.gz
wget https://osf.io/w4ber/download -O 3.fa.gz
wget https://osf.io/dnyzp/download -O 4.fa.gz
wget https://osf.io/ajvqk/download -O 5.fa.gz

Link to relevant snakemake tutorial section

Task: Create one or more rules to download the files.

Check: snakemake 1.fa.gz should run the appropriate wget command and produce the right file.

Hint: you can run multiple shell commands by doing this:

shell:
   """command1
   command2
   command3"""

One solution:

You can put the following block in your (otherwise empty) Snakefile:

rule download_files:
    output: "1.fa.gz", "2.fa.gz", "3.fa.gz", "4.fa.gz", "5.fa.gz"
    shell: """
        wget https://osf.io/t5bu6/download -O 1.fa.gz
        wget https://osf.io/ztqx3/download -O 2.fa.gz
        wget https://osf.io/w4ber/download -O 3.fa.gz
        wget https://osf.io/dnyzp/download -O 4.fa.gz
        wget https://osf.io/ajvqk/download -O 5.fa.gz
    """

and then run it like so:

snakemake

– since it's the only rule, it'll run that rule by default; otherwise, snakemake runs the first rule in the file.

If you want to see a more elegant solution (from a more civilized world?) please see the first example in this blog post.

3.2 Create snakemake rules to run `sourmash compute` on each of the files.

Task: Create a snakemake rule to run sourmash compute on each genome.

Check: snakemake 1.fa.gz.sig should run sourmash compute on 1.fa.gz to produce 1.fa.gz.sig.

One solution:

rule sourmash_compute:
    input: "{sample}.fa.gz"
    output: "{sample}.fa.gz.sig"
    shell: "sourmash compute -k 31 {input}"

To test this rule, ask snakemake to produce 1.fa.gz.sig and 2.fa.gz.sig:

snakemake 1.fa.gz.sig
snakemake 2.fa.gz.sig

Stretch goal: build a generic rule using wildcards (link to relevant lesson section).

3.3 Create a snakemake rule to run `sourmash compare`

Note: you can have multiple input files by using "file1", "file2", "file3"

Task: Create a snakemake rule that runs sourmash compute on all of the signature files.

Check: snakemake <cmp file> should run sourmash compare.

One solution:

rule sourmash_compare:
    input: "1.fa.gz.sig", "2.fa.gz.sig", "3.fa.gz.sig", "4.fa.gz.sig",
        "5.fa.gz.sig"
    output: "all.cmp"
    shell: "sourmash compare {input} -o all.cmp"

A few notes:

if you don't list the inputs explicitly then snakemake won't make sure they're there before running this
you want to change *.sig to {input} so that you avoid comparing all of the signatures in the directory, and just use those five. (This doesn't matter in this situation, but computers like precision.)
if you don't like typing in lots of input: filenames, you can use expand like so: input: expand("{sample}.fa.gz.sig", sample=[1,2,3,4,5]) - here, expand makes 5 concrete targets (filenames not containing wildcards) out of one wildcard expression with five values to fill in.
more on that vein - you can't have input: "{sample}.fa.gz.sig" because snakemake won't know how to fill in {sample} for input based on the output: block, which just contains all.cmp.

3.4 Create a snakemake rule to run `sourmash plot.`

Task: Create a snakemake rule that runs sourmash plot to produce PNG files.

Check: snakemake <cmp file>.matrix.png should run sourmash plot.

One solution:

rule sourmash_plot:
    input: "all.cmp", "all.cmp.labels.txt"
    output: "all.cmp.hist.png", "all.cmp.dendro.png", "all.cmp.matrix.png"
    shell: "sourmash plot --labels all.cmp"

rule sourmash_compare:
    input: expand("{sample}.fa.gz.sig", sample=range(1, 6))
    #input: expand("{sample}.fa.gz.sig", sample=[1,2,3,4,5])
    #input: "1.fa.gz.sig", "2.fa.gz.sig", "3.fa.gz.sig", "4.fa.gz.sig",
    #    "5.fa.gz.sig"
    output: "all.cmp", "all.cmp.labels.txt"
    shell: "sourmash compare {input} -o all.cmp"

*note that you have to change the output: block of sourmash_compare to include all.cmp.labels.txt.

Stretch goal: build a generic rule using wildcards, so that any .cmp file can be turned into a set of plots.

3.5 Create a default rule that runs the entire workflow

Link to tutorial section on default rules

Task: Create a default rule that asks for the plots to be created.

Check: Running just snakemake produces the plots.

One solution:

Add a default rule at the very tippy top of the file:

# default rule, b/c it's the first rule in the Snakefile
rule all:
    input: "all.cmp.matrix.png"

The entire Snakefile should look like this now:

# default rule, b/c it's the first rule in the Snakefile
rule all:
    input: "all.cmp.matrix.png"

rule sourmash_plot:
    input: "all.cmp", "all.cmp.labels.txt"
    output: "all.cmp.hist.png", "all.cmp.dendro.png", "all.cmp.matrix.png"
    shell: "sourmash plot --labels all.cmp"

rule sourmash_compare:
    input: expand("{sample}.fa.gz.sig", sample=range(1, 6))
    #input: expand("{sample}.fa.gz.sig", sample=[1,2,3,4,5])
    #input: "1.fa.gz.sig", "2.fa.gz.sig", "3.fa.gz.sig", "4.fa.gz.sig",
    #    "5.fa.gz.sig"
    output: "all.cmp", "all.cmp.labels.txt"
    shell: "sourmash compare {input} -o all.cmp"

rule sourmash_compute:
    input: "{sample}.fa.gz"
    output: "{sample}.fa.gz.sig"
    shell: "sourmash compute -k 31 {input}"
    
rule download_files:
    output: "1.fa.gz", "2.fa.gz", "3.fa.gz", "4.fa.gz", "5.fa.gz"
    shell: """
        wget https://osf.io/t5bu6/download -O 1.fa.gz
        wget https://osf.io/ztqx3/download -O 2.fa.gz
        wget https://osf.io/w4ber/download -O 3.fa.gz
        wget https://osf.io/dnyzp/download -O 4.fa.gz
        wget https://osf.io/ajvqk/download -O 5.fa.gz
    """

3.6 Stretch Goal: use a conda environment file for those commands that run sourmash

Relevant section in snakemake tutorial

Task: Create an environment file and name it in a conda: block for each snakemake rule that uses sourmash.

Check: snakemake --use-conda should create a new environment and run the commands in that environment.

One solution:

(TBA)

Stretch goal: use an R script to produce `.png` output

You'll probably need the r-essentials and r-base packages installed in your environment.
Produce a .csv file from sourmash compare using --csv <output file>

You can use the following code to get a plot going (ht Taylor Reiter and Shannon Joslin) –

#read in data
sourmash_comp_matrix <- read.csv("all.csv")

# Label the rows
rownames(sourmash_comp_matrix) <- colnames(sourmash_comp_matrix)

# Transform for plotting
sourmash_comp_matrix <- as.matrix(sourmash_comp_matrix)

# Make an unclustered heatmap
heatmap(sourmash_comp_matrix, Colv=F, Rowv=F, scale='none')

# save the heatmap!
png("sourmash_heatmap.png")
heatmap(sourmash_comp_matrix, Colv=F, Rowv=F, scale='none')
dev.off()

Stretch goal: create an Rmarkdown file to produce a matrix figure in R

Relevant section in Rmarkdown tutorial for syntax and running from the command line

(TBD)

4. Add your files to git and push to github

Now let's back all of our code up somewhere that will track our changes and (potentially) let us share it with others!

4.1 Add Snakefile to git

Relevant section from git/github lesson

Task: Use git add to add the Snakefile to git version control management.

Check: git status should show Snakefile under "Changes to be committed".

One solution:

git add Snakefile

4.2 Commit Snakefile to git

Relevant section (below the git add)

Task: Use git commit to commit the Snakefile to git version control.

Check: git status should no longer list Snakefile under "Changes to be committed".

One solution:

git commit -am "added Snakefile to run sourmash plot"

4.3 Push to github

Relevant section (below the git add/git commit)

Task: Run git push.

Check: Your Snakefile should be visible under your github.com/ account, in the repository you created.

One solution:

git push

This will result in everything being available on github, e.g. github.com/ctb/2020-ggg298-class9/.

Stretch goal: create a `.gitignore` to ignore automatically produced files

Relevant section from Github lesson

Task: Add .snakemake and other files to .gitignore

Check: git status should show fewer (or no) untracked files.

One solution:

(TBA)

Stretch: add 6 more genomes!

There are 6 more files in the OSF project. Add them into your Snakefile!

(Note: you can get their download URLs by either guessing, or by copying the URL in the download button that you get from clicking on each file.)

5. Create a slurm batch file to run this workflow on farm

5.1 Create a slurm batch file

Use this batch file as a template

Task: Create a new file that will run your sourmash workflow.

Check: You have a file that (1) changes to the right directory, (2) activates the right conda environment, and (3) runs snakemake.

One solution:

(TBA)

5.2 Test the slurm batch file

Task: Run the slurm batch file at the command line to make sure it works.

Check: Running bash <scriptname> runs your snakemake workflow, with the first error coming from something below your snakemake command. (Note: be certain to use the -n flag with snakemake when testing your script!!)

One solution:

(TBA)

5.3 Submit the slurm batch file to the queue

Task: Run sbatch <scriptname>.

Check: A slurm-*.out file is created!

One solution:

(TBA)

Voila, done!

Questions?

A checklist for new projects!

You are working in a project directory!
- The project's directory is under version control! (Git, or something else)
You have listed the conda packages you need somewhere, and created an environment with them!
You have created a snakemake workflow that runs some, or all, of your workflow!
The latest version of your Snakefile is visible on github (or somewhere else you are saving your work)!
You have scripted your figure creation, too!

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

GGG 298, March 2020 - Week 9, integrating all the things

Today's plan!

The workflow!

A. Computing signatures

B. Comparing the signatures

C. Plotting the comparison

The tasks!

1. Conda for software installation

1.1 First, use conda to install sourmash and snakemake.

1.2 Second, activate your conda environment

Stretch: Create a conda environment.yml that you can use conda create -f with

2. Create a github repository to hold your files.

2.1 Create a repository on github.com/

2.2 Clone your repository from github onto farm.

Alternative/weird situations

3. Create an automated workflow!

3.1 Create snakemake rules to download the genome files.

3.2 Create snakemake rules to run sourmash compute on each of the files.

3.3 Create a snakemake rule to run sourmash compare

3.4 Create a snakemake rule to run sourmash plot.

3.5 Create a default rule that runs the entire workflow

3.6 Stretch Goal: use a conda environment file for those commands that run sourmash

Stretch goal: use an R script to produce .png output

Stretch goal: create an Rmarkdown file to produce a matrix figure in R

4. Add your files to git and push to github

4.1 Add Snakefile to git

4.2 Commit Snakefile to git

4.3 Push to github

Stretch goal: create a .gitignore to ignore automatically produced files

Stretch: add 6 more genomes!

5. Create a slurm batch file to run this workflow on farm

5.1 Create a slurm batch file

5.2 Test the slurm batch file

5.3 Submit the slurm batch file to the queue

Voila, done!

A checklist for new projects!

Stretch: Create a conda environment.yml that you can use `conda create -f` with

3.2 Create snakemake rules to run `sourmash compute` on each of the files.

3.3 Create a snakemake rule to run `sourmash compare`

3.4 Create a snakemake rule to run `sourmash plot.`

Stretch goal: use an R script to produce `.png` output

Stretch goal: create a `.gitignore` to ignore automatically produced files