I'm available for office hours today, in CCAH 251 or 253.
Shannon Office Hours: Friday 11am-12pm in the DataLab (3rd Floor Library)
By the end of this lecture, students will:
Many things in bioinformatics are workflows. Data goes in, data comes out! Today we're going to talk about ways of automating this, using snakemake.
Generally the end result of a workflow is something you load into R and process (e.g. RMarkdown).
Realistically workflows get much more complicated when you're doing real things, too!
Every computational workflow consists of multiple steps, taking previous outputs and/or data/information in and executing upon them and outputting something.
raw data -> workflow -> summary data for plotting and statistics
See also Nature Toolbox, "Workflow systems turn raw data into scientific knowledge"
snakemake is an awesome workflow system created by Johannes Koester and others (see 2012 publication). Other workflow systems work similarly but have different syntax; while we use snakemake, we can also recommend nextflow, Common Workflow Language, and Workflow Definition Language. The latter two are more for workflow engineers than the first two.
snakemake works by looking at a file, called a Snakefile, that contains recipes for creating files. Then it uses the recipes and command line options to figure out what needs to be created and how to do it, and then …runs it!
So, it's (mostly) in the Snakefile, folks…
The name 'snakemake' comes from the fact that it's written in (and can be extended by) the Python programming language. (Confusingly, Python is actually named after Monty Python, not the reptiles, but whatever.)
Go ahead and log into the farm.
Create a working directory:
mkdir -p ~/298class4
cd ~/298class4
Download some data:
curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz
curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz
curl -L https://osf.io/xju4a/download -o ERR458500.fastq.gz
curl -L https://osf.io/nmqe6/download -o ERR458501.fastq.gz
Configure bioconda channels, just in case this didn't happen the first time through:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
(rerunning these commands is harmless!)
and, finally, activate our previous conda environment that contains fastqc. Then install salmon:
conda activate fqc
conda install -y -n fqc salmon
(if you don't have a fqc environment from week 3, you can create this with conda create -y -n fqc -c bioconda fastqc salmon
.)
We are now set!
We have a file "ERR458493.fastq.gz" and we want to make "ERR458493_fastqc.html" and "ERR458493_fastqc.zip" from it.
We could run fastqc ERR458493.fastq.gz
at the command line, but let's use this task as the start of our snakemake workflow first!
Run:
nano -ET4 Snakefile
and copy/paste the following in;
rule all:
input:
"ERR458493_fastqc.html",
"ERR458493_fastqc.zip"
rule make_fastqc:
input:
"ERR458493.fastq.gz"
output:
"ERR458493_fastqc.html",
"ERR458493_fastqc.zip"
shell:
"fastqc ERR458493.fastq.gz"
hit CTRL+X, y, ENTER to save the changes.
now run:
snakemake -p
and you will see snakemake run fastqc for you, as specified!
The logic above is this:
all
) is the rule run by default, so we are asking snakemake to create the two target files ERR458493_fastqc.html
and ERR458493_fastqc.zip
.make_fastqc
rule says, "if this input exists, you can run the provided shell command to make that output".Here, the "input:" in the rule all
has to match the "output" in the rule make_fastqc
or else Snakefile wouldn't know what to make.
Meta-notes:
nano -ET4
to edit!If you run snakemake -p
again, it won't do anything. That's because all of the input files for the first rule already exist!
However, if you remove a file and run snakemake:
rm ERR458493_fastqc.html
snakemake -p
then snakemake will run fastqc again, because now you don't have one of the files you're asking it to make!
This ability to selectively figure out whether or not to run a command is one of the most convenient features of snakemake.
Let's make the make_fastqc
rule a little more generic. Use nano -ET4 Snakefile
to edit the file and make the rule look like this:
rule make_fastqc:
input:
"ERR458493.fastq.gz"
output:
"ERR458493_fastqc.html",
"ERR458493_fastqc.zip"
shell:
"fastqc {input}"
What does this do? It replaces the {input}
with whatever is in the "input:" line, above.
TODO: Add a new rule, called make_fastqc2
, that runs fastqc on ERR458501.fastq.gz
Does it run?
Reminder: remember to add the desired output file to the "all" rule as an input, too!
You should now have two rules, make_fastqc
and make_fastqc2
, that have the same shell: command. But they have different inputs and outputs: one has "ERR458493.fastq.gz" as an input, and "ERR458493_fastqc.html" and "ERR458493_fastqc.zip" as outputs, while the other has "ERR458501.fastq.gz" as an input, and "ERR458501_fastqc.html" and "ERR458501_fastqc.zip" as outputs. If you line these up, you'll notice something interesting:
ERR458493.fastq.gz
ERR458493_fastqc.html
ERR458493_fastqc.zip
^^^^^^^^^
ERR458501.fastq.gz
ERR458501_fastqc.html
ERR458501_fastqc.zip
^^^^^^^^^
Notice how the top three and the bottom three have the same prefix (ERR458), and how the suffixes are the same between the two samples?
Use nano -ET4 Snakefile
and change the make_fastqc
rule so:
"{sample}.fastq.gz"
"{sample}_fastqc.html", "{sample}_fastqc.zip"
make_fastqc2
rule (you can use CTRL-K to delete an entire line if your cursor is on that line).Your complete Snakefile should look like this:
rule all:
input:
"ERR458493_fastqc.html",
"ERR458493_fastqc.zip",
"ERR458501_fastqc.html",
"ERR458501_fastqc.zip"
rule make_fastqc:
input:
"{sample}.fastq.gz"
output:
"{sample}_fastqc.html",
"{sample}_fastqc.zip"
shell:
"fastqc {input}"
This tells snakemake that any time it is asked for a file that ends with _fastqc.html
or _fastqc.zip
, it should look for a similarly named file that ends with .fastq.gz
and, if it finds one, it can run fastqc
on that file to produce those outputs.
Try it!
rm *.html
snakemake -p
Note that wildcards print out in the snakemake output for each task, so you can see exactly what is being substituted!
Final note: wildcards operate entirely within a single rule, not across rules.
Update the Snakefile so that it runs fastqc on "ERR458494.fastq.gz" and "ERR458500.fastq.gz" too.
Now let's add some more rules, at the bottom, so that we can do the quantification.
We need to do three things: (1) download a reference transcriptome, (2) index the reference transcriptome, and (3) quantify the reference genes based on the reads (using salmon).
Let's add the first two as new rules. Here are the shell commands to run; can you make two new rules from them? one called download_reference
and the other called index_reference
?
the download_reference
shell command is:
curl -O https://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna/orf_coding.fasta.gz
the index_reference
shell command is:
salmon index --index yeast_orfs --transcripts orf_coding.fasta.gz
(note that you can always just run these commands at the prompt if you want to see if they work!)
Suggestion:
shell:
line. To do this, copy the format of a previous rule! Note that you can put in empty input: and output: lines if you like.output:
line to download_reference
. What file does it produce?download_reference
doesn't take any input files, because it's downloading the data from the Internet!)input:
line to index_reference
. What does it take in?index_reference
, replace the filename it takes in the shell line with {input}
, too.snakemake -p
(one) Solution
Add these to the end of your Snakefile
rule download_reference:
output:
"orf_coding.fasta.gz"
shell:
"curl -O https://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna/orf_coding.fasta.gz"
rule index_reference:
input:
"orf_coding.fasta.gz"
output:
"yeast_orfs"
shell:
"salmon index --index {output} --transcripts {input}"
BUT if you try to run snakemake -p
then it won't run… we have to specify the rule to run:
snakemake -p download_reference
snakemake -p index_reference
If you'd like to have snakemake -p
run these, then add their output to the input: of rule all:
Alternatively, you can ask for the filenames: snakemake -p yeast_orfs orf_coding.fasta.gz
.
The next command we want to snakemake-ify is this one:
for i in *.fastq.gz
do
salmon quant -i yeast_orfs --libType U -r $i -o $i.quant --seqBias --gcBias
done
we could do this by copying the command above into a "shell:"" block, like so:
rule salmon_quant:
...
shell:
"""for i in *.fastq.gz
do
salmon quant -i yeast_orfs --libType U -r $i -o $i.quant --seqBias --gcBias
done"""
but this is not ideal, for a few reasons.
The two main reasons are these:
salmon quant
commands fails, the entire rule fails, so all four need to be re-run.a third reason is that, as written, the command cannot easily take advantage of snakemake wildcards (like {sample}
or {input}
).
(These are all "bad smells", i.e. if you find your Snakefile is behaving like this, it's time to reconsider your approach!)
These are all related to the feature that snakemake specifies workflows declaratively, as recipes, and can then choose how to run them based on the resources its given.
Analogy time: if you have a kitchen, and a chef, you can either give them a recipe procedure and ask them to follow it; or you can give them a recipe book, and then tell them what dishes you want for dinner. In the former case, they are stuck doing things in the order you specify. In the latter case, they have the freedom to do things in the order that is convenient or efficient. And, if you hire more chefs, you may be able to produce meals faster and more efficiently!
(We'll discuss how to provide more chefs to snakemake below, in more detail.)
With more wildcards, like we did in the fastqc example, above!
First, let's start with a test case by writing a single sample rule:
rule salmon_quant:
shell:
"salmon quant -i yeast_orfs --libType U -r ERR458493.fastq.gz -o ERR458493.fastq.gz.quant --seqBias --gcBias"
and make sure that works: snakemake -p salmon_quant
Now, add input:
and output:
…
rule salmon_quant:
input: "ERR458493.fastq.gz"
output: "ERR458493.fastq.gz.quant"
shell:
"salmon quant -i yeast_orfs --libType U -r ERR458493.fastq.gz -o ERR458493.fastq.gz.quant --seqBias --gcBias"
and check that for syntax: snakemake -p salmon_quant
and finally replace the prefixes with the {sample}
{input}
and {output}
wildcards we learned before:
rule salmon_quant:
input: "{sample}"
output: "{sample}.quant"
shell:
"salmon quant -i yeast_orfs --libType U -r {input} -o {output} --seqBias --gcBias"
Now, you can no longer run salmon_quant
directly – you have to ask snakemake for a particular file. Try:
rm -fr ERR458493.fastq.gz.quant
snakemake -p ERR458493.fastq.gz.quant
The reason for this is that snakemake doesn't automatically look at all the files in the directory and figure out which ones it can apply rules to - you have to ask it more specifically, by asking for the specific files you want.
Last but not least, we need to add this info into the default rule.
snakemake
run with no arguments for all four salmon quant
commands.-n
If you give snakemake a -n
parameter, it will tell you what it thinks it should run but won't actually run it. This is useful for situations where you don't know what needs to be run and want to find out without actually running it.
Rules that only mention specific files (no wildcards!) are concrete rules. You have to provide snakemake with at least one specific file to make, or it will tell you that there's a WorkflowError
:
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.
You can provide a filename on the command line, or you can put one in a rule and run that rule, but you have to provide at least one concrete request.
In contrast, rules containing wildcards are abstract rules, and snakemake will use them as needed to build the concrete files requested.
To debug WorkflowErrors as above,
There are many advanced features to snakemake, and we'll touch on a few of them here.
conda:
and --use-conda
If you specify a conda environment file, in an conda:
block in a rule, and run conda with --use-conda
, it will always run that rule in that software environment.
This is useful when you want to version-pin a specific action, and/or have conflicting software in different rules.
You can tell snakemake to run things in parallel by doing
snakemake -j 2
This will tell snakemake that it can run up to two jobs at a time. (It automatically figures out which jobs can be run at the same time by looking at the workflow graph.)
We'll show you how to do this on a larger scale on a cluster in a few weeks :)
R integration
Python integration
General advice:
Pick a small, linear workflow, and then:
alternatively, if you have a complex workflow that would take a lot of time and energy to convert,
what do (snakemake) workflows do for me?
A laundry list:
For me, the main reason to use snakemake is that it lets be sure that my workflow completed properly. snakemake tracks which commands fails, and will stop the workflow in its tracks! This is not something that you usually do in shell scripts.
(It turns out that I've spent a lot of my life being a bit paranoid about whether my commands actually ran correctly!)
Workflows can get really complicated; here, for example, is one for our most recent paper. But it's all just using the building blocks that I showed you above!
If you want to see some good examples of how to build nice, clean, simple looking workflows, check out this RNAseq example.
google is your friend!
The first three 201(b) class materials are a fairly gentle introduction: https://hackmd.io/YaM6z84wQeK619cSeLJ2tg#Schedule-of-lab-topics
here's another tutorial I wrote… link
a free book! – the Snakemake book
angus 2019 material – Workflow Management using Snakemake
Workflows can get really complicated; here, for example, is one for our most recent paper. But it's all just using the building blocks that I showed you above!
If you want to see some good examples of how to build nice, clean, simple looking workflows, check out this RNAseq example.
If you get the error
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.
then what's happening is you're trying to directly run a rule with a wildcard in it (i.e. an abstract rule), and snakemake can't figure out what the wildcard should be; instead, ask snakemake to build a specific file.
For example, with the rule immediately above that adds wildcards to map_reads
,
snakemake map_reads
will complain about the target rule containing wildcards. You should instead run
snakemake SRR2584857_1.sam
which snakemake can use to figure out what the wildcards should be.
An alternative to specifying the file on the command line is to put it in the default rule, e.g. rule all:
(see the section on default rules) and then you can run snakemake
.
We can use the -n
gives a dry run of the Snakefile. For example snakemake -p -n
snakemake <filename>
output:
?