owned this note
owned this note
Published
Linked with GitHub
# notes from snakemake live stream
C. Titus Brown
Jan 20, 2021
ctbrown@ucdavis.edu
[toc]
## Recording
A [recording of the livestream is available.](https://video.ucdavis.edu/media/1_ygjgu1ma)
## Getting started w/RStudio in a binder
I'll be using [a binder](mybinder.org/) to do everything on an anonymous remote cloud computer.
[![Binder](https://binder.pangeo.io/badge_logo.svg)](https://binder.pangeo.io/v2/gh/binder-examples/r-conda/master?urlpath=rstudio)
To configure my terminal in the RStudio terminal window on the binder, I'll run:
```
conda init
echo "PS1='\w $ '" >> .bashrc
```
and then restart the terminal.
Then I'll add two channels to conda so that I can install necessary software,
```
conda config --add channels conda-forge
conda config --add channels bioconda
```
## The commands for variant calling
I'm going to start with the following minimap2-based variant pipeline that ends by producing a VCF file.
```
# download some data
wget https://osf.io/4rdza/download -O SRR2584857_1.fastq.gz
# download a reference genome
wget https://osf.io/8sm92/download -O ecoli-rel606.fa.gz
# run minimap2
minimap2 -ax sr ecoli-rel606.fa.gz SRR2584857_1.fastq.gz > SRR2584857_1.ecoli-rel606.sam
# convert SAM to BAM
samtools view -b -F 4 SRR2584857_1.ecoli-rel606.sam > SRR2584857_1.ecoli-rel606.bam
# sort the BAM by position
samtools sort SRR2584857_1.ecoli-rel606.bam > SRR2584857_1.ecoli-rel606.bam.sorted
# unzip genome for bcftools
gunzip -c ecoli-rel606.fa.gz > ecoli-rel606.fa
# run mpileup
bcftools mpileup -Ou -f ecoli-rel606.fa SRR2584857_1.ecoli-rel606.bam.sorted > SRR2584857_1.ecoli-rel606.pileup
# call variants
bcftools call -mv -Ob SRR2584857_1.ecoli-rel606.pileup -o SRR2584857_1.ecoli-rel606.bcf
# turn the BCF into a VCF
bcftools view SRR2584857_1.ecoli-rel606.bcf > SRR2584857_1.ecoli-rel606.vcf
# and done!
```
## First stopping point Snakefile
env-variant-calling.yml:
```yaml
channels:
- bioconda
- conda-forge
- defaults
- https://conda.anaconda.org/conda-forge
dependencies:
- bcftools=1.9=h68d8f2e_9
- minimap2=2.17=hed695b0_3
- samtools=1.9=h10a08f8_12
```
Snakefile:
```python
# default rule - first one in the file
rule all:
input:
"SRR2584857_1.ecoli-rel606.vcf"
rule download_data:
output: "SRR2584857_1.fastq.gz"
shell: """
wget https://osf.io/4rdza/download -O {output}
"""
rule download_reference:
output: "ecoli-rel606.fa.gz"
shell: """
wget https://osf.io/8sm92/download -O {output}
"""
rule run_minimap:
input:
reads="SRR2584857_1.fastq.gz",
genome="ecoli-rel606.fa.gz"
output: "SRR2584857_1.ecoli-rel606.sam"
conda: "env-variant-calling.yml"
shell: """
minimap2 -ax sr {input.genome} {input.reads} > {output}
"""
rule sam_to_bam:
input:
rules.run_minimap.output
#"SRR2584857_1.ecoli-rel606.sam"
output:
"SRR2584857_1.ecoli-rel606.bam"
conda: "env-variant-calling.yml"
shell: """
samtools view -b -F 4 {input} > {output}
"""
rule sort_bam:
output: "SRR2584857_1.ecoli-rel606.bam.sorted"
input: "SRR2584857_1.ecoli-rel606.bam"
conda: "env-variant-calling.yml"
shell: """
samtools sort {input} > {output}
"""
rule call_variants:
input:
genome = "ecoli-rel606.fa.gz",
sorted_bam="SRR2584857_1.ecoli-rel606.bam.sorted"
output:
ungz_genome = "ecoli-rel606.fa",
pileup = "SRR2584857_1.ecoli-rel606.pileup",
bcf = "SRR2584857_1.ecoli-rel606.bcf",
vcf = "SRR2584857_1.ecoli-rel606.vcf"
conda: "env-variant-calling.yml"
shell: """
gunzip -c {input.genome} > {output.ungz_genome}
bcftools mpileup -Ou -f {output.ungz_genome} {input.sorted_bam} > {output.pileup}
bcftools call -mv -Ob {output.pileup} -o {output.bcf}
bcftools view {output.bcf} > {output.vcf}
"""
```
## Stopping point 2: workflow with full wildcards and sample dictionary
```python
samples_dict = {
'SRR2584857_1': 'https://osf.io/aksmc/download',
'SRR2584403_1': 'https://osf.io/6jx7z/download',
'SRR2584404_1': 'https://osf.io/s24ky/download',
'SRR2584405_1': 'https://osf.io/7qek6/download',
}
print('rule all contains', expand("{s}.ecoli-rel606.vcf", s=list(samples_dict)))
# default rule - first one in the file
rule all:
input:
expand("{s}.ecoli-rel606.vcf", s=list(samples_dict))
## all of the rules below here are now generic for ANY sample,
## as long as {sample}.fastq.gz exists, you can produce any of the
## rule outputs below.
def lookup_url(wildcards):
return samples_dict[wildcards.sample]
rule download_data:
output: "{sample}.fastq.gz"
params:
# use params instead of input for this, because
# params do not need to be file names that exist ;)
download_link = lookup_url
shell: """
wget {params.download_link} -O {output}
"""
rule download_reference:
output: "ecoli-rel606.fa.gz"
shell: """
wget https://osf.io/8sm92/download -O {output}
"""
rule run_minimap:
input:
reads="{sample}.fastq.gz",
genome="ecoli-rel606.fa.gz"
output: "{sample}.ecoli-rel606.sam"
conda: "env-variant-calling.yml"
shell: """
minimap2 -ax sr {input.genome} {input.reads} > {output}
"""
rule sam_to_bam:
input:
rules.run_minimap.output
output:
"{sample}.ecoli-rel606.bam"
conda: "env-variant-calling.yml"
shell: """
samtools view -b -F 4 {input} > {output}
"""
rule sort_bam:
output: "{sample}.ecoli-rel606.bam.sorted"
input: "{sample}.ecoli-rel606.bam"
conda: "env-variant-calling.yml"
shell: """
samtools sort {input} > {output}
"""
rule ungz_genome:
input: "ecoli-rel606.fa.gz",
output: "ecoli-rel606.fa"
shell: "gunzip -c {input} > {output}"
rule call_variants:
input:
sorted_bam="{sample}.ecoli-rel606.bam.sorted",
ungz_genome = "ecoli-rel606.fa",
output:
pileup = "{sample}.ecoli-rel606.pileup",
bcf = "{sample}.ecoli-rel606.bcf",
vcf = "{sample}.ecoli-rel606.vcf"
conda: "env-variant-calling.yml"
shell: """
bcftools mpileup -Ou -f {input.ungz_genome} {input.sorted_bam} > {output.pileup}
bcftools call -mv -Ob {output.pileup} -o {output.bcf}
bcftools view {output.bcf} > {output.vcf}
"""
```
## Stopping point 3: loading samples from a CSV file instead
Put this at the top of the Snakefile:
```python
import csv
# loading the sample names and URLs from 'samples.csv'
with open('samples.csv', 'rt') as fp:
r = csv.DictReader(fp)
samples_dict = {}
for row in r:
name = row['sample']
url = row['url']
samples_dict[name] = url
```
and write a `samples.csv` file like so:
```csvpreview=
sample,url
SRR2584857_1,https://osf.io/aksmc/download
SRR2584403_1,https://osf.io/6jx7z/download
SRR2584404_1,https://osf.io/s24ky/download
SRR2584405_1,https://osf.io/7qek6/download
```
## Other tutorials and links
http://ivory.idyll.org/blog/2020-snakemake-hacks-collections-files.html
https://training.nih-cfde.org/en/latest/Bioinformatics-Skills/Snakemake/