notes from snakemake live stream

# notes from snakemake live stream C. Titus Brown Jan 20, 2021 ctbrown@ucdavis.edu [toc] ## Recording A [recording of the livestream is available.](https://video.ucdavis.edu/media/1_ygjgu1ma) ## Getting started w/RStudio in a binder I'll be using [a binder](mybinder.org/) to do everything on an anonymous remote cloud computer. [![Binder](https://binder.pangeo.io/badge_logo.svg)](https://binder.pangeo.io/v2/gh/binder-examples/r-conda/master?urlpath=rstudio) To configure my terminal in the RStudio terminal window on the binder, I'll run: ``` conda init echo "PS1='\w $ '" >> .bashrc ``` and then restart the terminal. Then I'll add two channels to conda so that I can install necessary software, ``` conda config --add channels conda-forge conda config --add channels bioconda ``` ## The commands for variant calling I'm going to start with the following minimap2-based variant pipeline that ends by producing a VCF file. ``` # download some data wget https://osf.io/4rdza/download -O SRR2584857_1.fastq.gz # download a reference genome wget https://osf.io/8sm92/download -O ecoli-rel606.fa.gz # run minimap2 minimap2 -ax sr ecoli-rel606.fa.gz SRR2584857_1.fastq.gz > SRR2584857_1.ecoli-rel606.sam # convert SAM to BAM samtools view -b -F 4 SRR2584857_1.ecoli-rel606.sam > SRR2584857_1.ecoli-rel606.bam # sort the BAM by position samtools sort SRR2584857_1.ecoli-rel606.bam > SRR2584857_1.ecoli-rel606.bam.sorted # unzip genome for bcftools gunzip -c ecoli-rel606.fa.gz > ecoli-rel606.fa # run mpileup bcftools mpileup -Ou -f ecoli-rel606.fa SRR2584857_1.ecoli-rel606.bam.sorted > SRR2584857_1.ecoli-rel606.pileup # call variants bcftools call -mv -Ob SRR2584857_1.ecoli-rel606.pileup -o SRR2584857_1.ecoli-rel606.bcf # turn the BCF into a VCF bcftools view SRR2584857_1.ecoli-rel606.bcf > SRR2584857_1.ecoli-rel606.vcf # and done! ``` ## First stopping point Snakefile env-variant-calling.yml: ```yaml channels: - bioconda - conda-forge - defaults - https://conda.anaconda.org/conda-forge dependencies: - bcftools=1.9=h68d8f2e_9 - minimap2=2.17=hed695b0_3 - samtools=1.9=h10a08f8_12 ``` Snakefile: ```python # default rule - first one in the file rule all: input: "SRR2584857_1.ecoli-rel606.vcf" rule download_data: output: "SRR2584857_1.fastq.gz" shell: """ wget https://osf.io/4rdza/download -O {output} """ rule download_reference: output: "ecoli-rel606.fa.gz" shell: """ wget https://osf.io/8sm92/download -O {output} """ rule run_minimap: input: reads="SRR2584857_1.fastq.gz", genome="ecoli-rel606.fa.gz" output: "SRR2584857_1.ecoli-rel606.sam" conda: "env-variant-calling.yml" shell: """ minimap2 -ax sr {input.genome} {input.reads} > {output} """ rule sam_to_bam: input: rules.run_minimap.output #"SRR2584857_1.ecoli-rel606.sam" output: "SRR2584857_1.ecoli-rel606.bam" conda: "env-variant-calling.yml" shell: """ samtools view -b -F 4 {input} > {output} """ rule sort_bam: output: "SRR2584857_1.ecoli-rel606.bam.sorted" input: "SRR2584857_1.ecoli-rel606.bam" conda: "env-variant-calling.yml" shell: """ samtools sort {input} > {output} """ rule call_variants: input: genome = "ecoli-rel606.fa.gz", sorted_bam="SRR2584857_1.ecoli-rel606.bam.sorted" output: ungz_genome = "ecoli-rel606.fa", pileup = "SRR2584857_1.ecoli-rel606.pileup", bcf = "SRR2584857_1.ecoli-rel606.bcf", vcf = "SRR2584857_1.ecoli-rel606.vcf" conda: "env-variant-calling.yml" shell: """ gunzip -c {input.genome} > {output.ungz_genome} bcftools mpileup -Ou -f {output.ungz_genome} {input.sorted_bam} > {output.pileup} bcftools call -mv -Ob {output.pileup} -o {output.bcf} bcftools view {output.bcf} > {output.vcf} """ ``` ## Stopping point 2: workflow with full wildcards and sample dictionary ```python samples_dict = { 'SRR2584857_1': 'https://osf.io/aksmc/download', 'SRR2584403_1': 'https://osf.io/6jx7z/download', 'SRR2584404_1': 'https://osf.io/s24ky/download', 'SRR2584405_1': 'https://osf.io/7qek6/download', } print('rule all contains', expand("{s}.ecoli-rel606.vcf", s=list(samples_dict))) # default rule - first one in the file rule all: input: expand("{s}.ecoli-rel606.vcf", s=list(samples_dict)) ## all of the rules below here are now generic for ANY sample, ## as long as {sample}.fastq.gz exists, you can produce any of the ## rule outputs below. def lookup_url(wildcards): return samples_dict[wildcards.sample] rule download_data: output: "{sample}.fastq.gz" params: # use params instead of input for this, because # params do not need to be file names that exist ;) download_link = lookup_url shell: """ wget {params.download_link} -O {output} """ rule download_reference: output: "ecoli-rel606.fa.gz" shell: """ wget https://osf.io/8sm92/download -O {output} """ rule run_minimap: input: reads="{sample}.fastq.gz", genome="ecoli-rel606.fa.gz" output: "{sample}.ecoli-rel606.sam" conda: "env-variant-calling.yml" shell: """ minimap2 -ax sr {input.genome} {input.reads} > {output} """ rule sam_to_bam: input: rules.run_minimap.output output: "{sample}.ecoli-rel606.bam" conda: "env-variant-calling.yml" shell: """ samtools view -b -F 4 {input} > {output} """ rule sort_bam: output: "{sample}.ecoli-rel606.bam.sorted" input: "{sample}.ecoli-rel606.bam" conda: "env-variant-calling.yml" shell: """ samtools sort {input} > {output} """ rule ungz_genome: input: "ecoli-rel606.fa.gz", output: "ecoli-rel606.fa" shell: "gunzip -c {input} > {output}" rule call_variants: input: sorted_bam="{sample}.ecoli-rel606.bam.sorted", ungz_genome = "ecoli-rel606.fa", output: pileup = "{sample}.ecoli-rel606.pileup", bcf = "{sample}.ecoli-rel606.bcf", vcf = "{sample}.ecoli-rel606.vcf" conda: "env-variant-calling.yml" shell: """ bcftools mpileup -Ou -f {input.ungz_genome} {input.sorted_bam} > {output.pileup} bcftools call -mv -Ob {output.pileup} -o {output.bcf} bcftools view {output.bcf} > {output.vcf} """ ``` ## Stopping point 3: loading samples from a CSV file instead Put this at the top of the Snakefile: ```python import csv # loading the sample names and URLs from 'samples.csv' with open('samples.csv', 'rt') as fp: r = csv.DictReader(fp) samples_dict = {} for row in r: name = row['sample'] url = row['url'] samples_dict[name] = url ``` and write a `samples.csv` file like so: ```csvpreview= sample,url SRR2584857_1,https://osf.io/aksmc/download SRR2584403_1,https://osf.io/6jx7z/download SRR2584404_1,https://osf.io/s24ky/download SRR2584405_1,https://osf.io/7qek6/download ``` ## Other tutorials and links http://ivory.idyll.org/blog/2020-snakemake-hacks-collections-files.html https://training.nih-cfde.org/en/latest/Bioinformatics-Skills/Snakemake/

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.