NESI Workflow Seminar

# NESI Workflow Seminar ## Usefull resources - [Melbourne tutorial](https://www.melbournebioinformatics.org.au/tutorials/tutorials/cwl/media/#1) - [eResearch slides](https://jdeligt.github.io/presentations/2020_eResearch_BOF_Workflows.html#1) - [protocol](https://link.springer.com/protocol/10.1007/978-1-4939-9074-0_24) --- I'll go through the slidedeck but here's the gist of it: ![Future self](https://www.explainxkcd.com/wiki/images/3/3b/future_self.png) *source [xkcd.com]* --- # So what can we do to help our future self? --- ## code DRY (“Don’t Repeat Yourself”) *we'll leave the wet stuff for in the lab* A valuable lesson from the [how not to code](https://blog.usejournal.com/how-to-write-bad-code-the-definitive-guide-part-i-15b944e4cd3e) guidebook. --- In your own code this means embracing functions: ![Functions](https://learncodingwithgo.com/wp-content/uploads/2018/09/Screen-Shot-2018-09-03-at-8.16.42-PM.png) *source [https://learncodingwithgo.com/go-functions/]* In pipelines it means embracing workflow languages: ![workflow example](https://i.imgur.com/iyU8VIZ.png) *source [https://bcbio-nextgen.readthedocs.io/en/latest/contents/cwl.html]* --- ## It's about defining what needs to happen and having tools or pipelines that can do that and can easily be swapped for other/newer tools. --- Queue [eResearch slides](https://jdeligt.github.io/presentations/2020_eResearch_BOF_Workflows.html#1) --- # Demo time --- ### Prep I did ```bash cd /scale_wlg_persistent/filesets/project/nesi02659/workflow_workshop/ # Data wget https://www.melbournebioinformatics.org.au/tutorials/tutorials/cwl/media/data/data.tar.gz tar -zxvf data.tar.gz # Tool definitions git clone https://github.com/common-workflow-language/workflows # A genomics workflow git clone https://github.com/EvolutionaryGenomics/scalability-reproducibility-chapter.git ``` --- Prep NeSI environment ```bash module load cwltool/3.0.20200317203547-gimkl-2020a-Python-3.8.2 module load Clustal-Omega/1.2.2-gimkl-2017a module load snakemake/5.10.0-gimkl-2020a-Python-3.8.2 ``` --- ## Using pre defined tools ### BWA-MEM ```cwl #!/usr/bin/env cwl-runner cwlVersion: v1.0 class: CommandLineTool requirements: DockerRequirement: dockerPull: "quay.io/biocontainers/bwa:0.7.17--ha92aebf_3" inputs: InputFile: type: File[] format: http://edamontology.org/format_1930 # FASTA inputBinding: position: 201 Index: type: File inputBinding: position: 200 secondaryFiles: - .fai - .amb - .ann - .bwt - .pac - .sa #Optional arguments Threads: type: int? inputBinding: prefix: "-t" MinSeedLen: type: int? inputBinding: prefix: "-k" ... ``` ```bash # Running it cwltool --singularity workflows/tools/bwa-mem.cwl \ --reads core/mutant_R1.fastq \ --reads core/mutant_R2.fastq \ --reference core/wildtype.fna \ --output_filename mutant.sam ``` Output ```bash [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 24960 sequences (3744000 bp)... [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 12080, 0, 0) [M::mem_pestat] skip orientation FF as there are not enough pairs [M::mem_pestat] analyzing insert size distribution for orientation FR... [M::mem_pestat] (25, 50, 75) percentile: (372, 398, 425) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (266, 531) [M::mem_pestat] mean and std.dev: (398.41, 39.71) [M::mem_pestat] low and high boundaries for proper pairs: (213, 584) [M::mem_pestat] skip orientation RF as there are not enough pairs [M::mem_pestat] skip orientation RR as there are not enough pairs [M::mem_process_seqs] Processed 24960 reads in 1.281 CPU sec, 1.281 real sec [main] Version: 0.7.12-r1039 [main] CMD: bwa mem /var/lib/cwl/stgeafa3970-3893-4462-80a8-b3d02c449898/wildtype.fna /var/lib/cwl/stg69a63b06-1720-4bb9-8860-33393993ad0d/mutant_R1.fastq /var/lib/cwl/stg65485d7b-a5db-475e-8335-1ebcbe42cb76/mutant_R2.fastq [main] Real time: 1.447 sec; CPU: 1.355 sec INFO [job bwa-mem.cwl] completed success { "output": { "location": "file:///scale_wlg_persistent/filesets/project/nesi02659/workflow_workshop/mutant.bam", "basename": "mutant.bam", "class": "File", "checksum": "sha1$ea94193808e79eb3eb296b7fd3280652104db8cc", "size": 10035037, "path": "/scale_wlg_persistent/filesets/project/nesi02659/workflow_workshop/mutant.bam" } } ``` --- ## Using a pre defined workflow [source](https://github.com/EvolutionaryGenomics/scalability-reproducibility-chapter/blob/master/Snakemake/Snakefile) ```python TARGETS = list(map(lambda n: "../data/cluster%05d/results0-3.txt" % n, range(1, 73))) rule all: input: expand("{cwd}/{target}", cwd=os.getcwd(), target=TARGETS) rule clustal: input: "{cluster}/aa.fa" output: guidetree = "{cluster}/aa.ph", align = "{cluster}/aa.aln" shell: "clustalo -i {input} --guidetree-out={output.guidetree} > {output.align}" rule pal2nal: input: "{cluster}/aa.aln" output: "{cluster}/alignment.phy" shell: "pal2nal.pl -output paml {input} {wildcards.cluster}/nt.fa > {output}" rule codeml: input: "{cluster}/alignment.phy" output: "{cluster}/results0-3.txt" shell: "cd {wildcards.cluster}; echo | codeml ../paml0-3.ctl" ``` ### Running it snakemake ```bash cd scalability-reproducibility-chapter/Snakemake snakemake ``` ### Running it CWL ```bash cd scalability-reproducibility-chapter/ cwltool --singularity CWL/workflow.cwl --clusters data ``` ----- # Unlocking the potential - [bio-tools](https://github.com/common-workflow-library/bio-cwl-tools) - [workflow viewer](https://view.commonwl.org/workflows) - [human genomics specific workflows](https://github.com/genome/analysis-workflows) - many many more ### Standing on the shoulders of really tall people & awesome groups ![Demo](https://s20709.pcdn.co/wp-content/uploads/2017/07/2017-04-28-2.gif) *source SevenBridges*