# POOH: Quick View on Automating Options ###### tags: `POOH` `automation` `bash` `batch` `snakemake` [TOC] Automating you task is useful for saving time as one command will run after the other without having to wait at the computer and retype the command on the command line. It is also useful for reducing errors that can come from repeatedly typing out commands and it helps with the documentation of your process. It offers a prepackage instructions of how each sample was analyzed and can even include the version of the software used. These automation files can be shared to more easily allow others to replicate your data. ## Bash Scripts A Bash script or Shell script is a plain text file which contains a series of commands. These commands are a mixture of commands we would normally type ouselves on the command line (such as `gzip` or `echo` for example) and commands we could type on the command line but generally wouldn't.An important point to remember though is: Anything you can run normally on the command line can be put into a script and it will do exactly the same thing. Similarly, anything you can put into a script can also be run normally on the command line and it will do exactly the same thing. ([RyanTutorials.net](https://ryanstutorials.net/bash-scripting-tutorial/bash-script.php)) ### Creating a Bash Script A Bash script is most useful for executing repetitive task without having to contiuously type the commands in the command line. To create a bash script you can create a file with a text editor using the the `.sh` file extension. For ex. `bash_script.sh` Use `nano` or `vim` to create a text file. Example below > bash_script.sh In the text file, you will want to add `#!/bin/bash` to the top of the file before typing in any commands below > #!/bin/bash > > echo Welcome to Pooh! Then use `bash bash_script.sh` to run your bash script Another example would be: > #!/bin/bash > > md5sum *.fastq.gz :::info md5sum is used to check if a file has been changed or corrupted as each file has its own digits which change when the file is edited or changed in some other way. ::: * Bash scripts are also great for For Loops >#!/bin/bash >for i in .fastq.gz >do sourmash sketch dna $i -p abund -o $i.sig >done :::info This and commands like this are useful for doing repetitive task on multiple samples. If you have any errors in your output, it will be easy to go back and find the error in the bash script. ::: **This is not an exhaustive list of things you can do with a bash script. For more information on bash scripts see: [Automating your analyses and executing long-running analyses on remote computers](https://ngs-docs.github.io/2021-august-remote-computing/automating-your-analyses-and-executing-long-running-analyses-on-remote-computers.html#what-is-a-script)** ## sBatch Scripts A batch file is a script file that stores commands to be executed in a serial order. It helps automate routine tasks without requiring user input or intervention. Some common applications of batch files include loading programs, running multiple processes or performing repetitive actions in a sequence in the system. Also known as a batch job, a batch file is a text file created in Notepad or some other text editor. A batch file bundles or packages a set of commands into a single file in serial order. Without a batch file these commands would have to be presented one at a time to the system from a keyboard. Usually, a batch file is created for command sequences when a user has a repetitive need. A command-line interpreter takes the file as an input and executes the commands in the given order. A batch file eliminates the need to retype commands, which saves the user time and helps to avoid mistakes. It is also useful to simplify complex processes.([Techtarget](https://www.techtarget.com/searchwindowsserver/definition/batch-file)) ### Creating a SBatch Scripts sBatch scripts are very similar to bash scripts, except they are submitted to the slurm workload manager using sbatch. You create a sbatch script same as you would a bash script beginig with `#! /bin/bash`. Then use a text editor to create a new script with the `.sh` file extension. For example `batch_script.sh` ``` #! /bin/bash # #SBATCH --mail-user=<email>@ucdavis.edu # YOUR EMAIL ADDRESS #SBATCH --mail-type=ALL # NOTIFICATIONS OF SLURM JOB STATUS - ALL, NONE, BEGIN, END, FAIL, REQUEUE #SBATCH -J Pooh # JOB ID #SBATCH -e Pooh.j%j.err # STANDARD ERROR FILE TO WRITE TO #SBATCH -o Pooh.j%j.out # STANDARD OUTPUT FILE TO WRITE TO #SBATCH -c 1 # NUMBER OF PROCESSORS PER TASK #SBATCH --mem=1Gb # MEMORY POOL TO ALL CORES #SBATCH --time=00-00:05:00 # REQUESTED WALL TIME #SBATCH -p high2 # PARTITION TO SUBMIT TO cp *.fastq.gz ~/test/double ``` :::info Be sure to add your email address to the batch file and change parameters as neccessary. ::: To run your sbatch file use >sbatch batch_script.sh You will then recieve a submitted batch job number. You will also receive a `.err` and `.out` file after the job is done If I wanted to run the `sourmash sketch` command above, I would use the script below. This would run the task when there is memory avaliable on the cluster in the background. ``` #! /bin/bash # #SBATCH --mail-user=<email>@ucdavis.edu # YOUR EMAIL ADDRESS #SBATCH --mail-type=ALL # NOTIFICATIONS OF SLURM JOB STATUS - ALL, NONE, BEGIN, END, FAIL, REQUEUE #SBATCH -c 1 # NUMBER OF PROCESSORS PER TASK #SBATCH --mem=120Gb # MEMORY POOL TO ALL CORES #SBATCH --time=00-24:05:00 # REQUESTED WALL TIME #SBATCH -p med2 for i in .fastq.gz do sourmash sketch dna $i -p abund -o $i.sig done ``` **This is not an exhaustive list of things you can do with a batch script. For more information on bash scripts see: [ Executing large analyses on HPC clusters with slurm](https://ngs-docs.github.io/2021-august-remote-computing/executing-large-analyses-on-hpc-clusters-with-slurm.html#or-submit-batch-scripts-with-sbatch)** ## Snakemake Files A Snakemake workflow is defined by specifying rules in a Snakefile. Rules decompose the workflow into small steps (for example, the application of a single tool) by specifying how to create sets of output files from sets of input files. Snakemake automatically determines the dependencies between the rules by matching file names. The Snakemake language extends the Python language, adding syntactic structures for rule definition and additional controls. All added syntactic structures begin with a keyword followed by a code block that is either in the same line or indented and consisting of multiple lines. The resulting syntax resembles that of original Python constructs.([snakemake.readthedocs](https://snakemake.readthedocs.io/en/stable/tutorial/basics.html)) Example of a Snakemake File: ``` rule download_data: conda: "env-wget.yml" shell: """ wget https://osf.io/4rdza/download -O SRR2584857_1.fastq.gz """ rule download_genome: conda: "env-wget.yml" shell: "wget https://osf.io/8sm92/download -O ecoli-rel606.fa.gz" rule map_reads: conda: "env-minimap.yml" shell: """ minimap2 -ax sr ecoli-rel606.fa.gz SRR2584857_1.fastq.gz > SRR2584857_1.ecoli-rel606.sam """ rule sam_to_bam: conda: "env-minimap.yml" shell: """ samtools view -b -F 4 SRR2584857_1.ecoli-rel606.sam > SRR2584857_1.ecoli-rel606.bam """ rule sort_bam: conda: "env-minimap.yml" shell: """ samtools sort SRR2584857_1.ecoli-rel606.bam > SRR2584857_1.ecoli-rel606.bam.sorted """ rule call_variants: conda: "env-bcftools.yml" shell: """ gunzip -c ecoli-rel606.fa.gz > ecoli-rel606.fa bcftools mpileup -Ou -f ecoli-rel606.fa SRR2584857_1.ecoli-rel606.bam.sorted > SRR2584857_1.ecoli-rel606.pileup bcftools call -mv -Ob SRR2584857_1.ecoli-rel606.pileup -o SRR2584857_1.ecoli-rel606.bcf bcftools view SRR2584857_1.ecoli-rel606.bcf > SRR2584857_1.ecoli-rel606.vcf ``` Individual rules of the snakemake file can be ran individually by using one of the commands the commands shown below. > snakemake -p -j 1 --use-conda map_reads > snakemake -p -j 1 --use-conda sam_to_bam > snakemake -p -j 1 --use-conda sort_bam > snakemake -p -j 1 --use-conda call_variants [Things to keep in mind:](https://hackmd.io/D1J_6CCDQluLktknHb8yFg?view) * are just shell commands, with a bit of “decoration”. You could run them yourself if you wanted, outside of snakemake. * the order of the rules in the Snakefile doesn’t matter * rules can have one or more shell commands, one after the other on their own lines * snakemake -p prints the command that is being executed * -j 1 says “use only one CPU” * --use-conda says to install software as specified in the conda: block. * snakemake prints things out in red if it fails. * it’s all case sensitive * tabs and spacing matter.