Try   HackMD

POOH: Quick View on Automating Options

tags: POOH automation bash batch snakemake

Automating you task is useful for saving time as one command will run after the other without having to wait at the computer and retype the command on the command line. It is also useful for reducing errors that can come from repeatedly typing out commands and it helps with the documentation of your process. It offers a prepackage instructions of how each sample was analyzed and can even include the version of the software used. These automation files can be shared to more easily allow others to replicate your data.

Bash Scripts

A Bash script or Shell script is a plain text file which contains a series of commands. These commands are a mixture of commands we would normally type ouselves on the command line (such as gzip or echo for example) and commands we could type on the command line but generally wouldn't.An important point to remember though is:
Anything you can run normally on the command line can be put into a script and it will do exactly the same thing. Similarly, anything you can put into a script can also be run normally on the command line and it will do exactly the same thing. (RyanTutorials.net)

Creating a Bash Script

A Bash script is most useful for executing repetitive task without having to contiuously type the commands in the command line.

To create a bash script you can create a file with a text editor using the the .sh file extension. For ex. bash_script.sh

Use nano or vim to create a text file. Example below

bash_script.sh

In the text file, you will want to add #!/bin/bash to the top of the file before typing in any commands below

#!/bin/bash

echo Welcome to Pooh!

Then use bash bash_script.sh to run your bash script

Another example would be:

#!/bin/bash

md5sum *.fastq.gz

md5sum is used to check if a file has been changed or corrupted as each file has its own digits which change when the file is edited or changed in some other way.

  • Bash scripts are also great for For Loops

#!/bin/bash
for i in .fastq.gz
do sourmash sketch dna $i -p abund -o $i.sig
done

This and commands like this are useful for doing repetitive task on multiple samples. If you have any errors in your output, it will be easy to go back and find the error in the bash script.

This is not an exhaustive list of things you can do with a bash script. For more information on bash scripts see: Automating your analyses and executing long-running analyses on remote computers

sBatch Scripts

A batch file is a script file that stores commands to be executed in a serial order. It helps automate routine tasks without requiring user input or intervention. Some common applications of batch files include loading programs, running multiple processes or performing repetitive actions in a sequence in the system.

Also known as a batch job, a batch file is a text file created in Notepad or some other text editor. A batch file bundles or packages a set of commands into a single file in serial order. Without a batch file these commands would have to be presented one at a time to the system from a keyboard.

Usually, a batch file is created for command sequences when a user has a repetitive need. A command-line interpreter takes the file as an input and executes the commands in the given order. A batch file eliminates the need to retype commands, which saves the user time and helps to avoid mistakes. It is also useful to simplify complex processes.(Techtarget)

Creating a SBatch Scripts

sBatch scripts are very similar to bash scripts, except they are submitted to the slurm workload manager using sbatch.

You create a sbatch script same as you would a bash script beginig with #! /bin/bash. Then use a text editor to create a new script with the .sh file extension. For example batch_script.sh

#! /bin/bash
#
#SBATCH --mail-user=<email>@ucdavis.edu         # YOUR EMAIL ADDRESS
#SBATCH --mail-type=ALL                         # NOTIFICATIONS OF SLURM JOB STATUS - ALL, NONE, BEGIN, END, FAIL, REQUEUE
#SBATCH -J Pooh                           # JOB ID
#SBATCH -e Pooh.j%j.err                   # STANDARD ERROR FILE TO WRITE TO
#SBATCH -o Pooh.j%j.out                   # STANDARD OUTPUT FILE TO WRITE TO
#SBATCH -c 1                                    # NUMBER OF PROCESSORS PER TASK
#SBATCH --mem=1Gb                               # MEMORY POOL TO ALL CORES
#SBATCH --time=00-00:05:00                      # REQUESTED WALL TIME
#SBATCH -p high2                                # PARTITION TO SUBMIT TO

cp *.fastq.gz ~/test/double

Be sure to add your email address to the batch file and change parameters as neccessary.

To run your sbatch file use

sbatch batch_script.sh

You will then recieve a submitted batch job number. You will also receive a .err and .out file after the job is done

If I wanted to run the sourmash sketch command above, I would use the script below. This would run the task when there is memory avaliable on the cluster in the background.

#! /bin/bash
#
#SBATCH --mail-user=<email>@ucdavis.edu         # YOUR EMAIL ADDRESS
#SBATCH --mail-type=ALL                         # NOTIFICATIONS OF SLURM JOB STATUS - ALL, NONE, BEGIN, END, FAIL, REQUEUE
#SBATCH -c 1                                    # NUMBER OF PROCESSORS PER TASK
#SBATCH --mem=120Gb                               # MEMORY POOL TO ALL CORES
#SBATCH --time=00-24:05:00                      # REQUESTED WALL TIME
#SBATCH -p med2

for i in .fastq.gz
do sourmash sketch dna $i -p abund -o $i.sig
done 

This is not an exhaustive list of things you can do with a batch script. For more information on bash scripts see: Executing large analyses on HPC clusters with slurm

Snakemake Files

A Snakemake workflow is defined by specifying rules in a Snakefile. Rules decompose the workflow into small steps (for example, the application of a single tool) by specifying how to create sets of output files from sets of input files. Snakemake automatically determines the dependencies between the rules by matching file names.

The Snakemake language extends the Python language, adding syntactic structures for rule definition and additional controls. All added syntactic structures begin with a keyword followed by a code block that is either in the same line or indented and consisting of multiple lines. The resulting syntax resembles that of original Python constructs.(snakemake.readthedocs)

Example of a Snakemake File:

rule download_data:
    conda: "env-wget.yml"
    shell: """
        wget https://osf.io/4rdza/download -O SRR2584857_1.fastq.gz
    """

rule download_genome:
    conda: "env-wget.yml"
    shell:
        "wget https://osf.io/8sm92/download -O ecoli-rel606.fa.gz"

rule map_reads:
    conda: "env-minimap.yml"
    shell: """
        minimap2 -ax sr ecoli-rel606.fa.gz SRR2584857_1.fastq.gz > SRR2584857_1.ecoli-rel606.sam
    """

rule sam_to_bam:
    conda: "env-minimap.yml"
    shell: """
        samtools view -b -F 4 SRR2584857_1.ecoli-rel606.sam > SRR2584857_1.ecoli-rel606.bam
     """

rule sort_bam:
    conda: "env-minimap.yml"
    shell: """
        samtools sort SRR2584857_1.ecoli-rel606.bam > SRR2584857_1.ecoli-rel606.bam.sorted
    """

rule call_variants:
    conda: "env-bcftools.yml"
    shell: """
        gunzip -c ecoli-rel606.fa.gz > ecoli-rel606.fa
        bcftools mpileup -Ou -f ecoli-rel606.fa SRR2584857_1.ecoli-rel606.bam.sorted > SRR2584857_1.ecoli-rel606.pileup
        bcftools call -mv -Ob SRR2584857_1.ecoli-rel606.pileup -o SRR2584857_1.ecoli-rel606.bcf
        bcftools view SRR2584857_1.ecoli-rel606.bcf > SRR2584857_1.ecoli-rel606.vcf

Individual rules of the snakemake file can be ran individually by using one of the commands the commands shown below.

snakemake -p -j 1 use-conda map_reads
snakemake -p -j 1 use-conda sam_to_bam
snakemake -p -j 1 use-conda sort_bam
snakemake -p -j 1 use-conda call_variants

Things to keep in mind:

  • are just shell commands, with a bit of “decoration”. You could run them yourself if you wanted, outside of snakemake.
  • the order of the rules in the Snakefile doesn’t matter
  • rules can have one or more shell commands, one after the other on their own lines
  • snakemake -p prints the command that is being executed
  • -j 1 says “use only one CPU”
  • use-conda says to install software as specified in the conda: block.
  • snakemake prints things out in red if it fails.
  • it’s all case sensitive
  • tabs and spacing matter.