HybPiper 2 Workflow

--- title: HybPiper 2 Workflow breaks: false tags: HybSeq --- > This document is a guide for running HybPiper 2 on Fabronia (or quest??). HybPiper's github and documentation can be found ++[here](https://github.com/mossmatters/HybPiper/wiki)++. See ++[this note](https://hackmd.io/emTOFHJaQNmkjsIwJWxZ6Q)++ for running HybPiper 1. >[color=#907bf7] ## Installing HybPiper 2 > More details ++[here](https://github.com/mossmatters/HybPiper/wiki/Installation)++ > [color=#907bf7] Installing HybPiper on Fabronia should only require creating a conda environment (shouldn't need to add channels). In Quest you need to add bioconda and conda-forge chanels as follows. If you need help setting up/managing your conda environment feel free to reach out to me. ```bash= conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge conda config --set channel_priority strict ``` ```bash= #Create a conda environment called "hybpiper" with hybpiper installed conda create --name hybpiper -c chrisjackson-pellicle hybpiper ``` Answer yes to prompts. To test if the installation worked, activate the environment and test to see if hybpiper commands work. ```bash= #Activate the conda environment. The (base) in front of your username #should change to (hybpiper) conda activate hybpiper #Then run test commands and see if hybpiper prompts appear hybpiper --help #or hybpiper --version #or hybpiper check_dependencies ``` ## File Preparation Most files coming straight from the sequencing facility have a few extra characters in the names. Clean up the file names that have something like samplename_S79_R1_001.fastq.gz: ``` #to remove the S79: for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_S[0-9]*_/_/')"; done #to remove the _001_: for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_001.fastq.gz/.fastq.gz/')"; done #do note that if you are trying to clean up something like .1. in a file name, in sed, "."" means the first character in the file name and that will be what is edited. ``` ### Trimmomatic Before sending your raw read files into HybPiper, it's a good idea to pass them through software like ++[Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic)++ to filter out low quality reads, remove adapter sequences, etc. Trimmomatic Arguments - `PE -phred33` refers to the quality scoring scheme that is used in the raw reads - `input_R1.fastq.gz` and `input_R2.fastq.gz` are the 2 input files (the raw read files) that trimmomatic acts on - `output_R1_paired.fastq.gz output_R1_unpaired.fastq.gz output_R2_paired.fastq.gz output_R2_unpaired.fastq.gz` are the 4 output files that are produced, 2 paired read files and 2 unpaired read files - The remaining arguments informs trimmomatic's decision in keeping/removing sequences ```bash! #To run trimmomatic on one sample, cd into the folder containing your raw reads and run: java -jar /software/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 input_R1.fastq.gz input_R2.fastq.gz output_R1_paired.fastq.gz output_R1_unpaired.fastq.gz output_R2_paired.fastq.gz output_R2_unpaired.fastq.gz ILLUMINACLIP:/software/trimmomatic/0.39/adapters/TruSeq3-PE.fa:2:30:10:2:true LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40 ## #For loop to run on folder of samples for _R1 in *_R1*; do R2=${_R1/_R1.fastq.gz/_R2.fastq.gz} R1p=${_R1/\.fastq.gz/_paired.fastq.gz} R1u=${_R1/\.fastq.gz/_unpaired.fastq.gz} R2p=${R2/\.fastq.gz/_paired.fastq.gz} R2u=${R2/\.fastq.gz/_unpaired.fastq.gz} echo java -jar /software/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 $_R1 $R2 $R1p $R1u $R2p $R2u ILLUMINACLIP:/software/trimmomatic/0.39/adapters/TruSeq3-PE.fa:2:30:10:2:true LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40 >> trimmomatic.sh done ``` You will now have a shell script that you can bash in order to run trimmomatic on the files. ``` #make the shell script executable: chmod +x trimmomatic.sh #bash the shell script ./trimmomatic.sh ``` ##### Note: for single end reads, the trimmomatic command should be modified: ``` java -jar trimmomatic-0.35.jar SE -phred33 input.fq.gz output.fq.gz ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 ``` A sample batch script: ``` #!/bin/bash #SBATCH --account=pXXXXX #SBATCH --partition=normal #SBATCH --time=8:00:00 #SBATCH --mem=5G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=5 #SBATCH --job-name=trimmomatic #SBATCH --output=trimmomatic.log #SBATCH --error=error.trimmomatic.log #SBATCH --mail-type=ALL #SBATCH --mail-user=youremail@email.com #conda activate hybpiper #or #module load java ./trimmomatic.sh ``` If you want to include the unpaired reads into HybPiper's assembly, you need to concatenate the two files together (combine them into one file). ```bash #To concatenate the unpaired files for each sample: for file in *R1_unpaired* do file2=${file//_R1_/_R2_} file3=${file//R1_unpaired/unpaired} cat $file $file2 > $file3 done ``` Take a moment to organize your files. Make a new directory for your ./raw_reads and within that, a new directory for your ./unconcatenated_unpaired_reads. You only need R1_paired R2_paired and the concatenated _unpaired files for the next steps, but you need to keep the raw read files. ## Running HybPiper 2 ### Assembly HybPiper's first command to run is called `assemble` and is for sequence assembly and extraction. This process is resource intensive and will take a long time. In a folder containing the quality filtered raw reads, run these commands. NOTES ABOUT ASSEMBLY - The syntax below will need to change depending on the naming scheme of your files - It may be a good idea to back up your quality filtered raw reads somewhere (if storage space is not an issue to worry about) - Here I've redirected the output of this for loop into a text file/shell script. That way, you can check to see if all the variables were correctly stored before running this step. To run the shell script, type bash assemble - Arguments: - -t_dna: is the bait file - -r: the paired read files - --unpaired: the concatenated unpaired read file - --bwa: using _nucleotide_ target files - --run_intronerate: collect intron regions as well (you may not want this depending on your goals) - See github for additional arguements/details ```bash! #Here we make a for loop to run `hybpiper assemble` on all samples within a folder. for i in *R1_paired.fastq.gz do f2=${i//_R1_/_R2_} f3=${i//R1_paired.fastq.gz/unpaired.fastq.gz} f4=${i//_R1_paired.fastq.gz/} echo hybpiper assemble -t_dna Malinae_optimized_353.fasta -r $i $f2 --unpaired $f3 --prefix $f4 --bwa --run_intronerate >> assemble.sh done #If you are running Hybpiper 2.1.6 or newer versions, Intronerate is run by default and you have to remove --run_intronerate from the script. #For loop is outputed to a script called assemble.sh... Check the script before running the below line bash assemble.sh ``` ### Summary Statistics HybPiper also includes a number of commands that produce statistics about the assmebled sequences. This is separate from retrieving the sequences (described below). In the directory containing HybPiper's outputs and type these commands to generate summary statistics. make a namefile.txt ``` for i in *R1_paired.fastq.gz do name=${i//_R1_paired.fastq.gz/} echo $name >> namefile.txt done ``` NOTES ABOUT STATISTICS - Arguments - -t_dna: The nucleotide bait file - gene: "gene" if you want to generate statistics about the gene sequence, "supercontig" if you want supercontig statistics (requires intronerate) - namefile.txt: a text file containing individual names, one on each line ```bash=! #This line generates some tsv's containing statistics about each individual's assemblies hybpiper stats -t_dna Malinae_optimized_353.fasta gene namefile.txt #This line generates a heatmap based off of the summary statistics in the previous line hybpiper recovery_heatmap seq_lengths.tsv ``` ### Retrieve Sequences HybPiper offers the ability to retrieve the assembled sequences from each individual for each of the targets. Alternatively, Other downstream analyses often do this as well, so this step may be skipped. ```bash=! hybpiper retrieve_sequences dna -t_dna Malinae_optimized_353.fasta --sample_names namefile.txt ```