Processing Moth data

--- Title: Moth processing tags: HybSeq --- ## Trimmomatic > Most of this section taken from Johnathan's [Hybpiper 2 Workflow page](https://hackmd.io/zLFynXJvQQ-y0-rj6nqolQ). [Trimmomatic GitHub](https://github.com/usadellab/Trimmomatic/blob/main/README.md) Most files coming straight from the sequencing facility have a few extra characters in the names. Clean up the file names that have something like samplename_S79_R1_001.fastq.gz: ```bash= #to remove the S79: for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_S[0-9]*_/_/')"; done #to remove the _001_: for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_001.fastq.gz/.fastq.gz/')"; done ``` Trimmomatic filter out low quality reads, remove adapter sequences, etc. Trimmomatic conda environment needs to be activated before running. ```bash= #check environments conda info --envs # JF needs to activate genomics conda activate genomics conda activate trimmomatic ``` :::spoiler Trimmomatic Arguments - `PE -phred33` refers to the quality scoring scheme that is used in the raw reads - `input_R1.fastq.gz` and `input_R2.fastq.gz` are the 2 input files (the raw read files) that trimmomatic acts on - `output_R1_paired.fastq.gz output_R1_unpaired.fastq.gz output_R2_paired.fastq.gz output_R2_unpaired.fastq.gz` are the 4 output files that are produced, 2 paired read files and 2 unpaired read files - The remaining arguments informs trimmomatic's decision in keeping/removing sequences ::: ```bash= #Use nano to create a shell script and name it -> trimmomatic.sh #test script to ensure works java -jar /software/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 BC1023_24_R1.fastq.gz BC1023_24_R2.fastq.gz BC1023_24_R1_paired.fastq.gz BC1023_24_R1_unpaired.fastq.gz BC1023_24_R2_paired.fastq.gz BC1023_24_R1_unpaired.fastq.gz ILLUMINACLIP:/software/trimmomatic/0.39/adapters/TruSeq3-PE.fa:2:30:10:2:true LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40 done #For loop to run on folder of samples for _R1_ in *_R1*; do R2=${_R1_//_R1.fastq.gz/_R2.fastq.gz} R1p=${_R1_//.fastq.gz/_paired.fastq.gz} R1u=${_R1_//.fastq.gz/_unpaired.fastq.gz} R2p=${R2//.fastq.gz/_paired.fastq.gz} R2u=${R2//.fastq.gz/_unpaired.fastq.gz} echo java -jar /software/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 $_R1_ $R2 $R1p $R1u $R2p $R2u ILLUMINACLIP:/software/trimmomatic/0.39/adapters/TruSeq3-PE.fa:2:30:10:2:true LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40 >> trimmomatic.sh done ``` This should be run using a slurm environment. Create a slurm environment using nano and name it `slurm_trimmomatic.sh` ```bash= nano # will pull up a text editor paste text below then save as "slurm_trimmomatic.sh" ``` ``` #!/bin/bash #SBATCH --account=p31927 #SBATCH --partition=normal #SBATCH --time=15:00:00 #SBATCH --mem=1GB #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=slurm_trimmomatic #SBATCH --output=slurm_outlog.log #SBATCH --error=slurm_errors.log #SBATCH --mail-type=ALL #SBATCH --mail-user=jbf420@u.northwestern.edu module load trimmomatic/0.39 bash ./trimmomatic.sh ``` To run use sbatch command ```bash= sbatch slurm_trimmomatic.sh ``` ``` Trimmomatic cannot utilize more than 1GB of memory, so increasing does not make it run faster. I had ~250 files and ~168GB of data, which took a little over 14 hours to run. To run use sbatch command ```bash= sbatch slurm_trimmomatic.sh ``` After trimmomatic runs you will have the original file, R1 and R1 paired files, and R1 and R2 unpaired files. For HybPiper you only need the R1_paired.fastq.gz, R2_paired.fastq.gz, and a concatenated unpaired file. To concatenate the two unpaired files into one ```bash= for file in *R1_unpaired* do file2=${file//_R1_/_R2_} file3=${file//R1_unpaired/unpaired} cat $file $file2 > $file3 done ``` Take a moment to organize your files. Make a new directory for your ./raw_reads and within that, a new directory for your ./unconcatenated_unpaired_reads. You only need R1_paired R2_paired and the concatenated _unpaired files for the next steps, but you need to keep the raw read files. ## Running HybPiper 2 ### Assembly HybPiper's first command to run is called `assemble` and is for sequence assembly and extraction. This process is resource intensive and will take a long time. In a folder containing the quality filtered raw reads, run these commands. :::spoiler NOTES ABOUT ASSEMBLY - The syntax below will need to change depending on the naming scheme of your files - It may be a good idea to back up your quality filtered raw reads somewhere (if storage space is not an issue to worry about) - Here I've redirected the output of this for loop into a text file/shell script. That way, you can check to see if all the variables were correctly stored before running this step. To run the shell script, type bash assemble - Arguments: - -t_dna: is the bait file - -r: the paired read files - --unpaired: the concatenated unpaired read file - --bwa: using _nucleotide_ target files - --run_intronerate: collect intron regions as well (you may not want this depending on your goals) - See github for additional arguements/details ::: ```bash! #Here we make a for loop to run `hybpiper assemble` on all samples within a folder. for i in *R1_paired.fastq do f2=${i//_R1_/_R2_} f3=${i//R1_paired.fastq/unpaired.fastq} f4=${i//_R1_paired.fastq/} echo hybpiper assemble -t_dna micromoth.fasta -r $i $f2 --unpaired $f3 --prefix $f4 --bwa --run_intronerate --run_intronerate >> assemble.sh done #If you are running Hybpiper 2.1.6 or newer versions, Intronerate is run by default and you have to remove --run_intronerate from the script. #For loop is outputed to a script called assemble.sh... Check the script before running the below line This should be run using a slurm environment. Create a slurm environment using nano and name it `slurm_hybpiper.sh` ``` ``` nano slurm_hybpiper.sh ``` will pull up a text editor paste text below then save as "slurm_hybpiper.sh" ``` #!/bin/bash #SBATCH --account=p31927 #SBATCH --partition=normal #SBATCH --time=15:00:00 #SBATCH --mem=1GB #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=slurm_hybpiper #SBATCH --output=slurm_outlog.log #SBATCH --error=slurm_errors.log #SBATCH --mail-type=ALL #SBATCH --mail-user=jbf420@u.northwestern.edu module load hybpiper bash assemble.sh ``` Run using sbatch ```bash= sbatch slurm_hybpiper.sh ``` ### Summary Statistics HybPiper also includes a number of commands that produce statistics about the assmebled sequences. This is separate from retrieving the sequences (described below). `cd` to the parent folder containing HybPiper's outputs and type these commands to generate summary statistics. :::spoiler NOTES ABOUT STATISTICS - Arguments - -t_dna: The nucleotide bait file - gene: "gene" if you want to generate statistics about the gene sequence, "supercontig" if you want supercontig statistics (requires intronerate) - namefile.txt: a text file containing individual names, one on each line ::: ```bash=! #This line generates some tsv's containing statistics about each individual's assemblies hybpiper stats -t_dna micromoth.fasta gene namefile.txt #This line generates a heatmap based off of the summary statistics in the previous line hybpiper recovery_heatmap seq_lengths.tsv ``` ### Retrieve Sequences HybPiper offers the ability to retrieve the assembled sequences from each individual for each of the targets. Alternatively, Other downstream analyses often do this as well, so this step may be skipped. ```bash=! hybpiper retrieve_sequences dna -t_dna Malinae_optimized_353.fasta --sample_names namefile.txt ```