Cycads Project

--- title: Encephalartos tags: HybSeq --- Luciana Piniely --- This is the workflow of everything am running regarding this project # Installation Install HybPiper 2 ```bash= conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge conda config --set channel_priority strict # Create a conda env called "hybpiper" with hybpiper installed conda create --name hybpiper -c chrisjackson-pellicle hybpiper #answer yes to prompts ##Activate conda env, the (base) in front of your username should change to (hybpiper) conda activate hybpiper #Then run test commands and see if hybpiper prompts appear hybpiper --help #or hybpiper --version #or hybpiper check_dependencies ``` ## File Preparation Change files name from SampleID_S##_R#_001.fastq.gz to SampleID_R#.fastq.qz ```bash= #to remove the S##: for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_S[0-9]*_/_/')"; done #to remove the _001_: for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_001.fastq.gz/.fastq.gz/')"; done #note that if you are trying to clean up something like .1. in a file name, in sed, "."" means the first character in the file name and that will be what is edited. ``` ### Trimming We use trimmomatic which filter out low quality reads, remove adapter sequences -**[Trimmomatic GitHub](https://github.com/usadellab/Trimmomatic/blob/main/README.md) Activate trimmomatic conda environment ```bash= conda activate trimmomatic ``` :::spoiler Trimmomatic Arguments - `PE -phred33` refers to the quality scoring scheme that is used in the raw reads - `input_R1.fastq.gz` and `input_R2.fastq.gz` are the 2 input files (the raw read files) that trimmomatic acts on - `output_R1_paired.fastq.gz output_R1_unpaired.fastq.gz output_R2_paired.fastq.gz output_R2_unpaired.fastq.gz` are the 4 output files that are produced, 2 paired read files and 2 unpaired read files - The remaining arguments informs trimmomatic's decision in keeping/removing sequences ```bash= #For loop to run on folder of samples for _R1_ in *_R1*; do R2=${_R1_//_R1.fastq.gz/_R2.fastq.gz} R1p=${_R1_//.fastq.gz/_paired.fastq.gz} R1u=${_R1_//.fastq.gz/_unpaired.fastq.gz} R2p=${R2//.fastq.gz/_paired.fastq.gz} R2u=${R2//.fastq.gz/_unpaired.fastq.gz} echo java -jar /software/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 $_R1_ $R2 $R1p $R1u $R2p $R2u ILLUMINACLIP:/software/trimmomatic/0.39/adapters/TruSeq3-PE.fa:2:30:10:2:true LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40 >> trimmomatic.sh done # For loop to run on one sample java -jar /software/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 \ input_R1.fastq.gz input_R2.fastq.gz \ output_R1_paired.fastq.gz output_R1_unpaired.fastq.gz \ output_R2_paired.fastq.gz output_R2_unpaired.fastq.gz \ ILLUMINACLIP:/software/trimmomatic/0.39/adapters/TruSeq3-PE.fa:2:30:10:2:true \ LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40 #make a slurm script to run trimmomatic on the cluster-called slurmtrim.sh #view and edit slurm nano slurmtrim.sh #Example script #A sample batch script: #!/bin/bash #SBATCH --account=p32835 #SBATCH --partition=normal #SBATCH --time=15:00:00 #SBATCH --mem=5G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=5 #SBATCH --job-name=trimmomatic #SBATCH --output=trimmomatic.log #SBATCH --error=error.trimmomatic.log #SBATCH --mail-type=ALL #SBATCH --mail-user=lucianapiniely2026@u.northwestern.edu #module load hybpiper #or #module load java ./trimmomatic.sh #make the shell script executable: chmod +x trimmomatic.sh #run the script sbatch slurmtrim.sh ``` After running trimmomatic, you should have 4 files for each sample:R1_paired and unpaired, R2_paired and unpaired For HybPiper you only need the R1_paired.fastq.gz, R2_paired.fastq.gz, and a concatenated unpaired file. ```bash= #creates two directories mkdir raw_reads mkdir trimmed_raw_data mv *R1.fastq.gz *R2.fastq.gz raw_reads/ mv *paired.fastq.gz *unpaired.fastq.gz trimmed_raw_data #If you want to include the unpaired reads into HybPiper's assembly,you need to concatenate the two files together (combine them into one file). #To concatenate the unpaired files for each sample: for file in *R1_unpaired* do file2=${file//_R1_/_R2_} file3=${file//R1_unpaired/unpaired} cat $file $file2 > $file3 done #after concatenate you need to remove the unpaired reads for R1 and R2 rm *R1_unpaired.fastq.gz *R2_unpaired.fastq.gz #Now you should only have these three files per sample: #HE1_R1_paired.fastq.gz, HE1_R2_paired.fastq.gz and HE1_unpaired.fastq.gz ``` #### FILES NEEDED You will need text file with the names of all individauls as they appear in the file names and file with GoFLAG bait ```bash= # I named this namefile.txt ls *.fastq.gz | cut -d'_' -f1 | sort -u > namefile.txt #GoFlag bait file;To create this custom reference file create this script using nano command and name it extract_sequences.sh #path to link combined target file to the current directory ln -s /projects/p32835/combinedTarget.fasta /projects/p32835/Zamia_Dioon/cycads_move/trimmed_reads/combinedTarget.fasta ln -s /projects/p32835/combinedTarget.fasta /projects/p32835/2nd_batch/trimmed_raw_reads #For Encephalartos I used: awk '/Encephalartos_barteri/ {print; getline; print}' combinedTarget.fasta > ref_seq.fasta sed -E "s/>L([0-9]*)_.*_.*_(.*_.*)_[1-2]__REF/>\2-\1/g" ref_seq.fasta > target.fa rm ref_seq.fasta #For Zamia and Dioon I used awk '/Cycadales/ {print; getline; print}' combinedTarget.fasta > ref_seq.fasta sed -E "s/>L([0-9]*)_.*_.*_(.*_.*)_[1-2]__REF/>\2-\1/g" ref_seq.fasta > target.fa rm ref_seq.fasta #This will create a file called target.fa that will be used as the bait file in the HybPiper pipeline. ``` ##### ASSEMBLY Load hybpiper and run this loop; #If you are running Hybpiper 2.1.6 or newer versions, Intronerate is run by default and you have to remove --run_intronerate from the script. ```bash= module load Hybpiper ``` ```bash= for i in *R1_paired.fastq.gz do f2=${i//_R1_/_R2_} f3=${i//R1_paired.fastq.gz/unpaired.fastq.gz} f4=${i//_R1_paired.fastq.gz/} echo hybpiper assemble -t_dna target.fa -r $i $f2 --unpaired $f3 --prefix $f4 --bwa --run_intronerate >> assemble.sh done #create a slurm script to run the for loop; call it "slurmAssemble.sh" in the current directory with the for loop #!/bin/bash #SBATCH --account=p32835 #SBATCH --partition=normal #SBATCH --time=24:00:00 #SBATCH --mem=25G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=10 #SBATCH --job-name=assemble #SBATCH --output=assemble.log #SBATCH --error=error.assemble.log #SBATCH --mail-type=ALL #SBATCH --mail-user=lucianapiniely2026@u.northwestern.edu module load hybpiper ./assemble.sh #make the shell script executable: chmod +x assemble.sh #run the script sbatch slurmAssemble.sh ```