---
title: Encephalartos
tags: HybSeq
---
Luciana Piniely
---
This is the workflow of everything am running regarding this project
# Installation
Install HybPiper 2
```bash=
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
# Create a conda env called "hybpiper" with hybpiper installed
conda create --name hybpiper -c chrisjackson-pellicle hybpiper
#answer yes to prompts
##Activate conda env, the (base) in front of your username should change to (hybpiper)
conda activate hybpiper
#Then run test commands and see if hybpiper prompts appear
hybpiper --help #or
hybpiper --version #or
hybpiper check_dependencies
```
## File Preparation
Change files name from SampleID_S##_R#_001.fastq.gz to SampleID_R#.fastq.qz
```bash=
#to remove the S##:
for filename in *fastq.gz;
do mv "$filename" "$(echo "$filename" | sed 's/_S[0-9]*_/_/')"; done
#to remove the _001_:
for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_001.fastq.gz/.fastq.gz/')"; done
#note that if you are trying to clean up something like .1. in a file name, in sed, "."" means the first character in the file name and that will be what is edited.
```
### Trimming
We use trimmomatic which filter out low quality reads, remove adapter sequences
-**[Trimmomatic GitHub](https://github.com/usadellab/Trimmomatic/blob/main/README.md)
Activate trimmomatic conda environment
```bash=
conda activate trimmomatic
```
:::spoiler Trimmomatic Arguments
- `PE -phred33` refers to the quality scoring scheme that is used in the raw reads
- `input_R1.fastq.gz` and `input_R2.fastq.gz` are the 2 input files (the raw read files) that trimmomatic acts on
- `output_R1_paired.fastq.gz output_R1_unpaired.fastq.gz output_R2_paired.fastq.gz output_R2_unpaired.fastq.gz` are the 4 output files that are produced, 2 paired read files and 2 unpaired read files
- The remaining arguments informs trimmomatic's decision in keeping/removing sequences
```bash=
#For loop to run on folder of samples
for _R1_ in *_R1*; do
R2=${_R1_//_R1.fastq.gz/_R2.fastq.gz}
R1p=${_R1_//.fastq.gz/_paired.fastq.gz}
R1u=${_R1_//.fastq.gz/_unpaired.fastq.gz}
R2p=${R2//.fastq.gz/_paired.fastq.gz}
R2u=${R2//.fastq.gz/_unpaired.fastq.gz}
echo java -jar /software/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 $_R1_ $R2 $R1p $R1u $R2p $R2u ILLUMINACLIP:/software/trimmomatic/0.39/adapters/TruSeq3-PE.fa:2:30:10:2:true LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40 >> trimmomatic.sh
done
# For loop to run on one sample
java -jar /software/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 \
input_R1.fastq.gz input_R2.fastq.gz \
output_R1_paired.fastq.gz output_R1_unpaired.fastq.gz \
output_R2_paired.fastq.gz output_R2_unpaired.fastq.gz \
ILLUMINACLIP:/software/trimmomatic/0.39/adapters/TruSeq3-PE.fa:2:30:10:2:true \
LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40
#make a slurm script to run trimmomatic on the cluster-called slurmtrim.sh
#view and edit slurm
nano slurmtrim.sh
#Example script
#A sample batch script:
#!/bin/bash
#SBATCH --account=p32835
#SBATCH --partition=normal
#SBATCH --time=15:00:00
#SBATCH --mem=5G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=5
#SBATCH --job-name=trimmomatic
#SBATCH --output=trimmomatic.log
#SBATCH --error=error.trimmomatic.log
#SBATCH --mail-type=ALL
#SBATCH --mail-user=lucianapiniely2026@u.northwestern.edu
#module load hybpiper
#or
#module load java
./trimmomatic.sh
#make the shell script executable:
chmod +x trimmomatic.sh
#run the script
sbatch slurmtrim.sh
```
After running trimmomatic, you should have 4 files for each sample:R1_paired and unpaired, R2_paired and unpaired
For HybPiper you only need the R1_paired.fastq.gz, R2_paired.fastq.gz, and a concatenated unpaired file.
```bash=
#creates two directories
mkdir raw_reads
mkdir trimmed_raw_data
mv *R1.fastq.gz *R2.fastq.gz raw_reads/
mv *paired.fastq.gz *unpaired.fastq.gz trimmed_raw_data
#If you want to include the unpaired reads into HybPiper's assembly,you need to concatenate the two files together (combine them into one file).
#To concatenate the unpaired files for each sample:
for file in *R1_unpaired*
do
file2=${file//_R1_/_R2_}
file3=${file//R1_unpaired/unpaired}
cat $file $file2 > $file3
done
#after concatenate you need to remove the unpaired reads for R1 and R2
rm *R1_unpaired.fastq.gz *R2_unpaired.fastq.gz
#Now you should only have these three files per sample:
#HE1_R1_paired.fastq.gz, HE1_R2_paired.fastq.gz and HE1_unpaired.fastq.gz
```
#### FILES NEEDED
You will need text file with the names of all individauls as they appear in the file names and file with GoFLAG bait
```bash=
# I named this namefile.txt
ls *.fastq.gz | cut -d'_' -f1 | sort -u > namefile.txt
#GoFlag bait file;To create this custom reference file create this script using nano command and name it extract_sequences.sh
#path to link combined target file to the current directory
ln -s /projects/p32835/combinedTarget.fasta /projects/p32835/Zamia_Dioon/cycads_move/trimmed_reads/combinedTarget.fasta
ln -s /projects/p32835/combinedTarget.fasta /projects/p32835/2nd_batch/trimmed_raw_reads
#For Encephalartos I used:
awk '/Encephalartos_barteri/ {print; getline; print}' combinedTarget.fasta > ref_seq.fasta
sed -E "s/>L([0-9]*)_.*_.*_(.*_.*)_[1-2]__REF/>\2-\1/g" ref_seq.fasta > target.fa
rm ref_seq.fasta
#For Zamia and Dioon I used
awk '/Cycadales/ {print; getline; print}' combinedTarget.fasta > ref_seq.fasta
sed -E "s/>L([0-9]*)_.*_.*_(.*_.*)_[1-2]__REF/>\2-\1/g" ref_seq.fasta > target.fa
rm ref_seq.fasta
#This will create a file called target.fa that will be used as the bait file in the HybPiper pipeline.
```
##### ASSEMBLY
Load hybpiper and run this loop; #If you are running Hybpiper 2.1.6 or newer versions, Intronerate is run by default and you have to remove --run_intronerate from the script.
```bash=
module load Hybpiper
```
```bash=
for i in *R1_paired.fastq.gz
do f2=${i//_R1_/_R2_}
f3=${i//R1_paired.fastq.gz/unpaired.fastq.gz}
f4=${i//_R1_paired.fastq.gz/}
echo hybpiper assemble -t_dna target.fa -r $i $f2 --unpaired $f3 --prefix $f4 --bwa --run_intronerate >> assemble.sh
done
#create a slurm script to run the for loop; call it "slurmAssemble.sh" in the current directory with the for loop
#!/bin/bash
#SBATCH --account=p32835
#SBATCH --partition=normal
#SBATCH --time=24:00:00
#SBATCH --mem=25G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=10
#SBATCH --job-name=assemble
#SBATCH --output=assemble.log
#SBATCH --error=error.assemble.log
#SBATCH --mail-type=ALL
#SBATCH --mail-user=lucianapiniely2026@u.northwestern.edu
module load hybpiper
./assemble.sh
#make the shell script executable:
chmod +x assemble.sh
#run the script
sbatch slurmAssemble.sh
```