---
IMLS Hybridization Project: Juglans, Oaks, Magnolia
breaks: False
tags: 'HybSeq'
author: Dylan Cohen
---
---
Project Goals: To detect if hybridization is occuring between common and rare taxa at Botanic Gardens
--
We are using three different genera to ask this question.
1. Magnolia
2. Juglans
3. Oaks
---
### General Workflow
1. Clean raw data with **trimmomatic**
2. Clean up poly-g tails with **cutadapt**
3. Made custom baits file for Oaks/Juglans - Magnolia had own baits file
4. Run **Hybpiper**
5. Run **Hybphaser**
a. Align genes with **MUSCLE**
7. Use the Intronerated (Supercontigs) sequences from Hybphaser
8. Align Gene Sequences with **Muscle**
8 . Trim Sequences with **Clipkit**
9. Check Alignments with **Amas**
10. Make Gene Trees with **Iqtree**
11. Run **Astral**
---
### Trying to make a VCF file from gene alignments
#### Starting point
1. Start with the Intronerated Sequences from Hybphaser
2. Remove outgroups
3. Align with Muscle
4. Check Alignment Stats with AMAS
5. Remove poorly aligned samples??
Provisonal here
5. Run Clipkit | smart + gappy + just parsimony informative + constant sites
```
clipkit filtered.fasta -m kpic-smart-gap -o parsimony.fasta
```
7. Convert to vcf with **Snp-Sites**
```
snp-sites -v -o test.vcf parsimony.fasta
```
### There are many ways to filter a VCF for popgen/hybridization analyses and I suggest trying different filters - below are what I am trying
----
## NOTE this filter did not work for me
8. Fillter VCF file with vcftools
- Filter for bialleic states
```
vcftools --vcf input.vcf --min-alleles 2 --max-alleles 2 --recode --out filtered
```
9. How I convert a vcf to BED for ADMIXTURE note I am also using this [wrapper](https://github.com/dportik/admixture-wrapper) to run ADMIXTURE and it needs to be recoded in 12 format
```
plink --vcf alleles.recode.vcf --make-bed --out initial_plink --allow-extra-chr --double-id
plink --bfile initial_plink --recode12 --out admixture_input --allow-extra-chr --double-id
```
---
Raw Data Files Are located on the JBOD
For Magnolia most of the work has been done on Quest
Because Magnolia contains polyploids I have had to go another route to extract gene sequences - Notes on this to come
Oaks and Juglans I will try to use Suzy's servers and so the following code is written for those
---
# Cleaning raw data using Trimmomatic and Cutadapt
Step 1. Involves cleaning up the raw data file names, using trimmomatic to filter reads as a first pass, and then using cutadapt to remove artifical poly ggg tails
Remove _ 001 _ from file names
```
for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_001.fastq.gz/.fastq.gz/')"; done
```
Here is a loop to make a script to run on entire folder:
```
for _R1_ in *_R1*; do
R2=${_R1_//_R1.fastq.gz/_R2.fastq.gz}
R1p=${_R1_//.fastq.gz/_paired.fastq.gz}
R1u=${_R1_//.fastq.gz/_unpaired.fastq.gz}
R2p=${R2//.fastq.gz/_paired.fastq.gz}
R2u=${R2//.fastq.gz/_unpaired.fastq.gz}
echo trimmomatic PE -phred33 $_R1_ $R2 $R1p $R1u $R2p $R2u ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:true LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40 >> trimmomatic.sh
done
```
NOTE - > I installed trimmomatic with Conda, and I had to install my own Conda environement.
See [here](https://docs.anaconda.com/miniconda/) and the Linux install. I had to add the following to my .bashrc
At the end
```
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/home/srs57/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/home/srs57/miniconda3/etc/profile.d/conda.sh" ]; then
. "/home/srs57/miniconda3/etc/profile.d/conda.sh"
else
export PATH="/home/srs57/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
```
To run trimmomatic also make sure to grab the TruSeq3-PE.fa file from the github (I just copy and pasted the contents into text document on sever using nano)
Add ```#!/bin/bash``` to the top of the trimmomatic script and save activate the script by ``` chmod +x trimmomatic.sh```
To run the script ``` nohup ./trimmomatic.sh > trimmomatic_log.txt 2>&1 & ```
nohup -> Prevents the process from being terminated when the user logs off.
trimmomatic_log.txt -> Redirects the standard output (the normal output of the script) to the file
2>&1 -> Redirects the standard error (error messages) to the same log file
& -> Runs the command in the background, allowing you to log off without stopping the process.
### Note: Hybpiper input files are paired end 1 and paired end 2 files, and then concatenated unpaired reads files - - - run cutadapt on all files prior to concatenathing
Before running cutadapt I would remove the R1.fastq.gz and R2.fastq.gz since they are backed up in another location KEEP the paired and unpaired files for cutadapt processing
Script for cutadapt
```
for i in *.fastq.gz; do
# Get the base name of the file (without path and extension)
base_name=$(basename "$i" .fastq.gz)
# Run Cutadapt on the file
echo cutadapt --nextseq-trim=20 -o "${base_name}_trimmed.fastq" "$i" >> cutadapt.sh
done
```
Add these to the header of the script
```
#!/bin/bash
eval "$(conda shell.bash hook)"
conda activate cutadapt
```
To remove the duplicated sample number the starst with S##
```
for filename in *trimmed.fastq;
do mv "$filename" "$(echo "$filename" | sed 's/_S[0-9]*_/_/')"; done
```
---
# Hybpiper
I made a custom bait file, see here: https://github.com/chrisjackson-pellicle/NewTargets - the idea here is the closer sequences are to your target group the better gene recovery. Note, after making custom bait file make sure to check it with hybpiper``` hybpiper check_targetfile ``` and if necessary fix it using ```hybpiper fix_targetfile```. Intialy for Juglans and Oaks they are both in the same Order: Fagales, so I only include baits from that order, but I have expanded it to include the sister order Fabales and might expand it again. This was due to low complexity of gene regions found by check target file.
```
for i in *R1_paired_trimmed.fastq
do f2=${i//R1/R2}
f3=${i//R1_paired_trimmed.fastq/unpaired_trimmed.fastq}
f4=${i//_R1_paired_trimmed.fastq/}
echo hybpiper assemble -t_dna /home/dcl9541/Juglans_working/ztest/jugoak.fasta -r $i $f2 --unpaired $f3 --prefix $f4 --bwa --cpu 6 >> assemble.sh
done
```
-----------------------
# Assembling Plastid from hybseq data
- install getorganelle using conda (https://github.com/Kinggerm/GetOrganelle)
Loop to Run on files that were trimmed with trimmomatic and cleaned up with cutadapt. Note I am using the unpaired reads and just the regular plastid library. It might be better to design your own plastid library if you have sequences available.
```
for i in *R1.paired_trimmed.fastq; do
# Substitute R1 with R2 for paired reads
f2="${i/R1.paired_trimmed.fastq/R2.paired_trimmed.fastq}"
# Substitute R1 paired with unpaired for the unpaired reads
f3="${i/R1.paired_trimmed.fastq/unpaired_trimmed.fastq}"
# Use the base name (without R1.paired_trimmed.fastq) as the output directory and prefix, then remove any trailing '.'
f4="${i/R1.paired_trimmed.fastq/}"
f4="${f4%.}"
# Generate the command and append to reorgan.sh
echo "get_organelle_from_reads.py -1 $i -2 $f2 -u $f3 -o $f4 --prefix $f4 -t 4 -s Mag.cp.seed.fasta -F embplant_pt -R 15 -k 21,45,65,85,105 --reduce-reads-for-coverage inf --max-reads inf" >> reorgan2.sh
done
```
nohup ./reorgan.sh > get_log 2>&1 &
---
# Aligning Sequences
- Installed mafft to align plastid sequences
- conda install ```conda install bioconda::mafft```
- I installed Muscle to align sequences from Magnolia
- Combine all fastas into one file
``` cat *.fasta > magnolia.plastid.fasta```
bash script to align plastid
```
#!/bin/bash
eval "$(conda shell.bash hook)"
conda activate mafft
mafft --auto --thread 4 magnolia_plastid.fasta > magnoliap.align.fasta
```
and script to align genes
```
#!/bin/bash
eval "$(conda shell.bash hook)"
conda activate mafft
for i in *.fasta; do
mafft --auto --thread 4 ${i} > ${i%.*}_mafft.fasta;
done
```
OR - I think Muscle is better suited for longer alignments # Change threads as you wish
```
for i in *.fasta; do
echo muscle -super5 ${i} -output ${i%.*}_mus.fasta -threads 5;
done > muscle.sh
split -a 1 -d -l 21 muscle.sh muscle --additional-suffix=.sh
```
To run script ``` nohup ./alignment.sh >align.log 2>&1 &```
---
# Trimming Aligned sequences
- Added trimal to the trimmomatic environment
- used conda to install
- Also installed Clipkit for trimming magnolia
Code below removes 50% of gaps
```
#!/bin/bash
eval "$(conda shell.bash hook)"
conda activate trim
for i in *.fasta; do
trimal -in $i -out ${i//mafft/trim} -gt 0.5
done
```
Clipkit loop below
```
for i in *.fasta; do
clipkit $i -o ${i//mafft/clip}
done
```
---
# Maximum liklihood reconstruction
- Made new environment for Iqtree
- Conda install iqtree
Code to estimate sequence model of evolution, construct tree and estimate bootstrap values
A loop to create script to run on independent genes
```
for i in *.fasta; do
echo iqtree2 -s $i -T 10 -m TEST -B 1000
done > full_iqtree.sh
```
```
#!/bin/bash
eval "$(conda shell.bash hook)"
conda activate iqtree
iqtree -s magnoliap.align.fasta -ntmax 4 -m TEST -B 1000
```
After running iqtree on all the genes - colllapse branches with lower support - newick utilities is installed under iqtree on Curie - do not need a job for this, runs fast
```
for i in *.treefile; do
nw_ed $i 'i & b<=10' o > ${i//.treefile/_BS10.treefile}
done
```
---
# Insert Notes on Pate HERE
---
# Disco
[DISCO](https://github.com/JSdoubleL/DISCO) Github
About (From github)
Decomposition Into Single-COpy gene trees (DISCO) is a method for decomposing multi-copy gene-family trees while attempting to preserve orthologs and discard paralogs.
I have multi-copy gene trees with diploids, tetra/hexaploids - need to get best copy of gene for further downstream analyses
Need - ML genetrees to start
```
python3 disco.py -i MKT362594.1.4_APOV_supercontig_trim.fasta.treefile -o MKT362594_single.tree -d __ --keep-labels --remove_in_paralogs --single_tree
```
Python Script pro_disco.py will run the above code on a directory of treefiles and output single tree files in a new directory
Pythong script new2.py will use the allele copies located from the output of disco to extract them from multi copy alignment files and place single copies of each species + gene in a new directory
---
# For Hybridization Analyses Check out [here](https://hackmd.io/85GwrgNORuaKeG-jP8MVuQ)
---
# Extraction of SNPs from HybSeq Data
Attempting this protocol - https://github.com/lindsawi/HybSeq-SNP-Extraction
1. For each Species, determine which sample has the longest super contig, this is used for the reference. Note I did not filter or align this supercontig.
Code to figure out what supercontig is the longest:
## Below is important if you want to compare SNPs across multiple species
*If you are comparing across species like me, then I would create a consensus sequence across species to use as a reference to not bias SNPs. To do this I chose species references based on low missing data (less than 40 and most below 15%, also while maximizing the number of postions ~ 800000)
```
from Bio import SeqIO
import os
# Directory containing your FASTA files
fasta_directory = "path/to/fasta/files"
# Dictionary to store file names and their site counts
site_counts = {}
# Loop through all FASTA files in the directory
for filename in os.listdir(fasta_directory):
if filename.endswith(".fasta"):
filepath = os.path.join(fasta_directory, filename)
total_sites = 0
# Process each sequence in the FASTA file
for record in SeqIO.parse(filepath, "fasta"):
total_sites += len(record.seq)
# Store the total number of sites for the file
site_counts[filename] = total_sites
# Display the results
for filename, sites in site_counts.items():
print(f"{filename}: {sites} sites")
```
2. Get rid of any gaps in the supercontig, replace with Ns
```sed 's/-/N/g' reference_with_gaps.fasta > reference_without_gaps.fasta```
3.Make a folder for each species and place the supercontig and the paired R1 and R2 trimmomatic files in the folder.
Save this script to the folder:
```
#!/bin/bash
set -eo pipefail
eval "$(conda shell.bash hook)"
conda activate hybpiper
module load gatk/4.1.0
# Set variables
reference=$1
prefix=$2
read1fn="$prefix.R1.paired.fastq.gz"
read2fn="$prefix.R2.paired.fastq.gz"
# Create and move into the prefix directory
mkdir $prefix
cd $prefix
# Create sequence dictionary if it doesn't exist
if [ ! -f ../${reference%.*}.dict ]; then
gatk CreateSequenceDictionary -R ../$reference
fi
# Index the reference (parent directory paths used)
bwa index ../$reference
samtools faidx ../$reference
# Align read files to reference sequence and map
bwa mem ../$reference ../$read1fn ../$read2fn | samtools view -bS - | samtools sort -o "$prefix.sorted.bam"
# Convert FASTQ to unmapped BAM
gatk FastqToSam -F1 ../$read1fn -F2 ../$read2fn -O $prefix.unmapped.bam -SM $prefix
# Replace read groups for mapped and unmapped BAM files
gatk AddOrReplaceReadGroups -I $prefix.sorted.bam -O $prefix.sorted-RG.bam -RGID 2 -RGLB lib1 -RGPL illumina -RGPU unit1 -RGSM $prefix
gatk AddOrReplaceReadGroups -I $prefix.unmapped.bam -O $prefix.unmapped-RG.bam -RGID 2 -RGLB lib1 -RGPL illumina -RGPU unit1 -RGSM $prefix
# Combine mapped and unmapped BAM files
gatk MergeBamAlignment --ALIGNED_BAM $prefix.sorted-RG.bam --UNMAPPED_BAM $prefix.unmapped-RG.bam -O $prefix.merged.bam -R ../$reference
# Mark duplicates
gatk MarkDuplicates -I $prefix.merged.bam -O $prefix.marked.bam -M $prefix.metrics.txt
samtools index $prefix.marked.bam
# Create GVCF
gatk HaplotypeCaller -I $prefix.marked.bam -O $prefix-g.vcf -ERC GVCF -R ../$reference
# Remove intermediate files
rm $prefix.sorted.bam $prefix.unmapped.bam $prefix.merged.bam $prefix.unmapped-RG.bam $prefix.sorted-RG.bam
```
This loop below is to used to create a control file to run all samples in a folder
```
#!/bin/bash
# Loop over all .fasta and .fastq.gz files in the current directory
for fasta_file in *.fasta; do
# Extract the .fasta file (e.g., x_proctoriana4333_A_S224NOGAP.fasta)
fasta="${fasta_file}"
# Loop over corresponding .fastq.gz files
for fastq_file in *.R1.paired.fastq.gz; do
# Extract the sample name prefix from the .fastq.gz file
# This removes .R1.paired.fastq.gz from the end
fastq_base="${fastq_file%.R1.paired.fastq.gz}"
# Output the snpcaller.sh command
echo "bash snpcaller3.sh $fasta $fastq_base" >> capture3.sh
done
done
```
Example line:
```bash snpcaller.sh x_proctoriana4333_A_S224NOGAP.fasta proctoriana4406_B_S161```