IMLS Project: Juglans, Oaks, M

--- IMLS Hybridization Project: Juglans, Oaks, Magnolia breaks: False tags: 'HybSeq' author: Dylan Cohen --- --- Project Goals: To detect if hybridization is occuring between common and rare taxa at Botanic Gardens -- We are using three different genera to ask this question. 1. Magnolia 2. Juglans 3. Oaks --- ### General Workflow 1. Clean raw data with **trimmomatic** 2. Clean up poly-g tails with **cutadapt** 3. Made custom baits file for Oaks/Juglans - Magnolia had own baits file 4. Run **Hybpiper** 5. Run **Hybphaser** a. Align genes with **MUSCLE** 7. Use the Intronerated (Supercontigs) sequences from Hybphaser 8. Align Gene Sequences with **Muscle** 8 . Trim Sequences with **Clipkit** 9. Check Alignments with **Amas** 10. Make Gene Trees with **Iqtree** 11. Run **Astral** --- ### Trying to make a VCF file from gene alignments #### Starting point 1. Start with the Intronerated Sequences from Hybphaser 2. Remove outgroups 3. Align with Muscle 4. Check Alignment Stats with AMAS 5. Remove poorly aligned samples?? Provisonal here 5. Run Clipkit | smart + gappy + just parsimony informative + constant sites ``` clipkit filtered.fasta -m kpic-smart-gap -o parsimony.fasta ``` 7. Convert to vcf with **Snp-Sites** ``` snp-sites -v -o test.vcf parsimony.fasta ``` ### There are many ways to filter a VCF for popgen/hybridization analyses and I suggest trying different filters - below are what I am trying ---- ## NOTE this filter did not work for me 8. Fillter VCF file with vcftools - Filter for bialleic states ``` vcftools --vcf input.vcf --min-alleles 2 --max-alleles 2 --recode --out filtered ``` 9. How I convert a vcf to BED for ADMIXTURE note I am also using this [wrapper](https://github.com/dportik/admixture-wrapper) to run ADMIXTURE and it needs to be recoded in 12 format ``` plink --vcf alleles.recode.vcf --make-bed --out initial_plink --allow-extra-chr --double-id plink --bfile initial_plink --recode12 --out admixture_input --allow-extra-chr --double-id ``` --- Raw Data Files Are located on the JBOD For Magnolia most of the work has been done on Quest Because Magnolia contains polyploids I have had to go another route to extract gene sequences - Notes on this to come Oaks and Juglans I will try to use Suzy's servers and so the following code is written for those --- # Cleaning raw data using Trimmomatic and Cutadapt Step 1. Involves cleaning up the raw data file names, using trimmomatic to filter reads as a first pass, and then using cutadapt to remove artifical poly ggg tails Remove _ 001 _ from file names ``` for filename in *fastq.gz; do mv "$filename" "$(echo "$filename" | sed 's/_001.fastq.gz/.fastq.gz/')"; done ``` Here is a loop to make a script to run on entire folder: ``` for _R1_ in *_R1*; do R2=${_R1_//_R1.fastq.gz/_R2.fastq.gz} R1p=${_R1_//.fastq.gz/_paired.fastq.gz} R1u=${_R1_//.fastq.gz/_unpaired.fastq.gz} R2p=${R2//.fastq.gz/_paired.fastq.gz} R2u=${R2//.fastq.gz/_unpaired.fastq.gz} echo trimmomatic PE -phred33 $_R1_ $R2 $R1p $R1u $R2p $R2u ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:true LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40 >> trimmomatic.sh done ``` NOTE - > I installed trimmomatic with Conda, and I had to install my own Conda environement. See [here](https://docs.anaconda.com/miniconda/) and the Linux install. I had to add the following to my .bashrc At the end ``` # >>> conda initialize >>> # !! Contents within this block are managed by 'conda init' !! __conda_setup="$('/home/srs57/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" if [ $? -eq 0 ]; then eval "$__conda_setup" else if [ -f "/home/srs57/miniconda3/etc/profile.d/conda.sh" ]; then . "/home/srs57/miniconda3/etc/profile.d/conda.sh" else export PATH="/home/srs57/miniconda3/bin:$PATH" fi fi unset __conda_setup # <<< conda initialize <<< ``` To run trimmomatic also make sure to grab the TruSeq3-PE.fa file from the github (I just copy and pasted the contents into text document on sever using nano) Add ```#!/bin/bash``` to the top of the trimmomatic script and save activate the script by ``` chmod +x trimmomatic.sh``` To run the script ``` nohup ./trimmomatic.sh > trimmomatic_log.txt 2>&1 & ``` nohup -> Prevents the process from being terminated when the user logs off. trimmomatic_log.txt -> Redirects the standard output (the normal output of the script) to the file 2>&1 -> Redirects the standard error (error messages) to the same log file & -> Runs the command in the background, allowing you to log off without stopping the process. ### Note: Hybpiper input files are paired end 1 and paired end 2 files, and then concatenated unpaired reads files - - - run cutadapt on all files prior to concatenathing Before running cutadapt I would remove the R1.fastq.gz and R2.fastq.gz since they are backed up in another location KEEP the paired and unpaired files for cutadapt processing Script for cutadapt ``` for i in *.fastq.gz; do # Get the base name of the file (without path and extension) base_name=$(basename "$i" .fastq.gz) # Run Cutadapt on the file echo cutadapt --nextseq-trim=20 -o "${base_name}_trimmed.fastq" "$i" >> cutadapt.sh done ``` Add these to the header of the script ``` #!/bin/bash eval "$(conda shell.bash hook)" conda activate cutadapt ``` To remove the duplicated sample number the starst with S## ``` for filename in *trimmed.fastq; do mv "$filename" "$(echo "$filename" | sed 's/_S[0-9]*_/_/')"; done ``` --- # Hybpiper I made a custom bait file, see here: https://github.com/chrisjackson-pellicle/NewTargets - the idea here is the closer sequences are to your target group the better gene recovery. Note, after making custom bait file make sure to check it with hybpiper``` hybpiper check_targetfile ``` and if necessary fix it using ```hybpiper fix_targetfile```. Intialy for Juglans and Oaks they are both in the same Order: Fagales, so I only include baits from that order, but I have expanded it to include the sister order Fabales and might expand it again. This was due to low complexity of gene regions found by check target file. ``` for i in *R1_paired_trimmed.fastq do f2=${i//R1/R2} f3=${i//R1_paired_trimmed.fastq/unpaired_trimmed.fastq} f4=${i//_R1_paired_trimmed.fastq/} echo hybpiper assemble -t_dna /home/dcl9541/Juglans_working/ztest/jugoak.fasta -r $i $f2 --unpaired $f3 --prefix $f4 --bwa --cpu 6 >> assemble.sh done ``` ----------------------- # Assembling Plastid from hybseq data - install getorganelle using conda (https://github.com/Kinggerm/GetOrganelle) Loop to Run on files that were trimmed with trimmomatic and cleaned up with cutadapt. Note I am using the unpaired reads and just the regular plastid library. It might be better to design your own plastid library if you have sequences available. ``` for i in *R1.paired_trimmed.fastq; do # Substitute R1 with R2 for paired reads f2="${i/R1.paired_trimmed.fastq/R2.paired_trimmed.fastq}" # Substitute R1 paired with unpaired for the unpaired reads f3="${i/R1.paired_trimmed.fastq/unpaired_trimmed.fastq}" # Use the base name (without R1.paired_trimmed.fastq) as the output directory and prefix, then remove any trailing '.' f4="${i/R1.paired_trimmed.fastq/}" f4="${f4%.}" # Generate the command and append to reorgan.sh echo "get_organelle_from_reads.py -1 $i -2 $f2 -u $f3 -o $f4 --prefix $f4 -t 4 -s Mag.cp.seed.fasta -F embplant_pt -R 15 -k 21,45,65,85,105 --reduce-reads-for-coverage inf --max-reads inf" >> reorgan2.sh done ``` nohup ./reorgan.sh > get_log 2>&1 & --- # Aligning Sequences - Installed mafft to align plastid sequences - conda install ```conda install bioconda::mafft``` - I installed Muscle to align sequences from Magnolia - Combine all fastas into one file ``` cat *.fasta > magnolia.plastid.fasta``` bash script to align plastid ``` #!/bin/bash eval "$(conda shell.bash hook)" conda activate mafft mafft --auto --thread 4 magnolia_plastid.fasta > magnoliap.align.fasta ``` and script to align genes ``` #!/bin/bash eval "$(conda shell.bash hook)" conda activate mafft for i in *.fasta; do mafft --auto --thread 4 ${i} > ${i%.*}_mafft.fasta; done ``` OR - I think Muscle is better suited for longer alignments # Change threads as you wish ``` for i in *.fasta; do echo muscle -super5 ${i} -output ${i%.*}_mus.fasta -threads 5; done > muscle.sh split -a 1 -d -l 21 muscle.sh muscle --additional-suffix=.sh ``` To run script ``` nohup ./alignment.sh >align.log 2>&1 &``` --- # Trimming Aligned sequences - Added trimal to the trimmomatic environment - used conda to install - Also installed Clipkit for trimming magnolia Code below removes 50% of gaps ``` #!/bin/bash eval "$(conda shell.bash hook)" conda activate trim for i in *.fasta; do trimal -in $i -out ${i//mafft/trim} -gt 0.5 done ``` Clipkit loop below ``` for i in *.fasta; do clipkit $i -o ${i//mafft/clip} done ``` --- # Maximum liklihood reconstruction - Made new environment for Iqtree - Conda install iqtree Code to estimate sequence model of evolution, construct tree and estimate bootstrap values A loop to create script to run on independent genes ``` for i in *.fasta; do echo iqtree2 -s $i -T 10 -m TEST -B 1000 done > full_iqtree.sh ``` ``` #!/bin/bash eval "$(conda shell.bash hook)" conda activate iqtree iqtree -s magnoliap.align.fasta -ntmax 4 -m TEST -B 1000 ``` After running iqtree on all the genes - colllapse branches with lower support - newick utilities is installed under iqtree on Curie - do not need a job for this, runs fast ``` for i in *.treefile; do nw_ed $i 'i & b<=10' o > ${i//.treefile/_BS10.treefile} done ``` --- # Insert Notes on Pate HERE --- # Disco [DISCO](https://github.com/JSdoubleL/DISCO) Github About (From github) Decomposition Into Single-COpy gene trees (DISCO) is a method for decomposing multi-copy gene-family trees while attempting to preserve orthologs and discard paralogs. I have multi-copy gene trees with diploids, tetra/hexaploids - need to get best copy of gene for further downstream analyses Need - ML genetrees to start ``` python3 disco.py -i MKT362594.1.4_APOV_supercontig_trim.fasta.treefile -o MKT362594_single.tree -d __ --keep-labels --remove_in_paralogs --single_tree ``` Python Script pro_disco.py will run the above code on a directory of treefiles and output single tree files in a new directory Pythong script new2.py will use the allele copies located from the output of disco to extract them from multi copy alignment files and place single copies of each species + gene in a new directory --- # For Hybridization Analyses Check out [here](https://hackmd.io/85GwrgNORuaKeG-jP8MVuQ) --- # Extraction of SNPs from HybSeq Data Attempting this protocol - https://github.com/lindsawi/HybSeq-SNP-Extraction 1. For each Species, determine which sample has the longest super contig, this is used for the reference. Note I did not filter or align this supercontig. Code to figure out what supercontig is the longest: ## Below is important if you want to compare SNPs across multiple species *If you are comparing across species like me, then I would create a consensus sequence across species to use as a reference to not bias SNPs. To do this I chose species references based on low missing data (less than 40 and most below 15%, also while maximizing the number of postions ~ 800000) ``` from Bio import SeqIO import os # Directory containing your FASTA files fasta_directory = "path/to/fasta/files" # Dictionary to store file names and their site counts site_counts = {} # Loop through all FASTA files in the directory for filename in os.listdir(fasta_directory): if filename.endswith(".fasta"): filepath = os.path.join(fasta_directory, filename) total_sites = 0 # Process each sequence in the FASTA file for record in SeqIO.parse(filepath, "fasta"): total_sites += len(record.seq) # Store the total number of sites for the file site_counts[filename] = total_sites # Display the results for filename, sites in site_counts.items(): print(f"{filename}: {sites} sites") ``` 2. Get rid of any gaps in the supercontig, replace with Ns ```sed 's/-/N/g' reference_with_gaps.fasta > reference_without_gaps.fasta``` 3.Make a folder for each species and place the supercontig and the paired R1 and R2 trimmomatic files in the folder. Save this script to the folder: ``` #!/bin/bash set -eo pipefail eval "$(conda shell.bash hook)" conda activate hybpiper module load gatk/4.1.0 # Set variables reference=$1 prefix=$2 read1fn="$prefix.R1.paired.fastq.gz" read2fn="$prefix.R2.paired.fastq.gz" # Create and move into the prefix directory mkdir $prefix cd $prefix # Create sequence dictionary if it doesn't exist if [ ! -f ../${reference%.*}.dict ]; then gatk CreateSequenceDictionary -R ../$reference fi # Index the reference (parent directory paths used) bwa index ../$reference samtools faidx ../$reference # Align read files to reference sequence and map bwa mem ../$reference ../$read1fn ../$read2fn | samtools view -bS - | samtools sort -o "$prefix.sorted.bam" # Convert FASTQ to unmapped BAM gatk FastqToSam -F1 ../$read1fn -F2 ../$read2fn -O $prefix.unmapped.bam -SM $prefix # Replace read groups for mapped and unmapped BAM files gatk AddOrReplaceReadGroups -I $prefix.sorted.bam -O $prefix.sorted-RG.bam -RGID 2 -RGLB lib1 -RGPL illumina -RGPU unit1 -RGSM $prefix gatk AddOrReplaceReadGroups -I $prefix.unmapped.bam -O $prefix.unmapped-RG.bam -RGID 2 -RGLB lib1 -RGPL illumina -RGPU unit1 -RGSM $prefix # Combine mapped and unmapped BAM files gatk MergeBamAlignment --ALIGNED_BAM $prefix.sorted-RG.bam --UNMAPPED_BAM $prefix.unmapped-RG.bam -O $prefix.merged.bam -R ../$reference # Mark duplicates gatk MarkDuplicates -I $prefix.merged.bam -O $prefix.marked.bam -M $prefix.metrics.txt samtools index $prefix.marked.bam # Create GVCF gatk HaplotypeCaller -I $prefix.marked.bam -O $prefix-g.vcf -ERC GVCF -R ../$reference # Remove intermediate files rm $prefix.sorted.bam $prefix.unmapped.bam $prefix.merged.bam $prefix.unmapped-RG.bam $prefix.sorted-RG.bam ``` This loop below is to used to create a control file to run all samples in a folder ``` #!/bin/bash # Loop over all .fasta and .fastq.gz files in the current directory for fasta_file in *.fasta; do # Extract the .fasta file (e.g., x_proctoriana4333_A_S224NOGAP.fasta) fasta="${fasta_file}" # Loop over corresponding .fastq.gz files for fastq_file in *.R1.paired.fastq.gz; do # Extract the sample name prefix from the .fastq.gz file # This removes .R1.paired.fastq.gz from the end fastq_base="${fastq_file%.R1.paired.fastq.gz}" # Output the snpcaller.sh command echo "bash snpcaller3.sh $fasta $fastq_base" >> capture3.sh done done ``` Example line: ```bash snpcaller.sh x_proctoriana4333_A_S224NOGAP.fasta proctoriana4406_B_S161```