Lobeliad Phylogenomics Workflow

--- title: Lobeliad Phylogenomics Workflow breaks: false tags: HybSeq, John --- # Lobeliad Workflow > *The workflow for performing phylogenomic analyses conducted on Lobeliad HybSeq data. Analyses done on Northwestern's HPC, Quest. Software includes: Trimmomatic, HybPiper, MAFFT, trimAl, IQ Tree, Astral.* ## 1. Assembly Preparations ### 1.1 De-duplicating Reads The Lobeliad dataset had a unique problem (discovered during later downstream analyses) in that some of the raw-read files contained duplicate records (i.e. exact same sequence, exact same ID, exact same name, etc.) To get rid of these, I used bbmap's dedupe script. **This should not be necessary for most datasets.** (*Note that this was done before moving files onto Quest, this may be outdated & incorrect*) ```bash=! ## Deduplicate identical reads in raw read files. $ for file in *.fastq*; $ do file2=”${file//paired.fastq/paired_dd.fastq}” $ echo “/home/jz/Desktop/bbmap/dedupe.sh in=$file out=$file2 sort=name ascending=t rmn=t ac=f -da -Xmx8g” >> dedupeloop.sh $ echo “rm $file” $ done > dedupe.sh #Then run the dedupe script with `bash dedupe.sh` ``` ### 1.2 Trimmomatic When beginning work with HybSeq data, typically the first step is to filter the raw reads. I filter out low quality reads, short reads, and any lingering adapter sequence on the ends of the Lobeliad Paired-End reads with [**Trimmomatic**](http://www.usadellab.org/cms/?page=trimmomatic). To run Trimmomatic on an entire folder of raw reads, take advantage of for loops to make a shell script `trimmomatic.sh`: ```bash! ##For loop to utilize trimmomatic on quest $ for R1 in *R1_paired.fastq*; do $ echo "trimmomatic PE -phred33 $R1 ${R1//R1.fastq/R2.fastq} ${R1//R1.fastq/R1_paired.fastq} ${R1//R1.fastq/R1_unpaired.fastq} ${R1//R2.fastq/R2_paired.fastq} ${R1//R2.fastq/R2_unpaired.fastq} ILLUMINACLIP:/software/trimmomatic/0.39/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:5:20 MINLEN:36" $ done > trimmomatic.sh ``` :::danger **Trimmomatic on Quest** is hard to call and kind of finicky... I just installed trimmomatic myself in a conda environment and called it that way. ::: Then run the shell script `trimmomatic.sh` on Quest as a job: ```bash #!/bin/bash #SBATCH --account=p31911 #SBATCH --partition=short #SBATCH --time=01:00:00 #SBATCH --mem=8G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --job-name=lob_stats #SBATCH --error=lob_stats.log eval "$(conda shell.bash hook)" conda activate /home/jzy0986/.conda/envs/genomics bash ./trimmomatic.sh ``` ### 1.3 Unpaired Reads We use HybPiper to map & assemble these raw reads, but HybPiper only accepts 1 file for unpaired reads. We combine the two unpaired read files (generated for each sample) into one by running the following code in our folder containing trimmomatic outputs: ```bash! ##Concatenate the 2 unpaired read files into one for each sample $ for file in *R1_unpaired*; do $ cat $file ${file//R1/R2} > ${file//R1_unpaired/unpaired} $ done ``` ### 1.4 Bait File > **IMPORTANT UPDATE!** The Mega353 baits had very poor recovery across almost all individuals, so instead of using the "combined bait set", **we move forward with just the modified campanula baits**. Ignore anything in this section about filtered Mega-353. A combination of two bait sets was used for the Lobeliad phylogenomics: (1) a modified baitset derived from the ***campanula*** genus and (2) a filtered **Mega-Angiosperms353** baitset. #### Modified Campanula Baits The workflow for generating the modified campanula baitset is a follows: 1. Run Lobeliad Dataset through HybPiper using the original campanula baitset 2. Identify campanula loci that are paralogous through HybPiper's paralog analyses scripts 3. Create a new bait file including all versions of loci retrieved by HybPiper (meaning if a loci was flagged as paralogous, include the multiple versions of said loci in this new bait file, but as individual/separate loci). #### Filtered Mega-353 Included in the **[Mega-353 github page](https://github.com/chrisjackson-pellicle/NewTargets)** is a script for filtering the huge baitset. For the lobeliad data, we filter this baitset to only include targets in the *asterales* group but **excluding** the *asteraceae* family. The filtering file below contains the codes used to generate the filtered mega-353 baitset. ```bash! [Target_name] GIOY IZLO IHPC HUQC IXVJ AUIP FXGI [Group] [Order] [Family] [Species] ``` Running the `filter_mega353.py` script with `select_file.txt` generates the reduced mega_353 file. #### Final baitset To generate the final baitset, I simply combine the two files together to generate `camp_mega_baits.fasta` > Again, Mega353 targets had very low recovery, **we move forward with just the modified campanula baits**. ## 2. Assembly with HybPiper 2 **[HybPiper 2](https://github.com/mossmatters/HybPiper)** is used to assemble our raw lobeliad data into nuclear loci. Hybpiper requires the outputs of trimmomatic (filtered raw read files), a namelist file (containing all names of samples, one on a line), and a bait file. ### 2.1 Assembling with a for loop After making a separate directory for HybPiper's outputs, I run the following code to generate a script that runs HybPiper on all samples: ```bash! ##For loop where raw reads are in a different folder, # variable ${x##*/} represents just sample name $ for i in ~/p31911/lobeliad_rawreads/*R1_paired.fastq; do $ x=${i%_R1_paired.fastq}; $ echo "hybpiper assemble -t_dna supercontig_baits.fasta -r $i ${i//R1/R2} --unpaired ${i//R1_paired.fastq/unpaired.fastq} --prefix ${x##*/} --bwa"; $ done > assemble.sh ``` Now, the hybpiper folder should contain the `assemble.sh` script just generated, a namelist.txt (not used for assembly but later on), and the bait file. The hybpiper script can be posted as a job on Quest: ```bash #!/bin/bash #SBATCH --account=p31911 #SBATCH --partition=normal #SBATCH --time=40:00:00 #SBATCH --mem=75G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=20 #SBATCH --job-name=lob_assemble #SBATCH --error=assemble.log module purge eval "$(conda shell.bash hook)" conda activate /home/jzy0986/.conda/envs/hybpiper ###Run the script we made above bash assemble.sh ``` ### 2.2 Assembly Statistics and Heatmap HybPiper offers a number of scripts that generate summary statistics about the assembly and a heatmap to visualize how succesful recovery was. [**Here**](https://drive.google.com/file/d/1LGC4ow0d_s4ZGXT_YS0XjB7GlMVtp90d/view?usp=sharing) is a link to the heatmap generated, where each row is a sample, each column is the recovery of a loci where darker is better. The summary statistic files and heatmap were generated via this job submission script, executed in the HybPiper output folder. ```bash #!/bin/bash #SBATCH --account=p31911 #SBATCH --partition=normal #SBATCH --time=01:30:00 #SBATCH --mem=8G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=lob_stats #SBATCH --error=stats.log module purge eval "$(conda shell.bash hook)" conda activate /home/jzy0986/.conda/envs/hybpiper hybpiper stats -t_dna supercontig_baits.fasta gene namelist.txt hybpiper recovery_heatmap seq_lengths.tsv ``` ### 2.3 Dealing with missing data There are a lot of white rows in this heatmap, which represent **individuals that had low recovery across all genes**. These individuals can weaken the support or introduce error in phylogenies, so we should remove them from the dataset. This python code finds individuals with less than 25% of loci have been recovered (allows 75% missing data) and reports it. Reach out if need help ```python= import pandas as pd import os ### set working directory (hybpiper output folder) wd = "/projects/p31911/lobeliad_hybpiper" ###Read in dataframe, with minor modifications seq_lengths_file = os.path.join(wd, 'seq_lengths.tsv') #the seq_lengths file df = pd.read_table(seq_lengths_file, sep="\t") #turn into pandas df temp_meanlen = df.iloc[0] #temporarily store mean length row (ignore) df.drop(index=df.index[0], axis=0, inplace=True) #get rid of MeanLength row (the first row) #df.shape #see dimensions of df ###Figure out what rows (aka individuals) to drop all_taxa = df['Species'].tolist() #list of all taxa drop_indexs = [] #empty list to store index of taxa to drop drop_taxa = [] #empty list to store taxa to drop updated_taxa_list=[] #empty list to store taxa to keep for i in range(len(df.iloc[:,0])): #for each row #print(df.iloc[i,0], ":", (df.iloc[i,:]==0).sum()) #print species name: number of zeroes if (df.iloc[i,:]==0).sum()>=302: #if num of zeroes in row is greater than 302 (out of 403, equal to 75%) drop_indexs.append(i) #add index of row to list drop_taxa.append(df.iloc[i,0]) #add taxa name to list #updated_df = df.drop(drop_indexs) # make new dataframe without rows in drop list #updated_df.loc[0] = temp_meanlen # add mean length row from original df back #updated_df = updated_df.sort_index() # sorting by index df = df.drop(drop_indexs) # update dataframe by removing rows in index drop list df.loc[0] = temp_meanlen # add mean length row from original df back df = df.sort_index() # sorting updated df by index for taxa in all_taxa: #make list of kept taxa for updated namelist.txt file if taxa not in drop_taxa: updated_taxa_list.append(taxa) ###Write some new files removed_taxa_file = os.path.join(wd, 'removed_taxa.txt') with open(removed_taxa_file, 'w') as f: #a file that contains which taxa were removed f.write("A list of individuals that were removed from namelist file due to having poor gene recovery\n\n") for taxa in drop_taxa: f.write("%s\n"%taxa) f.write("\nTotal removed: " + str(len(drop_taxa)) + "\n") updated_namelist_file = os.path.join(wd, "namelist_filtered.txt") #an updated namelist file containing only kept taxa with open(updated_namelist_file, 'w') as f: for taxa in updated_taxa_list: f.write("%s\n"%taxa) filtered_seqlen = os.path.join(wd, 'seq_lengths_filtered.tsv') df.to_csv(filtered_seqlen, sep="\t") ``` Using the two files generated, we can look at the file `removed_taxa.txt` to see which/how many individuals were filtered out, and move these individual's folders to a temporary folder. The `namelist_filtered.txt` file is the original namelist file minus the removed taxa. Now, **when we run downstream commands that use the filtered namelist**, we only use the kept taxa. ### 2.4 Adding Outgroup/Downloading data NCBI Originally we didn't have an outgroup for this dataset, so we added one later recommended to us by Andy. We use 4 taxa closely related to *brighamia*: _C. rhodensis_, _C. lycica_, _C. simulans_, and _C. erinus_. Jeremie gave me the raw reads of two of these species, and Andy gave us the accession numbers for the other two: __SRR5205482__ and __SRR5205483__. We use the `sratoolkit` module and `fasterq-dump` command to download the sequences directly ont Quest. To use sratoolkit, first you have to [configure it](https://github.com/ncbi/sra-tools/wiki/03.-Quick-Toolkit-Configuration) for your own user on Quest, and then use the [fasterq-dump](https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump) command. Once all the outgroup files are obtained, it was assembled with HybPiper as well. ### 2.5 Retrieve Sequences Once we've both reduced missing data and added the outgroup to our assemblies, we can retrieve the loci sequences assembled for each individual using HybPiper's `retrieve_sequences` command: ```bash #!/bin/bash #SBATCH --account=p31911 #SBATCH --partition=short #SBATCH --time=04:00:00 #SBATCH --mem=4G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --job-name=lob_seqs #SBATCH --error=lob_seqs.log eval "$(conda shell.bash hook)" conda activate /home/jzy0986/.conda/envs/hybpiper ###Make folder and retrieve sequences cd /projects/p31911/lobeliad_hybpiper/ mkdir seqs hybpiper retrieve_sequences dna -t_dna supercontig_baits.fasta --sample_names namelist_filtered_25.txt --fasta_dir ./seqs ``` This puts a file for each loci in the folder 'seqs', and in each file is every individual's sequence if recovered. ## 3. Generating Coalescence Phylogeny Now we have files of all individual's sequences for each loci, but there are still many steps before we can use this data to infer a coalescent species tree. Coalescence models generate an evolutionary history of taxa for each loci independently (a gene tree), then uses all of the gene trees together to infer a species tree. **Here is the workflow for doing so**: 1) Align and trim gene fastas using MAFFT and trimAl 2) Use IQ Tree to generate gene trees from aligned gene fastas 3) Use newick utilities to modify/filter gene trees 4) Feed gene trees into Astral and infer a species tree ### 3.1 Aligning and trimming gene fastas The length of sequence recovered for each individual is going to be different for each individual in a locus. These sequences need to be aligned before software like IQ Tree can generate gene trees. I use [**MAFFT**](https://mafft.cbrc.jp/alignment/software/) to align each loci's fasta first using the `--auto` option. In a folder containing the retrieved sequences, use a for loop to run MAFFT on all loci ```bash! #!/bin/bash #SBATCH --account=p31911 #SBATCH --partition=normal #SBATCH --time=14:00:00 #SBATCH --mem=48G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=12 #SBATCH --job-name=coal_align #SBATCH --error=align.log ###Load module module load mafft ###For loop to run MAFFT on unaligned loci fasta for i in *.fasta; do mafft --auto --thread 12 ${i} > ${i%.*}_aligned.fasta; done ``` When sequences are aligned, a lot of gaps are introduced, especially towards the beginning/ends of sequences. This is because all sequences must end up being the same length, which is at minimum the longest sequence in the file. We can get rid of base pair positions that are mostly gaps using [**trimAl**](http://trimal.cgenomics.org/trimal). Use `-gt 0.5` to remove base pair positions that are 50% gaps. Note I installed trimAl in my conda environment instead of using modules on quest. ```bash #!/bin/bash #SBATCH --account=p31911 #SBATCH --partition=short #SBATCH --time=00:30:00 #SBATCH --mem=2G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --job-name=lob_trim #SBATCH --error=trim.log eval "$(conda shell.bash hook)" conda activate /home/jzy0986/.conda/envs/genomics for i in *.fasta; do trimal -in $i -out ${i//aligned/trim} -gt 0.5 done ``` ### 3.2 Generating gene trees from aligned loci files Once we have aligned gene sequences, we use [**IQTree**](http://www.iqtree.org/doc/) to generate genetrees that represent the evolutionary history of the gene. I've installed IQTree in my 'genomics' conda environment. In our approach, we let IQTree decide what substitution model is best and perform ultrafast bootstrapping to generate support values for branches. This procedure takes a very long time so we will submit a job to Quest **as an array**. Running an array of jobs allows us to run multiple jobs of the same type simultaneously. In essence, instead of running one iqtree job at a time, *we can run up to 9 simultaneously*. The "base" IQTree command will look something like this: ```bash iqtree -s <input file> -nt AUTO -ntmax 4 -m TEST -bb 1000 -wbtl -czb ``` - `-nt AUTO and -ntmax 4`: Decide optimal number of threads to use, with a maximum of 4 - `-m TEST`: searches for best substitution model before generating gene tree - `-bb 1000`: ultrafast bootstrap with 1000 replicates - `-wbtl`: write branch lengths - `-czb`: collapse near-zero branches - see more about arguements [here](http://www.iqtree.org/doc/Command-Reference) To generate a list of IQTree commands to run on all aligned gene files, we use a for loop again: ```bash for i in *.fasta; do echo iqtree -s $i -nt AUTO -ntmax 4 -m TEST -bb 1000 -wbtl -czb done > full_iqtree.sh ``` Because we want to run this job as an array, we need to split this full_iqtree.sh file into multiple files each containing a subset of the IQTree commands. Use the split command ```bash split -a 1 -d -l 45 full_iqtree.sh iqtree --additional-suffix=.sh ``` :::spoiler Arguments for split - `-a 1`: label split files with **one** letter extensions to base name (i.e. a, b, c, ..) - `-d`: label split files with numbers instead of letters, starting from 0 - `-l 45`: split the original file into smaller files, each with **45 lines** - `iqtree_full.sh`: the file to split - `iqtree`: base name of the split files - `--additional-suffix=.sh`: add a ".sh" to the end of base name + one letter extension ::: > If my original iqtree file has 403 iqtree commands, this will end up generating 9 subfiles each with 45 lines, called `iqtree0.sh iqtree1.sh iqtree2.sh ... iqtree8.sh`. Next we create a simple text file called `input_args.txt` that contains the names of the **subfiles** so that it looks like this. ```bash iqtree0.sh iqtree1.sh iqtree2.sh iqtree3.sh iqtree4.sh iqtree5.sh iqtree6.sh iqtree7.sh iqtree8.sh ``` Now we can write a job submission script for Quest. Note that we need to include a `#SBATCH --array 0-8` line to tell Quest we're running an array of 9 jobs. Here is what the submission script will look like: ```bash #!/bin/bash #SBATCH --account=p31911 #SBATCH --partition=normal #SBATCH --array=0-8 #SBATCH --time=30:00:00 #SBATCH --mem=4G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=iqtree_array #SBATCH --error=iqtree.log module purge eval "$(conda shell.bash hook)" conda activate /home/jzy0986/.conda/envs/genomics IFS=$'\n' read -d '' -r -a input_args < input_args.txt bash ${input_args[$SLURM_ARRAY_TASK_ID]} ``` The IFS line reads in the input_args.txt file and generates an array, where each element is one of the `iqtreex.sh` subfiles. Then we can bash each element of the array to run all 9 sub-jobs simultaneously. ::: info **Note** that the `#SBATCH --time` flag applies for the duration of the whole job, whereas the `#SBATCH --mem` and `#SBATCH --ntasks-per-node` flag applies to each sub-job. Each sub-job here gets 4 CPU's and 4G of memory, none of the sub-jobs can run for more than 30 hours. ::: #### IQTree Outputs For each gene fasta, IQTree will generate many files. Ultimately, the file we are interested in is the `.treefile` file. I'd recommend generating a new folder and moving all the genetree files into this folder, so you don't have to work in a directory containing thousands of files: ```bash mkdir ../genetrees mv *.treefile ../genetrees ``` ### 3.3 Removing branches with low support **Newick Utilites** is a collection of tools for phylogenetic trees. We use the `nw_ed` tool to collapse branches in each gene tree that have very low support (<10%) with the following function: ```bash nw_ed <input genetree> 'i & b<=10' o > <output genetree> ``` To run it as a loop over all of the gene tree files use the following loop. This does not need to be submitted via slurm on Quest because it is relatively simple: ```bash for i in *.treefile; do nw_ed $i 'i & b<=10' o > ${i//.treefile/_BS10.treefile} done ``` ### 3.4 Using Astral to infer a species tree [**Astral**](https://github.com/smirarab/ASTRAL) is an algorithm used to infer a species trees from gene trees. First all gene trees filtered through newick utilies are concatenated into a single file. ```bash cat *.treefile > lobeliad_coal25_genetrees.tre ``` We can call Astral using the following syntax (with `-t 2` to record quartet support) in a slurm job. ```bash #!/bin/bash #SBATCH --account=p31911 #SBATCH --partition=short #SBATCH --time=01:00:00 #SBATCH --mem=4G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=astral_lob #SBATCH --error=astral.log module purge eval "$(conda shell.bash hook)" conda activate /home/jzy0986/.conda/envs/genomics java -jar ~/tools/astral/astral.5.7.8.jar -i lobeliad_coal25_genetrees.tre -t 2 -o lobeliad_coal25_sptre.tre 2>astral.log ``` ## 4. Generating ML Phylogeny