Cyanea Checking

--- title: Cyanea HybPiper 2 Workflow breaks: false tags: HybSeq; Lobeliods --- > Checking some of Cyanea data (3/11/2024 for Adam), following work that Johnathan Zhang (@Jzhang) [Workflow used by Johnathan](https://hackmd.io/@CBG-genetics-team/SJdno0Hqs/) Folder in Quest (./projects/p31927/Cyanea/) ### 2.2 Assembly Statistics and Heatmap HybPiper offers a number of scripts that generate summary statistics about the assembly and a heatmap to visualize how succesful recovery was. [**Here**](https://drive.google.com/file/d/1LGC4ow0d_s4ZGXT_YS0XjB7GlMVtp90d/view?usp=sharing) is a link to the heatmap generated, where each row is a sample, each column is the recovery of a loci where darker is better. The summary statistic files and heatmap were generated via this job submission script, executed in the HybPiper output folder. ```bash #!/bin/bash #SBATCH --account=p31911 #SBATCH --partition=normal #SBATCH --time=01:30:00 #SBATCH --mem=8G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=lob_stats #SBATCH --error=stats.log module load hybpiper hybpiper stats -t_dna supercontig_baits.fasta gene namelist_cy.txt hybpiper recovery_heatmap seq_lengths.tsv ``` ### 2.3 Dealing with missing data There are a lot of white rows in this heatmap, which represent **individuals that had low recovery across all genes**. These individuals can weaken the support or introduce error in phylogenies, so we should remove them from the dataset. This python code finds individuals with less than 25% of loci have been recovered (allows 75% missing data) and reports it. Reach out if need help ```python= import pandas as pd import os ### set working directory (hybpiper output folder) wd = "/projects/p31911/lobeliad_hybpiper" ###Read in dataframe, with minor modifications seq_lengths_file = os.path.join(wd, 'seq_lengths.tsv') #the seq_lengths file df = pd.read_table(seq_lengths_file, sep="\t") #turn into pandas df temp_meanlen = df.iloc[0] #temporarily store mean length row (ignore) df.drop(index=df.index[0], axis=0, inplace=True) #get rid of MeanLength row (the first row) #df.shape #see dimensions of df ###Figure out what rows (aka individuals) to drop all_taxa = df['Species'].tolist() #list of all taxa drop_indexs = [] #empty list to store index of taxa to drop drop_taxa = [] #empty list to store taxa to drop updated_taxa_list=[] #empty list to store taxa to keep for i in range(len(df.iloc[:,0])): #for each row #print(df.iloc[i,0], ":", (df.iloc[i,:]==0).sum()) #print species name: number of zeroes if (df.iloc[i,:]==0).sum()>=302: #if num of zeroes in row is greater than 302 (out of 403, equal to 75%) drop_indexs.append(i) #add index of row to list drop_taxa.append(df.iloc[i,0]) #add taxa name to list #updated_df = df.drop(drop_indexs) # make new dataframe without rows in drop list #updated_df.loc[0] = temp_meanlen # add mean length row from original df back #updated_df = updated_df.sort_index() # sorting by index df = df.drop(drop_indexs) # update dataframe by removing rows in index drop list df.loc[0] = temp_meanlen # add mean length row from original df back df = df.sort_index() # sorting updated df by index for taxa in all_taxa: #make list of kept taxa for updated namelist.txt file if taxa not in drop_taxa: updated_taxa_list.append(taxa) ###Write some new files removed_taxa_file = os.path.join(wd, 'removed_taxa.txt') with open(removed_taxa_file, 'w') as f: #a file that contains which taxa were removed f.write("A list of individuals that were removed from namelist file due to having poor gene recovery\n\n") for taxa in drop_taxa: f.write("%s\n"%taxa) f.write("\nTotal removed: " + str(len(drop_taxa)) + "\n") updated_namelist_file = os.path.join(wd, "namelist_filtered.txt") #an updated namelist file containing only kept taxa with open(updated_namelist_file, 'w') as f: for taxa in updated_taxa_list: f.write("%s\n"%taxa) filtered_seqlen = os.path.join(wd, 'seq_lengths_filtered.tsv') df.to_csv(filtered_seqlen, sep="\t") ``` Using the two files generated, we can look at the file `removed_taxa.txt` to see which/how many individuals were filtered out, and move these individual's folders to a temporary folder. The `namelist_filtered.txt` file is the original namelist file minus the removed taxa. Now, **when we run downstream commands that use the filtered namelist**, we only use the kept taxa. ### 2.4 Adding Outgroup/Downloading data NCBI Originally we didn't have an outgroup for this dataset, so we added one later recommended to us by Andy. We use 4 taxa closely related to *brighamia*: _C. rhodensis_, _C. lycica_, _C. simulans_, and _C. erinus_. Jeremie gave me the raw reads of two of these species, and Andy gave us the accession numbers for the other two: __SRR5205482__ and __SRR5205483__. We use the `sratoolkit` module and `fasterq-dump` command to download the sequences directly ont Quest. To use sratoolkit, first you have to [configure it](https://github.com/ncbi/sra-tools/wiki/03.-Quick-Toolkit-Configuration) for your own user on Quest, and then use the [fasterq-dump](https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump) command. Once all the outgroup files are obtained, it was assembled with HybPiper as well. ### 2.5 Retrieve Sequences Once we've both reduced missing data and added the outgroup to our assemblies, we can retrieve the loci sequences assembled for each individual using HybPiper's `retrieve_sequences` command: ```bash #!/bin/bash #SBATCH --account=p31911 #SBATCH --partition=short #SBATCH --time=04:00:00 #SBATCH --mem=4G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --job-name=lob_seqs #SBATCH --error=lob_seqs.log eval "$(conda shell.bash hook)" conda activate /home/jzy0986/.conda/envs/hybpiper ###Make folder and retrieve sequences cd /projects/p31911/lobeliad_hybpiper/ mkdir seqs hybpiper retrieve_sequences dna -t_dna supercontig_baits.fasta --sample_names namelist_filtered_25.txt --fasta_dir ./seqs ``` This puts a file for each loci in the folder 'seqs', and in each file is every individual's sequence if recovered. ## 3. Generating Coalescence Phylogeny Now we have files of all individual's sequences for each loci, but there are still many steps before we can use this data to infer a coalescent species tree. Coalescence models generate an evolutionary history of taxa for each loci independently (a gene tree), then uses all of the gene trees together to infer a species tree. **Here is the workflow for doing so**: 1) Align and trim gene fastas using MAFFT and trimAl 2) Use IQ Tree to generate gene trees from aligned gene fastas 3) Use newick utilities to modify/filter gene trees 4) Feed gene trees into Astral and infer a species tree ### 3.1 Aligning and trimming gene fastas The length of sequence recovered for each individual is going to be different for each individual in a locus. These sequences need to be aligned before software like IQ Tree can generate gene trees. I use [**MAFFT**](https://mafft.cbrc.jp/alignment/software/) to align each loci's fasta first using the `--auto` option. In a folder containing the retrieved sequences, use a for loop to run MAFFT on all loci ```bash! #!/bin/bash #SBATCH --account=p31911 #SBATCH --partition=normal #SBATCH --time=14:00:00 #SBATCH --mem=48G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=12 #SBATCH --job-name=coal_align #SBATCH --error=align.log ###Load module module load mafft ###For loop to run MAFFT on unaligned loci fasta for i in *.fasta; do mafft --auto --thread 12 ${i} > ${i%.*}_aligned.fasta; done ``` When sequences are aligned, a lot of gaps are introduced, especially towards the beginning/ends of sequences. This is because all sequences must end up being the same length, which is at minimum the longest sequence in the file. We can get rid of base pair positions that are mostly gaps using [**trimAl**](http://trimal.cgenomics.org/trimal). Use `-gt 0.5` to remove base pair positions that are 50% gaps. Note I installed trimAl in my conda environment instead of using modules on quest. ```bash #!/bin/bash #SBATCH --account=p31911 #SBATCH --partition=short #SBATCH --time=00:30:00 #SBATCH --mem=2G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --job-name=lob_trim #SBATCH --error=trim.log eval "$(conda shell.bash hook)" conda activate /home/jzy0986/.conda/envs/genomics for i in *.fasta; do trimal -in $i -out ${i//aligned/trim} -gt 0.5 done ``` ### 3.2 Generating gene trees from aligned loci files Once we have aligned gene sequences, we use [**IQTree**](http://www.iqtree.org/doc/) to generate genetrees that represent the evolutionary history of the gene. I've installed IQTree in my 'genomics' conda environment. In our approach, we let IQTree decide what substitution model is best and perform ultrafast bootstrapping to generate support values for branches. This procedure takes a very long time so we will submit a job to Quest **as an array**. Running an array of jobs allows us to run multiple jobs of the same type simultaneously. In essence, instead of running one iqtree job at a time, *we can run up to 9 simultaneously*. The "base" IQTree command will look something like this: ```bash iqtree -s <input file> -nt AUTO -ntmax 4 -m TEST -bb 1000 -wbtl -czb ``` - `-nt AUTO and -ntmax 4`: Decide optimal number of threads to use, with a maximum of 4 - `-m TEST`: searches for best substitution model before generating gene tree - `-bb 1000`: ultrafast bootstrap with 1000 replicates - `-wbtl`: write branch lengths - `-czb`: collapse near-zero branches - see more about arguements [here](http://www.iqtree.org/doc/Command-Reference) To generate a list of IQTree commands to run on all aligned gene files, we use a for loop again: ```bash for i in *.fasta; do echo iqtree -s $i -nt AUTO -ntmax 4 -m TEST -bb 1000 -wbtl -czb done > full_iqtree.sh ``` Because we want to run this job as an array, we need to split this full_iqtree.sh file into multiple files each containing a subset of the IQTree commands. Use the split command ```bash split -a 1 -d -l 45 full_iqtree.sh iqtree --additional-suffix=.sh ``` :::spoiler Arguments for split - `-a 1`: label split files with **one** letter extensions to base name (i.e. a, b, c, ..) - `-d`: label split files with numbers instead of letters, starting from 0 - `-l 45`: split the original file into smaller files, each with **45 lines** - `iqtree_full.sh`: the file to split - `iqtree`: base name of the split files - `--additional-suffix=.sh`: add a ".sh" to the end of base name + one letter extension ::: > If my original iqtree file has 403 iqtree commands, this will end up generating 9 subfiles each with 45 lines, called `iqtree0.sh iqtree1.sh iqtree2.sh ... iqtree8.sh`. Next we create a simple text file called `input_args.txt` that contains the names of the **subfiles** so that it looks like this. ```bash iqtree0.sh iqtree1.sh iqtree2.sh iqtree3.sh iqtree4.sh iqtree5.sh iqtree6.sh iqtree7.sh iqtree8.sh ``` Now we can write a job submission script for Quest. Note that we need to include a `#SBATCH --array 0-8` line to tell Quest we're running an array of 9 jobs. Here is what the submission script will look like: ```bash #!/bin/bash #SBATCH --account=p31911 #SBATCH --partition=normal #SBATCH --array=0-8 #SBATCH --time=30:00:00 #SBATCH --mem=4G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=iqtree_array #SBATCH --error=iqtree.log module purge eval "$(conda shell.bash hook)" conda activate /home/jzy0986/.conda/envs/genomics IFS=$'\n' read -d '' -r -a input_args < input_args.txt bash ${input_args[$SLURM_ARRAY_TASK_ID]} ``` The IFS line reads in the input_args.txt file and generates an array, where each element is one of the `iqtreex.sh` subfiles. Then we can bash each element of the array to run all 9 sub-jobs simultaneously. ::: info **Note** that the `#SBATCH --time` flag applies for the duration of the whole job, whereas the `#SBATCH --mem` and `#SBATCH --ntasks-per-node` flag applies to each sub-job. Each sub-job here gets 4 CPU's and 4G of memory, none of the sub-jobs can run for more than 30 hours. ::: #### IQTree Outputs For each gene fasta, IQTree will generate many files. Ultimately, the file we are interested in is the `.treefile` file. I'd recommend generating a new folder and moving all the genetree files into this folder, so you don't have to work in a directory containing thousands of files: ```bash mkdir ../genetrees mv *.treefile ../genetrees ``` ### 3.3 Removing branches with low support **Newick Utilites** is a collection of tools for phylogenetic trees. We use the `nw_ed` tool to collapse branches in each gene tree that have very low support (<10%) with the following function: ```bash nw_ed <input genetree> 'i & b<=10' o > <output genetree> ``` To run it as a loop over all of the gene tree files use the following loop. This does not need to be submitted via slurm on Quest because it is relatively simple: ```bash for i in *.treefile; do nw_ed $i 'i & b<=10' o > ${i//.treefile/_BS10.treefile} done ``` ### 3.4 Using Astral to infer a species tree [**Astral**](https://github.com/smirarab/ASTRAL) is an algorithm used to infer a species trees from gene trees. First all gene trees filtered through newick utilies are concatenated into a single file. ```bash cat *.treefile > lobeliad_coal25_genetrees.tre ``` We can call Astral using the following syntax (with `-t 2` to record quartet support) in a slurm job. ```bash #!/bin/bash #SBATCH --account=p31911 #SBATCH --partition=short #SBATCH --time=01:00:00 #SBATCH --mem=4G #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=astral_lob #SBATCH --error=astral.log module purge eval "$(conda shell.bash hook)" conda activate /home/jzy0986/.conda/envs/genomics java -jar ~/tools/astral/astral.5.7.8.jar -i lobeliad_coal25_genetrees.tre -t 2 -o lobeliad_coal25_sptre.tre 2>astral.log ``` ## 4. Generating ML Phylogeny