John Paralog Project

--- title: John Paralog Project breaks: false tags: HybSeq, John --- > This document is a log for work on my Master's Project. I compare three recently developed methods used to identify paralogs in HybSeq Data. > Raw sequencing reads are assembled using **[HybPiper](https://bioone.org/journals/applications-in-plant-sciences/volume-4/issue-7/apps.1600016/HybPiper--Extracting-Coding-Sequence-and-Introns-for-Phylogenetics-from/10.3732/apps.1600016.full)**, github found **[here](https://github.com/mossmatters/HybPiper)**. > > Email: johnathanzhang2022@u.northwestern.edu > [color=#907bf7] # HybPiper ## Trimming raw reads I use trimmomatic with the following flags to trim the raw sequences downloaded from Kew. ```bash! #for one sample... java -jar /home/jz/Tools/trimmomatic/trimmomatic-0.39.jar PE -phred33 acaena_tenera_R1.fastq.gz acaena_tenera_R2.fastq.gz acaena_tenera_R1_paired.fastq.gz acaena_tenera_R1_unpaired.fastq.gz acaena_tenera_R2_paired.fastq.gz acaena_tenera_R2_unpaired.fastq.gz ILLUMINACLIP:/home/jz/Tools/trimmomatic/adapters/TruSeq3-PE.fa:2:30:10:2:true LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40 #for loop for R1 in *R1* do R2=${R1//R1.fastq.gz/R2.fastq} R1p=${R1//.fastq/_paired.fastq} R1u=${R1//.fastq/_unpaired.fastq} R2p=${R2//.fastq/_paired.fastq} R2u=${R2//.fastq/_unpaired.fastq} echo java -jar /home/jz/Tools/trimmomatic/trimmomatic-0.39.jar PE -phred33 $R1 $R2 $R1p $R1u $R2p $R2u ILLUMINACLIP:/home/jz/Tools/trimmomatic/adapters/TruSeq3-PE.fa:2:30:10:2:true LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20 MINLEN:40 >> trimmomatic.sh done #to concatenate the unpaired files for each sample for file in *R1_unpaired* do file2=${file//R1/R2} file3=${file//R1_unpaired/unpaired} cat $file $file2 > $file3 done ``` ## Running HybPiper 1.3 Potentilla reads were contained in 2 large fasta files. Files must be gunzipped. To run HybPiper on them: ```bash=! python ~/Tools/HybPiper/reads_first.py -b kew_probes_Malus_exons_concat.fasta -r PAFTOL_015295_R1.fastq PAFTOL_015295_R2.fastq --prefix PAFTOL_015295 --bwa #to run it as a loop for file in *R1_paired* do file2=${file//R1/R2} file3=${file//R1_paired_dd.fastq/unpaired.fastq} file4=${file//_R1_paired_dd.fastq/} echo python /home/jz/Desktop/HybPiper/reads_first.py -r $file $file2 --unpaired $file3 -b supercontig_baits.fasta --prefix $file4 --bwa >> reads_first.sh done ``` #### Retrieve sequences First make a new folder to house the FNA (fasta nucleotide) files, called gene fastas. Then run: ```bash! python ~/Tools/hybpiper1/retrieve_sequences.py ~/.../Malinae_optimized_353.fasta ~/.../gene_fastas dna ``` ### Investing paralogs #### Using paralog_investigator.py To run `paralog_investigator` on all the samples in namelist.txt ```bash=! while read i do echo $i python ../paralog_investigator.py $i done < namelist.txt ``` #### Using paralog_retriever.py To run `paralog_retriever.py` on the genes outputted from `paralog_investigator.py`: ```bash=! #one gene at a time python ../paralog_retriever.py namelist.txt gene002 > {}.paralogs.fasta #use parallel for all at once parallel "python paralog_retriever.py namelist.txt {} > {}.paralogs.fasta" ::: gene002 gene006 gene030 ``` ## Running Hybpiper 2.0 #### Assembly ```bash! for i in *R1_paired.fastq.gz do f2=${i//R1/R2} f3=${i//R1_paired.fastq.gz/_unpaired.fastq.gz} f4=${i//_R1_paired.fastq.gz/} echo 'hybpiper assemble -t_dna Malinae_optimized_353.fasta -r $i $f2 --unpaired $f3 --prefix $f4 --bwa' done > assemble.sh #Alternitavely, while read name; do echo hybpiper assemble -t_dna camp_mega_baits.fasta -r ../lobeliad_rawreads/$name*_paired.fastq --unpaired ../lobeliad_rawreads/$name*_unpaired.fastq --prefix $name --bwa >> hybpiper.sh; done < namelist.txt bash assemble.sh ``` #### Summary statistics & heatmap ``` hybpiper stats -t_dna Malinae_optimized_353.fasta gene namefile.txt hybpiper recovery_heatmap seq_lengths.tsv hybpiper retrieve_sequences dna -t_dna Malinae_optimized_353.fasta --sample_names namefile.txt ``` # HybPhaser ```bash=! cd ~/Data/potentilla_hybpiper bash ~/Tools/HybPhaser/1_generate_consensus_sequences.sh -n name_file.txt -p ~/Data/potentilla_hybpiper -o ~/Data/potentilla_hybphaser cd ~/Tools/HybPhaser Rscript 1a_count_snps.R Rscript 1b_assess_dataset.R Rscript 1c_generate_sequence_lists.R ``` ### Preliminary results I test ran HybPhaser using PAFTOL (potentilla) and the test data Crataegus_remotilobata. I used HybPiper's paralog_investigator and HybPhaser's summary table to assess paralog counts. - For PAFTOL HybPiper reported 0 loci were paralogous, HybPhaser reported 0 paralog warnings - For Crataegus HybPiper reported 5 loci were paralogous, and HybPhaser reported 0 paralog warnings :::danger This is outdated now- an updated version of 1c_generat sequence_lists.R has been made available on the github. Ignore below **1c_generate_sequence_lists.R**: Part of the scripts expect R objects to exist (like failed_samples, failed_loci), but I couldn't find where they are supposed to be created and these two did not exist for me. This is an issue when the script calls on these objects like such: ```r= ! loci_to_remove <- c(names(failed_loci), outloci_missing, outloci_para_all) ``` This line returns the error `object failed_loci not found`. I'm not sure if my fix still leads to the desired outcome of the script but I added a step to check for the existence of the object. If it didn't exist I would create an empty vector for it ```r= ! if(exists("failed_loci")) { loci_to_remove <- c(names(failed_loci), outloci_missing, outloci_para_all) } else { loci_to_remove <- c() failed_loci <- c() #this is added because the script still calls on failed loci downstream by checking if the length of it is > 0 } ``` This solution was applied everywhere the Rscript expected the existence of an object that did not exist for me (specifically objects `failed_samples` and `failed_loci`) ::: To retrieve consensus sequences from the loci flagged as paralogous (HybPhaser automatically deletes them), first I ran Rscript 1b with "remove_loci_for_all_samples_with_more_than_this_proportion_SNPs" set to "outliers". This generates a report that shows which loci are outliers. I move this folder aside and re-run Rscript 1b except the same arguement is set to "none" instead of "outliers". That way, when I run the next Rscript 1c, it doesn't get rid of the paralogous loci when generating consensus sequences. # Putative Paralog Detection (PPD) **Step 1:** Run hybpiper and run intronerate.py after reads_first on each sample ```bash= #to run intronerate on single sample (post hybpiper) python ~/Tools/HybPiper/intronerate.py --prefix PAFTOL_015295 #to run intronerate on all samples (post hybpiper) while read name; do python ~/Tools/hybpiper1/intronerate.py --prefix $name; done < namefile.txt #cleanup.py loop while read name; do python ~/Tools/hybpiper1/cleanup.py $name; done < namefile.txt ``` :::danger To correctly generate files for step 2, intronerate must be run and supercontig files must be used downstream. Only supercontig files contain the sample name AND GENE NAME for each sequence. Having only sample names yields many problems downstream. This works on HybPiper 1 only ::: **Step 2:** Run Step1.sh of ppd: ```bash=! ~/Data/rosaceae_ppd$ bash ~/Tools/putative_paralog/Step1.sh ./ ./name_file.txt ``` **Step 3:** Run Step2.sh of ppd: For Step 2, I had to rename R1_paired to just R1 because it asks for QC raw reads. It seems to only utilize paired reads though? ```bash! ~/Tools/putative_paralog$ bash Step2.sh 100 2 ~/Data/rosaceae_ppd/supercontig/ ~/Data/rosaceae_ppd/rosaceae_qc_reads ~/Data/rosaceae_ppd/step2/ ~/Data/rosaceae_ppd/name_file.txt ``` :::danger Use absolute paths when running step2 Also, go into shell script and change the naming structure of the raw quality control reads to match your own naming scheme Also, change how picard is called (`java -jar $PICARD` opposed to just `picard`) ::: **Step 4** Run ppd.py using default parameters. For the rosaceae data, we used *Docynia inidica* as the outgroup. ```bash! python ~/Tools/ppd2/PPD_2.0.1/ppd.py -ifa ~/Data/rosaceae_ppd2/step2/ -ina ~/Data/rosaceae_hybpiper/namefile.txt -iref ~/Data/rosaceae_hybpiper/Malinae_optimized_353.fasta -io ~/Data/rosaceae_ppd2/outgroup.txt -o ~/Data/rosaceae_ppd2/ -t exon -th 8 -mi 15 -he 0.2 ``` :::danger **no longer an issue:** Step 2.bash creates folders along with the degenerated fasta files. These folders need to be moved from this folder for PPD.py to work. (os.listdir includes these folders when we only want the files) **step 4 of ppd.py**: changes all "n's" to "-'s", which is quite foolish because this includes any "n's" in species/genus names, not just the sequences. Change sed line from `do sed -i -e 's/n/-/g' $f` to `do sed -i -e '/>/!s/n/-/g' $f` **extra flags** I had to change default values for sliding window and max % of heterozygous sites in sequence due to deep phylogenetic depth (lots of variance to begin with) ::: # Paralog Wizard (PW) I have no idea how this pipeline works, should probably figure that out. Of note is that there are **2 probe sets required for PW**. One probe set is of the bait sequences separated out into exons, and the other probe set is the bait sequene exons concatenated together. Most probe sets are the latter, and the authors of PW have also written a script to seperate out bait files into exons: **[2ex](https://github.com/rufimov/2ex)**. This is a lot of work. ### Installation I install Paralog Wizard onto a local machine. To download the PW pipeline from their **[github page](https://github.com/rufimov/ParalogWizard)**, use command line: ```bash=! git clone https://github.com/rufimov/ParalogWizard ``` ##### Dependencies PW has many dependencies, some of which are the same as HybPiper's (PW incorporates HybPiper as its first step). To **create a new conda environment** for paralog wizard and the necessary dependencies: ```bash=! conda create --name myenv python = 3.8 biopython numpy scikit-learn matplotlib pandas gffutils ``` - This above list might be incorrect or outdated, always check github pages for proper dependencies. The other dependencies that are necessary are - **BLAST command line tools** Instructions can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK52640/) - Remember to export path variables - **BLAT**: download all files from [here](hgdownload.cse.ucsc.edu/admin/exe/). If the link isnt working, (hgdownload.cse.ucsc.edu/admin/exe/) ### Usage > The folder structure for PW is described in detail on their github, and must be mimicked. > I also have very little understanding of what's going on during this pipeline. > [color=#907bf7] **Step 1:** The first step of PW is a HybPiper mirror, which assembles the raw sequencing reads. Make sure the naming scheme of the files and the names in sample_list.txt loosely match the structure in the provided test_dataset (Genus_species_sample.R1.fastq, e.g. PAFTOL-015295_1.R1.fastq). -nc is number of cores in all arguments ```bash=! python ParalogWizard.py cast_assemlble -d test_dataset -pr kew_probes_Malus_exons_concat.fasta -nc 10 ``` **2.** The second step matches assembled contigs to the bait exons ```bash=! python ParalogWizard.py cast_retrieve -d test_dataset/ -pe kew_probes_Malus_exons.fasta -c -nc 16 ``` **3.** Calculates divergences between assembled contigs for each sample ```bash=! python ParalogWizard.py cast_analyze -d test_dataset/ -nc 8 ``` **4.** Detects paralogs based off of divergence values. `-mi` and `-ma` must be chosen carefully ```bash=! python ParalogWizard.py cast_detect -d test_dataset/ -pe kew_probes_Malus_exons.fasta -p -mi 4 -ma 10 ``` - For the rosaceae data, I use divergene range of (5, 14) **5.** No description for this step ```bash=! python ParalogWizard.py cast_separate -d test_dataset/ -pc ./test_dataset/41detected_par/customized_reference_div_5.04_13.98.fas -i 90 ``` # 2ex **Step 1:** Downloaded malus_genome and malus_genome_annotation from [here](https://www.ncbi.nlm.nih.gov/genome?LinkName=nuccore_genome&from_uid=1562670041) **Step 2:** Build annotation database using first script ```bash= python ~/Tools/2ex/2ex_extract.py build malus_genome_annotation.gff malus_genome.fna # This creates the files malus_genome.annotation.gf ``` **Step 3:** Create exon files using the first script ```bash= python ~/Tools/2ex/2ex_extract.py extract malus_genome_annotation.gff malus_genome.fna # This creates the files exons.fasta and concatenated_exons.fasta ``` :::danger **NOTE**: Files can not be gunzipped! **KeyError: 'orig_transcript_id'**: Went into 2ex_extract.py and "continued" (replace with `continue`) every time it searched for this key ::: **Step 4:** Split into concatenated and split exon files ```bash= python ~/Tools/2ex/2ex_split.py concatenated_exons.fasta exons.fasta Malinae-optimized_Angiosperms353.fasta # This creates a lot of files, I end up using # "probes_separated_to_exons.fasta" as the potential exon file # Also, the Malinae-optimized_Angiosperms353.fasta file was # provided in 2ex ``` # IQ Tree ### Standardizing the output files :::danger Only use this to rename/fix files maybe, not mafft/trimal ::: Before running IQ tree on gene fastas, I reorganize and rename the fasta files from each pipeline using a python script. I also run MAFFT and trimAl where necessary. The python script was used as follows: ```bash! python ~/Tools/dev/align_rename/align_rename.py --hp ./hybpiper1 --hph ./hybphaser --ppd ./ppd --pw ./pw --align hybpiper hybphaser --output ./genetrees ``` - MAFFT was used to align the fasta files from hybpiper and hybphaser (used -auto) - trimAl was run on all files, though it seems ineffective on PW due to the inclusion of missing data (used -gt 0.7) - Files were also renamed to Gene.fasta (e.g. 4471.fasta) across all pipelines. - PPD accidentally removes "n"s from the files, so the script replaces the n's in sample names where missing - PW sample names are also modified to only include genus_species :::info Just do this manually ::: Here are some manual ways of looping over a folder: ```bash= for file in *; do file2=${file//.fasta/_aligned.fasta} echo “mafft –-auto --thread 8 $file > $file2” >> mafft_loop.sh echo “mv $file2 aligned” >> mafft_loop.sh #move aligned files to a new folder called aligned done for file in *; do file2=${file//_aligned.fasta/_trimmed.fasta} echo "trimal -in $file -out $file2 -gt 0.5" >> trimAl_loop.sh echo "mv $file2 trimmed" >> trimAl_loop.sh done ``` ### Preparing datasets for IQ Tree Each pipeline is split into 3 datasets: (1) the default outputs of each pipeline, (2) an output with all sequences flagged as potential paralogs for a gene removed, and (3) an output with sequences flagged as paralogous being included. Here are the notes for generating each of these outputs for the pipelines. #### HybPiper 2.0 The three datasets for HybPiper were generated using the paralog_report.tsv, created with HybPiper's statistics output. ```bash! ###To generate hybpiper paralog & w/o paralog data python paralog_mover.py --report paralog_report.tsv --action remove --fastas_path ./retrieved_sequences --output ./retrieved_sequences/no_paralogs/ ``` #### Paralog Wizard The `prepare_pw.py` script is used to prepare PW's datasets. PW contains paralog sequences in separate files for each gene. Using this existence of this file, the script either removes the sequences containing paralogous partners, or adds these partner sequences to the original file with a '_para_' suffix. The script simultaneously fixes some naming/syntax issues within the files. ```bash=! ###Generate PW data with/without paralog data #this is a guess i dont remember what i actually wrote python prepare_pw.py --pw_folder ./70concatenated_exon_alignments/fastas/ --rm_para --output ./pw_nopara ``` #### HybPhaser For the three HybPhaser datasets (and PPD), different approaches are used. The no paralog dataset removes the entire loci from the dataset, opposed to HybPiper which removes just the sample within the loci that had paralogous sequences. For adding paralogs, HybPhaser only flags loci as being paralogous but doesn't provide the paralogous sequences. To get paralogous sequences from a loci, say loci 4471, I use the 4471 consensus sequences file generated by HybPhaser as a bait file for a second run of HybPiper. After running another assembly with 4471 consensus sequences as the bait file, I extract the paralogous sequences generated by HybPiper's paralog_retriever. I replace HybPhaser's 4471 consensus sequence file with this new file. Note that this approach is as accurate as assembling individuals separately, where the bait file includes only that individual's consensus sequence. I don't remember specifics but to first generate the consensus sequence that will act as a new bait, I think hybphaser remove paralogs need to be set to "none" instead of "outliers". When set to "outliers" though, that's when the plog report will be generated (so you know which loci to re-run hybpiper assembly for). ### IQ Tree In each pipeline folder, I ran ```bash! for i in *; do iqtree -s $i -T AUTO -ntmax 4 -m TEST -B 1000 -wbtl -czb; done ``` - -nt AUTO to determine optimum # of threads to use - -m TEST is for model searching: jModelTest/ProtTest - -bb 1000 is ultrafast bootstrapping, 1000 replicates - -wbtl write the branch lengths - -czb collapse zero branches # Astral Many different forms/versions of Astral are maintained together as 'Aster' on github, including Astral-PRO. However, before running species tree reconstruction we must concatenate each genetree file into one and run newick utilities to get rid of branches with low support. ### Astral Preparation ```bash=! ###First, concatenate all gene trees into one file### cat * > hp_def_trim.genetree ###Run Newick Utilities to get rid of branches with low support### #Install Newick conda install -c bioconda newick_utils #Run newick utilities nw_ed hp_def_trim.genetree 'i & b<=10' o > hp_def_trim_BS10.genetree nw_reroot 4471.FNA docynia_indica > 4471_root.fasta #Run on folder of genetree files for file in *.genetree; do file2=${file//.genetree/_BS10.genetree} echo "nw_ed $file 'i & b<=10' o > $file2" >> nw_util.sh done ``` ### Running astral and astral-PRO To run Astral and create species trees using the default and no-paralog genetrees, run Astral as shown below. For the genetrees with paralogous sequences, I generated a species tree file using Astral. Using this sptree file, I create a mapping file where each species version label (e.g. species_A.main, species_A.0, species_A.1) gets folded into one branch. This mapping file is fed into Astral-PRO as below :::spoiler Args - -i input file - -o output file - -t (-u for a-pro) 2 - 2> redirect output to log file - -a mapping fiile ::: ```bash ###Run Astral #download ASTER (contains astral and astral-pro) git clone https://github.com/chaoszhang/ASTER.git #Run Astral java -jar ~/Tools/astral/astral.5.7.8.jar -i hp_def_trim.genetree -t 2 -o hp_def_trim.sptre 2>hp_def_trim.sptre.log #Run Astral-PRO ~/Tools/ASTER/bin/astral-pro -a hp_para_trim.map -u 2 -i hp_para_trim_BS10.genetree -o hp_para_trim_apro.sptree 2>hp_para_trim_apro.sptreelog ``` ### Log > The escape character doesn't work so treat semicolons ";" as commas ",". > [color=#907bf7] ```csvpreview {header="true"} Date (mm/dd/yy), Entry 02/01/22, Downloaded Paralog Wizard and tried to install all dependencies 02/08/22, Caved in and finally started setting up conda environments; created conda environment for PW 02/10/22, Tried running through test_dataset; getting errors. Something wrong with cast_separate or cast_detect 02/17/22, Narrowed down to cast_detect; for some reason during the refining process of the new probe set it's creating; some sequences aren't being removed and cast_separate (the downstream step) looks for sequences that aren't there 03/01/22, Couldn't figure out what the issue with cast_detect was; luckily on experimental branch of github; authors updated (and largely revamped) how PW works. I replace the original PW's cast_analyze spell with the experimental PW cast_analyze; and I was able to run through the test_dataset successfully 03/16/22, Going to try running PW on potentilla data over spring break 04/08/22, Updated version of paralog wizard was released; seemed to fix a number of bugs I was having. Was able to run through their test dataset with relatively few issues but did make it through 04/10/22, Ran potentilla data through paralog wizard 05/08/22, Ran potentilla through HybPiper and did paralog_investigator/paralog_retriever; no paralogs found??? 06/24/22, Ran potentilla data through HybPhaser alongside Crataegus_remotilobata because of issues in HybPhaser with using only one sample (I think dataframe issues in R) 06/29/22, Ran through first couple of steps of PPD for potentilla data & test kew_probes_malus bait set (includes running intronerate on HybPiper output) 07/01/22, Ran through Step2.sh of PPD for potentilla using test kew_probes_malus 07/08/22, Got 2ex to work but created probes dont match 100% with the test_dataset probes, pretty closely match though. Used this full probe set to run through HybPiper potentilla data 07/14/22, Ran potentilla data through HybPiper and ParalogWizard and HybPhaser using full Malinae-optimized-353 probe set 08/03/22, Collected HybSeq data (online) for 15 species in Rosacaea family for assembly and paralog detection. Ran HybPiper 2.0 on data. 08/15/22, Ran Rosaceae data through paralog wizard- used % divergence range of (2.195, 11.335) to flag paralogous genes 08/16/22, Ran Rosaceae data through HybPhaser 09/18/22, Ran Rosaceae data through PPD; begin working on manuscript ``` Notes: - .bashrc set up as follows for pw: - FastTree needs to be renamed to fasttree (if getting fasttree not found error) - for some reason cast_assemble's -d test_dataset argument can't have the / after it