# Bi278 Lab Num 4 Blast+ ### By Lee Ferenc 10/04/2022 ## Exercise 1. Generate an MSA ### Find the ids of all *tssB* sequences in my P.hayleyella bhqs69 genome that you annotated with prokka ``` #First I attempted using the file in my path but that didn't work grep tssB /home2/enfere24/colbyhome/Genomics/lab_03/bhq69/PROKKA*.tsv #So instead I copied over the file and ran it cp /courses/bi278/Course_Materials/lab_04/bhqs69/PROKKA*.tsv /home2/enfere24/colbyhome/Genomics/lab_04 grep tssB /home2/enfere24/colbyhome/Genomics/lab_04/PROKKA*.tsv ``` Output: >BH69_01414 CDS 543 tssB-5_1 type VI secretion system protein TssB-5 BH69_02561 CDS 507 tssB-5_2 type VI secretion system protein TssB-5 BH69_03639 CDS 540 tssB-3 type VI secretion system protein TssB-3 (T6SS-4) #### Get fasta format ##### BUT we need to infex the fasta file with samtools ``` #need to copy over oops cp /courses/bi278/Course_Materials/lab_04/bhqs69/PROKKA*.faa /home2/enfere24/colbyhome/Genomics/lab_04 samtools faidx /home2/enfere24/colbyhome/Genomics/lab_04/PROKKA*.faa ``` ##### Next, we’ll get the fasta format sequences of the genes that were identified as tssB. ``` amtools faidx /home2/enfere24/colbyhome/Genomics/lab_04/PROKKA*.faa BH69_01414 BH69_02561 BH69_03639 ``` Output (not that helpful): >BH69_01414 MADDGSVAPKERVNIVYKSETGGAKEDVELPLKQLVLGDFTLREDSTPLDQRKTVSVDKN NFNEVLRGQGLTLDVAVPNRLAGTPEPGAEEELLNVHLAFDNIRDFEPDAIVEQVPELQQ LVLLREALKALKGPLGNLPEFRRRLQDLVKDEGTRTRLLAELGASHEGDGKDSSNEEDKP >BH69_02561 MAKITGTVAPKERINITYVPATGGQHAEIELPLKLLVIGDFKGHEEETALEDRPVVRIDK DNFNEVLSEADVALKASVPLRLGEARPDDTLSVELEFKNIKDFGPDAVARQVPELRKLLE LREALVAVKGPMGNVPAFRKQLQALLGDEVSRSKLAKELSVALDGTAS >BH69_03639 MSASSSQKFIARNRAPRVQIEYDVEVYGSEKKVELPFVMGVLTDLSGKHPLEALPAVSER RFLEIDIDNFDERMKAIQPRVAFAVSNTLTGEGQVMVDMTFESMEDFSPAAIARKVDALR QLLEARTQLANLQTYMDGKSGAEALVTQLLQDPALLKSLASAPKPEPHDEGVAEPGEVN ##### Open file with ">" ``` samtools faidx /home2/enfere24/colbyhome/Genomics/lab_04/PROKKA*.faa BH69_01414 BH69_02561 BH69_03639 > bH69_tssB.faa ``` ##### Concatenate tssB with downladed from T6SS ``` cp /courses/bi278/Course_Materials/lab_04/t6ss_db.faa /home2/enfere24/colbyhome/Genomics/lab_04/ #because im tired of tabing cd /home2/enfere24/colbyhome/Genomics/lab_04 cat t6ss_db.faa bH69_tssB.faa > tssB_input.faa #Note I accidentally capitalized the H in bH69_tssB.faa #and was then confused why it bh69_tssB.faa didn't exist lol ``` ##### Align all the sequences with MUSCLE ``` muscle -align tssB_input.faa -output tssB_muscle.afa ``` ##### Muscle is weird/ Changes order of sequence in alignement so to correct we compare order of sequences by header ``` grep ">" tssB_input.faa | head ``` Output: >Pagri_1 Bmall_1 Bmall_2 Bpseu_1 Bpseu_2 Bpseu_3 Pcale_1 Pfung_1 Pmega_1 Pphem_1 ``` grep ">" tssB_muscle.afa | head ``` >Phayl_3 Pphem_2 Pphym_2 Pspre_4 Pphym_1 Psart_2 i5_Vibrio_coralliilyticus_OCN008 i3_Pantoea_ananatis_LMG2665 Bpseu_2 i4a_Pseudomonas_syringae_Shaanxi_M228 ##### Use python to re-order sequences (FINALLY) ``` #copy over the python file (I accidentally did it to lab_03 first oops) cp /courses/bi278/Course_Materials/lab_04/stable_py3.py /home2/enfere24/colbyhome/Genomics/lab_04 python stable_py3.py tssB_input.faa tssB_muscle.afa | grep ">" | head ``` ##### Fix order and save as new file (NOTICE: afa to faa in file name) ``` python stable_py3.py tssB_input.faa tssB_muscle.afa > tssB_muscle.faa ``` ##### Align with MAFFT ``` mafft --maxiterate 3 tssB_input.faa > tssB_mafft.faa ``` ###### Other stuff: So you can align nucleotide sequences and codon sequnecs. Note the flag -codontable 11 is for bacteria. ###### To make a codoon alignment form the protein alignment (nucleotide for t6SS isn't available right now) you can do: ``` perl pal2nal.pl **protein_alignment.faa** **nucleotide.fna** -output **fasta** -codontable 11 ``` ## Exercise 2. Compare MSAs ##### First I opened the website for multple Sequence ALignment Viewer from NCBI: (https://www.ncbi.nlm.nih.gov/projects/msaviewer/?appname=ncbi_msav&openuploaddialog) Then I clicked "Text" and uploaded the output from below and clicked upload and close ``` cat tssB_mafft.faa ``` Output not included because it's huge Opened a second window where I put in the output from ``` cat tssB_muscle.faa #NOT THE .AFA BECAUSE WRONG ORDER ``` #### Question 1: Describe at least 2 major similarities between your MUSCLE vs. MAFFT alignments. What would you assume about the regions that are similar across your alignments? First, all the proteins seems to have the same length with start being 1 and end being 182, 192, 141, etc (though I didn't compare everyone but everyone I choose seemed to be the same). Second, there is the hallmark in the middle of the majority of proteins of gap, small no-gap, gap. The size of the non-gap is smaller on MUSCLE and it's shifted to the right. The MUSCLE is also at 115-123 while MAFFT is between 100 and 110. I think this shifting is throughout all the proteins because the darker colors are shiftened (I used the BLOSUM45 coloring) to the right. Therefore I think the gaps are happening in the same place. #### Question 2: Describe at least 3 major differences between your MUSCLE vs. MAFFT alignments. Focus on how the starts and ends of your sequences are treated, and where gaps are introduced to make the alignment work across all your sequences. First, as stated in my answer in question 2 the gap, gap, small no-gap, gap is 10-15 bp to the right in the MUSCLE than MAFFT. There are many more gaps in the MUSCLE at the begnining. In Pagri 1 there are 5 in bp 1 to 30 compared to the 1. The end of the MUSCLE has bigger, larger gaps at the end as well. Both have gaps though. *But* This appears to be due to MUSCLE trying to have most proteins start and end at the same spot (244) while MAFFT doesn't try to force that. **TTDLR** MAFFT doesn't try to force the proteins to remain the same length throughout the alignment compared to MUSCLE, leading to less gaps throughout (especially at the begining and end) #### Q3. (optional) How would you decide which alignment is better? There are a few options to consider. You may read the papers for each software and decide you prefer the philosophy of one over the other, or find its comparative performance results more convincing. You may have a dataset that you know the answer to, and run it to test how well each aligner works. I skimed the papers/documentation b/c that's a *lot* of math and reading to do. But I noticed ***MAFFT doesn't try to include all the residues in the input sequences like the other sequencers***. I'm not really answering the question but I follow the philopshy of never using one alignment sequence. ## Exercise 3. Generate a gene tree from the MSA ###### Manual: http://www.microbesonline.org/fasttree/ ##### First generate from phyolgenies from the alignments ``` FastTree -lg < tssB_mafft.faa > tssB_mafft_ft.tre FastTree -lg < tssB_muscle.faa > tssB_muscle_ft.tre ``` I then went to: (https://[itol.embl.de]) and went to the top, clicked upload the outputs below, uploaded, clicked circular in the top right and ignored the branch length. ### MAFFT ``` cat tssB_mafft_ft.tre ``` ![](https://i.imgur.com/TYNx7H5.png) ### MUSCLE ``` cat tssB_muscle_ft.tre ``` ![](https://i.imgur.com/mUhXOuz.png) I noticed they have some differences in protein places on the trees like a. citrulli AAC00-1 But they are a similar (or exact) shape.