# BI278 Lab #4 – Multiple Sequence Alignment Quick note: Crtl C for if your cursor ever gets "stuck" **Exercise 1 - Generate an MSA** 1. First, we have to find the IDs of all *tssB* sequences in your new genome that you annotated with prokka ast week. ``` grep tssB ~PATH/PROKKA*.tsv grep tssB ~/lab_04/PROKKA*.tsv ``` Result: ![](https://i.imgur.com/nZeapJo.png) 2. Now, we need to get the fasta format sequences for these genes. For protein coding genes, you want to use amino acid sequences rather than DNA sequences. A reason for this could be that amino acid sequences highlight coding regions of the DNA sequence while the DNA sequence includes both coding and non-coding regions. First step is to index the fasta file so that you can find individual genes w/in it using *samtools* software. ``` samtools faidx PATH/PROKKA*.faa samtools faidx ~/lab_04/PROKKA*.faa ``` 3. Next, we get fasta format sequences of the genes that were identified as tssB. ``` samtools faidx PATH/PROKKA*.faa BH69_01414 BH69_02561 BH69_03639 samtools faidx ~/lab_04/PROKKA*.faa BH69_01414 BH69_02561 BH69_03639 ``` Result: ![](https://i.imgur.com/cXtDXhD.png) 4. ^We want to send the result into a file rather than show it on the screen. We use ">" operator to do that. ``` samtools faidx PATH/PROKKA*.faa BH69_01414 BH69_02561 BH69_03639 > bh69_tssB.faa samtools faidx ~/lab_04/PROKKA*.faa BH69_01414 BH69_02561 BH69_03639 > bh69_tssB.faa ``` 5. Concatenate your tssB sequences w/ the ones downloaded from the T6SS database. ``` cat t6ss_db.faa bh69_tssB.faa > tssB_input.faa ``` 6. Align sequences w/ MUSCLE ``` muscle -align tssB_input.faa -output tssB_muscle.afa ``` 7. MUSCLE has weird behavior where it changes the order of the sequences in its alignment. We want to correct this behavior. First, check the behavior. Compare the order of sequences by their headers: ``` grep ">" tssB_input.faa | head grep ">" tssB_muscle.afa | head ``` 8. You can use python script to re-order sequences! Check that it works first: ``` python stable_py3.py tssB_input.faa tssB_muscle.afa | grep ">" | head ``` 9. Now, fix the order & save it as a new file. (You can change the name if you want since afa to faa is very subtle difference but not necessary) ``` python stable_py3.py tssB_input.faa tssB_muscle.afa > tssB_muscle.faa ``` 10. Next, you have to align w/ MAFFT. We'll do a basic alignment w/ just a few refining iterations to save time. Key note: MAAFT doesn't have issue we corrected for MUSCLE above so output file is good as is. ``` mafft --maxiterate 3 tssB_input.faa > tssB_mafft.faa ``` 10. NOTE: You can align nucleotide seqs instead of protein seqs as well. Although, this is only in rare cases that would give you more informative results for protein coding genes. Codon alignments can still be useful. To do this, you can reverse-translate protein alignments into their codon nucleotide alignments. MAKE SURE you have BOTH protein alignment & nucleotide (unaligned) multifasta w/ the seqs in same order as the protein alignment file. Remember: flag - odontable 11 is for Bacteria To make a codon alignment from your protein alignment: ``` perl pal2nal.pl protein_alignment.faa nucleotide.fna -output fasta -codontable 11 ``` ^Can't use it for this exercise Why? B/c we don't have nucleotide fasta for the T6SS genes. Just keep it in mind b/c you can use it for future projects. **Exercise 2 - Compare MSAs** Purpose: Now that we have 2 alignments, we can compare them using a MSA (Multiple Sequence Alignment) Viewer available from NCBI Link: http://www.ncbi.nlm.nih.gov/projects/msaviewer/?appname=ncbi_msav&openuploaddialog 11. When you click on the link, you go to a pop-up window. CLick on the [Text] option as data sources & copy-paste your alignment into the window to the right. ``` cat tssB_mafft.faa #shows you the contents of the file cat tssB_muscle.faa ``` 12. Then, click on [Upload] and once data has been uploaded, click on [Close]. 13. Open a second window & load other alignment. Results: MAFFT: ![](https://i.imgur.com/L5q5z0k.png) Muscle: ![](https://i.imgur.com/sRRMSsD.png) **Q1. Describe at least 2 major similarities between your MUSCLE vs. MAFFT alignments. What would you assume about the regions that are similar across your alignments?** 1) Bases in the 30-100bp positions align similarly in both alignments. 2) Bases in the 110-180 positions align similarly in both alignments. **Q2. Describe at least 3 major differences between your MUSCLE vs. MAFFT alignments. Focus on how the starts and ends of your sequences are treated, and where gaps are introduced to make the alignment work across all your sequences.** 1) MAFFT was able to identify more alignments ihe base positions of approxmiately 100-110. There is a gap in Muscle for this portion. 2) There was more alignment in MAFFT in the start of the sequence (1-30) than in Muscle. 3) Muscle contained more alignment at the very end of the sequence (234-244) than the MAFFT. However, MAFFT had better alignment leading up to the end of the sequence (205-215) **Exercise 3 - Generate a Gene Tree from your MSA** Purpose: Remember that alignments are usually NOT the last step in an analysis. Typically, you can use an alignment to generate a phylogeny. In this exercise, we will make a quick phylogeny using approx Max Likelihood w/ FastTree. Manual for FastTree: http://www.microbesonline.org/fasttree/ 14) Generate phylogenies from the alignments: ``` FastTree -lg < tssB_muscle.faa > tssB_muscle_ft.tre FastTree -lg < tssB_mafft.faa > tssB_mafft_ft.tre ``` 15) Now, you can take a look at these trees by importing them into an online tree viewer. Go to: itol.embl.de Click on top bar [Upload] & copy-paste your tree file content into the [Tree text] window. Then click [Upload]. ``` cat tssB_mafft_ft.tre cat tssB_muscle_ft.tre ``` 16) Look at circiular form of tree & ignore branch lengths to make it easier to understand what you're looking at. MAFFT tree: ![](https://i.imgur.com/XDV60IE.jpg) ![](https://i.imgur.com/wmPDwUy.png) Muscle tree: ![](https://i.imgur.com/dwN7Nyb.jpg) ![](https://i.imgur.com/sUTMcEp.png) CHECK YOUR WORK: - Click around tree to make sure you can find your sequences that you added to the larger database file. - A good reality check of process= check whether your sequences group w/ other sequences from a close relative. - ex: bhqs69 seqs all group w/ Phayl (same species) seqs