BI278 Lab #4 - HackMD

# BI278 Lab #4 ### Exercise 1. Generate an MSA 1. Find the ids of all tssB sequences in your new genome that you annotated with prokka last week. ``` grep tssB PROKKA_09272022.tsv //I cd into last week's directory ``` and there was no matches in my own file so I used the one from course material ``` cp /courses/bi278/Course_Materials/lab_04/bbqs433/PROKKA_09242022.faa lab_04 cp /courses/bi278/Course_Materials/lab_04/bbqs433/PROKKA_09242022.tsv lab_04 cp /courses/bi278/Course_Materials/lab_04/bbqs433/proteins.faa lab_04 ``` And I used the .tsv file ``` grep tssB PROKKA_09242022.tsv ``` The result I got is >BB433_00131 >BB433_01843 2. First, we need to index the fasta file so that you can find individual genes within it using a new software named samtools. ``` samtools faidx PROKKA_09242022.faa ``` 3. Next, we’ll get the fasta format sequences of the genes that were identified as tssB. ``` samtools faidx PROKKA_09242022.faa BB433_00131 BB433_01843 ``` The result I got is >BB433_00131 MSESFQNEVPKSRVNIKLDLHTGGAQKKVELPLRLLALGDYSHGQDDRPLAARAKVDINK NNFDAVLAEFHPKARIAVENTLTGDGSELPVTLEFQSIRDFNPEVVASRVPELRALMAAR NLLRDLKSNLLDNATFRRELEKILRDPALSGRLHADLKSIGGSTPLLR >BB433_01843 MADNGSVAPKERVNIVYKSETGGAKEEVELPLKQLVLGDFTQREDSTPVDQRKTVSVDKT NFNEVLRAHGLTLDLAVPNRLAGAPEAGAEEELMNVHLEFDNIRAFEPDAIVDQVPELQQ LVLLREALKALKGPLGNLPEFRKRLQDLVKDEGTRARLLAELGASHEDDGGSSGEGGKGG HEEDQGNQGDKS 4. We want to send this into a file, rather than have it show up on your screen. We can use the “>” operator to do this. ``` samtools faidx PROKKA_09242022.faa BB433_00131 BB433_01843 > bb433_tssB.faa ``` 5. Now let’s concatenate your tssB sequences with the ones downloaded from the T6SS database. ``` cat t6ss_db.faa bb433_tssB.faa > tssB_input.faa ``` 6. Let’s first align all the sequences with MUSCLE ``` muscle -align tssB_input.faa -output tssB_muscle.afa ``` 8. Unfortunately, MUSCLE has a weird behavior. It changes the order of the sequences in its alignment. For various reasons, we want to correct this behavior. Let’s first check the behavior itself. Compare the order of sequences by their headers. ``` grep ">" tssB_input.faa | head ``` >Pagri_1 >Bmall_1 >Bmall_2 >Bpseu_1 >Bpseu_2 >Bpseu_3 >Pcale_1 >Pfung_1 >Pmega_1 >Pphem_1 ``` grep ">" tssB_muscle.afa | head ``` >Bpseu_6 >i2_Escherichia_coli_CFT073 >Pmega_1 >Phayl_1 >Pcale_1 >Pphyt_2 >Pbonn_1 >Pphem_1 >Bmall_1 10. We can use a python script to re-order the sequences. You can check that it works first. ``` python stable_py3.py tssB_input.faa tssB_muscle.afa | grep ">" | head ``` >Pagri_1 >Bmall_1 >Bmall_2 >Bpseu_1 >Bpseu_2 >Bpseu_3 >Pcale_1 >Pfung_1 >Pmega_1 >Pphem_1 12. Now let’s fix the order and save it as a new file. Note the very subtle difference in file name (afa to faa). If these are too similar go ahead and change the file name so it’s easier for you to tell the difference. ``` python stable_py3.py tssB_input.faa tssB_muscle.afa > tssB_muscle.faa ``` 14. Next, let’s align with MAFFT. We’ll just do a basic alignment with just a few refining iterations to save time. MAFFT does not have the issue we corrected for MUSCLE above so our output file is good as-is. ``` mafft --maxiterate 3 tssB_input.faa > tssB_mafft.faa ``` ### Exercise 2. Compare MSAs I used a Multiple Sequence Alignment Viewer available from NCBI to generate two set of analysis link to my analysis: MUSCLE: https://www.ncbi.nlm.nih.gov/projects/msaviewer/?key=GqmAcAar2XJ1hZd1VpShi_E1tP_rjuWL6Y3Bm9Wfx7FWvgWhBJsLGzknDO6Inp7jz_uS74zL186Q1ITZgt-OxLztgeOt34c,8ENqmuxBM5ifb32fvH5LYRvfXgoBew9-A3grbj9qLUS8S-9U7m6Faegq3QGHLsRTlUvIX9Z7jX7KZN5p2G_UdOZd21P3b90 MAFFT: https://www.ncbi.nlm.nih.gov/projects/msaviewer/?key=iToT45U4SuHmFgTmxQcyGGKmIaV-1HDRfNdUwUDFUuvD5JD5Q8OOJCjgHfUQPa5A_1iiTLxo522gd7R6sny-Z4xOsUCdfLc,VOfOPkjllzw7y9k7GNrvxb97_HajB60CoQSJEp0WjzgeN00qnhC08dHi5MgzruDTscvs3_L7qf7u5Prp_O_w9MLd_9PT7_k #### Q1. Describe at least 2 major similarities between your MUSCLE vs. MAFFT alignments.What would you assume about the regions that are similar across your alignments? Both MUSCLE and MAFFT alignments have regions of aligenment columns with gaps in the begining and near the end. Both MUSCLE and MAFFT alignments have a region of aligenment columns without gaps in the middle. #### Q2. Describe at least 3 major differences between your MUSCLE vs. MAFFT alignments. Focus on how the starts and ends of your sequences are treated, and where gaps are introduced to make the alignment work across all your sequences. One difference is that the aligenment columns in MAFFT alignment are of different lengths whereas in MUSCLE aligment the lengths are the same. Another difference I notice is the conservation region in MUSCLE and MAFFT alignments are not exactly the same. Even though these two graphs look similar enough, they have some silght variation between each other. Also, I notice that there are more gaps in the MUSCLE aligment than in MAFFT alignments. When I look at the similarity graph, I also notice the difference: there is a grey region in the front of the MAFFT alignments but not in MUSCLE aligment. ### Exercise 3. Generate a gene tree from your MSA I generate the phylogeny using approximate Maximum Likelihood with FastTree My tree: https://itol.embl.de/tree/137146126226306101664908297