# Lab 04 ## Exercise 1. Generate an MSA ### Comparing sequences of a conserved component gene *tssB* from new genomes from last week Finding all ids of *tssB* sequences from prokka produced genome ``` cd /colbyhome/BI278/lab_03/bbqs grep tssB ./PROKKA*.tsv ``` The following are the results: BB433_00131 BB433_01843 **For protein coding genes, it is almost always best to use amino acid sequences rather than DNA sequences. Why?:** It is better to use amino acid sequences because multiple codons can code for the same amino acid sequences. Two codons sharing no nucleotides could both code for the same amino acid sequence. The code below will get the fasta format for the sequences of genes that were identified as tssB sequences. ``` samtools faidx ./bbqs433/PROKKA*.faa samtools faidx ./bbqs433/PROKKA*.faa BB433_00131 BB433_01843 ``` The code above produces the following results: >BB433_00131 MSESFQNEVPKSRVNIKLDLHTGGAQKKVELPLRLLALGDYSHGQDDRPLAARAKVDINK NNFDAVLAEFHPKARIAVENTLTGDGSELPVTLEFQSIRDFNPEVVASRVPELRALMAAR NLLRDLKSNLLDNATFRRELEKILRDPALSGRLHADLKSIGGSTPLLR >BB433_01843 MADNGSVAPKERVNIVYKSETGGAKEEVELPLKQLVLGDFTQREDSTPVDQRKTVSVDKT NFNEVLRAHGLTLDLAVPNRLAGAPEAGAEEELMNVHLEFDNIRAFEPDAIVDQVPELQQ LVLLREALKALKGPLGNLPEFRKRLQDLVKDEGTRARLLAELGASHEDDGGSSGEGGKGG HEEDQGNQGDKS The code below sends the output from above into a file ``` samtools faidx ./bbqs433/PROKKA*.faa BB433_00131 BB433_01843 > bb433_tssB.faa ``` Viewing tssB sequences to one downloaded from T6SS dtabase ``` cp /courses/bi278/Course_Materials/lab_04/t6ss_db.faa ./ cat t6ss_db.faa bb433_tssB.faa > tssB_input.faa ``` MUSCLE is a popular alignment software that uses progressive multiple sequence alignment. We will align all of the concatenated tssB sequences with the ones downloaded from the T6SS database. ``` muscle -align tssB_input.faa -output tssB_muscle.afa ``` MUSCLE changes the order of the sequences that are aligned. We can use grep to check how the order was changed by comparing the input concatenated tssB sequences (faa) and the output muscle file (afa) ``` grep ">" tssB_input.faa | head grep ">" tssB_muscle.afa | head ``` The first sequence that was aligned in the input file was Pagri_1 while the first sequence that of the output muscle file is Bpseu_6 To re-order the sequences, weuse a python script. ``` py3 stable_py3.py tssB_input.faa tssB_muscle.afa > tssB_muscle.faa ``` To align with MAFFT, a slightly more complex MSA. We will only do a basic alignment. ``` mafft --maxiterate 3 tssB_input.faa > tssB_mafft.faa ``` If a nucleotide sequence is desired instead of a protein sequence, we can translate the protein alignments back into codons. ``` perl pal2nal.pl protein_alignment.faa nucleotide.fna -output fasta -codontable 11 ``` ## Exercise 2: Compare MSAs To compare the muscle and mafft alignments, we can view them in the multiple sequence alignment viewer from NCBI Open two separate NCBI MSA viewer windows and copy and paste the contents of the muscle and mafft files (both faa filetype) as text. *for some reason, the image button on hack md was not working but please email me if you would like to view the images of the muscle and mafft alignments* ### Q1. Describe at least 2 major similarities between your MUSCLE vs. MAFFT alignments. What would you assume about the regions that are similar across your alignments? The length and order of the protein sequences are the same across both alignments. The same amino acids appear at locations across the alignment. The end sequences for almost all of the proteins are the identical across both alignments. ### Q2. Describe at least 3 major differences between your MUSCLE vs. MAFFT alignments. Focus on how the starts and ends of your sequences are treated, and where gaps are introduced to make the alignment work across all your sequences. There are many more gaps in the muscle alignment than there are in mafft. The total length of the muscle alignment is also longer than the mafft alignment despite having the same protein alignments (236 vs. 227). The muscle alignment also has the end sequences in a vertical line while the mafft alignment shows the differing lengths by having them all terminated at different places. The Muscle alignments also have more residues through gaps (around double). The gaps in the MUSCLE alignment were introduced at the beginniing of the protein sequence. All of the start codons are lined up at the first position in the MUSCLE alignment while the MAFFT alignment has gaps before the start codon.