Bi278 Lab 4 v2 Multiple Sequence Alignment

# Bi278 Lab 4 v2 Multiple Sequence Alignment ### By Lee Ferenc 10/03/2023 Note, this lab is late by a few hours. Sorry, I had a doctors appointment that I emailed about. ## Introduction In this lab we are aigning multiple sequences together. The base or residue may be identical, different (due to mutation), or absent (due to insertion/ deletion) at each position. An algorithm applies a scoring system to see how similar. Two popular progressive multiple sequence alignment (MSA) softwares: 1. `muscle` 2. `mafft` Progressive alignment works by generating a guide tree so that the algorithm can start aligning very similar pairs of sequences and then moves to more divergence ones. After it's generated, it's used to generate a new and better tree. Then, iterative refinement and it keeps realigning partitions in the dataset and if the score imporoves then it replaces the old one. This is done until the maximum score is reached. ## 1. Generate an MSA #### 1. Finding gene IDs of tssB sequences of the annotated genome from `prokka`: `grep tssB PATH/PROKKA_*.tsv` This returns the gene ids in the first column. There may be duplicates. To get only unique gene ids, you may use `uniq`. My code was: `grep tssB /home2/enfere24/lab_03/PROKKA_*.tsv ` It output: ``` BHQS69_01414 CDS 543 tssB-5_1 type VI secretion system protein TssB-5 BHQS69_02561 CDS 507 tssB-5_2 type VI secretion system protein TssB-5 BHQS69_03639 CDS 540 tssB-3 type VI secretion system protein TssB-3 (T6SS-4) ``` #### 2. Get fasta sequences for the genes. Tip: For protein coding genes it's usually better to use amino acid sequences. (This is probably better because of synonymous mutations that don't change the AA sequence but the nucleotide sequence and it's some unnecesary information) Index the appropate fasta file conitaing AA sequences with `samtools`: `samtools faidx PATH/PROKKA_*faa` Example: `samtools faidx /home2/enfere24/lab_03/PROKKA_*.faa` #### 3. Call up fasta format sequenes based on their newly indexed gene ids `samtools faidx PATH/PROKKA_*.faa GENEID1 GENEID2 ...` This outputs a fasta file. Example. Using two genes from my previous steps: `samtools faidx /home2/enfere24/lab_03/PROKKA_*.faa BHQS69_02561 BHQS69_03639` I get the output: ``` >BHQS69_02561 (SEQUENCE) >BHQS69_03639 (SEQUENCE) ``` #### 4. Save sequences into file `samtools faidx PATH/PROKKA_*faa GENEID1 GENEID2 ... > FILENAME` I did: `samtools faidx /home2/enfere24/lab_03/PROKKA_*.faa BHQS69_02561 BHQS69_03639 > phalytssbalighment` (I'm surprised with no file suffix it worked when I checked the head. I wasn't sure what the suffix was) #### 5. Concatenate tssB sequence with ones downloaded from the T6SS `cat /courses/bi278/Course_Materials/lab_04/t6ss_db.faa FILENAME > tssB_input.faa ` I did: `cat /courses/bi278/Course_Materials/lab_04/t6ss_db.faa phalytssbalighment > tssB_input.faa` This output the file `tssB_intput.faa` #### 6. Alight all the sequences with `muscle` `muscle -align tssB_input.faa -output tssB_muscle.afa` This outputs the file `tssB_muscle.afa` #### 7. Check behavior of of `muscle` Muscle tends to changethe order of sequences in alignment. So to check this, compare the order of sequences by their headers ``` grep ">" tssB_input.faa | head grep ">" tssB_muscle.afa | head ``` The output was very out of order. #### 8. Use python to re-order sequences `python3 /courses/bi278/Course_Materials/lab_04/stable_py3.py tssB_input.faa tssB_muscle.afa | grep ">" | head` (I noticed in step 8 of the lab there was `python` instead of `python3` which caused it too not work. I hope I did the right fix) #### 9. Fix order and save to new file. Note: `.afa` goes to `.faa` `python3 /courses/bi278/Course_Materials/lab_04/stable_py3.py tssB_input.faa tssB_muscle.afa > tssB_muscle.faa` #### 10. Align with `mafft` `mafft --maxiterate 3 tssB_input.faa > tssB_mafft.faa` #### Side note: PAL2NAL For codon alignments, you reverse translate protein alignments using PAL2NAL. `perl pal2nal.pl protein_alignment.faa nucleotide.fna -output fasta -codontable 11` The flag `-codontable1 11` is for bacteria ## 2. Compare MSAs We can compare aligns with the Multiple Sequence Alignment Viewer from NCBI A link to it is [here](https://www.ncbi.nlm.nih.gov/projects/msaviewer/?appname=ncbi_msav&openuploaddialog). #### 1. Copy and paste the text of the alignment file Make sure to click "text" in the pop-up. After you copy and paste the text from the alightment files (`_mafft.faa` or `_muscle.faa` files), click "Upload". After it has uploaded, click "Close". #### Question 1: Describe at least 2 major similarities between your MUSCLE vs. MAFFT alignments. What would you assume about the regions that are similar across your alignments? First, the sequence's have same length between MAFFT and MUSCLE. For example Pagri 1 has 182 bps in both. (Pretty obvious but oh well) Second, there is the hallmark in the middle of the majority of proteins of gap, small no-gap, gap. The size of the non-gap is smaller on MUSCLE and it's shifted to the right. The MUSCLE is also at 115-123 while MAFFT is between 100 and 110. I think this shifting is throughout all the proteins because the darker colors are shiftened (I used the BLOSUM45 coloring) to the right. Therefore I think the gaps are happening in the same place. #### Question 2: Describe at least 2 major differences between your alignments. Focus on how the starts and ends of your sequences are treated, and where gaps are introduced to make the alignment work across all your sequences. First, as stated in my answer in question 2 the gap, gap, small no-gap, gap is 10-15 bp to the right in the MUSCLE than MAFFT. There are many more gaps in the MUSCLE at the begnining. In Pagri 1 there are 5 in bp 1 to 30 compared to the 1. The end of the MUSCLE has bigger, larger gaps at the end as well. Both have gaps though. *But* This appears to be due to MUSCLE trying to have most proteins start and end at the same spot (244) while MAFFT doesn't try to force that. **TTDLR** MAFFT doesn't try to force the proteins to remain the same length throughout the alignment compared to MUSCLE, leading to less gaps throughout (especially at the begining and end) #### Question 3: How would you decide which alignment is better? I skimed the papers/documentation b/c that's a *lot* of math and reading to do. But I noticed ***MAFFT doesn't try to include all the residues in the input sequences like the other sequencers***. I'm not really answering the question but I follow the philopshy of never using just one alignment sequence. I feel like looking at both could reveal much more. I personally prefer MAFFT due to less gaps but I feel like it depends on what you need. ## 3. Generate a gene tree from MSA Alignments can be used to create phylogeny. We can make one using approximate Maximum Likelihood with `FastTree`. #### 1. Generate phyologenies from alignments using `FastTree` ``` FastTree -lg < tssB_muscle.faa > tssB_muscle_ft.tre FastTree -lg < tssB_mafft.faa > tssB_mafft_ft.tre ``` #### 2. Import into online tree viewer Interactive Tree of Life can be found [here](https://itol.embl.de/). Click the top bar "Upload" and copy-paste the tree files into the text window. (Remember `cat`) #### 3. Change tree look It's sometimes easier to look at the tree when it's circular and ignoring branch lengths. Especially at first. But things like brach lengths can be helpful! Tyically, in maximum likelihood trees, the branch legnth represents "he number of substitutions per site" When I ran both alignments the shapes were quite simlar. For example, Flavobacterium johnsoniae UW101 and Bacteroides fragilis NCTC9343 have a really long branch and then diverge towards the end. But the legnth in MAFFT is longer than in MUSCLE (.84 to .79)