# BI278 Lab 4: Multiple Sequence Alignment
Files in
```
/courses/bi278/Course_Materials/lab_04/
```
Generate an MSA
1. Compare sequeces of component gene tssB of the baterial T6SS
```
grep tssB ~/lab_04/PROKKA*.tsv
```
Result:

2. To get the fasta format for these genes (better to use amino acid sequences rather than DNA sequences): index the fasta so that you can find identical genes within it using a software called samtools.
```
samtools faidx ~/lab_04/PROKKA*.faa
```
3. Get the fasta format sequnce of the genes that were identified as tssB.
```
samtools faidx ~/lab_04/PROKKA*.faa BH69_01414 BH69_02561 BH69_03639
```
Results:

4. We want to send this into a file rather than show up on the screen
```
samtools faidx ~/lab_04/PROKKA*.faa BH69_01414 BH69_02561 BH69_03639 > bh69_tssB.faa
```
5. Concatenate your tssB sequences with the ones downloaded from the T6SS database
```
cat t6ss_db.faa bh69_tssB.faa > tssB_input.faa
```
6. align all sequences with MUSCLE
```
muscle -align tssB_input.faa -output tssB_muscle.afa
```
7. MUSCLE has a weird behavior and changes the order of sequences in its alignment, which we want to correct. So, first we check the behavior itself and compare the order of sequences by their headers:
```
grep ">" tssB_input.faa | head
grep ">" tssB_muscle.afa | head
```
8. We can use a phython script to re-order the sequences.
```
python stable_py3.py tssB_input.faa tssB_muscle.afa | grep ">" | head
```
9. Now we can fix the order and save it as a new file (note the difference between afa and faa in filename)
```
python stable_py3.py tssB_input.faa tssB_muscle.afa > tssB_muscle.faa
```
10. Align with MAFFT.
```
mafft --maxiterate 3 tssB_input.faa > tssB_mafft.faa
```
NOTE: you can also align nucleotide sequences instead of protein sequences (but only in rare cases would this give you a more informative result for protein coding genes). Make sure you have both protein alignment and nucleotide (nualigned) multifasta with the sequences in the same order as the protein alignment file.
--> note the flag: codontable11 is for bacteria
*To use in future projects*
```
perl pal2nal.pl protein_alignment.faa nucleotide.fna -output
fasta -codontable 11
```
Compare MSAs
Use two alignments: compare using Multiple Sequence Alignment Viewer (NCBI) https://www.ncbi.nlm.nih.gov/projects/msaviewer/?appname=ncbi_msav&openuploaddialog
11. When you get the pop-up window, click [text] and copy-paste alignment into window on right
```
#in terminal
cat tssB_muscle.faa #the resulting sequence is what is copy-pasted into the window
```
12. Then click [upload]. When the window tells you that your data has been uploaded, click [close]
13. Open second window and load up the other alignment.
Note: window on left is mafft, window on right is muscle
Results:
MAFFT
MUSCLE
Questions:
Q1: Two major similarities
- Both alignments have the same sequence titles and sequence number (64).
- The central regions are very similar between the MUSCLE and MAFFT alignments (on the immediate right and left of the gap in the middle)
Q2: Three major differences
- The MAFFT alignment has 227 proteins, while the MUSCLE alignment has 244.
- There are more frequent and larger gaps in the sequences in the MUSCLE alignment than in the MAFFT.
- The beginnings and endings of the alignments are different (more gaps in MUSCLE, different patterns)
- The gap in the middle is different between the two: a bigger gap for MUSCLE, two smaller gaps in the MAFFT alignment.
Generate a gene tree from your MSA
*Alignments are usually not the last step in an analysis. One thing that you can use an alignment for is to create a phylogeny. Using FastTree you can make quick phylogeny using Maximum Liklihood*
14. Generate phylogenies from your aligments
```
FastTree -lg < tssB_muscle.faa > tssB_muscle_ft.tre
FastTree -lg < tssB_mafft.faa > tssB_mafft_ft.tre
```
15. Import into online tree viewer (itol.embl.de). Click [upload] and copy-paste tree file content into the [Tree text] window. Then click on [upload] below.
```
#in terminal
cat tssB_muscle_ft.tre
cat tssB_mafft_ft.tre #paste into window
```
16. Use circular function and ignore branch lengths
To check: see whether sequences group (BH69 sequences) group with Phayl (same species)
MAFFT tree 
MUSCLE tree 
Unrelated note: control C gets you back if you get stuck!