# bi278 lab #4: multiple sequence alignment
## Goal
- Take a gene of interest and generate a multiple sequence alignment with homologous sequences
- Compare your alignments with an alignment viewer
- Generate and view a gene tree from your alignment
Files for this Lab: /courses/bi278/Course_Materials/lab_04
Also use files from last week
Overviuew of allignment
An algorithm will compare multiple sequences and thenn score each position along the sequences to see if they are homologous. There are three possiblilities, "The base or residue may be identical, different (due to mutation), or absent (due to insertion/ deletion) at each position."
Allignments are hypotheses and thus may have multiple outcomes we'll look at two different softwares to create these hypotheses, muscle and mafft. Both of these softwares are progressive multiple sequence alignment (MSA) softwares. And as porfessor Noh says, "They are pretty similar overall, but mafft has more options if you have more complex datasets to work with."
Link to a Guide on the Muscle software: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-5-113
Link to a guide on the Maaft Software:
https://hackmd.io/@snoh-bi278/rJabIvvy6
## Generating an MSA
We are comparing a TssB sequence of the bacterial Type 6 Secretion System (T6SS) from our new genomes.
Before we get started I need to sign on and copy the folder from this week over
`mkdir lab_04
cp -r /courses/bi278/Course_Materials/lab_04/ ~/lab_04/
`
For Some reason this made a new directory within the old one so my path is ~/lab_04/lab_04/*
Find the gene ids of all tssB sequenecs in one of the genomes from the lab_04 folder I used bhqs69
`grep tssB ~/lab_04/lab_04/bhqs69/PROKKA_*.ts`
I ended up with 
as my result
if I dont have all unique genes then I should follow this advice from professor Noh "The first column contains your gene ids. Because of the way the tsv file is formatted, you probably have duplicate lines for gene and CDS. Just get the unique gene ids here. You can experiment with uniq to figure out how to get these using code (instead of copy-pasting)."
Now that we have the genes we need to get the sequences of these genes. For protein coding seuqneces its almost always better to use the amino acid sequence possibly because the amino acid sequence is a third of the size of the gene's nucleotide sequence.
To get a fasta sequence you use a software called Samtools which helps index the approirate fasta file that contains amino acid sequences for the genes predicted in my genome.
`samtools faidx ~/lab_04/lab_04/bhqs69/PROKKA_*faa` this is the command that does this.
Next we use the same command to pull up the fasta format sequenes based on their newly indexed gene ids.
`samtools faidx ~/lab_04/lab_04/bhqs69/PROKKA_*faa BH69_01414 BH69_02561 BH69_03639`
Notice this is just the previous command followed by the gene IDs identified earlier. Here are what your results should look like.

Since we are using these sequences to generate an allignment we should save them to a file.
`samtools faidx ~/lab_04/lab_04/bhqs69/PROKKA_*faa BH69_01414 BH69_02561 BH69_03639 > tssB_sequence`
the `> FILENAME` creates a new file that the sequences are written into.
Next we want to add "concatenate" the tssB sequences we found with the ones from the data base do this by running the following
`cat /courses/bi278/Course_Materials/lab_04/t6ss_db.faa tssB_sequence > tssB_input.faa`
Now that they are Concatenated we can run muscle to allign them
#### Running Muscle
Use the following command to run Muscle
`muscle -align tssB_input.faa -output tssB_muscle.afa`
As prof Noh notes " muscle has the weird behavior of changing the order of sequences in its alignment. For various reasons, we want to correct this behavior. Let’s check the behavior itself by comparing the order of sequences by their headers."
`grep ">" tssB_input.faa | head`
Followed by
`grep ">" tssB_input.faa | head`
Your results should look like this:

Since the orders are switched around we need to use a python script to reorder them
`python3 /courses/bi278/Course_Materials/lab_04/stable_py3.py tssB_input.faa tssB_muscle.afa | grep ">"
| head`
Now we can fix the order and save them to a new file using the following command:
`python3 /courses/bi278/Course_Materials/lab_04/stable_py3.py tssB_input.faa tssB_muscle.afa > tssB_muscle.faa` so the new file is `tssB_muscle.faa`
#### Running mafft
This is easier as there isn't the reordering issue encountered above.
use the following command to run mafft on our desired file
`mafft --maxiterate 3 tssB_input.faa > tssB_mafft.faa`
If youy wanted to allign nucleotide sequences instead of protien sequences use PAL2NAL to reverse translate the protiens into codons.
example of how you would run this `perl pal2nal.pl protein_alignment.faa nucleotide.fna -output fasta -codontable 11`
Professor Noh's notes on PAL2NAL "Just make sure you have both the protein alignment and nucleotide (unaligned) multi-fasta with the sequences in the same order as the protein alignment file. Note the flag -codontable 11 is for Bacteria."
Link to further resources:
## Compare MSAs
You can use the multi sequence allignment viewer from NCBI
Link: https://www.ncbi.nlm.nih.gov/projects/msaviewer/?appname=ncbi_msav&openuploaddialog
when you arrive here select text and input the text of one of your allignment files into the window.
To get to the file that you need use the command to print the file `cat tssB_muscle.faa`
I used my muscle file but you can use either Fasta file.
When your data tells you it's been uploaded, click close.
I got this result (You should play around with settings and coloring to make it more clear)

Next I did it with the mafft allignment as well
so I used `cat tssB_mafft.faa`
and I got this result

##### Q1. Describe at least 2 major similarities between your muscle vs. mafft alignments. What would you assume about the regions that are similar across your alignments?
I noticed that towards the center of the gene sequences there is alot more similarties between the two allignment than differenceces. Both of theme also show that there is less allignment at the start and end of the sequences than in the middle. Adtionally, when looking at conservation both allignments show incredibly similar results. However, they are slightly shifted from one another. I'm assuming that becasue these regions are more highly conserved, they are also the regions that show the most homology. (see below for the conservation results.)


##### Q2. Describe at least 2 major differences between your alignments. Focus on how the starts and ends of your sequences are treated, and where gaps are introduced to make the alignment work across all your sequences.
As I mentioned above, the two allignment results are to shifted slightly from one another. This shift is very moticeable when looking at the gaps, and where the allignments start. The gap in the middle is a clear indicator of this shift as one allignment software inserted it slightly after the other. Each allignment could have chosen a slightly different reading frame thus leading to this shift.
In adition to this, the ends of the allignments have more varied amino acid results than the rest of the sequences, which is supported by the lack of conservation seen in those regions.
##### Q3. How would you decide which alignment is better? There are a few options to consider. You may read the papers for each software and decide you prefer the philosophy of one over the other, or find its comparative performance results more convincing. You may have a dataset that you know the answer to, and run it to test how well each aligner works. (Alternatively you could also tweak the alignment parameters to work better for you.) Lastly, none of this may matter and you can choose to manually adjust the alignment based on what you know about your gene of interest. In this case, the software-based alignment is acting as a convenient way to get started
I would first read the manuals of b oth of these softwares as I am unfamiliar with them and i think it is highly valuable to understand them prior to choosing one over the other. However, I would also look towards datasets in this situation. As professor Noh mentioned, "T6SS are known to be important for virulence in many different kinds of pathogenic bacteria." Because of this, I am assuming that they are highly researched and there are multiple datasets that we know the answer to that we can use to test our allignments in this scenario.
### Generating a Gene Tree from the MSA.
After you make allignments you tend to want to make a phylogeny. We can make a quick phylogeny using approximate Maximum Likelihood with `FastTree`.
Use this link to find more resources on FastTree
http://www.microbesonline.org/fasttree/
To run fast tree use the following commands as example
`FastTree -lg < tssB_muscle.faa > tssB_muscle_ft.tre
FastTree -lg < tssB_mafft.faa > tssB_mafft_ft.tre`
FastTree Syntax:
`command < inputfile > outputfile
command < inputfile > outputfile`
where `< and >` is used to indicate to what to read in and what to write out.
In order to look at these tree we need to import these documents into an online tree viewer
go to: https://itol.embl.de
This is a tree viewer where you can upload your tree files.
In order to upload you need to copy and paste the file contetents so you'll need to use `cat` again.
`cat tssB_muscle_ft.tre`
`cat tssB_mafft_ft.tre`
Once you upload the data you'll be able to view your tree.
This is what it should initially look like.

Professor Noh suggests looking at the trees in cirucular mode to start as that is easier to digest. You can quickly swithc this in the control panel by clicking circular.
This is a good example of the final tree with the Bh69 gene at the top.
