# Bi278 Lab Num 4 Blast+
### By Lee Ferenc 10/04/2022
## Exercise 1. Generate an MSA
### Find the ids of all *tssB* sequences in my P.hayleyella bhqs69 genome that you annotated with prokka
```
#First I attempted using the file in my path but that didn't work
grep tssB /home2/enfere24/colbyhome/Genomics/lab_03/bhq69/PROKKA*.tsv
#So instead I copied over the file and ran it
cp /courses/bi278/Course_Materials/lab_04/bhqs69/PROKKA*.tsv /home2/enfere24/colbyhome/Genomics/lab_04
grep tssB /home2/enfere24/colbyhome/Genomics/lab_04/PROKKA*.tsv
```
Output:
>BH69_01414 CDS 543 tssB-5_1 type VI secretion system protein TssB-5
BH69_02561 CDS 507 tssB-5_2 type VI secretion system protein TssB-5
BH69_03639 CDS 540 tssB-3 type VI secretion system protein TssB-3 (T6SS-4)
#### Get fasta format
##### BUT we need to infex the fasta file with samtools
```
#need to copy over oops
cp /courses/bi278/Course_Materials/lab_04/bhqs69/PROKKA*.faa /home2/enfere24/colbyhome/Genomics/lab_04
samtools faidx /home2/enfere24/colbyhome/Genomics/lab_04/PROKKA*.faa
```
##### Next, we’ll get the fasta format sequences of the genes that were identified as tssB.
```
amtools faidx /home2/enfere24/colbyhome/Genomics/lab_04/PROKKA*.faa BH69_01414 BH69_02561 BH69_03639
```
Output (not that helpful):
>BH69_01414
MADDGSVAPKERVNIVYKSETGGAKEDVELPLKQLVLGDFTLREDSTPLDQRKTVSVDKN
NFNEVLRGQGLTLDVAVPNRLAGTPEPGAEEELLNVHLAFDNIRDFEPDAIVEQVPELQQ
LVLLREALKALKGPLGNLPEFRRRLQDLVKDEGTRTRLLAELGASHEGDGKDSSNEEDKP
>BH69_02561
MAKITGTVAPKERINITYVPATGGQHAEIELPLKLLVIGDFKGHEEETALEDRPVVRIDK
DNFNEVLSEADVALKASVPLRLGEARPDDTLSVELEFKNIKDFGPDAVARQVPELRKLLE
LREALVAVKGPMGNVPAFRKQLQALLGDEVSRSKLAKELSVALDGTAS
>BH69_03639
MSASSSQKFIARNRAPRVQIEYDVEVYGSEKKVELPFVMGVLTDLSGKHPLEALPAVSER
RFLEIDIDNFDERMKAIQPRVAFAVSNTLTGEGQVMVDMTFESMEDFSPAAIARKVDALR
QLLEARTQLANLQTYMDGKSGAEALVTQLLQDPALLKSLASAPKPEPHDEGVAEPGEVN
##### Open file with ">"
```
samtools faidx /home2/enfere24/colbyhome/Genomics/lab_04/PROKKA*.faa BH69_01414 BH69_02561 BH69_03639 > bH69_tssB.faa
```
##### Concatenate tssB with downladed from T6SS
```
cp /courses/bi278/Course_Materials/lab_04/t6ss_db.faa /home2/enfere24/colbyhome/Genomics/lab_04/
#because im tired of tabing
cd /home2/enfere24/colbyhome/Genomics/lab_04
cat t6ss_db.faa bH69_tssB.faa > tssB_input.faa
#Note I accidentally capitalized the H in bH69_tssB.faa
#and was then confused why it bh69_tssB.faa didn't exist lol
```
##### Align all the sequences with MUSCLE
```
muscle -align tssB_input.faa -output tssB_muscle.afa
```
##### Muscle is weird/ Changes order of sequence in alignement so to correct we compare order of sequences by header
```
grep ">" tssB_input.faa | head
```
Output:
>Pagri_1
Bmall_1
Bmall_2
Bpseu_1
Bpseu_2
Bpseu_3
Pcale_1
Pfung_1
Pmega_1
Pphem_1
```
grep ">" tssB_muscle.afa | head
```
>Phayl_3
Pphem_2
Pphym_2
Pspre_4
Pphym_1
Psart_2
i5_Vibrio_coralliilyticus_OCN008
i3_Pantoea_ananatis_LMG2665
Bpseu_2
i4a_Pseudomonas_syringae_Shaanxi_M228
##### Use python to re-order sequences (FINALLY)
```
#copy over the python file (I accidentally did it to lab_03 first oops)
cp /courses/bi278/Course_Materials/lab_04/stable_py3.py /home2/enfere24/colbyhome/Genomics/lab_04
python stable_py3.py tssB_input.faa tssB_muscle.afa | grep ">" | head
```
##### Fix order and save as new file (NOTICE: afa to faa in file name)
```
python stable_py3.py tssB_input.faa tssB_muscle.afa > tssB_muscle.faa
```
##### Align with MAFFT
```
mafft --maxiterate 3 tssB_input.faa > tssB_mafft.faa
```
###### Other stuff: So you can align nucleotide sequences and codon sequnecs. Note the flag -codontable 11 is for bacteria.
###### To make a codoon alignment form the protein alignment (nucleotide for t6SS isn't available right now) you can do:
```
perl pal2nal.pl **protein_alignment.faa** **nucleotide.fna** -output
**fasta** -codontable 11
```
## Exercise 2. Compare MSAs
##### First I opened the website for multple Sequence ALignment Viewer from NCBI:
(https://www.ncbi.nlm.nih.gov/projects/msaviewer/?appname=ncbi_msav&openuploaddialog)
Then I clicked "Text" and uploaded the output from below and clicked upload and close
```
cat tssB_mafft.faa
```
Output not included because it's huge
Opened a second window where I put in the output from
```
cat tssB_muscle.faa
#NOT THE .AFA BECAUSE WRONG ORDER
```
#### Question 1: Describe at least 2 major similarities between your MUSCLE vs. MAFFT alignments. What would you assume about the regions that are similar across your alignments?
First, all the proteins seems to have the same length with start being 1 and end being 182, 192, 141, etc (though I didn't compare everyone but everyone I choose seemed to be the same).
Second, there is the hallmark in the middle of the majority of proteins of gap, small no-gap, gap. The size of the non-gap is smaller on MUSCLE and it's shifted to the right. The MUSCLE is also at 115-123 while MAFFT is between 100 and 110. I think this shifting is throughout all the proteins because the darker colors are shiftened (I used the BLOSUM45 coloring) to the right. Therefore I think the gaps are happening in the same place.
#### Question 2: Describe at least 3 major differences between your MUSCLE vs. MAFFT alignments. Focus on how the starts and ends of your sequences are treated, and where gaps are introduced to make the alignment work across all your sequences.
First, as stated in my answer in question 2 the gap, gap, small no-gap, gap is 10-15 bp to the right in the MUSCLE than MAFFT.
There are many more gaps in the MUSCLE at the begnining. In Pagri 1 there are 5 in bp 1 to 30 compared to the 1.
The end of the MUSCLE has bigger, larger gaps at the end as well. Both have gaps though. *But* This appears to be due to MUSCLE trying to have most proteins start and end at the same spot (244) while MAFFT doesn't try to force that.
**TTDLR** MAFFT doesn't try to force the proteins to remain the same length throughout the alignment compared to MUSCLE, leading to less gaps throughout (especially at the begining and end)
#### Q3. (optional) How would you decide which alignment is better? There are a few options to consider. You may read the papers for each software and decide you prefer the philosophy of one over the other, or find its comparative performance results more convincing. You may have a dataset that you know the answer to, and run it to test how well each aligner works.
I skimed the papers/documentation b/c that's a *lot* of math and reading to do. But I noticed ***MAFFT doesn't try to include all the residues in the input sequences like the other sequencers***. I'm not really answering the question but I follow the philopshy of never using one alignment sequence.
## Exercise 3. Generate a gene tree from the MSA
###### Manual: http://www.microbesonline.org/fasttree/
##### First generate from phyolgenies from the alignments
```
FastTree -lg < tssB_mafft.faa > tssB_mafft_ft.tre
FastTree -lg < tssB_muscle.faa > tssB_muscle_ft.tre
```
I then went to: (https://[itol.embl.de]) and went to the top, clicked upload the outputs below, uploaded, clicked circular in the top right and ignored the branch length.
### MAFFT
```
cat tssB_mafft_ft.tre
```

### MUSCLE
```
cat tssB_muscle_ft.tre
```

I noticed they have some differences in protein places on the trees like a. citrulli AAC00-1 But they are a similar (or exact) shape.