# **Lab week 4**
`ssh` onto bi278
# Exercise 1: Generating a Multiple Sequence Alignment
- `mkdir labwk4` to make new directory for this week
- `cd /courses/bi278/Course_Materials/lab_04` to enter folder for this week
- `ls` and then `cp PROKKA_09242022.tsv ~/labwk4` to copy into labwk4 folder
- `cd ~` `cd labwk4`
- `grep tssB ./PROKKA*./tsv` to find the ids of the tssB sequences in the PROKKA anntotated genome. (Mine from last week wasn't working so I copied the one on the lab_04 folder)
a tssB sequence is just a conserved component gene of a bacterial type 6 seretion sequence (T6SS)
after `grep tssB ./PROKKA*./tsv` I got
BH69_01414
BH69_02561
BH69_03639
which matched the results I should have gotten
to get the fasta format sequences for these genes
- index the fasta file we used a new software named `samtools` so that individual genes within the fasta file can be found `samtools faidc ./PROKKA*.faa`
- to get the fasta format sequences of the genes identified as tssB `samtools faidx ./PROKKA*.faa BH69_-1414 BH69_02561 BH69_03639`
my results:
>BH69_01414
MADDGSVAPKERVNIVYKSETGGAKEDVELPLKQLVLGDFTLREDSTPLDQRKTVSVDKN
NFNEVLRGQGLTLDVAVPNRLAGTPEPGAEEELLNVHLAFDNIRDFEPDAIVEQVPELQQ
LVLLREALKALKGPLGNLPEFRRRLQDLVKDEGTRTRLLAELGASHEGDGKDSSNEEDKP
>BH69_02561
MAKITGTVAPKERINITYVPATGGQHAEIELPLKLLVIGDFKGHEEETALEDRPVVRIDK
DNFNEVLSEADVALKASVPLRLGEARPDDTLSVELEFKNIKDFGPDAVARQVPELRKLLE
LREALVAVKGPMGNVPAFRKQLQALLGDEVSRSKLAKELSVALDGTAS
>BH69_03639
MSASSSQKFIARNRAPRVQIEYDVEVYGSEKKVELPFVMGVLTDLSGKHPLEALPAVSER
RFLEIDIDNFDERMKAIQPRVAFAVSNTLTGEGQVMVDMTFESMEDFSPAAIARKVDALR
QLLEARTQLANLQTYMDGKSGAEALVTQLLQDPALLKSLASAPKPEPHDEGVAEPGEVN
- Next to send this to a file instead of just sending it to my screen, used `samtools faidx ./PROKKA*.faa BH69_01414 BH69_02561 BH69_03639 > bh69_tssB.faa`
`ls` to check that the two new files `bh69_tssB.faa` and ` PROKKA_09242022.faa.fai` were there
- So then we wanted to link th tssB sequences with the ones downloaded from the t6SS database, used `cat t6ss_db.faa bh69_tssB.faa > tssB_input.faa`
- Then we want to align the sequences using MUSCLE `muscle -align tssB_input.faa -output tssB_muscle.afa`
MUSCLE will sometimes change the order of the sequences in the alignment, so to check this behavior we can compare the order of the sequences using their headers and the commands
`grep ">" tssB_input.faa | head`
>Pagri_1
>Bmall_1
>Bmall_2
>Bpseu_1
>Bpseu_2
>Bpseu_3
>Pcale_1
>Pfung_1
>Pmega_1
>Pphem_1
>
`grep ">" tssB_muscle.afa | head`
>Phayl_3
>Pphem_2
>Pphym_2
>Pspre_4
>Pphym_1
>Psart_2
>i5_Vibrio_coralliilyticus_OCN008
>i3_Pantoea_ananatis_LMG2665
>Bpseu_2
>i4a_Pseudomonas_syringae_Shaanxi_M228
- to reorder the sequences you can use a python command like `
python stable_py3.py tssB_input.faa tssB_muscle.afa | grep ">"
| head`
this wasn't working originally because I had to copy the `stable_py3.py` file into my labwk4 folder
`cd/courses/bi278/Course_Materials/lab_04` just so I could make sure the python file was in here
`cp stable_py3.py ~/labwk4` and then `cd` back to `labwk4`
- try again `python stable_py3.py tssB_input.faa tssB_muscle.afa | grep ">" | head`
- and then to save it as a new file `python stable_py3.py tssB_input.faa tssB_muscle.afa > tssB_muscle.faa`
yay it worked
Now we wanted to align using MAFFT. Note that MAFFT does not have the same alignment issues as MUSCLE so the output is good to go without any additional commands
- ` mafft --maxiterate 3 tssB_input.faa > tssB_mafft.faa`
Note for future assignments: you can align nucleotide sequecnes instead of protein sequences but this is rarely going to be more informative. However, the alignments of the codons can be useful. You can reverse realign the proteins into their codon alignments using the protien alignment and the ualigned nucleotide's multi-fasta sequence in the same order as the protien alignment file. Also, note the flag `codontable 11` for bacteria
the code would look something like this `perl pal2nal.pl protein_alignment.faa nucleotide.fna -output fasta -codontable 11`
# Exercise 2: Comparing MSAs
Now that there are two alignments available, we can compare them using the MSA Viewer available from NCBI
`cat tssB_muscle.faa`
`cat tssB_mafft.faa` to print both of the alignments to the screen so I can copy and paste them into two seperate windows of the NCBI Viewer
**Questions**
1. *2 Major similarities*
- One major similarity is that both alignmnets have pretty much the same large sections, especially when looking with the conservation lens
- Another major similarity would be the toal length of the alignments looks about the same although they start at different points
3. *3 Major differences*
- One major difference would be in the location of the gaps. One of the alignments has a large gap in the middle, while the other alignment has a smaller middle gap and a gap all the way to the right that the other alignment does not have
- Although similar, the amount conserved by each alignment is slightly different
- The alignment with the smaller middle gap has a smaller gap in the beginning but a longer gap at the end wheras the other alignent has a slightyl larger gap in the beginning but recevies some data at the end
5. How would you decide which is better
You could decide which is better by
# Exercise 3: Generate a Gene Tee From the MSA
- Generate phylogenies from the alignments using the commads `FastTree -lg < tssB_muscle.faa > tssB_msucle.ft.tre`
`FastTree lg < tssB_mafft.faa > tssBmaft_ft.tre`
- and then `ls` to see the files and then
`cat tssB_mafft_ft.tre`
`cat tssB_mucsle.ft.tre` (don't know why mucsle and not muscle)
- Then looked at the trees using the online tree viewer