# CompBio Research '22-'23 (#2)
## Results from Preliminary Analysis (12/08/22)
The tree comparisons between the component trees and the species tree are [here](https://docs.google.com/presentation/d/1rhi58CRbOwfFIj8Bv5sgdVt1PKG8nyu7Vx2dIPosLjM/edit?usp=sharing).
## Creating General Species Tree (12/01/22, 12/07/22, 12/08/22)
- gather whole genomes of top ten species and sample species
- find in research area by using grep for species names in GCA files
- do an alignment-free tree
- use AAF (kmer counts and align)
- ASTRAL-III --> species tree from gene tree
- label gene trees with phylums and other higher orders
- research using API Mia used
```
ssh ksdixo23@nscc
ssh n26
cd /export/groups/snoh/ksdixo/secretion_systems/all/ss6
#gather species names of all species with top ten hits
cat ./iglB_tssC/*matches.ranked_topten.txt > ss6_all.matches.txt
cat ./tssB/*matches.ranked_topten.txt >> ss6_all.matches.txt
cat ./tssD/*matches.ranked_topten.txt >> ss6_all.matches.txt
cat ./tssE/*matches.ranked_topten.txt >> ss6_all.matches.txt
cat ./tssF/*matches.ranked_topten.txt >> ss6_all.matches.txt
cat ./tssG/*matches.ranked_topten.txt >> ss6_all.matches.txt
cat ./tssJ/*matches.ranked_topten.txt >> ss6_all.matches.txt
cat ./tssK/*matches.ranked_topten.txt >> ss6_all.matches.txt
cat ./tssL/*matches.ranked_topten.txt >> ss6_all.matches.txt
sort ss6_all.matches.txt | uniq > ss6_all.matches.uniq.txt
#gather whole genome sequences from species
cd /research/snoh/GEBA-0/ncbi_dataset/data
###matching species names with directory
grep -f /export/groups/snoh/ksdixo/secretion_systems/all/ss6/ss6_all.matches.uniq.txt ./*/*.fna
###make input directory for aaf
cd /export/groups/snoh/ksdixo/secretion_systems/all/ss6
mkdir input
cd input
mkdir Beutenbergia_cavernae
cp /research/snoh/GEBA-0/ncbi_dataset/data/GCA_000023105.1/GCA*genomic.fna ./Beutenbergia_cavernae
mkdir Kangiella_koreensis
cp /research/snoh/GEBA-0/ncbi_dataset/data/GCA_000024085.1/GCA*genomic.fna ./Kangiella_koreensis
mkdir Haliangium_ochraceum
cp /research/snoh/GEBA-0/ncbi_dataset/data/GCA_000024805.1/GCA*genomic.fna ./Haliangium_ochraceum
mkdir Rhodothermus_marinus
cp /research/snoh/GEBA-0/ncbi_dataset/data/GCA_000024845.1/GCA*genomic.fna ./Rhodothermus_marinus
mkdir Pirellula_staleyi
cp /research/snoh/GEBA-0/ncbi_dataset/data/GCA_000025185.1/GCA*genomic.fna ./Pirellula_staleyi
mkdir Arcobacter_nitrofigilis
cp /research/snoh/GEBA-0/ncbi_dataset/data/GCA_000092245.1/GCA*genomic.fna ./Arcobacter_nitrofigilis
mkdir Rubinisphaera_brasiliensis
cp /research/snoh/GEBA-0/ncbi_dataset/data/GCA_000165715.3/GCA*genomic.fna ./Rubinisphaera_brasiliensis
mkdir Paludibacter_propionicigenes
cp /research/snoh/GEBA-0/ncbi_dataset/data/GCA_000183135.1/GCA*genomic.fna ./Paludibacter_propionicigenes
mkdir Muricauda_ruestringensis
cp /research/snoh/GEBA-0/ncbi_dataset/data/GCA_000224085.1/GCA*genomic.fna ./Muricauda_ruestringensis
mkdir Frateuria_aurantia
cp /research/snoh/GEBA-0/ncbi_dataset/data/GCA_000242255.3/GCA*genomic.fna ./Frateuria_aurantia
mkdir Singulisphaera_acidiphila
cp /research/snoh/GEBA-0/ncbi_dataset/data/GCA_000242455.3/GCA*genomic.fna ./Singulisphaera_acidiphila
mkdir Thermocrinis_ruber
cp /research/snoh/GEBA-0/ncbi_dataset/data/GCA_000512735.1/GCA*genomic.fna ./Thermocrinis_ruber
mkdir Escherichia_coli
cp /research/snoh/GEBA-0/ncbi_dataset/data/GCA_000690815.1/GCA*genomic.fna ./Escherichia_coli
###creating tree using AAF
python /usr/local/src/AAF-20171218/BetaVersion/aaf_phylokmer.py -o k25 -d /export/groups/snoh/ksdixo/secretion_systems/all/ss6/input/
python /usr/local/src/AAF-20171218/BetaVersion/aaf_distance.py -i k25.dat.gz -o k25 -f k25.wc
mv k25.tre /personal/ksdixo23/NohLabResearch/hgt/
#add burkholderia genomes to input directory and rerun tree making
cd /export/groups/snoh/ksdixo/secretion_systems/all/ss6/input
mkdir Burkholderia_agricolaris
cp /export/groups/snoh/ksdixo/burk_genomes/baqs159.prokka.v3.faa ./Burkholderia_agricolaris
mkdir Burkholderia_bonniea
cp /export/groups/snoh/ksdixo/burk_genomes/bbqs859.prokka.v3.faa ./Burkholderia_bonniea
mkdir Burkholderia_haylella
cp /export/groups/snoh/ksdixo/burk_genomes/bhqs11.prokka.v3.faa ./Burkholderia_haylella
cd /export/groups/snoh/ksdixo/secretion_systems/all/ss6/output
python /usr/local/src/AAF-20171218/BetaVersion/aaf_phylokmer.py -o k25 -d /export/groups/snoh/ksdixo/secretion_systems/all/ss6/input/
####currently getting an index out of range error when running this now
python /usr/local/src/AAF-20171218/BetaVersion/aaf_distance.py -i k25.dat.gz -t 4 -o k25 -f k25.wc
mv k25.tre /personal/ksdixo23/NohLabResearch/hgt/
#add correct files for burkholderia genomes to input directoty and rerun tree making
cd /export/groups/snoh/ksdixo/secretion_systems/all/ss6/input
cp /export/groups/snoh/ksdixo/burk_genomes/baqs159.fna ./Burkholderia_agricolaris
cp /export/groups/snoh/ksdixo/burk_genomes/bbqs859.fna ./Burkholderia_bonniea
cp /export/groups/snoh/ksdixo/burk_genomes/bhqs11.fna ./Burkholderia_haylella
###25 kmer length
python /usr/local/src/AAF-20171218/BetaVersion/aaf_phylokmer.py -o k25 -d /export/groups/snoh/ksdixo/secretion_systems/all/ss6/input/
python /usr/local/src/AAF-20171218/BetaVersion/aaf_distance.py -i k25.dat.gz -t 4 -o k25 -f k25.wc
mv k25.tre /personal/ksdixo23/NohLabResearch/hgt/
###38 kmer length
python /usr/local/src/AAF-20171218/BetaVersion/aaf_phylokmer.py -k 38 -o k38 -d /export/groups/snoh/ksdixo/secretion_systems/all/ss6/input/
python /usr/local/src/AAF-20171218/BetaVersion/aaf_distance.py -i k38.dat.gz -t 4 -o k38 -f k38.wc
mv k38.tre /personal/ksdixo23/NohLabResearch/hgt/
#forgot to include three genomes so added to input directory and reran tree making
cd /export/groups/snoh/ksdixo/secretion_systems/all/ss6/input
mkdir Holophaga_foetida
cp /research/snoh/GEBA-0/ncbi_dataset/data/GCA_000242615.3/*scaf.fna ./Holophaga_foetida
mkdir Mucilaginibacter_paludis
cp /research/snoh/GEBA-0/ncbi_dataset/data/GCA_000166195.3/chr.fna ./Mucilaginibacter_paludis
mkdir Thiothrix_nivea
cp /research/snoh/GEBA-0/ncbi_dataset/data/GCA_000260135.1/*scaf.fna ./Thiothrix_nivea
cd /export/groups/snoh/ksdixo/secretion_systems/all/ss6/output
###25 kmer length
python /usr/local/src/AAF-20171218/BetaVersion/aaf_phylokmer.py -o k25 -d /export/groups/snoh/ksdixo/secretion_systems/all/ss6/input/
python /usr/local/src/AAF-20171218/BetaVersion/aaf_distance.py -i k25.dat.gz -t 4 -o k25 -f k25.wc
mv k25.tre /personal/ksdixo23/NohLabResearch/hgt/
###38 kmer length
python /usr/local/src/AAF-20171218/BetaVersion/aaf_phylokmer.py -k 38 -o k38 -d /export/groups/snoh/ksdixo/secretion_systems/all/ss6/input/
python /usr/local/src/AAF-20171218/BetaVersion/aaf_distance.py -i k38.dat.gz -t 4 -o k38 -f k38.wc
mv k38.tre /personal/ksdixo23/NohLabResearch/hgt/
```
## Gene Alignment and Gene Trees (11/10/22, 11/30/22, & 12/01/22)
Before this momement, I was using all genes that related to the T6SS to find similar genes in other species, align all of those genes from other species and the species I was interested in and create trees based on those alignments. Now, I must filter these results by gene (one gene for a specific component in T6SS) and use these filtered results to realign and create gene trees.
```
ssh ksdixo23@nscc
ssh n26
cd /export/groups/snoh/ksdixo/secretion_systems/
#examining different gene and corresponding component
grep T6SS ../baqs159/baqs159.ss_components.*txt
grep T6SS ../bbqs859/bbqs859.ss_components.*txt
grep T6SS ../bhqs11/bhqs11.ss_components.*txt
#creating gene specific query files combining the three species
mkdir all
cd all
grep tssB ../baqs159/baqs159.ss_components.*txt | cut -f 4 > ss6_tssB_geneIDs.txt
grep tssB ../bbqs859/bbqs859.ss_components.*txt | cut -f 4 >> ss6_tssB_geneIDs.txt
grep tssB ../bhqs11/bhqs11.ss_components.*txt | cut -f 4 >> ss6_tssB_geneIDs.txt
(repeat for iglB/tssC, tssD, tssE, tssF, tssG, tssJ, tssK, tssL; made a quick shell file for this but combined iglB and tssC files --> genetogeneIDs.sh component_name secreation_system_name)
#aligniment and creating trees
###run diamond blast
diamond blastp -b8 -c1 -d /research/snoh/diamonddb/GEBA_prot.dmnd -q ss6_tssB.faa -o ss6_tssB.matches.tsv --no-self-hits -e 1e-6 -k 100 -f 6 qseqid sseqid staxids sscinames sskingdoms skingdoms sphylums pident length mismatch gapopen qstart qend sstart send full_sseq evalue bitscore
###ranking on number of hits for each species (top ten)
cut -f 4 ss6_tssB.matches.tsv > ss6_tssB.matches.sscinames.txt
sort ss6_tssB.matches.sscinames.txt | grep -E -h "[a-z,A-Z] [a-z,A-Z]" | grep -v [0-9] | uniq -c | sort -n -r > ss6_tssB.matches.ranked.txt
head -10 ss6_tssB.matches.ranked.txt | sed -r "s/.{8}//" > ss6_tssB.matches.ranked_topten.txt
###making faa files for matches from top ten species
cut -f 2,4,16 ss6_tssB.matches.tsv > ss6_tssB.matches.seqs.txt
grep -f ss6_tssB.matches.ranked_topten.txt ss6_tssB.matches.seqs.txt | grep WP_ | sort | uniq > ss6_tssB.matches.ranked_topten.seqs.txt
sed "s/^/>/" ss6_tssB.matches.ranked_topten.seqs.txt | sed -r "s/\t/|/" | sed -r "s/\t/\n/g" | sed -r "s/ /_/g" > ss6_tssB.matches.faa
###combining faa files from original samples and matches
sed " /AGRI/ s/$/|Burkholderia_agricolaris/g" ss6_tssB.faa > temp.faa
sed " /BONN/ s/$/|Burkholderia_bonniea/g" temp.faa > temp2.faa
sed " /HAYL/ s/$/|Burkholderia_haylella/g" temp2.faa > temp3.faa
cat ss6_tssB.matches.faa temp3.faa > ss6_tssB.all.faa
###run alignment
linsi ss6_tssB.all.faa > ss6_tssB.mafft_LINSi.faa
ginsi ss6_tssB.all.faa > ss6_tssB.mafft_GINSi.faa
###create tree and move for access in R
FastTree ss6_tssB.mafft_LINSi.faa > ss6_tssB.tree_LINSi.tree
FastTree ss6_tssB.mafft_GINSi.faa > ss6_tssB.tree_GINSi.tree
mv ss6_tssB.tree_LINSi.tree /personal/ksdixo23/NohLabResearch/hgt/
mv ss6_tssB.tree_GINSi.tree /personal/ksdixo23/NohLabResearch/hgt/
###remove unnecessary files
rm temp*.faa
(repeat for other components using shell file --> alignAndTree.sh component_name secretion_system_name)
#tssG has an illegal character U in place 122 in sequence 7 so couldn't do alignment --> adding names to faa files messed up the files so need to change how we are doing this part
###combining faa files from original samples and matches
sed " /AGRI/ s/$/|Burkholderia_agricolaris/g" ss6_tssG.faa > temp.faa
sed -i "s/[a-zA-Z]|Burkholderia_agricolaris$//g" temp.faa
sed " /BONN/ s/$/|Burkholderia_bonniea/g" temp.faa > temp2.faa
sed -i "s/[a-zA-Z]|Burkholderia_bonniea$//g" temp2.faa
sed " /HAYL/ s/$/|Burkholderia_haylella/g" temp2.faa > temp3.faa
sed -i "s/[a-zA-Z]|Burkholderia_haylella$//g" temp3.faa
cat ss6_tssB.matches.faa temp3.faa > ss6_tssB.all.faa
```
## Visualizing Trees (11/02/22 & 11/03/22)
Most of this is done in Rstudio at working directory ```/personal/ksdixo23/NohLabResearch/hgt``` in hgt_ss6.R. The code down below is things ran on commandline in nscc.
```
ssh ksdixo23@nscc
ssh n26
cd /export/groups/snoh/ksdixo/secretion_systems/baqs159/
#the labels that are stored cutoff part of the species name
#change creation of faa files to avoid this
sed "s/^/>/" baqs159.ss6.matches.ranked_topten.seqs.txt | sed -r "s/\t/|/" | sed -r "s/\t/\n/g" | sed -r "s/ /_/g" > baqs159.ss6.matches.faa
#need to add sequences of ss6 from sample to alignment and tree creation
sed " /^>/ s/$/|Burkholderia_agricolaris/g" baqs159.ss6.faa > temp.faa
cat baqs159.ss6.matches.faa temp.faa > baqs159.ss6.all.faa
rm baqs159.ss6.seqs.faa
rm temp.faa
#rerun eveything after this to reflect changes
linsi baqs159.ss6.all.faa > baqs159.ss6.mafft_LINSi.faa
ginsi baqs159.ss6.all.faa > baqs159.ss6.mafft_GINSi.faa
FastTree baqs159.ss6.mafft_LINSi.faa > baqs.tree_LINSi.tree
FastTree baqs159.ss6.mafft_GINSi.faa > baqs.tree_GINSi.tree
mv baqs.tree_LINSi.tree /personal/ksdixo23/NohLabResearch/hgt/baqs.tree_LINSi.tree
mv baqs.tree_GINSi.tree /personal/ksdixo23/NohLabResearch/hgt/baqs.tree_GINSi.tree
```
Nest Steps --> Find a good way to compare trees, label trees in a way were differences are clear
## MAFFT Alignment and Tree Making (10/27/22, 11/02/22)
```
ssh ksdixo23@nscc
ssh n26
cd /export/groups/snoh/ksdixo/secretion_systems/baqs159/
mafft --auto --inputorder baqs159.ss6.seqs.faa > baqs159.ss6.mafft.faa
mv baqs159.ss6.mafft.faa > baqs159.ss6.mafft_LINSi.faa
ginsi baqs159.ss6.seqs.faa > baqs159.ss6.mafft_GINSi.faa
###best model was selected by the program, the program used L-INS-i, assumes only one alignable domain between sequences (loacl alignment)
####linsi baqs159.ss6.seqs.faa > baqs159.ss6.mafft.faa
###G-INS-i, assumes entire sequence can be aligned (gloabl alignment)
####ginsi baqs159.ss6.seqs.faa > baqs159.ss6.mafft.faa
###E-INS-i, assumes there are long unalignable sequences
####einsi baqs159.ss6.seqs.faa > baqs159.ss6.mafft.faa
FastTree baqs159.ss6.mafft_LINSi.faa > tree_LINSi.tree #didn't work because headers were not unique
#reran some file prep so headers are sequence ids instead of scientific names
cut -f 2,4,16 baqs159.ss6.matches.tsv > baqs159.ss6.matches.seqs.txt
grep -f baqs159.ss6.matches.ranked_topten.input.txt baqs159.ss6.matches.seqs.txt | grep WP_ > baqs159.ss6.matches.ranked_topten.seqs.txt
sed "s/^/>/" baqs159.ss6.matches.ranked_topten.seqs.txt | sed -r "s/\t/|/" | sed -r "s/\t/\n/g" > baqs159.ss6.seqs.faa
#reran alignment and FastTree
linsi baqs159.ss6.seqs.faa > baqs159.ss6.mafft_LINSi.faa
ginsi baqs159.ss6.seqs.faa > baqs159.ss6.mafft_GINSi.faa
FastTree baqs159.ss6.mafft_LINSi.faa > tree_LINSi.tree ##didn't work again because headers were not unique, realized there was a dublicate sequence, must remove it, may use uniq?
#reran some file prep so no replicated headers and sequences
grep -f baqs159.ss6.matches.ranked_topten.input.txt baqs159.ss6.matches.seqs.txt | grep WP_ | sort | uniq > baqs159.ss6.matches.ranked_topten.seqs.txt
sed "s/^/>/" baqs159.ss6.matches.ranked_topten.seqs.txt | sed -r "s/\t/|/" | sed -r "s/\t/\n/g" > baqs159.ss6.seqs.faa
#reran alignment and FastTree
linsi baqs159.ss6.seqs.faa > baqs159.ss6.mafft_LINSi.faa
ginsi baqs159.ss6.seqs.faa > baqs159.ss6.mafft_GINSi.faa
FastTree baqs159.ss6.mafft_LINSi.faa > tree_LINSi.tree
FastTree baqs159.ss6.mafft_GINSi.faa > tree_GINSi.tree
mv tree_LINSi.tree baqs.tree_LINSi.tree
mv tree_GINSi.tree baqs.tree_GINSi.tree
#rerun file prep for bbqs859 and bhqs11, run alignment, and FastTree
cd /export/groups/snoh/ksdixo/secretion_systems/bbqs859/
cut -f 2,4,16 bbqs859.ss6.matches.tsv > bbqs859.ss6.matches.seqs.txt
grep -f bbqs859.ss6.matches.ranked_topten.input.txt bbqs859.ss6.matches.seqs.txt | grep WP_ | sort | uniq > bbqs859.ss6.matches.ranked_topten.seqs.txt
sed "s/^/>/" bbqs859.ss6.matches.ranked_topten.seqs.txt | sed -r "s/\t/|/" | sed -r "s/\t/\n/g" > bbqs859.ss6.seqs.faa
linsi bbqs859.ss6.seqs.faa > bbqs859.ss6.mafft_LINSi.faa
ginsi bbqs859.ss6.seqs.faa > bbqs859.ss6.mafft_GINSi.faa
FastTree bbqs859.ss6.mafft_LINSi.faa > bbqs.tree_LINSi.tree
FastTree bbqs859.ss6.mafft_GINSi.faa > bbqs.tree_GINSi.tree
cd /export/groups/snoh/ksdixo/secretion_systems/bhqs11/
cut -f 2,4,16 bhqs11.ss6.matches.tsv > bhqs11.ss6.matches.seqs.txt
grep -f bhqs11.ss6.matches.ranked_topten.input.txt bhqs11.ss6.matches.seqs.txt | grep WP_ | sort | uniq > bhqs11.ss6.matches.ranked_topten.seqs.txt
sed "s/^/>/" bhqs11.ss6.matches.ranked_topten.seqs.txt | sed -r "s/\t/|/" | sed -r "s/\t/\n/g" > bhqs11.ss6.seqs.faa
linsi bhqs11.ss6.seqs.faa > bhqs11.ss6.mafft_LINSi.faa
ginsi bhqs11.ss6.seqs.faa > bhqs11.ss6.mafft_GINSi.faa
FastTree bhqs11.ss6.mafft_LINSi.faa > bhqs.tree_LINSi.tree
FastTree bhqs11.ss6.mafft_GINSi.faa > bhqs.tree_GINSi.tree
#copy files so accesible in Rstudio
cd ~
cd /colbyhome/NohLabResearch/hgt
cp /export/groups/snoh/ksdixo/secretion_systems/baqs159/*.tree .
cp /export/groups/snoh/ksdixo/secretion_systems/bhqs11/*.tree .
cp /export/groups/snoh/ksdixo/secretion_systems/bbqs859/*.tree .
```
## File Prep for Alignment (10/20/22)
```
ssh ksdixo23@nscc
ssh n26
cd /export/groups/snoh/ksdixo/secretion_systems/
#rerunning blast to include output taxon IDs, scientific names, super kingdoms, kingdoms, and phylums
diamond blastp -b8 -c1 -d /research/snoh/diamonddb/GEBA_prot.dmnd -q baqs159.ss6.faa -o baqs159.ss6.matches.tsv --no-self-hits -e 1e-6 -k 100 -f 6 qseqid sseqid staxids sscinames sskingdoms skingdoms sphylums pident length mismatch gapopen qstart qend sstart send full_sseq evalue bitscore
#exploring phylums of subject species with hits for all three genomes
cut -f 7 baqs159.ss6.matches.tsv > baqs159.ss6.matches.sphylums.txt
sort baqs159.ss6.matches.sphylums.txt | uniq -c | sort -n -r > baqs159.ss6.matches.ranked.phylums.txt
cut -f 7 bbqs859.ss6.matches.tsv > bbqs859.ss6.matches.sphylums.txt
sort bbqs859.ss6.matches.sphylums.txt | uniq -c | sort -n -r > bbqs859.ss6.matches.ranked.phylums.txt
cut -f 7 bhqs11.ss6.matches.tsv > bhqs11.ss6.matches.sphylums.txt
sort bhqs11.ss6.matches.sphylums.txt | uniq -c | sort -n -r > bhqs11.ss6.matches.ranked.phylums.txt
###no species of of phylum pseudomonadota (level higher than Betaproteobacteria) so may not have to worry about removing betaproteobacteria?
#clean up directory
mkdir baqs159
mkdir bbqs859
mkdir bhqs11
mv ./baqs159.* ./baqs159
mv ./bbqs859.* ./bbqs859
mv ./bhqs11.* ./bhqs11
#ran reranking and file prep for bbqs8859 and bhqs11
cut -f 4 bbqs859.ss6.matches.tsv > bbqs859.ss6.matches.sscinames.txt
sort bbqs859.ss6.matches.sscinames.txt | uniq -c | sort -n -r > bbqs859.ss6.matches.ranked.txt
sort bbqs859.ss6.matches.sscinames.txt | grep -E -h "[a-z,A-Z] [a-z,A-Z]" | grep -v [0-9] | uniq -c | sort -n -r > bbqs859.ss6.matches.ranked.txt
head -10 bbqs859.ss6.matches.ranked.txt > bbqs859.ss6.matches.ranked_topten.txt
sed -r "s/.{8}//" bbqs859.ss6.matches.ranked_topten.txt > bbqs859.ss6.matches.ranked_topten.input.txt
cut -f 4,16 bbqs859.ss6.matches.tsv > bbqs859.ss6.matches.seqs.txt
grep -f bbqs859.ss6.matches.ranked_topten.input.txt bbqs859.ss6.matches.seqs.txt | grep -v [0-9] > bbqs859.ss6.matches.ranked_topten.seqs.txt
sed "s/^/>/" bbqs859.ss6.matches.ranked_topten.seqs.txt | sed -r "s/\t/\n/g" > bbqs859.ss6.seqs.faa
cut -f 4 bhqs11.ss6.matches.tsv > bhqs11.ss6.matches.sscinames.txt
sort bhqs11.ss6.matches.sscinames.txt | uniq -c | sort -n -r > bhqs11.ss6.matches.ranked.txt
sort bhqs11.ss6.matches.sscinames.txt | grep -E -h "[a-z,A-Z] [a-z,A-Z]" | grep -v [0-9] | uniq -c | sort -n -r > bhqs11.ss6.matches.ranked.txt
head -10 bhqs11.ss6.matches.ranked.txt > bhqs11.ss6.matches.ranked_topten.txt
sed -r "s/.{8}//" bhqs11.ss6.matches.ranked_topten.txt > bhqs11.ss6.matches.ranked_topten.input.txt
cut -f 4,16 bhqs11.ss6.matches.tsv > bhqs11.ss6.matches.seqs.txt
grep -f bhqs11.ss6.matches.ranked_topten.input.txt bhqs11.ss6.matches.seqs.txt | grep -v [0-9] > bhqs11.ss6.matches.ranked_topten.seqs.txt
sed "s/^/>/" bhqs11.ss6.matches.ranked_topten.seqs.txt | sed -r "s/\t/\n/g" > bhqs11.ss6.seqs.faa
```
## Reranking (10/13/22, 10/19/22, & 10/20/22)
- Wang and Wu paper
- aligned protein sequences of genes from top ten species using MAFFT
- aligned columns were trimmed using ZORRO (probability of 0.4)
- trees were constructured using RAxML
- Hamilton's protocol
- use blastdbcmd to get full protein sequences
- aligned multiple protein sequences using MUSCLE
- contruct tree using fasttree
- visulaize tree usig RStuido
- Possible Approcahes
1. Gather subject ids using ranking text files (all ids that match top ten species names)
2. Rerun diamond blastp search and include subject sequence OR use blastdbcmd to get full sequences using subject ids
3. alignment of protein sequences using MAFFT OR MUSCLE
4. clean up alignments???
5. construct tree using fasttree OR RAxML
```
ssh ksdixo23@nscc
ssh n26
cd /export/groups/snoh/ksdixo/secretion_systems/
#rerunning blast to include output scientific names
###diamond makedb uses the flags for --taxonmap --taxonnames and --taxonnodes to get scientific names, kingdoms, super kingdoms, and phylumms
diamond blastp -b8 -c1 -d /research/snoh/diamonddb/GEBA_prot.dmnd -q baqs159.ss6.faa -o baqs159.ss6.matches.tsv --no-self-hits -e 1e-6 -k 100 -f 6 qseqid sseqid sscinames pident length mismatch gapopen qstart qend sstart send full_sseq evalue bitscore
#redid ranking based on scientific names
cut -f 3 baqs159.ss6.matches.tsv > baqs159.ss6.matches.sscinames.txt
sort baqs159.ss6.matches.sscinames.txt | uniq -c | sort -n -r > baqs159.ss6.matches.ranked.txt
#cleaned up directory
rm *.ranked_better.txt
rm *.stitles.txt
rm *.topten.*
rm *.stitles.*
rm bbqs859.ss6.matches.ranked.txt
rm bbqs859.ss6.matches.tsv
rm bhqs11.ss6.matches.ranked.txt
rm bhqs11.ss6.matches.tsv
#examined if duplicate sscinames have the same protein sequences
cat baqs159.ss6.matches.ranked.txt
grep -F -w "Singulisphaera acidiphila" baqs159.ss6.matches.tsv | cut -f 2,3,12
grep -F -w "Arcobacter nitrofigilis" baqs159.ss6.matches.tsv | cut -f 2,3,12
grep -F -w Rubinisphaera baqs159.ss6.matches.tsv | cut -f 2,3,12
###based on my observations, the duplicate names have duplicate protein sequences so remove sscinames with DSM or just genus
sort baqs159.ss6.matches.sscinames.txt | grep -E -h "[a-z,A-Z] [a-z,A-Z]" | grep -v [0-9] | uniq -c | sort -n -r > baqs159.ss6.matches.ranked.txt
head -10 baqs159.ss6.matches.ranked.txt > baqs159.ss6.matches.ranked_topten.txt
###counts are making it impossible to get the sequences from the blast output, must remove those counts
sed -r "s/.{8}//" baqs159.ss6.matches.ranked_topten.txt > baqs159.ss6.matches.ranked_topten.input.txt
#based on top ten matches, create faa with protein sequences
cut -f 3,12 baqs159.ss6.matches.tsv > baqs159.ss6.matches.seqs.txt
grep -f baqs159.ss6.matches.ranked_topten.input.txt baqs159.ss6.matches.seqs.txt | grep -v [0-9] > baqs159.ss6.matches.ranked_topten.seqs.txt
sed "s/^/>/" baqs159.ss6.matches.ranked_topten.seqs.txt | sed -r "s/\t/\n/g" > baqs159.ss6.seqs.faa
```
## Ranking (10/12/22)
- use grep to determine the amount of times different species are there
- rank based on the above metric
```
ssh ksdixo23@nscc
ssh n26
cd /export/groups/snoh/ksdixo/secretion_systems/
#rerunning blast to include output subject titles
diamond blastp -b8 -c1 -d /export/groups/snoh/shared/diamonddb/GEBA_prot.dmnd -q baqs159.ss6.faa -o baqs159.ss6.matches.tsv -f 6 qseqid sseqid stitle pident length mismatch gapopen qstart qend sstart send evalue bitscore
cut -f 3 baqs159.ss6.matches.tsv > baqs159.ss6.matches.stitles.txt
grep -oP '\[.*?\]' baqs159.ss6.matches.stitles.txt > baqs159.ss6.matches.stitles.sub.txt
sort baqs159.ss6.matches.stitles.sub.txt | uniq -c | sort -n -r > baqs159.ss6.matches.ranked.txt
#should I combine the regular one with the DSM ones? should I get rid of the DSM?
grep -v DSM baqs159.ss6.matches.ranked.txt > baqs159.ss6.matches.ranked_better.txt
#show top ten best hits
head baqs159.ss6.matches.ranked_better.txt
#repeat for other two genomes (bbqs859, bhqs11)
```
## BLAST search (10/06/22)
```
ssh ksdixo23@nscc
ssh n26
cd /export/groups/snoh/ksdixo/
cd secretion_systems/
rm baqs159.ss_genes.txt
rm baqs159.ss.faa
grep -F -w T6SS baqs159.ss_components.manual.txt > baqs159.ss6_components.manual.txt
cut -f 4 baqs159.ss6_components.manual.txt > baqs159.ss6_geneIDs.txt
nano
###shell script baqs159.geneIDtofaa.sh
#!/bin/bash
file = $1
while IFS= read -r line
do
samtools faidx /export/groups/snoh/ksdixo/burk_genomes/baqs159.prokka.v3.faa "$line" >> $2
done < "$file"
###
sh baqs159.geneIDtofaa.sh baqs159.ss6_geneIDs.txt baqs159.ss6.faa
diamond blastp -b8 -c1 -d /export/groups/snoh/shared/diamonddb/GEBA_prot.dmnd -q baqs159.ss6.faa -o baqs159.ss6.matches.tsv
#repeat for other two genomes (bbqs859, bhqs11)
```
## BLAST search (10/05/22)
```
ssh ksdixo23@nscc
ssh n26
cd /export/groups/snoh/ksdixo/
cd secretion_systems/
#creates txt file with geneids (type 6 to start)
cut -f 4 baqs159.ss_components.manual.txt > baqs159.ss_genes.txt
#stuck here, samtools does not read basic text file
#any way to print out contents of text file?
#just copy and past contents of text file?
#make a shell script?(>>)
#use blast -entry_batch command instead?
samtools faidx /export/groups/snoh/ksdixo/burk_genomes/baqs159.prokka.v3.faa baqs159.ss_genes.txt > baqs159.ss.faa
samtools faidx /export/groups/snoh/ksdixo/burk_genomes/baqs159.prokka.v3.faa -r baqs159.ss_genes.txt > baqs159.ss.faa
samtools faidx /export/groups/snoh/ksdixo/burk_genomes/baqs159.prokka.v3.faa --region-file baqs159.ss_genes.txt > baqs159.ss.faa
```
## Planning Pipeline (10/03/22)
- BLAST search
- Use HGT list to do a BLASTp search using diamond on GEBA
1. take query id and besthit id from HGT list
2. use thoses to find exact sequences
3. take exact sequences to do BLASTp search on GEBA database
4. results in text file of hits that must be filtered (only non alpha-proteobacteria, no self-hits, and evalue 1e-7 or lower)
- Use full genomes to do BLAST search on GEBA
```
cut -f 2 'species.darkhorse2_intact_hgt.txt' > 'species_result_hits.txt'
/usr/local/src/ncbi-blast-2.10.1+/bin/blastdbcmd -db 'NCBI_NR_db' -entry_batch 'species_result_hits.txt' > 'species_intact_HGT.faa'
./diamond prepdb -d 'GEBA_db'
./diamond blastp -d 'GEBA_db' -q 'species_intact_HGT.faa' -o 'species_out.tsv'
```
- Functional Annotation
1. Use HMMer3 to search protein sequences of HGT hits (https://www.ebi.ac.uk/Tools/pfa/hmmer3_hmmscan/)(still not sure how to use this)
- Ranking
- Phylogentic Analysis
- Species Tree
## Starting Off (09/29/22)
- using BLAST (BLASTp and NCBI BLAST) to identify potenial HGTs
- work on planning on pipeline
- Methods
- BLASTp search aganist 2461 bacterial genomes exculding self hits to *Rickettsiales*
- choose query sequences that have non alpha-proteobacterial best hits as candidate genes for LGTs
- COG functional annotation using hidden markov model HMMer3
- ranked the non alpha-proteobacterial species by frequency as best hits
- using top ten species, phylogentic analysis
- compare to species tree
- look for conjugative T4SS or phage genes mechanistic evidence of how genes were shared
## Plans for Research (09/21/22)
- aspects of the geneome suggesting HGT
- little conservation of gene order --> reorganization
- presence of conjugative DNA transfer in genome --> possibile to accept and transfer DNA material
- homologs with other closely related organisms
- general steps for finding evidence for HGT and amoeba "melting pot" theory
- annotations of genomes of interest
- see above for characteristics to look for
- tools: FastA (reciprocal best matches); pseudoFinder; selfIF
- use functional analysis tools to understand the prevalance of certain genes for certain processess
- tools: BLAST; COG (hidden markov model)
- alignments of genomes of interest?
- tools: MUSCLE
- phylogentic analysis of genomes of interest
- focus on the presence of the secreation system?
- comparisons of geneomes also found to live in amoebae and the geneome of the amoeba itself
```
#place where copies of genome, hgt gene lists, and secretion systmem gene lists
/export/groups/snoh/ksdixo/
```
### notes from Suegene
#### 9/29/22
* NCBI representative prokaryotic genomes (a.k.a [prokaryotic refseq genomes](https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/)) may contain 200,000+ genomes. This is too much, so I found a potential alternative source in a project called [GEBA](https://www.nature.com/articles/nbt.3886). Genomic Encyclopedia of Bacteria and Archaea (GEBA) was designed to broadly represent the prokaryotic tree of life. In the last release, it contained 1000+ genomes in various stages of completion. These are available through NCBI [PRJNA30815](https://www.ncbi.nlm.nih.gov/bioproject/30815).
* Between the GEBA paper's supplementary table and the specific BioProject, it looks like we should be able to download these genomes but we need [a new NCBI software](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/download-and-install/) so I have requested for this to be installed. Will update you soon.
#### 10/3/22
* Promised update: I made a `diamond` database with predicted gene aa sequences from the 374 GEBA genomes. You can run `diamond blastp` against this database as follows:
```
diamond blastp -b8 -c1 -d /export/groups/snoh/shared/diamonddb/GEBA_prot.dmnd -q bbqs859_tssH.fasta
```
* To run blast, you will need an input fasta sequence. All the other blast tricks probably apply to you will want to review them.
* This week's lab tutorial will show you how to get a sequence based on its gene_id from an faa file. You can get the gene_ids from the files in:
`/export/groups/snoh/ksdixo/secretion_systems/`
And you can get the aa sequences from these files using the `samtools faidx` command:
`/export/groups/snoh/ksdixo/burk_genomes/*faa`