Lab 4 notebook (10/19)

###### tags: `BI278` # Lab 4 notebook (10/19) As always, I start with the same commands to set up and import materials from the course folder. ``` ssh lmdrep23@bi278 mkdir lab_04 cd lab_04 cp /courses/bi278/Course_Materials/lab_04/* . ``` 1.1. DOWNLOADING GENOME FILES FROM NCBI This step shows how to download genome files from NCBI. We don't need to do this since the provided course materials are the needed genome files, but refer back to this section of the lab for working on the project. 1.2. MAKING CUSTOM BLAST DATABASES I made a BLASTdatabases directory in the lab folder and then made protein and nucleotide databases in this folder with the following commands, modifying for each species: ``` mkdir BLASTdatabases /usr/local/src/ncbi-blast-2.10.1+/bin/makeblastdb -in P.terrae.ncbi.protein.faa -dbtype prot -title pterrae_prot -out //home/lmdrep23/lab_04/BLASTdatabases/pterrae_prot -parse_seqids /usr/local/src/ncbi-blast-2.10.1+/bin/makeblastdb -in P.terrae.ncbi.cds_from_genomic.fna -dbtype nucl -title pterrae_nucl -out //home/lmdrep23/lab_04/BLASTdatabases/pterrae_nucl -parse_seqids ``` I regularly used the commands ```cd BLASTdatabases``` ```ls``` and ```cd ..``` to go back and forth between the lab_04 and BLAST directories to make sure all was working. I then had to add $BLASTDB to my directory to be able to search these directories. The following command shows blast where to look for databases ```export BLASTDB="/courses/bi278/Course_Materials/blastdb:/home/lmdrep23/lab_04/BLASTdatabases"``` I also created nucleotide and protein databases for the baqs70 genome ``` /usr/local/src/ncbi-blast-2.10.1+/bin/makeblastdb -in /home/lmdrep23/lab_03/baqs70/PROKKA_09282020.fna -dbtype nucl -title baqs70_nucl -out //home/lmdrep23/lab_04/BLASTdatabases/baqs70_nucl -parse_seqids``` ```/usr/local/src/ncbi-blast-2.10.1+/bin/makeblastdb -in /home/lmdrep23/lab_03/baqs70/PROKKA_09282020.faa -dbtype prot -title baqs70_prot -out //home/lmdrep23/lab_04/BLASTdatabases/baqs70_prot -parse_seqids ``` 1.3. RECIPROCAL (BI-DIRECTIONAL) BLAST The first step has me create a subset of gene protein sequences from last week. I used the following command to do so, increasing the rand probability from 0.002 to 0.005 to get around 15 proteins listed in the file. ``` awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print ">"$0}' ORS="" PROKKA_09282020.faa > try2_lab4_baqs.faa``` The command ```less try2_lab4_baqs.faa``` while in the baqs70 directory of lab_03 showed how this new command did give more proteins. Changing the directory back to lab_04 and using the following command moved this protein subset to the lab_04 folder ``` cp /home/lmdrep23/lab_03/baqs70/try2_lab4_baqs.faa .``` I then used BLASTp with these proteins from baqs70 as the query and the pterrae_prot database. The major difference from this BLAST command compared to the last lab is the max_target_hits being changed from 5 to 1 so that only the best hit is found. ``` /usr/local/src/ncbi-blast-2.10.1+/bin/blastp -task blastp-fast -num_threads 4 -query try2_lab4_baqs.faa -db pterrae_prot -outfmt "6 std stitle sscinames" -max_target_seqs 1 -evalue 1e-6 > newbaqs70subset_pterraedb.txt``` With this specified output format, the sseqid of the gene in the pterrae_prot database that is the best hit to a protein from baqs70 is the second column of output. Instead of using the less command to see all of the output, we can look at just that column with the following command ```cut -f 2 newbaqs70subset_pterraedb.txt > sseqids_newbaqs70subset_pterraedb.txt ``` The output in that file is ```WP_042306374.1 WP_042306323.1 WP_042305410.1 WP_042308327.1 WP_042308474.1 WP_042313198.1 WP_007577083.1 WP_081920713.1 WP_042306617.1 WP_042304500.1 WP_042315187.1 WP_042311772.1 WP_042308120.1 WP_042316064.1 ``` , thus these are the best hits in the pterrae_prot database of the proteins in the subset. The following command then runs these sseqids (belonging to p terrae genome) in ncbi to obtain the sequence for each of them and put these in an output file. ```/usr/local/src/ncbi-blast-2.10.1+/bin/blastdbcmd -db pterrae_prot -entry_batch sseqids_newbaqs70subset_pterraedb.txt > pterraesubjectgenes.faa``` We now switch to treating these proteins as the query, and BLAST them against the protein database for baqs70 to see if we find reciprocal best hits. This is done with the following command ```/usr/local/src/ncbi-blast-2.10.1+/bin/blastp -task blastp-fast -num_threads 4 -query pterraesubjectgenes.faa -db baqs70_prot -outfmt "6 std stitle sscinames" -max_target_seqs 1 -evalue 1e-6 > pterraesubset_baqs70db.txt``` To compare these hits to the hits found from baqs70, I found the sseqid list from the command ``` cut -f 2 pterraesubset_baqs70db.txt``` which gave the output : PAG70_00283 PAG70_00337 PAG70_01307 PAG70_01354 PAG70_01530 PAG70_03278 PAG70_04048 PAG70_04246 PAG70_04589 PAG70_05469 PAG70_05568 PAG70_01228 PAG70_04346 PAG70_07634 ...and compared this to the proteins present in the baqs70 subset file that was initially used as the query against the p terrae database. The command ```less try2_lab4_baqs.faa > sseqids_pterrasubset_baqs70db.txt``` allowed me to find the ids from the proteins in that initial query: PAG70_00283 PAG70_00337 PAG70_01223 PAG70_01307 PAG70_01354 PAG70_01530 PAG70_01762 PAG70_03278 PAG70_04048 PAG70_04246 PAG70_04589 PAG70_05469 PAG70_05568 PAG70_06083 PAG70_06235 PAG70_06470 PAG70_07634 Going down these two lists, once can find the RBHs to be PAG70_00283, PAG70_00337, PAG70_01307, PAG70_01354, PAG70_01530, PAG70_03278, PAG70_04048, PAG70_04246 PAG70_04589, PAG70_05469, PAG70_05568, and PAG70_07634. I then compared my baqs70 subset to the p. xenovorans and B.pseudomallei databases. The comparison of the files sseqids_newbaqs70subset_bpseudodb.txt and sseqids_bpseudosubset_baqs70db.txt identify reciprocal best hits for the baqs70 subset and B.pseudomallei genome. The same information can be found for the p. xenovorans genome with the files and sseqids_newbaqs70subset_pxenodb.txt and sseqids_pxenosubsett_baqs70db.txt. However, these are in different genomes, so it is more helpful to compare to the list of proteins in the baqs70 subset compiled above. Q1: The comaparison of sseqids_bpseudosubset_baqs70db.txt, sseqids_pxenosubsett_baqs70db.txt, and sseqids_pterrasubset_baqs70db.txt all reveal reciprocal best hits to the list of Q2: According to the tree provided in this lab, P. xenovorans is most closely related to baq70. I found 12 RBH between this genome and the baq70 subset. P. terrae is the second most closely related, for which I found 12 RBH with the baq70 subset. B. pseudomallei is the most distantly related, being listed as the outgroup, and I found 8 RBH for this genome with the baq70 subset. These findings generally support the idea that there are more RBH with more closely related species but does not provide overwhelmingly strong evidence. For the next step, I know that PAG70_00283 has an RBH in each genome, so I run the lines ```head newbaqs70subset_bpseudodb.txt```, ```head newbaqs70subset_pxenodb.txt```, and ```head newbaqs70subset_pterrae.txt``` to find the name of the corresponding gene to PAG70_00283 in each of the genomes. B. Pseudo: WP_004556694.1 P. Xeno: WP_011486644.1 P. Terrae: WP_042306374.1 I then looked into the files names __subjectgenes.faa which were retrieved from NCBI that have the sequences of the proteins, finding the sequence with the id above. I copy and pasted these sequences (including the original one from daq70) into a new faa file, which i created with the ```nano``` command. The following command then aligned these RBHs: ``` muscle -in four_rbh_seqs.faa -out gene_muscle_aln.faa``` The following command builds a cursory phylogeny out of these alignments. ```FastTree gene_muscle_aln.faa > gene_faa.tree``` Lastly, I can see the text version of these tree with the command ```cat gene_faa.tree```. This gives the following output: (P.Xeno:0.04683,DAQ70:0.05412,(BPseudo:0.28277,P.Terrae:0.17270)1.000:0.10473);