# Lab 6
'export BLASTDB=" /courses/bi278/Course_Materials/blastdb"'
'echo $BLASTDB' --> establish where the system should look for the BLAST database
'rpsblast -query lab_04/PROKKA_09242022.faa -db /courses/bi278/Course_Materials/blastdb/Cog -evalue 0.01 -max_target_seqs 25 -outfmt "6 qseqid sseqid evalue qcovs stitle" > PROKKA_09242022.faa.rpsblast.table' (needed Course_Materials/blastdb/ because error = cannot retrieve path to RPS database)
'awk -F' [\t,]' '!x[$1]++ && $4>=70 {print $1, $5}' OFS="\t" PROKKA_09242022.faa.rpsblast.table > lab_06/PROKKA_09242022.faa.cog.table'
Keep only the best COG match for each query sequence that have query coverage of at least 70. This means at least 70% of the query sequence is covered by the subject that matched.
This command:
ü Takes the format 6 table and splits it up by both tabs and commas. The comma
thing is because the last column includes the COG number along with a lot of
other additional descriptors separated by commas. We want to just grab the
COG number from this column for the next step.
ü It then looks at all the rows that match the same query ($1) as long as each has
query coverage of at least 70
ü It keeps the first one (best match) and prints two columns (=fields). You should
be able to figure out what these fields are based on the context above
awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($2 in a){print $1, $2, a[$2]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/cognames2003-2014.tab lab_06/PROKKA_09242022.faa.cog.table > temp
This command takes the first file cognames2003-2014.tab and makes a look-up
table that can access the categories by COG number. It then takes the second file
lab_06/PROKKA_09242022.faa.cog.table and goes through it row by row and looks up
what category the COG number in that row belongs to. It then writes the row along with
the COG category that it looked up and stores it in a new file.
'awk -F "\t" '{if ( length($3)>1 ) { $3 = substr($3, 0, 1) } else { $3 = $3 }; print}' OFS="\t" temp > temp2' use this to keep the primary COG category for each COG number
'awk -F "\t" 'NR==FNR {a[$1]=$2;next} {if ($3 in a){print $0, a[$3]} else {print $0}}' OFS="\t" /courses/bi278/Course_Materials/blastdb/fun2003-2014.tab temp2 > PROKKA_09242022.faa.cog.categorized'
do this to see the full description of the functional category. Can't visualize it here but allows you to using cat or head
first make directory blastdb = mkdir blastdb
Then need to add new databse to blastdb 'export BLASTDB="/courses/bi278/Course_Materials/blastdb:/home2/dkalde22/myblastdb"'
'makeblastdb -in lab_04/PROKKA_09242022.faa -dbtype prot -title genomename_prot -out myblastdb/genomename_prot -parse_seqids' to make protein database
'makeblastdb -in /courses/bi278/Course_Materials/lab_04/bhqs69/PROKKA_09242022.ffn -dbtype nucl -title genomename_nucl -out myblastdb/genomename_nucl -parse_seqids' to make nucleotide database
'blastdb_aliastool -dbtype prot -title 4burk_prot -out
myblastdb/4burk_prot -dblist "bb395_prot bb433_prot bh155_prot
bh69_prot "' (We were told not to run this.) use this to combine created files and search within them.
AFter faa files were collected into file lab_066 then used command 'orthofinder -f lab_066/ -X' to run orthofinder on the files
'cat Comparative_Genomics_Statistics/Statistics_Overall.tsv' to visualize file summary of ortholog percentage etc
cat Species_Tree/SpeciesTree_rooted_node_labels_txt to visualize resulting phylogeny
head Phylogenetic_Hierarchical_Orthogroups/N0.tsv to look at files describing orthogroups of genes shared by more than one genome
'awk -F "\t" '$4!="" && $5=="" && $6=="" && $7!="" && $8!=""
{print}' OFS="\t" Phylogenetic_Hierarchical_Orthogroups/N0.tsv' to find genes shared between specified genomes
head Orthogroups/Orthogroups_UnassignedGenes.tsv to visualize genes unique to genomes