# Lab 5: BLAST+
Kirsten Pastore
```
#ssh into bi278
ssh klpast23@bi278
#make new directory (in ~)
mkdir lab_05
```
set an environmental variale (tell BLAST+ where to look for databases)
```
echo $BLASTDB
#returns /courses/bi278/Course_Materials/blastdb
```
1.1 Run Command Line BLASTP
.ffn = nucleotide
.faa = amino acid
read the file with record seperator ">" and match a pattern, when finds pattern prints symbol ">" followed by the record ($0)
```
awk 'BEGIN {RS = ">"} /hypothetical protein/ {print ">"$0}'
ORS=" " PROKKA_09292022.faa
```
1. Write your hypothetical proteins to their own fasta file. Do this just for the protein sequences
```
awk 'BEGIN {RS = ">"}/hyothetical protein/ {print ">$0"}' ORS="" PROKKA_09272022.faa > hyp_protein.faa
```
2. take a subset of these proteins (3 or less total) and send them to their own fasta file
```
awk 'BEGIN {RS=">"; srand()} (rand()<=.002) {print">"$0}' ORS="" PROKKA_09272022.faa > hyp_protein02.faa
```
3. Run a protein blast (blastp)
```
blastp -task blastp-fast -num_threads 2 -db refseq_select_prot - evalue 1e-6 -query hyp_protein02.faa
```
4. modify search
```
blastp -task blastp-fast -num_threads 2 -db refseq_select_prot -evalue 1e-6 -outfmt "6 std ppos qcovs stitle sscinames staxid" -max_target_seqs 10 -query hyp_protein02.faa > subset_blast_fmt6.txt
```
5. take the subject sequence ids from the second column ($2) and save them as a list
```
awk '$1=="BB395_02192" {print $2}' example_blastp_outfmt6.txt | sort | uniq > example_blastp_BB395_02192.sseqids.txt
```
6. use the blastdbmd to find the entries from my search target database that match the subject sequence ids
```
blastdbcmd -db refseq_select_prot -entry_batch example_blastp_BB395_02192.sseqids.txt
```
1.2 Run Command Line BLASTN
this is the basic nucleotide BLAST search
```
blastn -task megablast -num_threads 2 -query PROKKA_09272022.ffn -db nt -evalue 1e-6
```
Want these columns to end up in my reorganized final output file:
1. query sequence id
2. subject sequence id
3. full name of subject sequence
4. scientific name of subject sequence
5. percent positive scoring matches
6. query coverage
7. e-value
8. Run a command to retrieve these fields only:
```
awk -F "\t" '{print $1, $2, $15, $16, $13, $14, $11}' OFS="\t" subset_blast02_fmt6.txt
```
9. Parse your results from the various BLAST searches you have been running to figure out your answers to the following questions. Note that you will have at most 10 results from each query sequence
Q2. What are the default values of the options matrix, evalue, max_target_seqs, num_descriptions, num_alignments for each type of BLAST? Under what circustances would you use those defaults vs. when would you consider changing them?
-matrix: Scoring matrix name (normally BLOSUM62 for both)
-evalue: Expectation value (E) threshold for saving hits, Default = '10' for both
-max_target_seqs: maximum number of aligned sequences to keep (value of 5 or more is recommended), Default = '500' for both
-num_descriptions: show one-line descriptions for this number of database sequences
-num_alignments: number of database sequences to show alignments for, Default - '250' for both
You would change the scoring matrix BLOSUM62 when you want to change the scoring method to determine sequence alignment and discern evolutionary distance. The evalue threshold determines the stringency of your search, and using th default value would not yield a very stringent search. We used 1e-6, whch means that good e-values for BLAST search are considered to be those that are 1e-6 or smaller. We changed num_alignments to 10 for this example, but if we wanted a larger number of alignments we would use the defaults. For num_descriptions the default is 500, for this example, we decreased the size of num_descriptions. For a project, we may increase the number of descriptions or keep the default. For max_target_seqs, we changed the value to 10 in this example, but we would increase the value if we wanted data for a project.
```
blastdb_aliastool -dbtype prot -title 4burk_prot -out myblastdb/4burk_prot -dblist "bbqs395 bbqs433_prot bh155_prot bh69_prot "
#returns: Created BLAST (alias) database myblastdb/4burk_prot with 10586 sequences
```
13. Make sure this worked by searching against one of your new custom databases
```
blastp -query tssH_bpseudo.fasta -db 4burk_prot -outfmt 6
```
## Exercise 3. Orthofinder and its output
Orthofinder is an orthogroup detection software that finds homologous groups of genes that are descended from a single ancestor gene
14. Collect all predicted protein sequences (prokka .faa equivalent) into a directory. Run orthofinder within that folder.
```
# in ~/lab_06/
mkdir prokka
mv *prokka* lab_06/prokka/
```
-f specifies where protein fasta files are and -X specifies that don't want Orthofinder to add species names to my gene ids
```
orthofinder -f ./ -X
```
get rid of empty directory for convenience
```
mv OrthoFinder/Results_Oct22/
rmdir OrthoFinder/
cd Results_Oct22
```
15. Check the overall statistics
```
cat Comparative_Genomics_Statistics/Statistics_Overall.tsv
```
16. Check out the species tree.
```
cat Species_Tree/SpeciesTree_rooted_node_labels_txt
```
Copy and paste this output into iToL
17. Now, check out the file that describes the orthogroups of genes shared by more than one genome
```
head Phylogenetic_Hierarchical_orthogroups/N0.tsv
```
this file lists out all genes that belong to each detected orthogroup
18. To find genes that are shared between genome 1 and genomes 4 and 5 find rows that have an entry in fields $4, $7, $8 but not $5, $6
```
awk -F "\t" '$4!="" && $5=="" && $6=="" && $7!="" && $8!="" {print}' OFS="\t" Phylogenetic_Hierarchical_Orthogroups/N0.tsv
```
19. Let's see which genes are unique to individual genomes
```
head Orthogroups/Orthogroups_UnassignedGenes.tsv
```