# VIRify database hmmer comparison
## Databases
```bash
# different versions of ViPhOGs
DB_VIPHOGS=/hps/nobackup2/production/metagenomics/garp/Data/hmmFiles/vpHMM_database
DB_VIPHOGS_V2=/hps/nobackup2/production/metagenomics/garp/Data/hmmFiles/v2/vpHMM_database_v2
DB_VIPHOGS_V3=/hps/nobackup2/production/metagenomics/garp/Data/hmmFiles/v3/vpHMM_database_v3
# additional databases for annotation visualization
DB_RVDB=/hps/nobackup2/production/metagenomics/mhoelzer/databases/rvdb/U-RVDBv17.0-prot.hmm
DB_VOGDB=/hps/nobackup2/production/metagenomics/mhoelzer/databases/vogdb/vogdb.hmm
DB_PVOGS=/hps/nobackup2/production/metagenomics/mhoelzer/databases/pvogs/pvogs.hmm
DB_VPS=/hps/nobackup2/production/metagenomics/mhoelzer/databases/vps/vps.hmm
```
### ViPhOG v1
* original database
* no additional bitscore cutoffs applied
* hits only filtered by evalue
hmmscan command:
```bash
hmmscan --cpu ${task.cpus} --noali -E "0.001" --domtblout \
${set_name}_${params.db}_hmmscan.tbl ${db}/${db}.hmm ${faa}
```
For v2 and v3 Guillermo calculated bitscore thresholds:
> I hmmerd all the models against the complete UniProtKB database (Feb 2019 release)
Then I went through all the results and identified all the models that I could assign to a single viral taxon, based on the hmmer scores obtained. Some of them reported hits for a single viral taxon (these are the ones that have ``nan`` on the second column), and others reported hits against multiple taxons, but the score range of the best hit could be separated from the ranges obtained for further taxa.
Example cutoffs:
```bash
#MODEL TAXA RANK SCORE1 SCORE2
ViPhOG12699 Gokushovirinae Subfamily 266.5 257.7
ViPhOG17845 Gokushovirinae Subfamily 57.1 nan
ViPhOG12433 Siphoviridae Family 33.5 33.3
```
We have bitscore thresholds for 22014 of the 31150 ViPhOGs.
### ViPhOG v2
For all models with available cutoffs:
* add ``SCORE1`` as sequence-specific GA and TC cutoff
* add ``SCORE1`` minus 3 bits as domain-specific GA and TC cutoff
* add ``SCORE2`` as sequence- and domain-specific NC cutoff, leave out if ``nan``
* if no bitscore cutoff is available, added _dummy_ fields for GA and TC with ``0.0``
* for cases where ``SCORE1 - 3 < SCORE2`` (see ViPhOG12433) the threshold was not modified
* 716/31150 models where the diff between bit score 1 and 2 is smaller 3
Example v2:
```
NAME ViPhOG12699.faa
GA 266.5 263.5;
TC 266.5 263.5;
NC 257.7 257.7;
NAME ViPhOG17845.faa
GA 57.1 54.1;
TC 57.1 54.1;
NAME ViPhOG12433.faa
GA 33.5 33.5;
TC 33.5 33.5;
NC 33.3 33.3;
```
hmmscan command:
```bash
hmmscan --cpu ${task.cpus} --noali --cut_ga --domtblout \
${set_name}_${params.db}_hmmscan_cutga.tbl ${db}/${db}.hmm ${faa}
# in addition filter evalue because some models dont have any bitscore cutoff
awk '{if(\$1 ~ /^#/){print \$0}else{if(\$7<0.001){print \$0}}}' \
${set_name}_${params.db}_hmmscan_cutga.tbl > ${set_name}_${params.db}_hmmscan.tbl
```
The same command is used for v3.
### ViPhOG v3
v3 is less restrictive than v2 when ``SCORE2`` is ``nan``:
* if we only have one bit score (so the second field is ``nan``), we just reduce also the sequence-specific GA by 3 bits. E.g. for such scores:
```bash
ViPhOG12699 Gokushovirinae Subfamily 266.5 257.7
ViPhOG17845 Gokushovirinae Subfamily 57.1 nan
```
```bash
# same as in v2
GA 266.5 263.5
TC 266.5 263.5
NC 257.7 257.7
```
```bash
# v3
GA 54.1 54.1
TC 54.1 54.1
# v2
GA 57.1 54.1;
TC 57.1 54.1;
```
## Test data sets
The pipeline was run with [v1, v2, v3] on these data sets:
* Kleiner _et al_. 2015, co-assembly, phage community study (20 MB)
* Neto _et al_. 2015, rndm selected one sample SRR3458562, communities mostly comprised of eukaryotic viruses (21 MB)
* Neto _et al_. 2015, co-assembly, communities mostly comprised of eukaryotic viruses 100 MB)
* Kallies _et al_. 2019, co-assembly, viruses enriched from groundwater (96 MB)
* Zheng _et al_. 2019, collection of 3k phages from NCBI (217 MB)
## Results
I only list hits with an assigend taxonomy here and some examples so we can look into it.
### Kleiner _et al._ 2015
Only taxonomies assigend for High confidence (HC) and putative prophage (PP) sets. No real difference for HC. PP might be reduced by one false positive prediction by using v2/v3.
#### v1
High confidence:
```bash
contig_ID genus subfamily family order
contig66 P22likevirus 0.029411764705882353 Podoviridae Caudovirales
contig68 T7likevirus Autographivirinae Podoviridae Caudovirales
contig69 T7likevirus Autographivirinae Podoviridae Caudovirales
contig97 0.25 0.25 Siphoviridae Caudovirales
contig124 Microvirus Microviridae
```
Prophage
```bash
contig_ID genus subfamily family order
contig1 0.07142857142857142 0.07142857142857142 0.35714285714285715 Caudovirales
contig32 0.4444444444444444 Peduovirinae Myoviridae Caudovirales
```
#### v2
High confidence:
```bash
contig_ID genus subfamily family order
contig66 P22likevirus 0.03225806451612903 Podoviridae Caudovirales
contig68 T7likevirus Autographivirinae Podoviridae Caudovirales
contig69 T7likevirus Autographivirinae Podoviridae Caudovirales
contig97 0.25 0.25 Siphoviridae Caudovirales
contig124 Microvirus Microviridae
```
Prophage
```bash
contig_ID genus subfamily family order
contig32 0.46153846153846156 Peduovirinae Myoviridae Caudovirales
```
#### v3
High confidence:
```bash
contig_ID genus subfamily family order
contig66 P22likevirus 0.03225806451612903 Podoviridae Caudovirales
contig68 T7likevirus Autographivirinae Podoviridae Caudovirales
contig69 T7likevirus Autographivirinae Podoviridae Caudovirales
contig97 0.25 0.25 Siphoviridae Caudovirales
contig124 Microvirus Microviridae
```
Prophage
```bash
contig_ID genus subfamily family order
contig32 0.46153846153846156 Peduovirinae Myoviridae Caudovirales
```
### Neto _et al_. 2015 (SRR3458562)
No differences between v1, v2, and v3. But also only few annotations in general for the single rndm selected read set SRR3458562.
### Neto _et al._ 2015 co-assembly
Finally, the co-assembly worked with the newest SPAdes v3.14. Hurray
#### v1
```bash
#HC
contig6 Viunalikevirus 0.2018348623853211 Myoviridae Caudovirales
contig64 Viunalikevirus 0.18421052631578946 Myoviridae Caudovirales
contig63 Viunalikevirus 0.18421052631578946 Myoviridae Caudovirales
contig128 0.5555555555555556 0.5555555555555556 Myoviridae Caudovirales
contig155 P22likevirus Podoviridae Caudovirales
contig193 0.25 0.5 Caudovirales
contig334 P22likevirus Podoviridae Caudovirales
contig339 Microvirus Microviridae
contig539 P22likevirus Podoviridae Caudovirales
contig1539 P2likevirus Peduovirinae Myoviridae Caudovirales
#LC
contig165 0.5 0.5 0.5 Caudovirales
contig226 0.5 Mimiviridae
contig1046 0.5 Mimiviridae
#Prophage
contig165 P22likevirus 0.3333333333333333 Podoviridae Caudovirales
```
#### v2
```bash
#HC
contig6 0.5526315789473685 0.23684210526315788 Myoviridae Caudovirales
contig63 Viunalikevirus 0.25925925925925924 Myoviridae Caudovirales
contig64 Viunalikevirus 0.25925925925925924 Myoviridae Caudovirales
contig128 0.5555555555555556 0.5555555555555556 Myoviridae Caudovirales
contig155 P22likevirus Podoviridae Caudovirales
contig334 P22likevirus Podoviridae Caudovirales
contig339 Microvirus Microviridae
contig539 P22likevirus Podoviridae Caudovirales
contig1539 P2likevirus Peduovirinae Myoviridae Caudovirales
#LC
contig165 0.5 0.5 0.5 Caudovirales
contig226 Mimivirus Mimiviridae
contig1046 0.5 Mimiviridae
#Prophage
contig165 P22likevirus 0.3333333333333333 Podoviridae Caudovirales
```
#### v3
```bash
#HC
contig6 0.5584415584415584 0.24675324675324675 Myoviridae Caudovirales
contig63 Viunalikevirus 0.25925925925925924 Myoviridae Caudovirales
contig64 Viunalikevirus 0.25925925925925924 Myoviridae Caudovirales
contig128 0.5555555555555556 0.5555555555555556 Myoviridae Caudovirales
contig155 P22likevirus Podoviridae Caudovirales
contig334 P22likevirus Podoviridae Caudovirales
contig339 Microvirus Microviridae
contig539 P22likevirus Podoviridae Caudovirales
contig1539 P2likevirus Peduovirinae Myoviridae Caudovirales
#LC
contig165 0.5 0.5 0.5 Caudovirales
contig226 Mimivirus Mimiviridae
contig1046 0.5 Mimiviridae
#Prophage
contig165 P22likevirus 0.3333333333333333 Podoviridae Caudovirales
```
### Kallies _et al_ 2019
For this _real_ data set we actually see that v2 and v3 reduce the amount of taxonomic assignments. The question might be if it is good to remove hits because they are false positives. I will check this exemplarily, see below.
#### v1
```bash
# HC
contig_ID genus subfamily family order
contig105 0.5 Peduovirinae Myoviridae Caudovirales
contig355 0.35714285714285715 Peduovirinae Myoviridae Caudovirales
contig1719 Skunalikevirus Siphoviridae Caudovirales
contig336 P22likevirus Podoviridae Caudovirales
contig675 0.3333333333333333 Siphoviridae Caudovirales
contig337 0.5 0.25 Podoviridae Caudovirales
contig1402 0.3333333333333333 0.3333333333333333 0.3333333333333333
contig2479 0.3333333333333333 0.3333333333333333 Caudovirales
contig1134 0.3333333333333333 0.3333333333333333 Caudovirales
contig868 Chlorovirus Phycodnaviridae
contig985 0.3333333333333333 0.3333333333333333 0.3333333333333333 Caudovirales
contig2535 0.3333333333333333 0.3333333333333333 Myoviridae Caudovirales
contig1696 0.3333333333333333 0.3333333333333333 0.3333333333333333 0.3333333333333333
contig576 0.25 0.25 0.25 0.25
contig2413 Gokushovirinae Microviridae
contig2203 Gokushovirinae Microviridae
contig5423 Gokushovirinae Microviridae
contig2418 0.5 Microviridae
contig1356 Chordopoxvirinae Poxviridae
contig1938 Phietalikevirus Siphoviridae Caudovirales
contig3967 Autographivirinae Podoviridae Caudovirales
contig5639 0.3333333333333333 0.3333333333333333 Myoviridae Caudovirales
contig3333 0.25 Autographivirinae Podoviridae Caudovirales
contig15296 Skunalikevirus Siphoviridae Caudovirales
#LC
contig334 0.2 0.2 0.2 Caudovirales
contig1104 0.25 Spounavirinae Herelleviridae Caudovirales
contig633 0.2 0.2 0.2 Caudovirales
contig2199 0.3333333333333333 0.3333333333333333 Caudovirales
contig1699 0.3333333333333333 0.3333333333333333 0.3333333333333333
contig939 0.3333333333333333 Polyomaviridae 0.3333333333333333
contig3764 0.3333333333333333 0.3333333333333333 Caudovirales
contig1117 Papillomaviridae
contig3113 0.3333333333333333 0.3333333333333333 Caudovirales
contig2610 0.3333333333333333 Autographivirinae Podoviridae Caudovirales
#Prophage
contig612 0.2857142857142857 0.2857142857142857 Phycodnaviridae
contig306 0.3333333333333333 Siphoviridae Caudovirales
```
#### v2
```bash
#HC
contig_ID genus subfamily family order
contig105 0.5 Peduovirinae Myoviridae Caudovirales
contig355 0.35714285714285715 Peduovirinae Myoviridae Caudovirales
contig1719 Skunalikevirus Siphoviridae Caudovirales
contig1134 0.3333333333333333 0.3333333333333333 Caudovirales
contig985 0.3333333333333333 0.3333333333333333 0.3333333333333333 Caudovirales
contig2535 0.3333333333333333 0.3333333333333333 Myoviridae Caudovirales
contig576 0.25 0.25 0.25 0.25
contig2203 0.2 Gokushovirinae Microviridae
contig2413 Gokushovirinae Microviridae
contig5423 Gokushovirinae Microviridae
contig1356 Chordopoxvirinae Poxviridae
contig2418 0.25 0.5 Microviridae
contig3967 Autographivirinae Podoviridae Caudovirales
contig5639 0.3333333333333333 0.3333333333333333 Myoviridae Caudovirales
contig15296 Skunalikevirus Siphoviridae Caudovirales
#LC
contig334 0.25 0.25 Caudovirales
contig633 0.25 0.25 0.25 Caudovirales
contig1104 Spounavirinae Herelleviridae Caudovirales
contig2199 Caudovirales
contig3113 0.3333333333333333 0.3333333333333333 0.3333333333333333 Caudovirales
contig1117 Papillomaviridae
#Prophage
contig612 0.16666666666666666 0.3333333333333333 Phycodnaviridae
```
#### v3
```bash
#HC
contig_ID genus subfamily family order
contig105 0.5 Peduovirinae Myoviridae Caudovirales
contig355 0.35714285714285715 Peduovirinae Myoviridae Caudovirales
contig1719 Skunalikevirus Siphoviridae Caudovirales
contig1134 0.3333333333333333 0.3333333333333333 Caudovirales
contig2535 0.3333333333333333 0.3333333333333333 Myoviridae Caudovirales
contig985 0.3333333333333333 0.3333333333333333 0.3333333333333333 Caudovirales
contig576 0.25 0.25 0.25 0.25
contig5423 Gokushovirinae Microviridae
contig2203 0.2 Gokushovirinae Microviridae
contig2413 Gokushovirinae Microviridae
contig2418 0.2 Gokushovirinae Microviridae
contig1356 Chordopoxvirinae Poxviridae
contig3967 Autographivirinae Podoviridae Caudovirales
contig5639 0.3333333333333333 0.3333333333333333 Myoviridae Caudovirales
contig15296 Skunalikevirus Siphoviridae Caudovirales
#LC
contig334 0.25 0.25 Caudovirales
contig633 0.25 0.25 0.25 Caudovirales
contig1104 Spounavirinae Herelleviridae Caudovirales
contig2199 Caudovirales
contig3113 0.3333333333333333 0.3333333333333333 0.3333333333333333 Caudovirales
contig1117 Papillomaviridae
#Prophage
contig612 0.16666666666666666 0.3333333333333333 Phycodnaviridae
```
#### These contigs have now no taxonomic assignment in v2/v3 but had one in v1 ...
... can I find homology to viruses although the bitscore cutoffs removed them?
```bash
#HC
contig336 --> no virus blast hit
contig675 --> no virus blast hit
contig337 --> no virus blast hit
contig1402 --> short (4% of query) hit against Flavobacterium phage
contig2479 --> short (14% of query) hit Psychrobacter phage pOW20
contig868 --> short (1% of query) hit Enterococcus phage IME_EF3
contig1696 --> somewhat good (41% query cov) hit Arthrobacter phage Isolde
contig1938 --> Caudovirales unclassi, Staphylococcus phage... only 70% seq ident
contig3333 --> full length hit in bakki genome Acinetobacter johnsonii
#LC
contig1699 --> no virus blast hit
contig939 --> no virus blast hit
contig3764 --> no virus blast hit
contig2610 --> full length hit against bakki genome Brevundimonas sp. SGAir0440
#Prophage
contig306 --> partly Caudovirales unclassified
```
So most hits that were filtered out in v2/v3 look fine. There are some edge cases and especially:
contig1938 (Phietalikevirus | Siphoviridae | Caudovirales) seems to be somewhat good hit (but only 70% seq ident) but was filtered in v2/v3: __why?__
The models that are included in v1 but were filtered in v2/v3 are
```bash
ViPhOG1205 Slashvirus Genus 307.3 173.4
ViPhOG15973 Bastillevirinae Subfamily 395.4 281.6
ViPhOG23202 Pagevirus Genus 322.4 313.8
ViPhOG902 Picovirinae Subfamily 275.1 46.4
ViPhOG24518 Pagevirus Genus 199.3 nan
ViPhOG4714 Arquatrovirinae Subfamily 31.4 nan
```
so quite diverse hits and also the models that are still in v2/v3 for this contig are quite diverse. So IMHO it is fine that finally no taxonomy was assigned for this contig. It is some kind of virus (or part of) and that's why it is in the VirSorter HC set but based on the protein models we can not clearly assign a taxonomy.
As a positive control, tested also some that have a taxonomic assignment, also still in v2/v3
```bash
contig355 Peduovirinae | Myoviridae | Caudovirales
-->
hit against Alicycliphilus, a genus in the phylum Proteobacteria (Bacteria)
Likely that the phage was included in the assembly of Alicycliphilus
Also some hits against phages
contig1719 Skunalikevirus | Siphoviridae | Caudovirales
-->
Top hit against Lactococcus phage that belongs to Skunavirus. nice.
contig1134 Caudovirales
-->
here we could only state that this might be a Caudovirales member
Croceibacter phage P2559Y partial blast hit, belongs to Caudovirales. so its fine
```
### Zheng _et al._ 2019
This data set only consists of phage genomes from NCBI. Again, v2 and v3 reduce the amount of taxonomic assignments. The difference between v2 and v3 is marginal.
#### v1
* 1654 HC taxonomic assignments
* 607 LC
* 343 Prophage
#### v2
* __HC still running... TODO__
* 534 LC
* 293 Prophage
#### v3
* 1452 HC taxonomic assignments
* 535 LC
* 293 Prophage
# Misc testing Martin (the important stuff is above)
## Small test set
```bash
DIR=/homes/mhoelzer/data/calc/2020-02-04_hmmer_virify
FAA=${DIR}/PRJNA530103_high_confidence_prodigal.faa
mkdir -p ${DIR}/viphogs ${DIR}/viphogs_v2 ${DIR}/rvdb ${DIR}/vogdb
hmmscan --cpu 8 --noali -E "0.001" --domtblout $DIR/viphogs/$(basename $FAA .faa)_hmmscan.tbl $DB_VIPHOGS ${FAA}
hmmscan --cpu 8 --noali -E "0.001" --domtblout $DIR/rvdb/$(basename $FAA .faa)_hmmscan.tbl $DB_RVDB ${FAA}
hmmscan --cpu 8 --noali -E "0.001" --domtblout $DIR/vogdb/$(basename $FAA .faa)_hmmscan.tbl $DB_VOGDB ${FAA}
hmmscan --cpu 8 --noali -E "0.001" --domtblout $DIR/pvogs/$(basename $FAA .faa)_hmmscan.tbl $DB_PVOGS ${FAA}
hmmscan --cpu 8 --cut_ga --noali --domtblout $DIR/viphogs_v2/$(basename $FAA .faa)_hmmscan.tbl.no_e_cutoff $DB_VIPHOGS_V2 ${FAA}
# add e-value filter step because we have models without GA bitscore cutoff in the database
awk '{if($1 ~ /^#/){print $0}else{if($7<0.001){print $0}}}' $DIR/viphogs_v2/$(basename $FAA .faa)_hmmscan.tbl.no_e_cutoff > $DIR/viphogs_v2/$(basename $FAA .faa)_hmmscan.tbl
```
### Results
#### VIPHOG
```bash
cat $DIR/viphogs/small_hmmscan.tbl | grep contig | awk '{print $1"\t"$4"\t"$8"\t"$18"\t"$19}'
```
```bash
ViPhOG12699.faa contig1 61.8 2 90
ViPhOG17845.faa contig2 56.6 1 87
ViPhOG12700.faa contig3 75.3 64 133
ViPhOG8852.faa contig4 532.7 16 470
```
Guillermos cutoffs for these models are (none for 12700):
```bash
ViPhOG12699 Gokushovirinae Subfamily 266.5 257.7
ViPhOG17845 Gokushovirinae Subfamily 57.1 nan
ViPhOG8852 Gokushovirinae Subfamily 519.5 413.9
```
Thus, 12699 and 17845 should be discarded by the v2 of the database...
#### VIPHOG V2
... and that's what happen now:
```bash
cat $DIR/viphogs_v2/small_hmmscan.tbl | grep contig | awk '{print $1"\t"$4"\t"$8"\t"$18"\t"$19}'
```
```bash
ViPhOG12700.faa contig3 75.3 64 133
ViPhOG8852.faa contig4 532.7 16 470
```
Based on the current implementation and cutoffs this is correct. 12700 is going through because the e-value is good enough although no bit score threshold exist. 8852 is going through because the bit score (532.7) is larger than the sequence-specific GA score (519.5).
#### RVDB
```bash
cat rvdb/small_hmmscan.tbl | grep contig | awk '{print $1"\t"$4"\t"$8"\t"$18"\t"$19}'
```
```bash
FAM002364 contig1 70.1 2 90
FAM002363 contig4 613.9 1 472
FAM015148 contig4 299.4 201 437
FAM016066 contig4 289.6 19 471
```
#### VOGDB
```bash
cat vogdb/small_hmmscan.tbl | grep contig | awk '{print $1"\t"$4"\t"$8"\t"$18"\t"$19}'
```
```bash
VOG00849 contig1 21.5 3 76
VOG10944 contig2 60.2 4 81
VOG10943 contig3 101.3 8 138
VOG01001 contig4 531.7 2 472
```
#### PVOGS
```bash
cat pvogs/small_hmmscan.tbl | grep contig | awk '{print $1"\t"$4"\t"$8"\t"$18"\t"$19}'
```
```bash
VOG2233 contig1 63.5 2 90
VOG3382 contig2 49.2 4 81
VOG2235 contig3 110.2 4 148
VOG2234 contig4 567.4 2 472
VOG0176 contig4 98.0 19 413
```
## Issue: we have HMMs that will not have a bitscore threshold applied
* using the ``--cut_ga`` option we need to remove the ``-E "0.001"`` option
* but for models that dont have a _GA_ added we want to keep the E-value filter I guess
* so I think we have two options here:
* 1) add a switch that either executes ``hmmscan`` with the ``--cut_ga`` option if the _GA_ field is available or w/o the ``--cut_ga`` when it is not available
* but this would meen we have two databases: the models with cutoff and those without
* 2) for all models where we do not have a bitscore threshold just add _dummy GA/TC/NC_ fields with a 0.0 score
* so we can use the same database and command with the ``--cut_ga`` option
* drawback: we do not filter the models by e-value directly...
* ...but we can filter them with some ``awk`` command after the hmmscan step
* __I currently go for this__
## PRJNA530103_high_confidence_prodigal
Viral DNA metagenome from three pristine limestone aquifer assemblages of the Hainich Critical Zone Exploratory, co-assembled
Fast ``wc -l`` (so also includes duplicates and some header lines)
```
1763 viphogs/PRJNA530103_high_confidence_prodigal_hmmscan.tbl
1090 viphogs_v2/PRJNA530103_high_confidence_prodigal_hmmscan.tbl
1201 pvogs/PRJNA530103_high_confidence_prodigal_hmmscan.tbl
566 rvdb/PRJNA530103_high_confidence_prodigal_hmmscan.tbl
1765 vogdb/PRJNA530103_high_confidence_prodigal_hmmscan.tbl
```
### Which hits got lost from v1 to v2 of the ViPhOG db?
```bash
cd viphogs
awk '{if($1 ~ /^#/){}else{print $1":"$4}}' PRJNA530103_high_confidence_prodigal_hmmscan.tbl > PRJNA530103_high_confidence_prodigal_hmmscan.tbl.hits
cd viphogs_v2
awk '{if($1 ~ /^#/){}else{print $1":"$4}}' PRJNA530103_high_confidence_prodigal_hmmscan.tbl >
```

362 __ViPhOG:Protein__ combinations were discarded based on bit score thresholds in the v2.
[https://www.rna.uni-jena.de/supplements/martin/viphogs_v1_only.txt](https://www.rna.uni-jena.de/supplements/martin/viphogs_v1_only.txt)
These hits include 273 proteins that were discarded.
[https://www.rna.uni-jena.de/supplements/martin/viphogs_v1_only_contigs.txt](https://www.rna.uni-jena.de/supplements/martin/viphogs_v1_only_contigs.txt)
Do we still have hits against the contigs these discarded proteins belong to? So that them would still a taxonomy be assigned in v2?
__We have ViPhOGs that hitted proteins of 111 different contigs in v2.__
__We have ViPhOGs that hitted proteins of 116 different contigs in v1.__
By applying the additional bit score thresholds in v2 we get no ViPhOG hits against these 5 contigs:
* NODE_1181_length_7064
* NODE_12413_length_1409
* NODE_3070_length_3948
* NODE_7798_length_2061
* NODE_9764_length_1729
## Execute the pipeline with the v2 and new hmmscan call
I use now the models with GA set like discused with Rob. Might be to strict. Lets see. I also filter by evalue additionally so that models that just pass because they dont have a GA are also filtered at least by evalue.
```bash
cd /homes/mhoelzer/data/calc/2020-01-31_virify/virify/v2
nextflow run /homes/mhoelzer/backuped/git/virify/virify.nf --fasta ../../assemblies/kleiner_2015.fasta --workdir work_kleiner -profile ebi --output results/
# DONE
nextflow run /homes/mhoelzer/backuped/git/virify/virify.nf --fasta ../../assemblies/kallies_2019.fasta --workdir work_kallies -profile ebi --output results/
```
## v3 database
We might need a even more relaxed bit score cutoff database.
Suggestion: if we only have one bit score (so the second field is ``nan``), we just reduce also the sequence-specific GA by 3 bits. E.g. for such scores:
```bash
ViPhOG12699 Gokushovirinae Subfamily 266.5 257.7
ViPhOG17845 Gokushovirinae Subfamily 57.1 nan
```
```bash
GA 266.5 263.5
TC 266.5 263.5
NC 257.7 257.7
```
```bash
GA 54.1 54.1
TC 54.1 54.1
```
What happens here? (716 models where the diff between bit score 1 and 2 is smaller 3)
ATM I just use this as it is without reducing bits.
```bash
ViPhOG13782 Rogunavirus Genus 122.6 122.2
```
```bash
cd /homes/mhoelzer/data/calc/2020-01-31_virify/virify/v3
nextflow run /homes/mhoelzer/backuped/git/virify/virify.nf --fasta ../../assemblies/kleiner_2015.fasta --workdir work_kleiner -profile ebi --output results/
```
## Running on YODA
The nextflow is running stable as it seems on YODA. The idea is now to test the 3 different databases on various sets:
* Kleiner et al. 2015, co-assembly
* Neto et al. 2015, rndm selected SRR3458562
* Neto et al. 2015, co-assembly
* Kallies et al. 2019, co-assembly
* Zheng et al. 2019, collection of phages from NCBI
* Gregory et al. 2019, selected sample
```bash
#module load singularity/3.5.0
session2
nextflow run virify.nf --fasta '/homes/mhoelzer/data/assemblies/renamed/*.fasta' -profile yoda_conda --output /hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify
# started... and renamed files before to maybe avoid parsing problems with the current script
nextflow run virify.nf --fasta '/homes/mhoelzer/data/assemblies/*.fasta' -profile yoda_conda --output /hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/v2 --version v2
nextflow run virify.nf --fasta '/homes/mhoelzer/data/assemblies/*.fasta' -profile yoda_conda --output /hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/v3 --version v3
# ok this worked. Now we do another test and if this runs through we make a relaese of virify v0.1 to run all files for the manuscript
OUT=/hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/2020-02-24_test_if_ready_for_release/no_virome
DB=/hps/nobackup2/metagenomics/mhoelzer/databases
nextflow run /homes/mhoelzer/git/virify/virify.nf \
--fasta '/homes/mhoelzer/data/assemblies/*.fasta' \
--workdir $OUT/work \
--virsorter $DB/virsorter/virsorter-data \
--viphog $DB/vpHMM_database_v3 \
--rvdb $DB/rvdb \
--pvogs $DB/pvogs \
--vogdb $DB/vogdb \
--vpf $DB/vpf \
--ncbi $DB/ncbi/ete3_ncbi_tax.sqlite \
--imgvr $DB/imgvr \
--version v3 \
-profile yoda_conda
OUT=/hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/2020-02-24_test_if_ready_for_release/virome
DB=/hps/nobackup2/metagenomics/mhoelzer/databases
nextflow run /homes/mhoelzer/git/virify/virify.nf \
--fasta '/homes/mhoelzer/data/assemblies/*.fasta' \
--workdir $OUT/work \
--virsorter $DB/virsorter/virsorter-data \
--viphog $DB/vpHMM_database_v3 \
--rvdb $DB/rvdb \
--pvogs $DB/pvogs \
--vogdb $DB/vogdb \
--vpf $DB/vpf \
--ncbi $DB/ncbi/ete3_ncbi_tax.sqlite \
--imgvr $DB/imgvr \
--version v3 \
--virome \
-profile yoda_conda
## run the full neto sample w/o any preselection of viral contigs. Do we find more?
## switched to branch no-preselection
OUT=/hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/2020-02-24_test_if_ready_for_release/virome/results/neto_2015_all
DB=/hps/nobackup2/metagenomics/mhoelzer/databases
nextflow run /homes/mhoelzer/git/virify/virify.nf --fasta /homes/mhoelzer/data/assemblies/neto_2015.fasta --workdir $OUT/work --virsorter $DB/virsorter/virsorter-data --viphog $DB/vpHMM_database_v3 --rvdb $DB/rvdb --pvogs $DB/pvogs --vogdb $DB/vogdb --vpf $DB/vpf --ncbi $DB/ncbi/ete3_ncbi_tax.sqlite --imgvr $DB/imgvr --version v3 -profile yoda_conda
## okay, indeed we find more hits
## check if virfinder with other model works better
## switched to branch virfinder
## yes worked better for euk viruses
OUT=/hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/2020-02-24_test_if_ready_for_release/virome/results/neto_2015_virfinder_model
DB=/hps/nobackup2/metagenomics/mhoelzer/databases
nextflow run /homes/mhoelzer/git/virify/virify.nf --fasta /homes/mhoelzer/data/assemblies/neto_2015.fasta --workdir $OUT/work --output $OUT --virsorter $DB/virsorter/virsorter-data --viphog $DB/vpHMM_database_v3 --rvdb $DB/rvdb --pvogs $DB/pvogs --vogdb $DB/vogdb --vpf $DB/vpf --ncbi $DB/ncbi/ete3_ncbi_tax.sqlite --imgvr $DB/imgvr --version v3 -profile yoda_conda
## run the ocean sample, this is not exclusively virome so run w/o virome option
OUT=/hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/2020-02-24_test_if_ready_for_release/no_virome
DB=/hps/nobackup2/metagenomics/mhoelzer/databases
nextflow run /homes/mhoelzer/git/virify/virify.nf --fasta /homes/mhoelzer/data/assemblies/gregory_2019.fasta --workdir $OUT/work --virsorter $DB/virsorter/virsorter-data --viphog $DB/vpHMM_database_v3 --rvdb $DB/rvdb --pvogs $DB/pvogs --vogdb $DB/vogdb --vpf $DB/vpf --ncbi $DB/ncbi/ete3_ncbi_tax.sqlite --imgvr $DB/imgvr --version v3 -profile yoda_conda
## run the hybrid assembly from Overholt paper, no-virome
OUT=/hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/2020-02-24_test_if_ready_for_release/no_virome
DB=/hps/nobackup2/metagenomics/mhoelzer/databases
nextflow run /homes/mhoelzer/git/virify/virify.nf --fasta /homes/mhoelzer/data/assemblies/overholt_2019.fasta --workdir $OUT/work --output $OUT/results/ --virsorter $DB/virsorter/virsorter-data --viphog $DB/vpHMM_database_v3 --rvdb $DB/rvdb --pvogs $DB/pvogs --vogdb $DB/vogdb --vpf $DB/vpf --ncbi $DB/ncbi/ete3_ncbi_tax.sqlite --imgvr $DB/imgvr --version v3 -profile yoda_conda
# done!
# run the neto and kleiner genomes w/ virome option
OUT=/hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/2020-02-24_test_if_ready_for_release/virome/results/neto_kleiner_genomes
DB=/hps/nobackup2/metagenomics/mhoelzer/databases
nextflow run /homes/mhoelzer/git/virify/virify.nf --fasta /homes/mhoelzer/data/assemblies/Neto_Kleiner_genomes.fna --workdir $OUT/work --output $OUT --virsorter $DB/virsorter/virsorter-data --viphog $DB/vpHMM_database_v3 --rvdb $DB/rvdb --pvogs $DB/pvogs --vogdb $DB/vogdb --vpf $DB/vpf --ncbi $DB/ncbi/ete3_ncbi_tax.sqlite --imgvr $DB/imgvr --version v3 --virome -profile yoda_conda
```
## Minimap or blast our sourmash vs RVDB?
```bash
wget https://rvdb.dbi.udel.edu/download/C-RVDBv17.0.fasta.gz
## MINIMAP
minimap2 -ax asm5 -t 8 -o high_confidence_putative_viral_contigs.sam C-RVDBv17.0.fasta high_confidence_putative_viral_contigs.fna
# only one contig mapped??? contig2841 vs acc|GENBANK|MH057965.1|Homo
# this is some integrated HepB... strange hit
## SOURMASH
sourmash compute --scaled 100 -k 21 --singleton --seed 42 -p 8 -o C-RVDBv17.0.sig C-RVDBv17.0.fasta
sourmash index C-RVDBv17.0 C-RVDBv17.0.sig
tar czf C-RVDBv17.0.sbt.json.tar.gz C-RVDBv17.0.sbt.json .sbt.phages
sourmash compute --scaled 100 -k 21 --singleton --seed 42 -p 8 -o high_confidence_putative_viral_contigs.sig high_confidence_putative_viral_contigs.fna
sourmash search -k 21 high_confidence_putative_viral_contigs.sig C-RVDBv17.0.sbt.json -o high_confidence_putative_viral_contigs.sig.temporary
value=\$(grep -v "similarity,name,filename,md5" \${tempfile} | awk '\$1>=0.5'|wc -l) # filtering criteria
## BLAST
makeblastdb -in C-RVDBv17.0.fasta -dbtype nucl
blastn -task megablast -num_threads 8 -query high_confidence_putative_viral_contigs.fna -db C-RVDBv17.0.fasta -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > high_confidence_putative_viral_contigs.blast
cat high_confidence_putative_viral_contigs.blast
contig168 acc|GENBANK|AF059603.1|UNVERIFIED_ORG: 98.361 122 2 0 9902 10023 18665 1 122 4.62e-52 215 1337
contig1355 acc|GENBANK|AF059603.1|UNVERIFIED_ORG: 94.201 845 16 32 5432 6273 6514 523 1337 0.0 1258 1337
contig1355 acc|GENBANK|AF059603.1|UNVERIFIED_ORG: 95.676 370 9 7 4518 4882 6514 116 483 8.22e-165 588 1337
contig2841 acc|GENBANK|MH057965.1|Homo 97.431 506 10 3 1379 1882 4157 613 109 0.0 859 627
blastn -task blastn -num_threads 8 -query high_confidence_putative_viral_contigs.fna -db C-RVDBv17.0.fasta -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > high_confidence_putative_viral_contigs.blastn
awk '{if($4>0.8*$9){print $0}}' high_confidence_putative_viral_contigs.blastn
# no hits even with lower aln length
blastn -task megablast -num_threads 4 -query low_confidence_putative_viral_contigs.fna -db C-RVDBv17.0.fasta -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > low_confidence_putative_viral_contigs.blast
awk '{if($4>0.8*$9){print $0}}' low_confidence_putative_viral_contigs.blast
# ok I get some hits for the LC set but mostly unclassified viruses in some metagenomes and hits against EVEs in human, Pan, Gorilla, ... using RVDB. so not really helpful to extend out taxonomic assignments
blastn -task megablast -num_threads 4 -query low_confidence_putative_viral_contigs.fna -db C-RVDBv17.0.fasta -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > low_confidence_putative_viral_contigs.blastn
awk '{if($4>0.8*$9){print $0}}' low_confidence_putative_viral_contigs.blastn
## does the IMG/VR work better as a reference???
https://img.jgi.doe.gov/vr/
DB=IMG_VR/IMG_VR_2018-07-01_4/IMGVR_all_nucleotides.fna
blastn -task megablast -num_threads 4 -query high_confidence_putative_viral_contigs.fna -db $DB -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > high_confidence_putative_viral_contigs.blast
awk '{if($4>0.8*$9){print $0}}' high_confidence_putative_viral_contigs.blastn
# super partial hits, have to go down to 0.1! e.g. then i find
contig2203 acc|REFSEQ|NC_027642.1|Gokushovirinae 70.423 781 210 12 877 1648 4790 811 1579 4.14e-88 333 4624
contig2413 acc|REFSEQ|NC_028993.1|Gokushovirinae 66.790 813 232 13 1174 1970 4520 823 1613 1.46e-55 224 4468
And this we already know from VIRify!
blastn -task blastn -num_threads 8 -query low_confidence_putative_viral_contigs.fna -db $DB -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > low_confidence_putative_viral_contigs.blast
awk '{if($4>0.8*$9){print $0}}' high_confidence_putative_viral_contigs.blastn
cat high_confidence_putative_viral_contigs.fna low_confidence_putative_viral_contigs.fna putative_prophages.fna > HC_LC_PP.fna
ALL=HC_LC_PP.fna
blastn -task blastn -num_threads 8 -query $ALL -db $DB -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > $ALL.blast
awk '{if($4>0.8*$9){print $0}}' $ALL.blast
blastn -task megablast -num_threads 8 -query $ALL -db $DB -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > $ALL.megablast
awk '{if($4>0.8*$9){print $0}}' $ALL.megablast
```