# VIRify database hmmer comparison ## Databases ```bash # different versions of ViPhOGs DB_VIPHOGS=/hps/nobackup2/production/metagenomics/garp/Data/hmmFiles/vpHMM_database DB_VIPHOGS_V2=/hps/nobackup2/production/metagenomics/garp/Data/hmmFiles/v2/vpHMM_database_v2 DB_VIPHOGS_V3=/hps/nobackup2/production/metagenomics/garp/Data/hmmFiles/v3/vpHMM_database_v3 # additional databases for annotation visualization DB_RVDB=/hps/nobackup2/production/metagenomics/mhoelzer/databases/rvdb/U-RVDBv17.0-prot.hmm DB_VOGDB=/hps/nobackup2/production/metagenomics/mhoelzer/databases/vogdb/vogdb.hmm DB_PVOGS=/hps/nobackup2/production/metagenomics/mhoelzer/databases/pvogs/pvogs.hmm DB_VPS=/hps/nobackup2/production/metagenomics/mhoelzer/databases/vps/vps.hmm ``` ### ViPhOG v1 * original database * no additional bitscore cutoffs applied * hits only filtered by evalue hmmscan command: ```bash hmmscan --cpu ${task.cpus} --noali -E "0.001" --domtblout \ ${set_name}_${params.db}_hmmscan.tbl ${db}/${db}.hmm ${faa} ``` For v2 and v3 Guillermo calculated bitscore thresholds: > I hmmerd all the models against the complete UniProtKB database (Feb 2019 release) Then I went through all the results and identified all the models that I could assign to a single viral taxon, based on the hmmer scores obtained. Some of them reported hits for a single viral taxon (these are the ones that have ``nan`` on the second column), and others reported hits against multiple taxons, but the score range of the best hit could be separated from the ranges obtained for further taxa. Example cutoffs: ```bash #MODEL TAXA RANK SCORE1 SCORE2 ViPhOG12699 Gokushovirinae Subfamily 266.5 257.7 ViPhOG17845 Gokushovirinae Subfamily 57.1 nan ViPhOG12433 Siphoviridae Family 33.5 33.3 ``` We have bitscore thresholds for 22014 of the 31150 ViPhOGs. ### ViPhOG v2 For all models with available cutoffs: * add ``SCORE1`` as sequence-specific GA and TC cutoff * add ``SCORE1`` minus 3 bits as domain-specific GA and TC cutoff * add ``SCORE2`` as sequence- and domain-specific NC cutoff, leave out if ``nan`` * if no bitscore cutoff is available, added _dummy_ fields for GA and TC with ``0.0`` * for cases where ``SCORE1 - 3 < SCORE2`` (see ViPhOG12433) the threshold was not modified * 716/31150 models where the diff between bit score 1 and 2 is smaller 3 Example v2: ``` NAME ViPhOG12699.faa GA 266.5 263.5; TC 266.5 263.5; NC 257.7 257.7; NAME ViPhOG17845.faa GA 57.1 54.1; TC 57.1 54.1; NAME ViPhOG12433.faa GA 33.5 33.5; TC 33.5 33.5; NC 33.3 33.3; ``` hmmscan command: ```bash hmmscan --cpu ${task.cpus} --noali --cut_ga --domtblout \ ${set_name}_${params.db}_hmmscan_cutga.tbl ${db}/${db}.hmm ${faa} # in addition filter evalue because some models dont have any bitscore cutoff awk '{if(\$1 ~ /^#/){print \$0}else{if(\$7<0.001){print \$0}}}' \ ${set_name}_${params.db}_hmmscan_cutga.tbl > ${set_name}_${params.db}_hmmscan.tbl ``` The same command is used for v3. ### ViPhOG v3 v3 is less restrictive than v2 when ``SCORE2`` is ``nan``: * if we only have one bit score (so the second field is ``nan``), we just reduce also the sequence-specific GA by 3 bits. E.g. for such scores: ```bash ViPhOG12699 Gokushovirinae Subfamily 266.5 257.7 ViPhOG17845 Gokushovirinae Subfamily 57.1 nan ``` ```bash # same as in v2 GA 266.5 263.5 TC 266.5 263.5 NC 257.7 257.7 ``` ```bash # v3 GA 54.1 54.1 TC 54.1 54.1 # v2 GA 57.1 54.1; TC 57.1 54.1; ``` ## Test data sets The pipeline was run with [v1, v2, v3] on these data sets: * Kleiner _et al_. 2015, co-assembly, phage community study (20 MB) * Neto _et al_. 2015, rndm selected one sample SRR3458562, communities mostly comprised of eukaryotic viruses (21 MB) * Neto _et al_. 2015, co-assembly, communities mostly comprised of eukaryotic viruses 100 MB) * Kallies _et al_. 2019, co-assembly, viruses enriched from groundwater (96 MB) * Zheng _et al_. 2019, collection of 3k phages from NCBI (217 MB) ## Results I only list hits with an assigend taxonomy here and some examples so we can look into it. ### Kleiner _et al._ 2015 Only taxonomies assigend for High confidence (HC) and putative prophage (PP) sets. No real difference for HC. PP might be reduced by one false positive prediction by using v2/v3. #### v1 High confidence: ```bash contig_ID genus subfamily family order contig66 P22likevirus 0.029411764705882353 Podoviridae Caudovirales contig68 T7likevirus Autographivirinae Podoviridae Caudovirales contig69 T7likevirus Autographivirinae Podoviridae Caudovirales contig97 0.25 0.25 Siphoviridae Caudovirales contig124 Microvirus Microviridae ``` Prophage ```bash contig_ID genus subfamily family order contig1 0.07142857142857142 0.07142857142857142 0.35714285714285715 Caudovirales contig32 0.4444444444444444 Peduovirinae Myoviridae Caudovirales ``` #### v2 High confidence: ```bash contig_ID genus subfamily family order contig66 P22likevirus 0.03225806451612903 Podoviridae Caudovirales contig68 T7likevirus Autographivirinae Podoviridae Caudovirales contig69 T7likevirus Autographivirinae Podoviridae Caudovirales contig97 0.25 0.25 Siphoviridae Caudovirales contig124 Microvirus Microviridae ``` Prophage ```bash contig_ID genus subfamily family order contig32 0.46153846153846156 Peduovirinae Myoviridae Caudovirales ``` #### v3 High confidence: ```bash contig_ID genus subfamily family order contig66 P22likevirus 0.03225806451612903 Podoviridae Caudovirales contig68 T7likevirus Autographivirinae Podoviridae Caudovirales contig69 T7likevirus Autographivirinae Podoviridae Caudovirales contig97 0.25 0.25 Siphoviridae Caudovirales contig124 Microvirus Microviridae ``` Prophage ```bash contig_ID genus subfamily family order contig32 0.46153846153846156 Peduovirinae Myoviridae Caudovirales ``` ### Neto _et al_. 2015 (SRR3458562) No differences between v1, v2, and v3. But also only few annotations in general for the single rndm selected read set SRR3458562. ### Neto _et al._ 2015 co-assembly Finally, the co-assembly worked with the newest SPAdes v3.14. Hurray #### v1 ```bash #HC contig6 Viunalikevirus 0.2018348623853211 Myoviridae Caudovirales contig64 Viunalikevirus 0.18421052631578946 Myoviridae Caudovirales contig63 Viunalikevirus 0.18421052631578946 Myoviridae Caudovirales contig128 0.5555555555555556 0.5555555555555556 Myoviridae Caudovirales contig155 P22likevirus Podoviridae Caudovirales contig193 0.25 0.5 Caudovirales contig334 P22likevirus Podoviridae Caudovirales contig339 Microvirus Microviridae contig539 P22likevirus Podoviridae Caudovirales contig1539 P2likevirus Peduovirinae Myoviridae Caudovirales #LC contig165 0.5 0.5 0.5 Caudovirales contig226 0.5 Mimiviridae contig1046 0.5 Mimiviridae #Prophage contig165 P22likevirus 0.3333333333333333 Podoviridae Caudovirales ``` #### v2 ```bash #HC contig6 0.5526315789473685 0.23684210526315788 Myoviridae Caudovirales contig63 Viunalikevirus 0.25925925925925924 Myoviridae Caudovirales contig64 Viunalikevirus 0.25925925925925924 Myoviridae Caudovirales contig128 0.5555555555555556 0.5555555555555556 Myoviridae Caudovirales contig155 P22likevirus Podoviridae Caudovirales contig334 P22likevirus Podoviridae Caudovirales contig339 Microvirus Microviridae contig539 P22likevirus Podoviridae Caudovirales contig1539 P2likevirus Peduovirinae Myoviridae Caudovirales #LC contig165 0.5 0.5 0.5 Caudovirales contig226 Mimivirus Mimiviridae contig1046 0.5 Mimiviridae #Prophage contig165 P22likevirus 0.3333333333333333 Podoviridae Caudovirales ``` #### v3 ```bash #HC contig6 0.5584415584415584 0.24675324675324675 Myoviridae Caudovirales contig63 Viunalikevirus 0.25925925925925924 Myoviridae Caudovirales contig64 Viunalikevirus 0.25925925925925924 Myoviridae Caudovirales contig128 0.5555555555555556 0.5555555555555556 Myoviridae Caudovirales contig155 P22likevirus Podoviridae Caudovirales contig334 P22likevirus Podoviridae Caudovirales contig339 Microvirus Microviridae contig539 P22likevirus Podoviridae Caudovirales contig1539 P2likevirus Peduovirinae Myoviridae Caudovirales #LC contig165 0.5 0.5 0.5 Caudovirales contig226 Mimivirus Mimiviridae contig1046 0.5 Mimiviridae #Prophage contig165 P22likevirus 0.3333333333333333 Podoviridae Caudovirales ``` ### Kallies _et al_ 2019 For this _real_ data set we actually see that v2 and v3 reduce the amount of taxonomic assignments. The question might be if it is good to remove hits because they are false positives. I will check this exemplarily, see below. #### v1 ```bash # HC contig_ID genus subfamily family order contig105 0.5 Peduovirinae Myoviridae Caudovirales contig355 0.35714285714285715 Peduovirinae Myoviridae Caudovirales contig1719 Skunalikevirus Siphoviridae Caudovirales contig336 P22likevirus Podoviridae Caudovirales contig675 0.3333333333333333 Siphoviridae Caudovirales contig337 0.5 0.25 Podoviridae Caudovirales contig1402 0.3333333333333333 0.3333333333333333 0.3333333333333333 contig2479 0.3333333333333333 0.3333333333333333 Caudovirales contig1134 0.3333333333333333 0.3333333333333333 Caudovirales contig868 Chlorovirus Phycodnaviridae contig985 0.3333333333333333 0.3333333333333333 0.3333333333333333 Caudovirales contig2535 0.3333333333333333 0.3333333333333333 Myoviridae Caudovirales contig1696 0.3333333333333333 0.3333333333333333 0.3333333333333333 0.3333333333333333 contig576 0.25 0.25 0.25 0.25 contig2413 Gokushovirinae Microviridae contig2203 Gokushovirinae Microviridae contig5423 Gokushovirinae Microviridae contig2418 0.5 Microviridae contig1356 Chordopoxvirinae Poxviridae contig1938 Phietalikevirus Siphoviridae Caudovirales contig3967 Autographivirinae Podoviridae Caudovirales contig5639 0.3333333333333333 0.3333333333333333 Myoviridae Caudovirales contig3333 0.25 Autographivirinae Podoviridae Caudovirales contig15296 Skunalikevirus Siphoviridae Caudovirales #LC contig334 0.2 0.2 0.2 Caudovirales contig1104 0.25 Spounavirinae Herelleviridae Caudovirales contig633 0.2 0.2 0.2 Caudovirales contig2199 0.3333333333333333 0.3333333333333333 Caudovirales contig1699 0.3333333333333333 0.3333333333333333 0.3333333333333333 contig939 0.3333333333333333 Polyomaviridae 0.3333333333333333 contig3764 0.3333333333333333 0.3333333333333333 Caudovirales contig1117 Papillomaviridae contig3113 0.3333333333333333 0.3333333333333333 Caudovirales contig2610 0.3333333333333333 Autographivirinae Podoviridae Caudovirales #Prophage contig612 0.2857142857142857 0.2857142857142857 Phycodnaviridae contig306 0.3333333333333333 Siphoviridae Caudovirales ``` #### v2 ```bash #HC contig_ID genus subfamily family order contig105 0.5 Peduovirinae Myoviridae Caudovirales contig355 0.35714285714285715 Peduovirinae Myoviridae Caudovirales contig1719 Skunalikevirus Siphoviridae Caudovirales contig1134 0.3333333333333333 0.3333333333333333 Caudovirales contig985 0.3333333333333333 0.3333333333333333 0.3333333333333333 Caudovirales contig2535 0.3333333333333333 0.3333333333333333 Myoviridae Caudovirales contig576 0.25 0.25 0.25 0.25 contig2203 0.2 Gokushovirinae Microviridae contig2413 Gokushovirinae Microviridae contig5423 Gokushovirinae Microviridae contig1356 Chordopoxvirinae Poxviridae contig2418 0.25 0.5 Microviridae contig3967 Autographivirinae Podoviridae Caudovirales contig5639 0.3333333333333333 0.3333333333333333 Myoviridae Caudovirales contig15296 Skunalikevirus Siphoviridae Caudovirales #LC contig334 0.25 0.25 Caudovirales contig633 0.25 0.25 0.25 Caudovirales contig1104 Spounavirinae Herelleviridae Caudovirales contig2199 Caudovirales contig3113 0.3333333333333333 0.3333333333333333 0.3333333333333333 Caudovirales contig1117 Papillomaviridae #Prophage contig612 0.16666666666666666 0.3333333333333333 Phycodnaviridae ``` #### v3 ```bash #HC contig_ID genus subfamily family order contig105 0.5 Peduovirinae Myoviridae Caudovirales contig355 0.35714285714285715 Peduovirinae Myoviridae Caudovirales contig1719 Skunalikevirus Siphoviridae Caudovirales contig1134 0.3333333333333333 0.3333333333333333 Caudovirales contig2535 0.3333333333333333 0.3333333333333333 Myoviridae Caudovirales contig985 0.3333333333333333 0.3333333333333333 0.3333333333333333 Caudovirales contig576 0.25 0.25 0.25 0.25 contig5423 Gokushovirinae Microviridae contig2203 0.2 Gokushovirinae Microviridae contig2413 Gokushovirinae Microviridae contig2418 0.2 Gokushovirinae Microviridae contig1356 Chordopoxvirinae Poxviridae contig3967 Autographivirinae Podoviridae Caudovirales contig5639 0.3333333333333333 0.3333333333333333 Myoviridae Caudovirales contig15296 Skunalikevirus Siphoviridae Caudovirales #LC contig334 0.25 0.25 Caudovirales contig633 0.25 0.25 0.25 Caudovirales contig1104 Spounavirinae Herelleviridae Caudovirales contig2199 Caudovirales contig3113 0.3333333333333333 0.3333333333333333 0.3333333333333333 Caudovirales contig1117 Papillomaviridae #Prophage contig612 0.16666666666666666 0.3333333333333333 Phycodnaviridae ``` #### These contigs have now no taxonomic assignment in v2/v3 but had one in v1 ... ... can I find homology to viruses although the bitscore cutoffs removed them? ```bash #HC contig336 --> no virus blast hit contig675 --> no virus blast hit contig337 --> no virus blast hit contig1402 --> short (4% of query) hit against Flavobacterium phage contig2479 --> short (14% of query) hit Psychrobacter phage pOW20 contig868 --> short (1% of query) hit Enterococcus phage IME_EF3 contig1696 --> somewhat good (41% query cov) hit Arthrobacter phage Isolde contig1938 --> Caudovirales unclassi, Staphylococcus phage... only 70% seq ident contig3333 --> full length hit in bakki genome Acinetobacter johnsonii #LC contig1699 --> no virus blast hit contig939 --> no virus blast hit contig3764 --> no virus blast hit contig2610 --> full length hit against bakki genome Brevundimonas sp. SGAir0440 #Prophage contig306 --> partly Caudovirales unclassified ``` So most hits that were filtered out in v2/v3 look fine. There are some edge cases and especially: contig1938 (Phietalikevirus | Siphoviridae | Caudovirales) seems to be somewhat good hit (but only 70% seq ident) but was filtered in v2/v3: __why?__ The models that are included in v1 but were filtered in v2/v3 are ```bash ViPhOG1205 Slashvirus Genus 307.3 173.4 ViPhOG15973 Bastillevirinae Subfamily 395.4 281.6 ViPhOG23202 Pagevirus Genus 322.4 313.8 ViPhOG902 Picovirinae Subfamily 275.1 46.4 ViPhOG24518 Pagevirus Genus 199.3 nan ViPhOG4714 Arquatrovirinae Subfamily 31.4 nan ``` so quite diverse hits and also the models that are still in v2/v3 for this contig are quite diverse. So IMHO it is fine that finally no taxonomy was assigned for this contig. It is some kind of virus (or part of) and that's why it is in the VirSorter HC set but based on the protein models we can not clearly assign a taxonomy. As a positive control, tested also some that have a taxonomic assignment, also still in v2/v3 ```bash contig355 Peduovirinae | Myoviridae | Caudovirales --> hit against Alicycliphilus, a genus in the phylum Proteobacteria (Bacteria) Likely that the phage was included in the assembly of Alicycliphilus Also some hits against phages contig1719 Skunalikevirus | Siphoviridae | Caudovirales --> Top hit against Lactococcus phage that belongs to Skunavirus. nice. contig1134 Caudovirales --> here we could only state that this might be a Caudovirales member Croceibacter phage P2559Y partial blast hit, belongs to Caudovirales. so its fine ``` ### Zheng _et al._ 2019 This data set only consists of phage genomes from NCBI. Again, v2 and v3 reduce the amount of taxonomic assignments. The difference between v2 and v3 is marginal. #### v1 * 1654 HC taxonomic assignments * 607 LC * 343 Prophage #### v2 * __HC still running... TODO__ * 534 LC * 293 Prophage #### v3 * 1452 HC taxonomic assignments * 535 LC * 293 Prophage # Misc testing Martin (the important stuff is above) ## Small test set ```bash DIR=/homes/mhoelzer/data/calc/2020-02-04_hmmer_virify FAA=${DIR}/PRJNA530103_high_confidence_prodigal.faa mkdir -p ${DIR}/viphogs ${DIR}/viphogs_v2 ${DIR}/rvdb ${DIR}/vogdb hmmscan --cpu 8 --noali -E "0.001" --domtblout $DIR/viphogs/$(basename $FAA .faa)_hmmscan.tbl $DB_VIPHOGS ${FAA} hmmscan --cpu 8 --noali -E "0.001" --domtblout $DIR/rvdb/$(basename $FAA .faa)_hmmscan.tbl $DB_RVDB ${FAA} hmmscan --cpu 8 --noali -E "0.001" --domtblout $DIR/vogdb/$(basename $FAA .faa)_hmmscan.tbl $DB_VOGDB ${FAA} hmmscan --cpu 8 --noali -E "0.001" --domtblout $DIR/pvogs/$(basename $FAA .faa)_hmmscan.tbl $DB_PVOGS ${FAA} hmmscan --cpu 8 --cut_ga --noali --domtblout $DIR/viphogs_v2/$(basename $FAA .faa)_hmmscan.tbl.no_e_cutoff $DB_VIPHOGS_V2 ${FAA} # add e-value filter step because we have models without GA bitscore cutoff in the database awk '{if($1 ~ /^#/){print $0}else{if($7<0.001){print $0}}}' $DIR/viphogs_v2/$(basename $FAA .faa)_hmmscan.tbl.no_e_cutoff > $DIR/viphogs_v2/$(basename $FAA .faa)_hmmscan.tbl ``` ### Results #### VIPHOG ```bash cat $DIR/viphogs/small_hmmscan.tbl | grep contig | awk '{print $1"\t"$4"\t"$8"\t"$18"\t"$19}' ``` ```bash ViPhOG12699.faa contig1 61.8 2 90 ViPhOG17845.faa contig2 56.6 1 87 ViPhOG12700.faa contig3 75.3 64 133 ViPhOG8852.faa contig4 532.7 16 470 ``` Guillermos cutoffs for these models are (none for 12700): ```bash ViPhOG12699 Gokushovirinae Subfamily 266.5 257.7 ViPhOG17845 Gokushovirinae Subfamily 57.1 nan ViPhOG8852 Gokushovirinae Subfamily 519.5 413.9 ``` Thus, 12699 and 17845 should be discarded by the v2 of the database... #### VIPHOG V2 ... and that's what happen now: ```bash cat $DIR/viphogs_v2/small_hmmscan.tbl | grep contig | awk '{print $1"\t"$4"\t"$8"\t"$18"\t"$19}' ``` ```bash ViPhOG12700.faa contig3 75.3 64 133 ViPhOG8852.faa contig4 532.7 16 470 ``` Based on the current implementation and cutoffs this is correct. 12700 is going through because the e-value is good enough although no bit score threshold exist. 8852 is going through because the bit score (532.7) is larger than the sequence-specific GA score (519.5). #### RVDB ```bash cat rvdb/small_hmmscan.tbl | grep contig | awk '{print $1"\t"$4"\t"$8"\t"$18"\t"$19}' ``` ```bash FAM002364 contig1 70.1 2 90 FAM002363 contig4 613.9 1 472 FAM015148 contig4 299.4 201 437 FAM016066 contig4 289.6 19 471 ``` #### VOGDB ```bash cat vogdb/small_hmmscan.tbl | grep contig | awk '{print $1"\t"$4"\t"$8"\t"$18"\t"$19}' ``` ```bash VOG00849 contig1 21.5 3 76 VOG10944 contig2 60.2 4 81 VOG10943 contig3 101.3 8 138 VOG01001 contig4 531.7 2 472 ``` #### PVOGS ```bash cat pvogs/small_hmmscan.tbl | grep contig | awk '{print $1"\t"$4"\t"$8"\t"$18"\t"$19}' ``` ```bash VOG2233 contig1 63.5 2 90 VOG3382 contig2 49.2 4 81 VOG2235 contig3 110.2 4 148 VOG2234 contig4 567.4 2 472 VOG0176 contig4 98.0 19 413 ``` ## Issue: we have HMMs that will not have a bitscore threshold applied * using the ``--cut_ga`` option we need to remove the ``-E "0.001"`` option * but for models that dont have a _GA_ added we want to keep the E-value filter I guess * so I think we have two options here: * 1) add a switch that either executes ``hmmscan`` with the ``--cut_ga`` option if the _GA_ field is available or w/o the ``--cut_ga`` when it is not available * but this would meen we have two databases: the models with cutoff and those without * 2) for all models where we do not have a bitscore threshold just add _dummy GA/TC/NC_ fields with a 0.0 score * so we can use the same database and command with the ``--cut_ga`` option * drawback: we do not filter the models by e-value directly... * ...but we can filter them with some ``awk`` command after the hmmscan step * __I currently go for this__ ## PRJNA530103_high_confidence_prodigal Viral DNA metagenome from three pristine limestone aquifer assemblages of the Hainich Critical Zone Exploratory, co-assembled Fast ``wc -l`` (so also includes duplicates and some header lines) ``` 1763 viphogs/PRJNA530103_high_confidence_prodigal_hmmscan.tbl 1090 viphogs_v2/PRJNA530103_high_confidence_prodigal_hmmscan.tbl 1201 pvogs/PRJNA530103_high_confidence_prodigal_hmmscan.tbl 566 rvdb/PRJNA530103_high_confidence_prodigal_hmmscan.tbl 1765 vogdb/PRJNA530103_high_confidence_prodigal_hmmscan.tbl ``` ### Which hits got lost from v1 to v2 of the ViPhOG db? ```bash cd viphogs awk '{if($1 ~ /^#/){}else{print $1":"$4}}' PRJNA530103_high_confidence_prodigal_hmmscan.tbl > PRJNA530103_high_confidence_prodigal_hmmscan.tbl.hits cd viphogs_v2 awk '{if($1 ~ /^#/){}else{print $1":"$4}}' PRJNA530103_high_confidence_prodigal_hmmscan.tbl > ``` ![](https://i.imgur.com/oGxhMRf.png) 362 __ViPhOG:Protein__ combinations were discarded based on bit score thresholds in the v2. [https://www.rna.uni-jena.de/supplements/martin/viphogs_v1_only.txt](https://www.rna.uni-jena.de/supplements/martin/viphogs_v1_only.txt) These hits include 273 proteins that were discarded. [https://www.rna.uni-jena.de/supplements/martin/viphogs_v1_only_contigs.txt](https://www.rna.uni-jena.de/supplements/martin/viphogs_v1_only_contigs.txt) Do we still have hits against the contigs these discarded proteins belong to? So that them would still a taxonomy be assigned in v2? __We have ViPhOGs that hitted proteins of 111 different contigs in v2.__ __We have ViPhOGs that hitted proteins of 116 different contigs in v1.__ By applying the additional bit score thresholds in v2 we get no ViPhOG hits against these 5 contigs: * NODE_1181_length_7064 * NODE_12413_length_1409 * NODE_3070_length_3948 * NODE_7798_length_2061 * NODE_9764_length_1729 ## Execute the pipeline with the v2 and new hmmscan call I use now the models with GA set like discused with Rob. Might be to strict. Lets see. I also filter by evalue additionally so that models that just pass because they dont have a GA are also filtered at least by evalue. ```bash cd /homes/mhoelzer/data/calc/2020-01-31_virify/virify/v2 nextflow run /homes/mhoelzer/backuped/git/virify/virify.nf --fasta ../../assemblies/kleiner_2015.fasta --workdir work_kleiner -profile ebi --output results/ # DONE nextflow run /homes/mhoelzer/backuped/git/virify/virify.nf --fasta ../../assemblies/kallies_2019.fasta --workdir work_kallies -profile ebi --output results/ ``` ## v3 database We might need a even more relaxed bit score cutoff database. Suggestion: if we only have one bit score (so the second field is ``nan``), we just reduce also the sequence-specific GA by 3 bits. E.g. for such scores: ```bash ViPhOG12699 Gokushovirinae Subfamily 266.5 257.7 ViPhOG17845 Gokushovirinae Subfamily 57.1 nan ``` ```bash GA 266.5 263.5 TC 266.5 263.5 NC 257.7 257.7 ``` ```bash GA 54.1 54.1 TC 54.1 54.1 ``` What happens here? (716 models where the diff between bit score 1 and 2 is smaller 3) ATM I just use this as it is without reducing bits. ```bash ViPhOG13782 Rogunavirus Genus 122.6 122.2 ``` ```bash cd /homes/mhoelzer/data/calc/2020-01-31_virify/virify/v3 nextflow run /homes/mhoelzer/backuped/git/virify/virify.nf --fasta ../../assemblies/kleiner_2015.fasta --workdir work_kleiner -profile ebi --output results/ ``` ## Running on YODA The nextflow is running stable as it seems on YODA. The idea is now to test the 3 different databases on various sets: * Kleiner et al. 2015, co-assembly * Neto et al. 2015, rndm selected SRR3458562 * Neto et al. 2015, co-assembly * Kallies et al. 2019, co-assembly * Zheng et al. 2019, collection of phages from NCBI * Gregory et al. 2019, selected sample ```bash #module load singularity/3.5.0 session2 nextflow run virify.nf --fasta '/homes/mhoelzer/data/assemblies/renamed/*.fasta' -profile yoda_conda --output /hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify # started... and renamed files before to maybe avoid parsing problems with the current script nextflow run virify.nf --fasta '/homes/mhoelzer/data/assemblies/*.fasta' -profile yoda_conda --output /hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/v2 --version v2 nextflow run virify.nf --fasta '/homes/mhoelzer/data/assemblies/*.fasta' -profile yoda_conda --output /hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/v3 --version v3 # ok this worked. Now we do another test and if this runs through we make a relaese of virify v0.1 to run all files for the manuscript OUT=/hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/2020-02-24_test_if_ready_for_release/no_virome DB=/hps/nobackup2/metagenomics/mhoelzer/databases nextflow run /homes/mhoelzer/git/virify/virify.nf \ --fasta '/homes/mhoelzer/data/assemblies/*.fasta' \ --workdir $OUT/work \ --virsorter $DB/virsorter/virsorter-data \ --viphog $DB/vpHMM_database_v3 \ --rvdb $DB/rvdb \ --pvogs $DB/pvogs \ --vogdb $DB/vogdb \ --vpf $DB/vpf \ --ncbi $DB/ncbi/ete3_ncbi_tax.sqlite \ --imgvr $DB/imgvr \ --version v3 \ -profile yoda_conda OUT=/hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/2020-02-24_test_if_ready_for_release/virome DB=/hps/nobackup2/metagenomics/mhoelzer/databases nextflow run /homes/mhoelzer/git/virify/virify.nf \ --fasta '/homes/mhoelzer/data/assemblies/*.fasta' \ --workdir $OUT/work \ --virsorter $DB/virsorter/virsorter-data \ --viphog $DB/vpHMM_database_v3 \ --rvdb $DB/rvdb \ --pvogs $DB/pvogs \ --vogdb $DB/vogdb \ --vpf $DB/vpf \ --ncbi $DB/ncbi/ete3_ncbi_tax.sqlite \ --imgvr $DB/imgvr \ --version v3 \ --virome \ -profile yoda_conda ## run the full neto sample w/o any preselection of viral contigs. Do we find more? ## switched to branch no-preselection OUT=/hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/2020-02-24_test_if_ready_for_release/virome/results/neto_2015_all DB=/hps/nobackup2/metagenomics/mhoelzer/databases nextflow run /homes/mhoelzer/git/virify/virify.nf --fasta /homes/mhoelzer/data/assemblies/neto_2015.fasta --workdir $OUT/work --virsorter $DB/virsorter/virsorter-data --viphog $DB/vpHMM_database_v3 --rvdb $DB/rvdb --pvogs $DB/pvogs --vogdb $DB/vogdb --vpf $DB/vpf --ncbi $DB/ncbi/ete3_ncbi_tax.sqlite --imgvr $DB/imgvr --version v3 -profile yoda_conda ## okay, indeed we find more hits ## check if virfinder with other model works better ## switched to branch virfinder ## yes worked better for euk viruses OUT=/hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/2020-02-24_test_if_ready_for_release/virome/results/neto_2015_virfinder_model DB=/hps/nobackup2/metagenomics/mhoelzer/databases nextflow run /homes/mhoelzer/git/virify/virify.nf --fasta /homes/mhoelzer/data/assemblies/neto_2015.fasta --workdir $OUT/work --output $OUT --virsorter $DB/virsorter/virsorter-data --viphog $DB/vpHMM_database_v3 --rvdb $DB/rvdb --pvogs $DB/pvogs --vogdb $DB/vogdb --vpf $DB/vpf --ncbi $DB/ncbi/ete3_ncbi_tax.sqlite --imgvr $DB/imgvr --version v3 -profile yoda_conda ## run the ocean sample, this is not exclusively virome so run w/o virome option OUT=/hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/2020-02-24_test_if_ready_for_release/no_virome DB=/hps/nobackup2/metagenomics/mhoelzer/databases nextflow run /homes/mhoelzer/git/virify/virify.nf --fasta /homes/mhoelzer/data/assemblies/gregory_2019.fasta --workdir $OUT/work --virsorter $DB/virsorter/virsorter-data --viphog $DB/vpHMM_database_v3 --rvdb $DB/rvdb --pvogs $DB/pvogs --vogdb $DB/vogdb --vpf $DB/vpf --ncbi $DB/ncbi/ete3_ncbi_tax.sqlite --imgvr $DB/imgvr --version v3 -profile yoda_conda ## run the hybrid assembly from Overholt paper, no-virome OUT=/hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/2020-02-24_test_if_ready_for_release/no_virome DB=/hps/nobackup2/metagenomics/mhoelzer/databases nextflow run /homes/mhoelzer/git/virify/virify.nf --fasta /homes/mhoelzer/data/assemblies/overholt_2019.fasta --workdir $OUT/work --output $OUT/results/ --virsorter $DB/virsorter/virsorter-data --viphog $DB/vpHMM_database_v3 --rvdb $DB/rvdb --pvogs $DB/pvogs --vogdb $DB/vogdb --vpf $DB/vpf --ncbi $DB/ncbi/ete3_ncbi_tax.sqlite --imgvr $DB/imgvr --version v3 -profile yoda_conda # done! # run the neto and kleiner genomes w/ virome option OUT=/hps/nobackup2/metagenomics/mhoelzer/nextflow-results/virify/2020-02-24_test_if_ready_for_release/virome/results/neto_kleiner_genomes DB=/hps/nobackup2/metagenomics/mhoelzer/databases nextflow run /homes/mhoelzer/git/virify/virify.nf --fasta /homes/mhoelzer/data/assemblies/Neto_Kleiner_genomes.fna --workdir $OUT/work --output $OUT --virsorter $DB/virsorter/virsorter-data --viphog $DB/vpHMM_database_v3 --rvdb $DB/rvdb --pvogs $DB/pvogs --vogdb $DB/vogdb --vpf $DB/vpf --ncbi $DB/ncbi/ete3_ncbi_tax.sqlite --imgvr $DB/imgvr --version v3 --virome -profile yoda_conda ``` ## Minimap or blast our sourmash vs RVDB? ```bash wget https://rvdb.dbi.udel.edu/download/C-RVDBv17.0.fasta.gz ## MINIMAP minimap2 -ax asm5 -t 8 -o high_confidence_putative_viral_contigs.sam C-RVDBv17.0.fasta high_confidence_putative_viral_contigs.fna # only one contig mapped??? contig2841 vs acc|GENBANK|MH057965.1|Homo # this is some integrated HepB... strange hit ## SOURMASH sourmash compute --scaled 100 -k 21 --singleton --seed 42 -p 8 -o C-RVDBv17.0.sig C-RVDBv17.0.fasta sourmash index C-RVDBv17.0 C-RVDBv17.0.sig tar czf C-RVDBv17.0.sbt.json.tar.gz C-RVDBv17.0.sbt.json .sbt.phages sourmash compute --scaled 100 -k 21 --singleton --seed 42 -p 8 -o high_confidence_putative_viral_contigs.sig high_confidence_putative_viral_contigs.fna sourmash search -k 21 high_confidence_putative_viral_contigs.sig C-RVDBv17.0.sbt.json -o high_confidence_putative_viral_contigs.sig.temporary value=\$(grep -v "similarity,name,filename,md5" \${tempfile} | awk '\$1>=0.5'|wc -l) # filtering criteria ## BLAST makeblastdb -in C-RVDBv17.0.fasta -dbtype nucl blastn -task megablast -num_threads 8 -query high_confidence_putative_viral_contigs.fna -db C-RVDBv17.0.fasta -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > high_confidence_putative_viral_contigs.blast cat high_confidence_putative_viral_contigs.blast contig168 acc|GENBANK|AF059603.1|UNVERIFIED_ORG: 98.361 122 2 0 9902 10023 18665 1 122 4.62e-52 215 1337 contig1355 acc|GENBANK|AF059603.1|UNVERIFIED_ORG: 94.201 845 16 32 5432 6273 6514 523 1337 0.0 1258 1337 contig1355 acc|GENBANK|AF059603.1|UNVERIFIED_ORG: 95.676 370 9 7 4518 4882 6514 116 483 8.22e-165 588 1337 contig2841 acc|GENBANK|MH057965.1|Homo 97.431 506 10 3 1379 1882 4157 613 109 0.0 859 627 blastn -task blastn -num_threads 8 -query high_confidence_putative_viral_contigs.fna -db C-RVDBv17.0.fasta -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > high_confidence_putative_viral_contigs.blastn awk '{if($4>0.8*$9){print $0}}' high_confidence_putative_viral_contigs.blastn # no hits even with lower aln length blastn -task megablast -num_threads 4 -query low_confidence_putative_viral_contigs.fna -db C-RVDBv17.0.fasta -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > low_confidence_putative_viral_contigs.blast awk '{if($4>0.8*$9){print $0}}' low_confidence_putative_viral_contigs.blast # ok I get some hits for the LC set but mostly unclassified viruses in some metagenomes and hits against EVEs in human, Pan, Gorilla, ... using RVDB. so not really helpful to extend out taxonomic assignments blastn -task megablast -num_threads 4 -query low_confidence_putative_viral_contigs.fna -db C-RVDBv17.0.fasta -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > low_confidence_putative_viral_contigs.blastn awk '{if($4>0.8*$9){print $0}}' low_confidence_putative_viral_contigs.blastn ## does the IMG/VR work better as a reference??? https://img.jgi.doe.gov/vr/ DB=IMG_VR/IMG_VR_2018-07-01_4/IMGVR_all_nucleotides.fna blastn -task megablast -num_threads 4 -query high_confidence_putative_viral_contigs.fna -db $DB -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > high_confidence_putative_viral_contigs.blast awk '{if($4>0.8*$9){print $0}}' high_confidence_putative_viral_contigs.blastn # super partial hits, have to go down to 0.1! e.g. then i find contig2203 acc|REFSEQ|NC_027642.1|Gokushovirinae 70.423 781 210 12 877 1648 4790 811 1579 4.14e-88 333 4624 contig2413 acc|REFSEQ|NC_028993.1|Gokushovirinae 66.790 813 232 13 1174 1970 4520 823 1613 1.46e-55 224 4468 And this we already know from VIRify! blastn -task blastn -num_threads 8 -query low_confidence_putative_viral_contigs.fna -db $DB -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > low_confidence_putative_viral_contigs.blast awk '{if($4>0.8*$9){print $0}}' high_confidence_putative_viral_contigs.blastn cat high_confidence_putative_viral_contigs.fna low_confidence_putative_viral_contigs.fna putative_prophages.fna > HC_LC_PP.fna ALL=HC_LC_PP.fna blastn -task blastn -num_threads 8 -query $ALL -db $DB -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > $ALL.blast awk '{if($4>0.8*$9){print $0}}' $ALL.blast blastn -task megablast -num_threads 8 -query $ALL -db $DB -evalue 1e-10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" > $ALL.megablast awk '{if($4>0.8*$9){print $0}}' $ALL.megablast ```