---
tags: hannah, virus, phage, 2023
---
phage affix 2023
===
## Tuning sgcs parameters to pull out most complete phage genomes
Question: Are my spacegraphcats search parameters pulling out complete (or as complete as possible) phage genomes?
### Planned steps:
- pull out phage genome from query metagenome/virome using diff params
- k=31, 21
- r=1, 2, 3, 5
- later:
- protein search with reference proteome (k10, r=??)?
- Assess recovered neighborhood:
- k-mer containment (ref genome vs recovered nbhd)
- Map to reference?
- Assemble + CheckV completeness
Does completeness / k-mer containment increase when we use less stringent /more inclusive search params (e.g. k21 vs k31, radius 5 vs radius 1)?
- start with metagenomes, try with viromes too
- start with k=21 atlases
-----
### Executed steps
#### Make the atlases 🔴
Took one virome & one metagenome, queried one phage reference genome
- virome: SM_CTTKX
- metagenome: SRR5962882
- reference viral genome: uvig_409809
- k sizes: 21, 15, 11
> note: I tried k=9, but threw an error `oops bcalm called with minSize(10)> kmerSize(9)` in the `SRR5962882_k9 bcalm.log.txt`
```
srun -p bmm -J sgc4 -t 4:59:00 --mem=70gb -c 2 --pty bash
```
#### plan to map the nbhds
From [Soil pH influences the structure of virus communities at local and global scales](https://paperpile.com/app/p/75e43283-73fb-0a47-a53b-0c403800589f), (in the supplementary materials, _Virus prediction and community profiling_)they say
> To determine the relative abundance of vOTUs, BBMap (Bushnell et al., 2014) was used to align reads from
> **each metagenome and virome**
> **back to indexed vOTUs**
> with the detection threshold of ≥75% of the contig length, covered ≥1x by reads recruited at ≥90% average nucleotide identity (Roux et al., 2019).
#### map the nbhds 🔴
made conda env `bbmap`
[BBmap docs](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmap-guide/)
> "BBMap must index a reference before mapping to it"
To index and map at the same time:
`bbmap.sh in=reads.fq out=mapped.sam ref=ref.fa` ->
```
#mvx, 11
# mvx, 15
bbmap.sh \
in=../make_atlases/mvx_k15_r1_SM_CTTKX/SM_CTTKX_k15_r1_search_oh0/uvig_409809.fa.cdbg_ids.contigs.fa.gz \
out=mapped/SM_CTTKX_k15_r1_uvig_409809_mapped.sam \
ref=../../data/ref_genomes/uvig_409809.fa > out.txt
mv ref/ ref_mvx_k15_r1_SM_CTTKX/
```
```
# mvx, 21
bbmap.sh \
in=../make_atlases/mvx_k21_r1_SM_CTTKX/SM_CTTKX_k21_r1_search_oh0/uvig_409809.fa.cdbg_ids.contigs.fa.gz \
out=mapped/SM_CTTKX_k21_r1_uvig_409809_mapped.sam \
ref=../../data/ref_genomes/uvig_409809.fa
mv ref/ ref_mvx_k21_r1_SM_CTTKX/
```
```
# mgx, 21
bbmap.sh \
in= ../make_atlases/mgx_k15_r1_SRR5962882/SRR5962882_k21_r1_search_oh0/uvig_409809.fa.cdbg_ids.contigs.fa.gz \
out=mapped/SM_CTTKX_k21_r1_uvig_409809_mapped.sam \
ref=../../data/ref_genomes/uvig_409809.fa
mv ref/ ref_mvx_k21_r1_SM_CTTKX/
```
:::danger
go back and fix mgx k15, to be not k21
:::
### what to do with sam files: https://www.metagenomics.wiki/tools/samtools/number-of-reads-in-bam-file , this can be used in terminal
### what doe mapping mean
how good is the viral nbhd as a pile of reads?
mapping is to give a better visual pic of what is mapping, as opposed to sourmash which will be more stringent bc of exact matching
compare to sourmash containment
### Main Question: how do we convince ourselves that a nbhd is pulling out a viral genome?
- does it contain something else,
- how complete is it?
- complete genome? check v
- Does it contain things that look like bacteria? sourmash gather, do a bunch of reads go to something else?
map
### next step -
- assemble the nbhd and run check v
more notes: minimap2 - map long reads or contigs to assemblies
- using reads or contigs, use the appropriate input parameters
new thing: miniprot - takes protein fasta from reference, index and map
prodigal helps make better than 6 frame translating,
---
this week:
protein
### check v notes:
[tutorial](https://bitbucket.org/berkeleylab/checkv/src/master/)
made a `checkv` conda env,
ran `checkv download_database ./` in `/group/ctbrowngrp/virus-references`
ran `export CHECKVDB=/group/ctbrowngrp/virus-references/checkv-db-v1.5`
> there is an option to update the database using own complete genomes:
> `checkv update_database /path/to/checkv-db /path/to/updated-checkv-db genomes.fna`
> I didnt do this though
syntax: `checkv end_to_end input_file.fna output_directory -t 16`
`checkv end_to_end ../make_atlases/mvx_k15_r1_SM_CTTKX/SM_CTTKX_k15_r1_search_oh0/uvig_409809.fa.cdbg_ids.reads.gz ./ -t 16` (didnt work lol)
but it seems like you have to run `export CHECKVDB=/group/ctbrowngrp/virus-references/checkv-db-v1.5` everytime
```
checkv contamination make_atlases/mvx_k15_r1_SM_CTTKX/SM_CTTKX_k15_r1_search_oh0/uvig_409809.fa.cdbg_ids.reads.gz checkv_nbhds/ -t 16 -d /group/ctbrowngrp/virus-references/checkv-db-v1.5
checkv completeness input_file.fna output_directory -t 16
checkv complete_genomes input_file.fna output_directory
checkv quality_summary input_file.fna output_directory
```
checkv completeness input_file.fna output_directory -t 16
```
checkv completeness ../data/ref_genomes/uvig_409809.fa checkv_nbhds/ -t 16 -d /group/ctbrowngrp/virus-references/checkv-db-v1.5
```
```
checkv end_to_end ../data/ref_genomes/uvig_409809.fa checkv_nbhds/ -t 16 -d /group/ctbrowngrp/virus-references/checkv-db-v1.5
```
-----
# old notes dump:
ISMB - open source computational tracks
- write an abstract for this thing
- MICROBIOME
Build a fake data set of phage to see if method works
..
why use sourmash for phage? out performs kracken and bracken- assign signle reads
but sourmash looks at whole genomes
..
Big goal: work towards a thesis:
actual goals:
classify and get phage abundance - care more about specific genomes
-----
---
how do you know you have a phage genome, and not a piece of a phage genome in your metagenome?
you can figure out how much of the phage genome is present by mapping
phage ref -> phage neighborhood -> map back to the reference genome a number of the reads will map lol. then I'll just get a percent mapping or something
to do: what is a good phage aligner: bwa, bowtie2.
What parameters do I need to pull out a phage genome? is there an optimal set of spacegraph cats perameters?
protein vs dna: dna,k=21,r=1 do you get a genome?
if I search with a protein k=10, r=1
you can get a bigger neighborhood by changing parameters, how do you know this is still accurate?
In addition to mapping, you cna run check v
OK I like
bowtie: read mapping (assumes that reference database has basically everything)
binning: (I do care if things dont map) then you assemble?
reads -> contigs,
contigs -> assign to organisms (this is binning)
you can use
blast..:? map contigs
checkv: compares to complete viral genomes - gives a completeness value
k21 radius of 5: if it gives better completeness we could care about hosts
so
joannes papers: they assemble bin map/assign/make new genomes all the time
use the tools they're using
maybe ask annie or jess. jess tried to do sgcs anyways
average abund, f match (like... 10? ). try some small ones too
probab
make list of programs Im going to run