phage affix 2023

--- tags: hannah, virus, phage, 2023 --- phage affix 2023 === ## Tuning sgcs parameters to pull out most complete phage genomes Question: Are my spacegraphcats search parameters pulling out complete (or as complete as possible) phage genomes? ### Planned steps: - pull out phage genome from query metagenome/virome using diff params - k=31, 21 - r=1, 2, 3, 5 - later: - protein search with reference proteome (k10, r=??)? - Assess recovered neighborhood: - k-mer containment (ref genome vs recovered nbhd) - Map to reference? - Assemble + CheckV completeness Does completeness / k-mer containment increase when we use less stringent /more inclusive search params (e.g. k21 vs k31, radius 5 vs radius 1)? - start with metagenomes, try with viromes too - start with k=21 atlases ----- ### Executed steps #### Make the atlases 🔴 Took one virome & one metagenome, queried one phage reference genome - virome: SM_CTTKX - metagenome: SRR5962882 - reference viral genome: uvig_409809 - k sizes: 21, 15, 11 > note: I tried k=9, but threw an error `oops bcalm called with minSize(10)> kmerSize(9)` in the `SRR5962882_k9 bcalm.log.txt` ``` srun -p bmm -J sgc4 -t 4:59:00 --mem=70gb -c 2 --pty bash ``` #### plan to map the nbhds From [Soil pH influences the structure of virus communities at local and global scales](https://paperpile.com/app/p/75e43283-73fb-0a47-a53b-0c403800589f), (in the supplementary materials, _Virus prediction and community profiling_)they say > To determine the relative abundance of vOTUs, BBMap (Bushnell et al., 2014) was used to align reads from > **each metagenome and virome** > **back to indexed vOTUs** > with the detection threshold of ≥75% of the contig length, covered ≥1x by reads recruited at ≥90% average nucleotide identity (Roux et al., 2019). #### map the nbhds 🔴 made conda env `bbmap` [BBmap docs](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmap-guide/) > "BBMap must index a reference before mapping to it" To index and map at the same time: `bbmap.sh in=reads.fq out=mapped.sam ref=ref.fa` -> ``` #mvx, 11 # mvx, 15 bbmap.sh \ in=../make_atlases/mvx_k15_r1_SM_CTTKX/SM_CTTKX_k15_r1_search_oh0/uvig_409809.fa.cdbg_ids.contigs.fa.gz \ out=mapped/SM_CTTKX_k15_r1_uvig_409809_mapped.sam \ ref=../../data/ref_genomes/uvig_409809.fa > out.txt mv ref/ ref_mvx_k15_r1_SM_CTTKX/ ``` ``` # mvx, 21 bbmap.sh \ in=../make_atlases/mvx_k21_r1_SM_CTTKX/SM_CTTKX_k21_r1_search_oh0/uvig_409809.fa.cdbg_ids.contigs.fa.gz \ out=mapped/SM_CTTKX_k21_r1_uvig_409809_mapped.sam \ ref=../../data/ref_genomes/uvig_409809.fa mv ref/ ref_mvx_k21_r1_SM_CTTKX/ ``` ``` # mgx, 21 bbmap.sh \ in= ../make_atlases/mgx_k15_r1_SRR5962882/SRR5962882_k21_r1_search_oh0/uvig_409809.fa.cdbg_ids.contigs.fa.gz \ out=mapped/SM_CTTKX_k21_r1_uvig_409809_mapped.sam \ ref=../../data/ref_genomes/uvig_409809.fa mv ref/ ref_mvx_k21_r1_SM_CTTKX/ ``` :::danger go back and fix mgx k15, to be not k21 ::: ### what to do with sam files: https://www.metagenomics.wiki/tools/samtools/number-of-reads-in-bam-file , this can be used in terminal ### what doe mapping mean how good is the viral nbhd as a pile of reads? mapping is to give a better visual pic of what is mapping, as opposed to sourmash which will be more stringent bc of exact matching compare to sourmash containment ### Main Question: how do we convince ourselves that a nbhd is pulling out a viral genome? - does it contain something else, - how complete is it? - complete genome? check v - Does it contain things that look like bacteria? sourmash gather, do a bunch of reads go to something else? map ### next step - - assemble the nbhd and run check v more notes: minimap2 - map long reads or contigs to assemblies - using reads or contigs, use the appropriate input parameters new thing: miniprot - takes protein fasta from reference, index and map prodigal helps make better than 6 frame translating, --- this week: protein ### check v notes: [tutorial](https://bitbucket.org/berkeleylab/checkv/src/master/) made a `checkv` conda env, ran `checkv download_database ./` in `/group/ctbrowngrp/virus-references` ran `export CHECKVDB=/group/ctbrowngrp/virus-references/checkv-db-v1.5` > there is an option to update the database using own complete genomes: > `checkv update_database /path/to/checkv-db /path/to/updated-checkv-db genomes.fna` > I didnt do this though syntax: `checkv end_to_end input_file.fna output_directory -t 16` `checkv end_to_end ../make_atlases/mvx_k15_r1_SM_CTTKX/SM_CTTKX_k15_r1_search_oh0/uvig_409809.fa.cdbg_ids.reads.gz ./ -t 16` (didnt work lol) but it seems like you have to run `export CHECKVDB=/group/ctbrowngrp/virus-references/checkv-db-v1.5` everytime ``` checkv contamination make_atlases/mvx_k15_r1_SM_CTTKX/SM_CTTKX_k15_r1_search_oh0/uvig_409809.fa.cdbg_ids.reads.gz checkv_nbhds/ -t 16 -d /group/ctbrowngrp/virus-references/checkv-db-v1.5 checkv completeness input_file.fna output_directory -t 16 checkv complete_genomes input_file.fna output_directory checkv quality_summary input_file.fna output_directory ``` checkv completeness input_file.fna output_directory -t 16 ``` checkv completeness ../data/ref_genomes/uvig_409809.fa checkv_nbhds/ -t 16 -d /group/ctbrowngrp/virus-references/checkv-db-v1.5 ``` ``` checkv end_to_end ../data/ref_genomes/uvig_409809.fa checkv_nbhds/ -t 16 -d /group/ctbrowngrp/virus-references/checkv-db-v1.5 ``` ----- # old notes dump: ISMB - open source computational tracks - write an abstract for this thing - MICROBIOME Build a fake data set of phage to see if method works .. why use sourmash for phage? out performs kracken and bracken- assign signle reads but sourmash looks at whole genomes .. Big goal: work towards a thesis: actual goals: classify and get phage abundance - care more about specific genomes ----- --- how do you know you have a phage genome, and not a piece of a phage genome in your metagenome? you can figure out how much of the phage genome is present by mapping phage ref -> phage neighborhood -> map back to the reference genome a number of the reads will map lol. then I'll just get a percent mapping or something to do: what is a good phage aligner: bwa, bowtie2. What parameters do I need to pull out a phage genome? is there an optimal set of spacegraph cats perameters? protein vs dna: dna,k=21,r=1 do you get a genome? if I search with a protein k=10, r=1 you can get a bigger neighborhood by changing parameters, how do you know this is still accurate? In addition to mapping, you cna run check v OK I like bowtie: read mapping (assumes that reference database has basically everything) binning: (I do care if things dont map) then you assemble? reads -> contigs, contigs -> assign to organisms (this is binning) you can use blast..:? map contigs checkv: compares to complete viral genomes - gives a completeness value k21 radius of 5: if it gives better completeness we could care about hosts so joannes papers: they assemble bin map/assign/make new genomes all the time use the tools they're using maybe ask annie or jess. jess tried to do sgcs anyways average abund, f match (like... 10? ). try some small ones too probab make list of programs Im going to run