# freebayes notes, 2020-10-28 https://galaxyproject.org/tutorials/var_hap/freebayes.pdf https://github.com/ekg/freebayes ## Some parameter thoughts from the repo * general interesting parameters after screening the repo: * Use an input VCF (bgzipped + tabix indexed) to force calls at particular alleles --> I think we do not need this atm * GVCF output allows us to have coverage information about non-called sites, and we can enable it with --gcvcf * For performance reasons we may want to skip regions of extremely high coverage in the reference using the --skip-limit parameter -g, e.g. -g 1000 * default command assumes diploid! --ploidy to change * -C 5 require 5 supporting observations to consider a variant * __Filter__ the output e.g. using reported QUAL and/or depth (DP) or observation count (AO). * Of primary interest to most users is the QUAL field, which estimates the probability that there is a polymorphism at the loci described by the record. In freebayes, this value can be understood as 1 - P(locus is homozygous given the data). It is recommended that users use this value to filter their results, rather than accepting anything output by freebayes as ground truth, e.g.: * ```freebayes -f ref.fa aln.bam | vcffilter -f "QUAL > 20" >results.vcf``` * removes any sites with estimated probability of not being polymorphic less than phred 20 (aka 0.01), or probability of polymorphism > 0.99. ## Literature screen I checked in particular for studies that analyzed SARS-CoV-2 with freebayes and in general used the tool in viral/amplicon context. However, of course, there are also other studies that used e.g. LoFreq for low-frequency variant calling in e.g. SARS-CoV-2 (Spanakis, Nikolaos, et al. "Dominant and rare SARS-Cov2 variants responsible for the COVID-19 pandemic in Athens, Greece." medRxiv (2020).) > Xiao, Minfeng, et al. "Multiple approaches for massively parallel sequencing of HCoV-19 genomes directly from clinical samples." BMC Genome Medicine (2020). --> "multiplex PCR amplicon (amplicon)-based", they used freebayes "Variants were first called with freebayes (v1.3.1) [34] (__parameters: -p 1 -q 20 -m 60 --min-coverage 10 -V__), and the low-confidence variants were removed with snippy-vcf_filter (v3.2) [35] (parameters: --minqual 100 --mincov 10 --minfrac 0.8). The remaining variants in VCF files generated by freebayes were annotated in SARS-CoV-2 genome assemblies and consensus sequences with SNVeff (v4.3) [36] using default parameters. Next, pysamstats v1.1.2 (https://github.com/alimanfoo/pysamstats) (parameters: -type variation_strand --min-baseq 20 -D 1000000) was used to count the number of matches, mismatches, deletions, and insertions at each base to determine the allele frequencies. Variant calls with allele frequencies ≥ 80% were identified as SNVs." * -p --ploidy 1 * -q --min-base-quality 20 * -m --min-mapping-quality 60 * --min-coverage 10 * Require at least this coverage to process a site. default: 0 * -V --binomial-obs-priors-off * Disable incorporation of prior expectations about observations. Uses read placement probability, strand balance probability, and read position (5'-3') probability. > Pfefferle, Susanne, et al. "Low and high infection dose transmission of SARS-CoV-2 in the first COVID-19 clusters in Northern Germany." medRxiv (2020). --> also used freebayes "Variants were called using freebayes Bayesian haplotype caller v1.3.1 (21) with __ploidy and haplotype independent detection__ parameters to generate frequency-based calls for all variants passing input thresholds (__-K -F 0.01__). Input thresholds were set to 10 variant supporting reads with a minimum base quality of 30 (__-C 10 -q 30__). Only high confidence variants present in > 33% of reads within at least one individual sample were included * -K --pooled-continuous * Output all alleles which pass input filters, regardless of genotyping outcome or model. * -F --min-alternate-fraction N * Require at least this fraction of observations supporting an alternate allele within a single individual in the in order to evaluate the position. default: 0.05 * -C --min-alternate-count N * Require at least this count of observations supporting an alternate allele within a single individual in order to evaluate the position. default: 2 ## Calling variants in non-diploid systems https://training.galaxyproject.org/training-material/topics/variant-analysis/tutorials/non-dip/tutorial.html#calling-non-diploid-variants * "in non-diploid systems, allele frequencies can range anywhere between 0 and 100% and there could be multiple (not just two) alleles per locus. The main challenge associated with non-diploid variant calling is the difficulty in distinguishing between the sequencing noise (abundant in all NGS platforms) and true low frequency variants." * not focusing on amplicon, but Illumina data * "FreeBayes is widely used for calling variants in diploid systems. However, it can also be used for calling variants in pooled samples where the number of samples is not known. This is the exact scenario we have here: in our sample we have multiple mitochondrial (or bacterial or viral) genomes, but we do not know exactly how many. Thus we will use the --pooled-continuous option of FreeBayes to generate frequency-based variant calls as well as some other options highlighted below." * Population model options * Set ploidy for the analysis”: 1, --ploidy * Assume that samples result from pooled sequencing”: Yes, -J --pooled-discrete * “Output all alleles which pass input filters, regardless of genotyping outcome or model”: Yes * I think this is the same as -K or--pooled-continuous as used as well by Pfefferle et al 2020 * it seems that they use -J (assume that samples result from pooled sequencing) and -K (output all alleles which pass filters) here, but they also state that "The “Population model options” are one of the most important parameter choices to make when calling variants in non-diploid systems." * further parameters: * -m --min-mapping-quality, Exclude alignments from analysis if they have a mapping quality less than”: 20 * Xiao et al. 2020 are quite restrictive here with 60, Pfefferle et al 2020 did not report this param * -q --min-base-quality Exclude alleles from analysis if their supporting base quality less than”: 30 * Pfefferle et al 2020: 30, Xiao et al 2020: 20 * __finally, they also have a nice part about "Filtering Variants" with VCFfilter__ > Kinoti, Wycliff M., et al. "Analysis of intra-host genetic diversity of Prunus necrotic ringspot virus (PNRSV) using amplicon next generation sequencing." PLoS One 12.6 (2017): e0179284. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5478126/ --> RNA virus, amplicon data, used freebayes w/ default parameters > Mohammed, Khadija Said, et al. "Evaluating the performance of tools used to call minority variants from whole genome short-read data." Wellcome open research 3 (2018). http://wrap.warwick.ac.uk/109460/1/WRAP-evaluating-performance-tools-used-call-minority-variants-whole-genome-short-read-data-Nokes-2018.pdf --> quantification of low frequency variants "Our study suggests that combining at least three tools when identifying minority variants is useful" "FreeBayes identified the majority of variants although it was characterised by variable sensitivity and precision in addition to a high false positive rate relative to the other minority variant callers and which varied with sample coverage. LoFreq was the most conservative caller." ## Some general and likely future thoughts * do we mark duplicates? * likely not bc/ "But it is not recommended in PCR-based amplicon sequencing applications where distinct DNA fragments can share the same genome coordinates." * what about paired-end reads that overlap? * merge them with e.g. flash ? or PEAR like here for SARS-CoV-2 amplicon (https://www.medrxiv.org/content/10.1101/2020.06.11.20127332v1.full.pdf)? * why not de novo assembly? mapping against ref --> SPAdes --> PILON