Sergei Pond

@g2dxQFPcSWKAyJOgEWdt3A

Joined on Jan 20, 2021

  • Viruses are as old as life itself, infecting everything from bacteria to humans. Since the very beginning, viruses have left an indelible imprint on the human genome, human history, and medical research. SARS-CoV-2, the viral pathogen responsible for the COVID-19 pandemic, is just the latest in a long line of highly impactful human pathogens, including smallpox, Influenza A virus (IAV) and HIV-1. By some estimates, close to 30% of human proteins are involved in combating viral infections. Viral pathogens, especially RNA viruses such (HIV-1, IAV, SARS-CoV-2), mutate very quickly, with some viruses generating every possible single-nuclotide mutation in each new viral generation. Immune reponses, antiviral drugs, competition between viral strains, transmission between species and hosts create some of strongest evolutionary forces that have been reported in evolutionary biology. Viruses are also some of the most sequenced organisms in existence (e.g. ~15M SARS-CoV-2 genomes have been sequenced in 2020-2022). The mission of the Center for Viral Evolution at Temple University is to Create computational and statistical learning approaches for the analysis of genomic data from rapidly evolving viral pathogens. Develop scalable software tools for processing large volumes of viral sequence data and deliver actionable and interpretable results. Apply these tools and techniques to learn how past viral evolution informs their present ability to adapt to our responses, and predict likely future paths viruses may take.
     Like  Bookmark
  • BW Indexed in: https://hackmd.io/@hannahkimincompbio/Sk9T_TIBY Writeup leader: Steven WeaverGoogle doc draft 1: https://docs.google.com/document/d/1ERRQVBIyBt_98uRQ7f4EvkgT1L2xJ9pNT-pmHBd1MzA/edit?usp=sharing Google doc draft 2: https://docs.google.com/document/d/1rnGZZZrcIzI6YtZlFgXri3j-mknFHXHonn_kK8h8U_g/edit?usp=sharing Project board: https://github.com/users/stevenweaver/projects/2/views/1 Authors: Sergei Pond, Steven Weaver, Jordan D Zehr, Alexander Lucaci, Hannah Kim, Avery Selberg Institutions: iGEM Potential delivery date: November 19th, 2021 (earliest) Abstract Write last
     Like  Bookmark
  • Your task is to write a Python script which implements a simple Neighbor Joining algorithm for phylogeny inference based on the Jukes Cantor distance, that works reasonably fast for ~20 or so sequences. Do not use "prepackaged" routines from BioPython, other than to read sequence files. Input : a FASTA multiple sequence alignment Output: inferred phylogenetic tree with branch lengths (you can choose the format, but Newick format is standard) For example $python3 NJ.py --msa test.fas ... (((human:0.01, chimp: 0.02): 0.03, gorilla : 0.03, orangutan : 0.03) : 0.01, gibbon : 0.05)
     Like  Bookmark
  • From Drabeck et el "Tall" dataset. This dataset has unusual dimensions: short (30) codons and relatively many (199) sequences. This creates some statistical issues that could potentially be impactful. Many branch lengths are =0 (for example, HyPhy collapses 130 internal tree branches because they are 0). This is not biologically realistic. If possible, it would be better to estimate branch lengths from a longer gene alignment, even if not all species are present. The precision with which non-0 branch lengths are estimated is degraded (likely biased). This could create downstream issues with all method. Methods which draw power from sequence length (e.g. aBSREL, BUSTED, and also PAML), are going to suffer power loss. Basic data exploration.
     Like 2 Bookmark
  • What you need: A labeled tree. Something like this Parameter settings. We are going to basically ignore everything except ω distributions here Foreground distribution (3 bins) Background distribution (3 bins) Suggested distributions are below. Simulated data. For this, we can use https://github.com/veg/hyphy-analyses/tree/master/SimulateMG94 An example call (notice how the distribution is specified; the one here is ω = 0.1 (weight 0.5), ω = 0.5 (weight 0.3), ω = 1 (weight 0.2)) would be hyphy ~/Development/hyphy-analyses/SimulateMG94/SimulateMG94.bf --model BS-REL --tree marine.txt --output sims/BSREL --replicates 100 --branch-variation bs-rel --omegas 0.1,0.5,0.5,0.3,1,0.2 --omegas-Foreground 0.1,0.5,0.5,0.3,1,0.2
     Like 2 Bookmark
  • We have an existing grant developing tools for viral sequence analysis using the Galaxy (usegalaxy.org) platform. We developed and validated workflows for calling intra-host variation (https://www.nature.com/articles/s41587-021-01069-1). We also work from the previous grant cycle on reliable haplotype phasing, detection of artifical recombination and error correction. This will form the basis of the computational core of the proposal. Aim to submit late Jan 2002. Central hypothesis. In a sufficiently large fraction of HIV+ individuals who are not fully suppressed and become infected with SARS-CoV-2, the evolution SARS-CoV-2 is measurably different than in other (HIV- and HIV+ but controlled) hosts. This will manifest in longer duration of infection for SARS-CoV-2 (marginal in some, extensive in others), affording the virus more time for intra-host evolution and development of multiple mutations. This has clear implications for the development of immune escape and other variants in SARS-CoV-2; requires renewed focus on HIV management and focused surveillance. Aim 1 Establish and sample a study cohort of HIV+ indivduals with SC-2 infections. Need to know the following (sample size TBD, what is practical?)
     Like  Bookmark
  • Consider a phylogenetic tree (P) with branches partitioned into two sets: those with a phenotype/trait of interest (T or test) and those without (B or background); see Fig XX.A for an example. To study the evolution of a coding sequence along P, we extend the Bayesian Unrestricted Episodic Diversification (BUSTED) codon evolutionary framework [ref] to allow independent selective regimes on F and B. A selective regime, ST or SB, is described by a distribution of ω substitution rate ratios with K discrete bins (ω1≤ω2...≤1≤ωK) Fig XX.B). For a given site and a branch in T (B), ω is modeled as a random independent (of other branches) draw from SF (SB), and the site phylogenetic likelihood is computed as the expectation over all possible draws. The K ω values and K-1 weight parameters parameters of each distribution are inferred from the data using maximum likelihood. TODO: add a sentence summarizing why BUSTED is a good model. To determine whether or not diversifying positive selection is associated with phenotype/trait for a given gene, we Estimate the ω distributions for T and B branches from the data, using maximum likelihood. Test for episodic diversifying selection (ωK > 1) on branches with the phenotype (T), using a likelihood ratio test (LRT), as described in the BUSTED paper. Test for episodic diversifying selection (ωK > 1) on branches without the phenotype (B). Test whether or not the ω distributions are different between T and B branches. This is done by fitting the null model where the entire tree has a shared ω distribution, and performing a likelihood ratio test (2K-1 degrees of freedom) versus the model from step (1).
     Like  Bookmark
  • https://github.com/spond/SARS-CoV-2-variation/blob/master/variation/variant-annotation.csv Headers Coord,Ref,Codon,GlobalF,First,Last,Clade0.75,PS,NS,LastSel,Pred,Haplo Coord : 0-based genomic coordiante for codon styart Ref : Reference codon at this position Codon : Alternate codon at this position GlobalF : Frequency of this codon in global (GISAID) data First : date first sampled
     Like  Bookmark
  • Improving method performance for smaller alignments/test sets (HyPhy versions 2.5.33 and later) FEL (Fixed Effects Likelihood) is a tool that we originally developed in 2005 to perform a "non-parametric" test of natural selection acting on individual alignment sites. The method essentially estimates (site-by-site) a pair of evolutionary substitution rates: α (synonymous substitutions) and β (non-synonymous substitutions) and performs a statisitical hypothesis test is α = β?. If the null hypothesis is rejected at some significance level (e.g. p≤0.05), then selection is inferred: negative/purifying if α < β and positive/diversifying otherwise. The significance for the test is derived using standard asymptotics which work well if the sample size is large enough. In this case, the sample size is ~the number of tested branches, which can be small (even ONE!). Motivated in part by own analysis of small samples (~20 sequences) of canine and feline coronavirues, we modified FEL to use parametric bootstrap at each site to obtain significance. This is of course much more expensive (xK, where K is the number of replicates for bootstrap replicates), but should result in a more accurate definition of the null distribution of the test statistic and better detection of non-neutral evolution. Another context in which FEL is often used is the estimation of site-by-site dN/dS (ω). These estimates are going to be quite noisy, and generally we do not recommend using them directly. But when coupled with some degree of uncertainty assessment, the estimates will be more useful. To make this possible, we added an option to compute profile likelihood estimates for each site.
     Like  Bookmark