Single Cell RNA Velocity

--- title: 'Single Cell RNA Velocity' disqus: hackmd --- # Table of Contents [TOC] # Model of transcriptional dynamics The model of transcriptional dynamics captures ranscriptional induction (“on” phase) and repression (“off” phase) of unspliced precursor mRNAs $u(t)$ with state- ependent rates $\alpha^{(k)}$, its splicing into mature mRNAs $s(t)$ with rate $\beta$ (i.e. removing introns from pre-mRNAs and joining adjacent exons to produce spliced mRNAs) and eventual degradation with rate $\gamma.$ # Velocyto ## Theoretical description of RNA velocity ### abundance of unspliced mRNA #### First of all, use uniform symbols. $$\begin{aligned} P(t) &= (P_0 - \frac{\beta}{\alpha_1})e^{-\alpha_1 t} + \frac{\beta}{\alpha_1}\\ u(t) &= (u_0 - \frac{\alpha}{\beta})e^{-\beta t} + \frac{\alpha}{\beta} \\ \text{Here we set }\beta = 1:\\ &= (u_0 - \alpha)e^{- t} + \alpha\\ &= u_0 e^{- t} - \alpha e^{- t} + \alpha\\ &= \alpha(1-e^{-t}) + u_0 e^{-t} \end{aligned} $$ This is exactly the expression of $u(t)$ in velocyto paper. #### Now let's play with spliced mRNA quantification. **The normalized degradation rate $\gamma$ varies among genes and needs to be estimated in a gene-specifc manner.** #### First of all, I will uniform the symbols and replace $\beta$ with $1$. $$\begin{aligned} M(t) &= \bigg( M_0 - \frac{\alpha_1 P_0 - \beta}{\alpha_2 - \alpha_1} - \frac{\beta}{\alpha_2} \bigg)e^{-\alpha_2 t} + \frac{\alpha_1 P_0 - \beta}{\alpha_2 - \alpha_1}e^{-\alpha_1t} + \frac{\beta}{\alpha_2}\\ s(t) &= \bigg( s_0 - \frac{\beta u_0 - \alpha}{\gamma - \beta} - \frac{\alpha}{\gamma} \bigg)e^{-\gamma t} + \frac{\beta u_0 - \alpha}{\gamma - \beta}e^{-\beta t} + \frac{\alpha}{\gamma}\\ \text{Here we set }\beta = 1:\\ &= \bigg( s_0 - \frac{u_0 - \alpha}{\gamma - 1} - \frac{\alpha}{\gamma} \bigg)e^{-\gamma t} + \frac{u_0 - \alpha}{\gamma - 1}e^{-t} + \frac{\alpha}{\gamma} \end{aligned} $$ #### This equation looks sophasticated, so let's play with the one in velocyto $$\begin{aligned} s(t) &= \frac{e^{-t(1+\gamma)}\bigg[e^{t(1+\gamma)} \alpha(\gamma -1) + e^{t\gamma}(u_0-\alpha)\gamma + e^t \big(\alpha - \gamma (s_0 + u_0 + s_0 \gamma)\big)\bigg]}{\gamma(\gamma-1)}\\ &= \frac{\alpha(\gamma -1) + e^{-t}(u_0-\alpha)\gamma + e^{-t\gamma} \big(\alpha - \gamma (s_0 + u_0 + s_0 \gamma)\big)}{\gamma(\gamma-1)}\\ &= \bigg(\frac{\alpha - \gamma (s_0 + u_0 + s_0 \gamma)}{\gamma (\gamma - 1)} \bigg)e^{-t\gamma } + \frac{\alpha}{\gamma} + \frac{u_0 - \alpha}{\gamma-1} e^{-t}\\ &= \bigg(\frac{\alpha}{\gamma (\gamma - 1)} - \frac{u_0}{\gamma - 1} - \frac{s_0( \gamma + 1)}{\gamma - 1} \bigg)e^{-t\gamma } + \frac{\alpha}{\gamma} + \frac{u_0 - \alpha}{\gamma-1} e^{-t}\\ &= \frac{-1}{\gamma -1} \bigg( s_0( \gamma + 1) + u_0 - \frac{\alpha}{\gamma} \bigg)e^{-\gamma t} + \frac{u_0 - \alpha}{\gamma-1} e^{-t} + \frac{\alpha}{\gamma} \end{aligned} $$ #### In summary: $$\begin{aligned}\text{in 2011 velocity}: s(t)&= &\bigg( &s_0 &+ \frac{ u_0 -\alpha}{1-\gamma} &- \frac{\alpha}{\gamma} &\bigg)e^{-\gamma t} + \frac{u_0 - \alpha}{\gamma - 1}e^{-t} + \frac{\alpha}{\gamma}\\ \text{In velocyto} : s(t)&= \frac{-1}{\gamma -1} &\bigg( &s_0( \gamma + 1) &+ u_0 &- \frac{\alpha}{\gamma} &\bigg)e^{-\gamma t} + \frac{u_0 - \alpha}{\gamma-1} e^{-t} + \frac{\alpha}{\gamma} \end{aligned} $$ What's the difference between these equations? why? ### Estimation and extrapolation #### Model II. Constant unspliced molecules assumption $$\begin{aligned} \frac{ds}{dt} &= u_0 - \gamma s(t)\\ \frac{ds}{dt} &= -\gamma \times (s(t) - \frac{u_0}{\gamma})\\ \frac{1}{s - \frac{u_0}{r}} ds &= -\gamma dt\\ \int_{s_0}^{s(t)} \frac{1}{s - \frac{u_0}{r}} ds &= - \int_{0}^{t} \gamma dt\\ \ln \bigg(s - \frac{u_0}{r}\bigg) \bigg|_{s_0}^{s(t)} &= -\gamma t\\ \ln(s(t) - \frac{u_0}{\gamma}) - \ln(s(0) - \frac{u_0}{\gamma}) &= -\gamma t\\ \ln\frac{s(t) - \frac{u_0}{\gamma}}{s(0) - \frac{u_0}{\gamma}}&= -\gamma t\\ \frac{s(t) - \frac{u_0}{\gamma}}{s(0) - \frac{u_0}{\gamma}}&= e^{-\gamma t}\\ s(t) &= s_0 e^{-\gamma t} + \frac{u_0}{\gamma}(1 - e^{-\gamma t}) \end{aligned} $$ **Assuming gene independence, the overall RNA velocity of the cell is a multidimensional vector comprised of the individual gene velocities.** ## Estimation framework features: 1. Independent normalization of spliced and unspliced counts allows to control for genome-wide variation of splicing rates between cells. 2. Our model incorporates a gene-specific offset to account for background signals that could originate from other transcripts or alignment errors. 3. The robustness of the RNA velocity estimates can be improved by pooling of transcript counts across 𝑘 most similar cells (see “Cell nearest neighbor (kNN) pooling” section below) or across well-correlated genes. ### Estimation of RNA velocity **For each gene**, the normalized degradation rate $\gamma$ was determined using a least squares fit: $$u \sim \gamma \times s $$ where $u = U/N_u$, $s = S/N_s$, where $U$ and $S$ are the number of unspliced and spliced counts, respectively, and $N_u$ and $N_s$ are the total numbers of unspliced and spliced molecules observed **in a given cell**, respectively. **Each gene of each cell has a $\gamma$** - Under Model I, the velocity component v for a given gene in a given cell was assumed to be $$v = u − \gamma s − o$$ Here $o$ was taken to be mean $u$ across cells where s=0, which is used to account for background signals that could originate from other transcripts or alignment errors. If most of cells are not in steady state, only cells whose $s$ value are within the top and bottom 5% for that gene. No regression weights were used. The extrapolated counts of **a given gene in a given cell** was then determined as $$s_t = \max(0, s_0 + vt) $$ **Each gene of each cell has a $v$ and $s_t$** - Under Model II, the displacement of spliced mRNA $\Delta 𝑠$ **for a given gene in a given cell** was estimated assuming constant $u$ as $$\Delta s(𝑡) =\bigg(\frac{\hat u}{\gamma} − 𝑠 \bigg) (1 − 𝑒^{-\gamma t}) $$ where $\hat u$ is the offset-adjusted unspliced count, calculated as $$\hat u = \max(0, u − o) $$ and using the default extrapolation time $𝑡 = 1$. Here use $\hat u$ because $o$ and $\gamma$ fit were estimated using a different logic. The extrapolated counts of a given gene in a given cell was then determined as $$S_t = \max(0, S + \Delta s(t) N_s) $$ where $S$ is the non-normalized spliced count for a gene in a given cell. **Here do $\Delta s(t) N_s$ because $\Delta s(t)$ is normalized, so we need to return it to its original magnitude** The normalized extrapolated counts were then calculated as $𝑠_t = \frac{S_t}{\hat N}$, where $\hat N$ is the extrapolated total size of the cell $\hat N= 𝑁_s + 𝑆_t − 𝑆$. $$\begin{aligned} \Delta s(t) &= s(t) - s_0\\ &= s_0 e^{- \gamma t} + \frac{u_0}{\gamma}(1 - e^{- \gamma t}) - s_0\\ \text{Here } s_0 \text{ is called } s\text{, } u_0 \text{ is called } \hat u:\\ &= s e^{- \gamma t} + \frac{\hat u}{\gamma}(1 - e^{- \gamma t}) - s\\ &= \frac{\hat u}{\gamma}(1 - e^{- \gamma t}) -s (1 − 𝑒^{-\gamma t})\\ &= \bigg(\frac{\hat u}{\gamma} − 𝑠 \bigg) (1 − 𝑒^{-\gamma t}) \end{aligned} $$ ### Errors due to unequal losses. Counts of spliced and unspliced mRNA molecules are subject to gene-specific losses during sample preparation. If these losses are equal, observed counts U and S will be a linear rescaling of the true counts $\hat U$ and $\hat S$: $$\begin{aligned} U &= k \hat U\\ S &= k \hat S \end{aligned} $$ If if the gene-specific loss of unspliced molecules $U$ differs from $S$, say by a factor $f$, then our estimates of \frac{ds}{dt} will be off by the factor $f$. 1. At steady state, instead of estimating $\gamma = \frac{\hat U}{\hat S}$,we will be estimating $f\gamma = \frac{f \hat U}{\hat S}$. 2. Away from steady state, our estimates of the velocity will be: $$f \hat U(t) - f \gamma \hat S(t) = f \bigg( \hat U(t) - \gamma \hat S(t) \bigg) = f \frac{d \hat S}{dt}$$. As $\frac{ds}{dt} = u_0 - \gamma s(t)$, the effect of this factor is to increase $\beta$ a little bit. ### Structure-based estimation of RNA velocity STEPS: 1. using gene-relative model described above to estimated initial $\gamma_r$. 2. fit a generalized additive model(GAM) to predict gene-specific values of $\gamma$ based on combination of gene structure parameters: - the number of exons - total intronic length - the number of predicted internal priming sites. 3. Calculate the unspliced count $M = log2(\frac{u}{\hat \gamma s + \hat o})$. $M$ value can be directly compared between genes, and we expect that co-regulated genes show similar $M$ values during up- and down- regulation. The $M$ values were refined by applying trimmed mean procedure across closely-correlated genes,and then are used to extrapolate the future expression state of the cell. - The analysis considers only genes with more than one annotated exon ($n_e > 2$), with total exonic length $l_e > 500bp$, and total intronic length $l_i > 3kbp$. Then a global generalized additive model was constructed to fit the dependency $\gamma_r$ on the structural parameters of the gene and its expression magnitude:$$\gamma_r \sim K_2(s,l_i) + K_1 \Bigg( log_{10} \bigg(\frac{l_i+1}{l_e+1} \bigg) \Bigg) + K_1(n_e) + K_1(n_c) + K_1(n_d)$$ - where $K_1$ and $K_2$ denote 1D and 2D local smoothing kernels. $s$ denotes spliced expression magnitude of a gene, and $n_e$ the number of expressed exons in a gene. $n_c$ and $n_d$ denote the number of concordant and discordant internal priming sites, respectively. - The model was fit, weighting the observations of each gene by the square root of its total spliced counts across the dataset. - The fitting procedure also omitted genes with disproportionate amount of unspliced counts (likely due to unannotated non-coding transcripts). To do so, total unspliced counts were modeled as $$\sum_{cells} u \sim \sum_{cells}s + \frac{l_i}{l_e}$$ - using a generalized linear model with normal distribution and log link and genes with pearson deviance exceeding 3 were omitted when fitting the model. 4. Using cells whose pearson deviance exceeding 3, the structure-based steady-state $\gamma_s$ is calculated using the formulation of $\gamma_r$. 5. The log ratio of observed to expected unspliced counts as $$m_s = log \bigg( \frac{u}{\gamma_s s + o_r} \bigg)$$where $o_r$ is the gene specific offset determined as described in the gene-relative model. 6. As we expect $m$ estimates of individual genes to be noisy, the next step used $k$ nearest genes to stabilize $m$ estimates for each gene. - Specifically, for a given gene $g$, $m$ was estimated as a trimmed mean of the $m_s$ values for $k$ genes most correlated with gene $g$ across the entire dataset. Default $k=15$ was used, with top and bottom 5 genes trimmed. These robust $m$ estimates were then used to estimate RNA velocity as described in “Gene kNN pooling”. ### Cell nearest neighbor (kNN) pooling. To improve slope $\gamma$ and offset $o$ estimation, we pooled count data across k most correlated genes, assessed using Pearson linear correlation distance on $log(s + 1)$ values to get $\hat \gamma$ and $\hat o$. k was adapted to match the sparsity and size of the dataset. The velocity of the original gene was then evaluated based on the assumption that the observed vs. expected (in steady-state) ratio of the unspliced signal should be similar between co-regulated genes, and therefore also similar for the pooled gene counts. **The extrapolated state was calculated using the initial (non-pooled) count values.** steps: 1. calculate $\hat \gamma$ and $\hat o$ using pooling counts across k most correlated genes. 2. calculate the log ratio of observed to expected unspliced count for a given gene and a given cell: $$m = \log_2(\frac{u}{\hat \gamma s + \hat o})\\ $$ 3. calculate gene specific $\gamma$ for each gene of each cell by only considering cells with $s > 0$. For a single gene of a cell, given $v = u - \gamma s - o = \hat u - \gamma s$ $$\begin{aligned} m &= \log_2(\frac{\hat u}{\hat \gamma s})\\ m &= \log_2(\hat u) - \log_2(\hat \gamma s)\\ \log_2(\hat u) &= m + \log_2(\hat \gamma s)\\ \gamma &= 2^{log_2(\hat u / s) - m} \end{aligned} $$ here $s$ and $\hat u$ ($\hat u$ is the offset-adjusted unspliced count $\hat u = \max(0, u − o)$) are count for the gene of the cell, $m$ is calculated using pooling counts, and $\hat o$ is collapsed into $\hat u$. In model 2, under the assumption of constant $u$, $\Delta s(t)$ was estimated as $$\Delta s(t) = (s + \epsilon)[e^{-\gamma t}(1-2^m) + 2^m] - s $$ where $\epsilon = 10^{-4}$ ### Visualization of cell velocities. #### PCA-based visualization Now we have a gene by cell **extrapolated state** matrix. We can visualize it by projecting it to the PCs. $$\text{velocity extrapolation vector of the cell i} = vt $$ #### t-SNE based visualization places the velocity arrow in the direction in which expression difference is best correlated with the estimated velocity vector, controlling for the cell density distribution. We calculated a transition probability matrix P by applying an exponential kernel on the Pearson correlation coefficient between the velocity vector and cell state difference vectors: Transformation function $$\varrho(x) = sign(x) \sqrt{|x|} $$ Given the $\varrho$ transformed expression vector of cell $i$ $s_i$ and expression vector of cell $j$ $s_j$, the difference vector $r_{ij}$ is defined as $$r_{ij} = \varrho (s_j - s_i) $$ The $\varrho$ transformed velocity extrapolation vector of the cell i $d_i$ is defined as $$d_i = \varrho (vt) $$ We calculated a transition probability matrix $P$ by applying an exponential kernel on the Pearson correlation coefficient between $r_{ij}$ and $d_i$: $$P_{ij} = exp \bigg( \frac{corr(r_{ij},d_i)}{\sigma} \bigg) $$ $P$ is row-normalized so that $\sum_j P_{ij} = 1$. The transition probabilities $P_{ij}$ were used as weights to compute the extrapolation vector. Given embedding matrix $X$ whose column vectors are the embedding for all cells: $$X = [x_1, x_2,...,x_{n-1}, x_n] $$ The predicted velocity displacement of a cell was calculated as: $$\Delta x_i = \sum_j \bigg( P_{ij} - \frac{1}{n} \bigg) \frac{x_j - x_i}{||x_j - x_i||} $$ Where substracting $\frac{1}{n}$ corrects the estimate for the non-uniform density of points in the embedding. For large dataset, we visualize regular grid vector field instead of individual cell velocity arrow. Given Gaussian kernel smoothing function $K_\sigma$: $$K_\sigma(a,b) = exp \bigg( \frac{-||a-b||^2}{2 \sigma ^2} \bigg) $$ The grid vector field was estimated by applying Gaussian kernel smoothing to the velocity vectors of cells around each grid point: $$\Delta x_{grid} = \sum_i K_\sigma (x_{grid},x_i) \Delta x_i $$ The size of the grid (number of grid points) was chosen depending on the visual scale of the figure. #### Diffusion start and end-point modeling To find the set of cells that correspond to the differentiation starting point and end points, we used a markov process where repectively the transition probability matrix $T$, each entry is defined as: $$T_{ij} = p W_{ij} + (1-p)D_{ij} $$ Where $W$ (row sum normalized) corresponds to the local Brownian motion component: $$W_{ij} = K_{\sigma_W}(x_i,x_j) $$ and D(rowsum normalized) represents the local velocity driven drift: $$D_{ij} = P_{ij}K_{\sigma_D}(x_i,x_j) $$ Here we used two Gaussian kernel$K_\sigma$, where $K_{\sigma_W}$ corresponds to the local Brownian motion, $K_{\sigma_D}$ is used to smooth the diffusion process. To avoid the influence of the local density of points on the result, we downsampled the dataset selecting the nearest neighbors of a uniform grid. $\sigma_D$ was set t0 the average distance between neighboring points and $\sigma_W = \sigma_D /2$. Backward diffusion (i.e. diffusion against the velocity bias) was performed by transposing and row-normalizing transition probability matrix $T$. In both directions of the diffusion, we started from a uniform distribution and performed 2500 iterations. # scVelo ## Preparing the scRNA-seq data for velocity estimation 1. Annotations of unspliced/spliced reads were obtained using velocyto CLI. Alternatively, reads can be pseudo-aligned with kallisto. 2. The count matrices are size-normalized to the median of total molecules across cells. 3. The top 2,000 highly variable genes are selected out of those that pass a minimum threshold of 20 expressed counts commonly for spliced and uns1pliced mRNA. 4. A nearest neighbor graph (with 30 neighbors) was calculated based on euclidean distances in PCA space (with 30 principal components) on logarithmized spliced counts. 5. For velocity estimation, first and second order moments (means and uncentered variances) are computed for each cell across its 30 nearest neighbors. 6. After velocity estimation, the gene space can be further restricted to genes that pass a minimum threshold for the coefficient of determination ($R^2$, derived from the steady-state model) or gene likelihood $(P ((u; s)j(\theta; \eta))$, derived from the dynamical model). ## Restate velocyto's method ### STEP 1. estimate $\gamma$ It claims that velocyto is a steady state model because the $\gamma$ is estimated using steady state properties. IN velocyto, it obtain on average a constant transcriptional state ($\frac{ds}{dt} = 0$)using cells lower and upper quantiles in phase space(extreme quantiles). $$\begin{aligned} v(t) = \frac{ds}{dt} = \beta u - \gamma s &= 0\\ \frac{u}{s} &= \frac{\beta}{\gamma} \\ \text{we call }\gamma' = \frac{u}{s}\\ \gamma' &= \frac{\beta}{\gamma} \end{aligned}$$ Here $\gamma'$ is the steady-state ratio of unspliced to spliced mRNA, it indicates where mRNA synthesis and degradation are in balance. It can be solved analytically via a least square fit: $$\gamma' = \frac{u^Ts}{||s||^2} $$ where $u= (u_1, ..., u_n)$ and $s=(s_1, ..., s_n)$ are sized-normalized counts vectors for a specific gene across selected n cells. If centers $u$ and $s$ by a basal transcription offset $$o = \bar u - \bar s \gamma', o\geq0 $$ where $\bar u$ and $\bar s$ are the means of $u$ and $s$, respectively, to account for basal transcription, the steady-state ratio is then given by $$\gamma' = \frac{Cov(u,s)}{Var(s)} $$ ### Step 2. Extrapolation. Now let's assume $\gamma$ is a constant, so velocities can be computed as deviations from this steady-state ratio: $$v_i = u_i - \gamma' s_i $$ where $v$ is a vector which has the same length with $u$ and $s$. While a constant transcriptional state is reflected by zero velocity, the direction and relative speed during a dynamic process is given by the sign and magnitude of non-zero velocity. Further, velocities only depend on one ratio instead of absolute values of kinetic rates, which technically corresponds to measuring all entities in units of splicing rate, thus effectively assuming one common splicing rate $\beta = 1$ across all genes. ## Dynamical model In recognition that steady states are not always captured and that splicing rates differ between genes, we establish a framework that does not rely on these restrictions. **The analytical solution to the gene-specific rate quations** ### unspliced mRNA abundance $$\begin{aligned} P(t) &= (P_0 - \frac{\beta}{\alpha_1})e^{-\alpha_1 t} + \frac{\beta}{\alpha_1}\\ u(t) &= (u_0 - \frac{\alpha}{\beta})e^{-\beta t} + \frac{\alpha}{\beta} \\ \text{Here we set } \tau = t - t^{(k)}_0, \alpha = \alpha^{(k)}:\\ &= (u_0 - \alpha)e^{ -\beta \tau} + \frac{\alpha^{(k)}}{\beta}\\ &= u_0 e^{ -\beta \tau} - \frac{\alpha^{(k)}}{\beta} e^{ -\beta \tau} + \frac{\alpha^{(k)}}{\beta}\\ &= u_0 e^{-\beta \tau} + \frac{\alpha^{(k)}}{\beta}(1-e^{-\beta \tau}) \end{aligned} $$ In summary, $$\begin{aligned}\text{in 2011 velocity}: u(t) &= u_0 e^{-\beta \tau} + \frac{\alpha^{(k)}}{\beta}(1-e^{-\beta \tau})\\ \text{In velocyto} : u(t) &= u_0 e^{-\beta \tau} + \frac{\alpha^{(k)}}{\beta}(1-e^{-\beta \tau}) \\ \text{In scVelo} : u(t) &= u_0 e^{-\beta \tau} + \frac{\alpha^{(k)}}{\beta}(1-e^{-\beta \tau}) \end{aligned} $$ ### spliced mRNA abundance $$\begin{aligned} \text{in 2011 velocity}: s(t) &= \bigg( s_0 - \frac{\beta u_0 - \alpha}{\gamma - \beta} - \frac{\alpha}{\gamma} \bigg)e^{-\gamma t} + \frac{\beta u_0 - \alpha}{\gamma - \beta}e^{-\beta t} + \frac{\alpha}{\gamma}\\ &= s_0 e^{-\gamma t} - \frac{\beta u_0 - \alpha}{\gamma - \beta} e^{-\gamma t} - \frac{\alpha}{\gamma} e^{-\gamma t} + \frac{\beta u_0 - \alpha}{\gamma - \beta}e^{-\beta t} + \frac{\alpha}{\gamma}\\ &= s_0 e^{-\gamma t} + \frac{\alpha}{\gamma} - \frac{\alpha}{\gamma} e^{-\gamma t} + \frac{\beta u_0 - \alpha}{\gamma - \beta}e^{-\beta t} - \frac{\beta u_0 - \alpha}{\gamma - \beta} e^{-\gamma t} \\ &= s_0 e^{-\gamma t} + \frac{\alpha}{\gamma}(1 - e^{-\gamma t}) + \frac{ \alpha - \beta u_0}{\gamma - \beta}( e^{-\gamma t} - e^{-\beta t} ) \end{aligned} $$ In summary, $$\begin{aligned} \text{in 2011 velocity}: s(t) &= s_0 e^{-\gamma t} + \frac{\alpha}{\gamma}(1 - e^{-\gamma t}) + \frac{ \alpha - \beta u_0}{\gamma - \beta}( e^{-\gamma t} - e^{-\beta t} )\\ \text{In scVelo}: s(t) &= s_0 e^{-\gamma t} + \frac{\alpha}{\gamma}(1 - e^{-\gamma t}) + \frac{ \alpha - \beta u_0}{\gamma - \beta}( e^{-\gamma t} - e^{-\beta t} )\\ \end{aligned} $$ with parameters of reaction rates $\theta = (\alpha^{(k)}, \beta,\gamma)$, cell-specific time points $t \in (t_1, ...,t_N)$, and initial conditions $u_0 = u(t_0)$, $s_0 = s(t_0)$. In this model, parameters are not constants anymore, they claim that each cell has 1. a transcriptional state k, so for N cells, there is an additional paramter set $k = (k_1, ..., k_N)$. The number of potential steady states equals the number of transcriptional states. 2. Consequently, there are state-dependent transcription rate $\alpha^{(k)}$, 3. initial conditions $u_0^{(k)}$ and $s_0^{(k)}$, 4. and the time duration of switching states $t_0^{(k)}$. Consider a transition from state $k = i \rightarrow i+1$ (e.g. from induction to homeostasis), the initial condition of state $k = i+1$ are given by evaluating the trajectory of state $k = i$ at its switching point $$\begin{aligned} u_0^{i+1} \big( t_0^{i+1}, \theta^{(i)} \big) &= u_0^{(i)} + \frac{\alpha^{(i)}}{\beta} \big(1 - e^{-\beta t_0^{(i+1)}} \big)\\ s_0^{i+1} \big( t_0^{i+1}, \theta^{(i)} \big) &= u_0^{i} + \frac{\alpha^{(i)}}{\gamma}(1 - e^{-\gamma t_0^{i+1}}) + \frac{ \alpha^{(i)} }{\gamma - \beta}( e^{-\gamma t_0^{i+1}} - e^{-\beta t_0^{i+1}} ) \end{aligned} $$ - [ ] I cannot see how to get this equation, so I assume this is a approximation of their $s(t)$ formulation. I think they have typo. I will dig into their code to see. $$ s_0^{i+1} \big( t_0^{i+1}, \theta^{(i)} \big) = u_0^{i} + \frac{\alpha^{(i)}}{\beta} \bigg( \frac{\beta}{\gamma} (1 - e^{-\gamma t_0^{i+1}}) + \frac{\beta}{\gamma - \beta}( e^{-\gamma t_0^{i+1}} - e^{-\beta t_0^{i+1}} ) \bigg) $$ > If I try to match the $s$ with $s(t)$ I got $$ u_0^{(i)} = s_0^{(i)}e^{\gamma t} - \frac{\beta u_0^{(i)}}{\gamma - \beta} (e^{- \gamma t_0} - e^{- \beta t_0}) $$ we know all $u_0$ and $s_0$ are in the steady state, so $\frac{u_0}{s_0} = \frac{\beta}{\gamma}$, after massaged the equation, I get $$ \beta + \beta^2 e^{-\gamma t} + \beta e^{-\beta t} = \gamma + e^{-\gamma t} $$ Nothing happend. Being at state $k = i$, abundances can potentially reach their steady state in the limit, $$ (u_\infty^{(i)},s_\infty^{(i)}) = (\frac{\alpha^{(i)}}{\beta}, \frac{\alpha^{(i)}}{\gamma}) $$ ## Parameter inference In the following, we consider four phases, induction ($k=1$), and repression ($k=0$) each with an associated potential steady state ($k=ss_1$, $k=ss_0$). ### Clarification To recover the splicing kinetics, we need to estimate: 1. reaction rates $\theta^{(k)}$ 2. cell specific latent time point $t_i$ for each cell, it couples the measurement to the differential equations system by mapping measurement to the phase trajectory. 3. state $k_i$ for each cell. 4. switching time point $t_0^{(k)}$ of transitioning to another state. Let the model estimate be $$\hat x(t,\theta) = (\hat u(t,\theta), \hat s(t,\theta)) $$ and the observations be $$x^{obs}_i \sim N(\hat x(t,\theta), \sigma ^2) $$ ### Maximum Likelihood Assume 1. gene-specific $\sigma$ is constant across cells within one transcriptional state. 2. the observations are i.i.d The negative log-likelihood to be minimized is given by a normal distribution: $$ l(\theta) = log(\sqrt{2\pi \sigma^2}) + \frac{1}{2\sigma ^2} \sum_i \bigg|\bigg|x(\theta|t_i) - x_i^{obs}\bigg|\bigg|^2 $$ where $\theta = (\alpha ^{(k)}, \beta, \gamma)$, $i$ repsects to cell $i$ ### Expectation-maximization Given cell-specific latent time points $t = (t_1, ..., t_N)$ for N cells, $x^{obs}_i$ can be mapped to $\hat x(t|\theta)$ to get a optimized $t_i$ for each cell. EM is used to optimize $t$ and $\theta$ by iterating between **finding the optimal paramters of kinetic rates** and **the latent variables of time and state**. #### EM is initialized with parameters derived from steady-state model, to be specific: $$\begin{aligned} \alpha^{(k_i)} &= \begin{cases} \max(s) & \text{if } k_i = 1\\ 0 & \text{if } k_i = 0 \end{cases}\\ \beta &= 1\\ \gamma' &= \frac{u^Ts}{||s||^2}\\ k_i &=\begin{cases} 1 & \text{if } u_i - \gamma's_i > 0\\ 0 & \text{else} \end{cases} \end{aligned} $$ where $u$ and $s$ are size-normalized unspliced and spliced counts from extreme quantile cells. To understand the dimensionality, here we clarify that: - rate $\beta, \gamma$ are constants. - rates $\alpha^{(k)}$ is gene-specific and state-specific, for each state of each gene of each cell $i$, there is a $\alpha^{(k_i)}$. - Abundance $u(t),s(t)$ are gene-specific. - Latent time point $t_i \in (t_1, ..., t_N)$ is cell-specific. - transcriptional state $k_i \in \{0, ss_0, 1, ss_1\}$ is cell-specific. - Standard deviation $\sigma$ in normal distribution $N(\hat x(t,\theta), \sigma ^2)$ is constant across cells within one transcriptional state. #### After initialization, the EM algorithm iteratively applies the following two steps: - E-step: 1. For each state $k_i \in (0, ss_0, 1, ss_1)$ of each cell $i \in (1,...,N)$, Given $\hat x(t|\theta)$ parametrized by the current estimate of $\theta^{(k)}$, we assign a latent time $t_i$ to the observed value $x^{obs}_i$ by minimizing the distance to the phase trajectory $(\hat x(t|\theta^{(k)}))$ in each transcriptional state. To be specific, exact optimal time assignment is applied in the last iteration. During iteration, the latent time is approximated explicitly for efficiency reason by:$$s(t) = \tilde \beta u(t) + \frac{\alpha}{\gamma} - \tilde \beta \frac{ \alpha}{\beta} + \bigg(s_0 - \frac{\alpha}{\gamma} - \beta (u(t) - \frac{\alpha}{\beta}) \bigg)e^{-\gamma \tau}$$ 3. State log-likelihoods are then assigned to each cell, so each possible state of each cell has a log-likelihood. State likelihoods are determined by the distance of the observations to the four segments of the phase trajectory, parametrized by kinetic rates and latent time. 4. For each cell $i \in (1,...,N)$, by integrating over all log-likelihoods for all transcriptional states $k_i \in (0, ss_0, 1, ss_1)$, we yield an expected value of the log-likelihood. - M-step: The parameter set $\theta$ is updated to best fit the expected value of the log-likelihood. all transient states(induction, repression, active steady state, and inactive steady state) are explicitly modeled. The state likelihoods are determined by the distance of the observations. The approximate optumal laten time $\tau_i$ is calculated by: $$ \begin{aligned} \text{Given: } u(t) &= (u_0 - \frac{\alpha}{\beta})e^{-\beta \tau} + \frac{\alpha}{\beta} \\ e^{-\beta \tau} &= \frac{u(t) - \frac{\alpha}{\beta}}{u_0 - \frac{\alpha}{\beta}}\\ \text{Given: } s(t) &= s_0 e^{-\gamma \tau} + \frac{\alpha}{\gamma}(1 - e^{-\gamma \tau}) + \frac{ \alpha - \beta u_0}{\gamma - \beta}( e^{-\gamma \tau} - e^{-\beta \tau} )\\ &= s_0 e^{-\gamma \tau} + \frac{\alpha}{\gamma}(1 - e^{-\gamma \tau}) + (\frac{ \beta}{\gamma - \beta} \times \frac{ \alpha}{\beta} - \frac{ \beta}{\gamma - \beta} \times u_0) ( e^{-\gamma \tau} - e^{-\beta \tau} )\\ \text{Here we set } \tilde \beta := \frac{\beta}{\gamma - \beta}\\ s(t) &= s_0 e^{-\gamma \tau} + \frac{\alpha}{\gamma}(1 - e^{-\gamma \tau}) + \tilde \beta( \frac{ \alpha}{\beta} - u_0) ( e^{-\gamma \tau} - e^{-\beta \tau} )\\ \text{Here we set } e^{-\beta \tau} = \frac{u(t) - \frac{\alpha}{\beta}}{u_0 - \frac{\alpha}{\beta}}\\ s(t) &= s_0 e^{-\gamma \tau} + \frac{\alpha}{\gamma}(1 - e^{-\gamma \tau}) + \tilde \beta( \frac{ \alpha}{\beta} - u_0) ( e^{-\gamma \tau} - \frac{u(t) - \frac{\alpha}{\beta}}{u_0 - \frac{\alpha}{\beta}} )\\ &= s_0 e^{-\gamma \tau} + \frac{\alpha}{\gamma}(1 - e^{-\gamma \tau}) + \tilde \beta( \frac{ \alpha}{\beta} - u_0) e^{-\gamma \tau} + \tilde \beta (u(t) - \frac{\alpha}{\beta})\\ &= \tilde \beta u(t) + \frac{\alpha}{\gamma} - \tilde \beta \frac{ \alpha}{\beta} + \bigg(s_0 - \frac{\alpha}{\gamma} - \tilde\beta (u_0 - \frac{\alpha}{\beta}) \bigg)e^{-\gamma \tau} \end{aligned} $$ $\tilde \beta$ constitutes the linear dependence of unspliced on spliced mRNA. If we denote $$ \begin{aligned} \tilde s(t) &:= s(t) - \tilde \beta u(t) = \frac{\alpha}{\gamma} - \tilde \beta \frac{ \alpha}{\beta} + \bigg(s_0 - \frac{\alpha}{\gamma} - \tilde \beta (u_0 - \frac{\alpha}{\beta}) \bigg)e^{-\gamma \tau}\\ \tilde s_\infty &:= s_\infty - \tilde \beta u_\infty = \frac{\alpha}{\gamma} - \tilde \beta \frac{ \alpha}{\beta}\\ \tilde s(t) - \tilde s_\infty &= \bigg(s_0 - \frac{\alpha}{\gamma} - \tilde \beta (u_0 - \frac{\alpha}{\beta}) \bigg)e^{-\gamma \tau}\\ &= \bigg((s_0 - \tilde \beta u_0) - (\frac{\alpha}{\gamma} - \tilde \beta \frac{\alpha}{\beta}) \bigg)e^{-\gamma \tau}\\ &= \bigg(\tilde s_0 - \tilde s_\infty \bigg)e^{-\gamma \tau} \\ \tau_i &= - \frac{1}{\gamma} log \bigg( \frac{\tilde s_i - \tilde s_\infty}{\tilde s_{0,i} - \tilde s_\infty}\bigg) \end{aligned} $$ for genes with $\beta > \gamma$, $\tau_i$ is obtained by this equation. For genes with $\beta \leq \gamma$, we can instead directly take the inverse of $u(t)$ by $$\begin{aligned} u(t) &= (u_0 - \frac{\alpha}{\beta})e^{-\beta t} + \frac{\alpha}{\beta}\\ u_\infty &= (u_0 - \frac{\alpha}{\beta})e^{-\infty} + \frac{\alpha}{\beta} = \frac{\alpha}{\beta}\\ \text{Here we set } \frac{\alpha}{\beta} = u_\infty\\ u(t) &= (u_0 - u_\infty)e^{-\beta t} + u_\infty\\ \tau_i &= - \frac{1}{\beta} log \bigg( \frac{u_i - u_\infty}{u_{0,i} - u_\infty}\bigg) \end{aligned} $$ ### Computing transition probabilities from velocities Assuming that velocities truthfully describe the actual dynamics locally, we estimate transition probabilities of cell-to-cell transitions. Here they claim that for $d$ genes across $n$ cells: - transition probability matrix is $n \times n$ - gene expression matrix $S \in \mathbb{R}^{n \times d}$ - velocity vector $\vec v = (\vec v_1, \vec v, ..., \vec v_n)$ is $n \times 1$, among which $\vec v_i \in \mathbb{R}^d$ predict the change in gene expression vector of cell $i$, which is denoted as $\vec s_i \in \mathbb{R}^d$. Cell $i$ is expected to transite towards cell $j$ when the expression difference vector $\delta_{ij} = s_j - s_i$ matches the predicted change according to the velocity vector $v_i$, this is calculated by calcualting the cosine similarity between two vectors: $$ \pi_{ij} = cos \angle (\delta_{ij},v_i) = \frac{\delta_{ij}^T v_i}{||\delta_{ij}||\times||v_i||} $$ where $\pi_{ii} = 0$. This similarity solely in directionality, not in magnitude, and range from $-1$ (opposite direction) over $0$ (orthogonal, thus maximally dissimilar) to $1$ (identical direction). The resulting similarity matrix $\pi$ encodes a graph, which we refer to as `velocity graph`. Optionally, a variance stabilizing transformation can be applied for $\delta$ and $v$ vefore similarity calculation. An exponential kernel is applied in order to transform the cosine correlation into transition probabilities: $$ \tilde \pi_{ij} = \frac{1}{z_i} exp(\frac{\pi_{ij}}{\sigma_i^2}) $$ with row normalization factors $z_i = \sum_j exp(\frac{\pi_{ij}}{\sigma_i^2})$, and kernel width parameters $\sigma_i$ optionally adjusted for each cell locally (across neighboring cells) The transition probabilities are aggregated into a **transition matrix or transport map $\tilde \pi$** describing the markov chain of the differentiation process. - [ ](**what is a distribution?**) **A distribution of cells** $\mu = (\mu_1, ..., \mu_n)$, where $\mu_i \in \{0,1\}$ can be pushed through the transport map to obtain its descendant distribution. Reversely, a distribution $\mu$ can also be pulled back through the transport map to obtain its ancestors, $$\begin{aligned} \mu^{des} &= \mu \cdot \tilde \pi\\ \mu^{anc} &= \mu \cdot \tilde \pi ^T\\ \end{aligned} $$ A descendant or ancestor distribution of a set of cells $\mathcal{S} = {s_1, ..., s_n}$ can be obtained by setting $\mu_i = 1_{s_i \in \mathcal{S}}$, where $1$ denotes the indicator function. ## Gene-shared latent time After inferring parameters $\theta$ of kinetic rates, a (**cell specific?**)**gene-shared latent time** is computed as follows, which mean each cell has a latent time. 1. **gene-specific time points** of well-fitted genes (with a likelihood of at least 0.1) are normalized to a common overall timescale. - [ ] **Where are these time points from?** 3. The root cells $o$ of the differentiation process are obtained by computing the stationary states $\mu^*$ satistifying$$\mu^* = \mu^* \tilde \pi^T $$ which is simply given by the left eigenvectors of $\tilde \pi^T$ corresponding to an eigenvalues of $1$. 4. Now, for every root cell $o$, we compute the p-quantile $Q^g_p$ of all respective gene-specific time increasements across all genes $g$ by $$t_{i,o} = Q^g_p(\vec t^g_i - \vec t^g_o)$$where - $t_{i,o}$ is the time increasement of cell $i$ to cell $o$ - $\vec t_i^g$ is the gene-specific timepoints vector for cell $i$ - $\vec t_i^o$ is the gene-specific timepoints vector for cell $o$ - Here p is chosen such that it maximizes the correlation between the resulting gene-shared time-course vector $t_o = (t_{0,o},...,t_{N,o})$ and its **convolution** across local neighborhood of cells. - [ ]what is convolution? - A high correlation with the convolution of latent time improves robustness and consistency in the estimate - The rationile behind taking the $p$-quantile is the adaption to a non-uniform density of cells along the time-course as cells often tend to accumulate in later time points (velocyto did this when computing t-SNE arrows). - In summary, for each root cell we find the respective time increments that achieve best overall accordance with the learned dynamics and yield local coherence. 5. Finally, for robustness, by regressing the gene-shared latent time-course against its neighborhood convolution, we detect inconsistent time points and replace them with their convolutions. ## Projection of velocities into the embedding The projection of velocities into a lower-dimensional embedding (e.g. t-SNE) for a cell $i$ is obtained on the basis of a transition matrix $\tilde \pi$ from last section. The position of cells in an embedding, such as t-SNE or UMAP, are described by a set of vectors $\tilde s = (\tilde s_1, ..., \tilde s_n)$. Given the normalized differences of the embedding positions $\tilde \delta_{ij} = \frac{\tilde s_j - \tilde s_i}{||\tilde s_j - \tilde s_i||}$, the embedded velocity is estimated as the expected displacements w.r.t. the transition matrix $$ \tilde v_i = \mathbb{E}_{\tilde \pi_i \cdot}[\tilde \delta_{i \cdot}] = \sum_{j \ne i}\bigg( \tilde \pi_{ij} - \frac{1}{n} \bigg) \tilde \delta_{ij} $$ where subtracting $\frac{1}{n}$ corrects for the non-uniform density of points in the embedding. The directional flow is visualized as single-cell velocities or streamlines in any embedding. ## Accounting for stochasticity through second-order moments The model for velocity estimation can be extended with higher order moments, obtained by treating transcription, splicing and degradation as probabilistic events. In this regard, the probabilities of all possible reactions corresponding to these events occurring within an infinitesimal time interval $(t,t+dt]$ are provided as follows: $$\begin{aligned} &\mathbb{P}(u_{t+dt} = u_t + 1 &, s_{t+dt} &= s_t &) &= \alpha \; dt\\ &\mathbb{P}(u_{t+dt} = u_t - 1 &, s_{t+dt} &= s_t + 1 &) &= \beta u_t \; dt\\ &\mathbb{P}(u_{t+dt} = u_t &, s_{t+dt} &= s_t - 1 &) &= \gamma s_t \; dt \end{aligned} $$ The first equation means that during $(t,t+dt]$, the probablility of spliced mRNA $+1$ is $\alpha \; dt$. > $1^{st}$ uncentered moment: $\mu'_1 = \mu$ > $2^{st}$ uncentered moment: $\mu'_2 = \mu^2 + \sigma^2$ From these probabilites, the time derivative for the **uncentered moment** $<u^l_t s^k_t> = \mathbb{E}[u^l_t s^k_t] = \sum_i \mathbb{P}_i x_i$ is derived as $$ \frac{d \langle u^l_t s^k_t \rangle}{dt} = \bigg\langle \alpha \big((u_t + 1)^l s_t^k - u^l_t s^k_t \big) \bigg\rangle + \bigg\langle \beta u_t \big((u_t-1)^l(s_t +1)^k - u^l_t s^k_t \big) \bigg\rangle + \bigg\langle \gamma s_t \big(u_t^l(s_t-1)^k - u^l_t s^k_t \big) \bigg\rangle $$ where $l$ and $k$ are the order of $u_t$ and $s_t$. Hence, the first- and second-order dynamics are given by replacing $l$ and $k$ with $0,1,2$ respectively: \begin{aligned} \frac{d \langle u_t \rangle}{dt} &= \alpha - \beta \langle u_t \rangle\\ \frac{d \langle s_t \rangle}{dt} &= \beta \langle u_t \rangle - \gamma \langle s_t \rangle\\ \frac{d \langle u^2_t \rangle}{dt} &= \alpha + 2\alpha \langle u_t \rangle + \beta \langle u_t \rangle - 2 \beta \langle u_t^2 \rangle\\ \frac{d \langle u_t s_t \rangle}{dt} &= \alpha \langle s_t \rangle + \beta \langle u_t^2 \rangle - \beta \langle u_t s_t \rangle - \gamma \langle u_t s_t \rangle \\ \frac{d \langle s^2_t \rangle}{dt} &= \beta \langle u_t \rangle + 2 \beta \langle u_t s_t \rangle + \gamma \langle s_t \rangle - 2\gamma \langle s_t^2 \rangle \end{aligned} The moments of **each cell** are computed among a preset number of nearest neighbors of the corresponding cell. $$ \underbrace{ \begin{pmatrix} \langle u_t \rangle \\ \langle u_t \rangle + 2 \langle u_t s_t \rangle \end{pmatrix} }_{u'} = \gamma \; \underbrace{\begin{pmatrix} \langle s_t \rangle \\ 2\langle s^2_t \rangle - \langle s_t \rangle \end{pmatrix} }_{s'} + \epsilon' $$ where $\mathbb E [\epsilon' | s'] = 0$ the expectation of $\epsilon'$ given $s'$ is a variable depends on $s'$. $Cov[\epsilon' | s'] = \Omega$. - [ ] Why there is $Cov$? $[\epsilon'|s']$ is a sigle variable. The steady-state ratio can be solved explicitly by generalized least squares and is given by $$ \gamma = \bigg( s'^T \Omega^{-1} s' \bigg)^{-1} s'^T \Omega^{-1} u' $$ The stochastic model thereby exploits not only the relationship between unspliced and spliced mRNA abundances, but also their covariation ### Validation metrics To validate the coherence of the velocity vector field, we define a consistency score for each cell $i$ as the mean correlation of its velocity $v_i$ with velocities from neighboring cells, $$ c_i = \big\langle cor(v_i,v_j) \big\rangle_j $$ To validate the contribution of a selection of genes(e.g. top likelihood-ranked genes) to the overall inferred dynamics, we defined a reconstructability score as follow: The velocity graph (velocity matrix, $\pi$) consisting of correlations between velocities and cell-to-cell transitions, is computed twice: 1. including all genes yielding $\pi$ (cosine similarity), and once 2. only including the selection of genes (in section **gene shared latent time**). The reconstrictability score is defined as the median correlation of outgoing transition from cell $i$ to all cells that it can potentially transition to, i.e., $$ r = median_i corr(\pi_i, \pi'_i) \text{ , across all cells } i $$ ###### tags: `Papers`