# Methods sections for Ki's paper ## Quality control of scRNA-Seq - Alexey made UMAPs showing how well genotypes and batches mixed. - He checked percentage of mitochondrial genes. He did not remove any cells due to too high percentages. He similarly checked the percentage of expression of mup genes, i.e. genes whose name starts with "mup" (whatever these are). - For the hepatocytes, he performed clustering with Seurat and found one cluster with strong expression of markers indicating other cell types than hepatocytes (markers for LSECs, stAellate cells, and/or Kupffer cells) and removed the cells of this cluster for subsequent analysis - For the LSECs, he did the same and also found one cluster that needed removal - I have some plots, if needed. ## Assignment of zonation position ### "Raw" position score To each sequenced cell $c$, we assigned a "zonation position" $\eta_c$, which is a quantity indicating the cell's position along the central-to-portal zonation axis, as inferred from the cell's expression of zonation markers. We summed up the cell's expression of portal and central markers, yielding sums $S_c^\text{P}$ and $S_c^\text{P}$, and transform this into a zonation position $\eta$, such that a cell expressing only the portal and none of the central marker genes is assigned $\eta_c=0$, while a cell expressing only central and no portal markers gets $\eta_c=1$. Values in between indicate mixed expression and hence an intermediate position. Details on how the expression sum are calculated are given below. For the hepatocytes, we used the following marker list: - markers for hepatocytes near portal vein: $L_\text{P}$ = {Alb*, Apof, Arg1, Apom, Asgr2, Asl*, Ass1*, Atp5a1, Cps1, C1s1, C8b, Cpt2, Cyp2f2*, Tkfc, Eef1b2, Elovl2, Fads1, Fbp1, Gc, Gnmt, Hsd17b13, Ifitm3, Igf1, Igfals, Ndufb10, Pck1, Pigr, S100a1, Serpina1c, Serpina1e, Serpind1, Serpinf1, Trf, Uqcrh, Vtn} - markers for hepatocytes near central vein: $L_\text{C}$ = {Alad, Aldh1a1, C6, Nat8f2, Cpox, Csad, Cyb5a, Cyp1a2, Cyp2c37, Cyp2c50, Cyp2e1*, Cyp3a11, Glu*, Gstm1, Hpd, Lect2, Mgst1, Oat, Pon1, Prodh, Rgn, Slc16a10}. These markers are the ones given in the Itzovitz paper [INSERT REF], to which we have added six more genes, marked with an asterisk. We write $k_{gc}$ for the UMI count for gene $g$ in cell $c$, $s_c=\sum_g k_{gc}$ for the total UMI count of cell $c$, and $x_{gc} = k_{gc} / s_c$ for the fraction of UMIs of cell $c$ that fall to gene $g$. For all marker genes, we normalize these fractions by dividing them by the gene's maximum expression, $y_{gc} = x_{gc} / x^\text{max}_g$. The maximum $x^\text{max}_g$ is here taken over all hepatocytes within a genotype, i.e., it is taken separately for the wildtype and the KO cells. Now, we calculate for each cell $c$ the sum of the maximum-normalized expression of the portal markers, $S_c^\text{P} = \sum_{g\in L_\text{P}} y_{gc}$, and of the central markers, $S_c^\text{C} = \sum_{g\in L_\text{C}} y_{gc}$, and get a fraction from this: $\eta_c = S_c^\text{C}\, /\, ( S_c^\text{P} + S_c^\text{C} )$. For the LSECs, we used the same method, and the following marker list: - markers for LSECs near central vein: Ang, Cd81, Emcn, Gas6, Gja1, Gja4, Id4, Ier3, Lfng, Plvap, Ptgs1, Rspo3, Tcim, Wnt2, Wnt9b. - markers for LSECs near portal vein: Angpt2, Cxcl9, Dll4, Efnb1, Efnb2, Esm1, Id2, Insr, Jak1, Lama4, Ltbp4, Ly6a, Msr1, Ntn4, Osmr. ### Ranked position score Our knock-out influences zonation, specifically, it weakens the "contrast" of expression differences between periportal and pericentral cells. Some of our markers no longer differ in expression in the knock-out. Fortunately, other marker genes are still informative and therefore, we can still assign position information. However, the lost markers "dilute" to signal and hence, the distribution of $\eta_c$ over cells differs markedly between wildtype and knockout: the values for $\eta_c$ tend to be lower in KO [or, is it higher? CHECK]. Furthermore, while we may assume that $\eta$ is monotonously related to the cells' actual positions along the central-to-portal axis, there is no reason to assume this relation to be linear, nor that it is not affected by treatment or batch effects. Hence, it seems prudent to use the $\eta$ values only to order the cells but to not make use of their absolute values. Therefore, we replace each cell's $\eta$ value with the cell's rank, scaled to the unit interval, using R's `cume_dist` function. In other words: Given a list of $n$ cells with their $\eta$ values, we order the cells by increasing $\eta$, and then assign to each cell a rank score, which is simply the cell's (zero-based) rank in the list, divided by $n-1$, such that the first (most central-like) cell gets score 0, then the score increases by $1/n$ for each cell, until the last (most portal-like) cell gets score 1. In the following section, this score will be referred to as `etaq`. Importantly, the sorting is done separately for each experimental batch, i.e. each cell's $\eta$ value gets ranked within a list comprising only cells with the same genotype, from the same mouse, and processed in the same plate. In this manner, we avoid technical batch effects influencing the result and also even out the signal dilution due to the knockout described before. ## Smoothing gene expression along zonation ### Spline regression In order to present the expression of a gene along the zonation axis as a smooth curve, we used spline regression via generalized linear models of the negative-binomial family with logarithmic link with a 4-dimensional B-spline basis. It seems most concise to describe the details by providing minimal R code. Assume that `data` is a data frame (table) with one row for each cell and columns as follows: `count` is the UMI count for the gene of interest, `total` is the cell's total UMI count (sumemd over all genes), `etaq` is the cell's rank-normalized position score obtained as described above, and `mouse` is a factor indicating which mouse the cell is from (and thereby also implying the genotype). The data frame contains rows for all hepatocytes (or all LSECs), combining both genotypes. For spline regression, we use a basis of 4 cubic B splines, as produced in R by the `bs` ("basis spline") function from R's core package "splines": ``` spline_matrix <- splines::bs( data$etaq, df=4, intercept=TRUE, Boundary.knots = c(0,1) ) ``` The actual regression is then performed by ``` MASS::glm.nb( count ~ spline_matrix + 0 + offset(log(total)), modelframe ) ``` Here, `glm.nb` is the GLM fit function for negative-binomial GLMs from the MASS package [Cite MASS book], `+ 0` causes the model to be fitted without intercept (because the intercept is already accounted for in the spline matrix), and the `offset` argument accounts for the total cell count, using a logarithm as this is the canonical and default link function for the negative-binomial GLM. The regression yields coefficients $\beta_{rm}$, one for each of the 4 spline bases ($r=1,..,4$) and each mouse $m$. A smoothing spline for the gene of interest can then be obtained for each mouse $m$ as $\mu_m(\eta_\text{Q}) = \exp\left( \sum_{r} \beta_{rm}B_r(\eta_\text{Q})\right)$. Here, $\eta_Q$ is the x axis for the plot, running from 0 to 1 and indicating the rank-normalized zonation position, and the $B_r(x)$ are the four basis functions of the standard 4th degree B spline basis (as returned by the `bs` call given above, when using $x$ as first argument). The fitted value $\mu(\eta_Q)$ is the expected expression fraction of the gene in a cell at position rank score $\eta_Q$. In our plots, y axes depicting $\mu(\eta_Q)$ are shown with a square-root scaling in order to improve homoscedasticity (because the square root is the variance-stabilizing transformation for the Poisson distribution). In order to allow the reader to visually assess reproducibility of a gene's expression pattern across biological replicates, we calculate splines separately for each mouse and plot one curve per mouse. Note that it is also possible to formally test for statistical significance by performing ANOVA on the coefficients $\beta_{rm}$ with respect to the mouse genotype. [TODO: DECIDE WHETHER TO INCLUDE OR REMOVE LAST PARAGRAPH/SENTENCE.]