Try   HackMD

Methods sections for Ki's paper

Quality control of scRNA-Seq

  • Alexey made UMAPs showing how well genotypes and batches mixed.
  • He checked percentage of mitochondrial genes. He did not remove any cells due to too high percentages. He similarly checked the percentage of expression of mup genes, i.e. genes whose name starts with "mup" (whatever these are).
  • For the hepatocytes, he performed clustering with Seurat and found one cluster with strong expression of markers indicating other cell types than hepatocytes (markers for LSECs, stAellate cells, and/or Kupffer cells) and removed the cells of this cluster for subsequent analysis
  • For the LSECs, he did the same and also found one cluster that needed removal
  • I have some plots, if needed.

Assignment of zonation position

"Raw" position score

To each sequenced cell

c, we assigned a "zonation position"
ηc
, which is a quantity indicating the cell's position along the central-to-portal zonation axis, as inferred from the cell's expression of zonation markers. We summed up the cell's expression of portal and central markers, yielding sums
ScP
and
ScP
, and transform this into a zonation position
η
, such that a cell expressing only the portal and none of the central marker genes is assigned
ηc=0
, while a cell expressing only central and no portal markers gets
ηc=1
. Values in between indicate mixed expression and hence an intermediate position. Details on how the expression sum are calculated are given below.

For the hepatocytes, we used the following marker list:

  • markers for hepatocytes near portal vein:
    LP
    = {Alb*, Apof, Arg1, Apom, Asgr2, Asl*, Ass1*, Atp5a1, Cps1, C1s1, C8b, Cpt2, Cyp2f2*, Tkfc, Eef1b2, Elovl2, Fads1, Fbp1, Gc, Gnmt, Hsd17b13, Ifitm3, Igf1, Igfals, Ndufb10, Pck1, Pigr, S100a1, Serpina1c, Serpina1e, Serpind1, Serpinf1, Trf, Uqcrh, Vtn}
  • markers for hepatocytes near central vein:
    LC
    = {Alad, Aldh1a1, C6, Nat8f2, Cpox, Csad, Cyb5a, Cyp1a2, Cyp2c37, Cyp2c50, Cyp2e1*, Cyp3a11, Glu*, Gstm1, Hpd, Lect2, Mgst1, Oat, Pon1, Prodh, Rgn, Slc16a10}.

These markers are the ones given in the Itzovitz paper [INSERT REF], to which we have added six more genes, marked with an asterisk.

We write

kgc for the UMI count for gene
g
in cell
c
,
sc=gkgc
for the total UMI count of cell
c
, and
xgc=kgc/sc
for the fraction of UMIs of cell
c
that fall to gene
g
. For all marker genes, we normalize these fractions by dividing them by the gene's maximum expression,
ygc=xgc/xgmax
. The maximum
xgmax
is here taken over all hepatocytes within a genotype, i.e., it is taken separately for the wildtype and the KO cells.

Now, we calculate for each cell

c the sum of the maximum-normalized expression of the portal markers,
ScP=gLPygc
, and of the central markers,
ScC=gLCygc
, and get a fraction from this:
ηc=ScC/(ScP+ScC)
.

For the LSECs, we used the same method, and the following marker list:

  • markers for LSECs near central vein: Ang, Cd81, Emcn, Gas6, Gja1, Gja4, Id4, Ier3, Lfng, Plvap, Ptgs1, Rspo3, Tcim, Wnt2, Wnt9b.
  • markers for LSECs near portal vein: Angpt2, Cxcl9, Dll4, Efnb1, Efnb2, Esm1, Id2, Insr, Jak1, Lama4, Ltbp4, Ly6a, Msr1, Ntn4, Osmr.

Ranked position score

Our knock-out influences zonation, specifically, it weakens the "contrast" of expression differences between periportal and pericentral cells. Some of our markers no longer differ in expression in the knock-out. Fortunately, other marker genes are still informative and therefore, we can still assign position information. However, the lost markers "dilute" to signal and hence, the distribution of

ηc over cells differs markedly between wildtype and knockout: the values for
ηc
tend to be lower in KO [or, is it higher? CHECK].

Furthermore, while we may assume that

η is monotonously related to the cells' actual positions along the central-to-portal axis, there is no reason to assume this relation to be linear, nor that it is not affected by treatment or batch effects. Hence, it seems prudent to use the
η
values only to order the cells but to not make use of their absolute values.

Therefore, we replace each cell's

η value with the cell's rank, scaled to the unit interval, using R's cume_dist function. In other words: Given a list of
n
cells with their
η
values, we order the cells by increasing
η
, and then assign to each cell a rank score, which is simply the cell's (zero-based) rank in the list, divided by
n1
, such that the first (most central-like) cell gets score 0, then the score increases by
1/n
for each cell, until the last (most portal-like) cell gets score 1. In the following section, this score will be referred to as etaq.

Importantly, the sorting is done separately for each experimental batch, i.e. each cell's

η value gets ranked within a list comprising only cells with the same genotype, from the same mouse, and processed in the same plate. In this manner, we avoid technical batch effects influencing the result and also even out the signal dilution due to the knockout described before.

Smoothing gene expression along zonation

Spline regression

In order to present the expression of a gene along the zonation axis as a smooth curve, we used spline regression via generalized linear models of the negative-binomial family with logarithmic link with a 4-dimensional B-spline basis.

It seems most concise to describe the details by providing minimal R code.

Assume that data is a data frame (table) with one row for each cell and columns as follows: count is the UMI count for the gene of interest, total is the cell's total UMI count (sumemd over all genes), etaq is the cell's rank-normalized position score obtained as described above, and mouse is a factor indicating which mouse the cell is from (and thereby also implying the genotype). The data frame contains rows for all hepatocytes (or all LSECs), combining both genotypes.

For spline regression, we use a basis of 4 cubic B splines, as produced in R by the bs ("basis spline") function from R's core package "splines":

spline_matrix <- splines::bs( data$etaq, df=4, 
  intercept=TRUE, Boundary.knots = c(0,1) )

The actual regression is then performed by

MASS::glm.nb( count ~ spline_matrix + 0 + offset(log(total)),
    modelframe )

Here, glm.nb is the GLM fit function for negative-binomial GLMs from the MASS package [Cite MASS book], + 0 causes the model to be fitted without intercept (because the intercept is already accounted for in the spline matrix), and the offset argument accounts for the total cell count, using a logarithm as this is the canonical and default link function for the negative-binomial GLM.

The regression yields coefficients

βrm, one for each of the 4 spline bases (
r=1,..,4
) and each mouse
m
. A smoothing spline for the gene of interest can then be obtained for each mouse
m
as
μm(ηQ)=exp(rβrmBr(ηQ))
. Here,
ηQ
is the x axis for the plot, running from 0 to 1 and indicating the rank-normalized zonation position, and the
Br(x)
are the four basis functions of the standard 4th degree B spline basis (as returned by the bs call given above, when using
x
as first argument). The fitted value
μ(ηQ)
is the expected expression fraction of the gene in a cell at position rank score
ηQ
.

In our plots, y axes depicting

μ(ηQ) are shown with a square-root scaling in order to improve homoscedasticity (because the square root is the variance-stabilizing transformation for the Poisson distribution).

In order to allow the reader to visually assess reproducibility of a gene's expression pattern across biological replicates, we calculate splines separately for each mouse and plot one curve per mouse. Note that it is also possible to formally test for statistical significance by performing ANOVA on the coefficients

βrm with respect to the mouse genotype. [TODO: DECIDE WHETHER TO INCLUDE OR REMOVE LAST PARAGRAPH/SENTENCE.]