Cell type annotation
Obtaining clusters of cells is fairly straightforward, but it is more difficult to determine what biological state is represented by each of those clusters. Doing so requires us to bridge the gap between the current dataset and prior biological knowledge, and the latter is not always available in a consistent and quantitative manner. Indeed, even the concept of a “cell type” is not clearly defined, with most practitioners possessing a “I’ll know it when I see it” intuition that is not amenable to computational analysis. As such, interpretation of scRNA-seq data is often manual and a common bottleneck in the analysis workflow.
To expedite this step, we can use various computational approaches that exploit prior information to assign meaning to an uncharacterized scRNA-seq dataset. The most obvious sources can directly compare our expression profiles to published reference datasets where each sample or cell has already been annotated with its putative biological state by domain experts.
A conceptually straightforward annotation approach is to compare the single-cell expression profiles with previously annotated reference datasets. Labels can then be assigned to each cell in our uncharacterized test dataset based on the most similar reference sample(s), for some definition of “similar”. Any published and labelled RNA-seq dataset (bulk or single-cell) can be used as a reference, though its reliability depends greatly on the expertise of the original authors who assigned the labels in the first place.
Cell type annotation, method assigns labels to cells based on the reference samples with the highest Spearman rank correlations, using only the marker genes between pairs of labels to focus on the relevant differences between cell types. It also performs a fine-tuning step for each cell where the correlations are recomputed with just the marker genes for the top-scoring labels. This aims to resolve any ambiguity between those labels by removing noise from irrelevant markers for other labels.
Seurat or other package markers in a pairwise manner between labels in the reference dataset. Specifically, for each label of interest, it performs pairwise comparisons to every other label in the reference and identifies the genes that are upregulated in the label of interest for each comparison. The initial score calculation is then performed on the union of marker genes across all comparisons for all label. This approach ensures that the selected subset of features will contain genes that distinguish each label from any other label. It also allows the fine-tuning step to aggressively improve resolution by only using marker genes from comparisons where both labels have scores close to the maximum.
The original (“classic”) marker detection algorithm used in Aran et al. (2019) identified marker genes based on their log-fold changes in each pairwise comparison. Specifically, it used the genes with the largest positive differences in the per-label median log-expression values between labels. The number of genes taken from each pairwise comparison was defined as 500(23)log2(n)500(23)log2(n), where nn is the number of unique labels in the reference; this scheme aimed to reduce the number of genes (and thus the computational time) as the number of labels and pairwise comparisons increased. Classic mode is primarily intended for reference datasets that have little or no replication, a description that covers many of the bulk-derived references and precludes more complicated marker detection procedures
Controlling marker detection
We have already introduced the classic approach in the previous chapter, but it is similarly straightforward to perform marker detection with conventional statistical tests. In particular, we identify top-ranked markers based on pairwise Wilcoxon rank sum tests or tt-tests between labels; this allows us to account for the variability across cells to choose genes that are robustly upregulated in each label.
The availability of variance-aware marker detection methods is most relevant for reference datasets that contain a reasonable number (i.e., at least two) of replicate samples for each label. An obvious use case is that of single-cell datasets that are used as a reference to annotate other single-cell datasets. It is also possible for users to supply their own custom marker lists to own library facilitating incorporation of prior biological knowledge into the annotation process.
Annotation with test-based marker detection
First, we set up the Tobias dataset to be our reference, computing log-normalized expression values as discussed with you before.
We run Seurate() as described in the website but with a marker detection mode that considers the variance of expression across cells. Here, we will use the Wilcoxon ranked sum test to identify the top markers for each pairwise comparison between labels. This is slower but more appropriate for single-cell data compared to the default marker detection algorithm, as the latter may fail for low-coverage data where the median for each label is often zero.
By default, the function will take the top (default: 10) genes from each pairwise comparison between labels. A larger number of markers increases the robustness of the annotation by ensuring that relevant genes are not omitted, especially if the reference dataset has study-specific effects that cause uninteresting genes to dominate the top set. However, this comes at the cost of increasing noise and computational time.
Defining custom markers
The marker detection in seurate() is built on top of the testing framework in scran, so most options. For example, we could use the t-test and test against a log-fold change threshold with I showed you before.
However, users can also construct their own marker lists with any DE testing machinery. For example, we can perform pairwise binomial tests to identify genes that are differentially detected (i.e., have differences in the proportion of cells with non-zero counts) between labels in the reference tobies dataset. We then take the top 10 marker genes from each pairwise comparison, obtaining a list of lists of character vectors containing the identities of the markers for that comparison.
Once we have this list of lists, we supply it to seurat() via the genes= argument, which causes the function to bypass the internal marker detection to use the supplied gene sets instead. The most obvious benefit of this approach is that the user can achieve greater control of the markers, allowing integration of prior biological knowledge to obtain more relevant genes and a more robust annotation.
In some cases, markers may only be available for specific labels rather than for pairwise comparisons between labels. This is accommodated by supplying a named list of character vectors to genes. Note that this is likely to be less powerful than the list-of-lists approach as information about pairwise differences is discarded.
The most obvious diagnostic reported by seurat() is the nested matrix of per-cell scores in the scores field. This contains the correlation-based scores prior to any fine-tuning for each cell (row) and reference label (column). Ideally, we would see unambiguous assignments where, for any given cell, one label’s score is clearly larger than the others.
To check whether this is indeed the case, we use the plotScoreHeatmap() function to visualize the score matrix (as I showed you in the heatmap). Here, the key is to examine the spread of scores within each cell, i.e., down the columns of the heatmap. Similar scores for a group of labels indicates that the assignment is uncertain for those columns, though this may be acceptable if the uncertainty is distributed across closely related cell types. (Note that the assigned label for a cell may not be the visually top-scoring label if fine-tuning is applied, as the only the pre-tuned scores are directly comparable across all labels.)
We can also display other metadata information for each cell by setting clusters= or annotation_col=. This is occasionally useful for examining potential batch effects, differences in cell type composition between conditions, relationship to clusters from an unsupervised analysis and so on,. For example, in your heatmap displays the donor of origin for each cell; we can see that each cell type has contributions from multiple donors, which is reassuring as it indicates that our assignments are not (purely) driven by donor effects.
Based on the deltas across cells
We identify poor-quality or ambiguous assignments based on the per-cell “CD8 T cells”, i.e., the difference between the score for the assigned label and the median across all labels for each cell. Our assumption is that most of the labels in the reference are not relevant to any given cell. Thus, the median across all labels can be used as a measure of the baseline correlation, while the gap from the assigned label to this baseline can be used as a measure of the assignment confidence.
Low CD8 T cells indicate that the assignment is uncertain, possibly because the cell’s true label does not exist in the reference. An obvious next step is to apply a threshold on the CD8T cells to filter out these low-confidence assignments. We use the CD8T cells rather than the assignment score as the latter is more sensitive to technical effects. For example, changes in library size affect the technical noise and can increase/decrease all scores for a given cell, while the delta is somewhat more robust as it focuses on the differences between scores within each cell.
seurate() will set a threshold on the CD8T cells for each label using an outlier-based strategy. Specifically, we identify cells with deltas that are small outliers relative to the deltas of other cells with the same label. This assumes that, for any given label, most cells assigned to that label are correct. We focus on outliers to avoid difficulties with setting a fixed threshold, especially given that the magnitudes of the deltas are about as uninterpretable as the scores themselves. Pruned labels are reported in the pruned.labels field where low-quality assignments are replaced with NA
However, the default pruning parameters may not be appropriate for every dataset. For example, if one label is consistently misassigned, the assumption that most cells are correctly assigned will not be appropriate. In such cases, we can revert to a fixed threshold by manually calling the underlying pruneScores() function with min.diff.med=. The example below discards cells with deltas below an arbitrary threshold of 0.5, where higher thresholds correspond to greater assignment certainty.
This entire process can be visualized using the VlnPlot() function, which displays the per-label distribution of the CD8 T cells across cells (I have showed you before). We can use this plot to check that outlier detection in count score behaved sensibly. Labels with especially low CD8 T may warrant some additional caution in their interpretation
If fine-tuning was performed, we can apply an even more stringent filter based on the difference between the highest and second-highest scores after fine-tuning. Cells will only pass the filter if they are assigned to a label that is clearly distinguishable from any other label. In practice, this approach tends to be too conservative as assignments involving closely related labels are heavily penalized.
Based on marker gene expression
Another simple yet effective diagnostic is to examine the expression of the marker genes for each label in the test dataset. The marker genes used for each label are reported in the metadata() of the seurat() output, so we can simply retrieve them to visualize their (usually log-transformed) expression values across the test dataset., we use the plotHeatmap or do hetmap () function from scater to examine the expression of markers used to identify CD4 or other cells.
If a cell in the test dataset is confidently assigned to a particular label, we would expect it to have strong expression of that label’s markers. We would also hope that those label’s markers are biologically meaningful; in this case, we do observe strong upregulation of CD20 (MS4A1) in the B cells, which is reassuring and gives greater confidence to the correctness of the assignment. If the identified markers are not meaningful or not consistently upregulated, some skepticism towards the quality of the assignments is warranted.
In practice, the heatmap may be overwhelmingly large if there too many reference-derived markers. To resolve this, we can prune the set of markers to focus on the most interesting genes based on their test expression profiles. example is limited to the top genes with the strongest evidence for upregulation in our test dataset using the assigned labels; such genes are effectively markers for CD4 or CD8 cells in both the reference and test datasets. As a diagnostic plot, this is much more amenable to quick inspection to check that the expected genes are present.
It is straightforward to repeat this process for all labels by wrapping this code in a loop, as shown below in dittoSeq. Note that plotHeatmap() is not the only function that can be used for this visualization; we could also use plotDots() to create a Seurat-style dot plot, or we could use other heatmap plotting functions such as dittoHeatmap() from dittoSeq.
In general, the heatmap provides a more interpretable diagnostic visualization than the plots of scores and CD8 T cells. However, it does require more effort to inspect and may not be feasible for large numbers of labels. It is also difficult to use a heatmap to determine the correctness of assignment for closely related labels.
In some cases, we may wish to use multiple references for annotation of a test dataset. This yields a more comprehensive set of cell types that are not covered by any individual reference, especially when differences in the resolution are considered. However, it is not trivial due to the presence of batch effects across references (from differences in technology, experimental protocol or the biological system) as well as differences in the annotation vocabulary between investigators.
Several strategies are available to combine inferences from multiple references:
• using reference-specific labels in a combined reference
• using harmonized labels in a combined reference
• combining scores across multiple references
Using reference-specific labels
In this strategy, each label is defined in the context of its reference dataset. This means that a label - say, “B cell” - in reference dataset X is considered to be different from a “B cell” label in reference dataset Y. Use of reference-specific labels is most appropriate if there are relevant biological differences between the references; for example, if one reference is concerned with healthy tissue while the other reference considers diseased tissue, it can be helpful to distinguish between the same cell type in different biological contexts.
We can easily implement this approach by combining the expression matrices together and pasting the reference name onto the corresponding character vector of labels. This modification ensures that the downstream Seurat() call will treat each label-reference combination as a distinct entity.
However, this strategy identifies markers by directly comparing expression values across references, meaning that the marker set is likely to contain genes responsible for uninteresting batch effects. This will increase noise during the calculation of the score in each reference, possibly leading to a loss of precision and a greater risk of technical variation dominating the classification results. The use of reference-specific labels also complicates interpretation of the results as the cell type is always qualified by its reference of origin.
Combined diagnostics
All of the diagnostic plots in Seurat or other library such as SingleR will naturally operate on these combined results. For example, we can create a heatmap of the scores in all of the individual references as well as for the recomputed scores in the combined results. Note that scores are only recomputed for the labels predicted in the individual references, so all labels outside of those are simply set to NA - hence the swathes of blue.
Sharing information during marker detection
One of the major problems with using multiple references is the presence of study-specific nomenclature. For example, the concept of a B cell may be annotated as B cells in one reference, B_cells in another reference, and then B and B-cell and so on in other references. We can overcome this by using harmonized labels where the same cell type is assigned as the same label across references, simplifying interpretation and ensuring that irrelevant discrepancies in labelling do not intefere with downstream analysis.
A more advanced approach is to share information across references during the marker detection stage. This is done by favoring genes the exhibit upregulation consistently in multiple references, which increases the likelihood that those markers will generalize to other datasets. For classic marker detection, we achieve this by calling getClassicMarkers(published papers) to obtain markers for use in seurat(); the same effect can be achieved for test-based methods functions by setting block=. We then use these improved markers by passing them to genes= as described in before. In this case, we specify com.markers twice in a list to indicate that we are using them for both of our references.
It is worth noting that, in the above code, the DEG genes are still identified within each reference and then the statistics are merged across references to identify the top markers. This ensures that we do not directly compare expression values across references, which reduces the susceptibility of marker detection to batch effects.
The most obvious problem with this approach is that it assumes that harmonized labels are available. This is usually not true and requires some manual mapping of the author-provided labels to a common vocabulary. The mapping process also runs the risk of discarding relevant information about the biological status (e.g., activation status, disease condition) if there is no obvious counterpart for that state in the ontology.
Manual label harmonization
Each reference is used to annotate the other and the probability of mutual assignment between each pair of labels is computed, i.e., for each pair of labels, what is the probability that a cell with one label is assigned the other and vice versa? Probabilities close to 1 in 1 relation between that pair of labels; on the other hand, an all-zero probability vector indicates that a label is unique to a particular reference.
This way can be used to guide harmonization to enforce a consistent vocabulary between two sets of labels. However, some manual intervention is still required in this process given the ambiguities posed by differences in biological systems and technologies. In the example above, cancer are considered to be unique to each reference while smooth muscle cells in the single cells data are incorrectly matched to fibroblasts in the Blueprint data. CD4+ and CD8+ T cells are also both assigned to “T cells”, so some decision about the acceptable resolution of the harmonized labels is required here.
As an aside, we can also use this function to identify the matching clusters between two independent scRNA-seq analyses. This involves substituting the cluster assignments as proxies for the labels, allowing us to match up clusters and integrate conclusions from multiple datasets without the difficulties of batch correction and reclustering.
Adjusting resolution
We can use the ontology graph to adjust the resolution of the reference labels by rolling up overly-specific terms to their last common ancestor (LCA). The find findcluster resolution () utility takes a set of terms and returns a list of potential LCAs for various subsets of those terms. Users can inspect this list to identify LCAs at the desired resolution and then map their descendent terms to those LCAs.
We can also use this function to synchronize multiple sets of terms to the same resolution. Here, we consider the Sergio’s dataset (my paper liver), which provides highly resolved annotation of immune cell types. The finclusters() function specifies the origins of the descendents for each LCA, allowing us to focus on LCAs that have representatives in both sets of terms.
For example, I notice that the mouse RNA-seq reference only has a single “T cell” term. To synchronize resolution across references, we would need to roll up all of the liver paperdata finely resolved subsets into that LCA. Of course, this results in some loss of precision and information; whether this is an acceptable price for simpler interpretation is a decision that is left to the user.
The default philosophy of Seurat is to perform annotation of each individual cell in the test dataset. An alternative strategy is to perform annotation of aggregated profiles for groups or clusters of cells.
By passing clusters= to seurat(), we direct the function to compute an aggregated profile per cluster. Annotation is then performed on the cluster-level profiles rather than on the single-cell level. This has the major advantage of being much faster to compute as there are obviously fewer clusters than cells; it is also easier to interpret as it directly returns the likely cell type identity of each cluster.
This approach assumes that each cluster in the test dataset corresponds to exactly one reference label. If a cluster actually contains a mixture of multiple labels, this will not be reflected in its lone assigned label. (We note that it would be very difficult to determine the composition of the mixture from the SingleR() scores.) Indeed, there is no guarantee that the clustering is driven by the same factors that distinguish the reference labels, decreasing the reliability of the annotations when novel heterogeneity is present in the test dataset. The default per-cell strategy is safer and provides more information about the ambiguity of the annotations, which is important for closely related labels where a close correspondence between clusters and labels cannot be expected.
Summary to compensate for what makes biological sense in the context of your experiment, you can merge certain clusters together. I usually don't do this and just tweak the resolution till each cluster has at least x number unique differentially expressed genes. I think as long as you define a cluster like a differentially expressed gene, it should be fine.
Oh I also use the clustree function https://lazappi.github.io/clustree/articles/clustree.html
https://singlecell.broadinstitute.org/single_cell?scpbr=the-alexandria-project (definitely check out this page)
https://www.sciencedirect.com/science/article/pii/S2001037021000192
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1795-z
https://www.biorxiv.org/content/10.1101/2019.12.26.888503v2.full
https://figshare.com/