# 2 QC and Clustering ### Step 1: Setup in Google Colab To run this notebook in Google Colab, follow these steps: 1. **Upload the Notebook**: Click on "File" > "Upload notebook" and select the `.ipynb` file. 2. **Mount Google Drive (Optional)**: If you want to save files or access datasets stored in Google Drive, you can mount your drive by running the following command: ```python from google.colab import drive drive.mount('/content/drive') ``` This step will allow you to easily import or export files. 3. **Install Necessary Packages**: Make sure that you have all the necessary libraries installed. In case the notebook uses packages that are not pre-installed in Colab, you can install them by adding cells like the following: ```python !pip install scanpy anndata matplotlib ``` ### Step 2: Run the Notebook Cells - Execute the cells in sequence to carry out the analysis, starting with data preprocessing, quality control, and downstream clustering. - To execute a cell, click on it and press `Shift + Enter` or click the play button next to the cell. ### Step 3: Quality Control: Removing Low-Quality Cells Removing low-quality cells is a critical preprocessing step in single-cell transcriptomics. These low-quality cells often introduce unwanted noise, significantly biasing downstream analyses such as clustering and differential expression, which could lead to misleading biological interpretations. The main criteria for filtering cells are: 1. **Low Number of Genes Detected**: Cells with very few detected genes are typically low-quality, possibly representing empty droplets or dead cells, and are therefore excluded. 2. **High Proportion of Mitochondrial Counts**: High mitochondrial gene expression often indicates stressed or dying cells, which are not suitable for further analysis. To address these quality issues, we use the parameters `min_cells` and `min_counts` to filter out potentially problematic cells. - **`min_cells`**: This parameter specifies the minimum number of cells a gene should be detected in to be retained in the dataset. For example, setting `min_cells=3` means that any gene detected in fewer than 3 cells will be excluded. This ensures that we focus only on genes that are widely expressed across the dataset, which may be more informative. - **`min_counts`**: This parameter sets the minimum number of total counts required for a cell to be retained in the analysis. Using `min_counts=200` will filter out any cells with fewer than 200 total transcript counts, likely representing empty droplets or noise rather than true biological cells. Here's an example snippet to set these thresholds using `scanpy`: ```python import scanpy as sc adata = sc.read("your_data.h5ad") sc.pp.filter_cells(adata, min_counts=200) sc.pp.filter_genes(adata, min_cells=3) ``` ### Step 4: Function `is_outlier` for Removing Low-Quality Cells The `is_outlier` function is used to identify and remove outlier cells based on various metrics, such as the number of counts, number of detected genes, and mitochondrial gene percentage. This ensures that we exclude cells that deviate significantly from the typical patterns observed in our dataset. A typical implementation looks like this: ```python import numpy as np def is_outlier(obs, metric, min_quantile=0.05, max_quantile=0.95): lower_bound = np.quantile(obs[metric], min_quantile) upper_bound = np.quantile(obs[metric], max_quantile) return (obs[metric] < lower_bound) | (obs[metric] > upper_bound) ``` In the example above, the `is_outlier` function calculates the lower and upper quantiles for a specified metric and marks cells outside of these bounds as outliers. These bounds can be adjusted to control how stringently low-quality cells are removed. To use this function to create a boolean mask and filter your AnnData object: ```python adata = adata[~is_outlier(adata.obs, 'n_genes'), :] ``` This ensures that only the cells within the specified range of metrics are retained for downstream analysis, resulting in higher-quality data for meaningful biological interpretations. ### Step 5: Proportional Fitting Normalization After removing low-quality cells, normalization is crucial to correct for differences in sequencing depth between cells. This step prevents technical artifacts from confounding the biological insights. The Proportional Fitting Normalization method, also known as **PFlog1pPF normalization**, scales the count data so that each cell has approximately the same total number of counts, followed by a log-transformation to stabilize variance. Here is a detailed code snippet for implementing proportional fitting normalization: ```python # Retain original raw counts rna.layers["counts"] = rna.X.copy() # Proportional fitting to mean of cell depth # Gene expression data is normalized to account for differences in the total counts across cells proportional_fitting = sc.pp.normalize_total(rna, target_sum=None, inplace=False) # Log1p transform to reduce the dynamic range and bring data closer to a normal distribution rna.layers["log1pPF_normalization"] = sc.pp.log1p(proportional_fitting["X"]) # Proportional fitting applied to log-transformed data rna.layers["PFlog1pPF_normalization"] = sc.pp.normalize_total(rna, target_sum=None, layer="log1pPF_normalization", inplace=False)["X"] ``` - **`sc.pp.normalize_total()`** adjusts the counts in each cell so that they sum to a target value (e.g., 10,000), making cells comparable in terms of total RNA content. - **`sc.pp.log1p()`** applies a log-transformation to mitigate the effect of outlier counts, compressing the scale of highly expressed genes more than low-expression genes, and making the data more suitable for downstream analysis. - Using both proportional fitting and log transformation helps achieve a more uniform distribution of counts, enhancing interpretability and robustness for subsequent analyses. In addition, there is an optional normalization method using **Pearson residuals** for more advanced applications: ```python # Pearson normalization for further removal of technical effects #rna.layers["Pearson_normalized"] = sc.experimental.pp.normalize_pearson_residuals(rna, inplace=False)["X"] ``` This normalization can be particularly beneficial for removing systematic technical effects, improving the comparability of expression values across samples, especially in cases with pronounced batch effects or other technical biases. ### Step 6: Highly Variable Genes Selection Selecting highly variable genes (HVGs) is essential to focus on genes that drive biological diversity across cells. HVGs are typically those that show substantial variance relative to their mean expression, reflecting meaningful biological processes rather than technical noise. To identify HVGs using `scanpy`, you can use: ```python sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5) adata = adata[:, adata.var['highly_variable']] ``` This command selects genes with high dispersion relative to their mean expression, which are likely to be biologically informative. Retaining only HVGs can significantly improve the efficiency and interpretability of downstream analyses. ### Step 7: Triku for Highly Variable Genes Selection **Triku** is an advanced tool for selecting highly variable genes that considers local variability across the entire distribution of gene expression, often capturing more subtle biological patterns. To use Triku, install it with: ```python !pip install triku ``` Then apply it to the AnnData object: ```python import triku as tk adata = tk.triku(adata) ``` This method often provides an enhanced set of highly variable genes, particularly useful when dealing with complex and heterogeneous datasets. ### Step 8: Principal Component Analysis (PCA) PCA is a fundamental dimensionality reduction technique that helps capture the major axes of variation in the data. In single-cell transcriptomics, PCA reduces the high-dimensional gene expression data into a smaller set of components that represent the most significant sources of variability. To compute PCA in `scanpy`: ```python sc.pp.pca(adata, n_comps=50) ``` The data is reduced to 50 principal components, which retain the most informative features, allowing subsequent analyses like clustering and visualization to proceed with reduced noise and computational complexity. ### Step 9: Neighbors Calculation The next step is to compute a graph representing cell-to-cell similarities, which is essential for clustering and visualization. Nearest neighbors are calculated based on the principal components to create this graph. Calculate the neighbors with: ```python sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50) ``` - **`n_neighbors`**: The number of neighbors to consider for each cell. Setting this parameter requires careful consideration as it determines how tightly or loosely cells are connected. - **`n_pcs`**: The number of principal components to use, affecting the granularity of the similarity graph. Including more PCs may help capture complex relationships, but at the cost of potentially adding noise. ### Step 9.1: Batch Effect Correction Using Harmony Integration Batch effects are a common issue in single-cell RNA sequencing, particularly when combining data from multiple experiments or conditions. To correct for batch effects and ensure that clustering is driven by biological variability rather than technical artifacts, we use **Harmony Integration**. Harmony can be used directly on the PCA representation to remove batch effects: ```python import scanpy.external as sce sce.pp.harmony_integrate(adata, key='batch') ``` - **`key='batch'`**: This parameter should correspond to the column in `adata.obs` that contains information about batch assignments (e.g., different experimental conditions). Harmony works by iteratively adjusting the principal components to minimize batch-driven differences, allowing for more accurate downstream clustering and visualization. ### Step 10: Uniform Manifold Approximation and Projection (UMAP) UMAP is a powerful technique for dimensionality reduction used to visualize high-dimensional data. UMAP seeks to preserve both local and global data structure, making it ideal for visualizing relationships between different cell populations. To compute UMAP: ```python sc.tl.umap(adata) ``` Visualize the UMAP embedding with: ```python sc.pl.umap(adata, color=['leiden']) ``` UMAP visualization is particularly useful to identify clusters of cells and explore biological relationships between different subpopulations. ### Step 11: Clustering Using Leiden Algorithm The **Leiden** algorithm is a community detection method that operates on the graph of cell similarities. It clusters cells based on their k-nearest neighbors, helping to identify distinct cell types or states within your dataset. To perform clustering with Leiden: ```python sc.tl.leiden(adata, resolution=0.5) ``` - **`resolution`**: Controls the granularity of clustering. A higher resolution will produce more clusters, potentially capturing finer subpopulations, while a lower resolution results in fewer, broader clusters. Choosing an optimal resolution often requires iterative testing and biological validation. ### Step 12: Annotating Clusters Based on Known Marker Genes Once the clusters are identified, annotation is crucial to assign biological meanings to each cluster. Annotation involves comparing the expression of known marker genes across clusters to identify their potential identities. To visualize marker gene expression: ```python sc.pl.dotplot(adata, ['CD3D', 'MS4A1', 'LYZ'], groupby='leiden') ``` In this example, **`CD3D`** is used as a T-cell marker, **`MS4A1`** as a B-cell marker, and **`LYZ`** as a myeloid cell marker. By comparing the expression of these genes across clusters, it is possible to infer the likely cell type represented by each cluster. You can also use a heatmap for a more detailed visualization of marker gene expression: ```python sc.pl.heatmap(adata, var_names=['CD3D', 'MS4A1', 'LYZ'], groupby='leiden', cmap='viridis') ``` Annotating clusters accurately is an iterative process that requires domain knowledge and often cross-referencing with published data or single-cell atlases. Successful annotation allows for meaningful biological interpretation of the identified cell populations, providing insights into cellular heterogeneity and function in the studied system.