# Data Integration ## 1. Setting Up Google Colab for Execution To run this notebook in Google Colab, follow these steps: 1. **Upload the notebook to Google Drive** (optional, but recommended for saving changes). 2. **Open Google Colab** and load the notebook. 3. **Install required dependencies** by running the following command at the top of your notebook: ```python !pip install scanpy celltypist scvi-tools hyperopt "ray[tune]" anndata2ri ``` 4. **Mount Google Drive** if your dataset is stored there: ```python from google.colab import drive drive.mount('/content/drive') ``` 5. **Restart the runtime** after installation to ensure all dependencies load correctly. --- ## 2. Creating CellTypist Models ### What is CellTypist? [CellTypist](https://www.science.org/doi/10.1126/science.abl5197) is a tool for **automated cell type annotation** of single-cell RNA sequencing (scRNA-seq) data. It uses a trained model to classify cells based on gene expression profiles. ### Steps to Use CellTypist: 1. **Load CellTypist** ```python import celltypist from celltypist import models ``` 2. **Check Available Models** ```python models.get_all_models() ``` 3. **Download and Load a Model** ```python model = models.download_model('Immune_All_Low.pkl') ``` 4. **Run Cell Annotation** ```python predictions = celltypist.annotate(my_adata, model=model) ``` 5. **Visualize Annotation Results** ```python celltypist.plot_confusion_matrix(predictions) ``` --- ## 3. scVI Label Transfer ### What is scVI? scVI (Single-cell Variational Inference) is a deep learning framework for modeling single-cell transcriptomic data. It is used for: - **Dimensionality reduction** - **Batch correction** - **Label transfer** (assigning cell identities based on reference datasets) ### How to Perform Label Transfer with scVI: 1. **Prepare the Dataset** ```python import scvi scvi.data.setup_anndata(adata, batch_key="batch_column") ``` 2. **Train the scVI Model** ```python model = scvi.model.SCVI(adata) model.train() ``` 3. **Infer Labels from a Reference Dataset** ```python adata.obsm["X_scvi"] = model.get_latents() ``` 4. **Visualize Results** ```python sc.pl.umap(adata, color="cell_type") ``` --- ## 4. Difference Between Integration and Batch Correction ### Batch Correction - Focuses on removing **technical variation** caused by different sequencing runs, protocols, or platforms. - Maintains the original biological structure of the data. - Example methods: **Harmony, Combat, MNN Correct**. ### Integration - Aims to **merge datasets** from different sources to create a unified reference space. - Can **remove** some biological variation if not carefully applied. - Example methods: **Seurat’s CCA, Harmony, scVI**. ![Running the Notebook in Google Colab - visual selection (1)](https://hackmd.io/_uploads/ry9RVqpvyl.svg) --- ## 5. When to Integrate Data and the Tradeoffs ### When to Integrate Data - When comparing datasets from **different platforms, conditions, or batches**. - When building a **reference atlas** for multiple tissue or disease conditions. - When clustering **heterogeneous cell types** across datasets. ### Tradeoffs of Integration - **Risk of overcorrection**: May remove real biological differences. - **Computationally expensive**: Requires more memory and time. - **Potential loss of dataset-specific signatures**: Important features may be averaged out. --- ## 6. When NOT to Integrate Data ### Integration is NOT Recommended for: - **Cancer Studies**: Tumor heterogeneity is biologically meaningful and should not be masked. - **Immune Response Studies**: Subtle immune states may be lost if overcorrected. - **Time-course Experiments**: Cells from different time points should be analyzed separately. - **Highly Specific Conditions**: If batch effects are minimal, integration is unnecessary. ![Running the Notebook in Google Colab - visual selection (2)](https://hackmd.io/_uploads/B1Y4Bc6w1e.svg) --- ## 7. Harmony-Based Batch Integration ### What is `harmony_integrate`? `[harmony_integrate](https://www.nature.com/articles/s41592-019-0619-0)` is a function used to **remove batch effects** and integrate scRNA-seq datasets while preserving biological variation. ### How Integration Works: - **Step 1:** Compute **Principal Components (PCs)** from the dataset. - **Step 2:** Apply **batch alignment** iteratively using Harmony. - **Step 3:** Store the integrated embedding for downstream analysis. ### Running Harmony Integration: ```python import scanpy as sc from harmony import harmonize # Load preprocessed AnnData object adata = sc.read_h5ad("my_data.h5ad") # Run Harmony integration adata.obsm["X_harmony"] = harmonize(adata.obsm["X_pca"], adata.obs, batch_key="batch_column") ``` --- ## 8. Checking Integration Quality After running Harmony, it's essential to assess integration quality. Here are key methods: ### A. **UMAP Visualization Before and After Integration** ```python sc.pp.neighbors(adata, use_rep='X_pca') sc.tl.umap(adata) sc.pl.umap(adata, color='batch_column', title='Before Integration') sc.pp.neighbors(adata, use_rep='X_harmony') sc.tl.umap(adata) sc.pl.umap(adata, color='batch_column', title='After Integration') ``` ### B. **Silhouette Scores for Batch Effect Evaluation** ```python from sklearn.metrics import silhouette_score silhouette_score(adata.obsm["X_harmony"], adata.obs["batch_column"]) ``` ### C. **Graph Connectivity Analysis** ```python sc.tl.paga(adata, groups='cell_type_column') sc.pl.paga(adata, color='batch_column') ``` --- ## 9. Annotation Using scVI & CellTypist To improve annotation results: - Use **[scVI](https://www.nature.com/articles/s41592-018-0229-2) latent space representation** for more accurate clustering. - Apply **CellTypist** for fine-grained label transfer. Example pipeline: ```python # Run CellTypist on scVI embedding predictions = celltypist.annotate(adata, model=model, use_rep="X_scvi") sc.pl.umap(adata, color="predictions") ```