# Data Integration
## 1. Setting Up Google Colab for Execution
To run this notebook in Google Colab, follow these steps:
1. **Upload the notebook to Google Drive** (optional, but recommended for saving changes).
2. **Open Google Colab** and load the notebook.
3. **Install required dependencies** by running the following command at the top of your notebook:
```python
!pip install scanpy celltypist scvi-tools hyperopt "ray[tune]" anndata2ri
```
4. **Mount Google Drive** if your dataset is stored there:
```python
from google.colab import drive
drive.mount('/content/drive')
```
5. **Restart the runtime** after installation to ensure all dependencies load correctly.
---
## 2. Creating CellTypist Models
### What is CellTypist?
[CellTypist](https://www.science.org/doi/10.1126/science.abl5197) is a tool for **automated cell type annotation** of single-cell RNA sequencing (scRNA-seq) data. It uses a trained model to classify cells based on gene expression profiles.
### Steps to Use CellTypist:
1. **Load CellTypist**
```python
import celltypist
from celltypist import models
```
2. **Check Available Models**
```python
models.get_all_models()
```
3. **Download and Load a Model**
```python
model = models.download_model('Immune_All_Low.pkl')
```
4. **Run Cell Annotation**
```python
predictions = celltypist.annotate(my_adata, model=model)
```
5. **Visualize Annotation Results**
```python
celltypist.plot_confusion_matrix(predictions)
```
---
## 3. scVI Label Transfer
### What is scVI?
scVI (Single-cell Variational Inference) is a deep learning framework for modeling single-cell transcriptomic data. It is used for:
- **Dimensionality reduction**
- **Batch correction**
- **Label transfer** (assigning cell identities based on reference datasets)
### How to Perform Label Transfer with scVI:
1. **Prepare the Dataset**
```python
import scvi
scvi.data.setup_anndata(adata, batch_key="batch_column")
```
2. **Train the scVI Model**
```python
model = scvi.model.SCVI(adata)
model.train()
```
3. **Infer Labels from a Reference Dataset**
```python
adata.obsm["X_scvi"] = model.get_latents()
```
4. **Visualize Results**
```python
sc.pl.umap(adata, color="cell_type")
```
---
## 4. Difference Between Integration and Batch Correction
### Batch Correction
- Focuses on removing **technical variation** caused by different sequencing runs, protocols, or platforms.
- Maintains the original biological structure of the data.
- Example methods: **Harmony, Combat, MNN Correct**.
### Integration
- Aims to **merge datasets** from different sources to create a unified reference space.
- Can **remove** some biological variation if not carefully applied.
- Example methods: **Seurat’s CCA, Harmony, scVI**.

---
## 5. When to Integrate Data and the Tradeoffs
### When to Integrate Data
- When comparing datasets from **different platforms, conditions, or batches**.
- When building a **reference atlas** for multiple tissue or disease conditions.
- When clustering **heterogeneous cell types** across datasets.
### Tradeoffs of Integration
- **Risk of overcorrection**: May remove real biological differences.
- **Computationally expensive**: Requires more memory and time.
- **Potential loss of dataset-specific signatures**: Important features may be averaged out.
---
## 6. When NOT to Integrate Data
### Integration is NOT Recommended for:
- **Cancer Studies**: Tumor heterogeneity is biologically meaningful and should not be masked.
- **Immune Response Studies**: Subtle immune states may be lost if overcorrected.
- **Time-course Experiments**: Cells from different time points should be analyzed separately.
- **Highly Specific Conditions**: If batch effects are minimal, integration is unnecessary.

---
## 7. Harmony-Based Batch Integration
### What is `harmony_integrate`?
`[harmony_integrate](https://www.nature.com/articles/s41592-019-0619-0)` is a function used to **remove batch effects** and integrate scRNA-seq datasets while preserving biological variation.
### How Integration Works:
- **Step 1:** Compute **Principal Components (PCs)** from the dataset.
- **Step 2:** Apply **batch alignment** iteratively using Harmony.
- **Step 3:** Store the integrated embedding for downstream analysis.
### Running Harmony Integration:
```python
import scanpy as sc
from harmony import harmonize
# Load preprocessed AnnData object
adata = sc.read_h5ad("my_data.h5ad")
# Run Harmony integration
adata.obsm["X_harmony"] = harmonize(adata.obsm["X_pca"], adata.obs, batch_key="batch_column")
```
---
## 8. Checking Integration Quality
After running Harmony, it's essential to assess integration quality. Here are key methods:
### A. **UMAP Visualization Before and After Integration**
```python
sc.pp.neighbors(adata, use_rep='X_pca')
sc.tl.umap(adata)
sc.pl.umap(adata, color='batch_column', title='Before Integration')
sc.pp.neighbors(adata, use_rep='X_harmony')
sc.tl.umap(adata)
sc.pl.umap(adata, color='batch_column', title='After Integration')
```
### B. **Silhouette Scores for Batch Effect Evaluation**
```python
from sklearn.metrics import silhouette_score
silhouette_score(adata.obsm["X_harmony"], adata.obs["batch_column"])
```
### C. **Graph Connectivity Analysis**
```python
sc.tl.paga(adata, groups='cell_type_column')
sc.pl.paga(adata, color='batch_column')
```
---
## 9. Annotation Using scVI & CellTypist
To improve annotation results:
- Use **[scVI](https://www.nature.com/articles/s41592-018-0229-2) latent space representation** for more accurate clustering.
- Apply **CellTypist** for fine-grained label transfer.
Example pipeline:
```python
# Run CellTypist on scVI embedding
predictions = celltypist.annotate(adata, model=model, use_rep="X_scvi")
sc.pl.umap(adata, color="predictions")
```