> 🚨 This is NOT a tutorial on how to process and cluster single-cell data using Scanpy. Please refer to the excellent [Scanpy Tutorial](https://scanpy.readthedocs.io/en/stable/tutorials/index.html).
# Motivation
This walkthrough is inspired by my colleague, who recently downloaded a `.h5ad` file from CELLxGENE but had only worked with Seurat before, and by my collaborators from non-biological fields who have no prior experience with single-cell data.
The idea is to make the input/output basics of Scanpy and the `.h5ad` / `.h5` file structures more accessible for newcomers and those coming from different backgrounds.
# 1. Loading data into Scanpy
Scanpy works primarily with the **AnnData** object. It has:
`.X` → expression matrix (cells × genes, usually sparse)
`.obs` → cell metadata (dataframe)
`.var` → gene metadata (dataframe)
`.uns` → unstructured data (plots, settings, colors, etc.)
`.obsm` → matrices aligned to cells (e.g., PCA, UMAP)
`.varm` → matrices aligned to genes
`.layers` → alternative expression matrices (raw, normalized, etc.)
## 💡 Quick orientation of AnnData
- Cells = rows → stored in `adata.obs` (metadata about each cell, including CellID/Index).
- Genes = columns → gene names are usually in `adata.var.index`.
- Expression matrix = counts of each gene in each cell → stored in `adata.X`.
- Pre-Processed expression matrix may be kept in `adata.raw.X`.
## Example: Load a built-in dataset
This built-in dataset is a processed 3k peripheral blood mononuclear cells (PBMCs) from 10x Genomics. PBMCs are immune cells from human blood and represent one of the most standard reference datasets in the single-cell field.
```{python}
import scanpy as sc
adata = sc.datasets.pbmc3k_processed() # 3k PBMC dataset
print(adata)
```
Output:
```
AnnData object with n_obs × n_vars = 2638 × 1838
obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
var: 'n_cells'
uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
varm: 'PCs'
obsp: 'distances', 'connectivities'
```
- Usually the gene names are stored at `adata.var.index`
- And count matrix (gene expression of every cells) in `adata.X` or `adata.raw.X` (pre-processed)
# 2. Reading from files
## a. From `.h5ad` (Scanpy format)
```{python}
adata = sc.read_h5ad("adata.h5ad")
```
## b. From `.h5` (10X Genomics format)
```{python}
adata = sc.read_10x_h5("filtered_feature_bc_matrix.h5")
```
# 3. Writing to files
a. Save as .h5ad
```{python}
adata.write("adata.h5ad") # saves everything
adata.write("adata_compressed.h5ad", compression="gzip") # smaller but would slow down writing and subsequent reading
```
b. Save as text
```{python}
adata.write_csvs("output_folder/", skip_data = False) # exports as multiple csv (obs, var
```
# 4. Exploring the .h5ad file structure with h5py
Since `.h5ad` file is an HDF5 file, you can inspect it with `h5py`:
```{python}
import h5py
with h5py.File("adata.h5ad", "r") as f:
print(list(f.keys()))
```
# ✅ Now you know how to:
1. Read from .h5ad and .h5
2. Write to .h5ad and .csv
3. Peek inside the HDF5 structure with h5py
>Learn more about anndata on
https://anndata.readthedocs.io/en/stable/