A Basic Scanpy I/O walkthrough

> 🚨 This is NOT a tutorial on how to process and cluster single-cell data using Scanpy. Please refer to the excellent [Scanpy Tutorial](https://scanpy.readthedocs.io/en/stable/tutorials/index.html). # Motivation This walkthrough is inspired by my colleague, who recently downloaded a `.h5ad` file from CELLxGENE but had only worked with Seurat before, and by my collaborators from non-biological fields who have no prior experience with single-cell data. The idea is to make the input/output basics of Scanpy and the `.h5ad` / `.h5` file structures more accessible for newcomers and those coming from different backgrounds. # 1. Loading data into Scanpy Scanpy works primarily with the **AnnData** object. It has: `.X` → expression matrix (cells × genes, usually sparse) `.obs` → cell metadata (dataframe) `.var` → gene metadata (dataframe) `.uns` → unstructured data (plots, settings, colors, etc.) `.obsm` → matrices aligned to cells (e.g., PCA, UMAP) `.varm` → matrices aligned to genes `.layers` → alternative expression matrices (raw, normalized, etc.) ## 💡 Quick orientation of AnnData - Cells = rows → stored in `adata.obs` (metadata about each cell, including CellID/Index). - Genes = columns → gene names are usually in `adata.var.index`. - Expression matrix = counts of each gene in each cell → stored in `adata.X`. - Pre-Processed expression matrix may be kept in `adata.raw.X`. ## Example: Load a built-in dataset This built-in dataset is a processed 3k peripheral blood mononuclear cells (PBMCs) from 10x Genomics. PBMCs are immune cells from human blood and represent one of the most standard reference datasets in the single-cell field. ```{python} import scanpy as sc adata = sc.datasets.pbmc3k_processed() # 3k PBMC dataset print(adata) ``` Output: ``` AnnData object with n_obs × n_vars = 2638 × 1838 obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain' var: 'n_cells' uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups' obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr' varm: 'PCs' obsp: 'distances', 'connectivities' ``` - Usually the gene names are stored at `adata.var.index` - And count matrix (gene expression of every cells) in `adata.X` or `adata.raw.X` (pre-processed) # 2. Reading from files ## a. From `.h5ad` (Scanpy format) ```{python} adata = sc.read_h5ad("adata.h5ad") ``` ## b. From `.h5` (10X Genomics format) ```{python} adata = sc.read_10x_h5("filtered_feature_bc_matrix.h5") ``` # 3. Writing to files a. Save as .h5ad ```{python} adata.write("adata.h5ad") # saves everything adata.write("adata_compressed.h5ad", compression="gzip") # smaller but would slow down writing and subsequent reading ``` b. Save as text ```{python} adata.write_csvs("output_folder/", skip_data = False) # exports as multiple csv (obs, var ``` # 4. Exploring the .h5ad file structure with h5py Since `.h5ad` file is an HDF5 file, you can inspect it with `h5py`: ```{python} import h5py with h5py.File("adata.h5ad", "r") as f: print(list(f.keys())) ``` # ✅ Now you know how to: 1. Read from .h5ad and .h5 2. Write to .h5ad and .csv 3. Peek inside the HDF5 structure with h5py >Learn more about anndata on https://anndata.readthedocs.io/en/stable/