# Code snippets to High-dimensional transcriptional profiling MATLAB offers tools and workflows for analyzing high-dimensional transcriptional profiling data from techniques like single-cell RNA sequencing (scRNA-seq) or bulk RNA-seq, focusing on gene expression analysis, clustering, visualization, and dimensionality reduction. Here's an overview of using MATLAB to analyze high-dimensional transcriptional data from individual cells, from preprocessing to visualization. ![Untitled-2023-08-12-1441.excalidraw](https://hackmd.io/_uploads/SJ1VQeFZyg.svg) ### 1. Loading and Preprocessing Data In MATLAB, you can start by loading raw gene expression data, which is typically stored as a matrix where rows represent genes and columns represent cells. Several formats, such as CSV, MAT, or HDF5, can be read using MATLAB's `readmatrix` or `importdata` functions. ```matlab % Example: Load gene expression matrix expressionData = readmatrix('gene_expression.csv'); % Load raw data geneNames = readcell('gene_names.csv'); % Load gene names cellNames = readcell('cell_names.csv'); % Load cell names % Basic Preprocessing expressionData = log1p(expressionData); % Log transformation to normalize data ``` ### 2. Quality Control and Filtering Filter low-quality cells and genes with low counts to remove noise. Basic metrics include: - Removing cells with low gene counts (indicating dead or unhealthy cells). - Removing genes with low expression across cells. ```matlab % Set thresholds for quality control minGeneCount = 200; minCellCount = 5; % Filter cells and genes geneFilter = sum(expressionData > 0, 2) >= minGeneCount; % Genes expressed in enough cells cellFilter = sum(expressionData > 0, 1) >= minCellCount; % Cells with enough genes expressed expressionData = expressionData(geneFilter, cellFilter); geneNames = geneNames(geneFilter); cellNames = cellNames(cellFilter); ``` ### 3. Normalization and Scaling Normalization controls for differences in library size and sequencing depth, and scaling centers the data. ```matlab % Normalize expression to account for sequencing depth normalizedData = expressionData ./ sum(expressionData, 1) * median(sum(expressionData, 1)); % Z-score scaling for gene-wise normalization scaledData = zscore(normalizedData, 0, 2); % Normalize each gene's expression ``` ### 4. Dimensionality Reduction (PCA, t-SNE, UMAP) Dimensionality reduction helps visualize and interpret high-dimensional data. MATLAB’s Statistics and Machine Learning Toolbox offers methods like PCA, and there are packages for t-SNE and UMAP. ```matlab % Perform PCA [coeff, score, latent] = pca(scaledData'); % Visualize first two principal components figure; scatter(score(:,1), score(:,2), 10, 'filled'); title('PCA of Gene Expression'); xlabel('PC1'); ylabel('PC2'); % Perform t-SNE tsneData = tsne(scaledData', 'Algorithm', 'barneshut', 'NumDimensions', 2); figure; scatter(tsneData(:,1), tsneData(:,2), 10, 'filled'); title('t-SNE of Gene Expression'); xlabel('t-SNE 1'); ylabel('t-SNE 2'); ``` For UMAP, use the MATLAB Toolbox "umap" available on MATLAB File Exchange. ### 5. Clustering Cells into Subpopulations Clustering reveals distinct cell populations based on transcriptional similarity. MATLAB’s `kmeans` or `cluster` functions are useful, and hierarchical clustering can visualize relationships among clusters. ```matlab % k-means clustering numClusters = 5; % Set the number of clusters [idx, C] = kmeans(score(:, 1:10), numClusters); % Clustering on the top 10 PCs % Visualize clusters figure; gscatter(score(:,1), score(:,2), idx); title('Cell Clusters in PCA Space'); xlabel('PC1'); ylabel('PC2'); ``` ### 6. Differential Gene Expression To identify marker genes that characterize each cluster, compare gene expression across clusters using t-tests or ANOVA. ```matlab % Perform differential expression analysis cluster1 = expressionData(:, idx == 1); cluster2 = expressionData(:, idx == 2); [~, pvals] = ttest2(cluster1', cluster2', 'Vartype', 'unequal'); adjustedPvals = mafdr(pvals, 'BHFDR', true); % FDR correction ``` ### 7. Visualization and Annotation Heatmaps and dot plots help visualize expression patterns. MATLAB’s `heatmap` function can be customized to highlight key genes and clusters. ```matlab % Select top marker genes for clusters topGenes = geneNames(adjustedPvals < 0.05); % Significant genes expressionSubset = scaledData(strcmp(geneNames, topGenes), :); % Generate heatmap figure; heatmap(expressionSubset, 'Colormap', parula, 'GridVisible', 'off'); title('Expression Heatmap of Marker Genes'); xlabel('Cells'); ylabel('Genes'); ``` ### 8. Gene Set Enrichment and Pathway Analysis (Optional) For pathway enrichment, you can use Gene Set Enrichment Analysis (GSEA) with MATLAB’s bioinformatics toolbox if gene sets are preloaded. ```matlab % Load gene sets and perform GSEA (requires additional libraries for gene sets) geneSet = load('pathway_genes.mat'); gseaResults = geneont('File', 'gene_ontology.obo'); ``` ### Summary MATLAB provides robust options for preprocessing, clustering, and visualizing high-dimensional transcriptional data, especially with basic gene expression analysis workflows. While other languages like Python offer more single-cell-specific libraries, MATLAB is suitable for custom analysis pipelines and advanced visualization, particularly for researchers already comfortable with its environment. For high-level analysis, combining MATLAB with tools like R (e.g., Seurat) or Python (e.g., Scanpy) can be a powerful approach to extract biological insights from single-cell RNA-seq data. In Python, high-dimensional transcriptional profiling of cells, especially for single-cell RNA sequencing (scRNA-seq) data, is typically handled using libraries like **Scanpy** and **Seurat (through the SeuratDisk package to bridge R and Python)**. Scanpy is particularly powerful, as it provides comprehensive preprocessing, visualization, and clustering tools within a single framework. Here's a guide to using Python for high-dimensional transcriptional profiling: ### 1. Installing and Importing Necessary Packages To get started, install Scanpy and other essential libraries: ```bash pip install scanpy[leiden] anndata matplotlib seaborn ``` Then, import them in your Python environment: ```python import scanpy as sc import anndata as ad import matplotlib.pyplot as plt import seaborn as sns import numpy as np ``` ### 2. Loading and Preprocessing Data Typically, scRNA-seq data is stored as a matrix (cells × genes). If you have a CSV file or other common formats, you can load it into an AnnData object (the primary data structure in Scanpy): ```python # Load data from a CSV file where rows are genes and columns are cells adata = sc.read_csv('gene_expression.csv').transpose() # Load gene and cell names (optional) adata.var_names = np.loadtxt('gene_names.csv', dtype=str) # Genes adata.obs_names = np.loadtxt('cell_names.csv', dtype=str) # Cells ``` ### 3. Quality Control and Filtering For scRNA-seq, it’s essential to filter cells and genes based on counts. This step removes dead or low-quality cells and non-informative genes. ```python # Filter genes that are expressed in less than 3 cells and cells with fewer than 200 genes sc.pp.filter_genes(adata, min_cells=3) sc.pp.filter_cells(adata, min_genes=200) # Calculate QC metrics and filter based on quality adata.var['mt'] = adata.var_names.str.startswith('MT-') # Mitochondrial genes sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True) # Filter cells based on mitochondrial gene content and total counts adata = adata[adata.obs['pct_counts_mt'] < 5, :] ``` ### 4. Normalization and Scaling Normalize gene expression counts and scale each gene to have unit variance: ```python # Normalize data so each cell has the same total count sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) # Log-transform data # Scale data to unit variance and zero mean for PCA sc.pp.scale(adata, max_value=10) ``` ### 5. Dimensionality Reduction (PCA, t-SNE, UMAP) Dimensionality reduction is essential for visualizing high-dimensional data in 2D or 3D. ```python # PCA sc.tl.pca(adata, svd_solver='arpack') # Plot the explained variance ratio of the PCs sc.pl.pca_variance_ratio(adata, log=True) # Compute neighborhood graph for clustering and embedding sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40) # UMAP sc.tl.umap(adata) sc.pl.umap(adata, color='total_counts') # t-SNE (optional) sc.tl.tsne(adata) sc.pl.tsne(adata, color='total_counts') ``` ### 6. Clustering Cells into Subpopulations Scanpy provides several clustering methods, including the **Leiden algorithm**, which is particularly effective for scRNA-seq data. ```python # Cluster cells using the Leiden algorithm sc.tl.leiden(adata, resolution=0.5) # Adjust resolution as needed sc.pl.umap(adata, color='leiden') # Visualize clusters on UMAP plot ``` ### 7. Differential Gene Expression Identify marker genes that are differentially expressed between clusters. ```python # Identify marker genes for each cluster sc.tl.rank_genes_groups(adata, 'leiden', method='t-test') # Can use other methods like 'wilcoxon' sc.pl.rank_genes_groups(adata, n_genes=10, sharey=False) ``` ### 8. Visualization of Gene Expression To visualize expression levels of specific genes or marker genes, use Violin plots or heatmaps. ```python # Violin plot for a specific gene sc.pl.violin(adata, ['GeneA', 'GeneB'], groupby='leiden') # Heatmap for top marker genes across clusters sc.pl.rank_genes_groups_heatmap(adata, n_genes=10, groupby='leiden', cmap='viridis') ``` ### 9. Gene Set Enrichment and Pathway Analysis Gene Set Enrichment Analysis (GSEA) can link transcriptional changes to biological pathways. For this, Scanpy doesn’t have native support, so you may use packages like **gseapy**. ```bash pip install gseapy ``` ```python import gseapy as gp # Run enrichment analysis for a specific list of genes gene_list = adata.var_names[adata.var['highly_variable']].tolist() enr = gp.enrichr(gene_list=gene_list, gene_sets='KEGG_2016', organism='Human', outdir='enrichr_results') ``` ### 10. Integrating and Analyzing Additional Data Modalities For more advanced analyses, consider integrating other omics data (e.g., spatial transcriptomics or ATAC-seq). Libraries like **scvi-tools** and **Anndata** enable complex multimodal analyses. ### Full Example of Workflow ```python import scanpy as sc import matplotlib.pyplot as plt # Load and preprocess adata = sc.read_csv('gene_expression.csv').transpose() sc.pp.filter_genes(adata, min_cells=3) sc.pp.filter_cells(adata, min_genes=200) adata.var['mt'] = adata.var_names.str.startswith('MT-') sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True) adata = adata[adata.obs['pct_counts_mt'] < 5, :] sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.scale(adata, max_value=10) # Dimensionality reduction and clustering sc.tl.pca(adata) sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40) sc.tl.umap(adata) sc.tl.leiden(adata, resolution=0.5) # Plotting sc.pl.umap(adata, color=['leiden', 'total_counts', 'pct_counts_mt']) # Differential expression sc.tl.rank_genes_groups(adata, 'leiden', method='t-test') sc.pl.rank_genes_groups(adata, n_genes=5, sharey=False) ``` ### Summary This pipeline provides an efficient way to process and analyze high-dimensional transcriptional data in Python using Scanpy. The workflow can be extended for multimodal analysis by integrating with additional libraries and tools. ### :point_right:[More on viadean.notion.site](https://viadean.notion.site/High-dimensional-transcriptional-profiling-of-cells-1361ae7b9a328082a918ef3498e2c90b?pvs=4)