Try   HackMD

Code snippets to High-dimensional transcriptional profiling

MATLAB offers tools and workflows for analyzing high-dimensional transcriptional profiling data from techniques like single-cell RNA sequencing (scRNA-seq) or bulk RNA-seq, focusing on gene expression analysis, clustering, visualization, and dimensionality reduction. Here's an overview of using MATLAB to analyze high-dimensional transcriptional data from individual cells, from preprocessing to visualization.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

1. Loading and Preprocessing Data

In MATLAB, you can start by loading raw gene expression data, which is typically stored as a matrix where rows represent genes and columns represent cells. Several formats, such as CSV, MAT, or HDF5, can be read using MATLAB's readmatrix or importdata functions.

% Example: Load gene expression matrix
expressionData = readmatrix('gene_expression.csv');  % Load raw data
geneNames = readcell('gene_names.csv');  % Load gene names
cellNames = readcell('cell_names.csv');  % Load cell names

% Basic Preprocessing
expressionData = log1p(expressionData); % Log transformation to normalize data

2. Quality Control and Filtering

Filter low-quality cells and genes with low counts to remove noise. Basic metrics include:

  • Removing cells with low gene counts (indicating dead or unhealthy cells).
  • Removing genes with low expression across cells.
% Set thresholds for quality control
minGeneCount = 200;
minCellCount = 5;

% Filter cells and genes
geneFilter = sum(expressionData > 0, 2) >= minGeneCount; % Genes expressed in enough cells
cellFilter = sum(expressionData > 0, 1) >= minCellCount; % Cells with enough genes expressed
expressionData = expressionData(geneFilter, cellFilter);
geneNames = geneNames(geneFilter);
cellNames = cellNames(cellFilter);

3. Normalization and Scaling

Normalization controls for differences in library size and sequencing depth, and scaling centers the data.

% Normalize expression to account for sequencing depth
normalizedData = expressionData ./ sum(expressionData, 1) * median(sum(expressionData, 1));

% Z-score scaling for gene-wise normalization
scaledData = zscore(normalizedData, 0, 2); % Normalize each gene's expression

4. Dimensionality Reduction (PCA, t-SNE, UMAP)

Dimensionality reduction helps visualize and interpret high-dimensional data. MATLAB’s Statistics and Machine Learning Toolbox offers methods like PCA, and there are packages for t-SNE and UMAP.

% Perform PCA
[coeff, score, latent] = pca(scaledData');

% Visualize first two principal components
figure;
scatter(score(:,1), score(:,2), 10, 'filled');
title('PCA of Gene Expression');
xlabel('PC1');
ylabel('PC2');

% Perform t-SNE
tsneData = tsne(scaledData', 'Algorithm', 'barneshut', 'NumDimensions', 2);
figure;
scatter(tsneData(:,1), tsneData(:,2), 10, 'filled');
title('t-SNE of Gene Expression');
xlabel('t-SNE 1');
ylabel('t-SNE 2');

For UMAP, use the MATLAB Toolbox "umap" available on MATLAB File Exchange.

5. Clustering Cells into Subpopulations

Clustering reveals distinct cell populations based on transcriptional similarity. MATLAB’s kmeans or cluster functions are useful, and hierarchical clustering can visualize relationships among clusters.

% k-means clustering
numClusters = 5; % Set the number of clusters
[idx, C] = kmeans(score(:, 1:10), numClusters); % Clustering on the top 10 PCs

% Visualize clusters
figure;
gscatter(score(:,1), score(:,2), idx);
title('Cell Clusters in PCA Space');
xlabel('PC1');
ylabel('PC2');

6. Differential Gene Expression

To identify marker genes that characterize each cluster, compare gene expression across clusters using t-tests or ANOVA.

% Perform differential expression analysis
cluster1 = expressionData(:, idx == 1);
cluster2 = expressionData(:, idx == 2);
[~, pvals] = ttest2(cluster1', cluster2', 'Vartype', 'unequal');
adjustedPvals = mafdr(pvals, 'BHFDR', true); % FDR correction

7. Visualization and Annotation

Heatmaps and dot plots help visualize expression patterns. MATLAB’s heatmap function can be customized to highlight key genes and clusters.

% Select top marker genes for clusters
topGenes = geneNames(adjustedPvals < 0.05); % Significant genes
expressionSubset = scaledData(strcmp(geneNames, topGenes), :);

% Generate heatmap
figure;
heatmap(expressionSubset, 'Colormap', parula, 'GridVisible', 'off');
title('Expression Heatmap of Marker Genes');
xlabel('Cells');
ylabel('Genes');

8. Gene Set Enrichment and Pathway Analysis (Optional)

For pathway enrichment, you can use Gene Set Enrichment Analysis (GSEA) with MATLAB’s bioinformatics toolbox if gene sets are preloaded.

% Load gene sets and perform GSEA (requires additional libraries for gene sets)
geneSet = load('pathway_genes.mat');
gseaResults = geneont('File', 'gene_ontology.obo');

Summary

MATLAB provides robust options for preprocessing, clustering, and visualizing high-dimensional transcriptional data, especially with basic gene expression analysis workflows. While other languages like Python offer more single-cell-specific libraries, MATLAB is suitable for custom analysis pipelines and advanced visualization, particularly for researchers already comfortable with its environment. For high-level analysis, combining MATLAB with tools like R (e.g., Seurat) or Python (e.g., Scanpy) can be a powerful approach to extract biological insights from single-cell RNA-seq data.

In Python, high-dimensional transcriptional profiling of cells, especially for single-cell RNA sequencing (scRNA-seq) data, is typically handled using libraries like Scanpy and Seurat (through the SeuratDisk package to bridge R and Python). Scanpy is particularly powerful, as it provides comprehensive preprocessing, visualization, and clustering tools within a single framework. Here's a guide to using Python for high-dimensional transcriptional profiling:

1. Installing and Importing Necessary Packages

To get started, install Scanpy and other essential libraries:

pip install scanpy[leiden] anndata matplotlib seaborn

Then, import them in your Python environment:

import scanpy as sc
import anndata as ad
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

2. Loading and Preprocessing Data

Typically, scRNA-seq data is stored as a matrix (cells × genes). If you have a CSV file or other common formats, you can load it into an AnnData object (the primary data structure in Scanpy):

# Load data from a CSV file where rows are genes and columns are cells
adata = sc.read_csv('gene_expression.csv').transpose()

# Load gene and cell names (optional)
adata.var_names = np.loadtxt('gene_names.csv', dtype=str)  # Genes
adata.obs_names = np.loadtxt('cell_names.csv', dtype=str)  # Cells

3. Quality Control and Filtering

For scRNA-seq, it’s essential to filter cells and genes based on counts. This step removes dead or low-quality cells and non-informative genes.

# Filter genes that are expressed in less than 3 cells and cells with fewer than 200 genes
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.filter_cells(adata, min_genes=200)

# Calculate QC metrics and filter based on quality
adata.var['mt'] = adata.var_names.str.startswith('MT-')  # Mitochondrial genes
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

# Filter cells based on mitochondrial gene content and total counts
adata = adata[adata.obs['pct_counts_mt'] < 5, :]

4. Normalization and Scaling

Normalize gene expression counts and scale each gene to have unit variance:

# Normalize data so each cell has the same total count
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)  # Log-transform data

# Scale data to unit variance and zero mean for PCA
sc.pp.scale(adata, max_value=10)

5. Dimensionality Reduction (PCA, t-SNE, UMAP)

Dimensionality reduction is essential for visualizing high-dimensional data in 2D or 3D.

# PCA
sc.tl.pca(adata, svd_solver='arpack')

# Plot the explained variance ratio of the PCs
sc.pl.pca_variance_ratio(adata, log=True)

# Compute neighborhood graph for clustering and embedding
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

# UMAP
sc.tl.umap(adata)
sc.pl.umap(adata, color='total_counts')

# t-SNE (optional)
sc.tl.tsne(adata)
sc.pl.tsne(adata, color='total_counts')

6. Clustering Cells into Subpopulations

Scanpy provides several clustering methods, including the Leiden algorithm, which is particularly effective for scRNA-seq data.

# Cluster cells using the Leiden algorithm
sc.tl.leiden(adata, resolution=0.5)  # Adjust resolution as needed
sc.pl.umap(adata, color='leiden')  # Visualize clusters on UMAP plot

7. Differential Gene Expression

Identify marker genes that are differentially expressed between clusters.

# Identify marker genes for each cluster
sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')  # Can use other methods like 'wilcoxon'
sc.pl.rank_genes_groups(adata, n_genes=10, sharey=False)

8. Visualization of Gene Expression

To visualize expression levels of specific genes or marker genes, use Violin plots or heatmaps.

# Violin plot for a specific gene
sc.pl.violin(adata, ['GeneA', 'GeneB'], groupby='leiden')

# Heatmap for top marker genes across clusters
sc.pl.rank_genes_groups_heatmap(adata, n_genes=10, groupby='leiden', cmap='viridis')

9. Gene Set Enrichment and Pathway Analysis

Gene Set Enrichment Analysis (GSEA) can link transcriptional changes to biological pathways. For this, Scanpy doesn’t have native support, so you may use packages like gseapy.

pip install gseapy
import gseapy as gp

# Run enrichment analysis for a specific list of genes
gene_list = adata.var_names[adata.var['highly_variable']].tolist()
enr = gp.enrichr(gene_list=gene_list, gene_sets='KEGG_2016', organism='Human', outdir='enrichr_results')

10. Integrating and Analyzing Additional Data Modalities

For more advanced analyses, consider integrating other omics data (e.g., spatial transcriptomics or ATAC-seq). Libraries like scvi-tools and Anndata enable complex multimodal analyses.

Full Example of Workflow

import scanpy as sc
import matplotlib.pyplot as plt

# Load and preprocess
adata = sc.read_csv('gene_expression.csv').transpose()
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.filter_cells(adata, min_genes=200)
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
adata = adata[adata.obs['pct_counts_mt'] < 5, :]
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.scale(adata, max_value=10)

# Dimensionality reduction and clustering
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)

# Plotting
sc.pl.umap(adata, color=['leiden', 'total_counts', 'pct_counts_mt'])

# Differential expression
sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
sc.pl.rank_genes_groups(adata, n_genes=5, sharey=False)

Summary

This pipeline provides an efficient way to process and analyze high-dimensional transcriptional data in Python using Scanpy. The workflow can be extended for multimodal analysis by integrating with additional libraries and tools.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
More on viadean.notion.site