# Code snippets to High-dimensional transcriptional profiling
MATLAB offers tools and workflows for analyzing high-dimensional transcriptional profiling data from techniques like single-cell RNA sequencing (scRNA-seq) or bulk RNA-seq, focusing on gene expression analysis, clustering, visualization, and dimensionality reduction. Here's an overview of using MATLAB to analyze high-dimensional transcriptional data from individual cells, from preprocessing to visualization.

### 1. Loading and Preprocessing Data
In MATLAB, you can start by loading raw gene expression data, which is typically stored as a matrix where rows represent genes and columns represent cells. Several formats, such as CSV, MAT, or HDF5, can be read using MATLAB's `readmatrix` or `importdata` functions.
```matlab
% Example: Load gene expression matrix
expressionData = readmatrix('gene_expression.csv'); % Load raw data
geneNames = readcell('gene_names.csv'); % Load gene names
cellNames = readcell('cell_names.csv'); % Load cell names
% Basic Preprocessing
expressionData = log1p(expressionData); % Log transformation to normalize data
```
### 2. Quality Control and Filtering
Filter low-quality cells and genes with low counts to remove noise. Basic metrics include:
- Removing cells with low gene counts (indicating dead or unhealthy cells).
- Removing genes with low expression across cells.
```matlab
% Set thresholds for quality control
minGeneCount = 200;
minCellCount = 5;
% Filter cells and genes
geneFilter = sum(expressionData > 0, 2) >= minGeneCount; % Genes expressed in enough cells
cellFilter = sum(expressionData > 0, 1) >= minCellCount; % Cells with enough genes expressed
expressionData = expressionData(geneFilter, cellFilter);
geneNames = geneNames(geneFilter);
cellNames = cellNames(cellFilter);
```
### 3. Normalization and Scaling
Normalization controls for differences in library size and sequencing depth, and scaling centers the data.
```matlab
% Normalize expression to account for sequencing depth
normalizedData = expressionData ./ sum(expressionData, 1) * median(sum(expressionData, 1));
% Z-score scaling for gene-wise normalization
scaledData = zscore(normalizedData, 0, 2); % Normalize each gene's expression
```
### 4. Dimensionality Reduction (PCA, t-SNE, UMAP)
Dimensionality reduction helps visualize and interpret high-dimensional data. MATLAB’s Statistics and Machine Learning Toolbox offers methods like PCA, and there are packages for t-SNE and UMAP.
```matlab
% Perform PCA
[coeff, score, latent] = pca(scaledData');
% Visualize first two principal components
figure;
scatter(score(:,1), score(:,2), 10, 'filled');
title('PCA of Gene Expression');
xlabel('PC1');
ylabel('PC2');
% Perform t-SNE
tsneData = tsne(scaledData', 'Algorithm', 'barneshut', 'NumDimensions', 2);
figure;
scatter(tsneData(:,1), tsneData(:,2), 10, 'filled');
title('t-SNE of Gene Expression');
xlabel('t-SNE 1');
ylabel('t-SNE 2');
```
For UMAP, use the MATLAB Toolbox "umap" available on MATLAB File Exchange.
### 5. Clustering Cells into Subpopulations
Clustering reveals distinct cell populations based on transcriptional similarity. MATLAB’s `kmeans` or `cluster` functions are useful, and hierarchical clustering can visualize relationships among clusters.
```matlab
% k-means clustering
numClusters = 5; % Set the number of clusters
[idx, C] = kmeans(score(:, 1:10), numClusters); % Clustering on the top 10 PCs
% Visualize clusters
figure;
gscatter(score(:,1), score(:,2), idx);
title('Cell Clusters in PCA Space');
xlabel('PC1');
ylabel('PC2');
```
### 6. Differential Gene Expression
To identify marker genes that characterize each cluster, compare gene expression across clusters using t-tests or ANOVA.
```matlab
% Perform differential expression analysis
cluster1 = expressionData(:, idx == 1);
cluster2 = expressionData(:, idx == 2);
[~, pvals] = ttest2(cluster1', cluster2', 'Vartype', 'unequal');
adjustedPvals = mafdr(pvals, 'BHFDR', true); % FDR correction
```
### 7. Visualization and Annotation
Heatmaps and dot plots help visualize expression patterns. MATLAB’s `heatmap` function can be customized to highlight key genes and clusters.
```matlab
% Select top marker genes for clusters
topGenes = geneNames(adjustedPvals < 0.05); % Significant genes
expressionSubset = scaledData(strcmp(geneNames, topGenes), :);
% Generate heatmap
figure;
heatmap(expressionSubset, 'Colormap', parula, 'GridVisible', 'off');
title('Expression Heatmap of Marker Genes');
xlabel('Cells');
ylabel('Genes');
```
### 8. Gene Set Enrichment and Pathway Analysis (Optional)
For pathway enrichment, you can use Gene Set Enrichment Analysis (GSEA) with MATLAB’s bioinformatics toolbox if gene sets are preloaded.
```matlab
% Load gene sets and perform GSEA (requires additional libraries for gene sets)
geneSet = load('pathway_genes.mat');
gseaResults = geneont('File', 'gene_ontology.obo');
```
### Summary
MATLAB provides robust options for preprocessing, clustering, and visualizing high-dimensional transcriptional data, especially with basic gene expression analysis workflows. While other languages like Python offer more single-cell-specific libraries, MATLAB is suitable for custom analysis pipelines and advanced visualization, particularly for researchers already comfortable with its environment. For high-level analysis, combining MATLAB with tools like R (e.g., Seurat) or Python (e.g., Scanpy) can be a powerful approach to extract biological insights from single-cell RNA-seq data.
In Python, high-dimensional transcriptional profiling of cells, especially for single-cell RNA sequencing (scRNA-seq) data, is typically handled using libraries like **Scanpy** and **Seurat (through the SeuratDisk package to bridge R and Python)**. Scanpy is particularly powerful, as it provides comprehensive preprocessing, visualization, and clustering tools within a single framework. Here's a guide to using Python for high-dimensional transcriptional profiling:
### 1. Installing and Importing Necessary Packages
To get started, install Scanpy and other essential libraries:
```bash
pip install scanpy[leiden] anndata matplotlib seaborn
```
Then, import them in your Python environment:
```python
import scanpy as sc
import anndata as ad
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
```
### 2. Loading and Preprocessing Data
Typically, scRNA-seq data is stored as a matrix (cells × genes). If you have a CSV file or other common formats, you can load it into an AnnData object (the primary data structure in Scanpy):
```python
# Load data from a CSV file where rows are genes and columns are cells
adata = sc.read_csv('gene_expression.csv').transpose()
# Load gene and cell names (optional)
adata.var_names = np.loadtxt('gene_names.csv', dtype=str) # Genes
adata.obs_names = np.loadtxt('cell_names.csv', dtype=str) # Cells
```
### 3. Quality Control and Filtering
For scRNA-seq, it’s essential to filter cells and genes based on counts. This step removes dead or low-quality cells and non-informative genes.
```python
# Filter genes that are expressed in less than 3 cells and cells with fewer than 200 genes
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.filter_cells(adata, min_genes=200)
# Calculate QC metrics and filter based on quality
adata.var['mt'] = adata.var_names.str.startswith('MT-') # Mitochondrial genes
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
# Filter cells based on mitochondrial gene content and total counts
adata = adata[adata.obs['pct_counts_mt'] < 5, :]
```
### 4. Normalization and Scaling
Normalize gene expression counts and scale each gene to have unit variance:
```python
# Normalize data so each cell has the same total count
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata) # Log-transform data
# Scale data to unit variance and zero mean for PCA
sc.pp.scale(adata, max_value=10)
```
### 5. Dimensionality Reduction (PCA, t-SNE, UMAP)
Dimensionality reduction is essential for visualizing high-dimensional data in 2D or 3D.
```python
# PCA
sc.tl.pca(adata, svd_solver='arpack')
# Plot the explained variance ratio of the PCs
sc.pl.pca_variance_ratio(adata, log=True)
# Compute neighborhood graph for clustering and embedding
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
# UMAP
sc.tl.umap(adata)
sc.pl.umap(adata, color='total_counts')
# t-SNE (optional)
sc.tl.tsne(adata)
sc.pl.tsne(adata, color='total_counts')
```
### 6. Clustering Cells into Subpopulations
Scanpy provides several clustering methods, including the **Leiden algorithm**, which is particularly effective for scRNA-seq data.
```python
# Cluster cells using the Leiden algorithm
sc.tl.leiden(adata, resolution=0.5) # Adjust resolution as needed
sc.pl.umap(adata, color='leiden') # Visualize clusters on UMAP plot
```
### 7. Differential Gene Expression
Identify marker genes that are differentially expressed between clusters.
```python
# Identify marker genes for each cluster
sc.tl.rank_genes_groups(adata, 'leiden', method='t-test') # Can use other methods like 'wilcoxon'
sc.pl.rank_genes_groups(adata, n_genes=10, sharey=False)
```
### 8. Visualization of Gene Expression
To visualize expression levels of specific genes or marker genes, use Violin plots or heatmaps.
```python
# Violin plot for a specific gene
sc.pl.violin(adata, ['GeneA', 'GeneB'], groupby='leiden')
# Heatmap for top marker genes across clusters
sc.pl.rank_genes_groups_heatmap(adata, n_genes=10, groupby='leiden', cmap='viridis')
```
### 9. Gene Set Enrichment and Pathway Analysis
Gene Set Enrichment Analysis (GSEA) can link transcriptional changes to biological pathways. For this, Scanpy doesn’t have native support, so you may use packages like **gseapy**.
```bash
pip install gseapy
```
```python
import gseapy as gp
# Run enrichment analysis for a specific list of genes
gene_list = adata.var_names[adata.var['highly_variable']].tolist()
enr = gp.enrichr(gene_list=gene_list, gene_sets='KEGG_2016', organism='Human', outdir='enrichr_results')
```
### 10. Integrating and Analyzing Additional Data Modalities
For more advanced analyses, consider integrating other omics data (e.g., spatial transcriptomics or ATAC-seq). Libraries like **scvi-tools** and **Anndata** enable complex multimodal analyses.
### Full Example of Workflow
```python
import scanpy as sc
import matplotlib.pyplot as plt
# Load and preprocess
adata = sc.read_csv('gene_expression.csv').transpose()
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.filter_cells(adata, min_genes=200)
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
adata = adata[adata.obs['pct_counts_mt'] < 5, :]
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.scale(adata, max_value=10)
# Dimensionality reduction and clustering
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)
# Plotting
sc.pl.umap(adata, color=['leiden', 'total_counts', 'pct_counts_mt'])
# Differential expression
sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
sc.pl.rank_genes_groups(adata, n_genes=5, sharey=False)
```
### Summary
This pipeline provides an efficient way to process and analyze high-dimensional transcriptional data in Python using Scanpy. The workflow can be extended for multimodal analysis by integrating with additional libraries and tools.
### :point_right:[More on viadean.notion.site](https://viadean.notion.site/High-dimensional-transcriptional-profiling-of-cells-1361ae7b9a328082a918ef3498e2c90b?pvs=4)