## What is mutational signature?
Different mutational processes often generate different combinations of mutation types, termed ‘signatures’. [Reference](https://www.nature.com/articles/nature12477)

## Non-negative matrix factorization (NMF)
NMF was used to decompose the mutation catalog matrix and revealed numerous mutation signatures.

### Tools:
Many tools were developed for mutational signature analysis, such as deconstructSigs, sigProfiler, MutationalPatterns, signatureAnalyzer, sigflow, sigminer, etc.
1. deconstructSigs:
* Has been around longer and has accumulated more citations in scientific literature
* User-friendly, especially for researchers familiar with R
* Well-suited for clinical applications and smaller studies
* Robust performance in assigning known signatures to individual samples
```
library(deconstructSigs)
library(BSgenome.Hsapiens.UCSC.hg19)
# Read in VCF file
vcf <- read.table("path/to/your/vcf_file.vcf", header = TRUE, stringsAsFactors = FALSE)
# Create a mutation matrix
mut_matrix <- mut.to.sigs.input(mut.ref = vcf,
sample.id = "Sample",
sig.type = "SBS",
chr = "CHROM",
pos = "POS",
ref = "REF",
alt = "ALT")
# Run deconstructSigs
sigs <- whichSignatures(tumor.ref = mut_matrix,
signatures.ref = signatures.cosmic,
sample.id = "Sample",
contexts.needed = TRUE,
tri.counts.method = 'exome2genome')
## Based on the description of document, 'exome2genome' is recommended for exome data
## signatures.ref can be replaced by COSMIC signature set.
# Plot results
plotSignatures(sigs)
```
2. sigProfiler:
* Developed by the team that originally described mutational signatures
* Part of a comprehensive suite of tools for signature analysis
* Gaining popularity, especially for larger studies and more complex analyses
* sigProfilerAssignment is increasingly used for single sample analysis
```
from SigProfilerExtractor import sigpro as sig
## Define input and output directories
input_dir = "/path/to/input/directory"
output_dir = "/path/to/output/directory"
## Run SigProfilerExtractor
sig.sigProfilerExtractor(input_type="vcf",
output=output_dir,
input_data=input_dir,
reference_genome="GRCh37",
minimum_signatures=1,
maximum_signatures=3,
nmf_replicates=100)
from SigProfilerAssignment import Analyzer as Analyze
## Run sigProfilerAssignment
Analyze.cosmic_fit(samples=input_dir ,
output=output_dir,
input_type="vcf",
context_type="96",
genome_build="GRCh37",
make_plots=True,
sample_reconstruction_plots=True,
exclude_signature_subgroups=None,
cosmic_version=3.4)
```
::: warning
sigProfilerExtractor:
De novo extraction of mutational signatures from a set of samples.When you want to discover new signatures or extract the active signatures in your dataset without relying on pre-defined signatures. This tool performs non-negative matrix factorization (NMF) on the mutation matrix, and identifies the optimal number of signatures in the dataset.
Extracts mutational signatures and their relative contributions in each sample.
sigProfilerAssignment:
Assigns known mutational signatures to individual samples. When you want to determine the contribution of established signatures (e.g., COSMIC signatures) to your samples. This tool uses a set of predefined signatures (e.g., COSMIC signatures), and estimates the contribution of each known signature to the mutational profile of each sample.
[Rererence](https://pmc.ncbi.nlm.nih.gov/articles/PMC10369904/)
:::
3. MutationalPatterns:
* Offers a good balance of functionality and ease of use
* Provides comprehensive visualization options
* Integrates well with other bioinformatics workflows in R
## QC metrics
1. Cosine Similarity
* 定義: Cosine similarity是一種衡量兩個向量之間角度相似度的指標,取值範圍從-1到1。值越接近1,表示兩個向量越相似。
* 解釋: 在mutational signature中,這意味著如果兩個突變簽名的cosine similarity接近1,則它們的突變模式非常相似,可能源自相同的突變過程。
2. L1 Norm (%)
* 定義: L1 norm(也稱為曼哈頓距離)是計算兩個向量之間絕對差異的總和。L1 Norm %則是將此距離標準化為百分比形式。
* 解釋: 在突變簽名中,L1 Norm %用來衡量實際觀察到的突變數量與預期突變數量之間的差距。較低的L1 Norm %表示模型擬合良好。
3. L2 Norm (%)
* 定義: L2 norm(或歐幾里得距離)是計算兩個向量之間差異平方和的平方根。L2 Norm %同樣是將此距離標準化為百分比。
* 解釋: L2 Norm %提供了另一種衡量突變簽名擬合度的方法,通常對於大幅度的偏差更敏感,因此可以更好地捕捉到極端值的影響。
4. KL Divergence
* 定義: Kullback-Leibler Divergence(KL散度)是一種衡量兩個概率分佈之間差異的指標。它告訴我們從一個分佈預測另一個分佈所需的信息增益。
* 解釋: 在mutational signature中,KL Divergence用來評估觀察到的突變簽名與預期簽名之間的信息損失。較高的KL Divergence表示這兩者之間存在顯著差異。
5. Correlation
* 定義: 相關性是衡量兩個變數之間線性關係強度的一種指標,通常使用皮爾森相關係數來表示。
* 解釋: 在mutational signature分析中,相關性可以幫助我們了解不同簽名之間是否存在共同模式或相互關聯。如果相關係數接近1,則表示這些簽名在某些突變特徵上具有高度一致性。
(summarized by perplexity)
## Number of input variants matters

[Reference](https://www.nature.com/articles/s41467-024-53711-6)