[Bioinformatics] Sequence Variant Nomenclature

## 變異命名學 (Sequence Variant Nomenclature): 即為將變異 (variant) 命名之學問，人類基因組變異協會（HGVS：Human Genome Variation Society）規則是目前學術界所公認的命名規則[[1]](https://kknews.cc/news/53oeqvk.html)。 --- ## Level 根據分子生物學的[中心法則](https://zh.wikipedia.org/zh-tw/%E4%B8%AD%E5%BF%83%E6%B3%95%E5%89%87)，即「DNA → RNA →蛋白質」的過程，從不同的維度(level)出發，相同的基因突變可以有多種不同的表現形式，例如，參考序列的不同、表現層次的不同（DNA、RNA或者蛋白質）都會導致突變的表現方式產生差異。目前，通用的參考序列主要包括：==基因組參考序列（以前綴「g.」表示）==、==cDNA參考序列（以前綴「c.」表示）==、非編碼DNA參考序列（以前綴「n.」表示）、RNA參考序列（以前綴「r.」表示）、==蛋白質參考序列（以前綴「p.」表示）==。日常應用中又以==g.==、==c==. 及==p.== 這三個前綴最為常見。 ### genomic reference (g.) 即將DNA上每個鹼基依照各個chromosome的p-arm(短臂)至q-arm(長臂)線性編號，可能因為對照的reference不同而改變，例如同一個變異在human genome hg19 和 hg38的座標可能會不一樣。 ### coding sequence (c.) 依據各個基因的coding sequence編號，轉錄起點(start codon)與(stop codon)終點即為編號起點與終點。一個基因可能具有多個coding transcript template，因此即便是同一個變異，coding sequence編號也有可能不同。 ### protein sequence (p.) 每個基因的coding sequence皆可以依據[遺傳編碼](https://zh.wikipedia.org/zh-hant/%E9%81%97%E4%BC%A0%E5%AF%86%E7%A0%81)做出相對的蛋白質序列，因此若一個基因有多個coding transcript template，就可能做出不同的蛋白質，protein sequence的編號也會因此改變。 --- ## Numbering 下圖為HGVS針對genomic reference(g.)、coding sequence(c.)及protein sequence(p.)三個維度，以及不同區域(exon、intron、UTR)的編號規則所做的簡單示意[[2]](https://varnomen.hgvs.org/bg-material/numbering/)。 ![](https://i.imgur.com/qBEKm8q.png =150%x) ### Exonic 撇除其他區域，以exon為例，上圖紅框標示之變異依據不同的維度命名分別為: :::info * 基因組參考序列之變異命名為 **g.306C>A**。 * cDNA參考序列之變異命名為 **c.6C>A**。 * 蛋白質參考序列之變異命名為 **p.D2E**。 ::: 實際的變異命名除了遵循上述規則外，還附加了諸如染色體編號、cDNA參考序列版本及蛋白質序列版本等資訊。以癌症常見的**BRAF V600E**為例: :::info * 基因組參考序列之變異命名: **NC_000007.13:g.140453136A>T** > 第七號染色體位置140453136的核甘酸A被置換成T。(GRCh37) * cDNA參考序列之變異命名: **NM_004333.6(BRAF):c.1799T>A** > BRAF的coding transcript *NM004333.6* coding sequence位置1799的核甘酸T被置換成A。 * 蛋白質參考序列之變異命名: **NP_004324.2:p.V600E** > BRAF的protein template *NP004324.2* 在protien sequence位置600的胺基酸V (Valine,纈氨酸) 被置換成E (Glutamate,穀氨酸)。 ::: ### Introic >* nucleotides at the 5’ end of an intron are numbered relative to the last nucleotide of the directly upstream exon, followed by a “+” (plus) and their position in to the intron, like c.87+1, c.87+2, c.87+3, … >* nucleotides at the 3’ end of an intron are numbered relative to the first nucleotide of the directly downstream exon, followed by a “-” (minus) and their position out of the intron, like …, c.88-3, c.88-2, c.88-1. > 若變異發生在intron時，則依據其與附近的exon之相對位置命名，靠近5'者以"+"表示；靠近3'者以"-"表示。以上圖綠框及藍框所標示之變異為例， * 綠框之變異依據不同的維度命名分別為: >g.315G>C,c.12+3G>C,p.? * 藍框之變異依據不同的維度命名分別為: >g.411A>G,c.13-2A>G,p.? 若位置非常靠近exon時(常見的定義為±3)，有可能會影響RNA splincing，通常會將這些區域另外稱作splice site。 ### UTR (untranlated region) > * nucleotides upstream (5’) of the ATG-translation initiation codon (start) are marked with a “-” (minus) and numbered c.-1, c.-2, c.-3, etc. (i.e. going further upstream) >* nucleotides downstream (3’) of the translation termination codon (stop) are marked with a “*” (asterisk) and numbered c.*1, c.*2, c.*3, etc. (i.e. going further downstream) >* there is no nucleotide c.0. 若變異發生在UTR時,則依據其與轉錄起點及轉錄終點之相對位置命名，位於5'UTR者以"-"表示；位於3'UTR者以"*"表示。以上圖橘框及紫框所標示之變異為例， * 橘框之變異依據不同的維度命名分別為: > g.299C>G,c.-2C>G,p.? * 紫框之變異依據不同的維度命名分別為 > g.1633A>T,c.*3A>T,p.? 位於intron及UTR的變異對於蛋白質序列的影響無法透過參考序列的得知，通常蛋白質參考序列之變異命名會以 "*?*" 來表示。 --- ## Variant type (DNA level) 依據不同的變異類型，如Substitution、Deletion、Insertion、Duplication、Deletion-insertion等，有不同的命名規則。如下圖所示: ![](https://i.imgur.com/7KHAMjB.png =50%x) ### Substitution > * Definition: a sequence change where, compared to a reference sequence, ==one nucleotide is replaced by one other nucleotide==. > * Format: "prefix" "position_substituted" "reference_nucleotide" ">" "new_nucleotide”, e.g. g.123A>G, c.6T>C ### Deletion > * Definition: a sequence change where, compared to a reference sequence, ==one or more nucleotides are not present (deleted)==. > * Format: “prefix” “position(s)_deleted” “del”, e.g. g.123_127del, c.4del ### Insertion > * Definition: a sequence change where, compared to the reference sequence, ==one or more nucleotides are inserted and where the insertion is not a copy of a sequence== immediately 5 > * Format: “prefix”“positions_flanking”“ins”“inserted_sequence”, e.g. g.123_124insAGC, c.3_4insC ### Duplication > * Definition: a sequence change where, compared to a reference sequence, ==a copy of one or more nucleotides are inserted directly 3' of the original copy of that sequence==. > * Format: “prefix”“position(s)_duplicated”“dup”, e.g. g.123_345dup, c.6dup ### Deletion-insertion > * Definition: a sequence change where, compared to a reference sequence, ==one or more nucleotides are replaced by one or more other nucleotides== and which is not a substitution, inversion or conversion. > * Format: “prefix”“position(s)_deleted”“delins”“inserted_sequence”, e.g. g.123_127delinsAG, c.4delinsAC ## Variant type (Protein level) 依據不同的變異類型，如Substitution、Deletion、Insertion、Duplication、Frameshift等，有不同的命名規則。如下圖所示: ![](https://i.imgur.com/lB0wKZW.png) ### Substitution > * Definition: a sequence change where, compared to a reference sequence, ==one amino acid is replaced by one other amino acid==. > * Format: “prefix”“amino_acid”“position”“new_amino_acid”, e.g. p.Arg54Ser >### Missense >> p.D3E / p.Asp3Glu >### Nonsense >> p.Q4X / p.Q4* / p.Gln4Ter >### Synonymous >> p.D3= / p.Asp3= ### Non-frameshit deletion/duplication/insertion/deletion-insertion > * Definition: a sequence change between the translation initiation (start) and termination (stop) codon where, compared to a reference sequence, one or more amino acids are deleted/duplicated/inserted. > * Format: “prefix”“amino_acid(s)+position(s)_deleted”“del”, e.g. p.Cys76_Glu79del > Format: “prefix”“amino_acid(s)+position(s)_duplicated”“dup”, e.g. p.Cys76_Glu79dup > * Format: “prefix”“amino_acids+positions_flanking”“ins”“inserted_sequence”, e.g. p.Lys23_Leu24insArgSerGln > * Format: “prefix”“amino_acid(s)+position(s)_deleted”“delins”“inserted_sequence”, e.g. p.Arg123_Lys127delinsSerAsp > * AKA ==in-frame== deletion/duplication/insertion/deletion-insertion, which doesn't lead to the shifting of reading frame. ### Frameshift insertion/deletion/duplication > * Definition: a sequence change between the translation initiation (start) and termination (stop) codon where, compared to a reference sequence, translation shifts to another reading frame. > * Format: “prefix”“amino_acid”position”new_amino_acid”“fs”“Ter”“position_termination_site”, e.g. p.(Arg123LysfsTer34) > * The shortest frame shift variant possible contains “fsTer2” > * Example: p.Arg97ProfsTer23 (short p.Arg97fs) / p.Arg97Profs*23: a variant with Arg97 as the first amino acid changed, shifting the reading frame, replacing it for a Pro and terminating at position Ter23. ### Alleles 以中括號表示in-cis或in-trans > NM_004006.2:c.[2376G>C;3103del] => in-cis > NP_003997.1:p.[Ser68Arg;Asn594del] => in-cis > NM_004006.2:c.[2376G>C];[3103del] => in-trans > NP_003997.1:p.[Ser68Arg];[Asn594del] => in-trans 以==小括號==表示==未經實驗證實==的protein變化 > Predicted consequences, i.e. without experimental evidence (no RNA or protein sequence analysed), should be given in parentheses inside the square brackets. > NP_003997.1:p.[(Ser68Arg;Asn594del)] => in-cis, predicted consequence > NP_003997.1:p.[(Ser68Arg)];[(Asn594del)] => in-trans, predicted consequence ### 3 prime rule ![](https://hackmd.io/_uploads/ryn0z-Hth.png) https://www.sophiagenetics.com/science-hub/hgvs-nomenclature/ ### Reference: https://kknews.cc/news/53oeqvk.html https://varnomen.hgvs.org/ https://onlinelibrary.wiley.com/doi/epdf/10.1002/%28SICI%291098-1004%28200001%2915%3A1%3C7%3A%3AAID-HUMU4%3E3.0.CO%3B2-N ## supplement 1. 標準序列(reference)的編碼方向(左到右,短到長,p到q)為5'到3' 2. 標準序列為正股(plus strand) 3. 若轉錄序列(coding sequence)與標準序列相同，模板股為反股(minus strand) 4. 若轉錄序列與標準序列互補，模板股為正股 5. 轉錄方向為合成股(i.e. 轉錄序列, coding sequence)的5'到3' 6. 轉錄序列的5'為上游，3'為下游 ###### tags: `genomics`