dbSNP
===
###### tags: `基因體/三級分析/資料庫`
###### tags: `基因體`, `SNP`, `dbSNP`, `Variant`, `NCBI`, `Sequence Ontology`
<br>
[TOC]
<br>
## 簡介
> 收集SNP(單核苷酸多型性)的一個資料庫
- **官網**
- https://www.ncbi.nlm.nih.gov/snp/
- **全名**
- database of SNP
- The Single Nucleotide Polymorphism Database
- 單核苷酸多型性資料庫
- 單一鹼基的置換、短插入與刪除的多型性資料庫
- **目的**
- 補充與輔助 GenBank (基因銀行)
- [GenBank (基因銀行)](https://zh.wikipedia.org/wiki/GenBank)
- 從全球各實驗室,接收超過百萬種的生物資料
- **建立者**
- NCBI 與 NHGRI 合作建立
- NCBI
- National Center for Biotechnology Information
- (美國)國家生物技術資訊中心
- NHGRI
- National Human Genome Research Institute
- (美國)國家人類基因體研究所
- 國立人類基因體研究所
- **內容**
- 來自任何生物體的核苷酸序列
- **rs 編號由來**
- NCBI 對提交的 SNP 進行審核(分類與考證)後,給予 rs 編號
- rs 就是 RefSNPs (參考 SNPs)
- rs 後面接一個數字
- 如 [rs03](https://ncbi.nlm.nih.gov/snp/rs03), [rs003](https://ncbi.nlm.nih.gov/snp/rs003), [rs000003](https://ncbi.nlm.nih.gov/snp/rs000003) 同 [rs3](https://ncbi.nlm.nih.gov/snp/rs3) (可以允許前綴 0)
- [rs7412](https://www.ncbi.nlm.nih.gov/snp/rs7412)
- [rs12345678](https://www.ncbi.nlm.nih.gov/snp/rs12345678) (隨便舉例,真的有此 ID)
- **rs 內容**
- 位置資訊
- 前後序列
- 分佈頻率
- **現況**
- 2017/9/1 只允許接收人類變異數據,不再接收其他生物體的變異數據 ([wiki](https://en.wikipedia.org/wiki/DbSNP#1._Source))
- **代碼 (編號含意)**
- [Overview of dbSNP](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3078622/)
- ss: Submitted SNP (提交的 SNP 數據)
- rs: Reference SNP cluster (又稱 refSNP)
- 經過審核後的登錄號
- 一筆 rs 可以有多筆的 ss
- **FTP**
- https://ftp.ncbi.nih.gov/snp/ (入口點)
- https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/ (舊版)
- https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/ (最新版本)
<br>
## dbSNP 子資料庫類型
- [[UCSC] Short Genetic Variants from dbSNP release 155](https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=dbSnp155Composite)
- **All dbSNP (155)** (全部 dbSNP)
the entire set (1.02 billion for hg19, 1.06 billion for hg38)
- **Common dbSNP (155)** (全部 dbSNP + 條件:MAF>=1%)
approximately 15 million variants with a minor allele frequency (MAF) of at least 1% (0.01) in the 1000 Genomes Phase 3 dataset. Variants in the Mult. subset (below) are excluded.
- **ClinVar dbSNP (155)** (有被 ClinVar 參考到的 dbSNP)
approximately 820,000 variants mentioned in ClinVar. Note: that includes both benign and pathogenic (as well as uncertain) variants. Variants in the Mult. subset (below) are excluded.
- **Mult. dbSNP (155)** (該序列有被回貼到不同的染色體)
variants that have been mapped to multiple chromosomes
<br>
## dbSNP 資料庫版本
### 00-All.vcf.gz vs GCF_000001405.39.gz
| file | size | date | ref |
|------|------|------|-----|
| [00-All.vcf.gz](https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/) | 15G | 2018-04-23 11:40 | [GRCh38.p7<br>(GCF_000001405.33)](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.33/) |
| [GCF_000001405.39.gz](https://ftp.ncbi.nih.gov/snp/latest_release/VCF/) | 24G | 2021-05-25 10:24 | [GRCh38.p13<br>(GCF_000001405.39)](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39/) |
- 00-All.vcf.gz
https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/
- GCF_000001405.39.gz
https://ftp.ncbi.nih.gov/snp/latest_release/VCF/
- 00-All.vcf.gz vs GCF_000001405.39.gz
[](https://i.imgur.com/KVYPgrl.png)
...
[](https://i.imgur.com/Vjlp7vI.png)
...
[](https://i.imgur.com/ajV3zm6.png)
### VCF vs JSON
- ### rs429358
- **Web**
https://www.ncbi.nlm.nih.gov/snp/rs429358
- AF(Allele Frequency) 資訊

- **VCF (from 00-All.vcf.gz)**
> 19 44908684 rs429358 T C . . RS=429358;RSPOS=44908684;dbSNPBuildID=80;SSR=0;SAO=1;VP=0x050368000a05150536130100;GENEINFO=APOE:348;WGT=1;VC=SNV;PM;PMC;S3D;SLO;NSM;REF;ASP;VLD;G5;HD;GNO;KGPhase1;KGPhase3;LSD;MTP;OM;CAF=0.8494,0.1506;COMMON=1;TOPMED=0.84440303261977573,0.15559696738022426
>
- AF(Allele Frequency) 資訊
- TOPMED=0.84440303261977573,0.15559696738022426
(只紀錄最大的 project ?)
<br>
## 操作手冊
- [Searching NCBI’s dbSNP database](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3078622/)
- color coded
- near the 3’ or 5’ end (white)
- intron (yellow)
- non-synonymous nonsense and missense (red)
- synonomus (green)
- frame-shift (blue)
<br>
## [dbSNP2.0 VCF File](https://www.ncbi.nlm.nih.gov/snp/docs/products/vcf/redesign/)
> 參考資料
> - [三級分析 / VCF Format](https://hackmd.io/6rATKTvURVSKia8K_9kBeQ)
### [Fixed Columns](https://www.ncbi.nlm.nih.gov/snp/docs/products/vcf/redesign/#loc_all_info)
| Column Name | Public Term | Description |
|-------------|-------------|-------------|
| CHROM | Chromosome | The RefSeq identifier for this chromosome. |
| POS | Position | The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. |
| ID | Identifier | The RefSNP ID. When a RefSNP is found at more than one location, the ID will be suffixed with a hyphen, and number. |
| REF | Reference Base(s) | Each base must be one of A C, G, T, or N. Multiple bases are permitted. Tools processing VCF files are not required to preserve case in the allele Strings. |
| ALT | Alternate Base(s) | Comma separated list of alternate, non-reference alleles recorded in dbSNP from supporting data. Options are base strings made up of the bases A, C, G, T, or N. |
| QUAL | Quality | In the dbSNP VCF, this field is always '.' (no value). |
| FILTER | Filter Status | In the dbSNP VCF, this field is always '.' (no value). |
| INFO | Additional Information | Contains additional information for the reported variation. INFO fields are encoded as a series of semicolon-separated short keys with optional values in the format: \<key>=\<data>[,data]. See See Table 2: VCF INFO Tags for more information. |
<br>
### [Variant Types (New)](https://www.ncbi.nlm.nih.gov/snp/docs/products/vcf/redesign/#var_type)
> 變異類型
| Public term 公共術語 | Abbreviation 縮寫 | SO ID 編號 |
|--------------------|-------------------|------------|
| Single Nucleotide Variant 單核苷酸變異 | SNV | [SO:0001483](http://www.sequenceontology.org/browser/current_svn/term/SO:0001483) |
| Insertion 插入 | INS | [SO:0000667](http://www.sequenceontology.org/browser/current_svn/term/SO:0000667) |
| Deletion 刪除 | DEL | [SO:0000159](http://www.sequenceontology.org/browser/current_svn/term/SO:0000159) |
| Indel 插入或刪除 | INDEL | [SO:1000032](http://www.sequenceontology.org/browser/current_svn/term/SO:1000032) |
| Multiple Nucleotide Variation 多核苷酸變異 | MNV | [SO:0002007](http://www.sequenceontology.org/browser/current_svn/term/SO:0002007) |
| Identity 相同(無變異) | NOALT (不需要樣本) | [SO:0002073](http://www.sequenceontology.org/browser/current_svn/term/SO:0000289) |
- ### **SO**: Sequence Ontology 序列本體論
http://www.sequenceontology.org/browser/
- ### [SNV (single nucleotide variant 單核苷酸變異)](http://www.sequenceontology.org/browser/current_svn/term/SO:0001483)
[](https://i.imgur.com/wvsqlMN.png)
- SNV
<br>
### Variant Types (Old)
> 變異類型
- [Snp class description](https://www.ncbi.nlm.nih.gov/projects/SNP/snp_legend.cgi?legend=snpClass)
| # | class | description |
|---|-------|--------------------------|
| 1 | SNV | single nucleotide variation |
| 2 | DIV | deletion/insertion variation |
| 3 | HETEROZYGOUS | variable, but undefined at nucleotide level |
| 4 | STR | short tandem repeat (microsatellite) variation |
| 5 | NAMED | insertion/deletion variation of named repetitive element |
| 6 | NO VARIATON | sequence scanned for variation, but none observed |
| 7 | MIXED | cluster contains submissions from 2 or more alleleic classes |
| 8 | MNV | multiple nucleotide variation with alleles of common length > 1 |
- [Old vs New](https://hackmd.io/kN7MzZZjTsKlOXLYy3gV5Q#00-Allvcfgz-vs-GCF_00000140539gz)
- Old: 00-All.vcf (ref=GRCh38.p7)
```1 105774 rs1236596781 GT G . . RS=1236596781;RSPOS=105775;dbSNPBuildID=151;SSR=0;SAO=0;VP=0x050000080005000002000200;GENEINFO=LOC100996442:100996442;WGT=1;VC=DIV;INT;ASP;TOPMED=0.78549471202854230,0.21450528797145769```
- `VC=DIV`
- New: GCF_000001405.39.gz (ref=GRCh38.p13)
```C_000001.11 105774 rs1236596781 GT G . . RS=1236596781;dbSNPBuildID=151;SSR=0;GENEINFO=LOC100996442:100996442;VC=INDEL;INT;R3;GNO;FREQ=Korea1K:0.8267,0.1733|TOMMO:0.8335,0.1665|dbGaP_PopFreq:0.7329,0.2671;COMMON```
- `VC=INDEL`
<br>
<hr>
<br>
## 註釋欄位
> - 透過 pbrun vcfanno 標注
> - 版本:`00-All.vcf.gz`
- ### 新增 FILTER 欄位
| ID | Description |
|----|-------------|
| PASS | All filters passed |
- ### 新增 INFO 欄位 (`dbsnp_` 為自定義 prefix)
| ID | Number / Type<br>Description |
|----|-----------------------------|
| dbsnp_ASP | 0 / Flag<br>Is Assembly specific. This is set if the variant only maps to one assembly |
| dbsnp_ASS | 0 / Flag<br>In acceptor splice site FxnCode = 73 |
| dbsnp_CAF | . / String<br>An ordered, comma delimited list of allele frequencies based on 1000Genomes, starting with the reference allele followed by alternate alleles as ordered in the ALT column. Where a 1000Genomes alternate allele is not in the dbSNPs alternate allele set, the allele is added to the ALT column. The minor allele is the second largest value in the list, and was previuosly reported in VCF as the GMAF. This is the GMAF reported on the RefSNP and EntrezSNP pages and VariationReporter |
| dbsnp_CDA | 0 / Flag<br>Variation is interrogated in a clinical diagnostic assay |
| dbsnp_CFL | 0 / Flag<br>Has Assembly conflict. This is for weight 1 and 2 variant that maps to different chromosomes on different assemblies. |
| dbsnp_COMMON | 1 / Integer<br>RS is a common SNP. A common SNP is one that has at least one 1000Genomes population with a minor allele of frequency >= 1% and for which 2 or more founders contribute to that minor allele frequency. |
| dbsnp_DSS | 0 / Flag<br>In donor splice-site FxnCode = 75 |
| dbsnp_G5 | 0 / Flag<br>>5% minor allele frequency in 1+ populations |
| dbsnp_G5A | 0 / Flag<br>>5% minor allele frequency in each and all populations |
| dbsnp_GENEINFO | 1 / String<br>Pairs each of gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|) |
| dbsnp_GNO | 0 / Flag<br>Genotypes available. The variant has individual genotype (in SubInd table). |
| dbsnp_HD | 0 / Flag<br>Marker is on high density genotyping kit (50K density or greater). The variant may have phenotype associations present in dbGaP. |
| dbsnp_INT | 0 / Flag<br>In Intron FxnCode = 6 |
| dbsnp_KGPhase1 | 0 / Flag<br>1000 Genome phase 1 (incl. June Interim phase 1) |
| dbsnp_KGPhase3 | 0 / Flag<br>1000 Genome phase 3 |
| dbsnp_LSD | 0 / Flag<br>Submitted from a locus-specific database |
| dbsnp_MTP | 0 / Flag<br>Microattribution/third-party annotation(TPA:GWAS,PAGE) |
| dbsnp_MUT | 0 / Flag<br>Is mutation (journal citation, explicit fact): a low frequency variation that is cited in journal and other reputable sources |
| dbsnp_NOC | 0 / Flag<br>Contig allele not present in variant allele list. The reference sequence allele at the mapped position is not present in the variant allele list, adjusted for orientation. |
| dbsnp_NOV | 0 / Flag<br>Rs cluster has non-overlapping allele sets. True when rs set has more than 2 alleles from different submissions and these sets share no alleles in common. |
| dbsnp_NSF | 0 / Flag<br>Has non-synonymous frameshift A coding region variation where one allele in the set changes all downstream amino acids. FxnClass = 44 |
| dbsnp_NSM | 0 / Flag<br>Has non-synonymous missense A coding region variation where one allele in the set changes protein peptide. FxnClass = 42 |
| dbsnp_NSN | 0 / Flag<br>Has non-synonymous nonsense A coding region variation where one allele in the set changes to STOP codon (TER). FxnClass = 41 |
| dbsnp_OM | 0 / Flag<br>Has OMIM/OMIA |
| dbsnp_OTH | 0 / Flag<br>Has other variant with exactly the same set of mapped positions on NCBI refernce assembly. |
| dbsnp_PM | 0 / Flag<br>Variant is Precious(Clinical,Pubmed Cited) |
| dbsnp_PMC | 0 / Flag<br>Links exist to PubMed Central article |
| dbsnp_R3 | 0 / Flag<br>In 3' gene region FxnCode = 13 |
| dbsnp_R5 | 0 / Flag<br>In 5' gene region FxnCode = 15 |
| dbsnp_REF | 0 / Flag<br>Has reference A coding region variation where one allele in the set is identical to the reference sequence. FxnCode = 8 |
| dbsnp_RS | 1 / Integer<br>dbSNP ID (i.e. rs number) |
| dbsnp_RSPOS | 1 / Integer<br>Chr position reported in dbSNP |
| dbsnp_RV | 0 / Flag<br>RS orientation is reversed |
| dbsnp_S3D | 0 / Flag<br>Has 3D structure - SNP3D table |
| dbsnp_SAO | 1 / Integer<br>Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both |
| dbsnp_SLO | 0 / Flag<br>Has SubmitterLinkOut - From SNP->SubSNP->Batch.link_out |
| dbsnp_SSR | 1 / Integer<br>Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 - byEST, 4 - oldAlign, 8 - Para_EST, 16 - 1kg_failed, 1024 - other |
| dbsnp_SYN | 0 / Flag<br>Has synonymous A coding region variation where one allele in the set does not change the encoded amino acid. FxnCode = 3 |
| dbsnp_TOPMED | . / String<br>An ordered, comma delimited list of allele frequencies based on TOPMed, starting with the reference allele followed by alternate alleles as ordered in the ALT column. The TOPMed minor allele is the second largest value in the list. |
| dbsnp_TPA | 0 / Flag<br>Provisional Third Party Annotation(TPA) (currently rs from PHARMGKB who will give phenotype data) |
| dbsnp_U3 | 0 / Flag<br>In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53 |
| dbsnp_U5 | 0 / Flag<br>In 5' UTR Location is in an untranslated region (UTR). FxnCode = 55 |
| dbsnp_VC | 1 / String<br>Variation Class |
| dbsnp_VLD | 0 / Flag<br>Is Validated. This bit is set if the variant has 2+ minor allele count based on frequency or genotype data. |
| dbsnp_VP | 1 / String<br>Variation Property. Documentation is at ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf |
| dbsnp_WGT | 1 / Integer<br>Weight, 00 - unmapped, 1 - weight 1, 2 - weight 2, 3 - weight 3 or more |
| dbsnp_WTD | 0 / Flag<br>Is Withdrawn by submitter If one member ss is withdrawn by submitter, then this bit is set. If all member ss' are withdrawn, then the rs is deleted to SNPHistory |
| dbsnp_<br>dbSNPBuildID | 1 / Integer<br>First dbSNP Build for RS |
- ### 新增其他資訊
- `##bcftools_normVersion=`
`1.7+htslib-1.9`
- `##bcftools_normCommand=`
`norm -m- -o aed5ad61ea26e520e12aaf647567d31b06296fdb.norm.vcf /workspace/datasets/germline/HG002/wes-output/output.vcf; Date=Wed Dec 28 11:19:03 2022`
<br>
## 標注範例
> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample
> chr1 14590 . G A 31.6 . AC=1;AF=0.5;AN=2;BaseQRankSum=0.674;DP=2;ExcessHet=3.0103;FS=0;MLEAC=1;MLEAF=0.5;MQ=22;MQRankSum=0;QD=15.82;ReadPosRankSum=0.674;SOR=0.693;dbsnp_RS=707679;dbsnp_RSPOS=14590;dbsnp_RV;dbsnp_VP=0x050000040005040102000100;dbsnp_GENEINFO=DDX11L1:100287102|WASH7P:653635;dbsnp_dbSNPBuildID=86;dbsnp_SAO=0;dbsnp_SSR=0;dbsnp_WGT=1;dbsnp_VC=SNV;dbsnp_R3;dbsnp_ASP;dbsnp_VLD;dbsnp_GNO;dbsnp_TOPMED=0.93632931957186544,0.06367068042813455 GT:AD:DP:GQ:PL 0/1:1,1:2:39:39,0,39
| Column | value |
|--------|-------|
| #CHROM | chr1 |
| POS | 14590 |
| ID | . |
| REF | G |
| ALT | A |
| QUAL | 31.6 |
| FILTER | . |
| INFO | (見下表展開) |
| FORMAT | GT:AD:DP:GQ:PL |
| sample | 0/1:1,1:2:39:39,0,39 |
| INFO | value |
|------|-------|
| AC | 1 |
| AF | 0.5 |
| AN | 2 |
| BaseQRankSum | 0.674 |
| DP | 2 |
| ExcessHet | 3.0103 |
| FS | 0 |
| MLEAC | 1 |
| MLEAF | 0.5 |
| MQ | 22 |
| MQRankSum | 0 |
| QD | 15.82 |
| ReadPosRankSum | 0.674 |
| SOR | 0.693 |
| dbsnp_RS | 707679 |
| dbsnp_RSPOS | 14590 |
| dbsnp_RV |
| dbsnp_VP | 0x050000040005040102000100 |
| dbsnp_GENEINFO | DDX11L1:100287102\|WASH7P:653635 |
| dbsnp_dbSNPBuildID | 86 |
| dbsnp_SAO | 0 |
| dbsnp_SSR | 0 |
| dbsnp_WGT | 1 |
| dbsnp_VC | SNV |
| dbsnp_R3 |
| dbsnp_ASP |
| dbsnp_VLD |
| dbsnp_GNO |
| dbsnp_TOPMED | 0.93632931957186544,0.06367068042813455 |
<br>
<hr>
<br>
## 註釋錯誤?
### Case1
- ref: Homo_sapiens_assembly38.fasta
- alt: HG002.novaseq.wes_agilent.100x
- dbsnp: 00-All.vcf.gz
- **不正常**
```
#CHROM POS ID REF ALT INFO
chr1 788418 . CAG C dbsnp_VC=DIV,DIV;dbsnp_CAF=0.1989,0.8011,0.1989,0.8011;dbsnp_TOPMED=0.17555428134556574,0.82444571865443425
```
- 註釋結果有 3 個變異頻率:`0.8011,0.1989,0.8011`
- dbSNP
```
#00-All.vcf.gz
1 788418 rs77445403 C G . . RS=77445403;RSPOS=788418;dbSNPBuildID=131;SSR=0;SAO=0;VP=0x050100080005000102000100;GENEINFO=LOC105378580:105378580;WGT=1;VC=SNV;SLO;INT;ASP;GNO;TOPMED=0.99999203618756371,0.00000796381243628
#GCF_000001405.39.gz
NC_000001.11 788418 rs77445403 C G . . RS=77445403;dbSNPBuildID=131;SSR=0;GENEINFO=LINC01409:105378580;VC=SNV;INT;R5;GNO;FREQ=GnomAD:1,1.428e-05|TOPMED:1,2.267e-05|dbGaP_PopFreq:1,0
```
- [rs77445403](https://www.ncbi.nlm.nih.gov/snp/rs77445403)

- 只有 1 個變異頻率
- **正常**
```
#CHROM POS ID REF ALT INFO
chr1 821224 . A G dbsnp_VC=SNV;dbsnp_CAF=0.252,.,0.748,.;dbsnp_TOPMED=0.36704415137614678,0.00375095565749235,0.62918100152905198,0.00002389143730886
```
- 有 3 個變異頻率:`.,0.748,.`
- dbSNP
```
#00-All.vcf.gz
1 821224 rs3131962 A C,G,T . . RS=3131962;RSPOS=821224;RV;dbSNPBuildID=103;SSR=0;SAO=0;VP=0x050100000005170536000104;WGT=1;VC=SNV;SLO;ASP;VLD;G5A;G5;HD;GNO;KGPhase1;KGPhase3;NOV;CAF=0.252,.,0.748,.;COMMON=1;TOPMED=0.36704415137614678,0.00375095565749235,0.62918100152905198,0.00002389143730886
#GCF_000001405.39.gz
NC_000001.11 821224 rs3131962 A C,G,T . . RS=3131962;dbSNPBuildID=103;SSR=0;VC=SNV;GNO;FREQ=1000Genomes:0.252,.,0.748,.|GENOME_DK:0.025,.,0.975,.|GnomAD:0.2708,.,0.7292,.|GoNL:0.1493,.,0.8507,.|KOREAN:0.1762,.,0.8238,.|Korea1K:0.1654,.,0.8346,.|NorthernSweden:0.14,.,0.86,.|Qatari:0.2269,.,0.7731,.|SGDP_PRJ:0.1447,.,0.8553,.|Siberian:0.1346,.,0.8654,.|TOMMO:0.1976,.,0.8024,.|dbGaP_PopFreq:0.2479,0,0.7521,0;COMMON
```
- [rs3131962](https://www.ncbi.nlm.nih.gov/snp/rs3131962)

<br>
## 參考資料
<br>
## 參考資料 (待消化)
- [dbSNP数据库](https://wenku.baidu.com/view/4e6b882c647d27284b7351e2.html)
- [从dbSNP里提取SNP的Asian频率](https://www.jianshu.com/p/42303428cc76)