# Human genome study series: 1000 Genome Project Structural Variation sequencing
###### tags: `Lab`
## Introduction: 1000 Genome Project
自2003人類第一個全基因體定序完成後,研究者將目光放向分析變異資訊中,早期相關研究如HapMap Project就是針對基因體內的變異(特別時common variant)進行偵測的計畫,本次要介紹的千人基因體計畫,英文名稱1000 Genome Project(1KGP),計畫期間為2008-2015年,主旨在收集多族群的DNA檢體,並偵測基因體內不同頻率的變異位點(MAF > 1% 的為主)。
樣本來源有部分承接HapMap Project的樣本,也包含了一些research cohort,檢體類型有DNA和cell line資料,目前都收錄在NHGRI Sample Repository for Human Genetic Research at the Coriell Institute for Medical Research
1KGP的研究計畫分成許多階段進行,其中較為重要的階段與研究成果整理於下表中:
* pilot phase 為前期重要里程碑,當時雖完成whole genome sequencing 但平均深度淺,因此只能找到common variants
* phase1 相比過往研究,使用較新的定序技術與工具分析同一批樣本,結果也有些微進展
* phase3 則是計畫結束前最重要的發表,當時有多個層面的提升,不僅擴充樣本種族多樣性與總數,更加深了定序深度和定序方法(microarray genotyping) 對於罕見變異的探索又更加精進。在研究發表方面,以Sudmant 這篇較常拿來討論與比較(1KGP 自己發的截至2023/3/3為止citation為38次 ; Sudmant citation 高達2022次)

雖然1KGP在2015正式告一段落,但這些樣本有繼續被投入研究,常見的研究方式有:重新定序、增加親屬樣本(related sample, trio sample)或是擴充樣本總數等,其中又以Human Genome Structural Variation Consortium (HGSVC)的使用最為知名。
其他相關研究介紹與資料連結都收錄在[IGSR](https://www.internationalgenome.org/1000-genomes-summary)的網站上。
## Research Paper
### An integrated map of structural variation in 2,504 human genomes
* **Sudmant, P., et al. *Nature* 526, 75–81 (2015)**
* introduction: 以1KGP + long read 針對SV進行分析研究,特別是對於DEL其他類別的SV偵測與分析
* material & method:
* Illumina WGS data (∼100 bp reads, mean 7.4-fold coverage) from 2,504 individuals
* GRCh37 reference,align by BWA and mrsFAST
* SV calling by multiple algorithm (tool comparison is in next article)
* SV refinement by PacBio long read data
* result:
* characteristic of phase 3 integrated SV callset
* novel rate compare to previous 1KGP and Database of Genomic Variants (DGV), if reciprocal overlap < 50%
* FDR estimation using deep-coverage Complete Genomics (CG) sequencing as truth set --> most falsenegatives are driven from illumina short-read data 代表short-read對於SV偵測尚有不足,長度和斷點延伸較差、且它定的深度本來就比較淺,如果深一點insertion 至少300bp可以做比較好?
* breakpoint assembly showed a mean boundary precision of 0–15bp for all SV types excluding DUP and INV 猜是用bedtools做的
* population genetic properties
* at VAF >= 2% nearly all SVs are shoared across population --> wrong description ?
* LD estimation with neighbor SNV data, if r^2 > 0.6 then SV are in linkage disequilibrium with SNV--> LD 數值呈現用 r-square 代表兩者的correlation
* heterozygosity rate among population
* population stratification by VAF concordance
* functional impact of SVs
* take intersection of SV binned by AF with classes of genic and intergenic functional element 看是否座落在影響功能區域
* deletion sizes increase, these SVs become rarer (P < 2.2 × 10^(-16); linear model, F-test)
* number of disease associations previously detected by GWAS may be attributable to SV --> 拿過去的GWAS reported candidate 去和SV做correlation看是位點否有高度相關(in strong LD r^2 > 0.8)
* SV clustering and complexity
* 3,163 regions where SVs seemed to cluster (>2 SVs mapping within 500 bp
* discussion and remark
### High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios
* **Byrska-Bishop et al.,*Cell* 2022 Sep 1;185(18):3426-3440**
* introduction:2015年結束的1kGP因為WGS定序深度較淺,且樣本數侷限,後續有許多研究以不同方式擴增這份資料的價值,本篇研究即是以689trio data(n = 602)將樣本數擴增為3,202個,並提高定序深度來重新偵測並比較SNV/SV相關的結果
* material & method:
* Illumina WGS data (∼NovaSeq6000, mean 30-fold coverage) from 3,202 individuals
* GRCh38 reference
* SNV/INDEL: GATK HaplotypeCaller
* SV: svtools + Absinthe and GATK-SV(last pipeline is the same as [gnomAD database](/Ei6cobaCQiOtr277YH7dYA))
* result:
* Small variation across the 3,202 1kGP samples
* cohort- and sample- level detection number, and defined novel SNVs compare to dbSNP
* mutation counts are consistent with gnomAD and TOPMed
* False discovery rate among small variants
* compared GT calls of NA12878 (HG001) to Genome In A Bottle (GIAB) NA12878 truth set v3.3.2 in confident regio
* FDR difference in easy- and difficult-to-sequence regions(Krusche, et al. Nat Biotechnol 2019 )
* from singleton discovery, they found the enriched FDR caused by somatic artifacts from cell-line propagation
* FDR re-evaluation using GIABv4.2.1 can see 5% of increased FP, which the latest truth sets used addition tech such as long reads to reduce noise
* Structural variation across the 3,202 1kGP samples
* compare SV calls between 3 callers and estimate the size, allele frequency distribution in both cohort- and sample- level
* In function annotation, they found SVs altered 162 genes per genome, which is consistent with previous studies:
* gnomAD-SV predicted 180 genes by SVs in each genome (Collins et al., 2020)
* HGSVC recent study estimated 189 SVs per genome (Ebert et al., 2021)
* Comparison of the SNV/SV calls to the 1kGP phase 3 call set
* compare the result with 2015, restricted in same 2,504 samples
* recall rate is high in SNV (98.3% using phase3 data as truth set) with high AF concordance
* the largest gain is in rare/singleton variants
* FDR estimation: high-coverage is lower than phase3
* in SV detection, 2-fold more SV sites was discovered, maybe caller selection can result in differnet results (Table : SV caller comparison between 2015 phase3 and 2022 high-coverage )
* SV benchmark using long-read sample as truth set can found high-coverage set has higher precision
* Haplotype phasing and imputation performance
* using pedigree-based correction in both SHAPEIT (chr1-22) and Eagle (chrX) tool to do haplotype phasing
* SV phasing is the combination of phased SNV vcf and SV vcf
* It imputed a set of 110 diverse samples from the Simons Genome Diversity Project, and it showed high imputation accuracy at MAF > 5% (using HGSVC long-read as truth set)
* discussion and remark
* family-based design for inheritance evaluation and high coverage
* gains in technological advancements brought to the high-coverage resource relative to phase 3 (especially in rare, singleton, and non-coding region variants)
* In FDR evaluation, somatic variants arise in the cell lines over time may impact the mutation rate
* Most existing reference imputation panels, such as the HRC and TOPMed do not yet include SVs
## Reference
* https://www.internationalgenome.org/1000-genomes-summary
* https://www.nature.com/articles/nature15394
* https://www.sciencedirect.com/science/article/pii/S0092867422009916?via%3Dihub