## CNV calling approches:


#### CNV callers can be composed of one or more approches including read-pair (RP), read-depth (RD), split read (SR), or assembly (AS) algorithms.

#### As shown in the figure, it seems that GATK gCNV is more sensitive to call CNV with length lower than 1000 bp for WES samples.
> GATK gCNV called mainly CNVs shorter than 500 bp in WES and 500–1,000 bp in WGS. GATK gCNV called shorter CNVs than any other tool.
#### Although GATK gCNV is able to call a large amount of CNVs, how to filtering these CNVs will become the other challenge.
> GATK gCNV recall was best for both WES and WGS data, followed by Lumpy, DELLY, cn.MOPS, and Manta. All tools performed poorly on the WES dataset. While recall for WGS in all tools, except CLC Genomics Benchmark, was fair, precision was lacking for all the tools, with a maximum precision of 66.7%. Tools that called a higher total number of CNVs, also had higher recall, but lower precision.
[Reference](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8699073/)
## ExomeDepth
* Adventage: easy to use
* How to interpret the result?
> The average confidence (Bayes factor) determined by ExomeDepth for true positive CNV events was ==45.04== (Supplementary Table S7, min.=6.4, max.=76.8) and the average ratio of sequencing reads between test and reference samples for deletions was 0.61 (Supplementary Table S7, min.=0.539, max.=0.745) and 1.4 for the sole duplication event. [Reference](
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5427176)
## GATK Germline CNV (gCNV)
The workflow of GATK gCNV caller.

CollectReadCounts
DetermineGermlineContigPloidy
GermlineCNVCaller
PostprocessGermlineCNVCalls

[Reference](
https://www.nature.com/articles/s41588-023-01449-0)
* Additional step in new verion of GATK(4.4.0): JointGermlineCNVSegmentation(Beta)
> The GermlineCNVCaller learns a denoising model per scattered shard while consistently calling CNVs across the shards. The tool models systematic biases and CNVs simultaneously, which allows for sensitive detection of both rare and common CNVs. PostprocessGermlineCNVCall is used to consolidate the scattered GermlineCNVCaller results, perform segmentation, and call the copy number states. This tool generates per-interval and per-segment sample calls in VCF format for each sample processed. ==The final step deployed in the pipeline is JointGermlineCNVSegmentation, which combines the gCNV segments and calls across samples to more accurately identify artifacts if found in too many control samples==.
>
[Reference 1](
https://gatk.broadinstitute.org/hc/en-us/articles/13832774541211-JointGermlineCNVSegmentation-BETA-)
[Reference 2](
https://www.mdpi.com/2075-4426/12/5/667)
supplement files:
https://console.cloud.google.com/storage/browser/gatk-best-practices/cnv_germline_pipeline;tab=objects?authuser=0&prefix=&forceOnObjectsSortingFiltering=false
## Limitation of CNV detection using NGS:
> * Perhaps the greatest problem in using NGS to discover structural variation is the nature of the data.
> * Owing to the complex nature of human genomes (for example, widespread common repeats and segmental duplications), there is considerable read-mapping ambiguity.
> * Longer reads and inserts are needed to ameliorate this bias by increasing the specificity in read mapping.
> * It is estimated, however, that >1.5% of the human genome cannot be covered uniquely even with read lengths of 1 kb.
[Rererence](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4108431/)
## A little improvement of diagnostic yield

[Rererence](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10379589/)
## How to benchmarking CNV callers?
There is no "public" gold-standard for CNV so far.
Take benchmarking callset from previous works as reference is needed.
1. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4617611/
(1000 Genomes Project)
2. https://www.nature.com/articles/s41598-021-93878-2
(Supplementary Tables contains 110050 exons, including 6853 true positive exons with CNV, and 103197 true negative non-CNV exons)
3. https://jmg.bmj.com/content/55/11/735
(1000 Genomes Project Gold)
4. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3227574/
(1000 Genomes Project Gold Standard + Roche NimbleGen 42 million aCGH Gold Standard)
5. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8045368/
(Supplementary table S8 including 1912 deletions and 110 duplications is consisted of 3 published call set)
6. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5402652/
(Additional file 5: Spreadsheet 1)
7. https://www.mdpi.com/2072-6694/13/24/6283
(Thier methodolgy seems easier to understand, can be take as reference)
8. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7671382/
9. https://pubmed.ncbi.nlm.nih.gov/34257369/
10. https://github.com/bioinformatics-IBCH/Comparison-study-of-germline-CNV-calling-tools/blob/master/analysis.py
## Variant filtration
> * CNVs were filtered based on their mode of inheritance, gCNV quality scores (QS) (QS>50; developer recommendations are QS>50 for duplications, >100 for deletions, and >400 for homozygous deletions, see Babadi et al11 for details),and their frequency in the Broad CMG callset. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10593084/)
> * A CNV is defined as high-confidence by GATK-gCNV (see Babadi et al.11 for details) if:
> 1. The CNV is present in a high-quality sample (with ≤ 200 autosomal raw CNV calls, of which at least 35 have QS >20)
>2. The sample frequency of the call is ≤ 0.01 within the Broad callset
>3. The number of overlapped exons is ≥ 3
>4. The QS score is equal or greater than the QS threshold (QS>50 for duplications, >100 for deletions, and >400 for homozygous deletions)
>(https://www.nature.com/articles/s41588-023-01449-0)

## True negative set?
> Any bases that are in the union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.18_2mindatasets_5minYesNoRatio.bed.gz bed file intervals and not in the vcf ==should be considered highly confident homozygous reference (for snps and short indels).== This bed file excludes regions/variant locations that are uncertain due to low coverage, genotypes called in < 2 datasets, locations with unresolved discordant genotypes, locations where most datasets have evidence of bias (systematic sequencing errors, local alignment problems, mapping problems, or abnormal allele balance), ==variants inside possible deletions, known segmental duplications, and structural variants reported in dbVar for NA12878==. In all, this excludes ~23% of the non-N bases in the GRCh37 reference assembly. To assess false positive and false negative rates, it is important to compare only variants inside the bed file regions.
https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/analysis/NIST_union_callsets_06172013/README.NIST.v2.18.txt
union13callableMQonlymerged_addcert_nouncert_excludesimplerep_excludesegdups_excludedecoy_excludeRepSeqSTRs_noCNVs_v2.18_2mindatasets_5minYesNoRatio.bed.gz
# Limitations
> Second, we have optimized GATK-gCNV for detecting rare CNVs at a site frequency <1%; ==common CNVs (frequency >1%) can be assessed, but it becomes challenging to disentangle true polymorphic CNVs segregating in the general population from technical biases==. For these variants, the performance of GATK-gCNV in the present implementation is lower than for rare CNVs (Supplementary Table 2).
> 
https://www.perplexity.ai/search/shi-yong-gatk-germline-cnvdui-Kywjpgl2Sxmg.GO_x59TqA
###### tags: `bioinformatic tools`