## What are GIAB genome-stratifications?
GIAB (Genome in a Bottle) Stratifications are ==a collection of genomic regions== categorized into different types, designed ==to aid in the evaluation and comparison of genomics tools and pipelines==. These stratifications are used to benchmark the accuracy of variant calling tools across various genomic complexities and challenges. Here's a brief overview of the types of GIAB Stratifications:
- **Low Complexity**: Includes regions with different types and sizes of low complexity sequence, such as homopolymers, Short Tandem Repeats (STRs), Variable Number of Tandem Repeats (VNTRs), and other locally repetitive sequences.
- **Functional Technically Difficult**: Consists of functional or potentially functional regions that are also likely to be technically difficult to sequence.
- **Genome Specific**: Covers difficult regions due to potentially difficult variation in a NIST/GIAB sample, (e.g. NA12878), including regions containing putative compound heterozygous variants, small regions containing multiple phased variants, and regions with potential structural or copy number variation.
- **Functional Regions**: Stratifies variants inside and outside coding regions.
- **GC Content**: Contains regions with different ranges of GC content.
- **Mappability**: Includes regions where short read mapping can be challenging.
- **Ancestry**: *Available only for GRCh38*, this category includes regions with inferred patterns of local ancestry.
- **Segmental Duplications**: Focuses on regions with segmental duplications or regions with non-trivial self-chain alignments.
- **Union**: Represents regions with different general types of difficult regions or any type of difficult region or complex variant.
- **Other Difficult**: High variability regions like the VDJ and MHC, near gaps in the reference or errors in the reference, and rDNA (specific to CHM13).
- **XY**: Chromosome XY specific regions such as PAR, XTR, or ampliconic.
These stratifications are essential for benchmarking tools like [hap.py](https://github.com/Illumina/hap.py/tree/master), allowing researchers to evaluate how well their variant calling algorithms perform under different genomic conditions. The stratifications are organized into subdirectories within the GIAB GitHub repository, each containing associated READMEs, scripts, and notebooks used to generate the various stratifications. Additionally, `.tsv` files are provided for use when benchmarking with [hap.py](https://github.com/Illumina/hap.py/tree/master), offering paths to stratifications relative to the reference directory [3].
Citations:
[1] https://www.biorxiv.org/content/10.1101/2023.10.27.563846v1.full
[2] https://github.com/genome-in-a-bottle/genome-stratifications
[3] https://www.nist.gov/programs-projects/genome-bottle
[4] https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/
## The effect of GC content
The effect of GC content on Next-Generation Sequencing (NGS) is primarily characterized by biases in library preparation, which can lead to uneven coverage of genomic regions. These biases are influenced by the thermodynamics of DNA polymerase activity during PCR amplification, which is a critical step in preparing samples for sequencing. Here's a summary of the key effects:
- **Uneven Coverage**: Regions with extreme GC content (either very high or very low) can be underrepresented or overrepresented in the final sequencing library. This unevenness affects the ability to accurately call variants in these regions, potentially skewing the results of genomic analyses.
- **PCR Amplification Bias**: The PCR amplification step, which is necessary for increasing the quantity of DNA fragments for sequencing, introduces biases. These biases favor the amplification of certain sequences over others, leading to imbalances in the representation of genomic regions with different GC contents. This is particularly problematic for regions with extreme GC content, where the balance of G and C bases makes them more difficult to amplify.
- **Impact on Variant Calling**: Uneven coverage and biases in library preparation can compromise the accuracy of variant calling, especially in regions with extreme GC content. This can lead to false positives or negatives in the identification of genetic variations, which is crucial for understanding disease mechanisms and developing personalized treatments.
:::info
High GC → thermodynamically stable → more energy are required to denature and separate the DNA strands during PCR amplification → lower coverage because the PCR process is not efficient.
Low GC → poor representation in the sequencing library, leading to sparse coverage and difficulty in distinguishing true variants from sequencing artifacts.
:::
GIAB defines the ranges of GC content for genomic regions by identifying areas with specific fractions of G and C bases. This categorization is crucial because different sequencing technologies exhibit distinct error profiles in regions with varying GC content. Specifically, ==GC-rich and AT-rich regions tend to have reduced coverage and precision in variant calling==. To address this, GIAB identifies regions with specific GC percentages in 5% increments from 15% to 85%. This approach allows for a detailed understanding of how different genomic regions behave under various sequencing conditions, which is essential for accurate variant detection and interpretation.
- **Relevant files**:

- **`gc30to55`**: This part specifies the range of GC content percentages for the genomic regions contained in the file. It means that the file includes regions with a GC content between 30% and 55%.
- **`gclt30orgt55`**: This part specifies the range of GC content percentages for the genomic regions contained in the file. It means that the file includes regions with a GC content less than 30% or greater than 55%.
- **`_slop50`**: This suffix indicates that the regions have been extended by 50 base pairs on either side (`slop` stands for "sliding window operation"). This extension is done to ensure that the regions are not too close to the edges of chromosomes, which can cause issues with certain analyses.
:::warning
Previous study concluded that GC biases across many commonly used platforms in experiments sequencing multiple genomes (with mean GC contents ranging from 28.9% to 62.4%)
sequencing libraries prepared using the MiSeq and NextSeq platforms exhibited major GC biases, with problems becoming increasingly severe outside the 45–65% GC range[5]. Although GIAB provided gclt25orgt65 and gclt30orgt55 generated their alldifficultregions.bed.gz considering gclt25orgt65 as problematic GC content region.
:::
[5] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7016772/
[6] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3639258/
[7] https://assets.thermofisher.com/TFS-Assets/BID/Technical-Notes/impact-gc-bias-library-preparation-tech-note.pdf
## Overlap specific regions with GIAB-stratification using R
```
library(GenomicRanges)
all_bed=list.files("GRCh37@all/",recursive=T)
all_bed=all_bed[!grepl(all_bed,pattern="GenomeSpecific")]
all_bed=all_bed[grepl(all_bed,pattern="bed.gz")]
target_region = read.table("CAP_confusion_region.bed", header = FALSE, stringsAsFactors = FALSE)
target_region_gr = GRanges(seqnames = target_region$V1, ranges = IRanges(start = target_region$V2, end = target_region$V3))
target_region_gr$overlapped=""
for(i in all_bed){
stratification_region = read.table(paste0("GRCh37@all/",i), header = FALSE, stringsAsFactors = FALSE)
stratification_region_gr = GRanges(seqnames = stratification_region$V1, ranges = IRanges(start = stratification_region$V2, end = stratification_region$V3))
overlaps = findOverlaps(target_region_gr, stratification_region_gr)
if(length(overlaps)>0){
target_region_gr$overlapped[overlaps@from]=paste0(target_region_gr$overlapped[overlaps@from],",",i)
}
}
overlapped=t(sapply(1:length(target_region_gr),function(x){
t(sapply(all_bed,function(y){
grepl(target_region_gr$overlapped[x],pattern=y)
}))
}))
colnames(overlapped)=all_bed
result=cbind(as.data.frame(target_region_gr),overlapped)
```
https://hackmd.io/@sK-GgpcqTNWnutbxzOoG2g/ByjWu11j_#NA12878-HG001
[](https://)