--- title: Quantification of alignment-wide major AF tags: OMM description: Personal hackmd notes image: https://partechshaker.com/wp-content/uploads/2018/10/logo_square.png robots: noindex, nofollow GA: UA-165598729-1 --- Quantification of alignment-wide *major AF* === :::info Quantify the major AF for all positions in the alignment and not only for positions detected by variant callers ::: [TOC] Make a local copy for testing --- ```bash! mkdir -p bam scp pmuench@grid.bifo.helmholtz-hzi.de:/net/sgi/oligomm_ab/OligoMM-report/databases/omm_new/joined_reference_curated.fasta bam scp pmuench@grid.bifo.helmholtz-hzi.de:/net/sgi/oligomm_ab/OligoMM-report/databases/omm_new/joined_reference_curated.fasta.fai bam scp pmuench@grid.bifo.helmholtz-hzi.de:/net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted.bam bam/ scp pmuench@grid.bifo.helmholtz-hzi.de:/net/sgi/oligomm_ab/OligoMM-report/processed/1423_S34/1423_S34_bwa_mapped_omm_sorted.bam.bai bam/ cd bam ``` Quantification of major AF --- AF can be quanttified by ```bash! samtools mpileup -uv -Q 0 -f joined_reference_curated.fasta -r "Akkermansia_muciniphila:1-250" 1423_S34_bwa_mapped_omm_sorted.bam > first_100.vcf nano first_100.vcf ``` where `-Q` parameter is controlling the minimal quality of the read. On default it seems there are many reads filtered out. Example for output focusing on one postion with two variants (2 reads with `G` and 1 with `C`) ![](https://i.imgur.com/BgZfwDd.png) ![](https://i.imgur.com/EvdsFyv.png) the ouptut for this particular position using the `samtools` command from above ```bash Akkermansia_muciniphila 237 . C G,T,<X> 0 . DP=54;I16=48,3,1,2,1887,72557,111,4107,3060,183600,180,10800,1100,26006,75,1875;QS=0.944165,0.0372233,0.0186117,0;VDB=0.0481132;SGB=-0.511536;RPB=0.730199;MQB=1;MQSB=1;BQB=1;MQ0F=0 PL 0,89,247,120,231,255,154,253,255,255 ``` In this case `QS=0.944165,0.0372233,0.0186117` is the AF of the three alleles (ref, alternative 1, alternative 2). So the _major AF_ is $max(QS)$ Apply major AF quantification on a bam file --- :::info - **working directory** `/home/aime/bam` ::: A script that outputs the major AF using [bcftools query](https://samtools.github.io/bcftools/howtos/query.html) ```bash! bcftools query -f '%CHROM %POS %QS\n' file.bcf ``` find max of QS field (not optimized) ```bash! samtools mpileup -uv -Q 0 -f joined_reference_curated.fasta 1423_S34_bwa_mapped_omm_sorted.bam > 1423_S34_bwa_mapped_omm_sorted.vcf bcftools view test.vcf -o file.bcf bcftools query -f '%CHROM\t%POS\t%QS]\n' file.bcf | cut -f3 | sed "s/\,/\ /g" | awk '{m=$1;for(i=1;i<=NF;i++)if($i<m)m=$i;print NR" ",m}' ``` ### Quantify frequency of _major allele_ Max value on the **QS** field describes the frequency of the major allele ```bash $ cat bin.awk BEGIN { delta = (delta == "" ? 0.1 : delta) } { bucketNr = int(($0+delta) / delta) cnt[bucketNr]++ numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr) } END { for (bucketNr=1; bucketNr<=numBuckets; bucketNr++) { end = beg + delta printf "%0.1f %0.1f %d\n", beg, end, cnt[bucketNr] beg = end } } ``` ```bash! # from bam to vcf without Q filtering samtools mpileup -v -Q 0 -f joined_reference_curated.fasta 1423_S34_bwa_mapped_omm_sorted.bam > 1423_S34_bwa_mapped_omm_sorted.vcf.gz # add -r "Akkermansia_muciniphila:1-100000" for testing # from vcf to bcf bcftools view -O b 1423_S34_bwa_mapped_omm_sorted.vcf.gz -o 1423_S34_bwa_mapped_omm_sorted.bcf.gz # extract major allele frequency bcftools query -f '%QS\n' 1423_S34_bwa_mapped_omm_sorted.bcf.gz | awk -F',' '{m=$1;for(i=1;i<=NF;i++)if($i>m)m=$i;print m}' > major_allele.dat # create histogram data awk -f hist.awk major_allele.dat > major_allele_bins.dat # using Rscript specified below Rscript plotHist.R -in major_allele_bins.dat -out major_allele_bins.dat.pdf ``` on sample `1423_S34_bwa_mapped_omm_sorted.bam` the output is ```bash $ less major_allele_bins.dat 0.0 0.1 141 0.1 0.2 0 0.2 0.3 0 0.3 0.4 4 0.4 0.5 983 0.5 0.6 765 0.6 0.7 1643 0.7 0.8 5650 0.8 0.9 19809 0.9 1.0 4888812 1.0 1.1 30253736 ``` ![](https://i.imgur.com/vxHXWfP.png) ### Quantify frequency of _reference allele_ First value on the **QS** field describes the frequency of the reference allele - `QS(1)`$=1$ if there is no variant on this position at all - `QS(1)`$=0$ if there all reads have on this position variants ```bash $ cat bin.awk BEGIN { delta = (delta == "" ? 0.1 : delta) } { bucketNr = int(($0+delta) / delta) cnt[bucketNr]++ numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr) } END { for (bucketNr=1; bucketNr<=numBuckets; bucketNr++) { end = beg + delta printf "%0.1f %0.1f %d\n", beg, end, cnt[bucketNr] beg = end } } ``` ```bash! # from bam to vcf without Q filtering samtools mpileup -v -Q 0 -f joined_reference_curated.fasta 1423_S34_bwa_mapped_omm_sorted.bam > 1423_S34_bwa_mapped_omm_sorted.vcf.gz # add -r "Akkermansia_muciniphila:1-100000" for testing # from vcf to bcf bcftools view -O b 1423_S34_bwa_mapped_omm_sorted.vcf.gz -o 1423_S34_bwa_mapped_omm_sorted.bcf.gz # extract Ref allele frequency bcftools query -f '%QS\n' 1423_S34_bwa_mapped_omm_sorted.bcf.gz | awk -F',' '{print $1}' > ref_allele.dat # create histogram data awk -f hist.awk delta='0.05' ref_allele.dat > ref_allele_bins.dat # using Rscript specified below Rscript plotHist.R -in ref_allele_bins.dat -out ref_allele_bins.pdf ``` on example files (subset `-r "Akkermansia_muciniphila:1-100000`)the output is ```bash $ less ref_allele_bins.dat 0.0 0.1 0 0.1 0.1 0 0.1 0.2 0 0.2 0.2 0 0.2 0.2 0 0.2 0.3 0 0.3 0.3 0 0.3 0.4 0 0.4 0.4 0 0.4 0.5 0 0.5 0.5 0 0.5 0.6 0 0.6 0.7 0 0.7 0.7 0 0.7 0.8 0 0.8 0.8 0 0.8 0.9 0 0.9 0.9 0 0.9 1.0 1 1.0 1.0 15 1.0 1.1 486 ``` and on full `1423_S34_bwa_mapped_omm_sorted.bam` file ``` $ less ref_allele_bins_0.1.dat 0.0 0.1 4602 0.1 0.2 80 0.2 0.3 107 0.3 0.4 146 0.4 0.5 1164 0.5 0.6 564 0.6 0.7 1505 0.7 0.8 5550 0.8 0.9 19727 0.9 1.0 4888735 1.0 1.1 30249363 ``` generated script to visualize AF ```r! #!/usr/bin/env Rscript # Philipp C. Muench (muench@hsph.harvard.edu) # script generates plot from binned AF output # usage: Rscript plotHist.R -in ref_allele_bins.dat -out plot.pdf suppressPackageStartupMessages(require(optparse)) suppressPackageStartupMessages(require(ggplot2)) option_list = list( make_option(c("-i", "--in"), action="store", default=NA, type='character', help="path to input file"), make_option(c("-o", "--out"), action="store", default='plot.pdf', type='character', help="path to pdf output"), make_option(c("-v", "--verbose"), action="store_true", default=TRUE, help="Should the program print extra stuff out? [default %default]"), make_option(c("-q", "--quiet"), action="store_false", dest="verbose", help="Make the program not be verbose."), make_option(c("-c", "--cvar"), action="store", default="this is c", help="a variable named c, with a default [default %default]") ) opt = parse_args(OptionParser(option_list=option_list)) if (opt$v) message(paste("Processing", opt$i)) readInput <- function(input_path){ stopifnot(!is.na(input_path)) df <- as.data.frame(read.table(input_path, header = F)) stopifnot(ncol(df) == 3) stopifnot(nrow(df) > 0) colnames(df) <- c("from", "to", "value") df$bin_name <- paste0("[", df$from, "-", df$to, "[") return(df) } genPlot <- function(df){ p <- ggplot2::ggplot(df, aes(x = bin_name, y = value)) + ggplot2::geom_bar(stat="identity") p <- p + theme_classic() + xlab("binned AF of reference") + ylab("occurence in alignment (log10 scaled)") p <- p + scale_y_log10() return(p) } if(!is.na(opt$i) & !is.na(opt$o)) { p <- genPlot(readInput(opt$i)) pdf(file = opt$o, width = 5, height = 4) plot(p) dev.off() } else { message("Please specify input file (--in) and output file (--out))") } ``` on `1423_S34_bwa_mapped_omm_sorted.bam` output is ![](https://i.imgur.com/ZxgBtYG.png) Double check major AF --- seems there are cases where the major AF is smaller than 0.5, which can be the case that major AF is here defined different. Get the nt position of such a event `major_allele.dat` contains 35,171,543 nts ```bash! cd /home/aime/bam R ``` ```R! dat <- read.csv2("major_allele.dat", header=F) which(as.numeric(as.matrix(dat$V1)) < 0.5) ``` ```R [1] 3705432 7428417 8339222 12785599 12801802 12807677 12865010 12905780 [9] 12905781 12905782 12905783 12907990 12907991 12915636 12936736 13096169 [17] 13096170 13136228 13142274 13142275 13149856 13250571 13263228 13267494 [25] 13323095 13350365 13400822 13400823 13425049 13426211 13494452 13582886 [33] 13635763 13682095 13704124 13728217 13787279 13787280 13990743 14027600 [41] 14038141 14056181 14123013 14158764 14335171 14338352 14338353 14342744 [49] 14371168 14433333 14496178 14496179 14496180 14496181 14534117 14534118 [57] 14534119 14567962 14613302 14742008 14803768 14859559 14863397 15174238 [65] 15231979 15284586 15615977 16637530 16877490 16877492 17647532 17647534 [73] 18158427 18240129 18520823 18723505 18850812 18850814 19203345 20329624 [81] 21141689 21323219 21827503 22068387 22083740 22614336 22633436 22633437 [89] 22633438 22633439 22692987 22755898 22777267 22777268 22837963 22885442 [97] 22885443 22888770 22888771 22945621 22945622 22989151 22989152 23037227 [105] 23045858 23045859 23058519 23114390 23118597 23199150 23246013 23287156 [113] 23302827 23303618 23303619 23309544 23728154 23902557 24232221 24232222 [121] 24232223 25236090 25370697 26959843 26959844 26959845 26959846 26959847 [129] 26959848 26959849 26959850 26959851 26959852 26959853 26959854 26959855 [137] 26959856 26959857 26959858 26959859 26959860 26959861 27129000 27177304 [145] 27212336 27584024 27603379 27632768 27671015 27671016 27763343 27789162 [153] 27867524 28116117 28116118 28116119 28116120 28116121 28116122 28116123 [161] 28116124 28223377 28237198 28533586 28544344 28560235 28785291 29530967 [169] 29626228 29646069 29646076 29798525 29861169 30152468 30974311 31749134 [177] 31905289 32156585 33719644 ``` This file shouuld contail all genomes starting with _A. Muc_, so these coordinates are relative to the first position in the reference. The order of the reference genomes: ``` $ grep ">" joined_reference_curated.fasta >Akkermansia_muciniphila >Acutalibacter_muris >Bifidobacterium_animalis_YL2_1 >Bifidobacterium_animalis_YL2_2 >B_caecimuris >Blautia_coccoides >Clostridium_innocuum >Clostridioforme >Enterococcus_faecalis >F_plautii_1 >F_plautii_2 >Lactobacillus_reuteri_I49_1 >Lactobacillus_reuteri_I49_2 >Lactobacillus_reuteri_I49_3 >Muribaculum_intestinale >T_muris ``` with sizes ``` $ less joined_reference_curated.fasta.fai Akkermansia_muciniphila 2737357 25 60 61 Acutalibacter_muris 3802913 2783026 60 61 Bifidobacterium_animalis_YL2_1 1133782 6649353 60 61 Bifidobacterium_animalis_YL2_2 888144 7802064 60 61 B_caecimuris 4800606 8705025 60 61 Blautia_coccoides 5128582 13585661 60 61 Clostridium_innocuum 4469084 18799742 60 61 Clostridioforme 7157610 23343328 60 61 Enterococcus_faecalis 3025655 30620255 60 61 F_plautii_1 3743035 33696351 60 61 F_plautii_2 70670 37501783 60 61 Lactobacillus_reuteri_I49_1 2045096 37573660 60 61 Lactobacillus_reuteri_I49_2 14338 39652870 60 61 Lactobacillus_reuteri_I49_3 4170 39667476 60 61 Muribaculum_intestinale 3307069 39671741 60 61 T_muris 2887949 43033937 60 61 ``` Neighborhood around the first < 0.5 major AF ``` 0.993274 0.992812 0.966161 0.89012 0.975505 0.48508 <- 0.976565 0.99231 0.990577 0.991922 0.996273 ``` Open sample in [3.123.8.123/catgenome/](3.123.8.123/catgenome/)