Heterozygosity Methods: ROHan, ANGSD, VCFTools and GenomeScope

--- title: 'Heterozygosity Methods: ROHan, ANGSD, VCFTools and GenomeScope' disqus: hackmd --- Heterozygosity Methods: ROHan, ANGSD, VCFTools and GenomeScope === ** summary of heterozygosity results for Warren: GenomeScope (not listed here): 0.19 #this is based on raw reads from our reference lynx ANGSD (below): 0.23 #this method uses whole genomes (bams aligned to our mLynCan4 reference) ROHan: (running) #same as ANGSD PLINK: 0.25, 321 SNP/Mb #this is a SNP-based method Re: Inbreeding Coefficient, there is a pretty large span. These were calculated using VCFTools (below) but I would like to validate this estimate with at least one more method, because some of them are REALLY high. I should note that these relationships didn't show up clearly in my two relatedness analyses, so I am a little suspicious. ``` > min(plink$F) 0.02024 #this is LIC60, a lynx from Maine > max(plink$F) 0.95235 #this is cb42, one of the lynx (now extirpated) from what was formerly the lynx population on Cape Breton Island. The lynx from CBI have the three highest inbreeding coefficients (0.95, 0.94, 0.820) followed by Newfoundland lynx. The highest inbreeding coefficient (F) for Maine lynx is 0.45905 (LIC27B) ``` ## Objective Heterozygosity has proven to be a metric we've spent more time than originally intended on. This is mostly because our original intention was not just to generate a heterozygosity metric for lynx (global and population) but ALSO to make comparisons to other felids including Iberian lynx. This has proven to be more challenging than expected. Why? Because papers have measured het using methods independently -- few genome-wide, some SNP, and some using microsatellite methods. Here we will try two methods (ROHan and ANGSD) that are whole-genome calculations of heterozygosity using bam files aligned to our mLynCan4v1p reference genome for Canada lynx. We will also use a VCF (SNP) based method through PLINK, and compare the three estimates. A review by [Miller at al 2014 in Nature](https://www.nature.com/articles/hdy201399) suggests that SNP-based estimates should be robust, and often are as long as the number of SNPs used are enough to ensure power (as we do). I still have reservations about making interspecies comparisons because papers present so few details on how heterozygosity is estimated. Ultimately, we will take our heterozygosity estimate and compare it to others including relevant information (e.g. how many loci and what method) A couple of nice interspecies heterozygosity comparisons were done by Godoy et al: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1090-1 Cho et al: https://www.nature.com/articles/ncomms3433?ref=driverlayer.com#accession-codes ## Background on ROH and heterozygosity Refer Jen Vashon to these explanations for help w/ interpretation: Diploid organisms have minor differences between corresponding pairs of autosomes. Differences of a single nucleotide create heterozygous sites. Very recent inbreeding can lead to large stretches of genomic deserts of heterozygosous sites. Such deserts are called runs of homozygosity (ROH). Due to the lack of heterozygosous sites, ROHs can cause an underestimate of global estimates of heterozygosity. Such global estimates of rates of heterozygous sites is useful to infer population genetics parameters such as effective population size (Ne). ## Table of Contents [TOC] ## Method 1: ROHan genome-wide heterozygosity and runs of homozygosity (ROH) -- single samples, averaged by population ### Background ROHan is a Bayesian framework to estimate local rates of heterozygosity, infer runs of homozygosity (ROH) and compute global (genome-wide) rates of heterozygosity. ROHan can work on modern and ancient samples with signs of ancient DNA damage (not relevant to our samples, but good to know!). We have already run pretty robust ROH analyses using detectRUNS (R), so we are more interested in the heterozygosity estimate here. ### Installing dependencies Working directory is /project/uma_lisa_komoroske/Tanya/scripts/ROHan This program has been installed as a shared module (thanks MGHPCC!) module load anaconda2/4.4.0 module load rohan/1.0 module load gcc/4.9.2 ### Download the git (done, ignore this) #Do a "git clone --depth 1 https://github.com/grenaud/rohan.git" #Once done, simply type : cd ROHan make #errors ### Calculate ts/tv ratio We're going to get that from vcftools. Calculates the Transition / Transversion ratio in bins of size defined by this option. Only uses bi-allelic SNPs. The resulting output file has the suffix ".TsTv". What size bin should be use? We can use several different bin sizes and see how sensitive Ts/Tv is to the bin size parameter... module load vcftools/0.1.14 Usage is: vcftools --gzvcf /project/uma_lisa_komoroske/Tanya/Rgenomics/mLynCan4_v1.p_lynxonly/6_SV_rmClusterSNP_BiSNP_SV_HardFilter_SV_all_mLynCan4_v1.p_HighQualSites_processed.vcf.gz --TsTv 1000 Results (this is just for lynx): Outputting Ts/Tv in bins of 10000bp Ts/Tv ratio: 1.78 Outputting Ts/Tv in bins of 1000bp Ts/Tv ratio: 1.78 It does not appear that Ts/Tv ratio is sensitive to bin size. We will use **1.78 as our input for ROHan** ### Preparing bams Our bams and bam indices for each individual are located in /project/uma_lisa_komoroske/Tanya/analyses/bams_mLynCan4_v1 Provide ROHan the same reference used for mapping located here: /project/uma_lisa_komoroske/Tanya/download/refs/mLynCan4_v1.p/GCF_007474595.1_mLynCan4_v1.p_genomic.fa ### Name your chromosomes Create a file with the name of the autosomes, 1 chromosome per line ex: chr1 chr2 ... #this isn't asked for anywhere in the function? so why? ### Run ROHan, for modern samples run: I think LIC46_RemoveBadReads.bam has the highest coverage so we will start with that one Usage is: ``` rohan --rohmu 2e-5 -o LIC46_mLynCan4v1p /project/uma_lisa_komoroske/Tanya/download/refs/mLynCan4_v1.p/GCF_007474595.1_mLynCan4_v1.p_genomic.fa /project/uma_lisa_komoroske/Tanya/analyses/bams_mLynCan4_v1/LIC46_RemoveBadReads.bam ``` Error: Cannot find file /../preComputated/coverageprior/correctionCov_12.bin containing pre-computed coverage weights. We are missing some kind of file including the coverage for each genome LIC is ~12X coverage. Will remedy this and re-run ### TO DO We need to average these by population, and include comparisons to bobcat and hybrid. But some of these are still running, so will report on that when they are done. ## Method 2: ANGSD -- single samples, then averaged by population This second method is a global (genome-wide heterozygosity) method from ANGSD. This is a basic proportion of heterozygous genotypes divided by the adjusted genome size (excluding regions of the genome with low confidence). For diploid single samples the hetereozygosity is simply second value in the SFS/AFS. This is likely a better esimate than PLINK (SNP-based) heterozygosity proportion because it covers the whole genome (input is aligned bam). Let's run ANGSD for each individual and then average by population. ### Installing dependencies module load angsd/0.919 1. You might want an interactive session to try this! bsub -q interactive -R rusage[mem=24000] -n 1 -W 90 -Is bash 2. ANGSD (single sample) OR (below) bsub -q short -R rusage[mem=8000] -n 4 -W 4:00 "angsd -i /project/uma_lisa_komoroske/Tanya/analyses/bams_mLynCan4_v1/a109_RemoveBadReads.bam -anc /project/uma_lisa_komoroske/Tanya/download/refs/mLynCan4_v1.p/GCF_007474595.1_mLynCan4_v1.p_genomic.fa -dosaf 1 -fold 1 -GL 6 -doCounts 1" 3. Use Wrapper I created a wrapper script so that ANGSD would run all 69 of our bams in one job. Given that the cluster usage from a single bam took 537 MB and ~1h, I will allot: (trying just two individuals first to be sure it works) bsub -q short -R rusage[mem=2000] -n 2 -W 3:00 -R span\[hosts=1\] "wrapper_angsd_het.sh" #8161046 4. RealSFS It was really tricky to get this job to run. I would follow these exact job submittion parameters (100231 MB MAX memory used) run this job from the following directory: /share/pkg/angsd/0.919/bin/ ``` bsub -q short -R rusage[mem=80000] -n 2 -W 4:00 -R span\[hosts=1\] "./realSFS /project/uma_lisa_komoroske/Tanya/scripts/angsd/allbams/angsdput.saf.idx >/project/uma_lisa_komoroske/Tanya/scripts/angsd/allbams/est.ml" ``` /share/pkg/angsd/0.919/bin/ I have a feeling that even though I ran the wrapper, it didn't end up running both samples. I think the angsd command needs to have a -bam/-b (list of BAM/CRAM files) instead of the -i (single sample) flag. 5. Caluclate a basic proportionin R a<-scan("/project/uma_lisa_komoroske/Tanya/scripts/angsd/est.ml") a[2]/sum(a) ### Result: Heterozygosity estimate 0.23. This is similar to 0.19, result from Genomescope. We need to finish running each bam, and then average by population, but this is a good indicator :) See more about this het proportion method here: http://www.popgen.dk/angsd/index.php/Heterozygosity ### TO DO: We still need to average these results by population, and compare to bobcat. But there are still a few running. ## Method 3: VCFtools -- all lynx (no bobcats or hybrid included here) We used the full suite of diversity metrics from VCFTools inclueing individual heterozygosity. VCFTools -het function calculates heterozygosity on a per-individual basis, including the inbreeding coefficient, F, for each individual using a method of moments. The resulting file has the suffix ".het". Note, that this is a SNP-based method with an input VCF, not a "whole-genome" method based on aligned bams. I find these methods to be tricky, because VCF files go through so much filtering for quality control (LD pruning, etc). In order to get an accurate heterozygosity ratio, the number of heterozygous SNPs (considered MAF > 0.2 here) is divided an adjusted genome size (2.41Gb) correcting for low-confidence regions of the genome. My understanding from the Martin et al 2014 review (linked above) is that these SNP-based heterozygosity estimates are robust, depending on the number of loci used, which in this case should be fine. :) Usage is: ``` vcftools --gzvcf /project/uma_lisa_komoroske/Tanya/Rgenomics/mLynCan4_v1.p/6_SV_rmClusterSNP_BiSNP_SV_HardFilter_SV_allexcept2_mLynCan4_v1.p_HighQualSites_processed.vcf.gz --het --freqx --out output_het ``` The VCFtools output was then loaded into R for further inspection and to calculate the het ratio [in R] ``` het$sumgenotyped = rowSums(het[,c(5,6,7)]) #sum of all genotyped (non-missing) het$total61 = rowSums(het[,c(8,9)]) #sum of all genotyped AND missing, should be 61 het$sumhets = rowSums(het[,c(8,9)]) #sum of all genotyped AND missing, should be 61 het$proportionhets <- rowSums(het[2:9], na.rm=TRUE)/8 het$proportionhets <- het[,c(6)]/het[,c(8)] #count SNPs that are heterozygous (MAF > 0.2) hetSNPs<- het[which(het$proportionhets>0.2), ] #this actually should be the proportion with MAF >0.2 according to methods from Godoy et al. and GenomeScope. ``` ### Results ``` 773660 heterozygous SNPs/3070089 total SNPs = 0.2519992 #this is similar to 0.23 (ANGSD) and GenomeScope(0.19) dividing the total number of heterozygous SNVs by genome size #heterozygous SNPs/Mb 1 Mbp (mega base pairs) = 1,000,000 2233851/2410 = 321.0207 heterozygous SNPs per Mb # This is comparable to the SNP/Mb for Eurasian lynx (279) and is higher than Iberian lynx, but Iberian lynx have the lowest SNP/Mb known (due to high % of the genome being in ROH). Looks good! ``` ### Re: Inbreeding Coefficient There is a pretty large range. See my skepticism above ``` > min(plink$F) [1] 0.02024 #this is LIC60, a lynx from Maine > max(plink$F) [1] 0.95235 #this is cb42, one of the lynx (now extirpated) from what was formerly the lynx population on Cape Breton Island. The lynx from CBI have the three highest inbreeding coefficients (0.95, 0.94, 0.820) followed by Newfoundland lynx. The highest inbreeding coefficient (F) for Maine lynx is 0.45905 (LIC27B) ``` ## Method 4: GenomeScope --single sample This was the original heterozygosity estimate on LIC74 (our reference lynx) from raw reads using GenomeScope. You can read more about the GenomeScope method [here](https://www.biorxiv.org/content/biorxiv/suppl/2017/02/28/075978.DC2/075978-1.pdf) You can read my tutorial on interpreting GenomeScope profiles [here](https://hackmd.io/6wOHeys3QImjwGsuYIzytA) We used GenomeScope via DNANexus (part of the VGP QC for genome assemblies) with 31-mers. ### Results: The resultant heterozygosity estimate was 0.19. This is fairly in line with the estimate from ANGSD (0.23), and likely more reliable because it is based on a genome with ridiculously high coverage, whereas the ANGSD estimate is based on genotype likelihoods from low-coverage whole genomes, and has some proportion of missing (low confidence) calls in each genome. ## Appendix and FAQ Some literature to bookmark for later: https://www.genetics.org/content/205/1/317 Re: lynx. I find it kind of annoying that there's no simple estimate of heterozygosity ratio in any of these earlier papers. I guess I will rely on the Ho/He, which is buried somewhere in the supplemental material. I'm yet to uncover exactly where, though. https://www.nrcresearchpress.com/doi/abs/10.1139/cjz-2014-0227#.Xv51JJNKhQI https://onlinelibrary.wiley.com/doi/full/10.1002/ece3.2945 https://onlinelibrary.wiley.com/doi/full/10.1002/ece3.3223 https://onlinelibrary.wiley.com/doi/full/10.1111/mec.15131?casa_token=cKGQ1DCw0sQAAAAA%3AteSkBAGeKyInZCR4Ow7b3iDOoalyH3axOeh0bf2GpYG89QNUfTn9cE_PZTrX-6EHdHIBNKq655vZ I want to read this preprint and include in the discussion re: diversity metrics and extinction risk (may not be as correlated as we think) https://arxiv.org/pdf/2007.02569.pdf :::info **Find this document incomplete?** Leave a comment! ::: ###### tags: `chapter2_popgen_metrics` `ROHan` `ANGSD` `VCFTools` `Inbreeding Coefficient` `heterozygosity ratio`