angsd - HackMD

# angsd By Tanya Lama #### Objectives #### Requirements ``` module load anaconda2/4.4.0 module load GATK/3.5 module load java/1.8.0_77 module load R/3.6.0 module load gcc/8.1.0 ``` What are the dependencies for angsd to run? ``` create --name angsd python==2.7.15 bwa==0.7.17 samtools==1.4.1 picard==2.17.8 gatk4 angsd source activate angsd ``` #### Move all bams to /bams directory ``` cp -r /project/uma_lisa_komoroske/Tanya/analyses/step6_RemoveBadReads/LIT2/LIT2_RemoveBadReads.bai /project/uma_lisa_komoroske/Tanya/analyses/bams ``` #### Index bams No need to index bams. We have BAI-format indices for all BAM files (this is the default product of samtools index) just move the .bai files over with the .bam files #### Make a list of bams for angsd ``` ls /project/uma_lisa_komoroske/Tanya/analyses/bams/*.bam > bam.filelist ``` #### SNP and genotype calling SNPs are called based on their allele frequencies by -doMaf. Basically, they will call a SNP if a site has a minor allele frequency significantly different from 0. (Note: how about really minor allele?) ### MAF for every basepair angsd -bam bam.filelist -doMajorMinor 2 -doMaf 8 -doCounts 1 -out out ### SNP calling -- trying this first we analyse data from bam files (-bam bam.files), calculate the genotype likelihood using the samtools method (-GL 1), infer the major and minor alleles (-doMajorMinor 1), estimate the allele frequencies assuming known minor (-doMAF 2) and only keep those sites that have a p-value less than 1e-6 of for being variable. the results are given in the file angsdSNPs.mafs.gz: The columns are the chromosome, the position, the major allele, the minor allele, the minor allele estimate, the allele frequency, the p-value and the number of individuals with information. ``` bsub -q long -W 300:00 -R rusage[mem=16000] -n 4 -R span\[hosts=1\] "angsd -bam bam.filelist -GL 1 -out /project/uma_lisa_komoroske/Tanya/analyses/bams/angsdSNPs2 -doMaf 2 -SNP_pval 1e-6 -doMajorMinor 1" #trying this on 2 separate runs, not knowing how long it could take... ``` ### Genotype Likelihoods angsd -bam bam.filelist -GL 1 -doGlf 2 -doMajorMinor 1 -doMaf 2 -SNP_pval 2e-6 -out genolike -nThreads 10 ### Genotype calling in one step angsd -bam bam.filelist -GL 2 -out gatk_outfile -doMaf 2 -doMajorMinor 1 -SNP_pval 1e-6 -doGeno 5 -doPost 1 -postCutoff 0.95 ### Global estimates of heterozygosity As above, we split the bamlist into our three populations based on PCA and sNMF. Next we use angsd to estimate global heterozygosity ratio for each individual in each subpopulation Usage is: ``` angsd -bam NFLD_bam.filelist -anc ancestral.fa -dosaf 1 -gl 1 bsub -q short -R rusage[mem=8000] -n 1 -W 1:00 “angsd -bam southSLR_bam.filelist -anc /project/uma_lisa_komoroske/Tanya/download/refs/mLynCan4_v1.p/GCF_007474595.1_mLynCan4_v1.p_genomic.fa -dosaf 1 -fold 1 -GL 6 -doCounts 1 -out southSLR” bsub -q short -R rusage[mem=8000] -n 1 -W 1:00 "angsd -i /project/uma_lisa_komoroske/Tanya/download/bams_mLynCan4_v1.p_lynx/a202_RemoveBadReads.bam -anc /project/uma_lisa_komoroske/Tanya/download/refs/mLynCan4_v1.p/GCF_007474595.1_mLynCan4_v1.p_genomic.fa -dosaf 1 -fold 1 -gl 6 -doCounts 1 -out a202" #Then bsub -q short -R rusage[mem=80000] -n 2 -W 4:00 -R span\[hosts=1\] "./realSFS /project/uma_lisa_komoroske/Tanya/scripts/angsd/angsd_het/NFLD/angsdput.saf.idx >/project/uma_lisa_komoroske/Tanya/scripts/angsd/angsd_het/NFLD/est.ml" #Then in R, re run the following calculation: #This is the final calculation from the ANGSD whole genome heterozygosity ratio calculation a<- read.table("/project/uma_lisa_komoroske/Tanya/scripts/ANGSD/est.ml") a[2]/sum(a) = 0.2338132 ``` bsub -q long -W 12:00 -R rusage[mem=16000] -n 4 -R span\[hosts=1\] "angsd -bam NFLD_bam.filelist -anc /project/uma_lisa_komoroske/Tanya/download/refs/mLynCan4_v1.p/GCF_007474595.1_mLynCan4_v1.p_genomic.fa -dosaf 1 -gl 1" bsub -q long -W 12:00 -R rusage[mem=16000] -n 4 -R span\[hosts=1\] "angsd -bam northSLR_bam.filelist -anc /project/uma_lisa_komoroske/Tanya/download/refs/mLynCan4_v1.p/GCF_007474595.1_mLynCan4_v1.p_genomic.fa -dosaf 1 -gl 1" bsub -q long -W 12:00 -R rusage[mem=16000] -n 4 -R span\[hosts=1\] "angsd -bam southSLR_bam.filelist -anc /project/uma_lisa_komoroske/Tanya/download/refs/mLynCan4_v1.p/GCF_007474595.1_mLynCan4_v1.p_genomic.fa -dosaf 1 -gl 1" ## Global estimate of heterozygosity (vcftools) module load vcftools/0.1.14 bsub -q interactive -R rusage[mem=24000] -n 1 -W 90 -Is bash ### northSLR heterozygosity vcftools --gzvcf /project/uma_lisa_komoroske/Tanya/download/VCFs/mLynCan4_v1.p_lynx/6SV_unfiltered_northSLR_lynx/6SV_northSLR.vcf.gz --het --out northSLR_output_het #only one output can be called at a time, so if we want to run --freqx we have to run it separately vcftools --gzvcf /project/uma_lisa_komoroske/Tanya/download/VCFs/mLynCan4_v1.p_lynx/6SV_unfiltered_northSLR_lynx/6SV_northSLR.vcf.gz --freqx --out northSLR_output_freqx ### northSLR freqx plink --vcf /project/uma_lisa_komoroske/Tanya/download/VCFs/mLynCan4_v1.p_lynx/6SV_unfiltered_northSLR_lynx/6SV_northSLR.vcf.gz --freqx --out northSLR_output_freq --allow-extra-chr #### In R het <-read.table("/project/uma_lisa_komoroske/Tanya/scripts/vcftools_metrics/heterozygosity/northSLR_output_freq.frqx", header=TRUE, fill=TRUE) het$sumgenotyped = rowSums(het[,c(5,6,7)]) #sum of all genotyped (non-missing) het$total61 = rowSums(het[,c(8,9)]) #sum of all genotyped AND missing, should be 61 het$sumhets = rowSums(het[,c(8,9)]) #sum of all genotyped AND missing, should be 61 het$proportionhets <- rowSums(het[2:9], na.rm=TRUE)/8 #error het$proportionhets <- het[,c(6)]/het[,c(8)] #count SNPs that are heterozygous (MAF > 0.2) hetSNPs<- het[which(het$proportionhets>0.2), ] #this actually should be the proportion with MAF >0.2 according to methods from Godoy et al. and GenomeScope. ### southSLR heterozygosity /project/uma_lisa_komoroske/Tanya/download/VCFs/mLynCan4_v1.p_lynx/6SV_unfiltered_southSLR_lynx/6SV_southSLR.vcf.gz ### NFLD heterozygosity /project/uma_lisa_komoroske/Tanya/download/VCFs/mLynCan4_v1.p_lynx/6SV_unfiltered_NFLD_lynx/6SV_NFLD.vcf.gz