Parnassius project

# Parnassius project ###### tags: `RADSeq` `STACKS` --- author: Emma Fetterly --- **Resources:** 1. Link to STACKS website: https://catchenlab.life.illinois.edu/stacks/ 2. Bioinformatics guide by Eric C. Anderson https://eriqande.github.io/eca-bioinf-handbook/shell-programming.html 3. Speciation & Population Genomics: a how-to-guide https://speciationgenomics.github.io/mapping_reference/ 4. Surm script how-to from NU https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1964 5. Globus - file management https://www.globus.org/ # vcftools introduction Dylan's example code for filtering the popualtions.snps.vcf file: ``` vcftools --vcf --input.vcf --min-meanDP 5 --minDP 5 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.50 --recode ---out renamed.ouput.vcf ``` ### VCF Parameters: **Depth:** ``` --min-meanDP <float> --max-meanDP <float> ``` Includes only sites with mean depth values (over all included individuals) greater than or equal to the "--min-meanDP" value and less than or equal to the "--max-meanDP" value. One of these options may be used without the other. These options require that the "DP" FORMAT tag is included for each site. **Hardy-Weinberg Equlibrium:** ``` --hwe <float> ``` Assesses sites for Hardy-Weinberg Equilibrium using an exact test, as defined by Wigginton, Cutler and Abecasis (2005). Sites with a p-value below the threshold defined by this option are taken to be out of HWE, and therefore excluded. **Missing data:** ``` --max-missing <float> ``` Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed). ``` --max-missing-count <integer> ``` Exclude sites with more than this number of missing genotypes over all individuals. **Phased:** ``` --phased ``` Excludes all sites that contain unphased genotypes. **MISCELLANEOUS FILTERING** ``` --minQ <float> ``` Includes only sites with Quality value above this threshold. # VCF filtering: Preliminary analysis following the workflow from https://speciationgenomics.github.io/filtering_vcfs/ to determine stats from this dataset 1. making a new VCF directory in the PASM direrectory and move the minimal filteritng and 80 filtered fiels into it ``` mkdir vcf ``` 2. make a directory for vcf tools output and make shortcuts for the vcf file and output unzip the VCF files ``` gunzip r80_8_filtered.vcf.gz ``` ``` mkdir vcftools full_vcf=/projects/p32037/PASM/r80_8_filtered.vcf OUT=/projects/p32037/PASM/vcftools ``` Calculate: Allele frequency `vcftools --vcf $full_vcf --freq2 --out $OUT --max-alleles 2` Mean depth per indiv `vcftools --vcf $full_vcf --depth --out $OUT` Mean depth per site `vcftools --vcf $full_vcf --site-mean-depth --out $OUT` Site quality `vcftools --vcf $full_vcf --site-quality --out $OUT` Proportion missing data/indiv `vcftools --vcf $full_vcf --missing-indv --out $OUT` Proportion missing data/site `vcftools --vcf $full_vcf --missing-site --out $OUT` Het and inbred per indiv `vcftools --vcf $full_vcf --het --out $OUT` ### Moving these outputs into R... 1. Variant quality Produced a file where the quality is -1 for each site, what is going on here? 2. Variant mean depth Based on this, it looks like we could set our **minimum to 10x and maximum depth to 50** ( ~ 2x our mean depth) | min | 1st Qu. | Median | Mean | 3rd Qu. | Max | | ---- | ------- | ------ | ----- | ------- | --- | | 1.00 | 10.50 | 18.40 | 26.52 | 29.82 | 3924.00 | ![Screenshot 2024-01-18 at 2.00.41 PM](https://hackmd.io/_uploads/HkGFQWvtT.png) 3. Variant missingness The mean here is 0.6899 and medium is 0.733 which seems very high - we may have to use a low % missing data here. (50% call rate?) | min | 1st Qu. | Median | Mean | 3rd Qu. | Max | | ---- | ------- | ------ | ----- | ------- | --- | | 0.000 | 0.5556 | 0.7333 | 0.6899 | 0.889 | 0.9333 | ![Screenshot 2024-01-18 at 2.06.08 PM](https://hackmd.io/_uploads/rkE6VWPYT.png) 4. Minor allele frequency Maybe set this to 0.1 or 0.15 based on the data? | min | 1st Qu. | Median | Mean | 3rd Qu. | Max | | ---- | ------- | ------ | ----- | ------- | --- | | 0.01351 | 0.1000 | 0.18750 | 0.22612 | 0.35000 | 0.5 | ![Screenshot 2024-01-18 at 2.10.38 PM](https://hackmd.io/_uploads/H1k0S-wKT.png) --- ### Individual based statistics 1. Mean depth per individual The range is pretty extreme here 2.071 - 127.826 suggestting an issue with individual sequencing depth. | min | 1st Qu. | Median | Mean | 3rd Qu. | Max | | ---- | ------- | ------ | ----- | ------- | --- | | 2.071 | 8.997 | 29.206 | 32.607 | 49.057| 127.826 | ![Screenshot 2024-01-18 at 2.12.31 PM](https://hackmd.io/_uploads/Bk1rUZDta.png) 2. Proportion of missing data per individual The proportion of missing data per individual is large 0.35 - 0.96 | min | 1st Qu. | Median | Mean | 3rd Qu. | Max | | ---- | ------- | ------ | ----- | ------- | --- | | 0.3576 | 0.4575 | 0.7652 | 0.6899 | 0.8497 | 0.9611 | ![Screenshot 2024-01-18 at 2.13.17 PM](https://hackmd.io/_uploads/SyAPIbPt6.png) 3. Heterozygosity and inbreeding coefficient per individual Range of -0.17 to 0.39 | min | 1st Qu. | Median | Mean | 3rd Qu. | Max | | ---- | ------- | ------ | ----- | ------- | --- | | -0.17667 | 0.03222 | 0.06950 | 0.0.08637 | 0.12000 | 0.39499 | ![Screenshot 2024-01-18 at 2.14.02 PM](https://hackmd.io/_uploads/S1oqU-vK6.png) Based on these reuslts, here are some settings that may be useful, althoug I have concerns about the quality and missingness metrics `MAF=0.15` `MISS=0.75` `MIN_DEPTH=10` `MAX_DEPTH=50` # Applying VCF filters Operating in the reference_stacks directory first attempt, output as populations_stackstry1.ouput.vcf and using the populations.snps.vcf file from the popualtions run (this has all individuals/all populations) ran using the "default parameters" from Dylan's notes and from the analysis above First load the vcftools module `module load vcftools/0.1.17` Apply the filters.... ``` vcftools --vcf populations.snps.vcf --min-meanDP 5 --minDP 10 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.50 --recode --out populations_stackstry1.ouput.vcf ``` ``` vcftools --vcf r80_8_filtered.vcf --maf 0.05 --recode --out r80_8_filtered_maf.vcf VCFtools - 0.1.17 (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --vcf populations.snps.vcf --mac 3 --max-alleles 2 --maxDP 50 --max-meanDP 50 --min-alleles 2 --minDP 5 --min-meanDP 5 --max-missing 0.8 --out populations_stackstry1.ouput.vcf --recode --remove-indels After filtering, kept 45 out of 45 Individuals Outputting VCF file... After filtering, kept 0 out of a possible 23789 Sites No data left for analysis! Run Time = 1.00 seconds ``` 0 sites kept...not great! trying again but changing --max-missing to 0.5 ``` vcftools --vcf populations.snps.vcf --min-meanDP 5 --minDP 10 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.50 --recode --out populations_stackstry2.ouput.vcf ``` Output: ``` (base) [ewf7555@quser32 reference_stacks]$ vcftools --vcf populations.snps.vcf --min-meanDP 5 --minDP 10 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.50 --recode --out populations_stackstry2.ouput.vcf VCFtools - 0.1.17 (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --vcf populations.snps.vcf --mac 3 --max-alleles 2 --maxDP 50 --max-meanDP 50 --min-alleles 2 --minDP 10 --min-meanDP 5 --max-missing 0.5 --out populations_stackstry2.ouput.vcf --recode --remove-indels After filtering, kept 45 out of 45 Individuals Outputting VCF file... After filtering, kept 145 out of a possible 23789 Sites Run Time = 1.00 seconds ``` Trying again with 0.49 ``` (base) [ewf7555@quser34 reference_stacks]$ vcftools --vcf populations.snps.vcf --min-meanDP 5 --minDP 10 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.49 --recode --out populations_stackstry3.ouput.vcf VCFtools - 0.1.17 (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --vcf populations.snps.vcf --mac 3 --max-alleles 2 --maxDP 50 --max-meanDP 50 --min-alleles 2 --minDP 10 --min-meanDP 5 --max-missing 0.49 --out populations_stackstry3.ouput.vcf --recode --remove-indels After filtering, kept 45 out of 45 Individuals Outputting VCF file... After filtering, kept 145 out of a possible 23789 Sites Run Time = 0.00 seconds ``` This is the same output, i don't think that 145 sites is good enough. Some populations have a lot of missing data...particualary GM and MW Remove GM and MW? - would have to re run gstacks and popualtions to do this (?) Maybe not.... Changed max missing to 0.6 and min DP to 5, adding MAF to 0.2 ``` (base) [ewf7555@quser34 reference_stacks]$ vcftools --vcf populations.snps.vcf --min-meanDP 5 --minDP 5 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.6 --maf 0.2 --recode --out populations_stackstry3.ouput.vcf Parameters as interpreted: --vcf populations.snps.vcf --mac 3 --maf 0.2 --max-alleles 2 --maxDP 50 --max-meanDP 50 --min-alleles 2 --minDP 5 --min-meanDP 5 --max-missing 0.6 --out populations_stackstry3.ouput.vcf --recode --remove-indels After filtering, kept 45 out of 45 Individuals Outputting VCF file... After filtering, kept 13 out of a possible 23789 Sites ``` # Troubleshooting ## Notes on removing "Bad apples" Primarily from Cerca et al. 2021 "Removing the bad apples: A simple bioinformatic mehtod to improve loci-reocvery in de novo RADseq data for non model organisms" https://doi.org/10.1111/2041-210X.13562 Steps they used: 1. Run STACKs with the complete dataset (all individuals) - this is called the "unclean" dataset 2. Run STACKs for each popualtion individually, then use vcftools to investigate missingness (--missing-indv, depth coverage --depth) 3. With the whole dataset in mind, we labelled individuals as to keep or remove (bad apples), following a general strategy which included: (a) retaining a minimum of two individuals per population or species; (b) designating a threshold for missing data, based on the average missing data for each whole dataset(≥40% missing data for Stygocapitella, ≥30% for Euhadra, ≥65% for Dendrilla, ≥40% for Anthochaera). ## Zoe's code for Clarkia data This is after running the denovo_map.pl (no filtering) ``` #### Quality filtering for CB. /data/Dimensions/clarkia/sv/cb/snp.calling/final/filter vcftools --vcf populations.snps.vcf --missing-indv # It looks like one indiviudal CB-58 failed and CB-73 and CB-77 did not do well. Remove them to keep 32 individuals # Remove individuals with lots of missing data vcftools --vcf populations.snps.vcf --remove rm.ind --recode --recode-INFO-all --out subset #Includes only sites with mean depth values (over all included individuals) greater than or equal to 15 vcftools --vcf subset.recode.vcf --min-meanDP 15 --recode --recode-INFO-all --out fil1 # kept 3568 #Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed). Here 10% or less missing data is allowed vcftools --vcf subset.recode.vcf --max-missing 0.90 --recode --recode-INFO-all --out fil4 # kept 2492 #Filter with all parameters including mean depth vcftools --vcf subset.recode.vcf --min-meanDP 15 --mac 3 --max-missing 0.90 --hwe 0.05 --recode --recode-INFO-all --out filtered.final.subset #kept 458 ``` So trying this appraoch with my data, starting with the populations.snps.vcf file Checking CACO data for missing individuals: Doing this filter in a new directory `mkdir zfilter` `vcftools --vcf populations.snps.vcf --missing-indv` insepcting the output file out.imiss, it looks like there are quite a few with greater than 50% missing data... Making a list of these individuals ``` awk '$5 > 0.5' out.imiss | cut -f1 > lowDP.indv ##Output file - 32 individuals have greater than 50% missing data INDV DP830.sort DP831.sort DP838.sort GM544.sort GM556.sort GM559.sort GM564.sort HP801.sort HP803.sort HP805.sort HP807.sort IB1506B.sort IB1519.sort IB1522B.sort IB1524B.sort IB1528B.sort IB2687.sort MW537.sort MW538.sort MW542.sort MW543.sort PS608B.sort PS613A.sort PS614A.sort PS615A.sort SP772B.sort SP790.sort SP796B.sort SR908B.sort SR912.sort SR914.sort SR922.sort ``` Maybe see how many have 80% missing data? ``` awk '$5 > 0.8' out.imiss | cut -f1 > verylowDP.indv #looks like 17 individuals had 80% missing INDV DP830.sort DP831.sort DP838.sort GM544.sort GM556.sort GM559.sort GM564.sort HP801.sort HP803.sort HP805.sort HP807.sort IB1522B.sort IB1524B.sort MW537.sort MW538.sort MW542.sort MW543.sort ``` 90%?? ``` awk '$5 > 0.9' out.imiss | cut -f1 > superlowDP.indv #Output 8 individuals INDV GM544.sort GM556.sort GM559.sort GM564.sort MW537.sort MW538.sort MW542.sort MW543.sort ``` Looks like GM and MW are not great in general. Why would entire popualtions not sequence well?? I guess we will remove them, and continue on for now ``` # Remove individuals with lots of missing data vcftools --vcf populations.snps.vcf --remove superlowDP.indv --recode --recode-INFO-all --out subset Excluding individuals in 'exclude' list After filtering, kept 37 out of 45 Individuals Outputting VCF file... After filtering, kept 23789 out of a possible 23789 Sites ``` **Mean depth values** Mean depth of 15 ``` #Includes only sites with mean depth values (over all included individuals) greater than or equal to 15 vcftools --vcf subset.recode.vcf --min-meanDP 15 --recode --recode-INFO-all --out fil1 # kept 5077 sites ``` Mean depth of 5 ``` (base) [ewf7555@quser34 zfilter]$ vcftools --vcf subset.recode.vcf --min-meanDP 5 --recode --recode-INFO-all --out fil7 VCFtools - 0.1.17 (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --vcf subset.recode.vcf --recode-INFO-all --min-meanDP 5 --out fil7 --recode After filtering, kept 37 out of 37 Individuals Outputting VCF file... After filtering, kept 12512 out of a possible 23789 Sites Run Time = 1.00 seconds ``` **Now on to the missing data...** missing data is 10% ``` #Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed). Here 10% or less missing data is allowed vcftools --vcf subset.recode.vcf --max-missing 0.90 --recode --recode-INFO-all --out fil4 # 577 out of a possible 23789 Sites ``` Missing data is 10% but mean depth is 5 vcftools --vcf subset.recode.vcf --min-meanDP 5 --maf --max-missing 0.90 --maf 0.15 --hwe 0.05 --recode --recode-INFO-all --out filtered.final2.subset **Final filtering** try # 1 with Zoe's parameters ``` #Filter with all parameters including mean depth vcftools --vcf subset.recode.vcf --min-meanDP 15 --mac 3 --max-missing 0.90 --hwe 0.05 --recode --recode-INFO-all --out filtered.final.subset #kept 424 (Zoe's kept 458) ``` Jeremie's idea: min depth is 5 maf is 0.15 hwe 0.05 no max missing at all! ``` vcftools --vcf subset.recode.vcf --min-meanDP 5 --maf --maf 0.15 --hwe 0.05 --recode --recode-INFO-all --out filtered.final2.subset (base) [ewf7555@quser34 zfilter]$ vcftools --vcf subset.recode.vcf --min-meanDP 5 --maf 0.15 --hwe 0.05 --recode --recode-INFO-all --out filtered.final2.subset VCFtools - 0.1.17 (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --vcf subset.recode.vcf --recode-INFO-all --maf 0.15 --max-alleles 2 --hwe 0.05 --min-meanDP 5 --out filtered.final2.subset --recode After filtering, kept 37 out of 37 Individuals Outputting VCF file... After filtering, kept 5151 out of a possible 23789 Sites Run Time = 1.00 seconds ``` Jeremie's idea: min depth is 5 maf is 0.15 no hwe no max missing at all! ``` (base) [ewf7555@quser34 zfilter]$ vcftools --vcf subset.recode.vcf --min-meanDP 5 --maf 0.15 --recode --recode-INFO-all --out filtered.final3.subset VCFtools - 0.1.17 (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --vcf subset.recode.vcf --recode-INFO-all --maf 0.15 --min-meanDP 5 --out filtered.final3.subset --recode After filtering, kept 37 out of 37 Individuals Outputting VCF file... After filtering, kept 6616 out of a possible 23789 Sites Run Time = 1.00 seconds ``` ``` vcftools --vcf subset.recode.vcf --min-meanDP 5 --maf 0.15 --recode --recode-INFO-all --out filtered.final3.subset ``` jeremie thoughts! Make a data set that is hardy weinberg and a set that is not if the linkage is evenly distribted then we don't have to filter for linakage **Checked to see what the R2** ``` vcftools --vcf filtered.final.subset.recode.vcf --plink --out filtered.plink plink --file filtered.plink --r2 --noweb ``` Looking at linkage...there are a lot of pairs/loci that are linked (with an R2 above 0.8 going up to 1) Following https://speciationgenomics.github.io/pca/ ``` Checking output with plink and pruning for likage --vcf filtered.final.subset.recode.vcf --double-id --allow-extra-chr \--set-missing-var-ids @:# \ --indep-pairwise 50 10 0.1 --out plink_zfilter #Output 192637 MB RAM detected; reserving 96318 MB for main workspace. --vcf: plink_zfilter-temporary.bed + plink_zfilter-temporary.bim + plink_zfilter-temporary.fam written. 424 variants loaded from .bim file. 37 people (0 males, 0 females, 37 ambiguous) loaded from .fam. Ambiguous sex IDs written to plink_zfilter.nosex . Using 1 thread (no multithreaded calculations invoked. Before main variant filters, 37 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.964686. 424 variants and 37 people pass filters and QC. Note: No phenotypes present. Pruned 15 variants from chromosome 27, leaving 10. Pruned 31 variants from chromosome 28, leaving 15. Pruned 7 variants from chromosome 29, leaving 8. Pruned 16 variants from chromosome 30, leaving 12. Pruned 0 variants from chromosome 31, leaving 1. Pruned 31 variants from chromosome 32, leaving 13. Pruned 25 variants from chromosome 33, leaving 10. Pruned 23 variants from chromosome 34, leaving 10. Pruned 33 variants from chromosome 35, leaving 13. Pruned 17 variants from chromosome 36, leaving 8. Pruned 37 variants from chromosome 37, leaving 10. Pruned 21 variants from chromosome 38, leaving 10. Pruned 33 variants from chromosome 39, leaving 14. Pruned 0 variants from chromosome 40, leaving 1. Pruning complete. 289 of 424 variants removed. Marker lists written to plink_zfilter.prune.in and plink_zfilter.prune.out . Segmentation fault ``` ``` plink --vcf filtered.final.subset.recode.vcf --double-id --allow-extra-chr --set-missing-var-ids @:# --extract plink_zfilter.prune.in --make-bed --pca --out CACO_zfil #Output 192637 MB RAM detected; reserving 96318 MB for main workspace. --vcf: CACO_zfil-temporary.bed + CACO_zfil-temporary.bim + CACO_zfil-temporary.fam written. 424 variants loaded from .bim file. 37 people (0 males, 0 females, 37 ambiguous) loaded from .fam. Ambiguous sex IDs written to CACO_zfil.nosex . --extract: 135 variants remaining. Using up to 51 threads (change this with --threads). Before main variant filters, 37 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.964765. 135 variants and 37 people pass filters and QC. Note: No phenotypes present. Relationship matrix calculation complete. --pca: Results saved to CACO_zfil.eigenval and CACO_zfil.eigenvec . --make-bed to CACO_zfil.bed + CACO_zfil.bim + CACO_zfil.fam ... done. Segmentation fault ``` Going to R to see what we have here... --- ``` ### Filtering for SNP panel. Using the 458 SNPs called - has maf 17%, in HWE, 10% missing data and mean depth 15X. /data/clarkia/sv/cb/snp.panel #### Now have to filter original fasta file to get filtered fasta sequences./data/Dimensions/clarkia/sv/cb/snp.panel/fasta.filter mkdir fasta.filter cp populations.loci.fa fasta.filter/ cp filtered.final.subset.recode.vcf fasta.filter/ cd fasta.filter cut -f 1 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.one #only keep first column of the vcf file that has the SNPs that passed quality filtering sed '1,15d' keep.col.one > remove.headers #remove all the headers rows of the fasta file sed -e 's/^/>CLocus_/' remove.headers > list.to.keep #add the fasta label to be able to match below grep -w -A 1 -f list.to.keep populations.loci.fa --no-group-separator > filtered.fasta.fa #sort the original fasta to only keep the high quality ones #### Now we need to sort fasta file to get SNPs with at least 50 bp on either side of SNP. First get the length of characters in each line of the file. Then combine them together awk '{ print length }' filtered.fasta.fa > count # get the count of each line in the fasta file sed -n '1~2!p' count > length #keep every other line, so you only have the length of the sequence (not the locus ID) paste -d \\n list.to.keep length > length.locus #paste the locus Id and the legnth of that locus #### Now get the position of each cut -f 2 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.two # only keep the second column of the vcf file that hast he high quality SNPs sed '1,15d' keep.col.two > positions #remove headers of the vcf file #### Combine the locus ID, length and position paste list.to.keep length positions | column -s $'\t' -t > length.position #combine and have them formatted into tabulated columns #### Keep SNPs that have 50 bp on either side. Keep every line where column 3 (position) has is greater than 50, meaning there are mpre than 50 bp before the SNP. Then calcualte the difference between columne 2 (length) and column 3 (position) so that you only keep the loci with a difference greater than 50 meaning that there are at least 50 bp after the SNP. We have 346 loci. awk '(NR>0) && ($3 > 50 ) ' length.position > right.50 awk 'BEGIN { OFS = "\t" } { $5 = $2 - $3 } 1' right.50 > difference awk '(NR>0) && ($4 > 50 ) ' difference > right.left.50 #### Filter original fasta file to only keep these 360 loci cut -f 1 right.left.50 | sed 's/[\t]/,/g' > filtered.id grep -w -A 1 -f filtered.id populations.loci.fa --no-group-separator > final.filtered.fasta.fa ### Now we must filter to remove SNPs that are not the same between replicates. Need to filter the vcf file to include just the 346 filtered loci. cat filtered.id | cut -d"_" -f2 > filtered.chrom #get only the chromosome number sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers #### Copy the text below into a new file called vcfkeeping. These are the column names that we want. #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CB-110 CB-110.2 CB-24 CB-24.2 CB-65 CB-65.2 CB-96 CB-96.2 #### Filter the full vcf file without headers to only keep the columns we want. cut -f"$(head -1 allvcf.headers | tr '\t' '\n' | nl | grep -Fwf vcfkeeping | awk '{print $1}' | paste -sd,)" allvcf.headers > rm.ind.vcf #### Keep only those 346 filtered loci grep -w -F -f filtered.chrom rm.ind.vcf > reps.filtered.vcf #for some reason this is keeping ten extra loci #so try adding prefix to chromosome number to first column in each file sed -e 's/^/chr/' filtered.chrom > filtered.chr sed -e 's/^/chr/' rm.ind.vcf > rm.ind.chr.vcf #try again grep -w -F -f filtered.chr rm.ind.chr.vcf > reps.filtered.vcf #now remove the chr string awk -F'chr' '{print $2}' reps.filtered.vcf > replicate.vcf #copy and paste header from previous file into new file ( sed 1q rm.ind.chr.vcf; sed 1d replicate.vcf ) > replicates.vcf && mv replicates.vcf replicate.vcf #### Now insert the vcf header manually (from the original vcf). Then use populations to make a genepop file to upload into R (locally). This will remove loci that are variable in other individuals but not variable in the replicates, so we are left with 310. populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/replicates.filter/replicate.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/replicates.filter -M /data/Dimensions/clarkia/sv/cb/snp.calling/final/popmap --genepop ##### I compared the loci that aren't the same in replicates in excel after uploading the genepop file. Note that you will loose some loci because if they aren't variable in the replicates, they won't be included in the genepop file. We of the starting 348 loci have 310 variable loci in the replicates to compare. The ID in the genepop file match up with the ids in the vcf file that you used to generate the genepop file. Make sure to use the GenePop ID from the replicate genepop file since this has ONLY the variable loci that you can compare. I then made a list of those to exclude. Now we have 202 SNPs. You will need to get the fasta loci ID from the genepop ID using the vcf file (the genepop file is ordered 1...x) ### Final filtering. /data/Dimensions/clarkia/sv/cb/snp.panel/replicates.filter. Using the loci that we are keeping. Make a list.to.keep with their locus ids then filter the original fasta file. grep -w -A 1 -f list.to.keep populations.loci.fa --no-group-separator > final.fasta.fa #sort the original fasta to only keep the high quality ones #### Now filter the vcf file. cat list.to.keep | cut -d"_" -f2 > snps.keep #get only the chromosome number sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers #remove header sed -e 's/^/chr/' allvcf.headers > allvcf.headers.chr #add 'chr' sed -e 's/^/chr/' snps.keep > snps.keep.chr #add 'chr' grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string #copy and paste header from previous file into new file ( sed 1q allvcf.headers; sed 1d filtered ) > filtered2 && mv filtered2 filtered.vcf #### Filter out linked loci. /data/Dimensions/clarkia/sv/cb/snp.panel/linkage.filter ## Had to reformat the vcf file to include a header bgzip -c filtered.vcf > filtered2.vcf.gz tabix -p vcf filtered2.vcf.gz /usr/local/libexec/bcftools/bcftools +prune -m 0.6 -w 100 filtered2.vcf.gz -Ob -o final.vcf #can't get this to work ## Checked to see what the R2 vcftools --vcf filtered.vcf --plink --out filtered.plink plink --file filtered.plink --r2 --noweb #### There are two pairs or loci that are linked (i.e. have an R2 above 80). Remove them and filter the vcf file. We have 200 SNPs sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers #remove header sed -e 's/^/chr/' allvcf.headers > allvcf.headers.chr #add 'chr' sed -e 's/^/chr/' list.to.keep > snps.keep.chr #add 'chr' grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string #copy and paste header from previous file into new file ( sed 1q allvcf.headers; sed 1d filtered ) > filtered2 && mv filtered2 filtered.vcf #### Run populations. Upload this gene pop file into R to look at heterozygosity of each locus at each population. Decided to keep all 310 loci for now. We can use the final.fasta.fa file to begin developing CB SNP panel. populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/linkage.filter/filtered.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/linkage.filter/ -M /data/Dimensions/clarkia/sv/cb/snp.panel/popmap.nr --genepop #### Only keeping a subset of loci that are highly variable across all three populations. Filter final vcf file to only get those highly variable SNPs and remove replicates. /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter ##### Run populations again to run genepop file in R populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/filtered.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/ -M /data/Dimensions/clarkia/sv/cb/snp.panel/popmap.nr --genepop --vcf ##### Filter the final vcf using the 111 most variable SNPs /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/hs.filter sed -e 's/^/chr/' filtered.vcf > allvcf.headers.chr #add 'chr' sed -e 's/^/chr/' list.to.keep > snps.keep.chr #add 'chr' grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string #copy and paste header from previous file into new file ( sed 1q filtered.vcf; sed 1d filtered ) > filtered2 && mv filtered2 final.vcf ##### Run populations again populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/hs.filter/final.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/hs.filter -M /data/Dimensions/clarkia/sv/cb/snp.panel/popmap.nr --genepop --vcf #### Get SNP info for all final SNPs and make master spreadsheet with all SNP info. vcftools --vcf final.p.snps.vcf --missing-indv #Generates a file reporting the missingness on a per-site basis. The file has the suffix ".lmiss". vcftools --vcf final.p.snps.vcf --missing-site #Generates a file containing the mean depth per individual. This file has the suffix ".idepth" vcftools --vcf final.p.snps.vcf --depth #Generates a file containing the mean depth per site averaged across all individuals. This output file has the suffix ".ldepth.mean". vcftools --vcf final.p.snps.vcf --site-mean-depth ##### Format for submission to sequencing facility ###### Filter fasta to get those we want sed -e 's/^/>CLocus_/' list.to.keep > final.list.to.keep # Make list of loci we want to keep w/ correct format grep -w -A 1 -f final.list.to.keep populations.loci.fa --no-group-separator > final.fasta.fa # fitler original fasta ###### Re-formatting fasta file so that the locus name is in the first column and the sequence is in the second. paste -d " " - - < final.fasta.fa > final.fasta2 ###### Now doing some re-formatting for the submission form to Minnesota. Need to get the SNP with the major and minor alleles into the fasta file. Make a file with the sequence in the first column, position in the second column, and [allele1/allele2] in the third column. Then run code below (had to get help on this in Stack Exchange). awk -F"\t" '{printf "%s%s%s%s",substr($1,1,$2-1),$3,substr($1,$2+1),ORS}' snp > submission.txt ``` **Bedtools intersect** bedtools intersect -a -b -a = BAM/BED/GFF/VCF file “A”. Each feature in A is compared to B in search of overlaps. Use “stdin” if passing A with a UNIX pipe. -b = One or more BAM/BED/GFF/VCF file(s) “B”. Use “stdin” if passing B with a UNIX pipe. use bedtools slop to input a buffersize around each SNP bedtools slop -i input.vcf -g genome_file.txt -b 1000 | bedtools intersect -a - -b annotations.gff -wa -wb > intersect_1000_output.txt **First we need to create a genome .txt file** ``` samtools faidx /projects/p32037/PASM/genome/ncbi_dataset/data/GCA_907164705.1/GCA_907164705.1_Parnassius_apollo_genomic.fna cut -f1,2 /projects/p32037/PASM/genome/ncbi_dataset/data/GCA_907164705.1/GCA_907164705.1_Parnassius_apollo_genomic.fna > genome_file.txt ``` **Bash script: "snp_annotation.sh"** ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=2:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem-per-cpu=10GB #SBATCH --job-name=PASM_intersect_1000 #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=PASM_intersect_1000.error #SBATCH --output=PASM_intersect_1000.out # Input argument GENOME_FILE=/projects/p32037/PASM/genome_file.txt # Convert VCF to BED format awk 'BEGIN {FS="\t"; OFS="\t"} !/^#/ {print $1, $2-1, $2, $3}' lfmm_filtered_vcf.vcf > temp_input.bed # Expand SNP regions by 10,000 bp and intersect with annotations bedtools slop -i temp_input.bed -g "$GENOME_FILE" -b 10000 | \ bedtools intersect -a - -b /projects/p32037/PASM/genome/ncbi_dataset/data/GCA_907164705.1/genomic.gff -wao > intersect_output_10000.tsv # Remove the temporary BED file rm temp_input.bed # Confirmation message echo "Intersection completed. Results saved to intersect_output_10000." ```