SNP Outlier Analysis

--- title: 'SNP Outlier Analysis' disqus: hackmd --- SNP Outlier Analysis === By Tanya Lama ## Goal We are looking for hallmarks of local adaptation by searching for outliers among our mLynCan4_v1.p SNPset (generated via GATK, and filtered and LD-pruned further using R (see HackMD or ask Tanya for these scripts)). ## Objectives 1. Estimate locus-by-locus FST and heterozygosity at each SNP ~53k SNP dataset "mLynCan4_v1.p.vcf.gz" using PLINK 2. Use the ProcessFST_STACKS.R script to evaluate FST outliers based on number of standard deviations from the mean 3. Plot FST versus alignment position color coded by reference genome chromosome. 4. Confirm outlier detection using the Beaumont and Nichols method, also used by Moen et al. 5. Confirm dually using the BayeScan method 6. Annotate outlier SNPs using --annotate via PLINK and SNPEff (R) ## Table of Contents [TOC] # Methods Estimating FST using PLINK --- 1. Request an interactive job bsub -q interactive -R rusage[mem=24000] -n 1 -W 30 -Is bash 2. Load dependencies module load anaconda2/4.4.0 module load plink/1.90b6.9 module load htslib/1.9 3. Estimating locus-by-locus FST in PLINK See: https://www.cog-genomics.org/plink/1.9/basic_stats#fst Tips on --within from: https://www.biostars.org/p/258167/ 4. Provide a set of subpopulations defined in assigned_mLynCan4v1p_filtered_reduced.clst The format should be THREE colums ``` <FID> <IID> <GROUP> name name north ``` In our situation, without relevant "FID" family-ID information, we will list individual ID (IID) twice, and direct PLINK to use the --double-id flag. 5. FST estimates for each variant Given a set of subpopulations defined via --within assigned_mLynCan4v1p_filtered_reduced.clst, --fst writes FST estimates for each autosomal diploid variant (computed using the method introduced in Weir BS, Cockerham CC (1984. Estimated F-statistics each variant are saved to objectname.fst, and reports raw and weighted global means to objectname.log This is a basic port of VCFtools --weir-fst-pop. The VCFtools implementation also provides windowed modes, which we have not ported (--recode vcf may be handy there, but we did not use it). 6. Usage is plink --vcf mLynCan4v1p_filtered_reduced.vcf.gz --within assigned_mLynCan4v1p_filtered_reduced.clst --double-id --fst --make-bed --out fst_mLynCan4v1p_filtered_reduced --allow-extra-chr Note: we had to include the --allow-extra-chr flag at the end Output is in /project/uma_lisa_komoroske/Tanya/outlier 7. Next Step the output we will be using in our ProcessFST_STACKS.R script is /outlier/fst_mLynCan4v1p_filtered_reduced.fst Evaluate outliers in R ([see Rmarkdown here](https://rpubs.com/tlama/outlier)) --- The ProcessFST.R script takes the output file fst_mLynCan4v1p_filtered_reduced.fst from PLINK (/outlier folder) and evaluates outliers based on the number of standard deviations from the mean. 1. Download the fst_mLynCan4v1p_filtered_reduced.fst from the cluster or GIT, we can run this locally on RStudio. 2. Data check Dimensions: N rows should equal the number of SNPs in our input dataset. This is true. Here we have 53K rows, one for each SNP. dim(fdat) 53265 5 Position: Most of our SNPs are in unique positions. This makes sense because we pruned clusters of SNPs >3 within 10bp of any given region, and the input SNP set was thouroughly filtered and pruned (should be in relative linkage equilibrium) length(unique(fdat$POS)) 53258 3. Plot the distribution of FST values ![](https://i.imgur.com/zEz6qbr.png) Here we can see that most of the FST values lie around 0, and some are slightly negative. Question: Shouldn't all FST values be between 0 and 1? Apparently, this is not un-common. Negative Fst values should be effectively interpreted as **zero** values. A zero value for Fst means that there is **no genetic differentiation between the populations** considered at that locus. 4. Set cutoff outside 3 standard deviations of the mean. Auteri et al. had this set at cutoff 9 but that seems extreme, and for us was >2.0 (FST limits are 0,1) so we settled for 3 sd of the mean as our cuttoff, and found that there are a high proportion of SNPs that either have gone to fixation (FST=1) OR are outliers (>3SD of the mean) 5. Plotted using ggplot ![](https://i.imgur.com/Zn01Xi3.png) We can see that MANY of these SNPs have gone to fixation (FST=1), or rather, complete differentiation between populations of study. 6. We saved all 1169 candidate "outlier" SNPs in an object We also plotted them here: ![](https://i.imgur.com/SOKallg.png) Pull outlier loci from our 6V.vcf (incl. ALL INFO columns) --- 1. Return to the terminal /outlier folder 2. Request an interactive session bsub -q interactive -R rusage[mem=24000] -n 1 -W 120 -Is bash 3. Load Dependencies module load anaconda2/4.4.0 module load bcftools/1.9 4. Open the SelectLoci.csv generated by our ProcessFST.R script. You will need the CHR and POS columns 5. In /project/uma_lisa_komoroske/Tanya/scripts/outlier use the nano command to open a new file copy/paste the CHR and POS columns from SelectedLoci.csv. save as positions_outlier_loci.txt 6. Use bcftools to create a new VCF called outlier_loci.vcf.gz, which selects the positions from /project/uma_lisa_komoroske/Tanya/Rgenomics/mLynCan4_v1.p_lynxonly/6_SV_rmClusterSNP_BiSNP_SV_HardFilter_SV_all_mLynCan4_v1.p_HighQualSites_processed.vcf.gz to a new VCF called "outlier_loci". bcftools view -T positions_outlier_loci.txt /project/uma_lisa_komoroske/Tanya/Rgenomics/mLynCan4_v1.p_lynxonly/6_SV_rmClusterSNP_BiSNP_SV_HardFilter_SV_all_mLynCan4_v1.p_HighQualSites_processed.vcf.gz > outlier_loci.vcf.gz 7. Confirm this new VCF has all the expected INFO columns, and is limited to the 1,169 candidate outliers we identified in R. The last SNP in the VCF should read NW_022059738.1 3530 --correct 9. We will proceed to annotating this outlier_loci.vcf.gz Annotate SNPs using SNPEff --- 1. Load dependencies module load anaconda2/4.4.0 #module load snpEff_snpSift/4.3T module load java/1.8.0_171 1a. Enter an interactive session as above 2. **Confirm** whether the mLynCan4_v1p genome is loaded as a database in SnpEff (highly unlikely) $ java -jar snpEff.jar databases 3. No? We will have to [Build a snpEff database](https://www.biostars.org/p/50963/) then In order to build our own database we had to locally install snpEff in the project space /bin The reference genome we used for all of our Canada lynx work is the RefSeq assembly accession: GCF_007474595.1 (latest) on NCBI [here](https://www.ncbi.nlm.nih.gov/assembly/GCF_007474595.1/) Therefore this is the assembly that needs to be input to snpEff as a new database so that the CHR names and positions will match up with our VCF 1. cd /project/uma_lisa_komoroske/bin/snpEff 2. mkdir data/mLynCan4 3. wget the exact .fna.gz and .gtf.gz for RefSeq GCF_007474595.1 4. unzip both files 5. mv gtf.gz to genes.gtf (unzipped) 6. mv fa.gz to sequences.fa (unzipped) 7. edit the snpEff.config file with the following text. location is somewhat irrelevant. I inserted this text after the "Mouse" genomes [# Lynx genome, version mLynCan4 mLynCan4.genome : Lynx] 8. cd /snpEff and run the following command "java -jar snpEff.jar build -gtf22 -v mLynCan4" 9. Review the updates: Genome name: 'Lynx' Genome version: 'mLynCan4' Genes: 29,216 Protein coding genes 19,761 N chromosomes: 66 Chromosomes (these names match our VCF): 'NC_044303.1' 'NC_044310.1' 'NC_044306.1' 'NC_044304.1' 'NC_044311.1' 'NC_044307.1' 'NC_044308.1' 'NC_044309.1' 'NC_044305.1' 'NC_044321.1' 'NC_044312.1' 'NC_044314.1' 'NC_044315.1' 'NC_044313.1' 'NC_044320.1' 'NC_044319.1' 'NC_044317.1' 'NC_044316.1' 'NC_044318.1' 'NW_022059695.1' 'NW_022059696.1' 'NW_022059694.1' 'NW_022059697.1' 'NW_022059698.1' 'NW_022059699.1' 'NW_022059700.1' 'NW_022059692.1' 'NW_022059701.1' 'NW_022059702.1' 'NW_022059703.1' 'NW_022059704.1' 'NW_022059706.1' 'NW_022059705.1' 'NW_022059693.1' 'NW_022059707.1' 'NW_022059708.1' 'NW_022059709.1' 'NW_022059710.1' 'NW_022059711.1' 'NW_022059712.1' 'NW_022059713.1' 'NW_022059714.1' 'NW_022059715.1' 'NW_022059716.1' 'NW_022059717.1' 'NW_022059718.1' 'NW_022059719.1' 'NW_022059720.1' 'NW_022059721.1' 'NW_022059722.1' 'NW_022059723.1' 'NW_022059725.1' 'NW_022059724.1' 'NW_022059726.1' 'NW_022059727.1' 'NW_022059728.1' 'NW_022059729.1' 'NW_022059730.1' 'NW_022059731.1' 'NW_022059732.1' 'NW_022059733.1' 'NW_022059734.1' 'NW_022059735.1' 'NW_022059736.1' 'NW_022059737.1' 'NW_022059738.1' 4. Use our mLynCan4 database to annotate outlier_loci.vcf. Note: the outlier_loci.vcf we have created is NOT gzipped. snpEff will not work if we include the .gz suffix. ``` java -Xmx20G -jar /project/uma_lisa_komoroske/bin/snpEff/snpEff.jar mLynCan4 ./copy_outlier_loci.vcf > copy_outlier_loci.ann.vcf ``` 7. Optional: Dump the contents of a database to a bed file java -Xmx4g -jar snpEff.jar dump -v -bed mLynCan4 > mLynCan4.bed 8. Optional: Same, but a tab separated TXT file (that can be loaded into R). java -Xmx4g -jar snpEff.jar dump -v -txt mLynCan4 > mLynCan4.txt Post-Processing snpEff Annotations using snpSift --- We'll be using the snpSift function to pull relevant information out the messy ANN/INFO column. This will help us find what variants are most interesting and relevant to future inquiry. 1. java -jar /project/uma_lisa_komoroske/bin/snpEff/SnpSift.jar extractFields outlier_loci.ann.vcf "ANN[*].GENE:" | awk -F"|" '{print $4}' 2. A list of genes is output here: We will save this as a CSV that we can then annotate manuall with information from NCBI as outlier_genes_snpSift.xls Example: RPL21-GTF3A LOC115510674-LOC115525805 DCLK1 CCNA1-SERTM1 3. Let's pull a bit more information (CHR, POS, REF/ALT) java -jar /project/uma_lisa_komoroske/bin/snpEff/SnpSift.jar extractFields outlier_loci.ann.vcf CHROM POS REF ALT "ANN[*].GENE:" | awk -F'[\t|]' '{print $1,$2,$3,$4,$6,$7,$8,$10,$12}' OFS="\t" > snpsift_genes_outlier_loci.txt 4. Download snpsift_genes_outlier_loci.txt locally 5. Annotate manually (look up genes) I really want to pull the part that says "gene_variant" java -jar /project/uma_lisa_komoroske/bin/snpEff/SnpSift.jar extractFields outlier_loci.ann.vcf CHROM POS REF ALT "ANN[*].GENE:" | awk -F'[\t|]' '{print $1,$2,$3,$4,$6,$8}' OFS="\t" > snpsift_genes_outlier_loci.txt 6. Just to be SURE we aren't missing anything, I want to annotate the FULL (unpruned) SNPset and look at the genes there, see why they weren't included in the FST outlier test (maybe they didn't meet the cutoff) e.g. clock genes? That was fun. We learned a few things. Located all of the SNPs in the CLOCK gene and their locations. A handful are upstream or downstream variants. Mostly intron "MODIFIER"s, sadly none were "HIGH" or "MODERATE". One more place to look: let's run annotate and snpSift on our VCF that includes bobcat AND lynx (not just bobcat) to see if we turn up any extra outliers there... Detour into functional gene sets --- We know there are gene-variant SNPs ~10 of them at least in a number of circadian rhythm (e.g. CLOCK) and thermal tolerance genes, because I've seen them myself on IGV. I want to know if they might have dropped out during LD pruning from our mLynCan4_v1p SNPset and thus from the outlier_loci list. Let's compile them carefully from the 6SV.vcf (filtered, but before LD pruning and cluster) 1. Let's annotate the raw VCF before (6SV.vcf) and after (mLynCan4_filtered_reduced) pruning, and see where they end up **6SV**: java -Xmx20G -jar /project/uma_lisa_komoroske/bin/snpEff/snpEff.jar mLynCan4 /project/uma_lisa_komoroske/Tanya/Rgenomics/mLynCan4_v1.p_lynxonly/6_SV_rmClusterSNP_BiSNP_SV_HardFilter_SV_all_mLynCan4_v1.p_HighQualSites_processed.vcf.gz > 6SV.ann.vcf **mLynCan4v1p**: java -Xmx20G -jar /project/uma_lisa_komoroske/bin/snpEff/snpEff.jar mLynCan4 ./mLynCan4v1p_filtered_reduced.vcf.gz > mLynCan4v1p_filtered_reduced.ann.vcf **mLynCan4 (bobcat and hybrid)** java -Xmx20G -jar /project/uma_lisa_komoroske/bin/snpEff/snpEff.jar mLynCan4 /project/uma_lisa_komoroske/Tanya/Rgenomics/mLynCan4_all/allsamples_nodups.vcf.gz > lynx_bob_hybrid.ann.vcf 2. snpSift mLynCan4v1p_filtered_reduced.ann.vcf java -jar /project/uma_lisa_komoroske/bin/snpEff/SnpSift.jar extractFields mLynCan4v1p_filtered_reduced.ann.vcf CHROM POS REF ALT "ANN[*].GENE:" > snpsift_genes_mLynCan4v1p_filtered_reduced.txt 3. snpSift 6SV.ann.vcf java -jar /project/uma_lisa_komoroske/bin/snpEff/SnpSift.jar extractFields 6SV.ann.vcf CHROM POS REF ALT "ANN[*].GENE:" | awk -F'[\t|]' '{print $2,$6,$7,$8,$10,$12}' OFS="\t" > snpsift_genes_6SV.txt 4. Finding We've noted that a large number of SNPs in genes/gene families of interest have been dropped from mLynCan4_v1p and thus from out outliers list. This is likely because loci are dropped due to clustering (>3SNPs within 10bp) when pruning for linkage-diseq. In order to investigate these "lost" loci, we are going to pull a list or ~57 genes of interested related to circadian rhythm and thermal tolerance. 5. snpSift "gene list" from 6SV.ann.vcf java -jar /project/uma_lisa_komoroske/bin/snpEff/SnpSift.jar filter "(ANN[*].GENE = 'ARNTL') | (ANN[*].GENE = 'AANAT')| (ANN[*].GENE = 'ARNTL2')| (ANN[*].GENE = 'CLOCK')| (ANN[*].GENE = 'CRY1')| (ANN[*].GENE = 'CRY2')| (ANN[*].GENE = 'CSNK1A1')| (ANN[*].GENE = 'CSNK1D')| (ANN[*].GENE = 'MTNR1A')| (ANN[*].GENE = 'MTNR1B')| (ANN[*].GENE = 'NR1D1')| (ANN[*].GENE = 'NR1D2')| (ANN[*].GENE = 'PER1')| (ANN[*].GENE = 'PER2')| (ANN[*].GENE = 'PER3')| (ANN[*].GENE = 'RORA')| (ANN[*].GENE = 'RORB')| (ANN[*].GENE = 'RORC')| (ANN[*].GENE = 'RXRB')| (ANN[*].GENE = 'TIMELESS')| (ANN[*].GENE = 'TIPIN')| (ANN[*].GENE = 'ATG14')| (ANN[*].GENE = 'FOXOS')| (ANN[*].GENE = 'HLF')| (ANN[*].GENE = 'NPAS2')| (ANN[*].GENE = 'NR1D1')| (ANN[*].GENE = 'NFIL3')| (ANN[*].GENE = 'NPAS3')| (ANN[*].GENE = 'HSPD1')| (ANN[*].GENE = 'HSPA8')| (ANN[*].GENE = 'HSF1')| (ANN[*].GENE = 'HSPA6')| (ANN[*].GENE = 'HSP')| (ANN[*].GENE = 'HSF1')| (ANN[*].GENE = 'HIF')| (ANN[*].GENE = 'HSPA6')| (ANN[*].GENE = 'HSP')| (ANN[*].GENE = 'HSF1')| (ANN[*].GENE = 'HIF')| (ANN[*].GENE = 'HSP70')| (ANN[*].GENE = 'SPDEF')| (ANN[*].GENE = 'HBM')| (ANN[*].GENE = 'MZT2B')| (ANN[*].GENE = 'ADIF')| (ANN[*].GENE = 'SELM')| (ANN[*].GENE = 'HBP1')| (ANN[*].GENE = 'MYADM')| (ANN[*].GENE = 'PTPRC')| (ANN[*].GENE = 'HSPH1')| (ANN[*].GENE = 'HSPA1A')| (ANN[*].GENE = 'DUSP7')| (ANN[*].GENE = 'PPM1A')| (ANN[*].GENE = 'GRB2')| (ANN[*].GENE = 'PRKCB')| (ANN[*].GENE = 'JUN')| (ANN[*].GENE = 'CRK')| (ANN[*].GENE = 'HSP90AA')" 6SV.ann.vcf > snpsift_circ_therm_genelist_6SV.txt #use genelist to make VCF for LFMM 7. Calculate FST at each position in snpsift_circ_therm_genelist_6SV.txt plink --vcf snpsift_circ_therm_genelist_6SV.vcf --within assigned_genelist_6SV.clst --double-id --fst --make-bed --out fst_circ_therm_genelist_6SV --allow-extra-chr We identified 1650 SNPs within a 57-gene list 8. Re-run ProcessFST outlier test, just for SNPs in our 54-gene geneset. [See details on this analysis on my Rpubs here](https://rpubs.com/tlama/outlier_circ_therm) In Summary, we identified 88 outliers ![](https://i.imgur.com/c7NoQc4.png) 9. Pull positions of 88 outliers within circ_therm_genelist bcftools view -T positions_circ_therm_genelist_6SV_outliers.txt /project/uma_lisa_komoroske/Tanya/Rgenomics/mLynCan4_v1.p_lynxonly/6_SV_rmClusterSNP_BiSNP_SV_HardFilter_SV_all_mLynCan4_v1.p_HighQualSites_processed.vcf.gz > outlier_loci_circ_therm_genelist_6SV.vcf.gz 10. Use snpEff and our mLynCan4 database to annotate outlier_loci_circ_therm_genelist_6SV.vcf.gz. Note: the outlier_loci.vcf we have created is NOT gzipped. snpEff will not work if we include the .gz suffix. java -Xmx20G -jar /project/uma_lisa_komoroske/bin/snpEff/snpEff.jar mLynCan4 ./outlier_loci_circ_therm_genelist_6SV.vcf > outlier_loci_circ_therm_genelist_6SV.ann.vcf 11. Use SnpSift to pull annotation info java -jar /project/uma_lisa_komoroske/bin/snpEff/SnpSift.jar extractFields outlier_loci_circ_therm_genelist_6SV.ann.vcf CHROM POS REF ALT "ANN[*].GENE:" | awk -F'[\t|]' '{print $1,$2,$3,$4,$6,$7, $8, $10, $12}' OFS="\t" > snpsift_genes_circ_therm_genelist_outliers.txt 12. Saved as SelectedLoci_circ_therm_genelist_6SV.csv in the project_canada_lynx folder. ## Summary: 88 outlier SNPs, but it doesn't look like any of them are "high" effect -- mostly modifiers. I've seen very few "high" effect SNPs in the set generally speaking. Still, interesting that there were so many outliers (88) in such a small set of genes. Calculate transition vs transversion ratio --- For each sample java -jar SnpSift.jar tstv hom s.vcf Discard: STACKS *populations* method --- Use the populations function of STACKS to analyze a population of individual samples. We will compute a number of population genentics stats and export a variety of standard outputs. A popmap specifying the individual and population is submitted to the program. At each nucleotide position, we will estimate * expected/observed het * π * FIS The populations program will compare all populations pairwise to compute FST. Given our data is reference aligned, a kernel-smoothed FST will also be calculated. The populations program will also compute haplotype-based population genetics statistics including: * haplotype diversity * ΦST * FST’ We tried to use the STACKS populations module to generate the locus-by-locus ΦST necessary for our first (of three) outlier approaches. **Lesson learned**: STACKS struggles with input VCFs from other programs including GATK (our upstream pipeline) and ipyrad. This is unfortunate! But ΦST is fairly simple to estimate, so we'll find another way of doing so. Regardless, here is some brief documentation of what we ran: 1. See the bottom of https://hackmd.io/@tlama/NGSPipeline where we used the SelectVariants tool in gatk4, to remove NULL or NOCALL genotypes (./.) from our final 6SV.vcf.gz resultant of step11_filter_high_qual_variants. The resultant VCF, purged_nocalls_2.vcf.gz includes a very low proportion of NOCALLS and all biallelic SNPs. We used this VCF as the input for STACKS populations module, as well as trying several others (snpsets from PLINK, R, filtered, unfiltered, etc) 2. Request an interactive job bsub -q interactive -R rusage[mem=24000] -n 1 -W 120 -Is bash 3. Load dependences module load anaconda2/4.4.0 module load stacks/2.52 4. Generate the required *popmap* Use *nano* to create a population map that indicates the sample name and population, then save as popmap. Formatting should be (no header). The individual names should be identical to those listed in your target VCF. use the *less* command to scroll down and confirm naming conventions in your VCF. Save the text file you've created as "popmap" ``` indiv ->tab population LIC11 1 ``` 5. Usage is: populations -V /project/uma_lisa_komoroske/Tanya/analyses/step11_filter_high_qual_variants/mLynCan4_v1.p_lynxonly/purged_nocalls_2.vcf.gz -M /project/uma_lisa_komoroske/Tanya/scripts/STACKS/popmap2 --fstats -O /project/uma_lisa_komoroske/Tanya/scripts/STACKS/ --vcf --verbose Note: because we are on an interactive node, we can call the populations command directly rather than submitting a job through bsub. # LFMM Out next step is to use WorldClim Variables in a GEA (genotype-environment-association) framework to evaluate whether See: [Landscape genomics provides evidence of climate‐associated genetic variation in Mexican populations of Quercus rugosa](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6231481/) 1. download WorldClim data in R 2. mismatch between today and 50 year projected climate data 3. what genes are significantly associated w/ climate variables 4. what is their gene function? 5. # Appendix and FAQ :::info **Find this document incomplete?** Leave a comment! ::: ###### tags: `chapter3_adaptation` `Genomics` `Outlier` `SNP` `R` `PLINK` `FST` `heterozygosity` `snpEff` `Annotation` # ScratchPad Next: validate using het AND FST per the method described in: 1. Beaumont MA, Nichols RA. Evaluating loci for use in the genetic analysis of population structure. Proceedings of the Royal Society of London. Series B: Biological Sciences. 1996 Dec 22;263(1377):1619-26. 2. Moen T, Hayes B, Nilsen F, Delghandi M, Fjalestad KT, Fevolden SE, Berg PR, Lien S. Identification and characterisation of novel SNP markers in Atlantic cod: evidence for directional selection. BMC genetics. 2008 Dec 1;9(1):18. The Moen et al. paper in particular has some great outlier visuals from applying the Beaumonth method. We will try to replicate this! ![](https://i.imgur.com/iQQnE9b.png) Estimating heterozygosity --- We want to calculating heterozygosity for each SNPs. After studying the plink guide, I have calculated the heterozygosity using the following script. plink --make-bed --file purebred411_qc --freqx --out freqx_411 And I got this output: CHR SNP A1 A2 C(HOM A1) C(HET) C(HOM A2) C(HAP A1) C(HAP A2) C(MISSING) 1 AX-85111653 3 1 45 187 178 0 0 1 1 AX-85043398 2 4 45 186 180 0 0 0 1 AX-85051079 4 2 5 71 335 0 0 0 1 AX-85154093 4 2 5 72 332 0 0 2 1 AX-85063459 3 1 56 199 155 0 0 1 https://www.biostars.org/p/356195/ https://www.biostars.org/p/356195/ Next Steps: 1) intergenic vs. gene-variant structure 2) run outlier analysis on just the circadian rhythm? 3) structuring analyses in a way that's more cohesive 4) moving toward candidate genes in our genome vs. what those look like in other felids? outgroup? 5) comparing against other genomes? -- raw SRA data? SNPs within an intron vs gene-variant, downstream, upstream, intergenic? intergenic SNPs = no function, can treat them as neutral markers? fun FST and Structure on neutral dataset, 2 samples were collected north but clustered genetically with the south -- are these samples males for females? good question from Warren. For some of these genes that I've found of interest -- should consider talking to Jose Antonio to see if we can compare to some Iberian Lynx Puma data (tropical) (iberian -- hot), puma (both) tiger (siberian and tropical) # Manuscript Prep Target Journal:Nature Scientific Reports? Title Authors Abstract Keywords Intro Results Discussion Conclusion Materials & Methods ROH -- how to interpre the ROH output from vcftools? Diff ROH, length, distribution among individuals/chromosomes. ROH detected in all 3 populations. 7 individuals no ROH (all on Gaspe Peninsula and south of S Lawrence River). Long ROH more indicative of recent Ne. ###### tags: `chapter3_adaptation` `