STACKS - A guide to assembling loci and SNPs for ddRADSeq data

# STACKS - A guide to assembling loci and SNPs for ddRADSeq data ###### tags: `RADSeq` `STACKS` `Pairedend Sequencing` `Population Genetics` --- author: Dylan Cohen --- Link to STACKS website: https://catchenlab.life.illinois.edu/stacks/ This guide is meant to help users with the basic workflow and to produce output files for population genetic analyses. ------ STACKS OVERVIEW 1. process_radrags - demultiplex and clean reads 2. denovo_map.pl - this will run thru a number of sub programs (ustacks, cstacks, sstacks, tsv2bam, gstacks, and populations) within STACKS. These could also be ran individually. 3. populations - use to fine tune output parameters and create output files for use elsewhere. Optional. ---- *** If you are using Buxbaumia you should be able to call STACKS without using the full path to where the program is installed. 1. `process_radtags -h` Process radtags is the first step in the STACKS pipeline. This will take your raw sequencing files and prepare them for SNP and loci calling. HINT: Most programs have associated help files and often you can call it by calling a command (process_radtags) and either -h or --help or something along those lines. This will usally return information about the command similar to the ?(function) in R. EXAMPLE process_radtags CODE: ``` nohup process_radtags -1 Amorphy_S1_R1_001.fastq.gz -2 Amorphy_S1_R2_001.fastq.gz -i gzfastq -b Pbar -o Phy_out --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 & ``` nohup - will tell buxbuamia not to kill/stop your job if you loose connection or close out the session. This is similar to screen. You will be able to continue to use the same screen while the command runs in the background. process_radtags - the command to call the program. -1 - the read 1 files -2 - the read 2 files -i - how are the raw files inputed as (gzfastq- they are gunzipped) -b - this is the barcodes file with Sample + barcode identifer AACCA A_hubrichtii_7 CGATC A_arenaria_TX TCGAT A_orientalis_14 TGCAT A_cil_tenuifolia_188 Note the orientation, identifier followed by the sample name with a tab inbetween. The barcode file could be made in a text editor (BBedit) and uploaded to the server using Cyberduck or (Filezilla?) or in Linux using nano. -o - this is the output directory that should be created before you call process radtags To make a directory `mkdir` + name To change to a directory `cd` + name or full path `cd ..` will change directory to the previous directory you were in --inline_null -> this tells STACKS the postion of our barcodes(adapters) for each sample --renz_1 ecoRI - restriction enzyme 1 --renz_2 mspI - restriction enzyme 2 ***** Make sure to change these if using different enzymes! These are the adapters Fant/Skogen lab uses --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 - We are allowing two mismatched bases in the adpaters -r- rescue barcodes and RAD-Tag cut sites. -c-clean data, remove any read with an uncalled base. -q-discard reads with low quality scores. ***The ends of raw reads are usally very messy and so most people truncate/trim reads to avoid problems with assembling loci downstream. We have 150 bp reads, and truncate to 100. to inspect raw non demulitplexed files: ``` zless *+* file name ``` -t 100 - truncate final read length to this value. -s 10 - number of threads to run. (10) &>ptags2.out& - redirects output/log files to ptags2.out. Example SLURM SCRIPT for RADTAGS ``` #!/bin/bash #SBATCH --account= ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time= ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=2 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu= 20 gb ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need)) #SBATCH --job-name= ## When you run squeue -u #SBATCH --mail-type= ## BEGIN, END, FAIL or ALL #SBATCH --mail-user= #SBATCH --error= # Enter a error log file name #SBATCH --output= # Enter your out.log file name module purge module stacks/2.64-gcc-9.2.0 process_radtags -1 Amorphy_S1_R1_001.fastq.gz -2 Amorphy_S1_R2_001.fastq.gz -i gzfastq -b Pbar -o Phy_out --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 ``` Notes on Slurm scripts: Make sure to put your account name into the script, how much time you are giving it to run, number of nodes, cores, and gb, job name, etc. nohup and & are not needed with slurm scripts because these will run in the background regardless since its a slurm submission. Once process_radtags has finished running, change directories to your output directory and inspect the process_radtags.log file. This will give you basic information on how many sequences were found per sample, and how many passed preliminary filtering. To view the the log file ``` cat process_radtags.log ``` STACKS outputs four different files for each sample if you are using paird end reads: sample_name.rem.1.fq.gz sample_name.rem.2.fq.gz sample_name.1.fq.gz sample_name.2.fq.gz We only use the files without the .rem; those files with a .rem. can be deleted. ``` rm *.rem.1.fq.gz rm *.rem.2.fq.gz ``` If you are planning on generating loci/SNPs for multiple sublibraries, you will need to move all demultiplexed files to one folder, if not skip below to denovo pipeline. Moving files can be achieved either by using copy paste/move linux commands followed by the path-to-the-new-folder, or by using cyberduck or filezila to transfer them manually. I prefer linux commands as they are less clunky. cp - copy paste command (this will duplicate files, leaving a copy in your current directoy) mv - move command (this will move your files and not leave a copy in the current directory) - mv can also be used to rename files/folders if you are not going to a new directory: mv name.txt name2.txt (this will rename name.txt to name2.txt) (*) - Asterisk without () is a wildcard, this can be used to select all files with the same ending or similar pattern cp *.1.fq.gz = copy paste all files with the ending .1.fq.gz and move to your directory of choice --- # Denovo Assembly --- This is true workhorse of the STACKS pipeline. During the denovo_mapper STACKS will run the following sub commands: ustacks, cstacks, sstacks, tsv2bam, gstacks, and populations. Populations can be run after the denovo assembler to fine tune parameters such as r and p, and also create different output files as well as some basic pop gen stats. Below are some of the key parameters (and short descriptions)most users of STACKS will investigate. This table is from [Paris et al. 2017 - Lost in parameter space](https://doi.org/10.1111/2041-210X.12775). My suggestion would be to start with default parameters until you are comfortable running the denovo mapper. The default parameters are: m = 3 M = 2 and n = 1. I would also make r = 80 as this seems to be the rule for most population genetics studies, but if you have a lot of missing data expore different r and p values. Once you are comfortable calling the pipline and understand its outputs, then I suggest moving to explore parameters. ![](https://i.imgur.com/kW6n08R.png) In order to run denovo_map.pl you will need: 1. All the fq.gz (fastq.gunzipped) sample files in one directory. 2. A population map file. Start off by creating a sample folder and moving the samplename.1.fq.gz and samplename.2.fq.gz files that you wish to run the denovo assembler pipleine to that folder. Next you will need to create a population map file. There should be no header, and just two columns, the first with sample ids, and the second with the population assigment. Make sure to include a tab inbetween the sample id and population. Pop map example: ``` Fug_B13 Bos Fug_B14 Bos Fug_B15 Bos Fug_B24 Bos Fug_F25 Fife Fug_F26 Fife Fug_F36 Fife Fug_S1 Sev Fug_S11 Sev Fug_S12 Sev Fug_S2 Sev ``` Here is an example code for the denovo_map.pl ``` denovo_map.pl -m 4 -M 3 -n 3 --samples /data/amsonia/raw/data-06-28-2022/FUG_100 --popmap /data/amsonia/raw/data-06-28-2022/FUG_100/Fug_map -o /data/amsonia/raw/data-06-28-2022/r60/FUG/4_3_3 -T 10 --paired -X "populations:-r 0.8" -X "populations: --write-random-snp" -X "populations: --vcf" ``` Remember to check out the help document associated with the assembler ```denovo_map.pl -h ``` This will fill you in on some the parameters we are using, the homepage for STACKS will have a better description. Briefly, -m, -M, -n -> are the key STACKS parameters, not including these will cause the default to run (3-2-1) --samples -> this is the flag followed by the path to your sample directory/folder you wish to run --popmap -> if the map isnt in the same folder as the script then you will have to provide the entire path --o -> this is the output directory that you will need to set up prior to running. -T -> how many cores/processors you wish to use --paired -> this tells STACKS you are working with paried end data -X "populations" -> this is the populations flag that you will need to use if you want to change population parameters during the denovo assembler. Note you can also rerun populations, so if you missed something on the inital run dont fret. -X "populations:-r 0.8" -> minimum percentage of individuals in a population required to process a locus for that population. -X "populations: --write-random-snp" -> restrict data analysis to one random SNP per locus. -X "populations: --vcf" — output SNPs and haplotypes in Variant Call Format (VCF). There are many more options for both denovo mapper as well as the populations program. These are just some of the basic options to get started. I like to run my code in bash scripts. Here is an example of one of my scripts to run denovo_map.pl: ``` #!/bin/bash nohup denovo_map.pl -m 3 -M 2 -n 2 --samples/data/amsonia/raw/data-08-30-2022/LONG_100 --popmap /data/amsonia/raw/data-08-30-2022/LONG_100/Long_map -o /data/amsonia/raw/data-08-30-2022/r80/Long/3_2_2 -T 10 --paired -X "populations:-r 0.8" -X "populations: --write-random-snp" -X "populations: --vcf" &> deLONG.out& ``` *** You can use the same slurm script format from process radtags for denovo assembly, just modify the slurm script, changing the the made code line Make sure to check the output log files either the one you are redirecting the log output to or the there will be a denovo.map.log file found in the output directory. There will be a lot of important statistics to consider including depth of coverage and the amount of loci, and SNPs found. Depth of coverage example out put: Depths of Coverage for Processed Samples: Fug_B13: 56.78x Fug_B14: 47.15x Fug_B15: 42.69x Fug_B17: 70.58x Fug_B18: 101.69x Fug_B19: 48.78x Fug_B20: 47.97x Fug_B21: 40.05x Fug_B22: 37.53x Example output from populations.log (found within output directory) Removed 72480 loci that did not pass sample/population constraints from 72874 loci. Kept 394 loci, composed of 78132 sites; 870 of those sites were filtered, 284 variant sites remained. Number of loci with PE contig: 394.00 (100.0%); Mean length of loci: 188.30bp (stderr 2.48); Number of loci with SE/PE overlap: 169.00 (42.9%); Mean length of overlapping loci: 158.34bp (stderr 2.32); mean overlap: 26.97bp (stderr 0.40); Mean genotyped sites per locus: 192.68bp (stderr 2.31). Population summary statistics (more detail in populations.sumstats_summary.tsv): Bos: 6.6802 samples per locus; pi: 0.20574; all/variant/polymorphic sites: 47713/172/132; private alleles: 25 Fife: 6.9333 samples per locus; pi: 0.1711; all/variant/polymorphic sites: 26633/90/58; private alleles: 19 Sev: 6.8323 samples per locus; pi: 0.21004; all/variant/polymorphic sites: 44579/161/127; private alleles: 32 --- ## TODO: Populations filtering --- --- # Optimizing STACKS parameters --- This will be a brief guide to optimizing your STACKS paramters based on [Masstretta-Yanes et al. 2015](https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.12291) and Lost in Parameter Space (Link above). The Masstretta-Yanes method suggests including multiple replicate samples for each sub-library. The idea is you will create a population map that includes all replicate samples from across sub-librarie, but keep total number of populations to one. Despite where these samples came from, the popmap should just include the same pop for all samples. Next, you will test different STACKS main parameters while keeping r = 80. The STACKS main params are m,M,n. Recommend starting with lower values first, 3,2,2; 4,2,2; 5,2,2; 3,3,3; 3,4,4; 3,5,5 etc. The idea is you want to maximize the SNP output, but minimize the distance between samples. After you have run several test assemblies, we will visualize the distance between sample replicates using a MDS plot with the program PLINK. ### NOTE: This section of the tutorial assumes you have plink and vcftools preinstalled on your system. If you are using CBG server buxbuamia, they are both installed and can be called by: ``` vcftools + command (man vcftools (for help)) plink + command (plink -h (for help)) ``` First create your population map file. Despite what population these samples come from, they are all apart of the same pop. This pop map should include all of your replicated samples. Make sure the replicate samples have a unique ID before running optimazation. Sample_A_Sub1 test Sample_A_Sub2 test Sample_A_Sub3 test Sample_A_Sub4 test Sample_B_Sub1 test Sample_B_Sub2 test Sample_B_Sub3 test Sample_B_Sub4 test Next set up your test directory with folders for each test run. Before you start the run make sure you have changed the script to match your pop map, and your output directory as well as the test parameters you want to use. We will be using Plink to approximate distances and you will need to either convert your vcf output to plink formation or you can pass the plink output flag in your assembly as part of the populations parameters. To do this add: ``` -X "populations: --plink" ``` to the end of you denovo.pl command line, but before the redirect to your log ouput file. Alternatively one may use vcftools to create the plink format: Update: Create a new directory and copy paste your populations snp file to the new directory. Change directory to the this directory. Then run the first line using vcftools below: ``` vcftools --vcf populations.snps.vcf --plink --out plink ``` Then run: ``` plink --file plink --make-bed --out binary --noweb ``` Followed by: ``` plink --file plink --cluster --mds-plot 3 --noweb ``` To download the plots either use cyberduck or Secure cop transfer protocol (SCP). Save these in folder labled with the main stacks params (3_2_2; 4_2_2) Use R to plot the file that ends with .mds and plot the names of the samples next to their corresponding points. Ideally replicates will cluster together, **IF** **they** **are not** then you will need to take a closer look at some of the basic stats for the run and use VCFtools to correct for missing data and potential other things. ## VCFTOOLS - A Short Guide --- --- In addition to the MDS plots you will need to extract information on # of Loci and SNPS obtained from each denovo assembly (found in your output directory). List the contents of the populations.log : --- ocus/sample distributions will be written to '/data/amsonia/raw/data-08-30-2022/r80/Fug/3_2_2/populations.log.distribs'. populations parameters selected: Percent samples limit per population: 0.8 Locus Population limit: 1 Percent samples overall: 0 Minor allele frequency cutoff: 0 Maximum observed heterozygosity cutoff: 1 Applying Fst correction: none. Pi/Fis kernel smoothing: off Fstats kernel smoothing: off Bootstrap resampling: off Parsing population map... The population map contained 30 samples, 3 population(s), 1 group(s). Working on 30 samples. Working on 3 population(s): Bos: Fug_B13, Fug_B14, Fug_B15, Fug_B17, Fug_B18, Fug_B19, Fug_B20, Fug_B21, Fug_B22, Fug_B24 Fife: Fug_F25, Fug_F26, Fug_F27, Fug_F29, Fug_F30, Fug_F32, Fug_F33, Fug_F34, Fug_F35, Fug_F36 Sev: Fug_S1, Fug_S11, Fug_S12, Fug_S2, Fug_S3, Fug_S4, Fug_S6, Fug_S7, Fug_S8, Fug_S9 Working on 1 group(s) of populations: defaultgrp: Bos, Fife, Sev SNPs and calls will be written in VCF format to '/data/amsonia/raw/data-08-30-2022/r80/Fug/3_2_2/populations.snps.vcf' Raw haplotypes will be written to '/data/amsonia/raw/data-08-30-2022/r80/Fug/3_2_2/populations.haplotypes.tsv' Population-level summary statistics will be written to '/data/amsonia/raw/data-08-30-2022/r80/Fug/3_2_2/populations.sumstats.tsv' Population-level haplotype summary statistics will be written to '/data/amsonia/raw/data-08-30-2022/r80/Fug/3_2_2/populations.hapstats.tsv' Processing data in batches: * load a batch of catalog loci and apply filters * compute SNP- and haplotype-wise per-population statistics * write the above statistics in the output files * export the genotypes/haplotypes in specified format(s) More details in '/data/amsonia/raw/data-08-30-2022/r80/Fug/3_2_2/populations.log.distribs'. Now processing... Batch 1 Batch 2 Batch 3 ## Pay close attention to these outputs as they will help decide which parameters are the best. **Removed 109107 loci that did not pass sample/population constraints from 137010 loci.** **Kept 27903 loci, composed of 6522968 sites; 30204 of those sites were filtered, 24608 variant sites remained.** ## I keep a separate excel/google sheets log of all my trial/optimazation runs and include the assemble name, number of loci the were kept (27903) and the number of variant sites (24608). These stats along with the r80 parameter are covered in detail in the Paris et al 2017 paper. ## Number of loci with PE contig: 27903.00 (100.0%); Mean length of loci: 223.77bp (stderr 0.21); Number of loci with SE/PE overlap: 3863.00 (13.8%); Mean length of overlapping loci: 204.75bp (stderr 0.28); mean overlap: 23.71bp (stderr 0.07); Mean genotyped sites per locus: 225.15bp (stderr 0.20). Population summary statistics (more detail in populations.sumstats_summary.tsv): Bos: 9.2835 samples per locus; pi: 0.16382; all/variant/polymorphic sites: 3659112/13567/7937; private alleles: 1676 Fife: 9.3419 samples per locus; pi: 0.17307; all/variant/polymorphic sites: 4168473/15623/9713; private alleles: 2353 Sev: 9.357 samples per locus; pi: 0.21065; all/variant/polymorphic sites: 5779699/22551/17637; private alleles: 4287 Populations is done. --- After you have determined the "best" set of parameters rerun denovo assembly on the entire data set. All samples will have to be put into one folder, and a populations map with population id will need to be made. The script for denovo assembly will also need to be updated accordingly with sample input, pop map, and output. --- ### *** UNDER CONSTRUCTION BELOW *** ### POST STACKS filtering ::TODO:: ### Below is extra - In all studies researchers should explore parameter space to optmize and obtain the best dataset. To learn about optimizing STACKS parameters consult these papers: Lost in Parameter [Space](https://besjournals.onlinelibrary.wiley.com/doi/pdf/10.1111/2041-210X.12775) -> Basically argues to use the "r80" method Restriction site-associated DNA sequencing, genotyping error estimation and de novo assembly [optimization](https://scholar.google.com/scholar?cluster=17514714704445809315&hl=en&as_sdt=0,14) for population genetic inference ->The idea is you want to minimize distance between replicates on a PCA. The dataset with the shortest distance should be the most optimal.