Emma's Castilleja - Reference aligned (IL)

# Mapping Castilleja ddRAD data to a reference genome ###### tags: `RADSeq` `STACKS` --- author: Emma Fetterly --- **Resources:** 1. Link to STACKS website: https://catchenlab.life.illinois.edu/stacks/ *This guide is meant to help users with the basic workflow and to produce output files for population genetic analyses.* 2. Bioinformatics guide by Eric C. Anderson https://eriqande.github.io/eca-bioinf-handbook/shell-programming.html 3. Speciation & Population Genomics: a how-to-guide https://speciationgenomics.github.io/mapping_reference/ 4. Surm script how-to from NU https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1964 5. Globus - file management https://www.globus.org/ ------ **Process Overview** 1. process_radrags - demultiplex and clean reads 2. mapping ddRADseq reads to reference genome with bwa + samtools 3. STACKS: gstacks 4. STACKS: populations 5. filtering with VCF tools and plink ---- # Quest basics ## Logging on Logging on to quest It will ask for your net ID password ``` ssh ewf7555@quest.northwestern.edu ``` To make a directory `mkdir` + name To change to a directory `cd` + name or full path `cd ..` will change directory to the previous directory you were in Moving files can be achieved either by using copy paste/move linux commands followed by the path-to-the-new-folder, or by using cyberduck or filezila to transfer them manually. I prefer linux commands as they are less clunky. `cp` - copy paste command (this will duplicate files, leaving a copy in your current directoy) `mv` - move command (this will move your files and not leave a copy in the current directory) - mv can also be used to rename files/folders if you are not going to a new directory: mv name.txt name2.txt (this will rename name.txt to name2.txt) *Asterisk without () is a wildcard, this can be used to select all files with the same ending or similar pattern cp *.1.fq.gz = copy paste all files with the ending .1.fq.gz and move to your directory of choice for example, if I want to move all my .gz files to a directory called "gz_reads" it could look like this ``` mv *.gz /projects/p32037/Castilleja_emma/readfiles_all/gz_reads ``` ## Castilleja data folder Jeremie's copy "Castilleja" won't allow for edits so I have a copy called "Castilleja_emma" I did this on Globus - cp or mv functions wouldn't work File path: ``` $ cd /projects/p32037/Castilleja_emma ``` This folder contains the index barcodes in .txt files and the sequence data in .fastq files There should be 12 indicies with 8 samples in each index group. :::danger I only see barcode indices for 12, 2 (twice - .tsv and .txt), 4, 5, 6, 7 ::: This is okay, the other fastq files are from another species in the same run. ``` [ewf7555@quser34 Castilleja]$ ls barcodes_index12.txt barcodes_index2.tsv barcodes_index2.txt barcodes_index4.txt barcodes_index5.txt barcodes_index6.txt barcodes_index7.txt C85HWANXX_s5_1_BC_X.fastq C85HWANXX_s5_1_illumina12index_10_SL136389.fastq C85HWANXX_s5_1_illumina12index_11_SL136390.fastq C85HWANXX_s5_1_illumina12index_12_SL136384.fastq C85HWANXX_s5_1_illumina12index_1_SL136385.fastq C85HWANXX_s5_1_illumina12index_2_SL136379.fastq C85HWANXX_s5_1_illumina12index_3_SL136386.fastq C85HWANXX_s5_1_illumina12index_4_SL136380.fastq C85HWANXX_s5_1_illumina12index_5_SL136381.fastq C85HWANXX_s5_1_illumina12index_6_SL136382.fastq C85HWANXX_s5_1_illumina12index_7_SL136383.fastq C85HWANXX_s5_1_illumina12index_8_SL136387.fastq C85HWANXX_s5_1_illumina12index_9_SL136388.fastq C85HWANXX_s5_2_BC_X.fastq C85HWANXX_s5_2_illumina12index_10_SL136389.fastq C85HWANXX_s5_2_illumina12index_11_SL136390.fastq C85HWANXX_s5_2_illumina12index_12_SL136384.fastq C85HWANXX_s5_2_illumina12index_1_SL136385.fastq C85HWANXX_s5_2_illumina12index_2_SL136379.fastq C85HWANXX_s5_2_illumina12index_3_SL136386.fastq C85HWANXX_s5_2_illumina12index_4_SL136380.fastq C85HWANXX_s5_2_illumina12index_5_SL136381.fastq C85HWANXX_s5_2_illumina12index_6_SL136382.fastq C85HWANXX_s5_2_illumina12index_7_SL136383.fastq C85HWANXX_s5_2_illumina12index_8_SL136387.fastq C85HWANXX_s5_2_illumina12index_9_SL136388.fastq ``` Look at the raw data sequences # Process radtags ## Code 1. `process_radtags -h` Process radtags is the first step in the STACKS pipeline. This will take your raw sequencing files and prepare them for SNP and loci calling. EXAMPLE process_radtags CODE FROM DYLAN: ``` nohup process_radtags -1 Amorphy_S1_R1_001.fastq.gz -2 Amorphy_S1_R2_001.fastq.gz -i gzfastq -b Pbar -o Phy_out --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 & ``` My version for Castilleja: ``` process_radtags -1 C85HWANXX_s5_1_illumina12index_12_SL136384.fastq -2 C85HWANXX_s5_2_illumina12index_12_SL136384.fastq -i fastq -b barcodes_index12.txt -o CACOradtags_output --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 & ``` process_radtags - the command to call the program. -1 - the read 1 files -2 - the read 2 files -i - how are the raw files inputed as (fastq) -b - this is the barcodes file (in this case: barcodes_index12.txt) with Sample + barcode identifer AACCA A_hubrichtii_7 CGATC A_arenaria_TX TCGAT A_orientalis_14 TGCAT A_cil_tenuifolia_188 Note the orientation, identifier followed by the sample name with a tab inbetween. The barcode file could be made in a text editor (BBedit) and uploaded to the server using Cyberduck or (Filezilla?) or in Linux using nano. -o - this is the output directory that should be created before you call process radtags **I made a new directory within the "Castilleja_emma" folder called "CACOradtags_output" for the first/test group (index 12)**` --inline_null -> this tells STACKS the postion of our barcodes(adapters) for each sample --renz_1 ecoRI - restriction enzyme 1 --renz_2 mspI - restriction enzyme 2 ***** Make sure to change these if using different enzymes! These are the adapters Fant/Skogen lab uses --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 - We are allowing two mismatched bases in the adpaters -r- rescue barcodes and RAD-Tag cut sites. -c-clean data, remove any read with an uncalled base. -q-discard reads with low quality scores. ***The ends of raw reads are usally very messy and so most people truncate/trim reads to avoid problems with assembling loci downstream. We have 150 bp reads, and truncate to 100. -t 100 - truncate final read length to this value. -s 10 - number of threads to run. (10) &>ptags2.out& - redirects output/log files to ptags2.out. Example SLURM SCRIPT for RADTAGS from DYLAN's notes: ``` #!/bin/bash #SBATCH --account= ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time= ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=2 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu= 20 gb ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need)) #SBATCH --job-name= ## When you run squeue -u #SBATCH --mail-type= ## BEGIN, END, FAIL or ALL #SBATCH --mail-user= #SBATCH --error= # Enter a error log file name #SBATCH --output= # Enter your out.log file name module purge module stacks/2.64-gcc-9.2.0 process_radtags -1 Amorphy_S1_R1_001.fastq.gz -2 Amorphy_S1_R2_001.fastq.gz -i gzfastq -b Pbar -o Phy_out --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 ``` ## Slurm scripts Actual Slurm script I used for Castilleja radtags: * This was for "index 12" ``` #!/bin/bash #SBATCH --account=p32037 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time=12:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=2 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu=25GB ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need)) #SBATCH --job-name=CACO_radtags ## When you run squeue -u #SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=CACO_radtags.error # Enter a error log file name #SBATCH --output=CACO_radtags.out # Enter your out.log file name module purge module load stacks/2.64-gcc-9.2.0 ##specify environment process_radtags -1 C85HWANXX_s5_1_illumina12index_12_SL136384.fastq -2 C85HWANXX_s5_2_illumina12index_12_SL136384.fastq -i fastq -b barcodes_index12.txt -o CACOradtags_output --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 ``` Notes on running a Slurm: What's a slurm?? https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1964 * open a new file with `nano` * copy and paste above script for the slurm (write in a plain .txt file) * save with control x * name it "process_radtags.sh" Run the Slurm with ``` sbatch process_radtags.sh #start slurm script squeue -u ewf7555 #check status of slurm using NU net ID ``` Notes on Slurm scripts: Make sure to put your account name into the script, how much time you are giving it to run, number of nodes, cores, and gb, job name, etc. *nohup and & are not needed with slurm scripts because these will run in the background regardless since its a slurm submission.* Output: ``` [ewf7555@quser31 Castilleja_emma]$ sh process_radtags.sh Processing paired-end data. Using Phred+33 encoding for quality scores. Reads will be truncated to 100bp Filtering reads for adapter sequence: ACACTCTTTCCCTACACGACGCTCTTCCGATCT GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT 2 mismatches allowed to adapter sequence. Found 1 paired input file(s). Searching for single-end, inlined barcodes. Loaded 8 barcodes (6bp). Will attempt to recover barcodes with at most 1 mismatches. Processing file 1 of 1 [C85HWANXX_s5_1_illumina12index_12_SL136384.fastq] Reading data from: C85HWANXX_s5_1_illumina12index_12_SL136384.fastq and C85HWANXX_s5_2_illumina12index_12_SL136384.fastq Processing pairs of RAD-Tags...1M...2M...3M...4M...5M...6M...7M...8M...9M...10M... 11M...12M...13M...14M...15M...16M...17M...18M...19M...20M...21M...22M...23M...24M... 25M...26M...27M...28M...29M...30M...31M... 63459904 total reads; -3137246 ambiguous barcodes; -411809 ambiguous RAD-Tags; +1108469 recovered; -2782870 low quality reads; 55721085 retained reads. 1406894 reads with adapter sequence. Closing files, flushing buffers...done. 63459904 total sequences 1406894 reads contained adapter sequence (2.2%) 3137246 barcode not found drops (4.9%) 2782870 low quality read drops (4.4%) 411809 RAD cutsite not found drops (0.6%) 55721085 retained reads (87.8%) Details logged: 'CACOradtags_output/process_radtags.log' For a summary, execute one or more: stacks-dist-extract CACOradtags_output/process_radtags.log total_raw_read_counts stacks-dist-extract --pretty CACOradtags_output/process_radtags.log per_barcode_raw_read_counts process_radtags is done. [ewf7555@quser31 Castilleja_emma]$ ``` This finished very quickly ~ less than 1 hour with 87.5 % retained reads. Trying again with "index 2" using the same Slurm script but subsituting different file names: Forward file (-1) C85HWANXX_s5_1_illumina12index_2_SL136379.fastq Reverse file (-2) C85HWANXX_s5_2_illumina12index_2_SL136379.fastq Barcode file: barcodes_index2.txt Job title: CACO_radtags2 For now I am putting this in a new output file in the "Castilleja_emma" folder to keep it separate from the index 12 run. ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --mem-per-cpu=25GB #SBATCH --job-name=CACO_radtags2 #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=CACO_radtags.error #SBATCH --output=CACO_radtags.out module purge module load stacks/2.64-gcc-9.2.0 ##specify environment process_radtags -1 C85HWANXX_s5_1_illumina12index_2_SL136379.fastq -2 C85HWANXX_s5_2_illumina12index_2_SL136379.fastq -i fastq -b barcodes_index2.txt -o CACOradtags_output2 --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 ``` Started at 11 am - immediate failure becasue there were too many columns in thebarcodes text file... need to investigate this. Used the "barcodes _index2.tsv" file instead - this worked Submitted slurm at 11:03 and finished within 30 minutes, with very similar results to the index 12 run (85.8% retained reads) Looks good, now I need to do this for index 4, 5, 6, 7 ``` mkdir CACOradtags_output4 mkdir CACOradtags_output5 mkdir CACOradtags_output6 mkdir CACOradtags_output7 Slurm for process_radtags4.sh #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --mem-per-cpu=25GB #SBATCH --job-name=CACO_radtags4 #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=CACO_radtags.error #SBATCH --output=CACO_radtags.out module purge module load stacks/2.64-gcc-9.2.0 ##specify environment process_radtags -1 C85HWANXX_s5_1_illumina12index_4_SL136380.fastq -2 C85HWANXX_s5_2_illumina12index_4_SL136380.fastq -i fastq -b barcodes_index4.txt -o CACOradtags_output4 --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 Slurm for process_radtags5.sh #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --mem-per-cpu=25GB #SBATCH --job-name=CACO_radtags2 #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=CACO_radtags.error #SBATCH --output=CACO_radtags.out module purge module load stacks/2.64-gcc-9.2.0 ##specify environment process_radtags -1 C85HWANXX_s5_1_illumina12index_5_SL136381.fastq -2 C85HWANXX_s5_2_illumina12index_5_SL136381.fastq -i fastq -b barcodes_index5.txt -o CACOradtags_output5 --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 Slurm for processradtags6.sh #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --mem-per-cpu=25GB #SBATCH --job-name=CACO_radtags6 #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=CACO_radtags.error #SBATCH --output=CACO_radtags.out module purge module load stacks/2.64-gcc-9.2.0 ##specify environment process_radtags -1 C85HWANXX_s5_1_illumina12index_6_SL136382.fastq -2 C85HWANXX_s5_2_illumina12index_6_SL136382.fastq -i fastq -b barcodes_index6.txt -o CACOradtags_output6 --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 Slurm for process_radtags7.sh** #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=2 #SBATCH --mem-per-cpu=25GB #SBATCH --job-name=CACO_radtags7 #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=CACO_radtags.error #SBATCH --output=CACO_radtags.out module purge module load stacks/2.64-gcc-9.2.0 ##specify environment process_radtags -1 C85HWANXX_s5_1_illumina12index_7_SL136383.fastq -2 C85HWANXX_s5_2_illumina12index_7_SL136383.fastq -i fastq -b barcodes_index7.txt -o CACOradtags_output7 --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 ``` I then ran these slurms and the outputs should be in spearate folders ie. CACOradtags_output4 , CACOradtags_output5, CACOradtags_output6 and CACOradtags_output with log files as process_radtags.log in the respective directories **Quick results of process radtags** --- **Index 12:** 63459904 total sequences 1406894 reads contained adapter sequence (2.2%) 3137246 barcode not found drops (4.9%) 2782870 low quality read drops (4.4%) 411809 RAD cutsite not found drops (0.6%) 55721085 retained reads (87.8%) **Index 2:** 55589592 total sequences 1055252 reads contained adapter sequence (1.9%) 4051380 barcode not found drops (7.3%) 2372433 low quality read drops (4.3%) 403793 RAD cutsite not found drops (0.7%) 47706734 retained reads (85.8%) **Index 4:** 39541866 total sequences 723528 reads contained adapter sequence (1.8%) 2560698 barcode not found drops (6.5%) 1591721 low quality read drops (4.0%) 285054 RAD cutsite not found drops (0.7%) 34380865 retained reads (86.9%) **Index 5:** 112035426 total sequences 2212884 reads contained adapter sequence (2.0%) 4991864 barcode not found drops (4.5%) 4956510 low quality read drops (4.4%) 706719 RAD cutsite not found drops (0.6%) 99167449 retained reads (88.5%) **Index 6:** 73285042 total sequences 1446848 reads contained adapter sequence (2.0%) 3755980 barcode not found drops (5.1%) 2914390 low quality read drops (4.0%) 496599 RAD cutsite not found drops (0.7%) 64671225 retained reads (88.2%) **Index 7:** 69138146 total sequences 1210578 reads contained adapter sequence (1.8%) 4440828 barcode not found drops (6.4%) 2743297 low quality read drops (4.0%) 469015 RAD cutsite not found drops (0.7%) 60274428 retained reads (87.2%) Once process_radtags has finished running, change directories to your output directory and inspect the process_radtags.log file. This will give you basic information on how manys sequences were found per sample, and how many passed preliminary filtering. To view the the log file ``` cat process_radtags.log ``` STACKS outputs four different files for each sample if you are using paird end reads: sample_name.rem.1.fq.gz sample_name.rem.2.fq.gz sample_name.1.fq.gz sample_name.2.fq.gz We only use the files without the .rem; those files with a .rem. can be deleted. ``` rm *.rem.1.fq.gz rm *.rem.2.fq.gz ``` If you are planning on generating loci/SNPs for multiple sublibraries, you will need to move all demultiplexed files to one folder, if not skip below to denovo pipeline. Moving files can be achieved either by using copy paste/move linux commands followed by the path-to-the-new-folder, or by using cyberduck or filezila to transfer them manually. I prefer linux commands as they are less clunky. cp - copy paste command (this will duplicate files, leaving a copy in your current directoy) mv - move command (this will move your files and not leave a copy in the current directory) - mv can also be used to rename files/folders if you are not going to a new directory: mv name.txt name2.txt (this will rename name.txt to name2.txt) (*) - Asterisk without () is a wildcard, this can be used to select all files with the same ending or similar pattern cp *.1.fq.gz = copy paste all files with the ending .1.fq.gz and move to your directory of choice **Removed .rem files and Moved all files into a single file in "Castilleja_emma" called "readfiles_all"** I was curious about the raw read quality of these sequences, so ran a fast qc on the raw read files ``` module load fastqc/0.12.0 fastqc -o /projects/p32037/Castilleja_emma/readfiles_all/fastqc -t 6 *.fq ``` Hard to review each file... so ran a multiqc https://multiqc.info/docs/ in the readfiles_all directory to view all the reports at once ``` [ewf7555@quser32 readfiles_all]$ module load multiqc [ewf7555@quser32 readfiles_all]$ multiqc . #this searched through the entire read_files directory and found the fastqc reports generated /// MultiQC 🔍 | v1.11 | multiqc | MultiQC Version v1.19 now available! | multiqc | Search path : /projects/p32037/Castilleja_emma/readfiles_all | searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 575/575 | picard | Found 35 AlignmentSummaryMetrics reports | fastqc | Found 97 reports | fastqc | Sample had zero reads: 'merged_DP830' | multiqc | Compressing plot data | multiqc | Report : multiqc_report.html | multiqc | Data : multiqc_data | multiqc | MultiQC complete ``` # Mapping ddRAD reads to a reference genome ### Preparing the reference: Good idea to make a new directory for your reference files. The reference needs to be uncompressed for this analysis. Unzip the reference: ``` mkdir ./reference mv tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa ./reference/ ``` Index the reference genome: ``` bwa index tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa ``` This will create a few other files with the same basename. They all need to be in the same directory with the .fsa_nt file in order for mapping and such to work. ### Map the reads: Run from the directory with the read files, this loop will align the reads for each pair of read files, generating a .sam file ``` bwa mem -M -t 4 /home/bioinfo8/reference_mapping/reference_files/CAJRAG01.1.fsa_nt sampleID.1.fq.gz sampleID.2.fq.gz > sampleID.sam ``` Code for Castilleja mapping for sample SP790 in index 12 (trial) ``` bwa mem -M -t 4 /projects/p32037/Castilleja_emma/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa SP790.1.fq SP790.2.fq > SP790.sam ``` Output: ``` [ewf7555@quser34 readfiles_all]$ bwa mem -M -t 4 /projects/p32037/Castilleja_emma/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa SP790.1.fq SP790.2.fq > SP790.sam [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 400000 sequences (40000000 bp)... [M::process] read 167046 sequences (16704600 bp)... [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 0, 0, 0) [M::mem_pestat] skip orientation FF as there are not enough pairs [M::mem_pestat] skip orientation FR as there are not enough pairs [M::mem_pestat] skip orientation RF as there are not enough pairs [M::mem_pestat] skip orientation RR as there are not enough pairs [M::mem_process_seqs] Processed 400000 reads in 235.877 CPU sec, 59.436 real sec [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 0, 0, 0) [M::mem_pestat] skip orientation FF as there are not enough pairs [M::mem_pestat] skip orientation FR as there are not enough pairs [M::mem_pestat] skip orientation RF as there are not enough pairs [M::mem_pestat] skip orientation RR as there are not enough pairs [M::mem_process_seqs] Processed 167046 reads in 100.074 CPU sec, 25.290 real sec [main] Version: 0.7.12-r1039 [main] CMD: bwa mem -M -t 4 /projects/p32037/Castilleja_emma/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa SP790.2.fq SP790.2.fq [main] Real time: 86.405 sec; CPU: 337.069 sec (base) [ewf7555@quser34 readfiles_all]$ ``` Evaluating with samtools: ``` samtools flagstat SP790.sam [ewf7555@quser34 readfiles_all]$ samtools flagstat SP790.sam 576010 + 0 in total (QC-passed reads + QC-failed reads) 8964 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 555492 + 0 mapped (96.44% : N/A) 567046 + 0 paired in sequencing 283523 + 0 read1 283523 + 0 read2 0 + 0 properly paired (0.00% : N/A) 546528 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 314768 + 0 with mate mapped to a different chr 19 + 0 with mate mapped to a different chr (mapQ>=5) (base) [ewf7555@quser34 readfiles_all]$ ``` Good mapping percentages are in the 90's. You can mess around with the -K values of bwa mem to see if you get a higher percent mapping to the reference. This is a tradeoff because the lower the -K, that means that the fewer exact basepairs need to match the reference in order to start mapping a read. Default is 19 when bwa mem is run without a -K value. It can be a good idea to run bwa mem on one sample with several values of -K : default, -K 16 ; -K 14; -K 12 ; -K 10. Evaluate which maps the most reads and then use the loop for the rest of the samples: With lower K numbers (16 and 10)it was only processing 2 reads and the output looked likd this: ``` [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 0, 0, 0) [M::mem_pestat] skip orientation FF as there are not enough pairs [M::mem_pestat] skip orientation FR as there are not enough pairs [M::mem_pestat] skip orientation RF as there are not enough pairs [M::mem_pestat] skip orientation RR as there are not enough pairs [M::mem_process_seqs] Processed 2 reads in 0.002 CPU sec, 0.001 real sec [M::process] read 2 sequences (200 bp)... ``` Don't know what the problem is here, but we had ~94% reads mapped... :::danger I found a typo! Was mapping both of the reverse files and not forward and reverse... :( ::: Corrected code: ``` bwa mem -M -t 4 /projects/p32037/Castilleja_emma/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa DP830.1.fq DP830.2.fq > DP830.sam ``` Output: ``` (base) [ewf7555@quser33 readfiles_all]$ bwa mem -M -t 4 /projects/p32037/Castilleja_emma/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa SP790.1.fq SP790.2.fq > SP790.sam [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 400000 sequences (40000000 bp)... [M::process] read 167046 sequences (16704600 bp)... [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 15058, 1, 1) [M::mem_pestat] skip orientation FF as there are not enough pairs [M::mem_pestat] analyzing insert size distribution for orientation FR... [M::mem_pestat] (25, 50, 75) percentile: (123, 173, 208) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 378) [M::mem_pestat] mean and std.dev: (169.60, 61.93) [M::mem_pestat] low and high boundaries for proper pairs: (1, 463) [M::mem_pestat] skip orientation RF as there are not enough pairs [M::mem_pestat] skip orientation RR as there are not enough pairs [M::mem_process_seqs] Processed 400000 reads in 309.368 CPU sec, 77.809 real sec [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 6232, 0, 0) [M::mem_pestat] skip orientation FF as there are not enough pairs [M::mem_pestat] analyzing insert size distribution for orientation FR... [M::mem_pestat] (25, 50, 75) percentile: (122, 172, 207) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 377) [M::mem_pestat] mean and std.dev: (168.39, 61.39) [M::mem_pestat] low and high boundaries for proper pairs: (1, 462) [M::mem_pestat] skip orientation RF as there are not enough pairs [M::mem_pestat] skip orientation RR as there are not enough pairs [M::mem_process_seqs] Processed 167046 reads in 131.863 CPU sec, 33.177 real sec [main] Version: 0.7.12-r1039 [main] CMD: bwa mem -M -t 4 /projects/p32037/Castilleja_emma/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa SP790.1.fq SP790.2.fq [main] Real time: 112.884 sec; CPU: 442.390 sec (base) [ewf7555@quser33 readfiles_all]$ ``` Evalute with samtools: ``` [ewf7555@quser33 readfiles_all]$ samtools flagstat SP790.sam 575042 + 0 in total (QC-passed reads + QC-failed reads) 7996 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 560287 + 0 mapped (97.43% : N/A) 567046 + 0 paired in sequencing 283523 + 0 read1 283523 + 0 read2 493096 + 0 properly paired (86.96% : N/A) 548284 + 0 with itself and mate mapped 4007 + 0 singletons (0.71% : N/A) 48272 + 0 with mate mapped to a different chr 8056 + 0 with mate mapped to a different chr (mapQ>=5) ``` Looks good... * 97.43 % mapped and * 86.96% properly paired ---- Evaluate which maps the most reads and then use the loop for the rest of the samples: Nora example code: ``` for i in *.1.fq.gz; do NM=${i//.1.fq.gz/} REVERSE=${i//.1.fq.gz/.2.fq.gz} bwa mem -M -t 4 -K 10 /home/bioinfo8/reference_mapping/reference_files/CAJRAG01.1.fsa_nt $i $REVERSE > $NM.sam ; done ``` Actual code I used: ``` for i in *.1.fq; do NM=${i//.1.fq/} REVERSE=${i//.1.fq/.2.fq} bwa mem -M -t 4 /projects/p32037/Castilleja_emma/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa $i $REVERSE > $NM.sam ; done ``` This looped the process to map all of the samples to the reference genome with mapping in the high 90s with ~ 80-90% propperly paired reads. Now convert the .sam files into .bam files: ``` for i in *.sam; do NM=${i//.sam/}; samtools view $i -b -o $NM.bam; done ``` and then sort the .bam files: ``` for i in *.bam; do NM=${i//.bam/}; samtools sort $i -o $NM.sort.bam; done ``` As long as your sorted .bam files are looking good, then you can delete the .sam files because they are quite large and take up a lot of space. I put all my sorted .bams into their own folder. I got an error with the sorting function: :::danger * DP830 does not have a sort.bam * DP840 does not have a sort.bam ::: ``` DP830.1.fq HP807.sort.bam MC940.1.fq PS608B.2.fq DP830.2.fq IB1506B.1.fq MC940.2.fq PS608B.bam DP830.bam IB1506B.2.fq MC940.bam PS608B.sam DP830.sam IB1506B.bam MC940.sam PS608B.sort.bam DP831.1.fq IB1506B.sam MC940.sort.bam PS613A.1.fq DP831.2.fq IB1506B.sort.bam MC944.1.fq PS613A.2.fq DP831.bam IB1519.1.fq MC944.2.fq PS613A.bam DP831.sam IB1519.2.fq MC944.bam PS613A.sam DP831.sort.bam IB1519.bam MC944.sam PS613A.sort.bam DP838.1.fq IB1519.sam MC944.sort.bam PS614A.1.fq DP838.2.fq IB1519.sort.bam MC946.1.fq PS614A.2.fq DP838.bam IB1522B.1.fq MC946.2.fq PS614A.bam DP838.sam IB1522B.2.fq MC946.bam PS614A.sam DP838.sort.bam IB1522B.bam MC946.sam PS614A.sort.bam DP840.1.fq IB1522B.sam MC946.sort.bam PS615A.1.fq DP840.2.fq IB1522B.sort.bam MC947.1.fq PS615A.2.fq DP840.bam IB1524B.1.fq MC947.2.fq PS615A.bam DP840.sam IB1524B.2.fq MC947.bam PS615A.sam DP840.sort.bam.tmp.0000.bam IB1524B.bam MC947.sam PS615A.sort.bam DP840.sort.bam.tmp.0001.bam IB1524B.sam MC947.sort.bam SP772B.1.fq DP840.sort.bam.tmp.0002.bam IB1524B.sort.bam MW537.1.fq SP772B.2.fq DP840.sort.bam.tmp.0003.bam IB1527B.1.fq MW537.2.fq SP772B.bam GM544.1.fq IB1527B.2.fq MW537.bam SP772B.sam GM544.2.fq IB1527B.bam MW537.sam SP772B.sort.bam GM544.bam IB1527B.sam MW537.sort.bam SP790.1.fq GM544.sam IB1527B.sort.bam MW538.1.fq SP790.2.fq GM544.sort.bam IB1528B.1.fq MW538.2.fq SP790.bam GM556.1.fq IB1528B.2.fq MW538.bam SP790.sam GM556.2.fq IB1528B.bam MW538.sam SP790.sort.bam GM556.bam IB1528B.sam MW538.sort.bam SP792B.1.fq GM556.sam IB1528B.sort.bam MW542.1.fq SP792B.2.fq GM556.sort.bam IB2684.1.fq MW542.2.fq SP792B.bam GM559.1.fq IB2684.2.fq MW542.bam SP792B.sam GM559.2.fq IB2684.bam MW542.sam SP792B.sort.bam GM559.bam IB2684.sam MW542.sort.bam SP796B.1.fq GM559.sam IB2684.sort.bam MW543.1.fq SP796B.2.fq GM559.sort.bam IB2685.1.fq MW543.2.fq SP796B.bam GM564.1.fq IB2685.2.fq MW543.bam SP796B.sam GM564.2.fq IB2685.bam MW543.sam SP796B.sort.bam GM564.bam IB2685.sam MW543.sort.bam SR908B.1.fq GM564.sam IB2685.sort.bam NR815.1.fq SR908B.2.fq GM564.sort.bam IB2686.1.fq NR815.2.fq SR908B.bam HP801.1.fq IB2686.2.fq NR815.bam SR908B.sam HP801.2.fq IB2686.bam NR815.sam SR908B.sort.bam HP801.bam IB2686.sam NR815.sort.bam SR912.1.fq HP801.sam IB2686.sort.bam NR817.1.fq SR912.2.fq HP801.sort.bam IB2687.1.fq NR817.2.fq SR912.bam HP803.1.fq IB2687.2.fq NR817.bam SR912.sam HP803.2.fq IB2687.bam NR817.sam SR912.sort.bam HP803.bam IB2687.sam NR817.sort.bam SR914.1.fq HP803.sam IB2687.sort.bam NR819.1.fq SR914.2.fq HP803.sort.bam IB2693.1.fq NR819.2.fq SR914.bam HP805.1.fq IB2693.2.fq NR819.bam SR914.sam HP805.2.fq IB2693.bam NR819.sam SR914.sort.bam HP805.bam IB2693.sam NR819.sort.bam SR922.1.fq HP805.sam IB2693.sort.bam NR820.1.fq SR922.2.fq HP805.sort.bam IB2695.1.fq NR820.2.fq SR922.bam HP807.1.fq IB2695.2.fq NR820.bam SR922.sam HP807.2.fq IB2695.bam NR820.sam SR922.sort.bam HP807.bam IB2695.sam NR820.sort.bam HP807.sam IB2695.sort.bam PS608B.1.fq ``` Trying those samples again... ``` [ewf7555@quser33 readfiles_all]$ samtools sort DP840.bam -o DP840.sort.bam [W::bam_hdr_read] EOF marker is absent. The input is probably truncated [E::hts_open_format] Failed to open file DP840.sort.bam.tmp.0000.bam samtools sort: failed to create temporary file "DP840.sort.bam.tmp.0000.bam": File exists (base) [ewf7555@quser33 readfiles_all]$ ``` Writing a slurm job - open nano ``` #!/bin/bash #SBATCH --account=p32037 #SBATCH --partition=normal #SBATCH --time=12:00:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem-per-cpu=20GB #SBATCH --job-name=CACO_radtags2 #SBATCH --mail-type=ALL #SBATCH --mail-user=emmafetterly2025@u.northwestern.edu #SBATCH --error=DP840bamsort.error #SBATCH --output=DP840bamsort.out module purge module load stacks/2.64-gcc-9.2.0 ##specify environment samtools sort DP840.bam -o DP840.sort.bam ``` --- **Evaluting the alignments** https://gatk.broadinstitute.org/hc/en-us/articles/13832683660955- make a shell script from this ``` java -jar picard.jar CollectAlignmentSummaryMetrics \ R=tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly \ I=input.bam \ O=output.txt ``` Bash script for this called run_picard.sh in the output_qual directory ``` #!/bin/bash module load picard/2.21.4 #Reference genome REFERENCE_GENOME="/projects/p32037/Castilleja_emma/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa" # Input directory containing BAM files INPUT_DIR="/projects/p32037/Castilleja_emma/readfiles_all/" # Output directory OUTPUT_DIR="/projects/p32037/Castilleja_emma/readfiles_all/output_qual" # Run Picard for each BAM file in the input directory for BAM_FILE in "$INPUT_DIR"/*.bam do # Check if there are matching files, else break the loop [ -e "$BAM_FILE" ] || break OUTPUT_FILE="${OUTPUT_DIR}/$(basename ${BAM_FILE%.*})_metrics.txt" picard CollectAlignmentSummaryMetrics \ R=$REFERENCE_GENOME \ I=$BAM_FILE \ O=$OUTPUT_FILE echo "Picard analysis for $(basename $BAM_FILE) complete. Metrics saved to $OUTPUT_FILE" done ``` This created a lot of files with metrics by sample, although it looks like DP840 didnt work too well ``` DP831.sort_metrics.txt GM564.sort_metrics.txt IB2684.sort_metrics.txt MW542.sort_metrics.txt run_picard.sh DP838.sort_metrics.txt HP801.sort_metrics.txt IB2685.sort_metrics.txt MW543.sort_metrics.txt SP790.sort_metrics.txt DP840.sort.bam.tmp.0000_metrics.txt HP803.sort_metrics.txt IB2687.sort_metrics.txt NR819.sort_metrics.txt SP792B.sort_metrics.txt DP840.sort.bam.tmp.0001_metrics.txt HP805.sort_metrics.txt IB2695.sort_metrics.txt NR820.sort_metrics.txt SP796B.sort_metrics.txt DP840.sort.bam.tmp.0002_metrics.txt HP807.sort_metrics.txt MC940.sort_metrics.txt PS608B.sort_metrics.txt SR912.sort_metrics.txt DP840.sort.bam.tmp.0003_metrics.txt IB1519.sort_metrics.txt MC944.sort_metrics.txt PS613A.sort_metrics.txt SR914.sort_metrics.txt GM556.sort_metrics.txt IB1522B.sort_metrics.txt MC947.sort_metrics.txt PS614A.sort_metrics.txt SR922.sort_metrics.txt GM559.sort_metrics.txt IB1524B.sort_metrics.txt MW538.sort_metrics.txt PS615A.sort_metrics.txt ``` You can use multiqc to evalute the ouput of these files (see the section on fastqc/multi qc after process radtags) [MultiQC report](https://m-b55e05.a0115.5898.data.globus.org/projects/p32037/Castilleja_emma/readfiles_all/multiqc_report.html#picard-alignmentsummary ) here ![Screenshot 2024-02-05 at 10.26.34 AM](https://hackmd.io/_uploads/SyuthKC5p.png) # STACKs: getting set up ### **Making a population map** Start off by creating a sorted bam folder and moving the soted bam files to this folder. Next you will need to create a population map file. There should be no header, and just two columns, the first with sample ids (with.sort if using refernce), and the second with the population assigment. Make sure to include a tab inbetween the sample id and population. Pop map example: Fug_B13 Bos Fug_B14 Bos Fug_B15 Bos Fug_B24 Bos Fug_F25 Fife Fug_F26 Fife Fug_F36 Fife Fug_S1 Sev Fug_S11 Sev Fug_S12 Sev Fug_S2 Sev My popmap file: saved as popmap.txt in this directory: `/projects/p32037/Castilleja_emma/sorted_bams/popmap/popmap.txt` ``` DP830.sort DP DP831.sort DP DP838.sort DP GM544.sort GM GM556.sort GM GM559.sort GM GM564.sort GM HP801.sort HP HP803.sort HP HP805.sort HP HP807.sort HP IB1506B.sort IB1 IB1519.sort IB1 IB1522B.sort IB1 IB1524B.sort IB1 IB1528B.sort IB1 IB2684.sort IB2 IB2685.sort IB2 IB2686.sort IB2 IB2687.sort IB2 IB2693.sort IB2 IB2695.sort IB2 MC940.sort MC MC944.sort MC MC946.sort MC MC947.sort MC MW537.sort MW MW538.sort MW MW542.sort MW MW543.sort MW NR815.sort NR NR817.sort NR NR819.sort NR NR820.sort NR PS608B.sort PS PS613A.sort PS PS614A.sort PS PS615A.sort PS SP772B.sort SP SP790.sort SP SP796B.sort SP SR908B.sort SR SR912.sort SR SR914.sort SR SR922.sort SR ``` # STACKs: gstacks ``` module load stacks/2.64-gcc-9.2.0 ``` Good idea to do this in screen ``` gstacks -I ~/sorted_bams/ -M ~/popmap --rm-pcr-duplicates -O ~/reference_stacks/ -t 8 ``` My version ``` gstacks -I /projects/p32037/Castilleja_emma/sorted_bams -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/popmap.txt --rm-pcr-duplicates -O /projects/p32037/Castilleja_emma/reference_stacks/ -t 8 Locus/sample distributions will be written to '/projects/p32037/Castilleja_emma/reference_stacks/gstacks.log.distribs'. ``` Output: Removed 24790321 read pairs whose insert length had already been seen in the same sample as putative PCR duplicates (97.2%); kept 726115 read pairs. Genotyped 40974 loci: effective per-sample coverage: mean=1.3x, stdev=0.1x, min=1.1x, max=1.5x mean number of sites per locus: 176.2 a consistent phasing was found for 81688 of out 83342 (98.0%) diploid loci needing phasing :::danger Low coverage and high amounts of PCR duplicates being removed - https://groups.google.com/g/stacks-users/c/qmAxcH7cbus this forum says you can't use "remove PCr duplicates" on ddRAD data? ::: Trying again without the --rm pcr duplicates ``` gstacks -I /projects/p32037/Castilleja_emma/sorted_bams -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/popmap.txt -O /projects/p32037/Castilleja_emma/reference_stacks/ -t 8 ``` Results: ``` Read 303512925 BAM records: kept 61320283 primary alignments (20.5%), of which 32416401 reverse reads skipped 202953667 primary alignments with insufficient mapping qualities (67.9%) skipped 15528934 excessively soft-clipped primary alignments (5.2%) skipped 18950917 unmapped reads (6.3%) skipped some suboptimal (secondary/supplementary) alignment records Per-sample stats (details in 'gstacks.log.distribs'): read 6744731.7 records/sample (218894-16733496) kept 11.5%-25.4% of these Built 53572 loci comprising 28903882 forward reads and 25516436 matching paired-end reads; mean insert length was 183.6 (sd: 56.9). Genotyped 53572 loci: effective per-sample coverage: mean=51.9x, stdev=38.4x, min=4.3x, max=119.2x mean number of sites per locus: 157.8 a consistent phasing was found for 101650 of out 148375 (68.5%) diploid loci needing phasing ``` **Only kept 20% of primary alignments??** * largest proportion is due to poor alignment (skipped 202953667 primary alignments with insufficient mapping qualities 67.9%) from Google Group: *the gstacks program is telling you precisely why it skipped the reads – they aligned really poorly. Take a look at the alignments yourself, say using samtools to look at the CIGAR strings, to get an intuition for how well the reads aligned. Is the reference the same species as your ddRAD data? Despite throwing out 80% of your data, gstacks still built 50k loci with very high coverage (~50x), so that may still be good, if you can understand why so many reads aligned poorly. Anyway, you can tell gstacks to keep the reads, but you should be cautious with the resulting RAD loci that are built unless you can understand why so much data aligned poorly.* **Looking at CIGAR strings:** more info on the process here: Population Genomics Analysis with RAD, Reprised: Stacks 2 https://link.springer.com/protocol/10.1007/978-1-0716-2313-8_7 Looking at the gstacks.log.distribs for per sample info: ``` $ stacks-dist-extract gstacks.log.distribs bam_stats_per_sample | grep -v '^#' > bam_stats_per_sample.tsv ``` This exported the file into bam_stats_per_sample.tsv, moving to R... **Primary Alignments kept per sample:** ![Screenshot 2024-01-22 at 9.24.20 PM](https://hackmd.io/_uploads/Sy_db23KT.png) **Fraction of alignments kept per sample:** ![Screenshot 2024-01-22 at 9.31.00 PM](https://hackmd.io/_uploads/SJwbXh2K6.png) **Primary alignments kept by population:** ![Screenshot 2024-01-22 at 9.48.45 PM](https://hackmd.io/_uploads/rJ8HvnhKT.png) **Fraction of total alignments kept by population** ![Screenshot 2024-01-22 at 9.47.31 PM](https://hackmd.io/_uploads/rJjgvhnKp.png) # STACKs: populations *Using the output from the gstacks run with remove duplicates not included* ``` populations -P /projects/p32037/Castilleja_emma/reference_stacks/ -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/popmap.txt -r 0.7 --vcf --genepop --fstats --smooth --ordered-export --fasta-loci --hwe -t 8 Removed 32934 loci that did not pass sample/population constraints from 53572 loci. Kept 20638 loci, composed of 4066480 sites; 19536 of those sites were filtered, 53552 variant sites remained. 3931634 genomic sites, of which 127659 were covered by multiple loci (3.2%). Mean genotyped sites per locus: 175.42bp (stderr 0.38). Population summary statistics (more detail in populations.sumstats_summary.tsv): DP: 3 samples per locus; pi: 0.10782; all/variant/polymorphic sites: 841059/8242/2043; private alleles: 375 GM: 3.2966 samples per locus; pi: 0.1204; all/variant/polymorphic sites: 871058/9555/2705; private alleles: 407 HP: 3.3522 samples per locus; pi: 0.15913; all/variant/polymorphic sites: 1962432/28347/10157; private alleles: 1411 IB1: 4.3691 samples per locus; pi: 0.11759; all/variant/polymorphic sites: 1099134/12722/4232; private alleles: 650 IB2: 5.5786 samples per locus; pi: 0.17175; all/variant/polymorphic sites: 2090579/29229/14642; private alleles: 2659 MC: 3.6263 samples per locus; pi: 0.22013; all/variant/polymorphic sites: 2766092/39055/20186; private alleles: 3633 MW: 3.131 samples per locus; pi: 0.19258; all/variant/polymorphic sites: 2198921/31495/13707; private alleles: 2159 NR: 3.6435 samples per locus; pi: 0.1906; all/variant/polymorphic sites: 2763281/39961/16937; private alleles: 2671 PS: 3.6204 samples per locus; pi: 0.18455; all/variant/polymorphic sites: 2119028/32278/13835; private alleles: 2033 SP: 3 samples per locus; pi: 0.13971; all/variant/polymorphic sites: 1221880/14446/4612; private alleles: 714 SR: 3.4173 samples per locus; pi: 0.18378; all/variant/polymorphic sites: 2177639/31518/13390; private alleles: 2172 Number of variable sites found to be significantly out of Hardy-Weinberg equilibrium (<0.05): DP: 0 GM: 0 HP: 0 IB1: 25 IB2: 345 MC: 0 MW: 0 NR: 0 PS: 0 SP: 0 SR: 0 Number of loci found to be significantly out of Hardy-Weinberg equilibrium (<0.05): DP: 829 GM: 876 HP: 2544 IB1: 1009 IB2: 2187 MC: 3622 MW: 2586 NR: 4353 PS: 2628 SP: 1203 SR: 2559 (more detail in populations.sumstats.tsv and populations.hapstats.tsv) populations -P /projects/p32037/Castilleja_emma/reference_stacks/ -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/popmap.txt -r 0. --vcf --genepop --fstats --smooth --ordered-export --fasta-loci --hwe -t 8 Removed 38108 loci that did not pass sample/population constraints from 53572 loci. Kept 15464 loci, composed of 3102777 sites; 36470 of those sites were filtered, 23855 variant sites remained. 3021309 genomic sites, of which 76862 were covered by multiple loci (2.5%). Mean genotyped sites per locus: 179.50bp (stderr 0.44). Population summary statistics (more detail in populations.sumstats_summary.tsv): DP: 3 samples per locus; pi: 0.22138; all/variant/polymorphic sites: 844928/3171/1541; private alleles: 137 GM: 4 samples per locus; pi: 0.18413; all/variant/polymorphic sites: 372599/932/428; private alleles: 5 HP: 4 samples per locus; pi: 0.2032; all/variant/polymorphic sites: 916619/4092/2007; private alleles: 92 IB1: 4.3177 samples per locus; pi: 0.22953; all/variant/polymorphic sites: 1103722/5373/3166; private alleles: 181 IB2: 5.5077 samples per locus; pi: 0.28504; all/variant/polymorphic sites: 2100591/15727/11989; private alleles: 1544 MC: 4 samples per locus; pi: 0.28718; all/variant/polymorphic sites: 1990720/12926/8999; private alleles: 915 MW: 4 samples per locus; pi: 0.18457; all/variant/polymorphic sites: 452675/1465/699; private alleles: 37 NR: 4 samples per locus; pi: 0.25021; all/variant/polymorphic sites: 1937957/13666/7851; private alleles: 777 PS: 4 samples per locus; pi: 0.25848; all/variant/polymorphic sites: 1551478/9934/6162; private alleles: 489 SP: 3 samples per locus; pi: 0.25723; all/variant/polymorphic sites: 1227442/6428/3593; private alleles: 206 SR: 4 samples per locus; pi: 0.23942; all/variant/polymorphic sites: 1205358/5598/3352; private alleles: 249 Number of variable sites found to be significantly out of Hardy-Weinberg equilibrium (<0.05): DP: 0 GM: 0 HP: 0 IB1: 25 IB2: 345 MC: 0 MW: 0 NR: 0 PS: 0 SP: 0 SR: 0 Number of loci found to be significantly out of Hardy-Weinberg equilibrium (<0.05): DP: 1379 GM: 560 HP: 1478 IB1: 1671 IB2: 3736 MC: 3497 MW: 700 NR: 3746 PS: 2499 SP: 1968 SR: 1835 (more detail in populations.sumstats.tsv and populations.hapstats.tsv) Population pair divergence statistics (more in populations.fst_summary.tsv and populations.phistats_summary.tsv): DP-GM: mean Fst: 0.076843; mean Phi_st: -0.045204; mean Fst': -0.055206; mean Dxy: 0.0022836 DP-HP: mean Fst: 0.13176; mean Phi_st: 0.022563; mean Fst': 0.0047729; mean Dxy: 0.002819 DP-IB1: mean Fst: 0.10093; mean Phi_st: 0.010901; mean Fst': 0.00043457; mean Dxy: 0.0030542 DP-IB2: mean Fst: 0.092756; mean Phi_st: 0.034092; mean Fst': 0.021025; mean Dxy: 0.0030585 DP-MC: mean Fst: 0.10269; mean Phi_st: 0.020933; mean Fst': 0.0098509; mean Dxy: 0.0032087 DP-MW: mean Fst: 0.11275; mean Phi_st: 0.023039; mean Fst': -0.0013835; mean Dxy: 0.002706 DP-NR: mean Fst: 0.15515; mean Phi_st: 0.086261; mean Fst': 0.070418; mean Dxy: 0.0031068 DP-PS: mean Fst: 0.13457; mean Phi_st: 0.083704; mean Fst': 0.06409; mean Dxy: 0.003278 DP-SP: mean Fst: 0.1174; mean Phi_st: -0.015306; mean Fst': -0.030904; mean Dxy: 0.003005 DP-SR: mean Fst: 0.1109; mean Phi_st: 0.018917; mean Fst': 0.0017189; mean Dxy: 0.0027452 GM-HP: mean Fst: 0.083611; mean Phi_st: -0.002185; mean Fst': -0.021962; mean Dxy: 0.002352 GM-IB1: mean Fst: 0.077449; mean Phi_st: -0.010154; mean Fst': -0.019626; mean Dxy: 0.0030394 GM-IB2: mean Fst: 0.074612; mean Phi_st: 0.034112; mean Fst': 0.0055244; mean Dxy: 0.0023244 GM-MC: mean Fst: 0.092171; mean Phi_st: 0.014567; mean Fst': -0.0061395; mean Dxy: 0.0020821 GM-MW: mean Fst: 0.11822; mean Phi_st: 0.014353; mean Fst': -0.0054951; mean Dxy: 0.0021802 GM-NR: mean Fst: 0.1419; mean Phi_st: 0.093246; mean Fst': 0.06632; mean Dxy: 0.0027865 GM-PS: mean Fst: 0.12219; mean Phi_st: 0.062718; mean Fst': 0.039221; mean Dxy: 0.0023461 GM-SP: mean Fst: 0.088249; mean Phi_st: -0.024104; mean Fst': -0.040362; mean Dxy: 0.0023379 GM-SR: mean Fst: 0.092541; mean Phi_st: 0.019877; mean Fst': 0.0066349; mean Dxy: 0.0022495 HP-IB1: mean Fst: 0.10148; mean Phi_st: 0.04359; mean Fst': 0.038607; mean Dxy: 0.003577 HP-IB2: mean Fst: 0.095374; mean Phi_st: 0.073303; mean Fst': 0.068848; mean Dxy: 0.0038316 HP-MC: mean Fst: 0.11275; mean Phi_st: 0.065318; mean Fst': 0.048885; mean Dxy: 0.0034525 HP-MW: mean Fst: 0.11758; mean Phi_st: 0.050438; mean Fst': 0.029142; mean Dxy: 0.002783 HP-NR: mean Fst: 0.16736; mean Phi_st: 0.14848; mean Fst': 0.1364; mean Dxy: 0.0038734 HP-PS: mean Fst: 0.12506; mean Phi_st: 0.088672; mean Fst': 0.077628; mean Dxy: 0.0036764 HP-SP: mean Fst: 0.11814; mean Phi_st: 0.053396; mean Fst': 0.035201; mean Dxy: 0.0032341 HP-SR: mean Fst: 0.11997; mean Phi_st: 0.073848; mean Fst': 0.061435; mean Dxy: 0.0033213 IB1-IB2: mean Fst: 0.077456; mean Phi_st: 0.032603; mean Fst': 0.028078; mean Dxy: 0.0041143 IB1-MC: mean Fst: 0.099536; mean Phi_st: 0.042905; mean Fst': 0.030725; mean Dxy: 0.0036717 IB1-MW: mean Fst: 0.080111; mean Phi_st: 0.009675; mean Fst': -0.0024801; mean Dxy: 0.0028004 IB1-NR: mean Fst: 0.14482; mean Phi_st: 0.1097; mean Fst': 0.10356; mean Dxy: 0.0040678 IB1-PS: mean Fst: 0.10352; mean Phi_st: 0.059911; mean Fst': 0.052647; mean Dxy: 0.0037677 IB1-SP: mean Fst: 0.095132; mean Phi_st: 0.021318; mean Fst': 0.0060615; mean Dxy: 0.003686 IB1-SR: mean Fst: 0.1012; mean Phi_st: 0.051603; mean Fst': 0.042171; mean Dxy: 0.0036588 IB2-MC: mean Fst: 0.095358; mean Phi_st: 0.066416; mean Fst': 0.06204; mean Dxy: 0.0046185 IB2-MW: mean Fst: 0.096803; mean Phi_st: 0.072078; mean Fst': 0.052561; mean Dxy: 0.0030746 IB2-NR: mean Fst: 0.12628; mean Phi_st: 0.12439; mean Fst': 0.13364; mean Dxy: 0.004951 IB2-PS: mean Fst: 0.096559; mean Phi_st: 0.073483; mean Fst': 0.074299; mean Dxy: 0.004654 IB2-SP: mean Fst: 0.08334; mean Phi_st: 0.023605; mean Fst': 0.020569; mean Dxy: 0.0040226 IB2-SR: mean Fst: 0.09741; mean Phi_st: 0.067676; mean Fst': 0.059311; mean Dxy: 0.0039687 MC-MW: mean Fst: 0.096171; mean Phi_st: 0.02607; mean Fst': 0.0024765; mean Dxy: 0.0024299 MC-NR: mean Fst: 0.1325; mean Phi_st: 0.10489; mean Fst': 0.10668; mean Dxy: 0.0043845 MC-PS: mean Fst: 0.11595; mean Phi_st: 0.081511; mean Fst': 0.080328; mean Dxy: 0.0041826 MC-SP: mean Fst: 0.1061; mean Phi_st: 0.039654; mean Fst': 0.026668; mean Dxy: 0.0035108 MC-SR: mean Fst: 0.096817; mean Phi_st: 0.035289; mean Fst': 0.025164; mean Dxy: 0.0034829 MW-NR: mean Fst: 0.14674; mean Phi_st: 0.084415; mean Fst': 0.072261; mean Dxy: 0.0025794 MW-PS: mean Fst: 0.097104; mean Phi_st: 0.020413; mean Fst': 0.0022018; mean Dxy: 0.0025054 MW-SP: mean Fst: 0.10348; mean Phi_st: 0.012837; mean Fst': -0.002628; mean Dxy: 0.0027021 MW-SR: mean Fst: 0.10065; mean Phi_st: 0.026811; mean Fst': 0.0086096; mean Dxy: 0.0025026 NR-PS: mean Fst: 0.15915; mean Phi_st: 0.14338; mean Fst': 0.14553; mean Dxy: 0.0045047 NR-SP: mean Fst: 0.15432; mean Phi_st: 0.094115; mean Fst': 0.088347; mean Dxy: 0.0039141 NR-SR: mean Fst: 0.14261; mean Phi_st: 0.10174; mean Fst': 0.095176; mean Dxy: 0.0037862 PS-SP: mean Fst: 0.11962; mean Phi_st: 0.063339; mean Fst': 0.053603; mean Dxy: 0.0037309 PS-SR: mean Fst: 0.12255; mean Phi_st: 0.077635; mean Fst': 0.068894; mean Dxy: 0.0035355 SP-SR: mean Fst: 0.11013; mean Phi_st: 0.036235; mean Fst': 0.025591; mean Dxy: 0.0032607 ``` now I need to run VCF tools to clean/filter these data # vcftools introduction Dylan's example code for filtering the popualtions.snps.vcf file: ``` vcftools --vcf --input.vcf --min-meanDP 5 --minDP 5 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.50 --recode ---out renamed.ouput.vcf ``` ### VCF Parameters: **Depth:** ``` --min-meanDP <float> --max-meanDP <float> ``` Includes only sites with mean depth values (over all included individuals) greater than or equal to the "--min-meanDP" value and less than or equal to the "--max-meanDP" value. One of these options may be used without the other. These options require that the "DP" FORMAT tag is included for each site. **Hardy-Weinberg Equlibrium:** ``` --hwe <float> ``` Assesses sites for Hardy-Weinberg Equilibrium using an exact test, as defined by Wigginton, Cutler and Abecasis (2005). Sites with a p-value below the threshold defined by this option are taken to be out of HWE, and therefore excluded. **Missing data:** ``` --max-missing <float> ``` Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed). ``` --max-missing-count <integer> ``` Exclude sites with more than this number of missing genotypes over all individuals. **Phased:** ``` --phased ``` Excludes all sites that contain unphased genotypes. **MISCELLANEOUS FILTERING** ``` --minQ <float> ``` Includes only sites with Quality value above this threshold. # VCF filtering: Preliminary analysis following the workflow from https://speciationgenomics.github.io/filtering_vcfs/ to determine stats from this dataset 1. making a new VCF directory and copy the populations vcf file into it ``` mkdir vcf cp /projects/p32037/Castilleja_emma/reference_stacks/populations.snps.vcf ./ ``` 2. make a directory for vcf tools output and make shortcuts for the vcf file and output ``` mkdir vcftools pop_vcf=/projects/p32037/Castilleja_emma/reference_stacks/populations.snps.vcf OUT=/projects/p32037/Castilleja_emma/reference_stacks/vcftools/caco_pops ``` Calculate: Allele frequency `vcftools --gzvcf $pop_vcf --freq2 --out $OUT --max-alleles 2` Mean depth per indiv `vcftools --gzvcf $pop_vcf --depth --out $OUT` Mean depth per site `vcftools --gzvcf $pop_vcf --site-mean-depth --out $OUT` Site quality `vcftools --gzvcf $pop_vcf --site-quality --out $OUT` Proportion missing data/indiv `vcftools --gzvcf $pop_vcf --missing-indv --out $OUT` Proportion missing data/site `vcftools --gzvcf $pop_vcf --missing-site --out $OUT` Het and inbred per indiv `vcftools --gzvcf $pop_vcf --het --out $OUT` ### Moving these outputs into R... 1. Variant quality Produced a file where the quality is -1 for each site, what is going on here? 2. Variant mean depth Based on this, it looks like we could set our **minimum to 10x and maximum depth to 50** ( ~ 2x our mean depth) | min | 1st Qu. | Median | Mean | 3rd Qu. | Max | | ---- | ------- | ------ | ----- | ------- | --- | | 1.00 | 10.50 | 18.40 | 26.52 | 29.82 | 3924.00 | ![Screenshot 2024-01-18 at 2.00.41 PM](https://hackmd.io/_uploads/HkGFQWvtT.png) 3. Variant missingness The mean here is 0.6899 and medium is 0.733 which seems very high - we may have to use a low % missing data here. (50% call rate?) | min | 1st Qu. | Median | Mean | 3rd Qu. | Max | | ---- | ------- | ------ | ----- | ------- | --- | | 0.000 | 0.5556 | 0.7333 | 0.6899 | 0.889 | 0.9333 | ![Screenshot 2024-01-18 at 2.06.08 PM](https://hackmd.io/_uploads/rkE6VWPYT.png) 4. Minor allele frequency Maybe set this to 0.1 or 0.15 based on the data? | min | 1st Qu. | Median | Mean | 3rd Qu. | Max | | ---- | ------- | ------ | ----- | ------- | --- | | 0.01351 | 0.1000 | 0.18750 | 0.22612 | 0.35000 | 0.5 | ![Screenshot 2024-01-18 at 2.10.38 PM](https://hackmd.io/_uploads/H1k0S-wKT.png) --- ### Individual based statistics 1. Mean depth per individual The range is pretty extreme here 2.071 - 127.826 suggestting an issue with individual sequencing depth. | min | 1st Qu. | Median | Mean | 3rd Qu. | Max | | ---- | ------- | ------ | ----- | ------- | --- | | 2.071 | 8.997 | 29.206 | 32.607 | 49.057| 127.826 | ![Screenshot 2024-01-18 at 2.12.31 PM](https://hackmd.io/_uploads/Bk1rUZDta.png) 2. Proportion of missing data per individual The proportion of missing data per individual is large 0.35 - 0.96 | min | 1st Qu. | Median | Mean | 3rd Qu. | Max | | ---- | ------- | ------ | ----- | ------- | --- | | 0.3576 | 0.4575 | 0.7652 | 0.6899 | 0.8497 | 0.9611 | ![Screenshot 2024-01-18 at 2.13.17 PM](https://hackmd.io/_uploads/SyAPIbPt6.png) 3. Heterozygosity and inbreeding coefficient per individual Range of -0.17 to 0.39 | min | 1st Qu. | Median | Mean | 3rd Qu. | Max | | ---- | ------- | ------ | ----- | ------- | --- | | -0.17667 | 0.03222 | 0.06950 | 0.0.08637 | 0.12000 | 0.39499 | ![Screenshot 2024-01-18 at 2.14.02 PM](https://hackmd.io/_uploads/S1oqU-vK6.png) Based on these reuslts, here are some settings that may be useful, althoug I have concerns about the quality and missingness metrics `MAF=0.15` `MISS=0.75` `MIN_DEPTH=10` `MAX_DEPTH=50` # Applying VCF filters Operating in the reference_stacks directory first attempt, output as populations_stackstry1.ouput.vcf and using the populations.snps.vcf file from the popualtions run (this has all individuals/all populations) ran using the "default parameters" from Dylan's notes and from the analysis above First load the vcftools module `module load vcftools/0.1.17` Apply the filters.... ``` vcftools --vcf populations.snps.vcf --min-meanDP 5 --minDP 10 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.50 --recode --out populations_stackstry1.ouput.vcf ``` ``` vcftools --vcf populations.snps.vcf --min-meanDP 5 --minDP 5 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.80 --recode --out populations_stackstry1.ouput.vcf VCFtools - 0.1.17 (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --vcf populations.snps.vcf --mac 3 --max-alleles 2 --maxDP 50 --max-meanDP 50 --min-alleles 2 --minDP 5 --min-meanDP 5 --max-missing 0.8 --out populations_stackstry1.ouput.vcf --recode --remove-indels After filtering, kept 45 out of 45 Individuals Outputting VCF file... After filtering, kept 0 out of a possible 23789 Sites No data left for analysis! Run Time = 1.00 seconds ``` 0 sites kept...not great! trying again but changing --max-missing to 0.5 ``` vcftools --vcf populations.snps.vcf --min-meanDP 5 --minDP 10 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.50 --recode --out populations_stackstry2.ouput.vcf ``` Output: ``` (base) [ewf7555@quser32 reference_stacks]$ vcftools --vcf populations.snps.vcf --min-meanDP 5 --minDP 10 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.50 --recode --out populations_stackstry2.ouput.vcf VCFtools - 0.1.17 (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --vcf populations.snps.vcf --mac 3 --max-alleles 2 --maxDP 50 --max-meanDP 50 --min-alleles 2 --minDP 10 --min-meanDP 5 --max-missing 0.5 --out populations_stackstry2.ouput.vcf --recode --remove-indels After filtering, kept 45 out of 45 Individuals Outputting VCF file... After filtering, kept 145 out of a possible 23789 Sites Run Time = 1.00 seconds ``` Trying again with 0.49 ``` (base) [ewf7555@quser34 reference_stacks]$ vcftools --vcf populations.snps.vcf --min-meanDP 5 --minDP 10 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.49 --recode --out populations_stackstry3.ouput.vcf VCFtools - 0.1.17 (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --vcf populations.snps.vcf --mac 3 --max-alleles 2 --maxDP 50 --max-meanDP 50 --min-alleles 2 --minDP 10 --min-meanDP 5 --max-missing 0.49 --out populations_stackstry3.ouput.vcf --recode --remove-indels After filtering, kept 45 out of 45 Individuals Outputting VCF file... After filtering, kept 145 out of a possible 23789 Sites Run Time = 0.00 seconds ``` This is the same output, i don't think that 145 sites is good enough. Some populations have a lot of missing data...particualary GM and MW Remove GM and MW? - would have to re run gstacks and popualtions to do this (?) Maybe not.... Changed max missing to 0.6 and min DP to 5, adding MAF to 0.2 ``` (base) [ewf7555@quser34 reference_stacks]$ vcftools --vcf populations.snps.vcf --min-meanDP 5 --minDP 5 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.6 --maf 0.2 --recode --out populations_stackstry3.ouput.vcf Parameters as interpreted: --vcf populations.snps.vcf --mac 3 --maf 0.2 --max-alleles 2 --maxDP 50 --max-meanDP 50 --min-alleles 2 --minDP 5 --min-meanDP 5 --max-missing 0.6 --out populations_stackstry3.ouput.vcf --recode --remove-indels After filtering, kept 45 out of 45 Individuals Outputting VCF file... After filtering, kept 13 out of a possible 23789 Sites ``` # Troubleshooting ## Notes on removing "Bad apples" Primarily from Cerca et al. 2021 "Removing the bad apples: A simple bioinformatic mehtod to improve loci-reocvery in de novo RADseq data for non model organisms" https://doi.org/10.1111/2041-210X.13562 Steps they used: 1. Run STACKs with the complete dataset (all individuals) - this is called the "unclean" dataset 2. Run STACKs for each popualtion individually, then use vcftools to investigate missingness (--missing-indv, depth coverage --depth) 3. With the whole dataset in mind, we labelled individuals as to keep or remove (bad apples), following a general strategy which included: (a) retaining a minimum of two individuals per population or species; (b) designating a threshold for missing data, based on the average missing data for each whole dataset(≥40% missing data for Stygocapitella, ≥30% for Euhadra, ≥65% for Dendrilla, ≥40% for Anthochaera). ## Zoe's code for Clarkia data This is after running the denovo_map.pl (no filtering) ``` #### Quality filtering for CB. /data/Dimensions/clarkia/sv/cb/snp.calling/final/filter vcftools --vcf populations.snps.vcf --missing-indv # It looks like one indiviudal CB-58 failed and CB-73 and CB-77 did not do well. Remove them to keep 32 individuals # Remove individuals with lots of missing data vcftools --vcf populations.snps.vcf --remove rm.ind --recode --recode-INFO-all --out subset #Includes only sites with mean depth values (over all included individuals) greater than or equal to 15 vcftools --vcf subset.recode.vcf --min-meanDP 15 --recode --recode-INFO-all --out fil1 # kept 3568 #Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed). Here 10% or less missing data is allowed vcftools --vcf subset.recode.vcf --max-missing 0.90 --recode --recode-INFO-all --out fil4 # kept 2492 #Filter with all parameters including mean depth vcftools --vcf subset.recode.vcf --min-meanDP 15 --mac 3 --max-missing 0.90 --hwe 0.05 --recode --recode-INFO-all --out filtered.final.subset #kept 458 ``` So trying this appraoch with my data, starting with the populations.snps.vcf file Checking CACO data for missing individuals: Doing this filter in a new directory `mkdir zfilter` `vcftools --vcf populations.snps.vcf --missing-indv` insepcting the output file out.imiss, it looks like there are quite a few with greater than 50% missing data... Making a list of these individuals ``` awk '$5 > 0.5' out.imiss | cut -f1 > lowDP.indv ##Output file - 32 individuals have greater than 50% missing data INDV DP830.sort DP831.sort DP838.sort GM544.sort GM556.sort GM559.sort GM564.sort HP801.sort HP803.sort HP805.sort HP807.sort IB1506B.sort IB1519.sort IB1522B.sort IB1524B.sort IB1528B.sort IB2687.sort MW537.sort MW538.sort MW542.sort MW543.sort PS608B.sort PS613A.sort PS614A.sort PS615A.sort SP772B.sort SP790.sort SP796B.sort SR908B.sort SR912.sort SR914.sort SR922.sort ``` Maybe see how many have 80% missing data? ``` awk '$5 > 0.8' out.imiss | cut -f1 > verylowDP.indv #looks like 17 individuals had 80% missing INDV DP830.sort DP831.sort DP838.sort GM544.sort GM556.sort GM559.sort GM564.sort HP801.sort HP803.sort HP805.sort HP807.sort IB1522B.sort IB1524B.sort MW537.sort MW538.sort MW542.sort MW543.sort ``` 90%?? ``` awk '$5 > 0.9' out.imiss | cut -f1 > superlowDP.indv #Output 8 individuals INDV GM544.sort GM556.sort GM559.sort GM564.sort MW537.sort MW538.sort MW542.sort MW543.sort ``` Looks like GM and MW are not great in general. Why would entire popualtions not sequence well?? I guess we will remove them, and continue on for now ``` # Remove individuals with lots of missing data vcftools --vcf populations.snps.vcf --remove superlowDP.indv --recode --recode-INFO-all --out subset Excluding individuals in 'exclude' list After filtering, kept 37 out of 45 Individuals Outputting VCF file... After filtering, kept 23789 out of a possible 23789 Sites ``` **Mean depth values** Mean depth of 15 ``` #Includes only sites with mean depth values (over all included individuals) greater than or equal to 15 vcftools --vcf subset.recode.vcf --min-meanDP 15 --recode --recode-INFO-all --out fil1 # kept 5077 sites ``` Mean depth of 5 ``` (base) [ewf7555@quser34 zfilter]$ vcftools --vcf subset.recode.vcf --min-meanDP 5 --recode --recode-INFO-all --out fil7 VCFtools - 0.1.17 (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --vcf subset.recode.vcf --recode-INFO-all --min-meanDP 5 --out fil7 --recode After filtering, kept 37 out of 37 Individuals Outputting VCF file... After filtering, kept 12512 out of a possible 23789 Sites Run Time = 1.00 seconds ``` **Now on to the missing data...** missing data is 10% ``` #Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed). Here 10% or less missing data is allowed vcftools --vcf subset.recode.vcf --max-missing 0.90 --recode --recode-INFO-all --out fil4 # 577 out of a possible 23789 Sites ``` Missing data is 10% but mean depth is 5 vcftools --vcf subset.recode.vcf --min-meanDP 5 --maf --max-missing 0.90 --maf 0.15 --hwe 0.05 --recode --recode-INFO-all --out filtered.final2.subset **Final filtering** try # 1 with Zoe's parameters ``` #Filter with all parameters including mean depth vcftools --vcf subset.recode.vcf --min-meanDP 15 --mac 3 --max-missing 0.90 --hwe 0.05 --recode --recode-INFO-all --out filtered.final.subset #kept 424 (Zoe's kept 458) ``` Jeremie's idea: min depth is 5 maf is 0.15 hwe 0.05 no max missing at all! ``` vcftools --vcf subset.recode.vcf --min-meanDP 5 --maf --maf 0.15 --hwe 0.05 --recode --recode-INFO-all --out filtered.final2.subset (base) [ewf7555@quser34 zfilter]$ vcftools --vcf subset.recode.vcf --min-meanDP 5 --maf 0.15 --hwe 0.05 --recode --recode-INFO-all --out filtered.final2.subset VCFtools - 0.1.17 (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --vcf subset.recode.vcf --recode-INFO-all --maf 0.15 --max-alleles 2 --hwe 0.05 --min-meanDP 5 --out filtered.final2.subset --recode After filtering, kept 37 out of 37 Individuals Outputting VCF file... After filtering, kept 5151 out of a possible 23789 Sites Run Time = 1.00 seconds ``` Jeremie's idea: min depth is 5 maf is 0.15 no hwe no max missing at all! ``` (base) [ewf7555@quser34 zfilter]$ vcftools --vcf subset.recode.vcf --min-meanDP 5 --maf 0.15 --recode --recode-INFO-all --out filtered.final3.subset VCFtools - 0.1.17 (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --vcf subset.recode.vcf --recode-INFO-all --maf 0.15 --min-meanDP 5 --out filtered.final3.subset --recode After filtering, kept 37 out of 37 Individuals Outputting VCF file... After filtering, kept 6616 out of a possible 23789 Sites Run Time = 1.00 seconds ``` ``` vcftools --vcf subset.recode.vcf --min-meanDP 5 --maf 0.15 --recode --recode-INFO-all --out filtered.final3.subset ``` jeremie thoughts! Make a data set that is hardy weinberg and a set that is not if the linkage is evenly distribted then we don't have to filter for linakage **Checked to see what the R2** ``` vcftools --vcf filtered.final.subset.recode.vcf --plink --out filtered.plink plink --file filtered.plink --r2 --noweb ``` Looking at linkage...there are a lot of pairs/loci that are linked (with an R2 above 0.8 going up to 1) Following https://speciationgenomics.github.io/pca/ ``` Checking output with plink and pruning for likage --vcf filtered.final.subset.recode.vcf --double-id --allow-extra-chr \--set-missing-var-ids @:# \ --indep-pairwise 50 10 0.1 --out plink_zfilter #Output 192637 MB RAM detected; reserving 96318 MB for main workspace. --vcf: plink_zfilter-temporary.bed + plink_zfilter-temporary.bim + plink_zfilter-temporary.fam written. 424 variants loaded from .bim file. 37 people (0 males, 0 females, 37 ambiguous) loaded from .fam. Ambiguous sex IDs written to plink_zfilter.nosex . Using 1 thread (no multithreaded calculations invoked. Before main variant filters, 37 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.964686. 424 variants and 37 people pass filters and QC. Note: No phenotypes present. Pruned 15 variants from chromosome 27, leaving 10. Pruned 31 variants from chromosome 28, leaving 15. Pruned 7 variants from chromosome 29, leaving 8. Pruned 16 variants from chromosome 30, leaving 12. Pruned 0 variants from chromosome 31, leaving 1. Pruned 31 variants from chromosome 32, leaving 13. Pruned 25 variants from chromosome 33, leaving 10. Pruned 23 variants from chromosome 34, leaving 10. Pruned 33 variants from chromosome 35, leaving 13. Pruned 17 variants from chromosome 36, leaving 8. Pruned 37 variants from chromosome 37, leaving 10. Pruned 21 variants from chromosome 38, leaving 10. Pruned 33 variants from chromosome 39, leaving 14. Pruned 0 variants from chromosome 40, leaving 1. Pruning complete. 289 of 424 variants removed. Marker lists written to plink_zfilter.prune.in and plink_zfilter.prune.out . Segmentation fault ``` ``` plink --vcf filtered.final.subset.recode.vcf --double-id --allow-extra-chr --set-missing-var-ids @:# --extract plink_zfilter.prune.in --make-bed --pca --out CACO_zfil #Output 192637 MB RAM detected; reserving 96318 MB for main workspace. --vcf: CACO_zfil-temporary.bed + CACO_zfil-temporary.bim + CACO_zfil-temporary.fam written. 424 variants loaded from .bim file. 37 people (0 males, 0 females, 37 ambiguous) loaded from .fam. Ambiguous sex IDs written to CACO_zfil.nosex . --extract: 135 variants remaining. Using up to 51 threads (change this with --threads). Before main variant filters, 37 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.964765. 135 variants and 37 people pass filters and QC. Note: No phenotypes present. Relationship matrix calculation complete. --pca: Results saved to CACO_zfil.eigenval and CACO_zfil.eigenvec . --make-bed to CACO_zfil.bed + CACO_zfil.bim + CACO_zfil.fam ... done. Segmentation fault ``` Going to R to see what we have here... --- ``` ### Filtering for SNP panel. Using the 458 SNPs called - has maf 17%, in HWE, 10% missing data and mean depth 15X. /data/clarkia/sv/cb/snp.panel #### Now have to filter original fasta file to get filtered fasta sequences./data/Dimensions/clarkia/sv/cb/snp.panel/fasta.filter mkdir fasta.filter cp populations.loci.fa fasta.filter/ cp filtered.final.subset.recode.vcf fasta.filter/ cd fasta.filter cut -f 1 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.one #only keep first column of the vcf file that has the SNPs that passed quality filtering sed '1,15d' keep.col.one > remove.headers #remove all the headers rows of the fasta file sed -e 's/^/>CLocus_/' remove.headers > list.to.keep #add the fasta label to be able to match below grep -w -A 1 -f list.to.keep populations.loci.fa --no-group-separator > filtered.fasta.fa #sort the original fasta to only keep the high quality ones #### Now we need to sort fasta file to get SNPs with at least 50 bp on either side of SNP. First get the length of characters in each line of the file. Then combine them together awk '{ print length }' filtered.fasta.fa > count # get the count of each line in the fasta file sed -n '1~2!p' count > length #keep every other line, so you only have the length of the sequence (not the locus ID) paste -d \\n list.to.keep length > length.locus #paste the locus Id and the legnth of that locus #### Now get the position of each cut -f 2 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.two # only keep the second column of the vcf file that hast he high quality SNPs sed '1,15d' keep.col.two > positions #remove headers of the vcf file #### Combine the locus ID, length and position paste list.to.keep length positions | column -s $'\t' -t > length.position #combine and have them formatted into tabulated columns #### Keep SNPs that have 50 bp on either side. Keep every line where column 3 (position) has is greater than 50, meaning there are mpre than 50 bp before the SNP. Then calcualte the difference between columne 2 (length) and column 3 (position) so that you only keep the loci with a difference greater than 50 meaning that there are at least 50 bp after the SNP. We have 346 loci. awk '(NR>0) && ($3 > 50 ) ' length.position > right.50 awk 'BEGIN { OFS = "\t" } { $5 = $2 - $3 } 1' right.50 > difference awk '(NR>0) && ($4 > 50 ) ' difference > right.left.50 #### Filter original fasta file to only keep these 360 loci cut -f 1 right.left.50 | sed 's/[\t]/,/g' > filtered.id grep -w -A 1 -f filtered.id populations.loci.fa --no-group-separator > final.filtered.fasta.fa ### Now we must filter to remove SNPs that are not the same between replicates. Need to filter the vcf file to include just the 346 filtered loci. cat filtered.id | cut -d"_" -f2 > filtered.chrom #get only the chromosome number sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers #### Copy the text below into a new file called vcfkeeping. These are the column names that we want. #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CB-110 CB-110.2 CB-24 CB-24.2 CB-65 CB-65.2 CB-96 CB-96.2 #### Filter the full vcf file without headers to only keep the columns we want. cut -f"$(head -1 allvcf.headers | tr '\t' '\n' | nl | grep -Fwf vcfkeeping | awk '{print $1}' | paste -sd,)" allvcf.headers > rm.ind.vcf #### Keep only those 346 filtered loci grep -w -F -f filtered.chrom rm.ind.vcf > reps.filtered.vcf #for some reason this is keeping ten extra loci #so try adding prefix to chromosome number to first column in each file sed -e 's/^/chr/' filtered.chrom > filtered.chr sed -e 's/^/chr/' rm.ind.vcf > rm.ind.chr.vcf #try again grep -w -F -f filtered.chr rm.ind.chr.vcf > reps.filtered.vcf #now remove the chr string awk -F'chr' '{print $2}' reps.filtered.vcf > replicate.vcf #copy and paste header from previous file into new file ( sed 1q rm.ind.chr.vcf; sed 1d replicate.vcf ) > replicates.vcf && mv replicates.vcf replicate.vcf #### Now insert the vcf header manually (from the original vcf). Then use populations to make a genepop file to upload into R (locally). This will remove loci that are variable in other individuals but not variable in the replicates, so we are left with 310. populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/replicates.filter/replicate.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/replicates.filter -M /data/Dimensions/clarkia/sv/cb/snp.calling/final/popmap --genepop ##### I compared the loci that aren't the same in replicates in excel after uploading the genepop file. Note that you will loose some loci because if they aren't variable in the replicates, they won't be included in the genepop file. We of the starting 348 loci have 310 variable loci in the replicates to compare. The ID in the genepop file match up with the ids in the vcf file that you used to generate the genepop file. Make sure to use the GenePop ID from the replicate genepop file since this has ONLY the variable loci that you can compare. I then made a list of those to exclude. Now we have 202 SNPs. You will need to get the fasta loci ID from the genepop ID using the vcf file (the genepop file is ordered 1...x) ### Final filtering. /data/Dimensions/clarkia/sv/cb/snp.panel/replicates.filter. Using the loci that we are keeping. Make a list.to.keep with their locus ids then filter the original fasta file. grep -w -A 1 -f list.to.keep populations.loci.fa --no-group-separator > final.fasta.fa #sort the original fasta to only keep the high quality ones #### Now filter the vcf file. cat list.to.keep | cut -d"_" -f2 > snps.keep #get only the chromosome number sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers #remove header sed -e 's/^/chr/' allvcf.headers > allvcf.headers.chr #add 'chr' sed -e 's/^/chr/' snps.keep > snps.keep.chr #add 'chr' grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string #copy and paste header from previous file into new file ( sed 1q allvcf.headers; sed 1d filtered ) > filtered2 && mv filtered2 filtered.vcf #### Filter out linked loci. /data/Dimensions/clarkia/sv/cb/snp.panel/linkage.filter ## Had to reformat the vcf file to include a header bgzip -c filtered.vcf > filtered2.vcf.gz tabix -p vcf filtered2.vcf.gz /usr/local/libexec/bcftools/bcftools +prune -m 0.6 -w 100 filtered2.vcf.gz -Ob -o final.vcf #can't get this to work ## Checked to see what the R2 vcftools --vcf filtered.vcf --plink --out filtered.plink plink --file filtered.plink --r2 --noweb #### There are two pairs or loci that are linked (i.e. have an R2 above 80). Remove them and filter the vcf file. We have 200 SNPs sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers #remove header sed -e 's/^/chr/' allvcf.headers > allvcf.headers.chr #add 'chr' sed -e 's/^/chr/' list.to.keep > snps.keep.chr #add 'chr' grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string #copy and paste header from previous file into new file ( sed 1q allvcf.headers; sed 1d filtered ) > filtered2 && mv filtered2 filtered.vcf #### Run populations. Upload this gene pop file into R to look at heterozygosity of each locus at each population. Decided to keep all 310 loci for now. We can use the final.fasta.fa file to begin developing CB SNP panel. populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/linkage.filter/filtered.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/linkage.filter/ -M /data/Dimensions/clarkia/sv/cb/snp.panel/popmap.nr --genepop #### Only keeping a subset of loci that are highly variable across all three populations. Filter final vcf file to only get those highly variable SNPs and remove replicates. /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter ##### Run populations again to run genepop file in R populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/filtered.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/ -M /data/Dimensions/clarkia/sv/cb/snp.panel/popmap.nr --genepop --vcf ##### Filter the final vcf using the 111 most variable SNPs /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/hs.filter sed -e 's/^/chr/' filtered.vcf > allvcf.headers.chr #add 'chr' sed -e 's/^/chr/' list.to.keep > snps.keep.chr #add 'chr' grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string #copy and paste header from previous file into new file ( sed 1q filtered.vcf; sed 1d filtered ) > filtered2 && mv filtered2 final.vcf ##### Run populations again populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/hs.filter/final.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/hs.filter -M /data/Dimensions/clarkia/sv/cb/snp.panel/popmap.nr --genepop --vcf #### Get SNP info for all final SNPs and make master spreadsheet with all SNP info. vcftools --vcf final.p.snps.vcf --missing-indv #Generates a file reporting the missingness on a per-site basis. The file has the suffix ".lmiss". vcftools --vcf final.p.snps.vcf --missing-site #Generates a file containing the mean depth per individual. This file has the suffix ".idepth" vcftools --vcf final.p.snps.vcf --depth #Generates a file containing the mean depth per site averaged across all individuals. This output file has the suffix ".ldepth.mean". vcftools --vcf final.p.snps.vcf --site-mean-depth ##### Format for submission to sequencing facility ###### Filter fasta to get those we want sed -e 's/^/>CLocus_/' list.to.keep > final.list.to.keep # Make list of loci we want to keep w/ correct format grep -w -A 1 -f final.list.to.keep populations.loci.fa --no-group-separator > final.fasta.fa # fitler original fasta ###### Re-formatting fasta file so that the locus name is in the first column and the sequence is in the second. paste -d " " - - < final.fasta.fa > final.fasta2 ###### Now doing some re-formatting for the submission form to Minnesota. Need to get the SNP with the major and minor alleles into the fasta file. Make a file with the sequence in the first column, position in the second column, and [allele1/allele2] in the third column. Then run code below (had to get help on this in Stack Exchange). awk -F"\t" '{printf "%s%s%s%s",substr($1,1,$2-1),$3,substr($1,$2+1),ORS}' snp > submission.txt ``` ## Trying Zoe's code with the Castilleja data To-Do: - Re-run populations for each population individually - Then we will separate red and yellow - plink module load for downstream analysis proceeding with trying to create a PCA with these sites just for practice... ``` plink --vcf $VCF --double-id --allow-extra-chr --set-missing-var-ids @:# --indep-pairwise 50 10 0.1 --out CACO (base) [ewf7555@quser33 plink]$ module load plink/2.001 (base) [ewf7555@quser33 plink]$ plink --vcf $VCF --double-id --allow-extra-chr --set-missing-var-ids @:# --indep-pairwise 50 10 0.1 --out CACO -bash: plink: command not found ``` plink module won't load? # Running STACKS for each population individually In a new directory "single_pops" I imported the results of the original gstacks run and individual population maps for the 11 populations code for original gstacks run ``` gstacks -I /projects/p32037/Castilleja_emma/sorted_bams -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/popmap.txt --rm-pcr-duplicates -O /projects/p32037/Castilleja_emma/single_pops/ -t 8 ``` New pop map files (by individual population, all .txt located in /projects/p32037/Castilleja_emma/sorted_bams/popmap/): popmap_SR popmap_SP popmap_PS popmap_NR popmap_MW popmap_MC popmap_IB2 popmap_IB1 popmap_HP popmap_GM popmap_DP Base population code: ``` populations -P ~/reference_stacks/ -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/popmap_POPID -r 0.8 --vcf --genepop --fstats --smooth --ordered-export --fasta-loci --min-maf 0.05 --hwe -t 8 ``` Trial run with SR population: ``` populations -P /projects/p32037/Castilleja_emma/single_pops/ -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/popmap_SR.txt -r 0.8 --vcf --genepop --fstats --smooth --ordered-export --fasta-loci --min-maf 0.05 --hwe -t 8 ``` got a "segmentation fault" error - this means that the command is using too much memory - is this the right approach? Do we need to add additional filters here? New approach: try running gstacks separately for each popualtion 1. make new directories for each population in the reference stacks folder Tried this with one popualtion (DP) but got another sementation error --- ## Running STACKs with popmap as one populations **Reference assembly run** Using the sorted bams from the initial process radtags making a new directory in the "Castilleja_emma" folder for this run ``` mkdir reference_stacks_singlepop ``` Making a new population map with all of them as one population "caco" saved as `single_pop.txt` in /projects/p32037/Castilleja_emma/sorted_bams/popmap/single_pop.txt had to remove IB1527.sort.bam - missing individual ``` module load stacks/2.64-gcc-9.2.0 ``` gstacks code for this run: ``` gstacks -I /projects/p32037/Castilleja_emma/sorted_bams -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/single_pop.txt -O /projects/p32037/Castilleja_emma/reference_stacks_singlepop/ -t 8 ``` Output looks the same as the individual population map ``` Configuration for this run: Input mode: reference-based Population map: '/projects/p32037/Castilleja_emma/sorted_bams/popmap/single_pop.txt' Input files: 45, e.g. '/projects/p32037/Castilleja_emma/sorted_bams/DP830.sort.bam' Output to: '/projects/p32037/Castilleja_emma/reference_stacks_singlepop/' Model: marukilow (var_alpha: 0.01, gt_alpha: 0.05) Reading BAM headers... Processing all loci... 1K... 2K... 5K... 10K... 20K... 50K... done. Read 303512925 BAM records: kept 61320283 primary alignments (20.5%), of which 32416401 reverse reads skipped 202953667 primary alignments with insufficient mapping qualities (67.9%) skipped 15528934 excessively soft-clipped primary alignments (5.2%) skipped 18950917 unmapped reads (6.3%) skipped some suboptimal (secondary/supplementary) alignment records Per-sample stats (details in 'gstacks.log.distribs'): read 6744731.7 records/sample (218894-16733496) kept 11.5%-25.4% of these Built 53572 loci comprising 28903882 forward reads and 25516436 matching paired-end reads; mean insert length was 183.6 (sd: 56.9). Genotyped 53572 loci: effective per-sample coverage: mean=51.9x, stdev=38.4x, min=4.3x, max=119.2x mean number of sites per locus: 157.8 a consistent phasing was found for 101650 of out 148375 (68.5%) diploid loci needing phasing gstacks is done. ``` Running populations using the new pop map ``` populations -P /projects/p32037/Castilleja_emma/reference_stacks_singlepop/ -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/single_pop.txt -r 0.8 --vcf --genepop --fstats --smooth --ordered-export --fasta-loci --min-maf 0.05 --hwe -t 8 ``` `Segmentation fault` stopped here Dylan suggested to reduce the r value to 0.5 (50% of individuals must have this allele) `populations -P /projects/p32037/Castilleja_emma/reference_stacks_singlepop/ -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/single_pop.txt -r 0.5 --vcf --genepop --fstats --smooth --ordered-export --fasta-loci --min-maf 0.05 --hwe -t 8` move vcf file into new directory for filtering ``` mkdir vcf cp /projects/p32037/Castilleja_emma/reference_stacks_singlepop/populations.snps.vcf ./ ``` Filtering for LD with PLINK - SNPfilR dataset ``` plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr \--set-missing-var-ids @:# \ --indep-pairwise 50 10 0.1 \ --out plink_CACO_filter (base) [ewf7555@quser33 plink_snpfiltR]$ plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --indep-pairwise 50 10 0.1 --out plink_CACO_filter PLINK v2.00a2LM 64-bit Intel (24 Jul 2019) www.cog-genomics.org/plink/2.0/ (C) 2005-2019 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to plink_CACO_filter.log. Options in effect: --allow-extra-chr --double-id --indep-pairwise 50 10 0.1 -- plink_CACO_filter --set-missing-var-ids @:# --vcf CACO.filtered.vcf.gz Start time: Wed Apr 3 22:23:19 2024 192784 MiB RAM detected; reserving 96392 MiB for main workspace. Using up to 52 threads (change this with --threads). --vcf: 4927 variants scanned. --vcf: plink_CACO_filter-temporary.pgen + plink_CACO_filter-temporary.pvar + plink_CACO_filter-temporary.psam written. 34 samples (0 females, 0 males, 34 ambiguous; 34 founders) loaded from plink_CACO_filter-temporary.psam. 4927 variants loaded from plink_CACO_filter-temporary.pvar. Note: No phenotype data present. Calculating allele frequencies... done. --indep-pairwise (11 compute threads): 4419/4927 variants removed. Variant lists written to plink_CACO_filter.prune.in and plink_CACO_filter.prune.out . End time: Wed Apr 3 22:25:07 2024 (base) [ewf7555@quser33 plink_snpfiltR]$ plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --indep-pairwise 50 10 0.5 --out plink_CACO_filter.5 (base) [ewf7555@quser33 plink_snpfiltR]$ plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --indep-pairwise 50 10 0.5 --out plink_CACO_filter.5 PLINK v2.00a2LM 64-bit Intel (24 Jul 2019) www.cog-genomics.org/plink/2.0/ (C) 2005-2019 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to plink_CACO_filter.5.log. Options in effect: --allow-extra-chr --double-id --indep-pairwise 50 10 0.5 -- plink_CACO_filter.5 --set-missing-var-ids @:# --vcf CACO.filtered.vcf.gz Start time: Wed Apr 3 22:29:37 2024 192784 MiB RAM detected; reserving 96392 MiB for main workspace. Using up to 52 threads (change this with --threads). --vcf: 4927 variants scanned. --vcf: plink_CACO_filter.5-temporary.pgen + plink_CACO_filter.5-temporary.pvar + plink_CACO_filter.5-temporary.psam written. 34 samples (0 females, 0 males, 34 ambiguous; 34 founders) loaded from plink_CACO_filter.5-temporary.psam. 4927 variants loaded from plink_CACO_filter.5-temporary.pvar. Note: No phenotype data present. Calculating allele frequencies... done. --indep-pairwise (11 compute threads): 2152/4927 variants removed. Variant lists written to plink_CACO_filter.5.prune.in and plink_CACO_filter.5.prune.out . End time: Wed Apr 3 22:29:37 2024 (base) [ewf7555@quser33 plink_snpfiltR]$ plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --indep-pairwise 50 10 0.3 --out plink_CACO_filter.5 (base) [ewf7555@quser33 plink_snpfiltR]$ plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --indep-pairwise 50 10 0.3 --out plink_CACO_filter.5 PLINK v2.00a2LM 64-bit Intel (24 Jul 2019) www.cog-genomics.org/plink/2.0/ (C) 2005-2019 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to plink_CACO_filter.5.log. Options in effect: --allow-extra-chr --double-id --indep-pairwise 50 10 0.3 -- plink_CACO_filter.5 --set-missing-var-ids @:# --vcf CACO.filtered.vcf.gz Start time: Wed Apr 3 22:30:34 2024 192784 MiB RAM detected; reserving 96392 MiB for main workspace. Using up to 52 threads (change this with --threads). --vcf: 4927 variants scanned. --vcf: plink_CACO_filter.5-temporary.pgen + plink_CACO_filter.5-temporary.pvar + plink_CACO_filter.5-temporary.psam written. 34 samples (0 females, 0 males, 34 ambiguous; 34 founders) loaded from plink_CACO_filter.5-temporary.psam. 4927 variants loaded from plink_CACO_filter.5-temporary.pvar. Note: No phenotype data present. Calculating allele frequencies... done. --indep-pairwise (11 compute threads): 3157/4927 variants removed. Variant lists written to plink_CACO_filter.5.prune.in and plink_CACO_filter.5.prune.out . End time: Wed Apr 3 22:30:34 2024 (base) [ewf7555@quser33 plink_snpfiltR]$ ``` Okay looks good... sticking with --indep-pairwise 50 10 0.3 ``` plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --extract plink_CACO_filter.5.prune.in --make-bed --pca --out caco plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --extract plink_CACO_filter.5.prune.in --make-bed --out caco plink2 --bfile caco --pca --allow-extra-chr plink2 --bfile caco --recode vcf --allow-extra-chr --out caco_ld ``` ``` plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --extract plink_CACO_filter.2.prune.in --make-bed --pca --out caco_2 plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --extract plink_CACO_filter.2.prune.in --make-bed --out caco_2 plink2 --bfile caco_2 --pca --allow-extra-chr plink2 --bfile caco_2 --recode vcf --allow-extra-chr --out caco_ld_2 ```