# Mapping Castilleja ddRAD data to a reference genome
###### tags: `RADSeq` `STACKS`
---
author: Emma Fetterly
---
**Resources:**
1. Link to STACKS website:
https://catchenlab.life.illinois.edu/stacks/
*This guide is meant to help users with the basic workflow and to produce output files for population genetic analyses.*
2. Bioinformatics guide by Eric C. Anderson
https://eriqande.github.io/eca-bioinf-handbook/shell-programming.html
3. Speciation & Population Genomics: a how-to-guide
https://speciationgenomics.github.io/mapping_reference/
4. Surm script how-to from NU https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1964
5. Globus - file management
https://www.globus.org/
------
**Process Overview**
1. process_radrags - demultiplex and clean reads
2. mapping ddRADseq reads to reference genome with bwa + samtools
3. STACKS: gstacks
4. STACKS: populations
5. filtering with VCF tools and plink
----
# Quest basics
## Logging on
Logging on to quest
It will ask for your net ID password
```
ssh ewf7555@quest.northwestern.edu
```
To make a directory `mkdir` + name
To change to a directory `cd` + name or full path `cd ..` will change directory to the previous directory you were in
Moving files can be achieved either by using copy paste/move linux commands followed by the path-to-the-new-folder, or by using cyberduck or filezila to transfer them manually. I prefer linux commands as they are less clunky.
`cp` - copy paste command (this will duplicate files, leaving a copy in your current directoy)
`mv` - move command (this will move your files and not leave a copy in the current directory) - mv can also be used to rename files/folders if you are not going to a new directory: mv name.txt name2.txt (this will rename name.txt to name2.txt)
*Asterisk without () is a wildcard, this can be used to select all files with the same ending or similar pattern cp *.1.fq.gz = copy paste all files with the ending .1.fq.gz and move to your directory of choice
for example, if I want to move all my .gz files to a directory called "gz_reads" it could look like this
```
mv *.gz /projects/p32037/Castilleja_emma/readfiles_all/gz_reads
```
## Castilleja data folder
Jeremie's copy "Castilleja" won't allow for edits so I have a copy called "Castilleja_emma"
I did this on Globus - cp or mv functions wouldn't work
File path:
```
$ cd /projects/p32037/Castilleja_emma
```
This folder contains the index barcodes in .txt files and the sequence data in .fastq files
There should be 12 indicies with 8 samples in each index group.
:::danger
I only see barcode indices for 12, 2 (twice - .tsv and .txt), 4, 5, 6, 7
:::
This is okay, the other fastq files are from another species in the same run.
```
[ewf7555@quser34 Castilleja]$ ls
barcodes_index12.txt
barcodes_index2.tsv
barcodes_index2.txt
barcodes_index4.txt
barcodes_index5.txt
barcodes_index6.txt
barcodes_index7.txt
C85HWANXX_s5_1_BC_X.fastq
C85HWANXX_s5_1_illumina12index_10_SL136389.fastq
C85HWANXX_s5_1_illumina12index_11_SL136390.fastq
C85HWANXX_s5_1_illumina12index_12_SL136384.fastq
C85HWANXX_s5_1_illumina12index_1_SL136385.fastq
C85HWANXX_s5_1_illumina12index_2_SL136379.fastq
C85HWANXX_s5_1_illumina12index_3_SL136386.fastq
C85HWANXX_s5_1_illumina12index_4_SL136380.fastq
C85HWANXX_s5_1_illumina12index_5_SL136381.fastq
C85HWANXX_s5_1_illumina12index_6_SL136382.fastq
C85HWANXX_s5_1_illumina12index_7_SL136383.fastq
C85HWANXX_s5_1_illumina12index_8_SL136387.fastq
C85HWANXX_s5_1_illumina12index_9_SL136388.fastq
C85HWANXX_s5_2_BC_X.fastq
C85HWANXX_s5_2_illumina12index_10_SL136389.fastq
C85HWANXX_s5_2_illumina12index_11_SL136390.fastq
C85HWANXX_s5_2_illumina12index_12_SL136384.fastq
C85HWANXX_s5_2_illumina12index_1_SL136385.fastq
C85HWANXX_s5_2_illumina12index_2_SL136379.fastq
C85HWANXX_s5_2_illumina12index_3_SL136386.fastq
C85HWANXX_s5_2_illumina12index_4_SL136380.fastq
C85HWANXX_s5_2_illumina12index_5_SL136381.fastq
C85HWANXX_s5_2_illumina12index_6_SL136382.fastq
C85HWANXX_s5_2_illumina12index_7_SL136383.fastq
C85HWANXX_s5_2_illumina12index_8_SL136387.fastq
C85HWANXX_s5_2_illumina12index_9_SL136388.fastq
```
Look at the raw data sequences
# Process radtags
## Code
1. `process_radtags -h`
Process radtags is the first step in the STACKS pipeline. This will take your raw sequencing files and prepare them for SNP and loci calling.
EXAMPLE process_radtags CODE FROM DYLAN:
```
nohup process_radtags -1 Amorphy_S1_R1_001.fastq.gz -2 Amorphy_S1_R2_001.fastq.gz -i gzfastq -b Pbar -o Phy_out --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 &
```
My version for Castilleja:
```
process_radtags -1 C85HWANXX_s5_1_illumina12index_12_SL136384.fastq -2 C85HWANXX_s5_2_illumina12index_12_SL136384.fastq -i fastq -b barcodes_index12.txt -o CACOradtags_output --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10 &
```
process_radtags - the command to call the program.
-1 - the read 1 files
-2 - the read 2 files
-i - how are the raw files inputed as (fastq)
-b - this is the barcodes file (in this case: barcodes_index12.txt) with Sample + barcode identifer
AACCA A_hubrichtii_7
CGATC A_arenaria_TX
TCGAT A_orientalis_14
TGCAT A_cil_tenuifolia_188
Note the orientation, identifier followed by the sample name with a tab inbetween. The barcode file could be made in a text editor (BBedit) and uploaded to the server using Cyberduck or (Filezilla?) or in Linux using nano.
-o - this is the output directory that should be created before you call process radtags
**I made a new directory within the "Castilleja_emma" folder called "CACOradtags_output" for the first/test group (index 12)**`
--inline_null -> this tells STACKS the postion of our barcodes(adapters) for each sample
--renz_1 ecoRI - restriction enzyme 1
--renz_2 mspI - restriction enzyme 2
***** Make sure to change these if using different enzymes!
These are the adapters Fant/Skogen lab uses
--adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT
--adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
--adapter_mm 2 - We are allowing two mismatched bases in the adpaters
-r- rescue barcodes and RAD-Tag cut sites.
-c-clean data, remove any read with an uncalled base.
-q-discard reads with low quality scores.
***The ends of raw reads are usally very messy and so most people truncate/trim reads to avoid problems with assembling loci downstream. We have 150 bp reads, and truncate to 100.
-t 100 - truncate final read length to this value.
-s 10 - number of threads to run. (10)
&>ptags2.out& - redirects output/log files to ptags2.out.
Example SLURM SCRIPT for RADTAGS from DYLAN's notes:
```
#!/bin/bash
#SBATCH --account= ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --time= ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter)
#SBATCH --nodes=1 ## how many computers/nodes do you need (no default)
#SBATCH --ntasks-per-node=2 ## how many cpus or processors do you need on per computer/node (default value 1)
#SBATCH --mem-per-cpu= 20 gb ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need))
#SBATCH --job-name= ## When you run squeue -u
#SBATCH --mail-type= ## BEGIN, END, FAIL or ALL
#SBATCH --mail-user=
#SBATCH --error= # Enter a error log file name
#SBATCH --output= # Enter your out.log file name
module purge
module stacks/2.64-gcc-9.2.0
process_radtags -1 Amorphy_S1_R1_001.fastq.gz -2 Amorphy_S1_R2_001.fastq.gz -i gzfastq -b Pbar -o Phy_out --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10
```
## Slurm scripts
Actual Slurm script I used for Castilleja radtags:
* This was for "index 12"
```
#!/bin/bash
#SBATCH --account=p32037 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --time=12:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter)
#SBATCH --nodes=1 ## how many computers/nodes do you need (no default)
#SBATCH --ntasks-per-node=2 ## how many cpus or processors do you need on per computer/node (default value 1)
#SBATCH --mem-per-cpu=25GB ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need))
#SBATCH --job-name=CACO_radtags ## When you run squeue -u
#SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=CACO_radtags.error # Enter a error log file name
#SBATCH --output=CACO_radtags.out # Enter your out.log file name
module purge
module load stacks/2.64-gcc-9.2.0 ##specify environment
process_radtags -1 C85HWANXX_s5_1_illumina12index_12_SL136384.fastq -2 C85HWANXX_s5_2_illumina12index_12_SL136384.fastq -i fastq -b barcodes_index12.txt -o CACOradtags_output --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10
```
Notes on running a Slurm:
What's a slurm?? https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1964
* open a new file with `nano`
* copy and paste above script for the slurm (write in a plain .txt file)
* save with control x
* name it "process_radtags.sh"
Run the Slurm with
```
sbatch process_radtags.sh #start slurm script
squeue -u ewf7555 #check status of slurm using NU net ID
```
Notes on Slurm scripts:
Make sure to put your account name into the script, how much time you are giving it to run, number of nodes, cores, and gb, job name, etc.
*nohup and & are not needed with slurm scripts because these will run in the background regardless since its a slurm submission.*
Output:
```
[ewf7555@quser31 Castilleja_emma]$ sh process_radtags.sh
Processing paired-end data.
Using Phred+33 encoding for quality scores.
Reads will be truncated to 100bp
Filtering reads for adapter sequence:
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
2 mismatches allowed to adapter sequence.
Found 1 paired input file(s).
Searching for single-end, inlined barcodes.
Loaded 8 barcodes (6bp).
Will attempt to recover barcodes with at most 1 mismatches.
Processing file 1 of 1 [C85HWANXX_s5_1_illumina12index_12_SL136384.fastq]
Reading data from:
C85HWANXX_s5_1_illumina12index_12_SL136384.fastq and
C85HWANXX_s5_2_illumina12index_12_SL136384.fastq
Processing pairs of RAD-Tags...1M...2M...3M...4M...5M...6M...7M...8M...9M...10M...
11M...12M...13M...14M...15M...16M...17M...18M...19M...20M...21M...22M...23M...24M...
25M...26M...27M...28M...29M...30M...31M...
63459904 total reads; -3137246 ambiguous barcodes; -411809 ambiguous RAD-Tags; +1108469 recovered; -2782870 low quality reads; 55721085 retained reads.
1406894 reads with adapter sequence.
Closing files, flushing buffers...done.
63459904 total sequences
1406894 reads contained adapter sequence (2.2%)
3137246 barcode not found drops (4.9%)
2782870 low quality read drops (4.4%)
411809 RAD cutsite not found drops (0.6%)
55721085 retained reads (87.8%)
Details logged: 'CACOradtags_output/process_radtags.log'
For a summary, execute one or more:
stacks-dist-extract CACOradtags_output/process_radtags.log total_raw_read_counts
stacks-dist-extract --pretty CACOradtags_output/process_radtags.log per_barcode_raw_read_counts
process_radtags is done.
[ewf7555@quser31 Castilleja_emma]$
```
This finished very quickly ~ less than 1 hour with 87.5 % retained reads.
Trying again with "index 2" using the same Slurm script but subsituting different file names:
Forward file (-1) C85HWANXX_s5_1_illumina12index_2_SL136379.fastq
Reverse file (-2) C85HWANXX_s5_2_illumina12index_2_SL136379.fastq
Barcode file: barcodes_index2.txt
Job title: CACO_radtags2
For now I am putting this in a new output file in the "Castilleja_emma" folder to keep it separate from the index 12 run.
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --mem-per-cpu=25GB
#SBATCH --job-name=CACO_radtags2
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=CACO_radtags.error
#SBATCH --output=CACO_radtags.out
module purge
module load stacks/2.64-gcc-9.2.0 ##specify environment
process_radtags -1 C85HWANXX_s5_1_illumina12index_2_SL136379.fastq -2 C85HWANXX_s5_2_illumina12index_2_SL136379.fastq -i fastq -b barcodes_index2.txt -o CACOradtags_output2 --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10
```
Started at 11 am - immediate failure becasue there were too many columns in thebarcodes text file... need to investigate this.
Used the "barcodes _index2.tsv" file instead - this worked
Submitted slurm at 11:03 and finished within 30 minutes, with very similar results to the index 12 run (85.8% retained reads)
Looks good, now I need to do this for index 4, 5, 6, 7
```
mkdir CACOradtags_output4
mkdir CACOradtags_output5
mkdir CACOradtags_output6
mkdir CACOradtags_output7
Slurm for process_radtags4.sh
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --mem-per-cpu=25GB
#SBATCH --job-name=CACO_radtags4
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=CACO_radtags.error
#SBATCH --output=CACO_radtags.out
module purge
module load stacks/2.64-gcc-9.2.0 ##specify environment
process_radtags -1 C85HWANXX_s5_1_illumina12index_4_SL136380.fastq -2 C85HWANXX_s5_2_illumina12index_4_SL136380.fastq -i fastq -b barcodes_index4.txt -o CACOradtags_output4 --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10
Slurm for process_radtags5.sh
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --mem-per-cpu=25GB
#SBATCH --job-name=CACO_radtags2
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=CACO_radtags.error
#SBATCH --output=CACO_radtags.out
module purge
module load stacks/2.64-gcc-9.2.0 ##specify environment
process_radtags -1 C85HWANXX_s5_1_illumina12index_5_SL136381.fastq -2 C85HWANXX_s5_2_illumina12index_5_SL136381.fastq -i fastq -b barcodes_index5.txt -o CACOradtags_output5 --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10
Slurm for processradtags6.sh
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --mem-per-cpu=25GB
#SBATCH --job-name=CACO_radtags6
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=CACO_radtags.error
#SBATCH --output=CACO_radtags.out
module purge
module load stacks/2.64-gcc-9.2.0 ##specify environment
process_radtags -1 C85HWANXX_s5_1_illumina12index_6_SL136382.fastq -2 C85HWANXX_s5_2_illumina12index_6_SL136382.fastq -i fastq -b barcodes_index6.txt -o CACOradtags_output6 --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10
Slurm for process_radtags7.sh**
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --mem-per-cpu=25GB
#SBATCH --job-name=CACO_radtags7
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=CACO_radtags.error
#SBATCH --output=CACO_radtags.out
module purge
module load stacks/2.64-gcc-9.2.0 ##specify environment
process_radtags -1 C85HWANXX_s5_1_illumina12index_7_SL136383.fastq -2 C85HWANXX_s5_2_illumina12index_7_SL136383.fastq -i fastq -b barcodes_index7.txt -o CACOradtags_output7 --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 10
```
I then ran these slurms and the outputs should be in spearate folders ie. CACOradtags_output4 , CACOradtags_output5, CACOradtags_output6 and CACOradtags_output with log files as process_radtags.log in the respective directories
**Quick results of process radtags**
---
**Index 12:**
63459904 total sequences
1406894 reads contained adapter sequence (2.2%)
3137246 barcode not found drops (4.9%)
2782870 low quality read drops (4.4%)
411809 RAD cutsite not found drops (0.6%)
55721085 retained reads (87.8%)
**Index 2:**
55589592 total sequences
1055252 reads contained adapter sequence (1.9%)
4051380 barcode not found drops (7.3%)
2372433 low quality read drops (4.3%)
403793 RAD cutsite not found drops (0.7%)
47706734 retained reads (85.8%)
**Index 4:**
39541866 total sequences
723528 reads contained adapter sequence (1.8%)
2560698 barcode not found drops (6.5%)
1591721 low quality read drops (4.0%)
285054 RAD cutsite not found drops (0.7%)
34380865 retained reads (86.9%)
**Index 5:**
112035426 total sequences
2212884 reads contained adapter sequence (2.0%)
4991864 barcode not found drops (4.5%)
4956510 low quality read drops (4.4%)
706719 RAD cutsite not found drops (0.6%)
99167449 retained reads (88.5%)
**Index 6:**
73285042 total sequences
1446848 reads contained adapter sequence (2.0%)
3755980 barcode not found drops (5.1%)
2914390 low quality read drops (4.0%)
496599 RAD cutsite not found drops (0.7%)
64671225 retained reads (88.2%)
**Index 7:**
69138146 total sequences
1210578 reads contained adapter sequence (1.8%)
4440828 barcode not found drops (6.4%)
2743297 low quality read drops (4.0%)
469015 RAD cutsite not found drops (0.7%)
60274428 retained reads (87.2%)
Once process_radtags has finished running, change directories to your output directory and inspect the process_radtags.log file. This will give you basic information on how manys sequences were found per sample, and how many passed preliminary filtering.
To view the the log file
```
cat process_radtags.log
```
STACKS outputs four different files for each sample if you are using paird end reads:
sample_name.rem.1.fq.gz
sample_name.rem.2.fq.gz
sample_name.1.fq.gz
sample_name.2.fq.gz
We only use the files without the .rem; those files with a .rem. can be deleted.
```
rm *.rem.1.fq.gz
rm *.rem.2.fq.gz
```
If you are planning on generating loci/SNPs for multiple sublibraries, you will need to move all demultiplexed files to one folder, if not skip below to denovo pipeline.
Moving files can be achieved either by using copy paste/move linux commands followed by the path-to-the-new-folder, or by using cyberduck or filezila to transfer them manually. I prefer linux commands as they are less clunky.
cp - copy paste command (this will duplicate files, leaving a copy in your current directoy)
mv - move command (this will move your files and not leave a copy in the current directory) - mv can also be used to rename files/folders if you are not going to a new directory: mv name.txt name2.txt (this will rename name.txt to name2.txt)
(*) - Asterisk without () is a wildcard, this can be used to select all files with the same ending or similar pattern cp *.1.fq.gz = copy paste all files with the ending .1.fq.gz and move to your directory of choice
**Removed .rem files and Moved all files into a single file in "Castilleja_emma" called "readfiles_all"**
I was curious about the raw read quality of these sequences, so ran a fast qc on the raw read files
```
module load fastqc/0.12.0
fastqc -o /projects/p32037/Castilleja_emma/readfiles_all/fastqc -t 6 *.fq
```
Hard to review each file... so ran a multiqc https://multiqc.info/docs/ in the readfiles_all directory to view all the reports at once
```
[ewf7555@quser32 readfiles_all]$ module load multiqc
[ewf7555@quser32 readfiles_all]$ multiqc .
#this searched through the entire read_files directory and found the fastqc reports generated
/// MultiQC 🔍 | v1.11
| multiqc | MultiQC Version v1.19 now available!
| multiqc | Search path : /projects/p32037/Castilleja_emma/readfiles_all
| searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 575/575
| picard | Found 35 AlignmentSummaryMetrics reports
| fastqc | Found 97 reports
| fastqc | Sample had zero reads: 'merged_DP830'
| multiqc | Compressing plot data
| multiqc | Report : multiqc_report.html
| multiqc | Data : multiqc_data
| multiqc | MultiQC complete
```
# Mapping ddRAD reads to a reference genome
### Preparing the reference:
Good idea to make a new directory for your reference files. The reference needs to be uncompressed for this analysis. Unzip the reference:
```
mkdir ./reference
mv tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa ./reference/
```
Index the reference genome:
```
bwa index tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa
```
This will create a few other files with the same basename. They all need to be in the same directory with the .fsa_nt file in order for mapping and such to work.
### Map the reads:
Run from the directory with the read files, this loop will align the reads for each pair of read files, generating a .sam file
```
bwa mem -M -t 4 /home/bioinfo8/reference_mapping/reference_files/CAJRAG01.1.fsa_nt sampleID.1.fq.gz sampleID.2.fq.gz > sampleID.sam
```
Code for Castilleja mapping for sample SP790 in index 12 (trial)
```
bwa mem -M -t 4 /projects/p32037/Castilleja_emma/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa SP790.1.fq SP790.2.fq > SP790.sam
```
Output:
```
[ewf7555@quser34 readfiles_all]$ bwa mem -M -t 4 /projects/p32037/Castilleja_emma/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa SP790.1.fq SP790.2.fq > SP790.sam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 400000 sequences (40000000 bp)...
[M::process] read 167046 sequences (16704600 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 0, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 400000 reads in 235.877 CPU sec, 59.436 real sec
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 0, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 167046 reads in 100.074 CPU sec, 25.290 real sec
[main] Version: 0.7.12-r1039
[main] CMD: bwa mem -M -t 4 /projects/p32037/Castilleja_emma/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa SP790.2.fq SP790.2.fq
[main] Real time: 86.405 sec; CPU: 337.069 sec
(base) [ewf7555@quser34 readfiles_all]$
```
Evaluating with samtools:
```
samtools flagstat SP790.sam
[ewf7555@quser34 readfiles_all]$ samtools flagstat SP790.sam
576010 + 0 in total (QC-passed reads + QC-failed reads)
8964 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
555492 + 0 mapped (96.44% : N/A)
567046 + 0 paired in sequencing
283523 + 0 read1
283523 + 0 read2
0 + 0 properly paired (0.00% : N/A)
546528 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
314768 + 0 with mate mapped to a different chr
19 + 0 with mate mapped to a different chr (mapQ>=5)
(base) [ewf7555@quser34 readfiles_all]$
```
Good mapping percentages are in the 90's. You can mess around with the -K values of bwa mem to see if you get a higher percent mapping to the reference. This is a tradeoff because the lower the -K, that means that the fewer exact basepairs need to match the reference in order to start mapping a read. Default is 19 when bwa mem is run without a -K value. It can be a good idea to run bwa mem on one sample with several values of -K : default, -K 16 ; -K 14; -K 12 ; -K 10. Evaluate which maps the most reads and then use the loop for the rest of the samples:
With lower K numbers (16 and 10)it was only processing 2 reads and the output looked likd this:
```
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 0, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 2 reads in 0.002 CPU sec, 0.001 real sec
[M::process] read 2 sequences (200 bp)...
```
Don't know what the problem is here, but we had ~94% reads mapped...
:::danger
I found a typo! Was mapping both of the reverse files and not forward and reverse... :(
:::
Corrected code:
```
bwa mem -M -t 4 /projects/p32037/Castilleja_emma/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa DP830.1.fq DP830.2.fq > DP830.sam
```
Output:
```
(base) [ewf7555@quser33 readfiles_all]$ bwa mem -M -t 4 /projects/p32037/Castilleja_emma/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa SP790.1.fq SP790.2.fq > SP790.sam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 400000 sequences (40000000 bp)...
[M::process] read 167046 sequences (16704600 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 15058, 1, 1)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (123, 173, 208)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 378)
[M::mem_pestat] mean and std.dev: (169.60, 61.93)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 463)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 400000 reads in 309.368 CPU sec, 77.809 real sec
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 6232, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (122, 172, 207)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 377)
[M::mem_pestat] mean and std.dev: (168.39, 61.39)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 462)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 167046 reads in 131.863 CPU sec, 33.177 real sec
[main] Version: 0.7.12-r1039
[main] CMD: bwa mem -M -t 4 /projects/p32037/Castilleja_emma/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa SP790.1.fq SP790.2.fq
[main] Real time: 112.884 sec; CPU: 442.390 sec
(base) [ewf7555@quser33 readfiles_all]$
```
Evalute with samtools:
```
[ewf7555@quser33 readfiles_all]$ samtools flagstat SP790.sam
575042 + 0 in total (QC-passed reads + QC-failed reads)
7996 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
560287 + 0 mapped (97.43% : N/A)
567046 + 0 paired in sequencing
283523 + 0 read1
283523 + 0 read2
493096 + 0 properly paired (86.96% : N/A)
548284 + 0 with itself and mate mapped
4007 + 0 singletons (0.71% : N/A)
48272 + 0 with mate mapped to a different chr
8056 + 0 with mate mapped to a different chr (mapQ>=5)
```
Looks good...
* 97.43 % mapped and
* 86.96% properly paired
----
Evaluate which maps the most reads and then use the loop for the rest of the samples:
Nora example code:
```
for i in *.1.fq.gz; do
NM=${i//.1.fq.gz/}
REVERSE=${i//.1.fq.gz/.2.fq.gz}
bwa mem -M -t 4 -K 10 /home/bioinfo8/reference_mapping/reference_files/CAJRAG01.1.fsa_nt $i $REVERSE > $NM.sam ; done
```
Actual code I used:
```
for i in *.1.fq; do
NM=${i//.1.fq/}
REVERSE=${i//.1.fq/.2.fq}
bwa mem -M -t 4 /projects/p32037/Castilleja_emma/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa $i $REVERSE > $NM.sam ; done
```
This looped the process to map all of the samples to the reference genome with mapping in the high 90s with ~ 80-90% propperly paired reads.
Now convert the .sam files into .bam files:
```
for i in *.sam; do
NM=${i//.sam/};
samtools view $i -b -o $NM.bam; done
```
and then sort the .bam files:
```
for i in *.bam; do NM=${i//.bam/}; samtools sort $i -o $NM.sort.bam; done
```
As long as your sorted .bam files are looking good, then you can delete the .sam files because they are quite large and take up a lot of space. I put all my sorted .bams into their own folder.
I got an error with the sorting function:
:::danger
* DP830 does not have a sort.bam
* DP840 does not have a sort.bam
:::
```
DP830.1.fq HP807.sort.bam MC940.1.fq PS608B.2.fq
DP830.2.fq IB1506B.1.fq MC940.2.fq PS608B.bam
DP830.bam IB1506B.2.fq MC940.bam PS608B.sam
DP830.sam IB1506B.bam MC940.sam PS608B.sort.bam
DP831.1.fq IB1506B.sam MC940.sort.bam PS613A.1.fq
DP831.2.fq IB1506B.sort.bam MC944.1.fq PS613A.2.fq
DP831.bam IB1519.1.fq MC944.2.fq PS613A.bam
DP831.sam IB1519.2.fq MC944.bam PS613A.sam
DP831.sort.bam IB1519.bam MC944.sam PS613A.sort.bam
DP838.1.fq IB1519.sam MC944.sort.bam PS614A.1.fq
DP838.2.fq IB1519.sort.bam MC946.1.fq PS614A.2.fq
DP838.bam IB1522B.1.fq MC946.2.fq PS614A.bam
DP838.sam IB1522B.2.fq MC946.bam PS614A.sam
DP838.sort.bam IB1522B.bam MC946.sam PS614A.sort.bam
DP840.1.fq IB1522B.sam MC946.sort.bam PS615A.1.fq
DP840.2.fq IB1522B.sort.bam MC947.1.fq PS615A.2.fq
DP840.bam IB1524B.1.fq MC947.2.fq PS615A.bam
DP840.sam IB1524B.2.fq MC947.bam PS615A.sam
DP840.sort.bam.tmp.0000.bam IB1524B.bam MC947.sam PS615A.sort.bam
DP840.sort.bam.tmp.0001.bam IB1524B.sam MC947.sort.bam SP772B.1.fq
DP840.sort.bam.tmp.0002.bam IB1524B.sort.bam MW537.1.fq SP772B.2.fq
DP840.sort.bam.tmp.0003.bam IB1527B.1.fq MW537.2.fq SP772B.bam
GM544.1.fq IB1527B.2.fq MW537.bam SP772B.sam
GM544.2.fq IB1527B.bam MW537.sam SP772B.sort.bam
GM544.bam IB1527B.sam MW537.sort.bam SP790.1.fq
GM544.sam IB1527B.sort.bam MW538.1.fq SP790.2.fq
GM544.sort.bam IB1528B.1.fq MW538.2.fq SP790.bam
GM556.1.fq IB1528B.2.fq MW538.bam SP790.sam
GM556.2.fq IB1528B.bam MW538.sam SP790.sort.bam
GM556.bam IB1528B.sam MW538.sort.bam SP792B.1.fq
GM556.sam IB1528B.sort.bam MW542.1.fq SP792B.2.fq
GM556.sort.bam IB2684.1.fq MW542.2.fq SP792B.bam
GM559.1.fq IB2684.2.fq MW542.bam SP792B.sam
GM559.2.fq IB2684.bam MW542.sam SP792B.sort.bam
GM559.bam IB2684.sam MW542.sort.bam SP796B.1.fq
GM559.sam IB2684.sort.bam MW543.1.fq SP796B.2.fq
GM559.sort.bam IB2685.1.fq MW543.2.fq SP796B.bam
GM564.1.fq IB2685.2.fq MW543.bam SP796B.sam
GM564.2.fq IB2685.bam MW543.sam SP796B.sort.bam
GM564.bam IB2685.sam MW543.sort.bam SR908B.1.fq
GM564.sam IB2685.sort.bam NR815.1.fq SR908B.2.fq
GM564.sort.bam IB2686.1.fq NR815.2.fq SR908B.bam
HP801.1.fq IB2686.2.fq NR815.bam SR908B.sam
HP801.2.fq IB2686.bam NR815.sam SR908B.sort.bam
HP801.bam IB2686.sam NR815.sort.bam SR912.1.fq
HP801.sam IB2686.sort.bam NR817.1.fq SR912.2.fq
HP801.sort.bam IB2687.1.fq NR817.2.fq SR912.bam
HP803.1.fq IB2687.2.fq NR817.bam SR912.sam
HP803.2.fq IB2687.bam NR817.sam SR912.sort.bam
HP803.bam IB2687.sam NR817.sort.bam SR914.1.fq
HP803.sam IB2687.sort.bam NR819.1.fq SR914.2.fq
HP803.sort.bam IB2693.1.fq NR819.2.fq SR914.bam
HP805.1.fq IB2693.2.fq NR819.bam SR914.sam
HP805.2.fq IB2693.bam NR819.sam SR914.sort.bam
HP805.bam IB2693.sam NR819.sort.bam SR922.1.fq
HP805.sam IB2693.sort.bam NR820.1.fq SR922.2.fq
HP805.sort.bam IB2695.1.fq NR820.2.fq SR922.bam
HP807.1.fq IB2695.2.fq NR820.bam SR922.sam
HP807.2.fq IB2695.bam NR820.sam SR922.sort.bam
HP807.bam IB2695.sam NR820.sort.bam
HP807.sam IB2695.sort.bam PS608B.1.fq
```
Trying those samples again...
```
[ewf7555@quser33 readfiles_all]$ samtools sort DP840.bam -o DP840.sort.bam
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::hts_open_format] Failed to open file DP840.sort.bam.tmp.0000.bam
samtools sort: failed to create temporary file "DP840.sort.bam.tmp.0000.bam": File exists
(base) [ewf7555@quser33 readfiles_all]$
```
Writing a slurm job - open nano
```
#!/bin/bash
#SBATCH --account=p32037
#SBATCH --partition=normal
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem-per-cpu=20GB
#SBATCH --job-name=CACO_radtags2
#SBATCH --mail-type=ALL
#SBATCH --mail-user=emmafetterly2025@u.northwestern.edu
#SBATCH --error=DP840bamsort.error
#SBATCH --output=DP840bamsort.out
module purge
module load stacks/2.64-gcc-9.2.0 ##specify environment
samtools sort DP840.bam -o DP840.sort.bam
```
---
**Evaluting the alignments**
https://gatk.broadinstitute.org/hc/en-us/articles/13832683660955-
make a shell script from this
```
java -jar picard.jar CollectAlignmentSummaryMetrics \
R=tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly \
I=input.bam \
O=output.txt
```
Bash script for this called run_picard.sh in the output_qual directory
```
#!/bin/bash
module load picard/2.21.4
#Reference genome
REFERENCE_GENOME="/projects/p32037/Castilleja_emma/reference/tom-cal2336-mb-hirise-xl3oi__12-09-2022__final_assembly.fa"
# Input directory containing BAM files
INPUT_DIR="/projects/p32037/Castilleja_emma/readfiles_all/"
# Output directory
OUTPUT_DIR="/projects/p32037/Castilleja_emma/readfiles_all/output_qual"
# Run Picard for each BAM file in the input directory
for BAM_FILE in "$INPUT_DIR"/*.bam
do
# Check if there are matching files, else break the loop
[ -e "$BAM_FILE" ] || break
OUTPUT_FILE="${OUTPUT_DIR}/$(basename ${BAM_FILE%.*})_metrics.txt"
picard CollectAlignmentSummaryMetrics \
R=$REFERENCE_GENOME \
I=$BAM_FILE \
O=$OUTPUT_FILE
echo "Picard analysis for $(basename $BAM_FILE) complete. Metrics saved to $OUTPUT_FILE"
done
```
This created a lot of files with metrics by sample, although it looks like DP840 didnt work too well
```
DP831.sort_metrics.txt GM564.sort_metrics.txt IB2684.sort_metrics.txt MW542.sort_metrics.txt run_picard.sh
DP838.sort_metrics.txt HP801.sort_metrics.txt IB2685.sort_metrics.txt MW543.sort_metrics.txt SP790.sort_metrics.txt
DP840.sort.bam.tmp.0000_metrics.txt HP803.sort_metrics.txt IB2687.sort_metrics.txt NR819.sort_metrics.txt SP792B.sort_metrics.txt
DP840.sort.bam.tmp.0001_metrics.txt HP805.sort_metrics.txt IB2695.sort_metrics.txt NR820.sort_metrics.txt SP796B.sort_metrics.txt
DP840.sort.bam.tmp.0002_metrics.txt HP807.sort_metrics.txt MC940.sort_metrics.txt PS608B.sort_metrics.txt SR912.sort_metrics.txt
DP840.sort.bam.tmp.0003_metrics.txt IB1519.sort_metrics.txt MC944.sort_metrics.txt PS613A.sort_metrics.txt SR914.sort_metrics.txt
GM556.sort_metrics.txt IB1522B.sort_metrics.txt MC947.sort_metrics.txt PS614A.sort_metrics.txt SR922.sort_metrics.txt
GM559.sort_metrics.txt IB1524B.sort_metrics.txt MW538.sort_metrics.txt PS615A.sort_metrics.txt
```
You can use multiqc to evalute the ouput of these files (see the section on fastqc/multi qc after process radtags)
[MultiQC report](https://m-b55e05.a0115.5898.data.globus.org/projects/p32037/Castilleja_emma/readfiles_all/multiqc_report.html#picard-alignmentsummary ) here

# STACKs: getting set up
### **Making a population map**
Start off by creating a sorted bam folder and moving the soted bam files to this folder. Next you will need to create a population map file. There should be no header, and just two columns, the first with sample ids (with.sort if using refernce), and the second with the population assigment. Make sure to include a tab inbetween the sample id and population.
Pop map example:
Fug_B13 Bos
Fug_B14 Bos
Fug_B15 Bos
Fug_B24 Bos
Fug_F25 Fife
Fug_F26 Fife
Fug_F36 Fife
Fug_S1 Sev
Fug_S11 Sev
Fug_S12 Sev
Fug_S2 Sev
My popmap file:
saved as popmap.txt in this directory:
`/projects/p32037/Castilleja_emma/sorted_bams/popmap/popmap.txt`
```
DP830.sort DP
DP831.sort DP
DP838.sort DP
GM544.sort GM
GM556.sort GM
GM559.sort GM
GM564.sort GM
HP801.sort HP
HP803.sort HP
HP805.sort HP
HP807.sort HP
IB1506B.sort IB1
IB1519.sort IB1
IB1522B.sort IB1
IB1524B.sort IB1
IB1528B.sort IB1
IB2684.sort IB2
IB2685.sort IB2
IB2686.sort IB2
IB2687.sort IB2
IB2693.sort IB2
IB2695.sort IB2
MC940.sort MC
MC944.sort MC
MC946.sort MC
MC947.sort MC
MW537.sort MW
MW538.sort MW
MW542.sort MW
MW543.sort MW
NR815.sort NR
NR817.sort NR
NR819.sort NR
NR820.sort NR
PS608B.sort PS
PS613A.sort PS
PS614A.sort PS
PS615A.sort PS
SP772B.sort SP
SP790.sort SP
SP796B.sort SP
SR908B.sort SR
SR912.sort SR
SR914.sort SR
SR922.sort SR
```
# STACKs: gstacks
```
module load stacks/2.64-gcc-9.2.0
```
Good idea to do this in screen
```
gstacks -I ~/sorted_bams/ -M ~/popmap --rm-pcr-duplicates -O ~/reference_stacks/ -t 8
```
My version
```
gstacks -I /projects/p32037/Castilleja_emma/sorted_bams -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/popmap.txt --rm-pcr-duplicates -O /projects/p32037/Castilleja_emma/reference_stacks/ -t 8
Locus/sample distributions will be written to '/projects/p32037/Castilleja_emma/reference_stacks/gstacks.log.distribs'.
```
Output:
Removed 24790321 read pairs whose insert length had already been seen in the same sample as putative PCR duplicates (97.2%); kept 726115 read pairs.
Genotyped 40974 loci:
effective per-sample coverage:
mean=1.3x,
stdev=0.1x,
min=1.1x,
max=1.5x
mean number of sites per locus: 176.2
a consistent phasing was found for 81688 of out 83342 (98.0%) diploid loci needing phasing
:::danger
Low coverage and high amounts of PCR duplicates being removed - https://groups.google.com/g/stacks-users/c/qmAxcH7cbus
this forum says you can't use "remove PCr duplicates" on ddRAD data?
:::
Trying again without the --rm pcr duplicates
```
gstacks -I /projects/p32037/Castilleja_emma/sorted_bams -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/popmap.txt -O /projects/p32037/Castilleja_emma/reference_stacks/ -t 8
```
Results:
```
Read 303512925 BAM records:
kept 61320283 primary alignments (20.5%), of which 32416401 reverse reads
skipped 202953667 primary alignments with insufficient mapping qualities (67.9%)
skipped 15528934 excessively soft-clipped primary alignments (5.2%)
skipped 18950917 unmapped reads (6.3%)
skipped some suboptimal (secondary/supplementary) alignment records
Per-sample stats (details in 'gstacks.log.distribs'):
read 6744731.7 records/sample (218894-16733496)
kept 11.5%-25.4% of these
Built 53572 loci comprising 28903882 forward reads and 25516436 matching paired-end reads; mean insert length was 183.6 (sd: 56.9).
Genotyped 53572 loci:
effective per-sample coverage: mean=51.9x, stdev=38.4x, min=4.3x, max=119.2x
mean number of sites per locus: 157.8
a consistent phasing was found for 101650 of out 148375 (68.5%) diploid loci needing phasing
```
**Only kept 20% of primary alignments??**
* largest proportion is due to poor alignment (skipped 202953667 primary alignments with insufficient mapping qualities 67.9%)
from Google Group:
*the gstacks program is telling you precisely why it skipped the reads – they aligned really poorly.
Take a look at the alignments yourself, say using samtools to look at the CIGAR strings, to get an intuition for how well the reads aligned. Is the reference the same species as your ddRAD data?
Despite throwing out 80% of your data, gstacks still built 50k loci with very high coverage (~50x), so that may still be good, if you can understand why so many reads aligned poorly. Anyway, you can tell gstacks to keep the reads, but you should be cautious with the resulting RAD loci that are built unless you can understand why so much data aligned poorly.*
**Looking at CIGAR strings:**
more info on the process here: Population Genomics Analysis with RAD, Reprised: Stacks 2 https://link.springer.com/protocol/10.1007/978-1-0716-2313-8_7
Looking at the gstacks.log.distribs for per sample info:
```
$ stacks-dist-extract gstacks.log.distribs bam_stats_per_sample | grep -v '^#' > bam_stats_per_sample.tsv
```
This exported the file into bam_stats_per_sample.tsv, moving to R...
**Primary Alignments kept per sample:**

**Fraction of alignments kept per sample:**

**Primary alignments kept by population:**

**Fraction of total alignments kept by population**

# STACKs: populations
*Using the output from the gstacks run with remove duplicates not included*
```
populations -P /projects/p32037/Castilleja_emma/reference_stacks/ -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/popmap.txt -r 0.7 --vcf --genepop --fstats --smooth --ordered-export --fasta-loci --hwe -t 8
Removed 32934 loci that did not pass sample/population constraints from 53572 loci.
Kept 20638 loci, composed of 4066480 sites; 19536 of those sites were filtered, 53552 variant sites remained.
3931634 genomic sites, of which 127659 were covered by multiple loci (3.2%).
Mean genotyped sites per locus: 175.42bp (stderr 0.38).
Population summary statistics (more detail in populations.sumstats_summary.tsv):
DP: 3 samples per locus; pi: 0.10782; all/variant/polymorphic sites: 841059/8242/2043; private alleles: 375
GM: 3.2966 samples per locus; pi: 0.1204; all/variant/polymorphic sites: 871058/9555/2705; private alleles: 407
HP: 3.3522 samples per locus; pi: 0.15913; all/variant/polymorphic sites: 1962432/28347/10157; private alleles: 1411
IB1: 4.3691 samples per locus; pi: 0.11759; all/variant/polymorphic sites: 1099134/12722/4232; private alleles: 650
IB2: 5.5786 samples per locus; pi: 0.17175; all/variant/polymorphic sites: 2090579/29229/14642; private alleles: 2659
MC: 3.6263 samples per locus; pi: 0.22013; all/variant/polymorphic sites: 2766092/39055/20186; private alleles: 3633
MW: 3.131 samples per locus; pi: 0.19258; all/variant/polymorphic sites: 2198921/31495/13707; private alleles: 2159
NR: 3.6435 samples per locus; pi: 0.1906; all/variant/polymorphic sites: 2763281/39961/16937; private alleles: 2671
PS: 3.6204 samples per locus; pi: 0.18455; all/variant/polymorphic sites: 2119028/32278/13835; private alleles: 2033
SP: 3 samples per locus; pi: 0.13971; all/variant/polymorphic sites: 1221880/14446/4612; private alleles: 714
SR: 3.4173 samples per locus; pi: 0.18378; all/variant/polymorphic sites: 2177639/31518/13390; private alleles: 2172
Number of variable sites found to be significantly out of Hardy-Weinberg equilibrium (<0.05):
DP: 0
GM: 0
HP: 0
IB1: 25
IB2: 345
MC: 0
MW: 0
NR: 0
PS: 0
SP: 0
SR: 0
Number of loci found to be significantly out of Hardy-Weinberg equilibrium (<0.05):
DP: 829
GM: 876
HP: 2544
IB1: 1009
IB2: 2187
MC: 3622
MW: 2586
NR: 4353
PS: 2628
SP: 1203
SR: 2559
(more detail in populations.sumstats.tsv and populations.hapstats.tsv)
populations -P /projects/p32037/Castilleja_emma/reference_stacks/ -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/popmap.txt -r 0. --vcf --genepop --fstats --smooth --ordered-export --fasta-loci --hwe -t 8
Removed 38108 loci that did not pass sample/population constraints from 53572 loci.
Kept 15464 loci, composed of 3102777 sites; 36470 of those sites were filtered, 23855 variant sites remained.
3021309 genomic sites, of which 76862 were covered by multiple loci (2.5%).
Mean genotyped sites per locus: 179.50bp (stderr 0.44).
Population summary statistics (more detail in populations.sumstats_summary.tsv):
DP: 3 samples per locus; pi: 0.22138; all/variant/polymorphic sites: 844928/3171/1541; private alleles: 137
GM: 4 samples per locus; pi: 0.18413; all/variant/polymorphic sites: 372599/932/428; private alleles: 5
HP: 4 samples per locus; pi: 0.2032; all/variant/polymorphic sites: 916619/4092/2007; private alleles: 92
IB1: 4.3177 samples per locus; pi: 0.22953; all/variant/polymorphic sites: 1103722/5373/3166; private alleles: 181
IB2: 5.5077 samples per locus; pi: 0.28504; all/variant/polymorphic sites: 2100591/15727/11989; private alleles: 1544
MC: 4 samples per locus; pi: 0.28718; all/variant/polymorphic sites: 1990720/12926/8999; private alleles: 915
MW: 4 samples per locus; pi: 0.18457; all/variant/polymorphic sites: 452675/1465/699; private alleles: 37
NR: 4 samples per locus; pi: 0.25021; all/variant/polymorphic sites: 1937957/13666/7851; private alleles: 777
PS: 4 samples per locus; pi: 0.25848; all/variant/polymorphic sites: 1551478/9934/6162; private alleles: 489
SP: 3 samples per locus; pi: 0.25723; all/variant/polymorphic sites: 1227442/6428/3593; private alleles: 206
SR: 4 samples per locus; pi: 0.23942; all/variant/polymorphic sites: 1205358/5598/3352; private alleles: 249
Number of variable sites found to be significantly out of Hardy-Weinberg equilibrium (<0.05):
DP: 0
GM: 0
HP: 0
IB1: 25
IB2: 345
MC: 0
MW: 0
NR: 0
PS: 0
SP: 0
SR: 0
Number of loci found to be significantly out of Hardy-Weinberg equilibrium (<0.05):
DP: 1379
GM: 560
HP: 1478
IB1: 1671
IB2: 3736
MC: 3497
MW: 700
NR: 3746
PS: 2499
SP: 1968
SR: 1835
(more detail in populations.sumstats.tsv and populations.hapstats.tsv)
Population pair divergence statistics (more in populations.fst_summary.tsv and populations.phistats_summary.tsv):
DP-GM: mean Fst: 0.076843; mean Phi_st: -0.045204; mean Fst': -0.055206; mean Dxy: 0.0022836
DP-HP: mean Fst: 0.13176; mean Phi_st: 0.022563; mean Fst': 0.0047729; mean Dxy: 0.002819
DP-IB1: mean Fst: 0.10093; mean Phi_st: 0.010901; mean Fst': 0.00043457; mean Dxy: 0.0030542
DP-IB2: mean Fst: 0.092756; mean Phi_st: 0.034092; mean Fst': 0.021025; mean Dxy: 0.0030585
DP-MC: mean Fst: 0.10269; mean Phi_st: 0.020933; mean Fst': 0.0098509; mean Dxy: 0.0032087
DP-MW: mean Fst: 0.11275; mean Phi_st: 0.023039; mean Fst': -0.0013835; mean Dxy: 0.002706
DP-NR: mean Fst: 0.15515; mean Phi_st: 0.086261; mean Fst': 0.070418; mean Dxy: 0.0031068
DP-PS: mean Fst: 0.13457; mean Phi_st: 0.083704; mean Fst': 0.06409; mean Dxy: 0.003278
DP-SP: mean Fst: 0.1174; mean Phi_st: -0.015306; mean Fst': -0.030904; mean Dxy: 0.003005
DP-SR: mean Fst: 0.1109; mean Phi_st: 0.018917; mean Fst': 0.0017189; mean Dxy: 0.0027452
GM-HP: mean Fst: 0.083611; mean Phi_st: -0.002185; mean Fst': -0.021962; mean Dxy: 0.002352
GM-IB1: mean Fst: 0.077449; mean Phi_st: -0.010154; mean Fst': -0.019626; mean Dxy: 0.0030394
GM-IB2: mean Fst: 0.074612; mean Phi_st: 0.034112; mean Fst': 0.0055244; mean Dxy: 0.0023244
GM-MC: mean Fst: 0.092171; mean Phi_st: 0.014567; mean Fst': -0.0061395; mean Dxy: 0.0020821
GM-MW: mean Fst: 0.11822; mean Phi_st: 0.014353; mean Fst': -0.0054951; mean Dxy: 0.0021802
GM-NR: mean Fst: 0.1419; mean Phi_st: 0.093246; mean Fst': 0.06632; mean Dxy: 0.0027865
GM-PS: mean Fst: 0.12219; mean Phi_st: 0.062718; mean Fst': 0.039221; mean Dxy: 0.0023461
GM-SP: mean Fst: 0.088249; mean Phi_st: -0.024104; mean Fst': -0.040362; mean Dxy: 0.0023379
GM-SR: mean Fst: 0.092541; mean Phi_st: 0.019877; mean Fst': 0.0066349; mean Dxy: 0.0022495
HP-IB1: mean Fst: 0.10148; mean Phi_st: 0.04359; mean Fst': 0.038607; mean Dxy: 0.003577
HP-IB2: mean Fst: 0.095374; mean Phi_st: 0.073303; mean Fst': 0.068848; mean Dxy: 0.0038316
HP-MC: mean Fst: 0.11275; mean Phi_st: 0.065318; mean Fst': 0.048885; mean Dxy: 0.0034525
HP-MW: mean Fst: 0.11758; mean Phi_st: 0.050438; mean Fst': 0.029142; mean Dxy: 0.002783
HP-NR: mean Fst: 0.16736; mean Phi_st: 0.14848; mean Fst': 0.1364; mean Dxy: 0.0038734
HP-PS: mean Fst: 0.12506; mean Phi_st: 0.088672; mean Fst': 0.077628; mean Dxy: 0.0036764
HP-SP: mean Fst: 0.11814; mean Phi_st: 0.053396; mean Fst': 0.035201; mean Dxy: 0.0032341
HP-SR: mean Fst: 0.11997; mean Phi_st: 0.073848; mean Fst': 0.061435; mean Dxy: 0.0033213
IB1-IB2: mean Fst: 0.077456; mean Phi_st: 0.032603; mean Fst': 0.028078; mean Dxy: 0.0041143
IB1-MC: mean Fst: 0.099536; mean Phi_st: 0.042905; mean Fst': 0.030725; mean Dxy: 0.0036717
IB1-MW: mean Fst: 0.080111; mean Phi_st: 0.009675; mean Fst': -0.0024801; mean Dxy: 0.0028004
IB1-NR: mean Fst: 0.14482; mean Phi_st: 0.1097; mean Fst': 0.10356; mean Dxy: 0.0040678
IB1-PS: mean Fst: 0.10352; mean Phi_st: 0.059911; mean Fst': 0.052647; mean Dxy: 0.0037677
IB1-SP: mean Fst: 0.095132; mean Phi_st: 0.021318; mean Fst': 0.0060615; mean Dxy: 0.003686
IB1-SR: mean Fst: 0.1012; mean Phi_st: 0.051603; mean Fst': 0.042171; mean Dxy: 0.0036588
IB2-MC: mean Fst: 0.095358; mean Phi_st: 0.066416; mean Fst': 0.06204; mean Dxy: 0.0046185
IB2-MW: mean Fst: 0.096803; mean Phi_st: 0.072078; mean Fst': 0.052561; mean Dxy: 0.0030746
IB2-NR: mean Fst: 0.12628; mean Phi_st: 0.12439; mean Fst': 0.13364; mean Dxy: 0.004951
IB2-PS: mean Fst: 0.096559; mean Phi_st: 0.073483; mean Fst': 0.074299; mean Dxy: 0.004654
IB2-SP: mean Fst: 0.08334; mean Phi_st: 0.023605; mean Fst': 0.020569; mean Dxy: 0.0040226
IB2-SR: mean Fst: 0.09741; mean Phi_st: 0.067676; mean Fst': 0.059311; mean Dxy: 0.0039687
MC-MW: mean Fst: 0.096171; mean Phi_st: 0.02607; mean Fst': 0.0024765; mean Dxy: 0.0024299
MC-NR: mean Fst: 0.1325; mean Phi_st: 0.10489; mean Fst': 0.10668; mean Dxy: 0.0043845
MC-PS: mean Fst: 0.11595; mean Phi_st: 0.081511; mean Fst': 0.080328; mean Dxy: 0.0041826
MC-SP: mean Fst: 0.1061; mean Phi_st: 0.039654; mean Fst': 0.026668; mean Dxy: 0.0035108
MC-SR: mean Fst: 0.096817; mean Phi_st: 0.035289; mean Fst': 0.025164; mean Dxy: 0.0034829
MW-NR: mean Fst: 0.14674; mean Phi_st: 0.084415; mean Fst': 0.072261; mean Dxy: 0.0025794
MW-PS: mean Fst: 0.097104; mean Phi_st: 0.020413; mean Fst': 0.0022018; mean Dxy: 0.0025054
MW-SP: mean Fst: 0.10348; mean Phi_st: 0.012837; mean Fst': -0.002628; mean Dxy: 0.0027021
MW-SR: mean Fst: 0.10065; mean Phi_st: 0.026811; mean Fst': 0.0086096; mean Dxy: 0.0025026
NR-PS: mean Fst: 0.15915; mean Phi_st: 0.14338; mean Fst': 0.14553; mean Dxy: 0.0045047
NR-SP: mean Fst: 0.15432; mean Phi_st: 0.094115; mean Fst': 0.088347; mean Dxy: 0.0039141
NR-SR: mean Fst: 0.14261; mean Phi_st: 0.10174; mean Fst': 0.095176; mean Dxy: 0.0037862
PS-SP: mean Fst: 0.11962; mean Phi_st: 0.063339; mean Fst': 0.053603; mean Dxy: 0.0037309
PS-SR: mean Fst: 0.12255; mean Phi_st: 0.077635; mean Fst': 0.068894; mean Dxy: 0.0035355
SP-SR: mean Fst: 0.11013; mean Phi_st: 0.036235; mean Fst': 0.025591; mean Dxy: 0.0032607
```
now I need to run VCF tools to clean/filter these data
# vcftools introduction
Dylan's example code for filtering the popualtions.snps.vcf file:
```
vcftools --vcf --input.vcf --min-meanDP 5 --minDP 5 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.50 --recode ---out renamed.ouput.vcf
```
### VCF Parameters:
**Depth:**
```
--min-meanDP <float>
--max-meanDP <float>
```
Includes only sites with mean depth values (over all included individuals) greater than or equal to the "--min-meanDP" value and less than or equal to the "--max-meanDP" value. One of these options may be used without the other. These options require that the "DP" FORMAT tag is included for each site.
**Hardy-Weinberg Equlibrium:**
```
--hwe <float>
```
Assesses sites for Hardy-Weinberg Equilibrium using an exact test, as defined by Wigginton, Cutler and Abecasis (2005). Sites with a p-value below the threshold defined by this option are taken to be out of HWE, and therefore excluded.
**Missing data:**
```
--max-missing <float>
```
Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed).
```
--max-missing-count <integer>
```
Exclude sites with more than this number of missing genotypes over all individuals.
**Phased:**
```
--phased
```
Excludes all sites that contain unphased genotypes.
**MISCELLANEOUS FILTERING**
```
--minQ <float>
```
Includes only sites with Quality value above this threshold.
# VCF filtering: Preliminary analysis
following the workflow from https://speciationgenomics.github.io/filtering_vcfs/ to determine stats from this dataset
1. making a new VCF directory and copy the populations vcf file into it
```
mkdir vcf
cp /projects/p32037/Castilleja_emma/reference_stacks/populations.snps.vcf ./
```
2. make a directory for vcf tools output and make shortcuts for the vcf file and output
```
mkdir vcftools
pop_vcf=/projects/p32037/Castilleja_emma/reference_stacks/populations.snps.vcf
OUT=/projects/p32037/Castilleja_emma/reference_stacks/vcftools/caco_pops
```
Calculate:
Allele frequency
`vcftools --gzvcf $pop_vcf --freq2 --out $OUT --max-alleles 2`
Mean depth per indiv
`vcftools --gzvcf $pop_vcf --depth --out $OUT`
Mean depth per site
`vcftools --gzvcf $pop_vcf --site-mean-depth --out $OUT`
Site quality
`vcftools --gzvcf $pop_vcf --site-quality --out $OUT`
Proportion missing data/indiv
`vcftools --gzvcf $pop_vcf --missing-indv --out $OUT`
Proportion missing data/site
`vcftools --gzvcf $pop_vcf --missing-site --out $OUT`
Het and inbred per indiv
`vcftools --gzvcf $pop_vcf --het --out $OUT`
### Moving these outputs into R...
1. Variant quality
Produced a file where the quality is -1 for each site, what is going on here?
2. Variant mean depth
Based on this, it looks like we could set our **minimum to 10x and maximum depth to 50** ( ~ 2x our mean depth)
| min | 1st Qu. | Median | Mean | 3rd Qu. | Max |
| ---- | ------- | ------ | ----- | ------- | --- |
| 1.00 | 10.50 | 18.40 | 26.52 | 29.82 | 3924.00 |

3. Variant missingness
The mean here is 0.6899 and medium is 0.733 which seems very high - we may have to use a low % missing data here. (50% call rate?)
| min | 1st Qu. | Median | Mean | 3rd Qu. | Max |
| ---- | ------- | ------ | ----- | ------- | --- |
| 0.000 | 0.5556 | 0.7333 | 0.6899 | 0.889 | 0.9333 |

4. Minor allele frequency
Maybe set this to 0.1 or 0.15 based on the data?
| min | 1st Qu. | Median | Mean | 3rd Qu. | Max |
| ---- | ------- | ------ | ----- | ------- | --- |
| 0.01351 | 0.1000 | 0.18750 | 0.22612 | 0.35000 | 0.5 |

---
### Individual based statistics
1. Mean depth per individual
The range is pretty extreme here 2.071 - 127.826 suggestting an issue with individual sequencing depth.
| min | 1st Qu. | Median | Mean | 3rd Qu. | Max |
| ---- | ------- | ------ | ----- | ------- | --- |
| 2.071 | 8.997 | 29.206 | 32.607 | 49.057| 127.826 |

2. Proportion of missing data per individual
The proportion of missing data per individual is large 0.35 - 0.96
| min | 1st Qu. | Median | Mean | 3rd Qu. | Max |
| ---- | ------- | ------ | ----- | ------- | --- |
| 0.3576 | 0.4575 | 0.7652 | 0.6899 | 0.8497 | 0.9611 |

3. Heterozygosity and inbreeding coefficient per individual
Range of -0.17 to 0.39
| min | 1st Qu. | Median | Mean | 3rd Qu. | Max |
| ---- | ------- | ------ | ----- | ------- | --- |
| -0.17667 | 0.03222 | 0.06950 | 0.0.08637 | 0.12000 | 0.39499 |

Based on these reuslts, here are some settings that may be useful, althoug I have concerns about the quality and missingness metrics
`MAF=0.15`
`MISS=0.75`
`MIN_DEPTH=10`
`MAX_DEPTH=50`
# Applying VCF filters
Operating in the reference_stacks directory
first attempt, output as
populations_stackstry1.ouput.vcf and using the populations.snps.vcf file from the popualtions run (this has all individuals/all populations)
ran using the "default parameters" from Dylan's notes and from the analysis above
First load the vcftools module
`module load vcftools/0.1.17`
Apply the filters....
```
vcftools --vcf populations.snps.vcf --min-meanDP 5 --minDP 10 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.50 --recode --out populations_stackstry1.ouput.vcf
```
```
vcftools
--vcf populations.snps.vcf
--min-meanDP 5
--minDP 5
--max-meanDP 50
--maxDP 50
--mac 3
--remove-indels
--max-alleles 2
--min-alleles 2
--max-missing 0.80
--recode
--out populations_stackstry1.ouput.vcf
VCFtools - 0.1.17
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf populations.snps.vcf
--mac 3
--max-alleles 2
--maxDP 50
--max-meanDP 50
--min-alleles 2
--minDP 5
--min-meanDP 5
--max-missing 0.8
--out populations_stackstry1.ouput.vcf
--recode
--remove-indels
After filtering, kept 45 out of 45 Individuals
Outputting VCF file...
After filtering, kept 0 out of a possible 23789 Sites
No data left for analysis!
Run Time = 1.00 seconds
```
0 sites kept...not great!
trying again but changing --max-missing to 0.5
```
vcftools
--vcf populations.snps.vcf
--min-meanDP 5
--minDP 10
--max-meanDP 50
--maxDP 50
--mac 3
--remove-indels
--max-alleles 2
--min-alleles 2
--max-missing 0.50
--recode
--out populations_stackstry2.ouput.vcf
```
Output:
```
(base) [ewf7555@quser32 reference_stacks]$ vcftools --vcf populations.snps.vcf --min-meanDP 5 --minDP 10 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.50 --recode --out populations_stackstry2.ouput.vcf
VCFtools - 0.1.17
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf populations.snps.vcf
--mac 3
--max-alleles 2
--maxDP 50
--max-meanDP 50
--min-alleles 2
--minDP 10
--min-meanDP 5
--max-missing 0.5
--out populations_stackstry2.ouput.vcf
--recode
--remove-indels
After filtering, kept 45 out of 45 Individuals
Outputting VCF file...
After filtering, kept 145 out of a possible 23789 Sites
Run Time = 1.00 seconds
```
Trying again with 0.49
```
(base) [ewf7555@quser34 reference_stacks]$ vcftools --vcf populations.snps.vcf --min-meanDP 5 --minDP 10 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.49 --recode --out populations_stackstry3.ouput.vcf
VCFtools - 0.1.17
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf populations.snps.vcf
--mac 3
--max-alleles 2
--maxDP 50
--max-meanDP 50
--min-alleles 2
--minDP 10
--min-meanDP 5
--max-missing 0.49
--out populations_stackstry3.ouput.vcf
--recode
--remove-indels
After filtering, kept 45 out of 45 Individuals
Outputting VCF file...
After filtering, kept 145 out of a possible 23789 Sites
Run Time = 0.00 seconds
```
This is the same output, i don't think that 145 sites is good enough.
Some populations have a lot of missing data...particualary GM and MW
Remove GM and MW? - would have to re run gstacks and popualtions to do this (?)
Maybe not....
Changed max missing to 0.6 and min DP to 5, adding MAF to 0.2
```
(base) [ewf7555@quser34 reference_stacks]$ vcftools --vcf populations.snps.vcf --min-meanDP 5 --minDP 5 --max-meanDP 50 --maxDP 50 --mac 3 --remove-indels --max-alleles 2 --min-alleles 2 --max-missing 0.6 --maf 0.2 --recode --out populations_stackstry3.ouput.vcf
Parameters as interpreted:
--vcf populations.snps.vcf
--mac 3
--maf 0.2
--max-alleles 2
--maxDP 50
--max-meanDP 50
--min-alleles 2
--minDP 5
--min-meanDP 5
--max-missing 0.6
--out populations_stackstry3.ouput.vcf
--recode
--remove-indels
After filtering, kept 45 out of 45 Individuals
Outputting VCF file...
After filtering, kept 13 out of a possible 23789 Sites
```
# Troubleshooting
## Notes on removing "Bad apples"
Primarily from Cerca et al. 2021 "Removing the bad apples: A simple bioinformatic mehtod to improve loci-reocvery in de novo RADseq data for non model organisms"
https://doi.org/10.1111/2041-210X.13562
Steps they used:
1. Run STACKs with the complete dataset (all individuals) - this is called the "unclean" dataset
2. Run STACKs for each popualtion individually, then use vcftools to investigate missingness (--missing-indv, depth coverage --depth)
3. With the whole dataset in mind, we labelled individuals as to keep or remove (bad apples), following a general strategy which included: (a) retaining a minimum of two individuals per population or species; (b) designating a threshold for missing data, based on the average missing data for each whole dataset(≥40% missing data for Stygocapitella, ≥30% for Euhadra, ≥65% for Dendrilla, ≥40% for Anthochaera).
## Zoe's code for Clarkia data
This is after running the denovo_map.pl (no filtering)
```
#### Quality filtering for CB. /data/Dimensions/clarkia/sv/cb/snp.calling/final/filter
vcftools --vcf populations.snps.vcf --missing-indv
# It looks like one indiviudal CB-58 failed and CB-73 and CB-77 did not do well. Remove them to keep 32 individuals
# Remove individuals with lots of missing data
vcftools --vcf populations.snps.vcf --remove rm.ind --recode --recode-INFO-all --out subset
#Includes only sites with mean depth values (over all included individuals) greater than or equal to 15
vcftools --vcf subset.recode.vcf --min-meanDP 15 --recode --recode-INFO-all --out fil1 # kept 3568
#Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed). Here 10% or less missing data is allowed
vcftools --vcf subset.recode.vcf --max-missing 0.90 --recode --recode-INFO-all --out fil4 # kept 2492
#Filter with all parameters including mean depth
vcftools --vcf subset.recode.vcf --min-meanDP 15 --mac 3 --max-missing 0.90 --hwe 0.05 --recode --recode-INFO-all --out filtered.final.subset #kept 458
```
So trying this appraoch with my data, starting with the populations.snps.vcf file
Checking CACO data for missing individuals:
Doing this filter in a new directory
`mkdir zfilter`
`vcftools --vcf populations.snps.vcf --missing-indv`
insepcting the output file out.imiss, it looks like there are quite a few with greater than 50% missing data...
Making a list of these individuals
```
awk '$5 > 0.5' out.imiss | cut -f1 > lowDP.indv
##Output file - 32 individuals have greater than 50% missing data
INDV
DP830.sort
DP831.sort
DP838.sort
GM544.sort
GM556.sort
GM559.sort
GM564.sort
HP801.sort
HP803.sort
HP805.sort
HP807.sort
IB1506B.sort
IB1519.sort
IB1522B.sort
IB1524B.sort
IB1528B.sort
IB2687.sort
MW537.sort
MW538.sort
MW542.sort
MW543.sort
PS608B.sort
PS613A.sort
PS614A.sort
PS615A.sort
SP772B.sort
SP790.sort
SP796B.sort
SR908B.sort
SR912.sort
SR914.sort
SR922.sort
```
Maybe see how many have 80% missing data?
```
awk '$5 > 0.8' out.imiss | cut -f1 > verylowDP.indv
#looks like 17 individuals had 80% missing
INDV
DP830.sort
DP831.sort
DP838.sort
GM544.sort
GM556.sort
GM559.sort
GM564.sort
HP801.sort
HP803.sort
HP805.sort
HP807.sort
IB1522B.sort
IB1524B.sort
MW537.sort
MW538.sort
MW542.sort
MW543.sort
```
90%??
```
awk '$5 > 0.9' out.imiss | cut -f1 > superlowDP.indv
#Output 8 individuals
INDV
GM544.sort
GM556.sort
GM559.sort
GM564.sort
MW537.sort
MW538.sort
MW542.sort
MW543.sort
```
Looks like GM and MW are not great in general.
Why would entire popualtions not sequence well??
I guess we will remove them, and continue on for now
```
# Remove individuals with lots of missing data
vcftools --vcf populations.snps.vcf --remove superlowDP.indv --recode --recode-INFO-all --out subset
Excluding individuals in 'exclude' list
After filtering, kept 37 out of 45 Individuals
Outputting VCF file...
After filtering, kept 23789 out of a possible 23789 Sites
```
**Mean depth values**
Mean depth of 15
```
#Includes only sites with mean depth values (over all included individuals) greater than or equal to 15
vcftools --vcf subset.recode.vcf --min-meanDP 15 --recode --recode-INFO-all --out fil1 # kept 5077 sites
```
Mean depth of 5
```
(base) [ewf7555@quser34 zfilter]$ vcftools --vcf subset.recode.vcf --min-meanDP 5 --recode --recode-INFO-all --out fil7
VCFtools - 0.1.17
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf subset.recode.vcf
--recode-INFO-all
--min-meanDP 5
--out fil7
--recode
After filtering, kept 37 out of 37 Individuals
Outputting VCF file...
After filtering, kept 12512 out of a possible 23789 Sites
Run Time = 1.00 seconds
```
**Now on to the missing data...**
missing data is 10%
```
#Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed). Here 10% or less missing data is allowed
vcftools --vcf subset.recode.vcf --max-missing 0.90 --recode --recode-INFO-all --out fil4 # 577 out of a possible 23789 Sites
```
Missing data is 10% but mean depth is 5
vcftools --vcf subset.recode.vcf --min-meanDP 5 --maf --max-missing 0.90 --maf 0.15 --hwe 0.05 --recode --recode-INFO-all --out filtered.final2.subset
**Final filtering**
try # 1 with Zoe's parameters
```
#Filter with all parameters including mean depth
vcftools --vcf subset.recode.vcf --min-meanDP 15 --mac 3 --max-missing 0.90 --hwe 0.05 --recode --recode-INFO-all --out filtered.final.subset #kept 424 (Zoe's kept 458)
```
Jeremie's idea:
min depth is 5
maf is 0.15
hwe 0.05
no max missing at all!
```
vcftools --vcf subset.recode.vcf --min-meanDP 5 --maf --maf 0.15 --hwe 0.05 --recode --recode-INFO-all --out filtered.final2.subset
(base) [ewf7555@quser34 zfilter]$ vcftools --vcf subset.recode.vcf --min-meanDP 5 --maf 0.15 --hwe 0.05 --recode --recode-INFO-all --out filtered.final2.subset
VCFtools - 0.1.17
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf subset.recode.vcf
--recode-INFO-all
--maf 0.15
--max-alleles 2
--hwe 0.05
--min-meanDP 5
--out filtered.final2.subset
--recode
After filtering, kept 37 out of 37 Individuals
Outputting VCF file...
After filtering, kept 5151 out of a possible 23789 Sites
Run Time = 1.00 seconds
```
Jeremie's idea:
min depth is 5
maf is 0.15
no hwe
no max missing at all!
```
(base) [ewf7555@quser34 zfilter]$ vcftools --vcf subset.recode.vcf --min-meanDP 5 --maf 0.15 --recode --recode-INFO-all --out filtered.final3.subset
VCFtools - 0.1.17
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf subset.recode.vcf
--recode-INFO-all
--maf 0.15
--min-meanDP 5
--out filtered.final3.subset
--recode
After filtering, kept 37 out of 37 Individuals
Outputting VCF file...
After filtering, kept 6616 out of a possible 23789 Sites
Run Time = 1.00 seconds
```
```
vcftools --vcf subset.recode.vcf --min-meanDP 5 --maf 0.15 --recode --recode-INFO-all --out filtered.final3.subset
```
jeremie thoughts!
Make a data set that is hardy weinberg
and a set that is not
if the linkage is evenly distribted then we don't have to filter for linakage
**Checked to see what the R2**
```
vcftools --vcf filtered.final.subset.recode.vcf --plink --out filtered.plink
plink --file filtered.plink --r2 --noweb
```
Looking at linkage...there are a lot of pairs/loci that are linked (with an R2 above 0.8 going up to 1)
Following https://speciationgenomics.github.io/pca/
```
Checking output with plink and pruning for likage
--vcf filtered.final.subset.recode.vcf --double-id --allow-extra-chr \--set-missing-var-ids @:# \ --indep-pairwise 50 10 0.1 --out plink_zfilter
#Output
192637 MB RAM detected; reserving 96318 MB for main workspace.
--vcf: plink_zfilter-temporary.bed + plink_zfilter-temporary.bim +
plink_zfilter-temporary.fam written.
424 variants loaded from .bim file.
37 people (0 males, 0 females, 37 ambiguous) loaded from .fam.
Ambiguous sex IDs written to plink_zfilter.nosex .
Using 1 thread (no multithreaded calculations invoked.
Before main variant filters, 37 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.964686.
424 variants and 37 people pass filters and QC.
Note: No phenotypes present.
Pruned 15 variants from chromosome 27, leaving 10.
Pruned 31 variants from chromosome 28, leaving 15.
Pruned 7 variants from chromosome 29, leaving 8.
Pruned 16 variants from chromosome 30, leaving 12.
Pruned 0 variants from chromosome 31, leaving 1.
Pruned 31 variants from chromosome 32, leaving 13.
Pruned 25 variants from chromosome 33, leaving 10.
Pruned 23 variants from chromosome 34, leaving 10.
Pruned 33 variants from chromosome 35, leaving 13.
Pruned 17 variants from chromosome 36, leaving 8.
Pruned 37 variants from chromosome 37, leaving 10.
Pruned 21 variants from chromosome 38, leaving 10.
Pruned 33 variants from chromosome 39, leaving 14.
Pruned 0 variants from chromosome 40, leaving 1.
Pruning complete. 289 of 424 variants removed.
Marker lists written to plink_zfilter.prune.in and plink_zfilter.prune.out .
Segmentation fault
```
```
plink --vcf filtered.final.subset.recode.vcf --double-id --allow-extra-chr --set-missing-var-ids @:# --extract plink_zfilter.prune.in --make-bed --pca --out CACO_zfil
#Output
192637 MB RAM detected; reserving 96318 MB for main workspace.
--vcf: CACO_zfil-temporary.bed + CACO_zfil-temporary.bim +
CACO_zfil-temporary.fam written.
424 variants loaded from .bim file.
37 people (0 males, 0 females, 37 ambiguous) loaded from .fam.
Ambiguous sex IDs written to CACO_zfil.nosex .
--extract: 135 variants remaining.
Using up to 51 threads (change this with --threads).
Before main variant filters, 37 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.964765.
135 variants and 37 people pass filters and QC.
Note: No phenotypes present.
Relationship matrix calculation complete.
--pca: Results saved to CACO_zfil.eigenval and CACO_zfil.eigenvec .
--make-bed to CACO_zfil.bed + CACO_zfil.bim + CACO_zfil.fam ... done.
Segmentation fault
```
Going to R to see what we have here...
---
```
### Filtering for SNP panel. Using the 458 SNPs called - has maf 17%, in HWE, 10% missing data and mean depth 15X. /data/clarkia/sv/cb/snp.panel
#### Now have to filter original fasta file to get filtered fasta sequences./data/Dimensions/clarkia/sv/cb/snp.panel/fasta.filter
mkdir fasta.filter
cp populations.loci.fa fasta.filter/
cp filtered.final.subset.recode.vcf fasta.filter/
cd fasta.filter
cut -f 1 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.one #only keep first column of the vcf file that has the SNPs that passed quality filtering
sed '1,15d' keep.col.one > remove.headers #remove all the headers rows of the fasta file
sed -e 's/^/>CLocus_/' remove.headers > list.to.keep #add the fasta label to be able to match below
grep -w -A 1 -f list.to.keep populations.loci.fa --no-group-separator > filtered.fasta.fa #sort the original fasta to only keep the high quality ones
#### Now we need to sort fasta file to get SNPs with at least 50 bp on either side of SNP. First get the length of characters in each line of the file. Then combine them together
awk '{ print length }' filtered.fasta.fa > count # get the count of each line in the fasta file
sed -n '1~2!p' count > length #keep every other line, so you only have the length of the sequence (not the locus ID)
paste -d \\n list.to.keep length > length.locus #paste the locus Id and the legnth of that locus
#### Now get the position of each
cut -f 2 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.two # only keep the second column of the vcf file that hast he high quality SNPs
sed '1,15d' keep.col.two > positions #remove headers of the vcf file
#### Combine the locus ID, length and position
paste list.to.keep length positions | column -s $'\t' -t > length.position #combine and have them formatted into tabulated columns
#### Keep SNPs that have 50 bp on either side. Keep every line where column 3 (position) has is greater than 50, meaning there are mpre than 50 bp before the SNP. Then calcualte the difference between columne 2 (length) and column 3 (position) so that you only keep the loci with a difference greater than 50 meaning that there are at least 50 bp after the SNP. We have 346 loci.
awk '(NR>0) && ($3 > 50 ) ' length.position > right.50
awk 'BEGIN { OFS = "\t" } { $5 = $2 - $3 } 1' right.50 > difference
awk '(NR>0) && ($4 > 50 ) ' difference > right.left.50
#### Filter original fasta file to only keep these 360 loci
cut -f 1 right.left.50 | sed 's/[\t]/,/g' > filtered.id
grep -w -A 1 -f filtered.id populations.loci.fa --no-group-separator > final.filtered.fasta.fa
### Now we must filter to remove SNPs that are not the same between replicates. Need to filter the vcf file to include just the 346 filtered loci.
cat filtered.id | cut -d"_" -f2 > filtered.chrom #get only the chromosome number
sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers
#### Copy the text below into a new file called vcfkeeping. These are the column names that we want.
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
CB-110
CB-110.2
CB-24
CB-24.2
CB-65
CB-65.2
CB-96
CB-96.2
#### Filter the full vcf file without headers to only keep the columns we want.
cut -f"$(head -1 allvcf.headers | tr '\t' '\n' | nl | grep -Fwf vcfkeeping | awk '{print $1}' | paste -sd,)" allvcf.headers > rm.ind.vcf
#### Keep only those 346 filtered loci
grep -w -F -f filtered.chrom rm.ind.vcf > reps.filtered.vcf
#for some reason this is keeping ten extra loci
#so try adding prefix to chromosome number to first column in each file
sed -e 's/^/chr/' filtered.chrom > filtered.chr
sed -e 's/^/chr/' rm.ind.vcf > rm.ind.chr.vcf
#try again
grep -w -F -f filtered.chr rm.ind.chr.vcf > reps.filtered.vcf
#now remove the chr string
awk -F'chr' '{print $2}' reps.filtered.vcf > replicate.vcf
#copy and paste header from previous file into new file
( sed 1q rm.ind.chr.vcf; sed 1d replicate.vcf ) > replicates.vcf && mv replicates.vcf replicate.vcf
#### Now insert the vcf header manually (from the original vcf). Then use populations to make a genepop file to upload into R (locally). This will remove loci that are variable in other individuals but not variable in the replicates, so we are left with 310.
populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/replicates.filter/replicate.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/replicates.filter -M /data/Dimensions/clarkia/sv/cb/snp.calling/final/popmap --genepop
##### I compared the loci that aren't the same in replicates in excel after uploading the genepop file. Note that you will loose some loci because if they aren't variable in the replicates, they won't be included in the genepop file. We of the starting 348 loci have 310 variable loci in the replicates to compare. The ID in the genepop file match up with the ids in the vcf file that you used to generate the genepop file. Make sure to use the GenePop ID from the replicate genepop file since this has ONLY the variable loci that you can compare. I then made a list of those to exclude. Now we have 202 SNPs. You will need to get the fasta loci ID from the genepop ID using the vcf file (the genepop file is ordered 1...x)
### Final filtering. /data/Dimensions/clarkia/sv/cb/snp.panel/replicates.filter. Using the loci that we are keeping. Make a list.to.keep with their locus ids then filter the original fasta file.
grep -w -A 1 -f list.to.keep populations.loci.fa --no-group-separator > final.fasta.fa #sort the original fasta to only keep the high quality ones
#### Now filter the vcf file.
cat list.to.keep | cut -d"_" -f2 > snps.keep #get only the chromosome number
sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers #remove header
sed -e 's/^/chr/' allvcf.headers > allvcf.headers.chr #add 'chr'
sed -e 's/^/chr/' snps.keep > snps.keep.chr #add 'chr'
grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter
awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string
#copy and paste header from previous file into new file
( sed 1q allvcf.headers; sed 1d filtered ) > filtered2 && mv filtered2 filtered.vcf
#### Filter out linked loci. /data/Dimensions/clarkia/sv/cb/snp.panel/linkage.filter
## Had to reformat the vcf file to include a header
bgzip -c filtered.vcf > filtered2.vcf.gz
tabix -p vcf filtered2.vcf.gz
/usr/local/libexec/bcftools/bcftools +prune -m 0.6 -w 100 filtered2.vcf.gz -Ob -o final.vcf #can't get this to work
## Checked to see what the R2
vcftools --vcf filtered.vcf --plink --out filtered.plink
plink --file filtered.plink --r2 --noweb
#### There are two pairs or loci that are linked (i.e. have an R2 above 80). Remove them and filter the vcf file. We have 200 SNPs
sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers #remove header
sed -e 's/^/chr/' allvcf.headers > allvcf.headers.chr #add 'chr'
sed -e 's/^/chr/' list.to.keep > snps.keep.chr #add 'chr'
grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter
awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string
#copy and paste header from previous file into new file
( sed 1q allvcf.headers; sed 1d filtered ) > filtered2 && mv filtered2 filtered.vcf
#### Run populations. Upload this gene pop file into R to look at heterozygosity of each locus at each population. Decided to keep all 310 loci for now. We can use the final.fasta.fa file to begin developing CB SNP panel.
populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/linkage.filter/filtered.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/linkage.filter/ -M /data/Dimensions/clarkia/sv/cb/snp.panel/popmap.nr --genepop
#### Only keeping a subset of loci that are highly variable across all three populations. Filter final vcf file to only get those highly variable SNPs and remove replicates. /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter
##### Run populations again to run genepop file in R
populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/filtered.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/ -M /data/Dimensions/clarkia/sv/cb/snp.panel/popmap.nr --genepop --vcf
##### Filter the final vcf using the 111 most variable SNPs /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/hs.filter
sed -e 's/^/chr/' filtered.vcf > allvcf.headers.chr #add 'chr'
sed -e 's/^/chr/' list.to.keep > snps.keep.chr #add 'chr'
grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter
awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string
#copy and paste header from previous file into new file
( sed 1q filtered.vcf; sed 1d filtered ) > filtered2 && mv filtered2 final.vcf
##### Run populations again
populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/hs.filter/final.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/hs.filter -M /data/Dimensions/clarkia/sv/cb/snp.panel/popmap.nr --genepop --vcf
#### Get SNP info for all final SNPs and make master spreadsheet with all SNP info.
vcftools --vcf final.p.snps.vcf --missing-indv
#Generates a file reporting the missingness on a per-site basis. The file has the suffix ".lmiss".
vcftools --vcf final.p.snps.vcf --missing-site
#Generates a file containing the mean depth per individual. This file has the suffix ".idepth"
vcftools --vcf final.p.snps.vcf --depth
#Generates a file containing the mean depth per site averaged across all individuals. This output file has the suffix ".ldepth.mean".
vcftools --vcf final.p.snps.vcf --site-mean-depth
##### Format for submission to sequencing facility
###### Filter fasta to get those we want
sed -e 's/^/>CLocus_/' list.to.keep > final.list.to.keep # Make list of loci we want to keep w/ correct format
grep -w -A 1 -f final.list.to.keep populations.loci.fa --no-group-separator > final.fasta.fa # fitler original fasta
###### Re-formatting fasta file so that the locus name is in the first column and the sequence is in the second.
paste -d " " - - < final.fasta.fa > final.fasta2
###### Now doing some re-formatting for the submission form to Minnesota. Need to get the SNP with the major and minor alleles into the fasta file. Make a file with the sequence in the first column, position in the second column, and [allele1/allele2] in the third column. Then run code below (had to get help on this in Stack Exchange).
awk -F"\t" '{printf "%s%s%s%s",substr($1,1,$2-1),$3,substr($1,$2+1),ORS}' snp > submission.txt
```
## Trying Zoe's code with the Castilleja data
To-Do:
- Re-run populations for each population individually
- Then we will separate red and yellow
- plink module load for downstream analysis
proceeding with trying to create a PCA with these sites just for practice...
```
plink --vcf $VCF --double-id --allow-extra-chr --set-missing-var-ids @:# --indep-pairwise 50 10 0.1 --out CACO
(base) [ewf7555@quser33 plink]$ module load plink/2.001
(base) [ewf7555@quser33 plink]$ plink --vcf $VCF --double-id --allow-extra-chr --set-missing-var-ids @:# --indep-pairwise 50 10 0.1 --out CACO
-bash: plink: command not found
```
plink module won't load?
# Running STACKS for each population individually
In a new directory "single_pops" I imported the results of the original gstacks run and individual population maps for the 11 populations
code for original gstacks run
```
gstacks -I /projects/p32037/Castilleja_emma/sorted_bams -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/popmap.txt --rm-pcr-duplicates -O /projects/p32037/Castilleja_emma/single_pops/ -t 8
```
New pop map files (by individual population, all .txt located in /projects/p32037/Castilleja_emma/sorted_bams/popmap/):
popmap_SR
popmap_SP
popmap_PS
popmap_NR
popmap_MW
popmap_MC
popmap_IB2
popmap_IB1
popmap_HP
popmap_GM
popmap_DP
Base population code:
```
populations -P ~/reference_stacks/ -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/popmap_POPID -r 0.8 --vcf --genepop --fstats --smooth --ordered-export --fasta-loci --min-maf 0.05 --hwe -t 8
```
Trial run with SR population:
```
populations -P /projects/p32037/Castilleja_emma/single_pops/ -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/popmap_SR.txt -r 0.8 --vcf --genepop --fstats --smooth --ordered-export --fasta-loci --min-maf 0.05 --hwe -t 8
```
got a "segmentation fault" error - this means that the command is using too much memory - is this the right approach? Do we need to add additional filters here?
New approach: try running gstacks separately for each popualtion
1. make new directories for each population in the reference stacks folder
Tried this with one popualtion (DP) but got another sementation error
---
## Running STACKs with popmap as one populations
**Reference assembly run**
Using the sorted bams from the initial process radtags
making a new directory in the "Castilleja_emma" folder for this run
```
mkdir reference_stacks_singlepop
```
Making a new population map with all of them as one population "caco"
saved as `single_pop.txt` in /projects/p32037/Castilleja_emma/sorted_bams/popmap/single_pop.txt
had to remove IB1527.sort.bam - missing individual
```
module load stacks/2.64-gcc-9.2.0
```
gstacks code for this run:
```
gstacks -I /projects/p32037/Castilleja_emma/sorted_bams -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/single_pop.txt -O /projects/p32037/Castilleja_emma/reference_stacks_singlepop/ -t 8
```
Output looks the same as the individual population map
```
Configuration for this run:
Input mode: reference-based
Population map: '/projects/p32037/Castilleja_emma/sorted_bams/popmap/single_pop.txt'
Input files: 45, e.g. '/projects/p32037/Castilleja_emma/sorted_bams/DP830.sort.bam'
Output to: '/projects/p32037/Castilleja_emma/reference_stacks_singlepop/'
Model: marukilow (var_alpha: 0.01, gt_alpha: 0.05)
Reading BAM headers...
Processing all loci...
1K...
2K...
5K...
10K...
20K...
50K...
done.
Read 303512925 BAM records:
kept 61320283 primary alignments (20.5%), of which 32416401 reverse reads
skipped 202953667 primary alignments with insufficient mapping qualities (67.9%)
skipped 15528934 excessively soft-clipped primary alignments (5.2%)
skipped 18950917 unmapped reads (6.3%)
skipped some suboptimal (secondary/supplementary) alignment records
Per-sample stats (details in 'gstacks.log.distribs'):
read 6744731.7 records/sample (218894-16733496)
kept 11.5%-25.4% of these
Built 53572 loci comprising 28903882 forward reads and 25516436 matching paired-end reads; mean insert length was 183.6 (sd: 56.9).
Genotyped 53572 loci:
effective per-sample coverage: mean=51.9x, stdev=38.4x, min=4.3x, max=119.2x
mean number of sites per locus: 157.8
a consistent phasing was found for 101650 of out 148375 (68.5%) diploid loci needing phasing
gstacks is done.
```
Running populations using the new pop map
```
populations -P /projects/p32037/Castilleja_emma/reference_stacks_singlepop/ -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/single_pop.txt -r 0.8 --vcf --genepop --fstats --smooth --ordered-export --fasta-loci --min-maf 0.05 --hwe -t 8
```
`Segmentation fault` stopped here
Dylan suggested to reduce the r value to 0.5 (50% of individuals must have this allele)
`populations -P /projects/p32037/Castilleja_emma/reference_stacks_singlepop/ -M /projects/p32037/Castilleja_emma/sorted_bams/popmap/single_pop.txt -r 0.5 --vcf --genepop --fstats --smooth --ordered-export --fasta-loci --min-maf 0.05 --hwe -t 8`
move vcf file into new directory for filtering
```
mkdir vcf
cp /projects/p32037/Castilleja_emma/reference_stacks_singlepop/populations.snps.vcf ./
```
Filtering for LD with PLINK - SNPfilR dataset
```
plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr \--set-missing-var-ids @:# \ --indep-pairwise 50 10 0.1 \ --out plink_CACO_filter
(base) [ewf7555@quser33 plink_snpfiltR]$ plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --indep-pairwise 50 10 0.1 --out plink_CACO_filter
PLINK v2.00a2LM 64-bit Intel (24 Jul 2019) www.cog-genomics.org/plink/2.0/
(C) 2005-2019 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to plink_CACO_filter.log.
Options in effect:
--allow-extra-chr
--double-id
--indep-pairwise 50 10 0.1
-- plink_CACO_filter
--set-missing-var-ids @:#
--vcf CACO.filtered.vcf.gz
Start time: Wed Apr 3 22:23:19 2024
192784 MiB RAM detected; reserving 96392 MiB for main workspace.
Using up to 52 threads (change this with --threads).
--vcf: 4927 variants scanned.
--vcf: plink_CACO_filter-temporary.pgen + plink_CACO_filter-temporary.pvar +
plink_CACO_filter-temporary.psam written.
34 samples (0 females, 0 males, 34 ambiguous; 34 founders) loaded from
plink_CACO_filter-temporary.psam.
4927 variants loaded from plink_CACO_filter-temporary.pvar.
Note: No phenotype data present.
Calculating allele frequencies... done.
--indep-pairwise (11 compute threads): 4419/4927 variants removed.
Variant lists written to plink_CACO_filter.prune.in and
plink_CACO_filter.prune.out .
End time: Wed Apr 3 22:25:07 2024
(base) [ewf7555@quser33 plink_snpfiltR]$
plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --indep-pairwise 50 10 0.5 --out plink_CACO_filter.5
(base) [ewf7555@quser33 plink_snpfiltR]$ plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --indep-pairwise 50 10 0.5 --out plink_CACO_filter.5
PLINK v2.00a2LM 64-bit Intel (24 Jul 2019) www.cog-genomics.org/plink/2.0/
(C) 2005-2019 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to plink_CACO_filter.5.log.
Options in effect:
--allow-extra-chr
--double-id
--indep-pairwise 50 10 0.5
-- plink_CACO_filter.5
--set-missing-var-ids @:#
--vcf CACO.filtered.vcf.gz
Start time: Wed Apr 3 22:29:37 2024
192784 MiB RAM detected; reserving 96392 MiB for main workspace.
Using up to 52 threads (change this with --threads).
--vcf: 4927 variants scanned.
--vcf: plink_CACO_filter.5-temporary.pgen + plink_CACO_filter.5-temporary.pvar
+ plink_CACO_filter.5-temporary.psam written.
34 samples (0 females, 0 males, 34 ambiguous; 34 founders) loaded from
plink_CACO_filter.5-temporary.psam.
4927 variants loaded from plink_CACO_filter.5-temporary.pvar.
Note: No phenotype data present.
Calculating allele frequencies... done.
--indep-pairwise (11 compute threads): 2152/4927 variants removed.
Variant lists written to plink_CACO_filter.5.prune.in and
plink_CACO_filter.5.prune.out .
End time: Wed Apr 3 22:29:37 2024
(base) [ewf7555@quser33 plink_snpfiltR]$
plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --indep-pairwise 50 10 0.3 --out plink_CACO_filter.5
(base) [ewf7555@quser33 plink_snpfiltR]$ plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --indep-pairwise 50 10 0.3 --out plink_CACO_filter.5
PLINK v2.00a2LM 64-bit Intel (24 Jul 2019) www.cog-genomics.org/plink/2.0/
(C) 2005-2019 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to plink_CACO_filter.5.log.
Options in effect:
--allow-extra-chr
--double-id
--indep-pairwise 50 10 0.3
-- plink_CACO_filter.5
--set-missing-var-ids @:#
--vcf CACO.filtered.vcf.gz
Start time: Wed Apr 3 22:30:34 2024
192784 MiB RAM detected; reserving 96392 MiB for main workspace.
Using up to 52 threads (change this with --threads).
--vcf: 4927 variants scanned.
--vcf: plink_CACO_filter.5-temporary.pgen + plink_CACO_filter.5-temporary.pvar
+ plink_CACO_filter.5-temporary.psam written.
34 samples (0 females, 0 males, 34 ambiguous; 34 founders) loaded from
plink_CACO_filter.5-temporary.psam.
4927 variants loaded from plink_CACO_filter.5-temporary.pvar.
Note: No phenotype data present.
Calculating allele frequencies... done.
--indep-pairwise (11 compute threads): 3157/4927 variants removed.
Variant lists written to plink_CACO_filter.5.prune.in and
plink_CACO_filter.5.prune.out .
End time: Wed Apr 3 22:30:34 2024
(base) [ewf7555@quser33 plink_snpfiltR]$
```
Okay looks good... sticking with --indep-pairwise 50 10 0.3
```
plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --extract plink_CACO_filter.5.prune.in --make-bed --pca --out caco
plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --extract plink_CACO_filter.5.prune.in --make-bed --out caco
plink2 --bfile caco --pca --allow-extra-chr
plink2 --bfile caco --recode vcf --allow-extra-chr --out caco_ld
```
```
plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --extract plink_CACO_filter.2.prune.in --make-bed --pca --out caco_2
plink2 --vcf CACO.filtered.vcf.gz --double-id --allow-extra-chr --set-missing-var-ids @:# --extract plink_CACO_filter.2.prune.in --make-bed --out caco_2
plink2 --bfile caco_2 --pca --allow-extra-chr
plink2 --bfile caco_2 --recode vcf --allow-extra-chr --out caco_ld_2
```