---
title: "Brendan's DENU pt1 - STACKS and SNP Discovery""
breaks: FALSE
tags: RADSeq
---
Author: Brendan Connolly (p31939)
---
### Connecting to Servers
```
Quest:
ssh bjc8165@quest.northwestern.edu
Curie:
ssh bjc8165@10.2.0.53
cd /data/labs/Fant/Connolly
```
### Transfer things to JBOD, do this in curie
```
rsync -rtu --delete SOURCE root@x.x.x.x:/DESTINATION
### Example
#Make md5sum file for everything you're transferring over
for i in *.gz;
do
md5sum $i > $i.md5
done
#Transfer Files Over
rsync -rtu --delete bjc8165@quest.northwestern.edu:/projects/p31939/DENUddRAD/RawReads /data/labs/Fant/Connolly/DENU_ddrad/raw
#Check integrity of transferred files
md5sum -c DENU-3_S3_L001_R1_001.fastq.gz.md5
```
### Transfering things to local SSD
```
#Run in powershell to set local location to external SSD (D:)
Set-Location D:
#Run in powershell to transfer to SSD
scp -r bjc8165@quest.northwestern.edu:/YOUR/DIRECTORY/TO/TRANSFER /D:/DESTINATION/DIRECTORY
###Example
scp -r bjc8165@quest.northwestern.edu:/projects/p31939/DENUddRAD /D:/GeneticsBackup/
#Run this in powershell to get text file of every file's md5hash - only files with . in name
Get-FileHash -Algorithm MD5 -Path (Get-ChildItem "D:\GeneticsBackup\DENU_Amplicon\Raw\Trimmed\*.*" -Recurse -force) | export-csv D:\GeneticsBackup\DENU_Amplicon\Raw\Trimmed\trimmed.csv
#Run this in the directory on unix to generate md5sum for every file in the directory
find -type f -exec md5sum '{}' \; > md5sum.txt
#compare md5 hashes in excel to ensure file integrity
```
## Stacks
Really good tutorial on how Stacks works. https://catchenlab.life.illinois.edu/stacks/param_tut.php
## File Process_RadTags
#!/bin/bash
#SBATCH --account=p31939 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --time=12:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions$
#SBATCH --nodes=1 ## how many computers/nodes do you need (no default)
#SBATCH --ntasks-per-node=2 ## how many cpus or processors do you need on per computer/node (default value 1)
#SBATCH --mem-per-cpu=25GB ## how much RAM do you need per computer/node (this affects your FairShare score so be caref
#SBATCH --job-name=DENU_S1_radtags ## When you run squeue -u
#SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL
#SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu
#SBATCH --error=S1_RadTags.error
#SBATCH --output=S1_RadTags.out
module purge
module avail
module load stacks/2.64-gcc-9.2.0
process_radtags -1 /projects/p31939/DENUddRAD/DENU-1_S1_L001_R1_001.fastq.gz -2 /projects/p31939/DENUddRAD/DENU-1_S1_L001_R2_001.fastq.gz -i gzfastq -b /projects/p31939/DENUddRAD/s1_bars -o ./S1_Radtags --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 2
This works, but a lot of reads are dropped, cutsites not found in ~70% of samples. Still retains around 18 million reads. Will try to troubleshoot: play around with amount trimmed (-t), add linker nucleotide to barcodes (add T).
### Notes for Troubleshooting (6/27/23)
- a lot of cutsites were not found and were dropped across sublibraries (~70%). Not sure why.
- in the process_radtags log, there are six barcodes that I didn't use (#s 42-48; TCACG, TCAGT, TCCGG, TCTGC, TGGAA, TTACC
) that show up in every sublibrary. Reads found for these are ~ 1 million each. This is a problem in every sublibrary and doesn't make sense. Contamination?
- In every sublibary, samples that were in wells C2 (AATTA) and G2 (ATACG) have very low reads found that contain those barcodes, like ~10,000. Not sure if there was a problem with these barcodes, or if they received one of the above barcodes by accident.
### Notes for troubleshooting (7/6/23)
- Ran process_radtags with the 6 above barcodes added to my barcodes .txt file. Seeing if more reads are retained now.
Continuining with pipeline..
## DeNOVO Assemby
### Example Script
#!/bin/bash
#SBATCH --account=p31939 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX
#SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc)
#SBATCH --time=12:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter)
#SBATCH --nodes=1 ## how many computers/nodes do you need (no default)
#SBATCH --ntasks-per-node=4 ## how many cpus or processors do you need on per computer/node (default value 1)
#SBATCH --mem-per-cpu=20G ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need))
#SBATCH --job-name=DeNova_433_63nf5drop ## When you run squeue -u
#SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL
#SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu
#SBATCH --error=DeNova_433_63nf5drop.error # Enter a error log file name
#SBATCH --output=DeNova_433_63nf5drop.out # Enter your out.log file name
module purge
module load stacks/2.64-gcc-9.2.0 ##specify environment
denovo_map.pl -m 4 -M 3 -n 3 --samples /projects/p31939/DENUddRAD/demultiplex_70_63nf5_dropped --popmap /projects/p31939/DENUddRAD/pop_map_63nf5drop.txt -o /projects/p31939/DENUddRAD/DeNovo_433_70_63nf5drop -T 4 --paired -X "populations:-r 0.8" -X "populations: --write-random-snp" -X "populations: --vcf"
### Notes (7/6/23)
script fails unless 63_nf5 is dropped. Likely cause of low samples. However, with default parameters (4_3_3) only 93 loci are retained. I dropped all samples with less than 10,000 reads to try to retain more samples:
- (ATACG): 184_control: 2617
- (CTTGG): 65_flav: 346
- (ATACG): 2,3,141,142-NF2: 2943 reads
- (CTTGG): 60-nf5: 469 reads
- (ATACG): 185_control: 2992 reads
- (CTTGG): 63-65_NF5: 318 reads
- (ATACG): 172_control: 2,309 reads
- (CTTGG): 63-65_NF4: 368 reads
```
rm 184_Control.1.fq.gz 184_Control.2.fq.gz 65_Flav.1.fq.gz 65_Flav.2.fq.gz 2_3_141_142_NF2.1.fq.gz 2_3_141_142_NF2.2.fq.gz 60_NF5.1.fq.gz 60_NF5.2.fq.gz 185_Control.1.fq.gz 185_Control.2.fq.gz 63-65_NF5.1.fq.gz 63-65_NF5.2.fq.gz 172_Control.2.fq.gz 172_Control.1.fq.gz 63-65_NF4.1.fq.gz 63-65_NF4.2.fq.gz
```
### Notes (7/14/2023)
When I run DeNovo pipeline on the demultiplexed samples with 70% of reads dropped for unrecognizable cutsites, I only have max ~ 230 loci retained. Even with adjusting DeNovo parameters. This is too low for amplicon.
https://docs.google.com/spreadsheets/d/1t60fjJ_DagiukDfUUVaYlj_-WA0dR_P9ZP4KDDpHTVE/edit#gid=0
When checking for cutsites is disabled in process_radtags, i have 80M reads left. But DeNovo assembly fails to retain any loci since all are filtered out.
- Fastqc reports show low quality in cut site region. There may be too many mismatches for DeNovo to retain any loci due to the low quality here.
- solution: trim first 10 reads from every sample and then run denovo assembly
#### Trimming off cutsites from de-multiplexed samples
run all four lines
for i in *.1.fq.gz; do echo "seqtk trimfq -b 5 $i > ${i//.1.fq.gz/_trim.1.fq.gz}"; done > trimloop1.sh
for i in *.2.fq.gz; do echo "seqtk trimfq -b 5 $i > ${i//.2.fq.gz/_trim.2.fq.gz}"; done > trimloop2.sh
cat trimloop1.sh trimloop2.sh > trimloop.sh
module load seqtk/Dec17
bash trimloop.sh
rm *1.sh *2.sh *1.1.fq.gz *1.2.fq.gz *2.1.fq.gz *2.2.fq.gz *3.1.fq.gz *3.2.fq.gz *4.2.fq.gz *4.1.fq.gz *5.1.fq.gz *5.2.fq.gz *l.1.fq.gz *l.2.fq.gz *o.1.fq.gz *o.2.fq.gz *v.1.fq.gz *v.2.fq.gz
for i in *.1.fq.gz; do echo "mv $i ${i//_trim.1.fq.gz/.1.fq.gz}"; done > renameloop1.sh
for i in *.2.fq.gz; do echo "mv $i ${i//_trim.2.fq.gz/.2.fq.gz}"; done > renameloop2.sh
cat renameloop1.sh renameloop2.sh > renameloop.sh
bash renameloop.sh
for file in *; do mv "$file" "${file%???}"; done
for i in *.fq; do echo "gzip $i"; done > gziploop_150bp.sh
bash gziploop_150bp.sh
#### FastQC of all demultiplexed files with first 10 bp removed

### Removing FQ files with missing data (Done by checking file size)
To get more loci, files with a lot of missing data were removed. This was done in 3 criteria:
1.) 76 biggest samples + Set of Tech 1_4_42_43_NF2 = 80 samples
```
find . -type f -size +1200k
```
2.) 56 biggest + Set of Tech 1_4_42_43_NF2 = 60 samples
```
find . -type f -size +1500k
```
2.) 36 biggest samples + Set of Tech 1_4_42_43_NF2 = 40 samples
```
find . -type f -size +2200k
```
copy and paste files into excel, sort A-Z, and make sure both *1.fq.gz and *2.fq.gz are included. Then using concat in Excel make a textfile for each file with the following format:
```
cp 103_Appo.2.fq.gz /projects/p31939/DENUddRAD/RadTags_nocutcheck/9+filesdrop/demulti_nocutcheck_trim10_1200kbdrop
```
### run denovo after removing samples with low # of loci
## SNPS Filtering - VCF Tools
### Filtering overview:
-maf = 0.05
-hwe = 0.05
-mac = 3
-not filtering for DP
-make sure there is 50bp before and after snp
-no linked snps
-Replicates not variable?
module load vcftools/current
vcftools --vcf populations.snps.vcf --min-meanDP 0 --mac 3 --max-missing 0.2 --hwe 0.05 --recode --recode-INFO-all --out filtered.final.subset
vcftools --vcf populations.snps.vcf --max-missing 0.2 --hwe 0.05 --recode --recode-INFO-all --out filtered.final.subset
rm out.log out.imiss fil1.log fil1.recode.vcf filtered.final.subset.log filtered.final.subset.recode.vcf
## Filtering Fasta file
make new directory for fasta filtering
mkdir fasta.filter
cp populations.loci.fa fasta.filter/
cp filtered.final.subset.recode.vcf fasta.filter/
cd fasta.filter
### Filtering 50bp script
```
### Before starting script, make a file called "vcfkeeping" with the following lines:
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
Tech_168_Control_S1
Tech_168_Control_S2
Tech_168_Control_S3
Tech_168_Control_S4
Tech_1_4_42_43_NF2_S1
Tech_1_4_42_43_NF2_S2
Tech_1_4_42_43_NF2_S3
Tech_1_4_42_43_NF2_S4
###Script start###
cut -f 1 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.one
sed '1,15d' keep.col.one > remove.headers
sed -e 's/^/>CLocus_/' remove.headers > list.to.keep
grep -w -A 1 -f list.to.keep populations.loci.fa --no-group-separator > filtered.fasta.fa
awk '{ print length }' filtered.fasta.fa > count
sed -n '1~2!p' count > length
paste -d \\n list.to.keep length > length.locus
cut -f 2 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.two
sed '1,15d' keep.col.two > positions
paste list.to.keep length positions | column -s $'\t' -t > length.position
awk '(NR>0) && ($3 > 50 ) ' length.position > right.50
awk 'BEGIN { OFS = "\t" } { $5 = $2 - $3 } 1' right.50 > difference
awk '(NR>0) && ($4 > 50 ) ' difference > right.left.50
cut -f 1 right.left.50 | sed 's/[\t]/,/g' > filtered.id
grep -w -A 1 -f filtered.id populations.loci.fa --no-group-separator > final.filtered.fasta.fa
#Creates VCF file with only replicates and above filtered loci
cat filtered.id | cut -d"_" -f2 > filtered.chrom #get only the chromosome number
sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers
cut -f"$(head -1 allvcf.headers | tr '\t' '\n' | nl | grep -Fwf vcfkeeping | awk '{print $1}' | paste -sd,)" allvcf.headers > rm.ind.vcf
grep -w -F -f filtered.chrom rm.ind.vcf > reps.filtered.vcf
sed -e 's/^/chr/' filtered.chrom > filtered.chr
sed -e 's/^/chr/' rm.ind.vcf > rm.ind.chr.vcf
grep -w -F -f filtered.chr rm.ind.chr.vcf > reps.filtered.vcf
awk -F'chr' '{print $2}' reps.filtered.vcf > replicate.vcf
head -n1 rm.ind.chr.vcf > replicateheaders.vcf
cat replicate.vcf >> replicateheaders.vcf
rm replicate.vcf
mv replicateheaders.vcf replicate.vcf
head -n14 filtered.final.subset.recode.vcf > vcfheaders.vcf
cat replicate.vcf >> vcfheaders.vcf
rm replicate.vcf
mv vcfheaders.vcf replicate.vcf
#Creates VCF file with all samples but only above filtered loci
sed -e 's/^/chr/' allvcf.headers > all.50bpfiltered.chr.vcf
grep -w -F -f filtered.chr all.50bpfiltered.chr.vcf > all.50bpfiltered.vcf
awk -F'chr' '{print $2}' all.50bpfiltered.vcf > all.50bp.vcf
head -n1 all.50bpfiltered.chr.vcf > allheaders.vcf
cat all.50bp.vcf >> allheaders.vcf
rm all.50bpfiltered.vcf
mv allheaders.vcf 50bpfiltered.vcf
head -n14 filtered.final.subset.recode.vcf > vcfheaders2.vcf
cat 50bpfiltered.vcf >> vcfheaders2.vcf
rm 50bpfiltered.vcf
mv vcfheaders2.vcf 50bpfiltered.vcf
rm rm.ind.chr.vcf replicate.headers.vcf reps.filtered.vcf rm.ind.vcf filtered.chr right.50 difference right.left.50 keep.col.one remove.headers length count keep.col.two length.position positions length.locus list.to.keep filtered.chrom filtered.id allvcf.headers all.50bpfiltered.chr.vcf all.50bp.vcf filtered.fasta.fa
```
### 50bp Filtering Script Annoatated
##only keep first column of the vcf file that has the SNPs that passed quality filtering
cut -f 1 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.one
##remove all the headers rows of the fasta file
sed '1,15d' keep.col.one > remove.headers
#add the fasta label to be able to match below
sed -e 's/^/>CLocus_/' remove.headers > list.to.keep
#sort the original fasta to only keep the high quality ones
grep -w -A 1 -f list.to.keep populations.loci.fa --no-group-separator > filtered.fasta.fa
Now we need to sort fasta file to get SNPs with at least 50 bp on either side of SNP. First get the length of characters in each line of the file. Then combine them together
## get the count of each line in the fasta file
awk '{ print length }' filtered.fasta.fa > count
#keep every other line, so you only have the length of the sequence (not the locus ID)
sed -n '1~2!p' count > length
#paste the locus Id and the legnth of that locus
paste -d \\n list.to.keep length > length.locus
Now get the position of each
# only keep the second column of the vcf file that hast he high quality SNPs
cut -f 2 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.two
#remove headers of the vcf file
sed '1,15d' keep.col.two > positions
Combine the locus ID, length and position
## combine and have them formatted into tabulated columns
paste list.to.keep length positions | column -s $'\t' -t > length.position
Keep SNPs that have 50 bp on either side. Keep every line where column 3 (position) has is greater than 50, meaning there are mpre than 50 bp before the SNP. Then calcualte the difference between columne 2 (length) and column 3 (position) so that you only keep the loci with a difference greater than 50 meaning that there are at least 50 bp after the SNP. We have 346 loci.
```
awk '(NR>0) && ($3 > 50 ) ' length.position > right.50
awk 'BEGIN { OFS = "\t" } { $5 = $2 - $3 } 1' right.50 > difference
awk '(NR>0) && ($4 > 50 ) ' difference > right.left.50
```
Filter original fasta file to only keep these loci
```
cut -f 1 right.left.50 | sed 's/[\t]/,/g' > filtered.id
grep -w -A 1 -f filtered.id populations.loci.fa --no-group-separator > final.filtered.fasta.fa
```
Now we must filter to remove SNPs that are not the same between replicates. Need to filter the vcf file to include just the 346 filtered loci.
```
cat filtered.id | cut -d"_" -f2 > filtered.chrom #get only the chromosome number
sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers
```
Copy the text below into a new file called vcfkeeping. These are the column names that we want. - this only includes one set of technical replicates. Other two had a lot of missing data, but it might be worth including Tech_1_4_42_43_NF2
```
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
Tech_168_Control_S1
Tech_168_Control_S2
Tech_168_Control_S3
Tech_168_Control_S4
Tech_1_4_42_43_NF2_S1
Tech_1_4_42_43_NF2_S2
Tech_1_4_42_43_NF2_S3
Tech_1_4_42_43_NF2_S4
```
Filter the full vcf file without headers to only keep the columns/samples we want.
```
cut -f"$(head -1 allvcf.headers | tr '\t' '\n' | nl | grep -Fwf vcfkeeping | awk '{print $1}' | paste -sd,)" allvcf.headers > rm.ind.vcf
```
Keep only those 346 filtered loci
```
grep -w -F -f filtered.chrom rm.ind.vcf > reps.filtered.vcf
##for some reason this is keeping ten extra loci
##so try adding prefix to chromosome number to first column in each file
sed -e 's/^/chr/' filtered.chrom > filtered.chr
sed -e 's/^/chr/' rm.ind.vcf > rm.ind.chr.vcf
#try again
grep -w -F -f filtered.chr rm.ind.chr.vcf > reps.filtered.vcf
#now remove the chr string
awk -F'chr' '{print $2}' reps.filtered.vcf > replicate.vcf
#copy and paste header from previous file into new file
head -n1 rm.ind.chr.vcf > replicateheaders.vcf
cat replicate.vcf >> replicateheaders.vcf
rm replicate.vcf
mv replicateheaders.vcf replicate.vcf
```
Now insert the vcf header manually (from the original vcf). Then use populations to make a genepop file to upload into R (locally). This will remove loci that are variable in other individuals but not variable in the replicates, so we are left with 310.
```
head -n14 filtered.final.subset.recode.vcf > vcfheaders.vcf
cat replicate.vcf >> vcfheaders.vcf
rm replicate.vcf
mv vcfheaders.vcf replicate.vcf
```
### Are there any variable SNPs in replicates?
```
mkdir replicates.filter
module load stacks/2.64-gcc-9.2.0
populations -V ./replicate.vcf -O ./replicates.filter -M /projects/p31939/DENUddRAD/RadTags_cutcheck_trim10_1200kbDrop/pop_map_1500kbDrop_with2reps --genepop
***OR***
populations -V ./replicate.vcf -O ./replicates.filter -M /projects/p31939/DENUddRAD/RadTags_nocutcheck/9+filesdrop/pop_map_2200kbDrop_with2reps --genepop
```
Download genepop file, open in Excel and see which variant sites are the same within tech replicates - we want to keep these. Record the loci # from the genepop file. We next want to filter out the original fasta file (populations.loci.fa) to only keep loci that have passed all filtering. First let's make a list of our loci to keep.
```
#pull the loci #s from the genepop file - the loci numbers here refer to the line number of the snp from the replicate.vcf file
sed -n '2p' replicate.p.snps.genepop > replicate.variable.loci
#Turn the single row of locis into multiple
tr , "\n" < replicate.variable.loci > replicate.variable.loci2
#Delete "_0" that follow each loci
sed 's/..$//' < replicate.variable.loci2 > replicate.variable.loci3
#Make new file with the replicate loci we want removed.
sed '5d; 23d; 24d; 28d; 32d; 34d; 37d; 38d; 39d; 41d; 42d; 45d; 48d; 62d; 64d' replicate.variable.loci3 > replicate.variable.loci4
cp replicate.variable.loci4 /projects/p31939/DENUddRAD/RadTags_nocutcheck/9+filesdrop/DeNovo_366_r20_maf.05_trim10_1200kbdrop_with2reps/fasta.filter
rm replicate.variable.loci replicate.variable.loci2 replicate.variable.loci3 replicate.variable.loci4
cd ..
#Pull column 1 from the vcf with loci that passed quality and bp filtering
cut -f 1 replicate.vcf | sed 's/[\t]/,/g' > rep.col.one
#Remove the headers of this vcf
sed '1,15d' rep.col.one > remove.headers.rep
#Make our final list of loci to keep - this command filters the original vcf to remove loci by line numbers (i.e. how loci were named in the .genepop file)
sed "$(sed 's/$/d/' replicate.variable.loci4)" remove.headers.rep > final.loci.to.keep
rm remove.headers.rep rep.col.one
```
Filter the original VCF
```
###############
## ##
## ANNOTATED ##
## ##
###############
#get only the chromosome number
cat final.loci.to.keep | cut -d"_" -f2 > snps.keep
#remove header
sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers
#add 'chr'
sed -e 's/^/chr/' allvcf.headers > allvcf.headers.chr
#add 'chr'
sed -e 's/^/chr/' snps.keep > snps.keep.chr
#filter
grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr
#now remove the chr string
awk -F'chr' '{print $2}' filtered.chr > filtered
#copy and paste header from previous file into new file
head -n1 allvcf.headers > headers.temp
cat filtered >> headers.temp
mv headers.temp filtered.vcf
###############
## ##
## Script ##
## ##
###############
cat final.loci.to.keep | cut -d"_" -f2 > snps.keep
sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers
sed -e 's/^/chr/' allvcf.headers > allvcf.headers.chr
sed -e 's/^/chr/' snps.keep > snps.keep.chr
grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr
awk -F'chr' '{print $2}' filtered.chr > filtered
head -n1 allvcf.headers > headers.temp
cat filtered >> headers.temp
mv headers.temp filtered.vcf
rm snps.keep allvcf.headers allvcf.headers.chr snps.keep.chr filtered.chr
```
### Testing linkage
```
mkdir linkage.filter
cp filtered.vcf ./linkage.filter
cd linkage.filter
##Need to load plink on coda
conda activate plink
conda install -c bioconda plink
#Make Plink file
vcftools --vcf filtered.vcf --plink --out filtered.plink
#Make it binary - sometimes this doesn't work, manually add a to end of first chromosome > 23 in plink.map file
plink --file filtered.plink --make-bed --out binary --noweb --allow-extra-chr
#note from Nora: this will add "scaffold_" to each locus number, fixing the "invalid chromosome" issue:
sed -E "s/^([0-9])/scaffold_\1/g" filtered.vcf > filtered.scaffold.vcf
#Creating table of all pairwise snps that have a r2 value greater than .8
plink --bfile binary --r2 inter-chr --ld-window-r2 0.8 --allow-extra-chr --noweb --out links
```
Note that since these SNPs are linked, you should be able to keep one of the pairs of SNPs. To do this, Nora made a list of loci to remove. This will grab the first column with SNP locus A:
```
awk '{print $3}' links.ld | cut -f1 -d":" > linkA
```
And this will remove any duplicate values in that file:
```
awk '!seen[$0]++' linkA > snps.with.linkage
#get rid of the column name
sed {1d} snps.with.linkage > snps.with.linkage1
#add "chr" to the snps to drop
sed -e 's/^/chr/' snps.with.linkage.chr
```
Now remove all loci from snps.with.linkage from the vcf:
```
grep -wv -F -f chr.snps.with.link filtered.vcf > filtered2.vcf
#remove "chr"
awk -F'chr' '{print $2}' filtered2.vcf > filtered3.vcf && mv filtered3.vcf filtered2.vcf
```
do the plink analysis again with those snps removed. See if there are still linked snps. Do this iteratively until there are no longer any R2
values above .8 Nora didn't have any more linked after the first round.
nano links.ld
```
## Paramaters summarized
- No checking for cutsite
- Stacks: m = 3; minimum raw reads to form stack
- Stacks M = 6; number of mismatches between allele stacks to form loci
- Stacks n = 6; number of mismatches between loci stacks to form catalog
- Stacks: r = 0.20; % of individuals that must posess loci
- MAF = 0.05
- HWE = 0.05; everything in equilibrium
- MAC = 3
- No filtering of min. mean depth of coverage
- 50bp on either side of loci
```
## Final Loci for designing Primers:
/projects/p31939/DENUddRAD/RadTags_nocutcheck/9+filesdrop/DeNovo_366_r20_maf.05_trim10_2200kbdrop_with2reps/fasta.filter/Connolly_96_loci_24Oct.fasta.fa
/projects/p31939/DENUddRAD/RadTags_nocutcheck/9+filesdrop/DeNovo_366_r20_maf.05_trim10_2200kbdrop_with2reps/fasta.filter/Connolly_extra_loci_20Oct.fasta.fa
## Trying Stacks with reference genome
```
INDS=($(for i in ./Radtags_trim_cutcheck_2200kbdrop/*.1.fq.gz; do echo $(basename ${i%.1*}); done))
for IND in ${INDS[@]}; do FORWARD= /projects/p31939/DENUddRAD/reference_genomes/aquilegia_coerulea_genome/data/GCA_002738505.1/Radtags_trim_cutcheck_2200kbdrop/${IND}1.fq.gz; REVERSE= /projects/p31939/DENUddRAD/reference_genomes/aquilegia_coerulea_genome/data/GCA_002738505.1/Radtags_trim_cutcheck_2200kbdrop/${IND}2.fq.gz; OUTPUT=~/align/${IND}_sort.bam; done
for IND in ${INDS[@]}; do FORWARD= /projects/p31939/DENUddRAD/reference_genomes/aquilegia_coerulea_genome/data/GCA_002738505.1/Radtags_trim_cutcheck_2200kbdrop/${IND}.1.fq.gz; REVERSE= /projects/p31939/DENUddRAD/reference_genomes/aquilegia_coerulea_genome/data/GCA_002738505.1/Radtags_trim_cutcheck_2200kbdrop/${IND}.2.fq.gz; OUTPUT=~/align/${IND}_sort.bam; done
```