# SNP panel development for Clarkia concinna concinna and Clarkia breweri
## Demultiplexing. First combine all the forward and reverse reads together (they gave them in two parts to create smaller files). I didn't label the libraries correctly on the order form, so they are AmSub1 or AmSub2. These data are found in /data/clarkia/sv/raw. There is a directory for each library.
```
cat AmSub1-Fant09_S1_L001_R1_001_part1.fastq.gz AmSub1-Fant09_S1_L001_R1_001_part2.fastq.gz > Clarkia1_Fant09_S1_L001_R1_001.fastq.gz
cat AmSub1-Fant09_S1_L001_R2_001_part1.fastq.gz AmSub1-Fant09_S1_L001_R2_001_part2.fastq.gz > Clarkia1_Fant09_S1_L001_R2_001.fastq.gz
```
### Use process_radtags in STACKS for demultiplexing. Here we have paried end data, so we are specifying which is the forward/first read and which is the second. The output folder for the demultiplexed samples is /data/clarkia/sv/samples. The outputs for process_radtags is in the raw folder. Did not trim sequences.
```
#!/bin/bash
#Tell the computer to make a log file
exec 3>&1 4>&2
trap 'exec 2>&4 1>&3' 0 1 2 3 RETURN
exec 1>log.out 2>&1
process_radtags -1 ./Clarkia1_Fant09_S1_L001_R1_001.fastq.gz -2 ./Clarkia1_Fant09_S1_L001_R2_001.fastq.gz -i gzfastq -o /data/clarkia/sv/samples -b barcodes --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -s 10
```
### Check read quality with FastQC. This is done in the raw directory for each library. First you combine both files.
```
cat Clarkia2_Fant09_S2_L001_R1_001.fastq.gz Clarkia2_Fant09_S2_L001_R2_001.fastq.gz > allsamples.fq.gz
```
#### Then make a script to run job. This is done in the raw directory for each library.
```
#!/bin/bash
#Tell the computer to make a log file
exec 3>&1 4>&2
trap 'exec 2>&4 1>&3' 0 1 2 3 RETURN
exec 1>fastqlog.out 2>&1
fastqc allsamples.fq.gz
```
## Clarkia breweri. /data/clarkia/sv/cb/snp.calling
### Optimization of STACKS parameters. This is done in /data/clarkia/sv/cb/snp.calling/opt.
```
#!/bin/bash
#Tell the computer to make a log file
exec 3>&1 4>&2
trap 'exec 2>&4 1>&3' 0 1 2 3 RETURN
exec 1>log.out 2>&1
#Create output directories and define variables for output folders (so the code is cleaner below)
mkdir /data/Dimensions/clarkia/sv/cb/snp.calling/opt/m$1_M$2_n$3
OUT=/data/Dimensions/clarkia/sv/cb/snp.calling/opt/m$1_M$2_n$3
mv log.out ${OUT}/
denovo_map.pl --samples /data/Dimensions/clarkia/sv/samples --paired --popmap /data/Dimensions/clarkia/sv/cb/snp.calling/opt/popmap2 -o ${OUT}/ -T 25 -X "ustacks: --min_cov $1" -X "ustacks: --max_dist $2" -X "cstacks: -n $3" -X "populations:--min-maf 0.05" -X "populations: --write-random-snp" -X "populations: --plink"
OUT=/data/Dimensions/clarkia/sv/cb/snp.calling/opt/m$1_M$2_n$3
cd ${OUT}/
plink --file populations.plink --cluster --mds-plot 2 --noweb
```
#### Run optimizations
````
sh opt.sh 3 2 2 # run
sh opt.sh 4 2 2 # run
sh opt.sh 5 2 2 # run
sh opt.sh 3 3 3 # run
sh opt.sh 3 1 1 # run
sh opt.sh 3 4 4 # run
sh opt.sh 4 3 3 # run
sh opt.sh 4 1 1 # run
````
### Final SNP calling. /data/clarkia/sv/cb/snp.calling/final. Using default parameters and populations so that an allele has to be present in all three populations (-p 3), in at least 50% of individuals in a population (-r 0.5), and a minor allele frequency of 17%.
```
#!/bin/bash
#Tell the computer to make a log file
exec 3>&1 4>&2
trap 'exec 2>&4 1>&3' 0 1 2 3 RETURN
exec 1>log.out 2>&1
denovo_map.pl --samples /data/Dimensions/clarkia/sv/samples --paired --popmap /data/Dimensions/clarkia/sv/cb/snp.calling/final/popmap -o /data/Dimensions/clarkia/sv/cb/snp.calling/final -T 25 -X "ustacks: --min_cov 3" -X "ustacks: --max_dist 2" -X "cstacks: -n 2" -X "populations:--min-maf 0.17" -X "populations:-r .5" -X "populations:-p 3" -X "populations: --write-random-snp" -X "populations: --vcf" -X "populations: --genepop" -X "populations: --fasta-loci"
```
#### Quality filtering for CB. /data/Dimensions/clarkia/sv/cb/snp.calling/final/filter
```
vcftools --vcf populations.snps.vcf --missing-indv
# It looks like one indiviudal CB-58 failed and CB-73 and CB-77 did not do well. Remove them to keep 32 individuals
# Remove individuals with lots of missing data
vcftools --vcf populations.snps.vcf --remove rm.ind --recode --recode-INFO-all --out subset
#Includes only sites with mean depth values (over all included individuals) greater than or equal to 15
vcftools --vcf subset.recode.vcf --min-meanDP 15 --recode --recode-INFO-all --out fil1 # kept 3568
#Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed). Here 10% or less missing data is allowed
vcftools --vcf subset.recode.vcf --max-missing 0.90 --recode --recode-INFO-all --out fil4 # kept 2492
#Filter with all parameters including mean depth
vcftools --vcf subset.recode.vcf --min-meanDP 15 --mac 3 --max-missing 0.90 --hwe 0.05 --recode --recode-INFO-all --out filtered.final.subset #kept 458
```
### Filtering for SNP panel. Using the 458 SNPs called - has maf 17%, in HWE, 10% missing data and mean depth 15X. /data/clarkia/sv/cb/snp.panel
#### Now have to filter original fasta file to get filtered fasta sequences./data/Dimensions/clarkia/sv/cb/snp.panel/fasta.filter
```
mkdir fasta.filter
cp populations.loci.fa fasta.filter/
cp filtered.final.subset.recode.vcf fasta.filter/
cd fasta.filter
cut -f 1 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.one #only keep first column of the vcf file that has the SNPs that passed quality filtering
sed '1,15d' keep.col.one > remove.headers #remove all the headers rows of the fasta file
sed -e 's/^/>CLocus_/' remove.headers > list.to.keep #add the fasta label to be able to match below
grep -w -A 1 -f list.to.keep populations.loci.fa --no-group-separator > filtered.fasta.fa #sort the original fasta to only keep the high quality ones
```
#### Now we need to sort fasta file to get SNPs with at least 50 bp on either side of SNP. First get the length of characters in each line of the file. Then combine them together
```
awk '{ print length }' filtered.fasta.fa > count # get the count of each line in the fasta file
sed -n '1~2!p' count > length #keep every other line, so you only have the length of the sequence (not the locus ID)
paste -d \\n list.to.keep length > length.locus #paste the locus Id and the legnth of that locus
```
#### Now get the position of each
```
cut -f 2 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.two # only keep the second column of the vcf file that hast he high quality SNPs
sed '1,15d' keep.col.two > positions #remove headers of the vcf file
```
#### Combine the locus ID, length and position
```
paste list.to.keep length positions | column -s $'\t' -t > length.position #combine and have them formatted into tabulated columns
```
#### Keep SNPs that have 50 bp on either side. Keep every line where column 3 (position) has is greater than 50, meaning there are mpre than 50 bp before the SNP. Then calcualte the difference between columne 2 (length) and column 3 (position) so that you only keep the loci with a difference greater than 50 meaning that there are at least 50 bp after the SNP. We have 346 loci.
```
awk '(NR>0) && ($3 > 50 ) ' length.position > right.50
awk 'BEGIN { OFS = "\t" } { $5 = $2 - $3 } 1' right.50 > difference
awk '(NR>0) && ($4 > 50 ) ' difference > right.left.50
```
#### Filter original fasta file to only keep these 360 loci
```
cut -f 1 right.left.50 | sed 's/[\t]/,/g' > filtered.id
grep -w -A 1 -f filtered.id populations.loci.fa --no-group-separator > final.filtered.fasta.fa
```
### Now we must filter to remove SNPs that are not the same between replicates. Need to filter the vcf file to include just the 346 filtered loci.
```
cat filtered.id | cut -d"_" -f2 > filtered.chrom #get only the chromosome number
sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers
```
#### Copy the text below into a new file called vcfkeeping. These are the column names that we want.
```
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
CB-110
CB-110.2
CB-24
CB-24.2
CB-65
CB-65.2
CB-96
CB-96.2
```
#### Filter the full vcf file without headers to only keep the columns we want.
```
cut -f"$(head -1 allvcf.headers | tr '\t' '\n' | nl | grep -Fwf vcfkeeping | awk '{print $1}' | paste -sd,)" allvcf.headers > rm.ind.vcf
```
#### Keep only those 346 filtered loci
```
grep -w -F -f filtered.chrom rm.ind.vcf > reps.filtered.vcf
#for some reason this is keeping ten extra loci
#so try adding prefix to chromosome number to first column in each file
sed -e 's/^/chr/' filtered.chrom > filtered.chr
sed -e 's/^/chr/' rm.ind.vcf > rm.ind.chr.vcf
#try again
grep -w -F -f filtered.chr rm.ind.chr.vcf > reps.filtered.vcf
#now remove the chr string
awk -F'chr' '{print $2}' reps.filtered.vcf > replicate.vcf
#copy and paste header from previous file into new file
( sed 1q rm.ind.chr.vcf; sed 1d replicate.vcf ) > replicates.vcf && mv replicates.vcf replicate.vcf
```
#### Now insert the vcf header manually (from the original vcf). Then use populations to make a genepop file to upload into R (locally). This will remove loci that are variable in other individuals but not variable in the replicates, so we are left with 310.
```
populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/replicates.filter/replicate.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/replicates.filter -M /data/Dimensions/clarkia/sv/cb/snp.calling/final/popmap --genepop
```
##### I compared the loci that aren't the same in replicates in excel after uploading the genepop file. Note that you will loose some loci because if they aren't variable in the replicates, they won't be included in the genepop file. We of the starting 348 loci have 310 variable loci in the replicates to compare. The ID in the genepop file match up with the ids in the vcf file that you used to generate the genepop file. Make sure to use the GenePop ID from the replicate genepop file since this has ONLY the variable loci that you can compare. I then made a list of those to exclude. Now we have 202 SNPs. You will need to get the fasta loci ID from the genepop ID using the vcf file (the genepop file is ordered 1...x)
### Final filtering. /data/Dimensions/clarkia/sv/cb/snp.panel/replicates.filter. Using the loci that we are keeping. Make a list.to.keep with their locus ids then filter the original fasta file.
```
grep -w -A 1 -f list.to.keep populations.loci.fa --no-group-separator > final.fasta.fa #sort the original fasta to only keep the high quality ones
```
#### Now filter the vcf file.
```
cat list.to.keep | cut -d"_" -f2 > snps.keep #get only the chromosome number
sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers #remove header
sed -e 's/^/chr/' allvcf.headers > allvcf.headers.chr #add 'chr'
sed -e 's/^/chr/' snps.keep > snps.keep.chr #add 'chr'
grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter
awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string
#copy and paste header from previous file into new file
( sed 1q allvcf.headers; sed 1d filtered ) > filtered2 && mv filtered2 filtered.vcf
```
#### Filter out linked loci. /data/Dimensions/clarkia/sv/cb/snp.panel/linkage.filter
```
## Had to reformat the vcf file to include a header
bgzip -c filtered.vcf > filtered2.vcf.gz
tabix -p vcf filtered2.vcf.gz
/usr/local/libexec/bcftools/bcftools +prune -m 0.6 -w 100 filtered2.vcf.gz -Ob -o final.vcf #can't get this to work
## Checked to see what the R2
vcftools --vcf filtered.vcf --plink --out filtered.plink
plink --file filtered.plink --r2 --noweb
```
#### There are two pairs or loci that are linked (i.e. have an R2 above 80). Remove them and filter the vcf file. We have 200 SNPs
```
sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers #remove header
sed -e 's/^/chr/' allvcf.headers > allvcf.headers.chr #add 'chr'
sed -e 's/^/chr/' list.to.keep > snps.keep.chr #add 'chr'
grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter
awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string
#copy and paste header from previous file into new file
( sed 1q allvcf.headers; sed 1d filtered ) > filtered2 && mv filtered2 filtered.vcf
```
#### Run populations. Upload this gene pop file into R to look at heterozygosity of each locus at each population. Decided to keep all 310 loci for now. We can use the final.fasta.fa file to begin developing CB SNP panel.
```
populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/linkage.filter/filtered.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/linkage.filter/ -M /data/Dimensions/clarkia/sv/cb/snp.panel/popmap.nr --genepop
```
#### Only keeping a subset of loci that are highly variable across all three populations. Filter final vcf file to only get those highly variable SNPs and remove replicates. /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter
##### Run populations again to run genepop file in R
```
populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/filtered.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/ -M /data/Dimensions/clarkia/sv/cb/snp.panel/popmap.nr --genepop --vcf
```
##### Filter the final vcf using the 111 most variable SNPs /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/hs.filter
```
sed -e 's/^/chr/' filtered.vcf > allvcf.headers.chr #add 'chr'
sed -e 's/^/chr/' list.to.keep > snps.keep.chr #add 'chr'
grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter
awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string
#copy and paste header from previous file into new file
( sed 1q filtered.vcf; sed 1d filtered ) > filtered2 && mv filtered2 final.vcf
```
##### Run populations again
```
populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/hs.filter/final.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/hs.filter -M /data/Dimensions/clarkia/sv/cb/snp.panel/popmap.nr --genepop --vcf
```
#### Get SNP info for all final SNPs and make master spreadsheet with all SNP info.
```
vcftools --vcf final.p.snps.vcf --missing-indv
#Generates a file reporting the missingness on a per-site basis. The file has the suffix ".lmiss".
vcftools --vcf final.p.snps.vcf --missing-site
#Generates a file containing the mean depth per individual. This file has the suffix ".idepth"
vcftools --vcf final.p.snps.vcf --depth
#Generates a file containing the mean depth per site averaged across all individuals. This output file has the suffix ".ldepth.mean".
vcftools --vcf final.p.snps.vcf --site-mean-depth
```
##### Format for submission to sequencing facility
###### Filter fasta to get those we want
```
sed -e 's/^/>CLocus_/' list.to.keep > final.list.to.keep # Make list of loci we want to keep w/ correct format
grep -w -A 1 -f final.list.to.keep populations.loci.fa --no-group-separator > final.fasta.fa # fitler original fasta
```
###### Re-formatting fasta file so that the locus name is in the first column and the sequence is in the second.
```
paste -d " " - - < final.fasta.fa > final.fasta2
```
###### Now doing some re-formatting for the submission form to Minnesota. Need to get the SNP with the major and minor alleles into the fasta file. Make a file with the sequence in the first column, position in the second column, and [allele1/allele2] in the third column. Then run code below (had to get help on this in Stack Exchange).
```
awk -F"\t" '{printf "%s%s%s%s",substr($1,1,$2-1),$3,substr($1,$2+1),ORS}' snp > submission.txt
```
## Clarkia concinna. /data/clarkia/sv/cc/snp.calling
### Optimization of STACKS parameters. This is done in /data/clarkia/sv/cc/snp.calling/opt.
```
#!/bin/bash
#Tell the computer to make a log file
exec 3>&1 4>&2
trap 'exec 2>&4 1>&3' 0 1 2 3 RETURN
exec 1>log.out 2>&1
#Create output directories and define variables for output folders (so the code is cleaner below)
mkdir /data/Dimensions/clarkia/sv/cc/snp.calling/opt/m$1_M$2_n$3
OUT=/data/Dimensions/clarkia/sv/cc/snp.calling/opt/m$1_M$2_n$3
mv log.out ${OUT}/
denovo_map.pl --samples /data/Dimensions/clarkia/sv/samples --paired --popmap /data/Dimensions/clarkia/sv/cb/snp.calling/opt/popmap2 -o ${OUT}/ -T 25 -X "ustacks: --min_cov $1" -X "ustacks: --max_dist $2" -X "cstacks: -n $3" -X "populations:--min-maf 0.05" -X "populations: --write-random-snp" -X "populations: --plink"
OUT=/data/Dimensions/clarkia/sv/cc/snp.calling/opt/m$1_M$2_n$3
cd ${OUT}/
plink --file populations.plink --cluster --mds-plot 2 --noweb
```
#### Run optimizations
````
sh opt.sh 3 2 2 # run
sh opt.sh 4 2 2 # run
sh opt.sh 5 2 2 # run
sh opt.sh 3 3 3 # run
sh opt.sh 3 1 1 # run
sh opt.sh 3 4 4 # run
sh opt.sh 4 3 3 # run
sh opt.sh 4 1 1 # run
````
### Final SNP calling. /data/clarkia/sv/cc/snp.calling/final. Using default parameters and populations so that an allele has to be present in all three populations (-p 3), in at least 50% of individuals in a population (-r 0.5), and a minor allele frequency of 17%.
```
#!/bin/bash
#Tell the computer to make a log file
exec 3>&1 4>&2
trap 'exec 2>&4 1>&3' 0 1 2 3 RETURN
exec 1>log.out 2>&1
denovo_map.pl --samples /data/Dimensions/clarkia/sv/samples --paired --popmap /data/Dimensions/clarkia/sv/cc/snp.calling/final/popmap -o /data/Dimensions/clarkia/sv/cc/snp.calling/final -T 10 -X "ustacks: --min_cov 3" -X "ustacks: --max_dist 2" -X "cstacks: -n 2" -X "populations:--min-maf 0.17" -X "populations:-r .5" -X "populations:-p 3" -X "populations: --write-random-snp" -X "populations: --vcf" -X "populations: --genepop" -X "populations: --fasta-loci"
```
##### CC-41 was included as a replicate, but I didn't not include CC-41.2 in the barcode file, so the info for all sequences is included as CC-41. It is no longer able to be used as a replicate unless I want to demultiplext again...
#### Quality filtering for CC. /data/Dimensions/clarkia/sv/cc/snp.calling/final/filter
```
vcftools --vcf populations.snps.vcf --missing-indv
# Remove individuals CC-26, CC-41
# Remove individuals with lots of missing data
vcftools --vcf populations.snps.vcf --remove rm.ind --recode --recode-INFO-all --out subset
#Includes only sites with mean depth values (over all included individuals) greater than or equal to 15
vcftools --vcf subset.recode.vcf --min-meanDP 15 --recode --recode-INFO-all --out fil1 # kept 2832
#Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed). Here 10% or less missing data is allowed
vcftools --vcf subset.recode.vcf --max-missing 0.90 --recode --recode-INFO-all --out fil4 # kept 1801
#Filter with all parameters including mean depth
vcftools --vcf subset.recode.vcf --min-meanDP 15 --mac 3 --max-missing 0.90 --hwe 0.05 --recode --recode-INFO-all --out filtered.final.subset #kept 466
```
### Filtering for SNP panel. Using the 458 SNPs called - has maf 17%, in HWE, 10% missing data and mean depth 15X. /data/clarkia/sv/cc/snp.panel/fasta.filter
### Now have to filter original fasta file to get filtered fasta sequences.
```
mkdir fasta.filter
cp populations.loci.fa fasta.filter/
cp filtered.final.subset.recode.vcf fasta.filter/
cd fasta.filter
cut -f 1 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.one #only keep first column of the vcf file that has the SNPs that passed quality filtering
sed '1,15d' keep.col.one > remove.headers #remove all the headers rows of the fasta file
sed -e 's/^/>CLocus_/' remove.headers > list.to.keep #add the fasta label to be able to match below
grep -w -A 1 -f list.to.keep populations.loci.fa --no-group-separator > filtered.fasta.fa #sort the original fasta to only keep the high quality ones
```
### Now we need to sort fasta file to get SNPs with at least 50 bp on either side of SNP. First get the length of characters in each line of the file. Then combine them together
```
awk '{ print length }' filtered.fasta.fa > count # get the count of each line in the fasta file
sed -n '1~2!p' count > length #keep every other line, so you only have the length of the sequence (not the locus ID)
paste -d \\n list.to.keep length > length.locus #paste the locus Id and the legnth of that locus
```
#### Now get the position of each
```
cut -f 2 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.two # only keep the second column of the vcf file that hast he high quality SNPs
sed '1,15d' keep.col.two > positions #remove headers of the vcf file
```
#### Combine the locus ID, length and position
```
paste list.to.keep length positions | column -s $'\t' -t > length.position #combine and have them formatted into tabulated columns
```
#### Keep SNPs that have 50 bp on either side. Keep every line where column 3 (position) has is greater than 50, meaning there are mpre than 50 bp before the SNP. Then calcualte the difference between columne 2 (length) and column 3 (position) so that you only keep the loci with a difference greater than 50 meaning that there are at least 50 bp after the SNP. We have 346 loci.
```
awk '(NR>0) && ($3 > 50 ) ' length.position > right.50
awk 'BEGIN { OFS = "\t" } { $5 = $2 - $3 } 1' right.50 > difference
awk '(NR>0) && ($4 > 50 ) ' difference > right.left.50
```
#### Filter original fasta file to only keep the loci
```
cut -f 1 right.left.50 | sed 's/[\t]/,/g' > filtered.id
grep -w -A 1 -f filtered.id populations.loci.fa --no-group-separator > final.filtered.fasta.fa
```
### Now we must filter to remove SNPs that are not the same between replicates. /data/Dimensions/clarkia/sv/cc/snp.panel/replicates.filter
#### Need to filter the vcf file to include just the 378 filtered loci.
```
cat filtered.id | cut -d"_" -f2 > filtered.chrom #get only the chromosome number
sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers
```
#### Copy the text below into a new file called vcfkeeping. These are the column names that we want. CC-324.2 had a lot of missing data, did not include
```
#CHROM
POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
CC-100
CC-100.2
CC-102
CC-102.2
CC-110
CB-110.2
CC-331
CC-331.2
```
#### Filter the full vcf file without headers to only keep the columns we want.
```
cut -f"$(head -1 allvcf.headers | tr '\t' '\n' | nl | grep -Fwf vcfkeeping | awk '{print $1}' | paste -sd,)" allvcf.headers > rm.ind.vcf
```
#### Keep only those filtered loci
```
grep -w -F -f filtered.chrom rm.ind.vcf > reps.filtered.vcf
#for some reason this is keeping ten extra loci
#so try adding prefix to chromosome number to first column in each file
sed -e 's/^/chr/' filtered.chrom > filtered.chr
sed -e 's/^/chr/' rm.ind.vcf > rm.ind.chr.vcf
#try again
grep -w -F -f filtered.chr rm.ind.chr.vcf > reps.filtered.vcf
#now remove the chr string
awk -F'chr' '{print $2}' reps.filtered.vcf > replicate.vcf
#copy and paste header from previous file into new file
( sed 1q rm.ind.chr.vcf; sed 1d replicate.vcf ) > replicates.vcf && mv replicates.vcf replicate.vcf
```
#### Now insert the vcf header manually (from the original vcf). Then use populations to make a genepop file to upload into R (locally).
```
populations -V /data/Dimensions/clarkia/sv/cc/snp.panel/replicates.filter/replicate.vcf -O /data/Dimensions/clarkia/sv/cc/snp.panel/replicates.filter -M /data/Dimensions/clarkia/sv/cc/snp.calling/final/popmap --genepop
```
##### I compared the loci that aren't the same in replicates in excel after uploading the genepop file. Note that you will loose some loci because if they aren't variable in the replicates, they won't be included in the genepop file. We of the starting 378 loci have 298 variable loci in the replicates to compare. The ID in the genepop file match up with the ids in the vcf file that you used to generate the genepop file. Make sure to use the GenePop ID from the replicate genepop file since this has ONLY the variable loci that you can compare. I then made a list of those to exclude. Now we have SNPs. You will need to get the fasta loci ID from the genepop ID using the vcf file (the genepop file is ordered 1...x)
### Linkage filtering. /data/Dimensions/clarkia/sv/cc/snp.panel/linkage.filter
#### Filter the vcf file. Put the snps you're keeping in snps.keep
```
sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers #remove header
sed -e 's/^/chr/' allvcf.headers > allvcf.headers.chr #add 'chr'
sed -e 's/^/chr/' snps.keep > snps.keep.chr #add 'chr'
grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter
awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string
#copy and paste header from previous file into new file
( sed 1q allvcf.headers; sed 1d filtered ) > filtered2 && mv filtered2 filtered.vcf
```
#### Filter out linked loci. Plink showed that no R2 was above .6 - did not filter any loci
```
## Had to reformat the vcf file to include a header
bgzip -c filtered.vcf > filtered2.vcf.gz
tabix -p vcf filtered2.vcf.gz
/usr/local/libexec/bcftools/bcftools +prune -m 0.6 -w 100 filtered2.vcf.gz -Ob -o final.vcf #can't get this to work
## Checked to see what the R2
vcftools --vcf filtered.vcf --plink --out filtered.plink
plink --file filtered.plink --r2 --noweb
```
#### Run populations. Upload this gene pop file into R to look at heterozygosity of each locus at each population.
```
populations -V /data/Dimensions/clarkia/sv/cc/snp.panel/linkage.filter/filtered.vcf -O /data/Dimensions/clarkia/sv/cc/snp.panel/linkage.filter/ -M /data/Dimensions/clarkia/sv/cc/snp.panel/popmap.nr --genepop
```
#### Only keeping a subset of loci that are highly variable across all three populations. Filter final vcf file to only get those highly variable SNPs and remove replicates. /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter
##### Filter the final vcf using the 142 most variable SNPs
```
sed -e 's/^/chr/' filtered.vcf > allvcf.headers.chr #add 'chr'
sed -e 's/^/chr/' list.to.keep > snps.keep.chr #add 'chr'
grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter
awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string
#copy and paste header from previous file into new file
( sed 1q filtered.vcf; sed 1d filtered ) > filtered2 && mv filtered2 final.vcf
```
##### Run populations again
```
populations -V /data/Dimensions/clarkia/sv/cc/snp.panel/final.filter/final.vcf -O /data/Dimensions/clarkia/sv/cc/snp.panel/final.filter/ -M /data/Dimensions/clarkia/sv/cc/snp.panel/popmap.nr --genepop --vcf
```
#### Get SNP info for all final SNPs and make master spreadsheet with all SNP info.
```
vcftools --vcf final.p.snps.vcf --missing-indv
#Generates a file reporting the missingness on a per-site basis. The file has the suffix ".lmiss".
vcftools --vcf final.p.snps.vcf --missing-site
#Generates a file containing the mean depth per individual. This file has the suffix ".idepth"
vcftools --vcf final.p.snps.vcf --depth
#Generates a file containing the mean depth per site averaged across all individuals. This output file has the suffix ".ldepth.mean".
vcftools --vcf final.p.snps.vcf --site-mean-depth
```
##### Format for submission to sequencing facility
###### Filter fasta to get those we want
```
sed -e 's/^/>CLocus_/' list.to.keep > final.list.to.keep # Make list of loci we want to keep w/ correct format
grep -w -A 1 -f final.list.to.keep populations.loci.fa --no-group-separator > final.fasta.fa # fitler original fasta
```
###### Re-formatting fasta file so that the locus name is in the first column and the sequence is in the second.
```
paste -d " " - - < final.fasta.fa > final.fasta2
```
###### Now doing some re-formatting for the submission form to Minnesota. Need to get the SNP with the major and minor alleles into the fasta file. Make a file with the sequence in the first column, position in the second column, and [allele1/allele2] in the third column. Then run code below (had to get help on this in Stack Exchange).
```
awk -F"\t" '{printf "%s%s%s%s",substr($1,1,$2-1),$3,substr($1,$2+1),ORS}' snps > submission.txt
```
# Secure copy
```
# For windows
scp zdiazmartin@10.2.0.10:/data/Phyllostegia/coancestry/coancestry.txt C:\Users\zdiaz-martin\Desktop\Files
scp zdiazmartin@10.2.0.10:/data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/hs.filter/final.fasta.fa C:\Users\zdiaz-martin\Desktop\Files
# For Mac
scp zdiazmartin@10.2.0.10:/data/Amorphophallus/jt/snp.calling/opt/m3_M2_n2/populations.sumstats_summary.tsv /Users/zdm/Desktop/Files
Oenocarpus!88*
```