SNP panel development for Clarkia concinna concinna and Clarkia breweri

# SNP panel development for Clarkia concinna concinna and Clarkia breweri ## Demultiplexing. First combine all the forward and reverse reads together (they gave them in two parts to create smaller files). I didn't label the libraries correctly on the order form, so they are AmSub1 or AmSub2. These data are found in /data/clarkia/sv/raw. There is a directory for each library. ``` cat AmSub1-Fant09_S1_L001_R1_001_part1.fastq.gz AmSub1-Fant09_S1_L001_R1_001_part2.fastq.gz > Clarkia1_Fant09_S1_L001_R1_001.fastq.gz cat AmSub1-Fant09_S1_L001_R2_001_part1.fastq.gz AmSub1-Fant09_S1_L001_R2_001_part2.fastq.gz > Clarkia1_Fant09_S1_L001_R2_001.fastq.gz ``` ### Use process_radtags in STACKS for demultiplexing. Here we have paried end data, so we are specifying which is the forward/first read and which is the second. The output folder for the demultiplexed samples is /data/clarkia/sv/samples. The outputs for process_radtags is in the raw folder. Did not trim sequences. ``` #!/bin/bash #Tell the computer to make a log file exec 3>&1 4>&2 trap 'exec 2>&4 1>&3' 0 1 2 3 RETURN exec 1>log.out 2>&1 process_radtags -1 ./Clarkia1_Fant09_S1_L001_R1_001.fastq.gz -2 ./Clarkia1_Fant09_S1_L001_R2_001.fastq.gz -i gzfastq -o /data/clarkia/sv/samples -b barcodes --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -s 10 ``` ### Check read quality with FastQC. This is done in the raw directory for each library. First you combine both files. ``` cat Clarkia2_Fant09_S2_L001_R1_001.fastq.gz Clarkia2_Fant09_S2_L001_R2_001.fastq.gz > allsamples.fq.gz ``` #### Then make a script to run job. This is done in the raw directory for each library. ``` #!/bin/bash #Tell the computer to make a log file exec 3>&1 4>&2 trap 'exec 2>&4 1>&3' 0 1 2 3 RETURN exec 1>fastqlog.out 2>&1 fastqc allsamples.fq.gz ``` ## Clarkia breweri. /data/clarkia/sv/cb/snp.calling ### Optimization of STACKS parameters. This is done in /data/clarkia/sv/cb/snp.calling/opt. ``` #!/bin/bash #Tell the computer to make a log file exec 3>&1 4>&2 trap 'exec 2>&4 1>&3' 0 1 2 3 RETURN exec 1>log.out 2>&1 #Create output directories and define variables for output folders (so the code is cleaner below) mkdir /data/Dimensions/clarkia/sv/cb/snp.calling/opt/m$1_M$2_n$3 OUT=/data/Dimensions/clarkia/sv/cb/snp.calling/opt/m$1_M$2_n$3 mv log.out ${OUT}/ denovo_map.pl --samples /data/Dimensions/clarkia/sv/samples --paired --popmap /data/Dimensions/clarkia/sv/cb/snp.calling/opt/popmap2 -o ${OUT}/ -T 25 -X "ustacks: --min_cov $1" -X "ustacks: --max_dist $2" -X "cstacks: -n $3" -X "populations:--min-maf 0.05" -X "populations: --write-random-snp" -X "populations: --plink" OUT=/data/Dimensions/clarkia/sv/cb/snp.calling/opt/m$1_M$2_n$3 cd ${OUT}/ plink --file populations.plink --cluster --mds-plot 2 --noweb ``` #### Run optimizations ```` sh opt.sh 3 2 2 # run sh opt.sh 4 2 2 # run sh opt.sh 5 2 2 # run sh opt.sh 3 3 3 # run sh opt.sh 3 1 1 # run sh opt.sh 3 4 4 # run sh opt.sh 4 3 3 # run sh opt.sh 4 1 1 # run ```` ### Final SNP calling. /data/clarkia/sv/cb/snp.calling/final. Using default parameters and populations so that an allele has to be present in all three populations (-p 3), in at least 50% of individuals in a population (-r 0.5), and a minor allele frequency of 17%. ``` #!/bin/bash #Tell the computer to make a log file exec 3>&1 4>&2 trap 'exec 2>&4 1>&3' 0 1 2 3 RETURN exec 1>log.out 2>&1 denovo_map.pl --samples /data/Dimensions/clarkia/sv/samples --paired --popmap /data/Dimensions/clarkia/sv/cb/snp.calling/final/popmap -o /data/Dimensions/clarkia/sv/cb/snp.calling/final -T 25 -X "ustacks: --min_cov 3" -X "ustacks: --max_dist 2" -X "cstacks: -n 2" -X "populations:--min-maf 0.17" -X "populations:-r .5" -X "populations:-p 3" -X "populations: --write-random-snp" -X "populations: --vcf" -X "populations: --genepop" -X "populations: --fasta-loci" ``` #### Quality filtering for CB. /data/Dimensions/clarkia/sv/cb/snp.calling/final/filter ``` vcftools --vcf populations.snps.vcf --missing-indv # It looks like one indiviudal CB-58 failed and CB-73 and CB-77 did not do well. Remove them to keep 32 individuals # Remove individuals with lots of missing data vcftools --vcf populations.snps.vcf --remove rm.ind --recode --recode-INFO-all --out subset #Includes only sites with mean depth values (over all included individuals) greater than or equal to 15 vcftools --vcf subset.recode.vcf --min-meanDP 15 --recode --recode-INFO-all --out fil1 # kept 3568 #Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed). Here 10% or less missing data is allowed vcftools --vcf subset.recode.vcf --max-missing 0.90 --recode --recode-INFO-all --out fil4 # kept 2492 #Filter with all parameters including mean depth vcftools --vcf subset.recode.vcf --min-meanDP 15 --mac 3 --max-missing 0.90 --hwe 0.05 --recode --recode-INFO-all --out filtered.final.subset #kept 458 ``` ### Filtering for SNP panel. Using the 458 SNPs called - has maf 17%, in HWE, 10% missing data and mean depth 15X. /data/clarkia/sv/cb/snp.panel #### Now have to filter original fasta file to get filtered fasta sequences./data/Dimensions/clarkia/sv/cb/snp.panel/fasta.filter ``` mkdir fasta.filter cp populations.loci.fa fasta.filter/ cp filtered.final.subset.recode.vcf fasta.filter/ cd fasta.filter cut -f 1 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.one #only keep first column of the vcf file that has the SNPs that passed quality filtering sed '1,15d' keep.col.one > remove.headers #remove all the headers rows of the fasta file sed -e 's/^/>CLocus_/' remove.headers > list.to.keep #add the fasta label to be able to match below grep -w -A 1 -f list.to.keep populations.loci.fa --no-group-separator > filtered.fasta.fa #sort the original fasta to only keep the high quality ones ``` #### Now we need to sort fasta file to get SNPs with at least 50 bp on either side of SNP. First get the length of characters in each line of the file. Then combine them together ``` awk '{ print length }' filtered.fasta.fa > count # get the count of each line in the fasta file sed -n '1~2!p' count > length #keep every other line, so you only have the length of the sequence (not the locus ID) paste -d \\n list.to.keep length > length.locus #paste the locus Id and the legnth of that locus ``` #### Now get the position of each ``` cut -f 2 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.two # only keep the second column of the vcf file that hast he high quality SNPs sed '1,15d' keep.col.two > positions #remove headers of the vcf file ``` #### Combine the locus ID, length and position ``` paste list.to.keep length positions | column -s $'\t' -t > length.position #combine and have them formatted into tabulated columns ``` #### Keep SNPs that have 50 bp on either side. Keep every line where column 3 (position) has is greater than 50, meaning there are mpre than 50 bp before the SNP. Then calcualte the difference between columne 2 (length) and column 3 (position) so that you only keep the loci with a difference greater than 50 meaning that there are at least 50 bp after the SNP. We have 346 loci. ``` awk '(NR>0) && ($3 > 50 ) ' length.position > right.50 awk 'BEGIN { OFS = "\t" } { $5 = $2 - $3 } 1' right.50 > difference awk '(NR>0) && ($4 > 50 ) ' difference > right.left.50 ``` #### Filter original fasta file to only keep these 360 loci ``` cut -f 1 right.left.50 | sed 's/[\t]/,/g' > filtered.id grep -w -A 1 -f filtered.id populations.loci.fa --no-group-separator > final.filtered.fasta.fa ``` ### Now we must filter to remove SNPs that are not the same between replicates. Need to filter the vcf file to include just the 346 filtered loci. ``` cat filtered.id | cut -d"_" -f2 > filtered.chrom #get only the chromosome number sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers ``` #### Copy the text below into a new file called vcfkeeping. These are the column names that we want. ``` #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CB-110 CB-110.2 CB-24 CB-24.2 CB-65 CB-65.2 CB-96 CB-96.2 ``` #### Filter the full vcf file without headers to only keep the columns we want. ``` cut -f"$(head -1 allvcf.headers | tr '\t' '\n' | nl | grep -Fwf vcfkeeping | awk '{print $1}' | paste -sd,)" allvcf.headers > rm.ind.vcf ``` #### Keep only those 346 filtered loci ``` grep -w -F -f filtered.chrom rm.ind.vcf > reps.filtered.vcf #for some reason this is keeping ten extra loci #so try adding prefix to chromosome number to first column in each file sed -e 's/^/chr/' filtered.chrom > filtered.chr sed -e 's/^/chr/' rm.ind.vcf > rm.ind.chr.vcf #try again grep -w -F -f filtered.chr rm.ind.chr.vcf > reps.filtered.vcf #now remove the chr string awk -F'chr' '{print $2}' reps.filtered.vcf > replicate.vcf #copy and paste header from previous file into new file ( sed 1q rm.ind.chr.vcf; sed 1d replicate.vcf ) > replicates.vcf && mv replicates.vcf replicate.vcf ``` #### Now insert the vcf header manually (from the original vcf). Then use populations to make a genepop file to upload into R (locally). This will remove loci that are variable in other individuals but not variable in the replicates, so we are left with 310. ``` populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/replicates.filter/replicate.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/replicates.filter -M /data/Dimensions/clarkia/sv/cb/snp.calling/final/popmap --genepop ``` ##### I compared the loci that aren't the same in replicates in excel after uploading the genepop file. Note that you will loose some loci because if they aren't variable in the replicates, they won't be included in the genepop file. We of the starting 348 loci have 310 variable loci in the replicates to compare. The ID in the genepop file match up with the ids in the vcf file that you used to generate the genepop file. Make sure to use the GenePop ID from the replicate genepop file since this has ONLY the variable loci that you can compare. I then made a list of those to exclude. Now we have 202 SNPs. You will need to get the fasta loci ID from the genepop ID using the vcf file (the genepop file is ordered 1...x) ### Final filtering. /data/Dimensions/clarkia/sv/cb/snp.panel/replicates.filter. Using the loci that we are keeping. Make a list.to.keep with their locus ids then filter the original fasta file. ``` grep -w -A 1 -f list.to.keep populations.loci.fa --no-group-separator > final.fasta.fa #sort the original fasta to only keep the high quality ones ``` #### Now filter the vcf file. ``` cat list.to.keep | cut -d"_" -f2 > snps.keep #get only the chromosome number sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers #remove header sed -e 's/^/chr/' allvcf.headers > allvcf.headers.chr #add 'chr' sed -e 's/^/chr/' snps.keep > snps.keep.chr #add 'chr' grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string #copy and paste header from previous file into new file ( sed 1q allvcf.headers; sed 1d filtered ) > filtered2 && mv filtered2 filtered.vcf ``` #### Filter out linked loci. /data/Dimensions/clarkia/sv/cb/snp.panel/linkage.filter ``` ## Had to reformat the vcf file to include a header bgzip -c filtered.vcf > filtered2.vcf.gz tabix -p vcf filtered2.vcf.gz /usr/local/libexec/bcftools/bcftools +prune -m 0.6 -w 100 filtered2.vcf.gz -Ob -o final.vcf #can't get this to work ## Checked to see what the R2 vcftools --vcf filtered.vcf --plink --out filtered.plink plink --file filtered.plink --r2 --noweb ``` #### There are two pairs or loci that are linked (i.e. have an R2 above 80). Remove them and filter the vcf file. We have 200 SNPs ``` sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers #remove header sed -e 's/^/chr/' allvcf.headers > allvcf.headers.chr #add 'chr' sed -e 's/^/chr/' list.to.keep > snps.keep.chr #add 'chr' grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string #copy and paste header from previous file into new file ( sed 1q allvcf.headers; sed 1d filtered ) > filtered2 && mv filtered2 filtered.vcf ``` #### Run populations. Upload this gene pop file into R to look at heterozygosity of each locus at each population. Decided to keep all 310 loci for now. We can use the final.fasta.fa file to begin developing CB SNP panel. ``` populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/linkage.filter/filtered.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/linkage.filter/ -M /data/Dimensions/clarkia/sv/cb/snp.panel/popmap.nr --genepop ``` #### Only keeping a subset of loci that are highly variable across all three populations. Filter final vcf file to only get those highly variable SNPs and remove replicates. /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter ##### Run populations again to run genepop file in R ``` populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/filtered.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/ -M /data/Dimensions/clarkia/sv/cb/snp.panel/popmap.nr --genepop --vcf ``` ##### Filter the final vcf using the 111 most variable SNPs /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/hs.filter ``` sed -e 's/^/chr/' filtered.vcf > allvcf.headers.chr #add 'chr' sed -e 's/^/chr/' list.to.keep > snps.keep.chr #add 'chr' grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string #copy and paste header from previous file into new file ( sed 1q filtered.vcf; sed 1d filtered ) > filtered2 && mv filtered2 final.vcf ``` ##### Run populations again ``` populations -V /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/hs.filter/final.vcf -O /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/hs.filter -M /data/Dimensions/clarkia/sv/cb/snp.panel/popmap.nr --genepop --vcf ``` #### Get SNP info for all final SNPs and make master spreadsheet with all SNP info. ``` vcftools --vcf final.p.snps.vcf --missing-indv #Generates a file reporting the missingness on a per-site basis. The file has the suffix ".lmiss". vcftools --vcf final.p.snps.vcf --missing-site #Generates a file containing the mean depth per individual. This file has the suffix ".idepth" vcftools --vcf final.p.snps.vcf --depth #Generates a file containing the mean depth per site averaged across all individuals. This output file has the suffix ".ldepth.mean". vcftools --vcf final.p.snps.vcf --site-mean-depth ``` ##### Format for submission to sequencing facility ###### Filter fasta to get those we want ``` sed -e 's/^/>CLocus_/' list.to.keep > final.list.to.keep # Make list of loci we want to keep w/ correct format grep -w -A 1 -f final.list.to.keep populations.loci.fa --no-group-separator > final.fasta.fa # fitler original fasta ``` ###### Re-formatting fasta file so that the locus name is in the first column and the sequence is in the second. ``` paste -d " " - - < final.fasta.fa > final.fasta2 ``` ###### Now doing some re-formatting for the submission form to Minnesota. Need to get the SNP with the major and minor alleles into the fasta file. Make a file with the sequence in the first column, position in the second column, and [allele1/allele2] in the third column. Then run code below (had to get help on this in Stack Exchange). ``` awk -F"\t" '{printf "%s%s%s%s",substr($1,1,$2-1),$3,substr($1,$2+1),ORS}' snp > submission.txt ``` ## Clarkia concinna. /data/clarkia/sv/cc/snp.calling ### Optimization of STACKS parameters. This is done in /data/clarkia/sv/cc/snp.calling/opt. ``` #!/bin/bash #Tell the computer to make a log file exec 3>&1 4>&2 trap 'exec 2>&4 1>&3' 0 1 2 3 RETURN exec 1>log.out 2>&1 #Create output directories and define variables for output folders (so the code is cleaner below) mkdir /data/Dimensions/clarkia/sv/cc/snp.calling/opt/m$1_M$2_n$3 OUT=/data/Dimensions/clarkia/sv/cc/snp.calling/opt/m$1_M$2_n$3 mv log.out ${OUT}/ denovo_map.pl --samples /data/Dimensions/clarkia/sv/samples --paired --popmap /data/Dimensions/clarkia/sv/cb/snp.calling/opt/popmap2 -o ${OUT}/ -T 25 -X "ustacks: --min_cov $1" -X "ustacks: --max_dist $2" -X "cstacks: -n $3" -X "populations:--min-maf 0.05" -X "populations: --write-random-snp" -X "populations: --plink" OUT=/data/Dimensions/clarkia/sv/cc/snp.calling/opt/m$1_M$2_n$3 cd ${OUT}/ plink --file populations.plink --cluster --mds-plot 2 --noweb ``` #### Run optimizations ```` sh opt.sh 3 2 2 # run sh opt.sh 4 2 2 # run sh opt.sh 5 2 2 # run sh opt.sh 3 3 3 # run sh opt.sh 3 1 1 # run sh opt.sh 3 4 4 # run sh opt.sh 4 3 3 # run sh opt.sh 4 1 1 # run ```` ### Final SNP calling. /data/clarkia/sv/cc/snp.calling/final. Using default parameters and populations so that an allele has to be present in all three populations (-p 3), in at least 50% of individuals in a population (-r 0.5), and a minor allele frequency of 17%. ``` #!/bin/bash #Tell the computer to make a log file exec 3>&1 4>&2 trap 'exec 2>&4 1>&3' 0 1 2 3 RETURN exec 1>log.out 2>&1 denovo_map.pl --samples /data/Dimensions/clarkia/sv/samples --paired --popmap /data/Dimensions/clarkia/sv/cc/snp.calling/final/popmap -o /data/Dimensions/clarkia/sv/cc/snp.calling/final -T 10 -X "ustacks: --min_cov 3" -X "ustacks: --max_dist 2" -X "cstacks: -n 2" -X "populations:--min-maf 0.17" -X "populations:-r .5" -X "populations:-p 3" -X "populations: --write-random-snp" -X "populations: --vcf" -X "populations: --genepop" -X "populations: --fasta-loci" ``` ##### CC-41 was included as a replicate, but I didn't not include CC-41.2 in the barcode file, so the info for all sequences is included as CC-41. It is no longer able to be used as a replicate unless I want to demultiplext again... #### Quality filtering for CC. /data/Dimensions/clarkia/sv/cc/snp.calling/final/filter ``` vcftools --vcf populations.snps.vcf --missing-indv # Remove individuals CC-26, CC-41 # Remove individuals with lots of missing data vcftools --vcf populations.snps.vcf --remove rm.ind --recode --recode-INFO-all --out subset #Includes only sites with mean depth values (over all included individuals) greater than or equal to 15 vcftools --vcf subset.recode.vcf --min-meanDP 15 --recode --recode-INFO-all --out fil1 # kept 2832 #Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed). Here 10% or less missing data is allowed vcftools --vcf subset.recode.vcf --max-missing 0.90 --recode --recode-INFO-all --out fil4 # kept 1801 #Filter with all parameters including mean depth vcftools --vcf subset.recode.vcf --min-meanDP 15 --mac 3 --max-missing 0.90 --hwe 0.05 --recode --recode-INFO-all --out filtered.final.subset #kept 466 ``` ### Filtering for SNP panel. Using the 458 SNPs called - has maf 17%, in HWE, 10% missing data and mean depth 15X. /data/clarkia/sv/cc/snp.panel/fasta.filter ### Now have to filter original fasta file to get filtered fasta sequences. ``` mkdir fasta.filter cp populations.loci.fa fasta.filter/ cp filtered.final.subset.recode.vcf fasta.filter/ cd fasta.filter cut -f 1 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.one #only keep first column of the vcf file that has the SNPs that passed quality filtering sed '1,15d' keep.col.one > remove.headers #remove all the headers rows of the fasta file sed -e 's/^/>CLocus_/' remove.headers > list.to.keep #add the fasta label to be able to match below grep -w -A 1 -f list.to.keep populations.loci.fa --no-group-separator > filtered.fasta.fa #sort the original fasta to only keep the high quality ones ``` ### Now we need to sort fasta file to get SNPs with at least 50 bp on either side of SNP. First get the length of characters in each line of the file. Then combine them together ``` awk '{ print length }' filtered.fasta.fa > count # get the count of each line in the fasta file sed -n '1~2!p' count > length #keep every other line, so you only have the length of the sequence (not the locus ID) paste -d \\n list.to.keep length > length.locus #paste the locus Id and the legnth of that locus ``` #### Now get the position of each ``` cut -f 2 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.two # only keep the second column of the vcf file that hast he high quality SNPs sed '1,15d' keep.col.two > positions #remove headers of the vcf file ``` #### Combine the locus ID, length and position ``` paste list.to.keep length positions | column -s $'\t' -t > length.position #combine and have them formatted into tabulated columns ``` #### Keep SNPs that have 50 bp on either side. Keep every line where column 3 (position) has is greater than 50, meaning there are mpre than 50 bp before the SNP. Then calcualte the difference between columne 2 (length) and column 3 (position) so that you only keep the loci with a difference greater than 50 meaning that there are at least 50 bp after the SNP. We have 346 loci. ``` awk '(NR>0) && ($3 > 50 ) ' length.position > right.50 awk 'BEGIN { OFS = "\t" } { $5 = $2 - $3 } 1' right.50 > difference awk '(NR>0) && ($4 > 50 ) ' difference > right.left.50 ``` #### Filter original fasta file to only keep the loci ``` cut -f 1 right.left.50 | sed 's/[\t]/,/g' > filtered.id grep -w -A 1 -f filtered.id populations.loci.fa --no-group-separator > final.filtered.fasta.fa ``` ### Now we must filter to remove SNPs that are not the same between replicates. /data/Dimensions/clarkia/sv/cc/snp.panel/replicates.filter #### Need to filter the vcf file to include just the 378 filtered loci. ``` cat filtered.id | cut -d"_" -f2 > filtered.chrom #get only the chromosome number sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers ``` #### Copy the text below into a new file called vcfkeeping. These are the column names that we want. CC-324.2 had a lot of missing data, did not include ``` #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CC-100 CC-100.2 CC-102 CC-102.2 CC-110 CB-110.2 CC-331 CC-331.2 ``` #### Filter the full vcf file without headers to only keep the columns we want. ``` cut -f"$(head -1 allvcf.headers | tr '\t' '\n' | nl | grep -Fwf vcfkeeping | awk '{print $1}' | paste -sd,)" allvcf.headers > rm.ind.vcf ``` #### Keep only those filtered loci ``` grep -w -F -f filtered.chrom rm.ind.vcf > reps.filtered.vcf #for some reason this is keeping ten extra loci #so try adding prefix to chromosome number to first column in each file sed -e 's/^/chr/' filtered.chrom > filtered.chr sed -e 's/^/chr/' rm.ind.vcf > rm.ind.chr.vcf #try again grep -w -F -f filtered.chr rm.ind.chr.vcf > reps.filtered.vcf #now remove the chr string awk -F'chr' '{print $2}' reps.filtered.vcf > replicate.vcf #copy and paste header from previous file into new file ( sed 1q rm.ind.chr.vcf; sed 1d replicate.vcf ) > replicates.vcf && mv replicates.vcf replicate.vcf ``` #### Now insert the vcf header manually (from the original vcf). Then use populations to make a genepop file to upload into R (locally). ``` populations -V /data/Dimensions/clarkia/sv/cc/snp.panel/replicates.filter/replicate.vcf -O /data/Dimensions/clarkia/sv/cc/snp.panel/replicates.filter -M /data/Dimensions/clarkia/sv/cc/snp.calling/final/popmap --genepop ``` ##### I compared the loci that aren't the same in replicates in excel after uploading the genepop file. Note that you will loose some loci because if they aren't variable in the replicates, they won't be included in the genepop file. We of the starting 378 loci have 298 variable loci in the replicates to compare. The ID in the genepop file match up with the ids in the vcf file that you used to generate the genepop file. Make sure to use the GenePop ID from the replicate genepop file since this has ONLY the variable loci that you can compare. I then made a list of those to exclude. Now we have SNPs. You will need to get the fasta loci ID from the genepop ID using the vcf file (the genepop file is ordered 1...x) ### Linkage filtering. /data/Dimensions/clarkia/sv/cc/snp.panel/linkage.filter #### Filter the vcf file. Put the snps you're keeping in snps.keep ``` sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers #remove header sed -e 's/^/chr/' allvcf.headers > allvcf.headers.chr #add 'chr' sed -e 's/^/chr/' snps.keep > snps.keep.chr #add 'chr' grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string #copy and paste header from previous file into new file ( sed 1q allvcf.headers; sed 1d filtered ) > filtered2 && mv filtered2 filtered.vcf ``` #### Filter out linked loci. Plink showed that no R2 was above .6 - did not filter any loci ``` ## Had to reformat the vcf file to include a header bgzip -c filtered.vcf > filtered2.vcf.gz tabix -p vcf filtered2.vcf.gz /usr/local/libexec/bcftools/bcftools +prune -m 0.6 -w 100 filtered2.vcf.gz -Ob -o final.vcf #can't get this to work ## Checked to see what the R2 vcftools --vcf filtered.vcf --plink --out filtered.plink plink --file filtered.plink --r2 --noweb ``` #### Run populations. Upload this gene pop file into R to look at heterozygosity of each locus at each population. ``` populations -V /data/Dimensions/clarkia/sv/cc/snp.panel/linkage.filter/filtered.vcf -O /data/Dimensions/clarkia/sv/cc/snp.panel/linkage.filter/ -M /data/Dimensions/clarkia/sv/cc/snp.panel/popmap.nr --genepop ``` #### Only keeping a subset of loci that are highly variable across all three populations. Filter final vcf file to only get those highly variable SNPs and remove replicates. /data/Dimensions/clarkia/sv/cb/snp.panel/final.filter ##### Filter the final vcf using the 142 most variable SNPs ``` sed -e 's/^/chr/' filtered.vcf > allvcf.headers.chr #add 'chr' sed -e 's/^/chr/' list.to.keep > snps.keep.chr #add 'chr' grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #filter awk -F'chr' '{print $2}' filtered.chr > filtered #now remove the chr string #copy and paste header from previous file into new file ( sed 1q filtered.vcf; sed 1d filtered ) > filtered2 && mv filtered2 final.vcf ``` ##### Run populations again ``` populations -V /data/Dimensions/clarkia/sv/cc/snp.panel/final.filter/final.vcf -O /data/Dimensions/clarkia/sv/cc/snp.panel/final.filter/ -M /data/Dimensions/clarkia/sv/cc/snp.panel/popmap.nr --genepop --vcf ``` #### Get SNP info for all final SNPs and make master spreadsheet with all SNP info. ``` vcftools --vcf final.p.snps.vcf --missing-indv #Generates a file reporting the missingness on a per-site basis. The file has the suffix ".lmiss". vcftools --vcf final.p.snps.vcf --missing-site #Generates a file containing the mean depth per individual. This file has the suffix ".idepth" vcftools --vcf final.p.snps.vcf --depth #Generates a file containing the mean depth per site averaged across all individuals. This output file has the suffix ".ldepth.mean". vcftools --vcf final.p.snps.vcf --site-mean-depth ``` ##### Format for submission to sequencing facility ###### Filter fasta to get those we want ``` sed -e 's/^/>CLocus_/' list.to.keep > final.list.to.keep # Make list of loci we want to keep w/ correct format grep -w -A 1 -f final.list.to.keep populations.loci.fa --no-group-separator > final.fasta.fa # fitler original fasta ``` ###### Re-formatting fasta file so that the locus name is in the first column and the sequence is in the second. ``` paste -d " " - - < final.fasta.fa > final.fasta2 ``` ###### Now doing some re-formatting for the submission form to Minnesota. Need to get the SNP with the major and minor alleles into the fasta file. Make a file with the sequence in the first column, position in the second column, and [allele1/allele2] in the third column. Then run code below (had to get help on this in Stack Exchange). ``` awk -F"\t" '{printf "%s%s%s%s",substr($1,1,$2-1),$3,substr($1,$2+1),ORS}' snps > submission.txt ``` # Secure copy ``` # For windows scp zdiazmartin@10.2.0.10:/data/Phyllostegia/coancestry/coancestry.txt C:\Users\zdiaz-martin\Desktop\Files scp zdiazmartin@10.2.0.10:/data/Dimensions/clarkia/sv/cb/snp.panel/final.filter/hs.filter/final.fasta.fa C:\Users\zdiaz-martin\Desktop\Files # For Mac scp zdiazmartin@10.2.0.10:/data/Amorphophallus/jt/snp.calling/opt/m3_M2_n2/populations.sumstats_summary.tsv /Users/zdm/Desktop/Files Oenocarpus!88* ```