Brendan's DENU pt1 - STACKS and SNP Discovery

--- title: "Brendan's DENU pt1 - STACKS and SNP Discovery"" breaks: FALSE tags: RADSeq --- Author: Brendan Connolly (p31939) --- ### Connecting to Servers ``` Quest: ssh bjc8165@quest.northwestern.edu Curie: ssh bjc8165@10.2.0.53 cd /data/labs/Fant/Connolly ``` ### Transfer things to JBOD, do this in curie ``` rsync -rtu --delete SOURCE root@x.x.x.x:/DESTINATION ### Example #Make md5sum file for everything you're transferring over for i in *.gz; do md5sum $i > $i.md5 done #Transfer Files Over rsync -rtu --delete bjc8165@quest.northwestern.edu:/projects/p31939/DENUddRAD/RawReads /data/labs/Fant/Connolly/DENU_ddrad/raw #Check integrity of transferred files md5sum -c DENU-3_S3_L001_R1_001.fastq.gz.md5 ``` ### Transfering things to local SSD ``` #Run in powershell to set local location to external SSD (D:) Set-Location D: #Run in powershell to transfer to SSD scp -r bjc8165@quest.northwestern.edu:/YOUR/DIRECTORY/TO/TRANSFER /D:/DESTINATION/DIRECTORY ###Example scp -r bjc8165@quest.northwestern.edu:/projects/p31939/DENUddRAD /D:/GeneticsBackup/ #Run this in powershell to get text file of every file's md5hash - only files with . in name Get-FileHash -Algorithm MD5 -Path (Get-ChildItem "D:\GeneticsBackup\DENU_Amplicon\Raw\Trimmed\*.*" -Recurse -force) | export-csv D:\GeneticsBackup\DENU_Amplicon\Raw\Trimmed\trimmed.csv #Run this in the directory on unix to generate md5sum for every file in the directory find -type f -exec md5sum '{}' \; > md5sum.txt #compare md5 hashes in excel to ensure file integrity ``` ## Stacks Really good tutorial on how Stacks works. https://catchenlab.life.illinois.edu/stacks/param_tut.php ## File Process_RadTags #!/bin/bash #SBATCH --account=p31939 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time=12:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions$ #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=2 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu=25GB ## how much RAM do you need per computer/node (this affects your FairShare score so be caref #SBATCH --job-name=DENU_S1_radtags ## When you run squeue -u #SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL #SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu #SBATCH --error=S1_RadTags.error #SBATCH --output=S1_RadTags.out module purge module avail module load stacks/2.64-gcc-9.2.0 process_radtags -1 /projects/p31939/DENUddRAD/DENU-1_S1_L001_R1_001.fastq.gz -2 /projects/p31939/DENUddRAD/DENU-1_S1_L001_R2_001.fastq.gz -i gzfastq -b /projects/p31939/DENUddRAD/s1_bars -o ./S1_Radtags --inline_null --renz_1 ecoRI --renz_2 mspI --adapter_1 ACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2 -r -c -q -t 100 -s 2 This works, but a lot of reads are dropped, cutsites not found in ~70% of samples. Still retains around 18 million reads. Will try to troubleshoot: play around with amount trimmed (-t), add linker nucleotide to barcodes (add T). ### Notes for Troubleshooting (6/27/23) - a lot of cutsites were not found and were dropped across sublibraries (~70%). Not sure why. - in the process_radtags log, there are six barcodes that I didn't use (#s 42-48; TCACG, TCAGT, TCCGG, TCTGC, TGGAA, TTACC ) that show up in every sublibrary. Reads found for these are ~ 1 million each. This is a problem in every sublibrary and doesn't make sense. Contamination? - In every sublibary, samples that were in wells C2 (AATTA) and G2 (ATACG) have very low reads found that contain those barcodes, like ~10,000. Not sure if there was a problem with these barcodes, or if they received one of the above barcodes by accident. ### Notes for troubleshooting (7/6/23) - Ran process_radtags with the 6 above barcodes added to my barcodes .txt file. Seeing if more reads are retained now. Continuining with pipeline.. ## DeNOVO Assemby ### Example Script #!/bin/bash #SBATCH --account=p31939 ## Required: your allocation/account name, i.e. eXXXX, pXXXX or bXXXX #SBATCH --partition=normal ## Required: (buyin, short, normal, long, gengpu, genhimem, etc) #SBATCH --time=12:00:00 ## Required: How long will the job need to run (remember different partitions have restrictions on this parameter) #SBATCH --nodes=1 ## how many computers/nodes do you need (no default) #SBATCH --ntasks-per-node=4 ## how many cpus or processors do you need on per computer/node (default value 1) #SBATCH --mem-per-cpu=20G ## how much RAM do you need per computer/node (this affects your FairShare score so be careful to not ask for more than you need)) #SBATCH --job-name=DeNova_433_63nf5drop ## When you run squeue -u #SBATCH --mail-type=ALL ## BEGIN, END, FAIL or ALL #SBATCH --mail-user=brendanconnolly2023@u.northwestern.edu #SBATCH --error=DeNova_433_63nf5drop.error # Enter a error log file name #SBATCH --output=DeNova_433_63nf5drop.out # Enter your out.log file name module purge module load stacks/2.64-gcc-9.2.0 ##specify environment denovo_map.pl -m 4 -M 3 -n 3 --samples /projects/p31939/DENUddRAD/demultiplex_70_63nf5_dropped --popmap /projects/p31939/DENUddRAD/pop_map_63nf5drop.txt -o /projects/p31939/DENUddRAD/DeNovo_433_70_63nf5drop -T 4 --paired -X "populations:-r 0.8" -X "populations: --write-random-snp" -X "populations: --vcf" ### Notes (7/6/23) script fails unless 63_nf5 is dropped. Likely cause of low samples. However, with default parameters (4_3_3) only 93 loci are retained. I dropped all samples with less than 10,000 reads to try to retain more samples: - (ATACG): 184_control: 2617 - (CTTGG): 65_flav: 346 - (ATACG): 2,3,141,142-NF2: 2943 reads - (CTTGG): 60-nf5: 469 reads - (ATACG): 185_control: 2992 reads - (CTTGG): 63-65_NF5: 318 reads - (ATACG): 172_control: 2,309 reads - (CTTGG): 63-65_NF4: 368 reads ``` rm 184_Control.1.fq.gz 184_Control.2.fq.gz 65_Flav.1.fq.gz 65_Flav.2.fq.gz 2_3_141_142_NF2.1.fq.gz 2_3_141_142_NF2.2.fq.gz 60_NF5.1.fq.gz 60_NF5.2.fq.gz 185_Control.1.fq.gz 185_Control.2.fq.gz 63-65_NF5.1.fq.gz 63-65_NF5.2.fq.gz 172_Control.2.fq.gz 172_Control.1.fq.gz 63-65_NF4.1.fq.gz 63-65_NF4.2.fq.gz ``` ### Notes (7/14/2023) When I run DeNovo pipeline on the demultiplexed samples with 70% of reads dropped for unrecognizable cutsites, I only have max ~ 230 loci retained. Even with adjusting DeNovo parameters. This is too low for amplicon. https://docs.google.com/spreadsheets/d/1t60fjJ_DagiukDfUUVaYlj_-WA0dR_P9ZP4KDDpHTVE/edit#gid=0 When checking for cutsites is disabled in process_radtags, i have 80M reads left. But DeNovo assembly fails to retain any loci since all are filtered out. - Fastqc reports show low quality in cut site region. There may be too many mismatches for DeNovo to retain any loci due to the low quality here. - solution: trim first 10 reads from every sample and then run denovo assembly #### Trimming off cutsites from de-multiplexed samples run all four lines for i in *.1.fq.gz; do echo "seqtk trimfq -b 5 $i > ${i//.1.fq.gz/_trim.1.fq.gz}"; done > trimloop1.sh for i in *.2.fq.gz; do echo "seqtk trimfq -b 5 $i > ${i//.2.fq.gz/_trim.2.fq.gz}"; done > trimloop2.sh cat trimloop1.sh trimloop2.sh > trimloop.sh module load seqtk/Dec17 bash trimloop.sh rm *1.sh *2.sh *1.1.fq.gz *1.2.fq.gz *2.1.fq.gz *2.2.fq.gz *3.1.fq.gz *3.2.fq.gz *4.2.fq.gz *4.1.fq.gz *5.1.fq.gz *5.2.fq.gz *l.1.fq.gz *l.2.fq.gz *o.1.fq.gz *o.2.fq.gz *v.1.fq.gz *v.2.fq.gz for i in *.1.fq.gz; do echo "mv $i ${i//_trim.1.fq.gz/.1.fq.gz}"; done > renameloop1.sh for i in *.2.fq.gz; do echo "mv $i ${i//_trim.2.fq.gz/.2.fq.gz}"; done > renameloop2.sh cat renameloop1.sh renameloop2.sh > renameloop.sh bash renameloop.sh for file in *; do mv "$file" "${file%???}"; done for i in *.fq; do echo "gzip $i"; done > gziploop_150bp.sh bash gziploop_150bp.sh #### FastQC of all demultiplexed files with first 10 bp removed ![](https://hackmd.io/_uploads/rkUsB8k53.png) ### Removing FQ files with missing data (Done by checking file size) To get more loci, files with a lot of missing data were removed. This was done in 3 criteria: 1.) 76 biggest samples + Set of Tech 1_4_42_43_NF2 = 80 samples ``` find . -type f -size +1200k ``` 2.) 56 biggest + Set of Tech 1_4_42_43_NF2 = 60 samples ``` find . -type f -size +1500k ``` 2.) 36 biggest samples + Set of Tech 1_4_42_43_NF2 = 40 samples ``` find . -type f -size +2200k ``` copy and paste files into excel, sort A-Z, and make sure both *1.fq.gz and *2.fq.gz are included. Then using concat in Excel make a textfile for each file with the following format: ``` cp 103_Appo.2.fq.gz /projects/p31939/DENUddRAD/RadTags_nocutcheck/9+filesdrop/demulti_nocutcheck_trim10_1200kbdrop ``` ### run denovo after removing samples with low # of loci ## SNPS Filtering - VCF Tools ### Filtering overview: -maf = 0.05 -hwe = 0.05 -mac = 3 -not filtering for DP -make sure there is 50bp before and after snp -no linked snps -Replicates not variable? module load vcftools/current vcftools --vcf populations.snps.vcf --min-meanDP 0 --mac 3 --max-missing 0.2 --hwe 0.05 --recode --recode-INFO-all --out filtered.final.subset vcftools --vcf populations.snps.vcf --max-missing 0.2 --hwe 0.05 --recode --recode-INFO-all --out filtered.final.subset rm out.log out.imiss fil1.log fil1.recode.vcf filtered.final.subset.log filtered.final.subset.recode.vcf ## Filtering Fasta file make new directory for fasta filtering mkdir fasta.filter cp populations.loci.fa fasta.filter/ cp filtered.final.subset.recode.vcf fasta.filter/ cd fasta.filter ### Filtering 50bp script ``` ### Before starting script, make a file called "vcfkeeping" with the following lines: #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Tech_168_Control_S1 Tech_168_Control_S2 Tech_168_Control_S3 Tech_168_Control_S4 Tech_1_4_42_43_NF2_S1 Tech_1_4_42_43_NF2_S2 Tech_1_4_42_43_NF2_S3 Tech_1_4_42_43_NF2_S4 ###Script start### cut -f 1 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.one sed '1,15d' keep.col.one > remove.headers sed -e 's/^/>CLocus_/' remove.headers > list.to.keep grep -w -A 1 -f list.to.keep populations.loci.fa --no-group-separator > filtered.fasta.fa awk '{ print length }' filtered.fasta.fa > count sed -n '1~2!p' count > length paste -d \\n list.to.keep length > length.locus cut -f 2 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.two sed '1,15d' keep.col.two > positions paste list.to.keep length positions | column -s $'\t' -t > length.position awk '(NR>0) && ($3 > 50 ) ' length.position > right.50 awk 'BEGIN { OFS = "\t" } { $5 = $2 - $3 } 1' right.50 > difference awk '(NR>0) && ($4 > 50 ) ' difference > right.left.50 cut -f 1 right.left.50 | sed 's/[\t]/,/g' > filtered.id grep -w -A 1 -f filtered.id populations.loci.fa --no-group-separator > final.filtered.fasta.fa #Creates VCF file with only replicates and above filtered loci cat filtered.id | cut -d"_" -f2 > filtered.chrom #get only the chromosome number sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers cut -f"$(head -1 allvcf.headers | tr '\t' '\n' | nl | grep -Fwf vcfkeeping | awk '{print $1}' | paste -sd,)" allvcf.headers > rm.ind.vcf grep -w -F -f filtered.chrom rm.ind.vcf > reps.filtered.vcf sed -e 's/^/chr/' filtered.chrom > filtered.chr sed -e 's/^/chr/' rm.ind.vcf > rm.ind.chr.vcf grep -w -F -f filtered.chr rm.ind.chr.vcf > reps.filtered.vcf awk -F'chr' '{print $2}' reps.filtered.vcf > replicate.vcf head -n1 rm.ind.chr.vcf > replicateheaders.vcf cat replicate.vcf >> replicateheaders.vcf rm replicate.vcf mv replicateheaders.vcf replicate.vcf head -n14 filtered.final.subset.recode.vcf > vcfheaders.vcf cat replicate.vcf >> vcfheaders.vcf rm replicate.vcf mv vcfheaders.vcf replicate.vcf #Creates VCF file with all samples but only above filtered loci sed -e 's/^/chr/' allvcf.headers > all.50bpfiltered.chr.vcf grep -w -F -f filtered.chr all.50bpfiltered.chr.vcf > all.50bpfiltered.vcf awk -F'chr' '{print $2}' all.50bpfiltered.vcf > all.50bp.vcf head -n1 all.50bpfiltered.chr.vcf > allheaders.vcf cat all.50bp.vcf >> allheaders.vcf rm all.50bpfiltered.vcf mv allheaders.vcf 50bpfiltered.vcf head -n14 filtered.final.subset.recode.vcf > vcfheaders2.vcf cat 50bpfiltered.vcf >> vcfheaders2.vcf rm 50bpfiltered.vcf mv vcfheaders2.vcf 50bpfiltered.vcf rm rm.ind.chr.vcf replicate.headers.vcf reps.filtered.vcf rm.ind.vcf filtered.chr right.50 difference right.left.50 keep.col.one remove.headers length count keep.col.two length.position positions length.locus list.to.keep filtered.chrom filtered.id allvcf.headers all.50bpfiltered.chr.vcf all.50bp.vcf filtered.fasta.fa ``` ### 50bp Filtering Script Annoatated ##only keep first column of the vcf file that has the SNPs that passed quality filtering cut -f 1 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.one ##remove all the headers rows of the fasta file sed '1,15d' keep.col.one > remove.headers #add the fasta label to be able to match below sed -e 's/^/>CLocus_/' remove.headers > list.to.keep #sort the original fasta to only keep the high quality ones grep -w -A 1 -f list.to.keep populations.loci.fa --no-group-separator > filtered.fasta.fa Now we need to sort fasta file to get SNPs with at least 50 bp on either side of SNP. First get the length of characters in each line of the file. Then combine them together ## get the count of each line in the fasta file awk '{ print length }' filtered.fasta.fa > count #keep every other line, so you only have the length of the sequence (not the locus ID) sed -n '1~2!p' count > length #paste the locus Id and the legnth of that locus paste -d \\n list.to.keep length > length.locus Now get the position of each # only keep the second column of the vcf file that hast he high quality SNPs cut -f 2 filtered.final.subset.recode.vcf | sed 's/[\t]/,/g' > keep.col.two #remove headers of the vcf file sed '1,15d' keep.col.two > positions Combine the locus ID, length and position ## combine and have them formatted into tabulated columns paste list.to.keep length positions | column -s $'\t' -t > length.position Keep SNPs that have 50 bp on either side. Keep every line where column 3 (position) has is greater than 50, meaning there are mpre than 50 bp before the SNP. Then calcualte the difference between columne 2 (length) and column 3 (position) so that you only keep the loci with a difference greater than 50 meaning that there are at least 50 bp after the SNP. We have 346 loci. ``` awk '(NR>0) && ($3 > 50 ) ' length.position > right.50 awk 'BEGIN { OFS = "\t" } { $5 = $2 - $3 } 1' right.50 > difference awk '(NR>0) && ($4 > 50 ) ' difference > right.left.50 ``` Filter original fasta file to only keep these loci ``` cut -f 1 right.left.50 | sed 's/[\t]/,/g' > filtered.id grep -w -A 1 -f filtered.id populations.loci.fa --no-group-separator > final.filtered.fasta.fa ``` Now we must filter to remove SNPs that are not the same between replicates. Need to filter the vcf file to include just the 346 filtered loci. ``` cat filtered.id | cut -d"_" -f2 > filtered.chrom #get only the chromosome number sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers ``` Copy the text below into a new file called vcfkeeping. These are the column names that we want. - this only includes one set of technical replicates. Other two had a lot of missing data, but it might be worth including Tech_1_4_42_43_NF2 ``` #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Tech_168_Control_S1 Tech_168_Control_S2 Tech_168_Control_S3 Tech_168_Control_S4 Tech_1_4_42_43_NF2_S1 Tech_1_4_42_43_NF2_S2 Tech_1_4_42_43_NF2_S3 Tech_1_4_42_43_NF2_S4 ``` Filter the full vcf file without headers to only keep the columns/samples we want. ``` cut -f"$(head -1 allvcf.headers | tr '\t' '\n' | nl | grep -Fwf vcfkeeping | awk '{print $1}' | paste -sd,)" allvcf.headers > rm.ind.vcf ``` Keep only those 346 filtered loci ``` grep -w -F -f filtered.chrom rm.ind.vcf > reps.filtered.vcf ##for some reason this is keeping ten extra loci ##so try adding prefix to chromosome number to first column in each file sed -e 's/^/chr/' filtered.chrom > filtered.chr sed -e 's/^/chr/' rm.ind.vcf > rm.ind.chr.vcf #try again grep -w -F -f filtered.chr rm.ind.chr.vcf > reps.filtered.vcf #now remove the chr string awk -F'chr' '{print $2}' reps.filtered.vcf > replicate.vcf #copy and paste header from previous file into new file head -n1 rm.ind.chr.vcf > replicateheaders.vcf cat replicate.vcf >> replicateheaders.vcf rm replicate.vcf mv replicateheaders.vcf replicate.vcf ``` Now insert the vcf header manually (from the original vcf). Then use populations to make a genepop file to upload into R (locally). This will remove loci that are variable in other individuals but not variable in the replicates, so we are left with 310. ``` head -n14 filtered.final.subset.recode.vcf > vcfheaders.vcf cat replicate.vcf >> vcfheaders.vcf rm replicate.vcf mv vcfheaders.vcf replicate.vcf ``` ### Are there any variable SNPs in replicates? ``` mkdir replicates.filter module load stacks/2.64-gcc-9.2.0 populations -V ./replicate.vcf -O ./replicates.filter -M /projects/p31939/DENUddRAD/RadTags_cutcheck_trim10_1200kbDrop/pop_map_1500kbDrop_with2reps --genepop ***OR*** populations -V ./replicate.vcf -O ./replicates.filter -M /projects/p31939/DENUddRAD/RadTags_nocutcheck/9+filesdrop/pop_map_2200kbDrop_with2reps --genepop ``` Download genepop file, open in Excel and see which variant sites are the same within tech replicates - we want to keep these. Record the loci # from the genepop file. We next want to filter out the original fasta file (populations.loci.fa) to only keep loci that have passed all filtering. First let's make a list of our loci to keep. ``` #pull the loci #s from the genepop file - the loci numbers here refer to the line number of the snp from the replicate.vcf file sed -n '2p' replicate.p.snps.genepop > replicate.variable.loci #Turn the single row of locis into multiple tr , "\n" < replicate.variable.loci > replicate.variable.loci2 #Delete "_0" that follow each loci sed 's/..$//' < replicate.variable.loci2 > replicate.variable.loci3 #Make new file with the replicate loci we want removed. sed '5d; 23d; 24d; 28d; 32d; 34d; 37d; 38d; 39d; 41d; 42d; 45d; 48d; 62d; 64d' replicate.variable.loci3 > replicate.variable.loci4 cp replicate.variable.loci4 /projects/p31939/DENUddRAD/RadTags_nocutcheck/9+filesdrop/DeNovo_366_r20_maf.05_trim10_1200kbdrop_with2reps/fasta.filter rm replicate.variable.loci replicate.variable.loci2 replicate.variable.loci3 replicate.variable.loci4 cd .. #Pull column 1 from the vcf with loci that passed quality and bp filtering cut -f 1 replicate.vcf | sed 's/[\t]/,/g' > rep.col.one #Remove the headers of this vcf sed '1,15d' rep.col.one > remove.headers.rep #Make our final list of loci to keep - this command filters the original vcf to remove loci by line numbers (i.e. how loci were named in the .genepop file) sed "$(sed 's/$/d/' replicate.variable.loci4)" remove.headers.rep > final.loci.to.keep rm remove.headers.rep rep.col.one ``` Filter the original VCF ``` ############### ## ## ## ANNOTATED ## ## ## ############### #get only the chromosome number cat final.loci.to.keep | cut -d"_" -f2 > snps.keep #remove header sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers #add 'chr' sed -e 's/^/chr/' allvcf.headers > allvcf.headers.chr #add 'chr' sed -e 's/^/chr/' snps.keep > snps.keep.chr #filter grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr #now remove the chr string awk -F'chr' '{print $2}' filtered.chr > filtered #copy and paste header from previous file into new file head -n1 allvcf.headers > headers.temp cat filtered >> headers.temp mv headers.temp filtered.vcf ############### ## ## ## Script ## ## ## ############### cat final.loci.to.keep | cut -d"_" -f2 > snps.keep sed '1,14d' filtered.final.subset.recode.vcf > allvcf.headers sed -e 's/^/chr/' allvcf.headers > allvcf.headers.chr sed -e 's/^/chr/' snps.keep > snps.keep.chr grep -w -F -f snps.keep.chr allvcf.headers.chr > filtered.chr awk -F'chr' '{print $2}' filtered.chr > filtered head -n1 allvcf.headers > headers.temp cat filtered >> headers.temp mv headers.temp filtered.vcf rm snps.keep allvcf.headers allvcf.headers.chr snps.keep.chr filtered.chr ``` ### Testing linkage ``` mkdir linkage.filter cp filtered.vcf ./linkage.filter cd linkage.filter ##Need to load plink on coda conda activate plink conda install -c bioconda plink #Make Plink file vcftools --vcf filtered.vcf --plink --out filtered.plink #Make it binary - sometimes this doesn't work, manually add a to end of first chromosome > 23 in plink.map file plink --file filtered.plink --make-bed --out binary --noweb --allow-extra-chr #note from Nora: this will add "scaffold_" to each locus number, fixing the "invalid chromosome" issue: sed -E "s/^([0-9])/scaffold_\1/g" filtered.vcf > filtered.scaffold.vcf #Creating table of all pairwise snps that have a r2 value greater than .8 plink --bfile binary --r2 inter-chr --ld-window-r2 0.8 --allow-extra-chr --noweb --out links ``` Note that since these SNPs are linked, you should be able to keep one of the pairs of SNPs. To do this, Nora made a list of loci to remove. This will grab the first column with SNP locus A: ``` awk '{print $3}' links.ld | cut -f1 -d":" > linkA ``` And this will remove any duplicate values in that file: ``` awk '!seen[$0]++' linkA > snps.with.linkage #get rid of the column name sed {1d} snps.with.linkage > snps.with.linkage1 #add "chr" to the snps to drop sed -e 's/^/chr/' snps.with.linkage.chr ``` Now remove all loci from snps.with.linkage from the vcf: ``` grep -wv -F -f chr.snps.with.link filtered.vcf > filtered2.vcf #remove "chr" awk -F'chr' '{print $2}' filtered2.vcf > filtered3.vcf && mv filtered3.vcf filtered2.vcf ``` do the plink analysis again with those snps removed. See if there are still linked snps. Do this iteratively until there are no longer any R2 values above .8 Nora didn't have any more linked after the first round. nano links.ld ``` ## Paramaters summarized - No checking for cutsite - Stacks: m = 3; minimum raw reads to form stack - Stacks M = 6; number of mismatches between allele stacks to form loci - Stacks n = 6; number of mismatches between loci stacks to form catalog - Stacks: r = 0.20; % of individuals that must posess loci - MAF = 0.05 - HWE = 0.05; everything in equilibrium - MAC = 3 - No filtering of min. mean depth of coverage - 50bp on either side of loci ``` ## Final Loci for designing Primers: /projects/p31939/DENUddRAD/RadTags_nocutcheck/9+filesdrop/DeNovo_366_r20_maf.05_trim10_2200kbdrop_with2reps/fasta.filter/Connolly_96_loci_24Oct.fasta.fa /projects/p31939/DENUddRAD/RadTags_nocutcheck/9+filesdrop/DeNovo_366_r20_maf.05_trim10_2200kbdrop_with2reps/fasta.filter/Connolly_extra_loci_20Oct.fasta.fa ## Trying Stacks with reference genome ``` INDS=($(for i in ./Radtags_trim_cutcheck_2200kbdrop/*.1.fq.gz; do echo $(basename ${i%.1*}); done)) for IND in ${INDS[@]}; do FORWARD= /projects/p31939/DENUddRAD/reference_genomes/aquilegia_coerulea_genome/data/GCA_002738505.1/Radtags_trim_cutcheck_2200kbdrop/${IND}1.fq.gz; REVERSE= /projects/p31939/DENUddRAD/reference_genomes/aquilegia_coerulea_genome/data/GCA_002738505.1/Radtags_trim_cutcheck_2200kbdrop/${IND}2.fq.gz; OUTPUT=~/align/${IND}_sort.bam; done for IND in ${INDS[@]}; do FORWARD= /projects/p31939/DENUddRAD/reference_genomes/aquilegia_coerulea_genome/data/GCA_002738505.1/Radtags_trim_cutcheck_2200kbdrop/${IND}.1.fq.gz; REVERSE= /projects/p31939/DENUddRAD/reference_genomes/aquilegia_coerulea_genome/data/GCA_002738505.1/Radtags_trim_cutcheck_2200kbdrop/${IND}.2.fq.gz; OUTPUT=~/align/${IND}_sort.bam; done ```