--- tags: primer hunting title: Biji 12-Jul-2021 --- [toc] # Summary * I think ones you had been grepping for were both V2 primers and found in the middle of the amplified fragments * I think the primers used were: * 27F AGAGTTTGATCMTGGCTCAG * 534R CAGCAGCCGCGGTAAT * And that they were already removed from the data, below shows what brought me to this thinking 🙂 # Env ```bash conda create -y -n cutadapt-3.4 -c conda-forge -c bioconda -c defaults cutadapt=3.4 conda activate cutadapt-3.4 ``` # Testing those that were `grep`'d * Fwd: AGTGGCGGACGGGTGAGTAA * seems to be labeled as V2 according to [this](https://link.springer.com/article/10.1007/s00248-018-1299-5/tables/1) * Rev: TGCTGCCTCCCGTAGGAGT * also seems to be labeled as V2 according to [this](https://www.nature.com/articles/s41598-018-27757-8) Just running on one at time: ## Fwd ```bash cutadapt -g ^AGTGGCGGACGGGTGAGTAA -o 11GS_FWD-test.fq.gz 11GS_FWD.fastq.gz ``` ``` This is cutadapt 3.4 with Python 3.9.4 Command line parameters: -g ^AGTGGCGGACGGGTGAGTAA -o 11GS_FWD-test.fq.gz 11GS_FWD.fastq.gz Processing reads on 1 core in single-end mode ... [ 8<--------] 00:00:05 213,348 reads @ 23.9 µs/read; 2.51 M reads/minute Finished in 5.11 s (24 µs/read; 2.51 M reads/minute). === Summary === Total reads processed: 213,348 Reads with adapters: 0 (0.0%) Reads written (passing filters): 213,348 (100.0%) Total basepairs processed: 56,110,524 bp Total written (filtered): 56,110,524 bp (100.0%) === Adapter 1 === Sequence: AGTGGCGGACGGGTGAGTAA; Type: anchored 5'; Length: 20; Trimmed: 0 times ``` **None found** Trying without anchoring (removing the "^" in front, so need not start at the front exactly): ```bash cutadapt -g AGTGGCGGACGGGTGAGTAA -o 11GS_FWD-test.fq.gz 11GS_FWD.fastq.gz ``` ``` This is cutadapt 3.4 with Python 3.9.4 Command line parameters: -g AGTGGCGGACGGGTGAGTAA -o 11GS_FWD-test.fq.gz 11GS_FWD.fastq.gz Processing reads on 1 core in single-end mode ... [ 8<--------] 00:00:04 213,348 reads @ 22.3 µs/read; 2.69 M reads/minute Finished in 4.77 s (22 µs/read; 2.69 M reads/minute). === Summary === Total reads processed: 213,348 Reads with adapters: 175,320 (82.2%) Reads written (passing filters): 213,348 (100.0%) Total basepairs processed: 56,110,524 bp Total written (filtered): 40,536,480 bp (72.2%) === Adapter 1 === Sequence: AGTGGCGGACGGGTGAGTAA; Type: regular 5'; Length: 20; Trimmed: 175320 times No. of allowed errors: 1-9 bp: 0; 10-19 bp: 1; 20 bp: 2 Overview of removed sequences length count expect max.err error counts 57 1 0.0 2 1 60 2 0.0 2 0 0 2 65 129 0.0 2 0 25 104 68 3 0.0 2 0 0 3 70 9 0.0 2 0 0 9 71 32 0.0 2 0 7 25 72 17 0.0 2 0 6 11 73 3919 0.0 2 231 1460 2228 74 499 0.0 2 303 80 116 75 7631 0.0 2 1296 2419 3916 76 3079 0.0 2 2902 164 13 77 989 0.0 2 74 203 712 78 15 0.0 2 0 6 9 79 245 0.0 2 17 170 58 80 21 0.0 2 3 14 4 81 77 0.0 2 1 7 69 82 1133 0.0 2 2 678 453 83 121 0.0 2 8 71 42 84 4315 0.0 2 429 3670 216 85 2445 0.0 2 177 1264 1004 86 7050 0.0 2 2638 2968 1444 87 17207 0.0 2 2097 5585 9525 88 38167 0.0 2 440 34425 3302 89 6315 0.0 2 1824 3015 1476 90 21637 0.0 2 211 20401 1025 91 35482 0.0 2 22208 12338 936 92 985 0.0 2 33 742 210 93 9500 0.0 2 6858 1300 1342 94 360 0.0 2 164 172 24 95 1821 0.0 2 1091 421 309 96 455 0.0 2 45 238 172 97 175 0.0 2 93 77 5 98 181 0.0 2 3 42 136 99 14 0.0 2 0 3 11 100 13 0.0 2 0 6 7 101 21 0.0 2 2 12 7 102 557 0.0 2 3 484 70 103 898 0.0 2 21 246 631 104 108 0.0 2 15 52 41 105 1769 0.0 2 368 378 1023 106 7777 0.0 2 7211 520 46 107 62 0.0 2 7 33 22 108 7 0.0 2 0 2 5 109 9 0.0 2 0 1 8 112 3 0.0 2 0 3 115 2 0.0 2 1 0 1 116 1 0.0 2 0 1 117 10 0.0 2 9 0 1 118 1 0.0 2 0 0 1 120 2 0.0 2 2 122 1 0.0 2 0 1 129 1 0.0 2 1 140 1 0.0 2 0 0 1 142 1 0.0 2 0 0 1 143 2 0.0 2 0 2 146 2 0.0 2 0 0 2 147 1 0.0 2 0 0 1 148 10 0.0 2 1 8 1 149 4 0.0 2 3 1 155 3 0.0 2 0 3 156 2 0.0 2 1 1 158 1 0.0 2 0 1 159 1 0.0 2 0 1 171 1 0.0 2 0 0 1 176 2 0.0 2 0 0 2 181 1 0.0 2 0 0 1 195 1 0.0 2 1 196 1 0.0 2 0 1 211 1 0.0 2 0 1 212 7 0.0 2 0 7 213 1 0.0 2 0 1 214 1 0.0 2 0 0 1 239 1 0.0 2 0 1 241 2 0.0 2 0 1 1 ``` **Found a lot, but they are in the middle of the reads. Looking at the "length" column of the output above, that's the length of things trimmed, ranging from 57 to 241. The majority were around 88-91 bases, which makes sense if these amplicons are V1/V3, and this is a V2 primer.** ## Rev ```bash cutadapt -g ^TGCTGCCTCCCGTAGGAGT -o 11GS_REV-test.fq.gz 11GS_REV.fastq.gz ``` ``` This is cutadapt 3.4 with Python 3.9.4 Command line parameters: -g ^TGCTGCCTCCCGTAGGAGT -o 11GS_REV-test.fq.gz 11GS_REV.fastq.gz Processing reads on 1 core in single-end mode ... [ 8<--------] 00:00:04 213,348 reads @ 21.0 µs/read; 2.85 M reads/minute Finished in 4.50 s (21 µs/read; 2.85 M reads/minute). === Summary === Total reads processed: 213,348 Reads with adapters: 0 (0.0%) Reads written (passing filters): 213,348 (100.0%) Total basepairs processed: 56,750,568 bp Total written (filtered): 56,750,568 bp (100.0%) === Adapter 1 === Sequence: TGCTGCCTCCCGTAGGAGT; Type: anchored 5'; Length: 19; Trimmed: 0 times ``` **None found** Without anchoring: ```bash cutadapt -g TGCTGCCTCCCGTAGGAGT -o 11GS_REV-test.fq.gz 11GS_REV.fastq.gz ``` ``` This is cutadapt 3.4 with Python 3.9.4 Command line parameters: -g TGCTGCCTCCCGTAGGAGT -o 11GS_REV-test.fq.gz 11GS_REV.fastq.gz Processing reads on 1 core in single-end mode ... [ 8<---------] 00:00:02 213,348 reads @ 13.5 µs/read; 4.43 M reads/minute Finished in 2.90 s (14 µs/read; 4.42 M reads/minute). === Summary === Total reads processed: 213,348 Reads with adapters: 185,604 (87.0%) Reads written (passing filters): 213,348 (100.0%) Total basepairs processed: 56,750,568 bp Total written (filtered): 24,175,780 bp (42.6%) === Adapter 1 === Sequence: TGCTGCCTCCCGTAGGAGT; Type: regular 5'; Length: 19; Trimmed: 185604 times No. of allowed errors: 1-9 bp: 0; 10-19 bp: 1 Overview of removed sequences length count expect max.err error counts 14 1 0.0 1 1 59 5 0.0 1 0 5 148 2 0.0 1 2 150 2 0.0 1 2 152 1 0.0 1 1 153 7 0.0 1 3 4 154 36 0.0 1 30 6 155 178 0.0 1 84 94 156 20644 0.0 1 18283 2361 157 4113 0.0 1 3083 1030 158 3684 0.0 1 2444 1240 159 2289 0.0 1 1822 467 160 1901 0.0 1 1623 278 161 900 0.0 1 765 135 162 51 0.0 1 38 13 163 10 0.0 1 7 3 164 7 0.0 1 6 1 165 19 0.0 1 7 12 166 19 0.0 1 9 10 167 9 0.0 1 9 168 133 0.0 1 112 21 169 81 0.0 1 59 22 170 4957 0.0 1 4160 797 171 264 0.0 1 215 49 172 132 0.0 1 93 39 173 1044 0.0 1 820 224 174 556 0.0 1 465 91 175 2947 0.0 1 2365 582 176 30578 0.0 1 25280 5298 177 1057 0.0 1 812 245 178 395 0.0 1 264 131 179 291 0.0 1 109 182 180 4249 0.0 1 3087 1162 181 57254 0.0 1 46282 10972 182 44819 0.0 1 36138 8681 183 2650 0.0 1 2029 621 184 314 0.0 1 238 76 185 3 0.0 1 2 1 201 1 0.0 1 1 204 1 0.0 1 1 ``` Cutting off lots again, bulk at 156 and 176. Makes sense again if these are V1/V3 amplicons, and this is a V2 primer 👍 --- So I think that's why you were able `grep` these, but not really get 'em with cutadapt, they're just in the middle of the reads 👍 --- # Looking against a reference seq ## Forward read Took one seq from 11_GS_FWD.fastq.gz and blasting against the 16S db to get a ref sequence. With that we can see where we align, and can look at the reference for hints of primers (even if they happen to be cut off from our seqs already - which happens sometimes without our knowing, potentially complicating this further, ha) ``` >DE18INS60510:189:000000000-AJN0U:1:1101:8782:1749 1:N:0: AACGAACGCTGGCGGCATGCTTAACACATGCAAGTCGAACGAGAAAGTTTCCTTCGGGAAGCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAATCTGCCCTATGGTCTGGGATAACCTTTCGAAAGGGGGGCTAATACCGGATAAGCCCACGGAGACTTCGGTCACTGTGGGCAAAGATGACCTCTTCTATGTTATCGCTATCGGATGAGTCCGCGGCCCATTAGCTCGTTGGTAGGGTAATGGCCTACCAAGGCTA ``` ![](https://i.imgur.com/n3oBRDK.png) Top hit was to this bugger, hitting it's seq at its positions 22-285: ``` Geothermobacter ehrlichii strain SS015 16S ribosomal RNA, partial sequence Sequence ID: NR_042754.2Length: 1399Number of Matches: 1 Range 1: 22 to 285GenBankGraphicsNext MatchPrevious Match Alignment statistics for match #1 Score Expect Identities Gaps Strand 329 bits(178) 4e-90 237/265(89%) 5/265(1%) Plus/Plus Query 1 AACGAACGCTGGCGGCATGCTTAACACATGCAAGTCGAACGAGAAAGTTTCCTTCGGGAA 60 |||||||||||||||||||| |||||||||||||||||||| ||||| |||||||||||| Sbjct 22 AACGAACGCTGGCGGCATGCCTAACACATGCAAGTCGAACGCGAAAGCTTCCTTCGGGAA 81 Query 61 GCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAATCTGCCCTATGGTCTGGGATA 120 ||||||||||||||||||||||||||||||||||||||| |||||| || ||||||||| Sbjct 82 GCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAACCTGCCCGGTGATCTGGGATA 141 Query 121 ACCTTTCGAAAGGGGGGCTAATACCGGATAAGCCCACGGAGACTTCGGTCACT-GTGGGC 179 || | |||||||||| ||||||||||||||||| ||| | |||||||||| || ||||| Sbjct 142 ACATCTCGAAAGGGGTGCTAATACCGGATAAGCTCACAGGGACTTCGGTC-CTGGTGGGA 200 Query 180 AAAGATGACCTCTTCT--AT-GTTATCGCTATCGGATGAGTCCGCGGCCCATTAGCTCGT 236 ||||||| |||||||| | | ||| | | |||||| |||||||||||||||||| || Sbjct 201 AAAGATGGCCTCTTCTTGAAAGCTATTGTCACCGGATGGGTCCGCGGCCCATTAGCTAGT 260 Query 237 TGGTAGGGTAATGGCCTACCAAGGC 261 ||||||||||||||||||||||||| Sbjct 261 TGGTAGGGTAATGGCCTACCAAGGC 285 ``` Getting full-length of that ref sequence: ``` >NR_042754.2 Geothermobacter ehrlichii strain SS015 16S ribosomal RNA, partial sequence TAGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGCATGCCTAACACATGCAAGTCGAACGCGAAAGCTTCCTTCGGGAAGCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAACCTGCCCGGTGATCTGGGATAACATCTCGAAAGGGGTGCTAATACCGGATAAGCTCACAGGGACTTCGGTCCTGGTGGGAAAAGATGGCCTCTTCTTGAAAGCTATTGTCACCGGATGGGTCCGCGGCCCATTAGCTAGTTGGTAGGGTAATGGCCTACCAAGGCAACGATGGGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATTTTGCGCAATGGGCGAAAGCCTGACGCAGCAATGCCGCGTGAGTGAAGAAGGCCTTCGGGTCGTAAAGCTCTGTCAGAGGGGAAGAAACTCCTGATGGCTAATACCTGTCAGGACTGACGGTACCCTCAAAGGAAGCCCGGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTGTTCGAAATTATTGGGCGTAAAGCGCGTGTAGGCGGTCCGTTAAGTCTGATGTGAAAGCCCGGGGCTCAACCTCGGAAGTGCATTGGAAACTGGCGGACTTGAGTACGGGAGAGGGAAGTGGAATTCCGAGTGTAGGGGTGAAATCCGTAGATATTCGGAGGAACACCGGTGGCGAAGGCGGCTTCCTGGACCGATACTGACGCTGAGACGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGGGTACTAGGTGTTGCGGGTATTGACCCCTGCAGTGCCGTAGCTAACGCATTAAGTACCCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCTGGGCTTGACATCCCGATCGTATCCCATGGAAACATGGGAGTCAGTTCGGCTGGATCGGTGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCCTTAGTTGCCATCATTCAGTTGGGCACTCTAGGGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTATGTCCAGGGCTACACACGTGCTACAATGGCCGGTACAAAGGGTAGCGATACCGTGAGGTGGAGCCAATCCCAAAAAGCCGGTCTCAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGTATCAGCATGACGCGGTAATACGTGCCCGGGC ``` Looking at where we aligned to: ``` ## in front of our aligned portion TAGAGTTTGATCCTGGCTCAG ## this is the common 27-F primer noted above (AGAGTTTGATCCTGGCTCAG) with one base in front of it, so suggests the forward primers are cut off already ## aligned portion AACGAACGCTGGCGGCATGCCTAACACATGCAAGTCGAACGCGAAAGCTTCCTTCGGGAAGCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAACCTGCCCGGTGATCTGGGATAACATCTCGAAAGGGGTGCTAATACCGGATAAGCTCACAGGGACTTCGGTCCTGGTGGGAAAAGATGGCCTCTTCTTGAAAGCTATTGTCACCGGATGGGTCCGCGGCCCATTAGCTAGTTGGTAGGGTAATGGCCTACCAAGGC ``` **Since the 27F primer is right in front of where our amplicon starts, it suggests these primers were trimmed off already.** ## Reverse read Got corresponding reverse read of one we did above: ```bash zgrep -A 1 "^@DE18INS60510:189:000000000-AJN0U:1:1101:8782:1749" 11GS_REV.fastq.gz | sed 's/^@/>/' >DE18INS60510:189:000000000-AJN0U:1:1101:8782:1749 2:N:0: GCACGGAGTTAGCCGGTGCTTCCTTTGAGGGTACCGTCAATACTGTCGCGATTAAACAACAATAGTTTCTTCCCCTCTGACAGAGCTTTACGATCCTAAAACCTTCATCACTCACGCGGCATTGCTGCGTCAGGCTTTCGCCCATTGCGCAAAATTCCCCACTGCTGCCTCCCGTAGGAGTCTGGACCGTGTCTCCGTTCCCGTGTGGCTGATCATCCTCTCAGACCAGCTACCCATCGTAGCATTGGTAGCCCATTACCCTCACA ``` Blasted against our reference sequence from above: ![](https://i.imgur.com/ZXlzliG.png) Aligned to ref positions 265-527: ``` NR_042754.2 Geothermobacter ehrlichii strain SS015 16S ribosomal RNA, partial sequence Sequence ID: Query_3809Length: 1399Number of Matches: 1 Range 1: 265 to 527GraphicsNext MatchPrevious Match Alignment statistics for match #1 Score Expect Identities Gaps Strand 357 bits(193) 8e-103 240/263(91%) 1/263(0%) Plus/Minus Query 1 GCACGGAGTTAGCCGGTGCTTCCTTTGAGGGTACCGTCAATACTGTC-GCGATTAAACAA 59 |||||||||||||| | |||||||||||||||||||||| | ||| | | |||| || Sbjct 527 GCACGGAGTTAGCCCGGGCTTCCTTTGAGGGTACCGTCAGTCCTGACAGGTATTAGCCAT 468 Query 60 CAATAGTTTCTTCCCCTCTGACAGAGCTTTACGATCCTAAAACCTTCATCACTCACGCGG 119 || |||||||||||||||||||||||||||||| || || ||||| |||||||||||| Sbjct 467 CAGGAGTTTCTTCCCCTCTGACAGAGCTTTACGACCCGAAGGCCTTCTTCACTCACGCGG 408 Query 120 CATTGCTGCGTCAGGCTTTCGCCCATTGCGCAAAATTCCCCACTGCTGCCTCCCGTAGGA 179 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 407 CATTGCTGCGTCAGGCTTTCGCCCATTGCGCAAAATTCCCCACTGCTGCCTCCCGTAGGA 348 Query 180 GTCTGGACCGTGTCTCCGTTCCCGTGTGGCTGATCATCCTCTCAGACCAGCTACCCATCG 239 |||||||||||||||| ||||| ||||||||||||||||||||||||||||||||||||| Sbjct 347 GTCTGGACCGTGTCTCAGTTCCAGTGTGGCTGATCATCCTCTCAGACCAGCTACCCATCG 288 Query 240 TAGCATTGGTAGCCCATTACCCT 262 | || ||||||| |||||||||| Sbjct 287 TTGCCTTGGTAGGCCATTACCCT 265 ``` ## Looking at full span of reference sequence that our forward and reverse cover together * forward read covered 22-285 of ref positions * reverse read covered 265-527 of ref positions * so full span covered by reads for this amplified fragment is 22-527 Broken down on reference: *Before aligned-portion of full fragment* ``` TAGAGTTTGATCCTGGCTCAG ``` **This matches this common starting position 27 primer (AGAGTTTGATCMTGGCTCAG); with a 'T' in front on the ref here** *Aligned portion of whole amplified fragment in reference (forward and reverse read – spans 506 bases in this ref)* ``` AACGAACGCTGGCGGCATGCCTAACACATGCAAGTCGAACGCGAAAGCTTCCTTCGGGAAGCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAACCTGCCCGGTGATCTGGGATAACATCTCGAAAGGGGTGCTAATACCGGATAAGCTCACAGGGACTTCGGTCCTGGTGGGAAAAGATGGCCTCTTCTTGAAAGCTATTGTCACCGGATGGGTCCGCGGCCCATTAGCTAGTTGGTAGGGTAATGGCCTACCAAGGCAACGATGGGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATTTTGCGCAATGGGCGAAAGCCTGACGCAGCAATGCCGCGTGAGTGAAGAAGGCCTTCGGGTCGTAAAGCTCTGTCAGAGGGGAAGAAACTCCTGATGGCTAATACCTGTCAGGACTGACGGTACCCTCAAAGGAAGCCCGGGCTAACTCCGTGC ``` *After aligned-portion* ``` CAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTGTTCGAAATTATTGGGCGTAAAGCGCGTGTAGGCGGTCCGTTAAGTCTGATGTGAAAGCCCGGGGCTCAACCTCGGAAGTGCATTGGAAACTGGCGGACTTGAGTACGGGAGAGGGAAGTGGAATTCCGAGTGTAGGGGTGAAATCCGTAGATATTCGGAGGAACACCGGTGGCGAAGGCGGCTTCCTGGACCGATACTGACGCTGAGACGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGGGTACTAGGTGTTGCGGGTATTGACCCCTGCAGTGCCGTAGCTAACGCATTAAGTACCCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCTGGGCTTGACATCCCGATCGTATCCCATGGAAACATGGGAGTCAGTTCGGCTGGATCGGTGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCCTTAGTTGCCATCATTCAGTTGGGCACTCTAGGGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTATGTCCAGGGCTACACACGTGCTACAATGGCCGGTACAAAGGGTAGCGATACCGTGAGGTGGAGCCAATCCCAAAAAGCCGGTCTCAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGTATCAGCATGACGCGGTAATACGTGCCCGGGC ``` **First 16 bases of the after aligned-portion are a common position 534 primer (CAGCAGCCGCGGTAAT)** --- So i think the forward and reverse primers were already trimmed. If it's possible different samples have been treated differently, you may want to check some of them in a similar fashion 🙂 ---