---
tags: primer hunting
title: Biji 12-Jul-2021
---
[toc]
# Summary
* I think ones you had been grepping for were both V2 primers and found in the middle of the amplified fragments
* I think the primers used were:
* 27F AGAGTTTGATCMTGGCTCAG
* 534R CAGCAGCCGCGGTAAT
* And that they were already removed from the data, below shows what brought me to this thinking 🙂
# Env
```bash
conda create -y -n cutadapt-3.4 -c conda-forge -c bioconda -c defaults cutadapt=3.4
conda activate cutadapt-3.4
```
# Testing those that were `grep`'d
* Fwd: AGTGGCGGACGGGTGAGTAA
* seems to be labeled as V2 according to [this](https://link.springer.com/article/10.1007/s00248-018-1299-5/tables/1)
* Rev: TGCTGCCTCCCGTAGGAGT
* also seems to be labeled as V2 according to [this](https://www.nature.com/articles/s41598-018-27757-8)
Just running on one at time:
## Fwd
```bash
cutadapt -g ^AGTGGCGGACGGGTGAGTAA -o 11GS_FWD-test.fq.gz 11GS_FWD.fastq.gz
```
```
This is cutadapt 3.4 with Python 3.9.4
Command line parameters: -g ^AGTGGCGGACGGGTGAGTAA -o 11GS_FWD-test.fq.gz 11GS_FWD.fastq.gz
Processing reads on 1 core in single-end mode ...
[ 8<--------] 00:00:05 213,348 reads @ 23.9 µs/read; 2.51 M reads/minute
Finished in 5.11 s (24 µs/read; 2.51 M reads/minute).
=== Summary ===
Total reads processed: 213,348
Reads with adapters: 0 (0.0%)
Reads written (passing filters): 213,348 (100.0%)
Total basepairs processed: 56,110,524 bp
Total written (filtered): 56,110,524 bp (100.0%)
=== Adapter 1 ===
Sequence: AGTGGCGGACGGGTGAGTAA; Type: anchored 5'; Length: 20; Trimmed: 0 times
```
**None found**
Trying without anchoring (removing the "^" in front, so need not start at the front exactly):
```bash
cutadapt -g AGTGGCGGACGGGTGAGTAA -o 11GS_FWD-test.fq.gz 11GS_FWD.fastq.gz
```
```
This is cutadapt 3.4 with Python 3.9.4
Command line parameters: -g AGTGGCGGACGGGTGAGTAA -o 11GS_FWD-test.fq.gz 11GS_FWD.fastq.gz
Processing reads on 1 core in single-end mode ...
[ 8<--------] 00:00:04 213,348 reads @ 22.3 µs/read; 2.69 M reads/minute
Finished in 4.77 s (22 µs/read; 2.69 M reads/minute).
=== Summary ===
Total reads processed: 213,348
Reads with adapters: 175,320 (82.2%)
Reads written (passing filters): 213,348 (100.0%)
Total basepairs processed: 56,110,524 bp
Total written (filtered): 40,536,480 bp (72.2%)
=== Adapter 1 ===
Sequence: AGTGGCGGACGGGTGAGTAA; Type: regular 5'; Length: 20; Trimmed: 175320 times
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20 bp: 2
Overview of removed sequences
length count expect max.err error counts
57 1 0.0 2 1
60 2 0.0 2 0 0 2
65 129 0.0 2 0 25 104
68 3 0.0 2 0 0 3
70 9 0.0 2 0 0 9
71 32 0.0 2 0 7 25
72 17 0.0 2 0 6 11
73 3919 0.0 2 231 1460 2228
74 499 0.0 2 303 80 116
75 7631 0.0 2 1296 2419 3916
76 3079 0.0 2 2902 164 13
77 989 0.0 2 74 203 712
78 15 0.0 2 0 6 9
79 245 0.0 2 17 170 58
80 21 0.0 2 3 14 4
81 77 0.0 2 1 7 69
82 1133 0.0 2 2 678 453
83 121 0.0 2 8 71 42
84 4315 0.0 2 429 3670 216
85 2445 0.0 2 177 1264 1004
86 7050 0.0 2 2638 2968 1444
87 17207 0.0 2 2097 5585 9525
88 38167 0.0 2 440 34425 3302
89 6315 0.0 2 1824 3015 1476
90 21637 0.0 2 211 20401 1025
91 35482 0.0 2 22208 12338 936
92 985 0.0 2 33 742 210
93 9500 0.0 2 6858 1300 1342
94 360 0.0 2 164 172 24
95 1821 0.0 2 1091 421 309
96 455 0.0 2 45 238 172
97 175 0.0 2 93 77 5
98 181 0.0 2 3 42 136
99 14 0.0 2 0 3 11
100 13 0.0 2 0 6 7
101 21 0.0 2 2 12 7
102 557 0.0 2 3 484 70
103 898 0.0 2 21 246 631
104 108 0.0 2 15 52 41
105 1769 0.0 2 368 378 1023
106 7777 0.0 2 7211 520 46
107 62 0.0 2 7 33 22
108 7 0.0 2 0 2 5
109 9 0.0 2 0 1 8
112 3 0.0 2 0 3
115 2 0.0 2 1 0 1
116 1 0.0 2 0 1
117 10 0.0 2 9 0 1
118 1 0.0 2 0 0 1
120 2 0.0 2 2
122 1 0.0 2 0 1
129 1 0.0 2 1
140 1 0.0 2 0 0 1
142 1 0.0 2 0 0 1
143 2 0.0 2 0 2
146 2 0.0 2 0 0 2
147 1 0.0 2 0 0 1
148 10 0.0 2 1 8 1
149 4 0.0 2 3 1
155 3 0.0 2 0 3
156 2 0.0 2 1 1
158 1 0.0 2 0 1
159 1 0.0 2 0 1
171 1 0.0 2 0 0 1
176 2 0.0 2 0 0 2
181 1 0.0 2 0 0 1
195 1 0.0 2 1
196 1 0.0 2 0 1
211 1 0.0 2 0 1
212 7 0.0 2 0 7
213 1 0.0 2 0 1
214 1 0.0 2 0 0 1
239 1 0.0 2 0 1
241 2 0.0 2 0 1 1
```
**Found a lot, but they are in the middle of the reads. Looking at the "length" column of the output above, that's the length of things trimmed, ranging from 57 to 241. The majority were around 88-91 bases, which makes sense if these amplicons are V1/V3, and this is a V2 primer.**
## Rev
```bash
cutadapt -g ^TGCTGCCTCCCGTAGGAGT -o 11GS_REV-test.fq.gz 11GS_REV.fastq.gz
```
```
This is cutadapt 3.4 with Python 3.9.4
Command line parameters: -g ^TGCTGCCTCCCGTAGGAGT -o 11GS_REV-test.fq.gz 11GS_REV.fastq.gz
Processing reads on 1 core in single-end mode ...
[ 8<--------] 00:00:04 213,348 reads @ 21.0 µs/read; 2.85 M reads/minute
Finished in 4.50 s (21 µs/read; 2.85 M reads/minute).
=== Summary ===
Total reads processed: 213,348
Reads with adapters: 0 (0.0%)
Reads written (passing filters): 213,348 (100.0%)
Total basepairs processed: 56,750,568 bp
Total written (filtered): 56,750,568 bp (100.0%)
=== Adapter 1 ===
Sequence: TGCTGCCTCCCGTAGGAGT; Type: anchored 5'; Length: 19; Trimmed: 0 times
```
**None found**
Without anchoring:
```bash
cutadapt -g TGCTGCCTCCCGTAGGAGT -o 11GS_REV-test.fq.gz 11GS_REV.fastq.gz
```
```
This is cutadapt 3.4 with Python 3.9.4
Command line parameters: -g TGCTGCCTCCCGTAGGAGT -o 11GS_REV-test.fq.gz 11GS_REV.fastq.gz
Processing reads on 1 core in single-end mode ...
[ 8<---------] 00:00:02 213,348 reads @ 13.5 µs/read; 4.43 M reads/minute
Finished in 2.90 s (14 µs/read; 4.42 M reads/minute).
=== Summary ===
Total reads processed: 213,348
Reads with adapters: 185,604 (87.0%)
Reads written (passing filters): 213,348 (100.0%)
Total basepairs processed: 56,750,568 bp
Total written (filtered): 24,175,780 bp (42.6%)
=== Adapter 1 ===
Sequence: TGCTGCCTCCCGTAGGAGT; Type: regular 5'; Length: 19; Trimmed: 185604 times
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1
Overview of removed sequences
length count expect max.err error counts
14 1 0.0 1 1
59 5 0.0 1 0 5
148 2 0.0 1 2
150 2 0.0 1 2
152 1 0.0 1 1
153 7 0.0 1 3 4
154 36 0.0 1 30 6
155 178 0.0 1 84 94
156 20644 0.0 1 18283 2361
157 4113 0.0 1 3083 1030
158 3684 0.0 1 2444 1240
159 2289 0.0 1 1822 467
160 1901 0.0 1 1623 278
161 900 0.0 1 765 135
162 51 0.0 1 38 13
163 10 0.0 1 7 3
164 7 0.0 1 6 1
165 19 0.0 1 7 12
166 19 0.0 1 9 10
167 9 0.0 1 9
168 133 0.0 1 112 21
169 81 0.0 1 59 22
170 4957 0.0 1 4160 797
171 264 0.0 1 215 49
172 132 0.0 1 93 39
173 1044 0.0 1 820 224
174 556 0.0 1 465 91
175 2947 0.0 1 2365 582
176 30578 0.0 1 25280 5298
177 1057 0.0 1 812 245
178 395 0.0 1 264 131
179 291 0.0 1 109 182
180 4249 0.0 1 3087 1162
181 57254 0.0 1 46282 10972
182 44819 0.0 1 36138 8681
183 2650 0.0 1 2029 621
184 314 0.0 1 238 76
185 3 0.0 1 2 1
201 1 0.0 1 1
204 1 0.0 1 1
```
Cutting off lots again, bulk at 156 and 176. Makes sense again if these are V1/V3 amplicons, and this is a V2 primer 👍
---
So I think that's why you were able `grep` these, but not really get 'em with cutadapt, they're just in the middle of the reads 👍
---
# Looking against a reference seq
## Forward read
Took one seq from 11_GS_FWD.fastq.gz and blasting against the 16S db to get a ref sequence. With that we can see where we align, and can look at the reference for hints of primers (even if they happen to be cut off from our seqs already - which happens sometimes without our knowing, potentially complicating this further, ha)
```
>DE18INS60510:189:000000000-AJN0U:1:1101:8782:1749 1:N:0:
AACGAACGCTGGCGGCATGCTTAACACATGCAAGTCGAACGAGAAAGTTTCCTTCGGGAAGCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAATCTGCCCTATGGTCTGGGATAACCTTTCGAAAGGGGGGCTAATACCGGATAAGCCCACGGAGACTTCGGTCACTGTGGGCAAAGATGACCTCTTCTATGTTATCGCTATCGGATGAGTCCGCGGCCCATTAGCTCGTTGGTAGGGTAATGGCCTACCAAGGCTA
```

Top hit was to this bugger, hitting it's seq at its positions 22-285:
```
Geothermobacter ehrlichii strain SS015 16S ribosomal RNA, partial sequence
Sequence ID: NR_042754.2Length: 1399Number of Matches: 1
Range 1: 22 to 285GenBankGraphicsNext MatchPrevious Match
Alignment statistics for match #1
Score Expect Identities Gaps Strand
329 bits(178) 4e-90 237/265(89%) 5/265(1%) Plus/Plus
Query 1 AACGAACGCTGGCGGCATGCTTAACACATGCAAGTCGAACGAGAAAGTTTCCTTCGGGAA 60
|||||||||||||||||||| |||||||||||||||||||| ||||| ||||||||||||
Sbjct 22 AACGAACGCTGGCGGCATGCCTAACACATGCAAGTCGAACGCGAAAGCTTCCTTCGGGAA 81
Query 61 GCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAATCTGCCCTATGGTCTGGGATA 120
||||||||||||||||||||||||||||||||||||||| |||||| || |||||||||
Sbjct 82 GCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAACCTGCCCGGTGATCTGGGATA 141
Query 121 ACCTTTCGAAAGGGGGGCTAATACCGGATAAGCCCACGGAGACTTCGGTCACT-GTGGGC 179
|| | |||||||||| ||||||||||||||||| ||| | |||||||||| || |||||
Sbjct 142 ACATCTCGAAAGGGGTGCTAATACCGGATAAGCTCACAGGGACTTCGGTC-CTGGTGGGA 200
Query 180 AAAGATGACCTCTTCT--AT-GTTATCGCTATCGGATGAGTCCGCGGCCCATTAGCTCGT 236
||||||| |||||||| | | ||| | | |||||| |||||||||||||||||| ||
Sbjct 201 AAAGATGGCCTCTTCTTGAAAGCTATTGTCACCGGATGGGTCCGCGGCCCATTAGCTAGT 260
Query 237 TGGTAGGGTAATGGCCTACCAAGGC 261
|||||||||||||||||||||||||
Sbjct 261 TGGTAGGGTAATGGCCTACCAAGGC 285
```
Getting full-length of that ref sequence:
```
>NR_042754.2 Geothermobacter ehrlichii strain SS015 16S ribosomal RNA, partial sequence
TAGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGCATGCCTAACACATGCAAGTCGAACGCGAAAGCTTCCTTCGGGAAGCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAACCTGCCCGGTGATCTGGGATAACATCTCGAAAGGGGTGCTAATACCGGATAAGCTCACAGGGACTTCGGTCCTGGTGGGAAAAGATGGCCTCTTCTTGAAAGCTATTGTCACCGGATGGGTCCGCGGCCCATTAGCTAGTTGGTAGGGTAATGGCCTACCAAGGCAACGATGGGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATTTTGCGCAATGGGCGAAAGCCTGACGCAGCAATGCCGCGTGAGTGAAGAAGGCCTTCGGGTCGTAAAGCTCTGTCAGAGGGGAAGAAACTCCTGATGGCTAATACCTGTCAGGACTGACGGTACCCTCAAAGGAAGCCCGGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTGTTCGAAATTATTGGGCGTAAAGCGCGTGTAGGCGGTCCGTTAAGTCTGATGTGAAAGCCCGGGGCTCAACCTCGGAAGTGCATTGGAAACTGGCGGACTTGAGTACGGGAGAGGGAAGTGGAATTCCGAGTGTAGGGGTGAAATCCGTAGATATTCGGAGGAACACCGGTGGCGAAGGCGGCTTCCTGGACCGATACTGACGCTGAGACGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGGGTACTAGGTGTTGCGGGTATTGACCCCTGCAGTGCCGTAGCTAACGCATTAAGTACCCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCTGGGCTTGACATCCCGATCGTATCCCATGGAAACATGGGAGTCAGTTCGGCTGGATCGGTGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCCTTAGTTGCCATCATTCAGTTGGGCACTCTAGGGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTATGTCCAGGGCTACACACGTGCTACAATGGCCGGTACAAAGGGTAGCGATACCGTGAGGTGGAGCCAATCCCAAAAAGCCGGTCTCAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGTATCAGCATGACGCGGTAATACGTGCCCGGGC
```
Looking at where we aligned to:
```
## in front of our aligned portion
TAGAGTTTGATCCTGGCTCAG
## this is the common 27-F primer noted above (AGAGTTTGATCCTGGCTCAG) with one base in front of it, so suggests the forward primers are cut off already
## aligned portion
AACGAACGCTGGCGGCATGCCTAACACATGCAAGTCGAACGCGAAAGCTTCCTTCGGGAAGCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAACCTGCCCGGTGATCTGGGATAACATCTCGAAAGGGGTGCTAATACCGGATAAGCTCACAGGGACTTCGGTCCTGGTGGGAAAAGATGGCCTCTTCTTGAAAGCTATTGTCACCGGATGGGTCCGCGGCCCATTAGCTAGTTGGTAGGGTAATGGCCTACCAAGGC
```
**Since the 27F primer is right in front of where our amplicon starts, it suggests these primers were trimmed off already.**
## Reverse read
Got corresponding reverse read of one we did above:
```bash
zgrep -A 1 "^@DE18INS60510:189:000000000-AJN0U:1:1101:8782:1749" 11GS_REV.fastq.gz | sed 's/^@/>/'
>DE18INS60510:189:000000000-AJN0U:1:1101:8782:1749 2:N:0:
GCACGGAGTTAGCCGGTGCTTCCTTTGAGGGTACCGTCAATACTGTCGCGATTAAACAACAATAGTTTCTTCCCCTCTGACAGAGCTTTACGATCCTAAAACCTTCATCACTCACGCGGCATTGCTGCGTCAGGCTTTCGCCCATTGCGCAAAATTCCCCACTGCTGCCTCCCGTAGGAGTCTGGACCGTGTCTCCGTTCCCGTGTGGCTGATCATCCTCTCAGACCAGCTACCCATCGTAGCATTGGTAGCCCATTACCCTCACA
```
Blasted against our reference sequence from above:

Aligned to ref positions 265-527:
```
NR_042754.2 Geothermobacter ehrlichii strain SS015 16S ribosomal RNA, partial sequence
Sequence ID: Query_3809Length: 1399Number of Matches: 1
Range 1: 265 to 527GraphicsNext MatchPrevious Match
Alignment statistics for match #1
Score Expect Identities Gaps Strand
357 bits(193) 8e-103 240/263(91%) 1/263(0%) Plus/Minus
Query 1 GCACGGAGTTAGCCGGTGCTTCCTTTGAGGGTACCGTCAATACTGTC-GCGATTAAACAA 59
|||||||||||||| | |||||||||||||||||||||| | ||| | | |||| ||
Sbjct 527 GCACGGAGTTAGCCCGGGCTTCCTTTGAGGGTACCGTCAGTCCTGACAGGTATTAGCCAT 468
Query 60 CAATAGTTTCTTCCCCTCTGACAGAGCTTTACGATCCTAAAACCTTCATCACTCACGCGG 119
|| |||||||||||||||||||||||||||||| || || ||||| ||||||||||||
Sbjct 467 CAGGAGTTTCTTCCCCTCTGACAGAGCTTTACGACCCGAAGGCCTTCTTCACTCACGCGG 408
Query 120 CATTGCTGCGTCAGGCTTTCGCCCATTGCGCAAAATTCCCCACTGCTGCCTCCCGTAGGA 179
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 407 CATTGCTGCGTCAGGCTTTCGCCCATTGCGCAAAATTCCCCACTGCTGCCTCCCGTAGGA 348
Query 180 GTCTGGACCGTGTCTCCGTTCCCGTGTGGCTGATCATCCTCTCAGACCAGCTACCCATCG 239
|||||||||||||||| ||||| |||||||||||||||||||||||||||||||||||||
Sbjct 347 GTCTGGACCGTGTCTCAGTTCCAGTGTGGCTGATCATCCTCTCAGACCAGCTACCCATCG 288
Query 240 TAGCATTGGTAGCCCATTACCCT 262
| || ||||||| ||||||||||
Sbjct 287 TTGCCTTGGTAGGCCATTACCCT 265
```
## Looking at full span of reference sequence that our forward and reverse cover together
* forward read covered 22-285 of ref positions
* reverse read covered 265-527 of ref positions
* so full span covered by reads for this amplified fragment is 22-527
Broken down on reference:
*Before aligned-portion of full fragment*
```
TAGAGTTTGATCCTGGCTCAG
```
**This matches this common starting position 27 primer (AGAGTTTGATCMTGGCTCAG); with a 'T' in front on the ref here**
*Aligned portion of whole amplified fragment in reference (forward and reverse read – spans 506 bases in this ref)*
```
AACGAACGCTGGCGGCATGCCTAACACATGCAAGTCGAACGCGAAAGCTTCCTTCGGGAAGCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAACCTGCCCGGTGATCTGGGATAACATCTCGAAAGGGGTGCTAATACCGGATAAGCTCACAGGGACTTCGGTCCTGGTGGGAAAAGATGGCCTCTTCTTGAAAGCTATTGTCACCGGATGGGTCCGCGGCCCATTAGCTAGTTGGTAGGGTAATGGCCTACCAAGGCAACGATGGGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATTTTGCGCAATGGGCGAAAGCCTGACGCAGCAATGCCGCGTGAGTGAAGAAGGCCTTCGGGTCGTAAAGCTCTGTCAGAGGGGAAGAAACTCCTGATGGCTAATACCTGTCAGGACTGACGGTACCCTCAAAGGAAGCCCGGGCTAACTCCGTGC
```
*After aligned-portion*
```
CAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTGTTCGAAATTATTGGGCGTAAAGCGCGTGTAGGCGGTCCGTTAAGTCTGATGTGAAAGCCCGGGGCTCAACCTCGGAAGTGCATTGGAAACTGGCGGACTTGAGTACGGGAGAGGGAAGTGGAATTCCGAGTGTAGGGGTGAAATCCGTAGATATTCGGAGGAACACCGGTGGCGAAGGCGGCTTCCTGGACCGATACTGACGCTGAGACGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGGGTACTAGGTGTTGCGGGTATTGACCCCTGCAGTGCCGTAGCTAACGCATTAAGTACCCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCTGGGCTTGACATCCCGATCGTATCCCATGGAAACATGGGAGTCAGTTCGGCTGGATCGGTGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCCTTAGTTGCCATCATTCAGTTGGGCACTCTAGGGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTATGTCCAGGGCTACACACGTGCTACAATGGCCGGTACAAAGGGTAGCGATACCGTGAGGTGGAGCCAATCCCAAAAAGCCGGTCTCAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGTATCAGCATGACGCGGTAATACGTGCCCGGGC
```
**First 16 bases of the after aligned-portion are a common position 534 primer (CAGCAGCCGCGGTAAT)**
---
So i think the forward and reverse primers were already trimmed. If it's possible different samples have been treated differently, you may want to check some of them in a similar fashion 🙂
---