Summary

  • I think ones you had been grepping for were both V2 primers and found in the middle of the amplified fragments
  • I think the primers used were:
    • 27F AGAGTTTGATCMTGGCTCAG
    • 534R CAGCAGCCGCGGTAAT
  • And that they were already removed from the data, below shows what brought me to this thinking ๐Ÿ™‚

Env

conda create -y -n cutadapt-3.4 -c conda-forge -c bioconda -c defaults cutadapt=3.4

conda activate cutadapt-3.4

Testing those that were grep'd

  • Fwd: AGTGGCGGACGGGTGAGTAA
    • seems to be labeled as V2 according to this
  • Rev: TGCTGCCTCCCGTAGGAGT
    • also seems to be labeled as V2 according to this

Just running on one at time:

Fwd

cutadapt -g ^AGTGGCGGACGGGTGAGTAA -o 11GS_FWD-test.fq.gz 11GS_FWD.fastq.gz
This is cutadapt 3.4 with Python 3.9.4
Command line parameters: -g ^AGTGGCGGACGGGTGAGTAA -o 11GS_FWD-test.fq.gz 11GS_FWD.fastq.gz
Processing reads on 1 core in single-end mode ...
[  8<--------] 00:00:05       213,348 reads  @     23.9 ยตs/read;   2.51 M reads/minute
Finished in 5.11 s (24 ยตs/read; 2.51 M reads/minute).

=== Summary ===

Total reads processed:                 213,348
Reads with adapters:                         0 (0.0%)
Reads written (passing filters):       213,348 (100.0%)

Total basepairs processed:    56,110,524 bp
Total written (filtered):     56,110,524 bp (100.0%)

=== Adapter 1 ===

Sequence: AGTGGCGGACGGGTGAGTAA; Type: anchored 5'; Length: 20; Trimmed: 0 times

None found

Trying without anchoring (removing the "^" in front, so need not start at the front exactly):

cutadapt -g AGTGGCGGACGGGTGAGTAA -o 11GS_FWD-test.fq.gz 11GS_FWD.fastq.gz
This is cutadapt 3.4 with Python 3.9.4
Command line parameters: -g AGTGGCGGACGGGTGAGTAA -o 11GS_FWD-test.fq.gz 11GS_FWD.fastq.gz
Processing reads on 1 core in single-end mode ...
[  8<--------] 00:00:04       213,348 reads  @     22.3 ยตs/read;   2.69 M reads/minute
Finished in 4.77 s (22 ยตs/read; 2.69 M reads/minute).

=== Summary ===

Total reads processed:                 213,348
Reads with adapters:                   175,320 (82.2%)
Reads written (passing filters):       213,348 (100.0%)

Total basepairs processed:    56,110,524 bp
Total written (filtered):     40,536,480 bp (72.2%)

=== Adapter 1 ===

Sequence: AGTGGCGGACGGGTGAGTAA; Type: regular 5'; Length: 20; Trimmed: 175320 times

No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20 bp: 2

Overview of removed sequences
length	count	expect	max.err	error counts
57	1	0.0	2	1
60	2	0.0	2	0 0 2
65	129	0.0	2	0 25 104
68	3	0.0	2	0 0 3
70	9	0.0	2	0 0 9
71	32	0.0	2	0 7 25
72	17	0.0	2	0 6 11
73	3919	0.0	2	231 1460 2228
74	499	0.0	2	303 80 116
75	7631	0.0	2	1296 2419 3916
76	3079	0.0	2	2902 164 13
77	989	0.0	2	74 203 712
78	15	0.0	2	0 6 9
79	245	0.0	2	17 170 58
80	21	0.0	2	3 14 4
81	77	0.0	2	1 7 69
82	1133	0.0	2	2 678 453
83	121	0.0	2	8 71 42
84	4315	0.0	2	429 3670 216
85	2445	0.0	2	177 1264 1004
86	7050	0.0	2	2638 2968 1444
87	17207	0.0	2	2097 5585 9525
88	38167	0.0	2	440 34425 3302
89	6315	0.0	2	1824 3015 1476
90	21637	0.0	2	211 20401 1025
91	35482	0.0	2	22208 12338 936
92	985	0.0	2	33 742 210
93	9500	0.0	2	6858 1300 1342
94	360	0.0	2	164 172 24
95	1821	0.0	2	1091 421 309
96	455	0.0	2	45 238 172
97	175	0.0	2	93 77 5
98	181	0.0	2	3 42 136
99	14	0.0	2	0 3 11
100	13	0.0	2	0 6 7
101	21	0.0	2	2 12 7
102	557	0.0	2	3 484 70
103	898	0.0	2	21 246 631
104	108	0.0	2	15 52 41
105	1769	0.0	2	368 378 1023
106	7777	0.0	2	7211 520 46
107	62	0.0	2	7 33 22
108	7	0.0	2	0 2 5
109	9	0.0	2	0 1 8
112	3	0.0	2	0 3
115	2	0.0	2	1 0 1
116	1	0.0	2	0 1
117	10	0.0	2	9 0 1
118	1	0.0	2	0 0 1
120	2	0.0	2	2
122	1	0.0	2	0 1
129	1	0.0	2	1
140	1	0.0	2	0 0 1
142	1	0.0	2	0 0 1
143	2	0.0	2	0 2
146	2	0.0	2	0 0 2
147	1	0.0	2	0 0 1
148	10	0.0	2	1 8 1
149	4	0.0	2	3 1
155	3	0.0	2	0 3
156	2	0.0	2	1 1
158	1	0.0	2	0 1
159	1	0.0	2	0 1
171	1	0.0	2	0 0 1
176	2	0.0	2	0 0 2
181	1	0.0	2	0 0 1
195	1	0.0	2	1
196	1	0.0	2	0 1
211	1	0.0	2	0 1
212	7	0.0	2	0 7
213	1	0.0	2	0 1
214	1	0.0	2	0 0 1
239	1	0.0	2	0 1
241	2	0.0	2	0 1 1

Found a lot, but they are in the middle of the reads. Looking at the "length" column of the output above, that's the length of things trimmed, ranging from 57 to 241. The majority were around 88-91 bases, which makes sense if these amplicons are V1/V3, and this is a V2 primer.

Rev

cutadapt -g ^TGCTGCCTCCCGTAGGAGT -o 11GS_REV-test.fq.gz 11GS_REV.fastq.gz
This is cutadapt 3.4 with Python 3.9.4
Command line parameters: -g ^TGCTGCCTCCCGTAGGAGT -o 11GS_REV-test.fq.gz 11GS_REV.fastq.gz
Processing reads on 1 core in single-end mode ...
[  8<--------] 00:00:04       213,348 reads  @     21.0 ยตs/read;   2.85 M reads/minute
Finished in 4.50 s (21 ยตs/read; 2.85 M reads/minute).

=== Summary ===

Total reads processed:                 213,348
Reads with adapters:                         0 (0.0%)
Reads written (passing filters):       213,348 (100.0%)

Total basepairs processed:    56,750,568 bp
Total written (filtered):     56,750,568 bp (100.0%)

=== Adapter 1 ===

Sequence: TGCTGCCTCCCGTAGGAGT; Type: anchored 5'; Length: 19; Trimmed: 0 times

None found

Without anchoring:

cutadapt -g TGCTGCCTCCCGTAGGAGT -o 11GS_REV-test.fq.gz 11GS_REV.fastq.gz
This is cutadapt 3.4 with Python 3.9.4
Command line parameters: -g TGCTGCCTCCCGTAGGAGT -o 11GS_REV-test.fq.gz 11GS_REV.fastq.gz
Processing reads on 1 core in single-end mode ...
[ 8<---------] 00:00:02       213,348 reads  @     13.5 ยตs/read;   4.43 M reads/minute
Finished in 2.90 s (14 ยตs/read; 4.42 M reads/minute).

=== Summary ===

Total reads processed:                 213,348
Reads with adapters:                   185,604 (87.0%)
Reads written (passing filters):       213,348 (100.0%)

Total basepairs processed:    56,750,568 bp
Total written (filtered):     24,175,780 bp (42.6%)

=== Adapter 1 ===

Sequence: TGCTGCCTCCCGTAGGAGT; Type: regular 5'; Length: 19; Trimmed: 185604 times

No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1

Overview of removed sequences
length	count	expect	max.err	error counts
14	1	0.0	1	1
59	5	0.0	1	0 5
148	2	0.0	1	2
150	2	0.0	1	2
152	1	0.0	1	1
153	7	0.0	1	3 4
154	36	0.0	1	30 6
155	178	0.0	1	84 94
156	20644	0.0	1	18283 2361
157	4113	0.0	1	3083 1030
158	3684	0.0	1	2444 1240
159	2289	0.0	1	1822 467
160	1901	0.0	1	1623 278
161	900	0.0	1	765 135
162	51	0.0	1	38 13
163	10	0.0	1	7 3
164	7	0.0	1	6 1
165	19	0.0	1	7 12
166	19	0.0	1	9 10
167	9	0.0	1	9
168	133	0.0	1	112 21
169	81	0.0	1	59 22
170	4957	0.0	1	4160 797
171	264	0.0	1	215 49
172	132	0.0	1	93 39
173	1044	0.0	1	820 224
174	556	0.0	1	465 91
175	2947	0.0	1	2365 582
176	30578	0.0	1	25280 5298
177	1057	0.0	1	812 245
178	395	0.0	1	264 131
179	291	0.0	1	109 182
180	4249	0.0	1	3087 1162
181	57254	0.0	1	46282 10972
182	44819	0.0	1	36138 8681
183	2650	0.0	1	2029 621
184	314	0.0	1	238 76
185	3	0.0	1	2 1
201	1	0.0	1	1
204	1	0.0	1	1

Cutting off lots again, bulk at 156 and 176. Makes sense again if these are V1/V3 amplicons, and this is a V2 primer ๐Ÿ‘


So I think that's why you were able grep these, but not really get 'em with cutadapt, they're just in the middle of the reads ๐Ÿ‘


Looking against a reference seq

Forward read

Took one seq from 11_GS_FWD.fastq.gz and blasting against the 16S db to get a ref sequence. With that we can see where we align, and can look at the reference for hints of primers (even if they happen to be cut off from our seqs already - which happens sometimes without our knowing, potentially complicating this further, ha)

>DE18INS60510:189:000000000-AJN0U:1:1101:8782:1749 1:N:0:
AACGAACGCTGGCGGCATGCTTAACACATGCAAGTCGAACGAGAAAGTTTCCTTCGGGAAGCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAATCTGCCCTATGGTCTGGGATAACCTTTCGAAAGGGGGGCTAATACCGGATAAGCCCACGGAGACTTCGGTCACTGTGGGCAAAGATGACCTCTTCTATGTTATCGCTATCGGATGAGTCCGCGGCCCATTAGCTCGTTGGTAGGGTAATGGCCTACCAAGGCTA

Top hit was to this bugger, hitting it's seq at its positions 22-285:

Geothermobacter ehrlichii strain SS015 16S ribosomal RNA, partial sequence
Sequence ID: NR_042754.2Length: 1399Number of Matches: 1
Range 1: 22 to 285GenBankGraphicsNext MatchPrevious Match
Alignment statistics for match #1
Score	Expect	Identities	Gaps	Strand
329 bits(178)	4e-90	237/265(89%)	5/265(1%)	Plus/Plus
Query  1    AACGAACGCTGGCGGCATGCTTAACACATGCAAGTCGAACGAGAAAGTTTCCTTCGGGAA  60
            |||||||||||||||||||| |||||||||||||||||||| ||||| ||||||||||||
Sbjct  22   AACGAACGCTGGCGGCATGCCTAACACATGCAAGTCGAACGCGAAAGCTTCCTTCGGGAA  81

Query  61   GCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAATCTGCCCTATGGTCTGGGATA  120
            ||||||||||||||||||||||||||||||||||||||| ||||||  || |||||||||
Sbjct  82   GCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAACCTGCCCGGTGATCTGGGATA  141

Query  121  ACCTTTCGAAAGGGGGGCTAATACCGGATAAGCCCACGGAGACTTCGGTCACT-GTGGGC  179
            || | |||||||||| ||||||||||||||||| ||| | |||||||||| || ||||| 
Sbjct  142  ACATCTCGAAAGGGGTGCTAATACCGGATAAGCTCACAGGGACTTCGGTC-CTGGTGGGA  200

Query  180  AAAGATGACCTCTTCT--AT-GTTATCGCTATCGGATGAGTCCGCGGCCCATTAGCTCGT  236
            ||||||| ||||||||  |  | ||| |  | |||||| |||||||||||||||||| ||
Sbjct  201  AAAGATGGCCTCTTCTTGAAAGCTATTGTCACCGGATGGGTCCGCGGCCCATTAGCTAGT  260

Query  237  TGGTAGGGTAATGGCCTACCAAGGC  261
            |||||||||||||||||||||||||
Sbjct  261  TGGTAGGGTAATGGCCTACCAAGGC  285

Getting full-length of that ref sequence:

>NR_042754.2 Geothermobacter ehrlichii strain SS015 16S ribosomal RNA, partial sequence
TAGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGCATGCCTAACACATGCAAGTCGAACGCGAAAGCTTCCTTCGGGAAGCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAACCTGCCCGGTGATCTGGGATAACATCTCGAAAGGGGTGCTAATACCGGATAAGCTCACAGGGACTTCGGTCCTGGTGGGAAAAGATGGCCTCTTCTTGAAAGCTATTGTCACCGGATGGGTCCGCGGCCCATTAGCTAGTTGGTAGGGTAATGGCCTACCAAGGCAACGATGGGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATTTTGCGCAATGGGCGAAAGCCTGACGCAGCAATGCCGCGTGAGTGAAGAAGGCCTTCGGGTCGTAAAGCTCTGTCAGAGGGGAAGAAACTCCTGATGGCTAATACCTGTCAGGACTGACGGTACCCTCAAAGGAAGCCCGGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTGTTCGAAATTATTGGGCGTAAAGCGCGTGTAGGCGGTCCGTTAAGTCTGATGTGAAAGCCCGGGGCTCAACCTCGGAAGTGCATTGGAAACTGGCGGACTTGAGTACGGGAGAGGGAAGTGGAATTCCGAGTGTAGGGGTGAAATCCGTAGATATTCGGAGGAACACCGGTGGCGAAGGCGGCTTCCTGGACCGATACTGACGCTGAGACGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGGGTACTAGGTGTTGCGGGTATTGACCCCTGCAGTGCCGTAGCTAACGCATTAAGTACCCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCTGGGCTTGACATCCCGATCGTATCCCATGGAAACATGGGAGTCAGTTCGGCTGGATCGGTGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCCTTAGTTGCCATCATTCAGTTGGGCACTCTAGGGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTATGTCCAGGGCTACACACGTGCTACAATGGCCGGTACAAAGGGTAGCGATACCGTGAGGTGGAGCCAATCCCAAAAAGCCGGTCTCAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGTATCAGCATGACGCGGTAATACGTGCCCGGGC

Looking at where we aligned to:

  ## in front of our aligned portion
TAGAGTTTGATCCTGGCTCAG

    ## this is the common 27-F primer noted above (AGAGTTTGATCCTGGCTCAG) with one base in front of it, so suggests the forward primers are cut off already

  ## aligned portion
AACGAACGCTGGCGGCATGCCTAACACATGCAAGTCGAACGCGAAAGCTTCCTTCGGGAAGCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAACCTGCCCGGTGATCTGGGATAACATCTCGAAAGGGGTGCTAATACCGGATAAGCTCACAGGGACTTCGGTCCTGGTGGGAAAAGATGGCCTCTTCTTGAAAGCTATTGTCACCGGATGGGTCCGCGGCCCATTAGCTAGTTGGTAGGGTAATGGCCTACCAAGGC

Since the 27F primer is right in front of where our amplicon starts, it suggests these primers were trimmed off already.

Reverse read

Got corresponding reverse read of one we did above:

zgrep -A 1 "^@DE18INS60510:189:000000000-AJN0U:1:1101:8782:1749" 11GS_REV.fastq.gz | sed 's/^@/>/'
>DE18INS60510:189:000000000-AJN0U:1:1101:8782:1749 2:N:0:
GCACGGAGTTAGCCGGTGCTTCCTTTGAGGGTACCGTCAATACTGTCGCGATTAAACAACAATAGTTTCTTCCCCTCTGACAGAGCTTTACGATCCTAAAACCTTCATCACTCACGCGGCATTGCTGCGTCAGGCTTTCGCCCATTGCGCAAAATTCCCCACTGCTGCCTCCCGTAGGAGTCTGGACCGTGTCTCCGTTCCCGTGTGGCTGATCATCCTCTCAGACCAGCTACCCATCGTAGCATTGGTAGCCCATTACCCTCACA

Blasted against our reference sequence from above:

Aligned to ref positions 265-527:

NR_042754.2 Geothermobacter ehrlichii strain SS015 16S ribosomal RNA, partial sequence
Sequence ID: Query_3809Length: 1399Number of Matches: 1
Range 1: 265 to 527GraphicsNext MatchPrevious Match
Alignment statistics for match #1
Score	Expect	Identities	Gaps	Strand
357 bits(193)	8e-103	240/263(91%)	1/263(0%)	Plus/Minus
Query  1    GCACGGAGTTAGCCGGTGCTTCCTTTGAGGGTACCGTCAATACTGTC-GCGATTAAACAA  59
            |||||||||||||| | |||||||||||||||||||||| | ||| | |  ||||  || 
Sbjct  527  GCACGGAGTTAGCCCGGGCTTCCTTTGAGGGTACCGTCAGTCCTGACAGGTATTAGCCAT  468

Query  60   CAATAGTTTCTTCCCCTCTGACAGAGCTTTACGATCCTAAAACCTTCATCACTCACGCGG  119
            ||  |||||||||||||||||||||||||||||| || ||  ||||| ||||||||||||
Sbjct  467  CAGGAGTTTCTTCCCCTCTGACAGAGCTTTACGACCCGAAGGCCTTCTTCACTCACGCGG  408

Query  120  CATTGCTGCGTCAGGCTTTCGCCCATTGCGCAAAATTCCCCACTGCTGCCTCCCGTAGGA  179
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  407  CATTGCTGCGTCAGGCTTTCGCCCATTGCGCAAAATTCCCCACTGCTGCCTCCCGTAGGA  348

Query  180  GTCTGGACCGTGTCTCCGTTCCCGTGTGGCTGATCATCCTCTCAGACCAGCTACCCATCG  239
            |||||||||||||||| ||||| |||||||||||||||||||||||||||||||||||||
Sbjct  347  GTCTGGACCGTGTCTCAGTTCCAGTGTGGCTGATCATCCTCTCAGACCAGCTACCCATCG  288

Query  240  TAGCATTGGTAGCCCATTACCCT  262
            | || ||||||| ||||||||||
Sbjct  287  TTGCCTTGGTAGGCCATTACCCT  265

Looking at full span of reference sequence that our forward and reverse cover together

  • forward read covered 22-285 of ref positions
  • reverse read covered 265-527 of ref positions
  • so full span covered by reads for this amplified fragment is 22-527

Broken down on reference:

Before aligned-portion of full fragment

TAGAGTTTGATCCTGGCTCAG

This matches this common starting position 27 primer (AGAGTTTGATCMTGGCTCAG); with a 'T' in front on the ref here

Aligned portion of whole amplified fragment in reference (forward and reverse read โ€“ spans 506 bases in this ref)

AACGAACGCTGGCGGCATGCCTAACACATGCAAGTCGAACGCGAAAGCTTCCTTCGGGAAGCGAGTAGAGTGGCGCACGGGTGAGTAACACGTGGATAACCTGCCCGGTGATCTGGGATAACATCTCGAAAGGGGTGCTAATACCGGATAAGCTCACAGGGACTTCGGTCCTGGTGGGAAAAGATGGCCTCTTCTTGAAAGCTATTGTCACCGGATGGGTCCGCGGCCCATTAGCTAGTTGGTAGGGTAATGGCCTACCAAGGCAACGATGGGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATTTTGCGCAATGGGCGAAAGCCTGACGCAGCAATGCCGCGTGAGTGAAGAAGGCCTTCGGGTCGTAAAGCTCTGTCAGAGGGGAAGAAACTCCTGATGGCTAATACCTGTCAGGACTGACGGTACCCTCAAAGGAAGCCCGGGCTAACTCCGTGC

After aligned-portion

CAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTGTTCGAAATTATTGGGCGTAAAGCGCGTGTAGGCGGTCCGTTAAGTCTGATGTGAAAGCCCGGGGCTCAACCTCGGAAGTGCATTGGAAACTGGCGGACTTGAGTACGGGAGAGGGAAGTGGAATTCCGAGTGTAGGGGTGAAATCCGTAGATATTCGGAGGAACACCGGTGGCGAAGGCGGCTTCCTGGACCGATACTGACGCTGAGACGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGGGTACTAGGTGTTGCGGGTATTGACCCCTGCAGTGCCGTAGCTAACGCATTAAGTACCCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCTGGGCTTGACATCCCGATCGTATCCCATGGAAACATGGGAGTCAGTTCGGCTGGATCGGTGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCCTTAGTTGCCATCATTCAGTTGGGCACTCTAGGGAGACTGCCGGTGTTAAACCGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCCTTATGTCCAGGGCTACACACGTGCTACAATGGCCGGTACAAAGGGTAGCGATACCGTGAGGTGGAGCCAATCCCAAAAAGCCGGTCTCAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGTTGGAATCGCTAGTAATCGCGTATCAGCATGACGCGGTAATACGTGCCCGGGC

First 16 bases of the after aligned-portion are a common position 534 primer (CAGCAGCCGCGGTAAT)


So i think the forward and reverse primers were already trimmed. If it's possible different samples have been treated differently, you may want to check some of them in a similar fashion ๐Ÿ™‚