ATB download: https://github.com/sourmash-bio/sourmash/issues/3247
GTDB files: https://github.com/sourmash-bio/sourmash/issues/3183#issuecomment-2362160753
Build a picklist for the GTDB genomes we're interested in (Sulcia and Buchnera):
head -1 gtdb-rs220.lineages.csv > sublineages.csv
egrep -i 'buchnera|sulcia' gtdb-rs220.lineages.csv >> sublineages.csv
Extract just those genomes from GTDB RS220:
sourmash sig cat --picklist sublineages.csv:ident:ident gtdb-rs220-k21.zip -o sublineages.sig.zip
Search ATB v0.2 for overlaps:
sourmash scripts manysearch sublineages.sig.zip allthebacteria-r0.2-k21.zip -o sublineages.x.atb.csv -k 21
(Takes about 10-15 minutes )
Parse results:
>>> df = pd.read_csv('sublineages.x.atb.csv')
>>> df2 = df[df['max_containment_ani'] >= 0.9]
>>> df2[['match_name', 'max_containment_ani']]
match_name max_containment_ani
23861 NZ_CP056771.1 Buchnera aphidicola (Aphis gossy... 0.907215
23879 NZ_CP056771.1 Buchnera aphidicola (Aphis gossy... 0.921266
23901 NZ_CP056771.1 Buchnera aphidicola (Aphis gossy... 0.910469
23906 NZ_CP056771.1 Buchnera aphidicola (Aphis gossy... 0.997631
23909 NZ_CP056771.1 Buchnera aphidicola (Aphis gossy... 1.000000
23911 NZ_CP056771.1 Buchnera aphidicola (Aphis gossy... 0.999172
Eliminate all ATB genomes already labeled as Buchnera:
>>> df3 = df[~df['match_name'].str.contains('Buchnera')]
>>> len(df3)
26764
>>> df3['max_containment'].max()
np.float64(0.0448430493273542)
>>> df3.sort_values(by='max_containment')[['match_name', 'max_containment']]
match_name max_containment
21457 NZ_KB944588.1 Enterococcus faecalis ATCC 19433... 0.010017
96 NZ_BNJW01000001.1 Enterococcus lactis strain C... 0.010017
26119 NZ_LT700212.1 Actinomyces provencensis strain ... 0.010017
25967 NZ_MVFY01000010.1 Citrobacter portucalensis st... 0.010017
25963 NZ_PRDB01000030.1 Klebsiella michiganensis str... 0.010017
... ... ...
23667 JAGGDU010000001.1 Trichodesmium erythraeum GBR... 0.042601
23596 JAGGDU010000001.1 Trichodesmium erythraeum GBR... 0.042601
23525 JAGGDU010000001.1 Trichodesmium erythraeum GBR... 0.042601
24492 NZ_JPOS01000001.1 Phaeodactylibacter xiamenens... 0.044643
24451 NZ_JPOS01000001.1 Phaeodactylibacter xiamenens... 0.044843
So, there's a 4.5% overlap in k-mers with some Buchnera queries for JAGGDU010000001.1
and NZ_JPOS01000001.1
, which is probably not enough to call it a Buchnera genome… The estimated ANI is in the 80% range (column max_containment_ani
).
tl;dr no matches to Sulcia, only one match to Buchnera, in all the ATB genomes.
Download the CSV file here for more exploration: https://farm.cse.ucdavis.edu/~ctbrown/buchnera-sulcia.manysearch.zip
with 4 CPUs:
Command being timed: "sourmash scripts manysearch sublineages.sig.zip allthebacteria-r0.2-k21.zip -o sublineages.x.atb.csv.2 -k 21 -c 4"
User time (seconds): 10967.28
System time (seconds): 98.50
Percent of CPU this job got: 385%
Elapsed (wall clock) time (h:mm:ss or m:ss): 47:51.17
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 20568304
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 506284
Minor (reclaiming a frame) page faults: 4419875
Voluntary context switches: 63584
Involuntary context switches: 120076
Swaps: 0
File system inputs: 129128808
File system outputs: 21096
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
For some background: sourmash relies on estimating overlaps using k-mers. From this one can estimate Jaccard, containment, and ANI between sets. We've found these estimates are pretty reliable for a wide range of parameters.
I tend to use sourmash in a way that avoids reaching conclusions too quickly, because biology data is so messy :). But I didn't properly caveat the above results!
Technically, what I show above is that there are substantial overlaps between Buchnera and Sulcia reference genomes from GTDB, and > 2000 SRA "isolate" shotgun sets.
My first concern was: did I screw this up?? So that's what most of the below addresses. I'll revisit the questions you asked after I make sure I'm not imagining things!
I've updated the matches CSV with more information, explored below.
Looking at the 4th best match, SRR27437571, it has:
So, whatever this sample is supposed to be, it contains approximately 256,000 21-mers that belong to the Buchnera genome in GTDB.
I would say this is either contamination in the Buchnera genome or Buchnera is present in this sample (or both).
Further analysis of SRR27437571 suggests that it is actually a metagenome, with a substantial portion of Buchnera present in it. I used sourmash gather to do this.
See the sourmash gather CSV for SRR27437571, sort by f_unique_weighted
, look at match name - has about 3x coverage (median_abund), and Buchnera is about 0.0034% of the sample. So a very small match.
Note that I did this analysis with k=51, so much more specific than k=21 (but less sensitive to evolutionary distance). We can do some alignments if we need to verify it but I'd bet good money that it's there
I took about 10 samples that matched to Buchnera or Sulcia, and did the metagenome breakdown as above.
understand why buchnera/sulcia not in some of them
Not sure. More goes here.