AllTheBacteria Buchnera and Sulcia search w/sourmash

ATB download: https://github.com/sourmash-bio/sourmash/issues/3247

GTDB files: https://github.com/sourmash-bio/sourmash/issues/3183#issuecomment-2362160753

Build a picklist for the GTDB genomes we're interested in (Sulcia and Buchnera):

head -1 gtdb-rs220.lineages.csv > sublineages.csv
egrep -i 'buchnera|sulcia' gtdb-rs220.lineages.csv   >> sublineages.csv

Extract just those genomes from GTDB RS220:

sourmash sig cat --picklist sublineages.csv:ident:ident gtdb-rs220-k21.zip -o sublineages.sig.zip

Search ATB v0.2 for overlaps:

sourmash scripts manysearch sublineages.sig.zip allthebacteria-r0.2-k21.zip -o sublineages.x.atb.csv -k 21

(Takes about 10-15 minutes )

Parse results:

>>> df = pd.read_csv('sublineages.x.atb.csv')
>>> df2 = df[df['max_containment_ani'] >= 0.9]
>>> df2[['match_name', 'max_containment_ani']]
                                              match_name  max_containment_ani
23861  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             0.907215
23879  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             0.921266
23901  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             0.910469
23906  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             0.997631
23909  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             1.000000
23911  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             0.999172

Eliminate all ATB genomes already labeled as Buchnera:

>>> df3 = df[~df['match_name'].str.contains('Buchnera')]
>>> len(df3)
26764
>>> df3['max_containment'].max()
np.float64(0.0448430493273542)
>>> df3.sort_values(by='max_containment')[['match_name', 'max_containment']]
                                              match_name  max_containment
21457  NZ_KB944588.1 Enterococcus faecalis ATCC 19433...         0.010017
96     NZ_BNJW01000001.1 Enterococcus lactis strain C...         0.010017
26119  NZ_LT700212.1 Actinomyces provencensis strain ...         0.010017
25967  NZ_MVFY01000010.1 Citrobacter portucalensis st...         0.010017
25963  NZ_PRDB01000030.1 Klebsiella michiganensis str...         0.010017
...                                                  ...              ...
23667  JAGGDU010000001.1 Trichodesmium erythraeum GBR...         0.042601
23596  JAGGDU010000001.1 Trichodesmium erythraeum GBR...         0.042601
23525  JAGGDU010000001.1 Trichodesmium erythraeum GBR...         0.042601
24492  NZ_JPOS01000001.1 Phaeodactylibacter xiamenens...         0.044643
24451  NZ_JPOS01000001.1 Phaeodactylibacter xiamenens...         0.044843

So, there's a 4.5% overlap in k-mers with some Buchnera queries for JAGGDU010000001.1 and NZ_JPOS01000001.1, which is probably not enough to call it a Buchnera genome The estimated ANI is in the 80% range (column max_containment_ani).

tl;dr no matches to Sulcia, only one match to Buchnera, in all the ATB genomes.

Download the CSV file here for more exploration: https://farm.cse.ucdavis.edu/~ctbrown/buchnera-sulcia.manysearch.zip

benchmark info

with 4 CPUs:

        Command being timed: "sourmash scripts manysearch sublineages.sig.zip allthebacteria-r0.2-k21.zip -o sublineages.x.atb.csv.2 -k 21 -c 4"
        User time (seconds): 10967.28
        System time (seconds): 98.50
        Percent of CPU this job got: 385%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 47:51.17
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 20568304
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 506284
        Minor (reclaiming a frame) page faults: 4419875
        Voluntary context switches: 63584
        Involuntary context switches: 120076
        Swaps: 0
        File system inputs: 129128808
        File system outputs: 21096
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Followup

sourmash calculates overlaps; ANI calculation; and reports

For some background: sourmash relies on estimating overlaps using k-mers. From this one can estimate Jaccard, containment, and ANI between sets. We've found these estimates are pretty reliable for a wide range of parameters.

I tend to use sourmash in a way that avoids reaching conclusions too quickly, because biology data is so messy :). But I didn't properly caveat the above results!

Technically, what I show above is that there are substantial overlaps between Buchnera and Sulcia reference genomes from GTDB, and > 2000 SRA "isolate" shotgun sets.

My first concern was: did I screw this up?? So that's what most of the below addresses. I'll revisit the questions you asked after I make sure I'm not imagining things!

I've updated the matches CSV with more information, explored below.

Looking at the 4th best match, SRR27437571, it has:

  • containment of 38%
  • intersect_hashes of 256 => 256 KB of predicted overlap
  • a query containment ANI of > 95%, based on k-mer matches within the query

So, whatever this sample is supposed to be, it contains approximately 256,000 21-mers that belong to the Buchnera genome in GTDB.

I would say this is either contamination in the Buchnera genome or Buchnera is present in this sample (or both).

A more detailed analysis of the SRR27437571 sample (Buchnera 3rd match)

Further analysis of SRR27437571 suggests that it is actually a metagenome, with a substantial portion of Buchnera present in it. I used sourmash gather to do this.

See the sourmash gather CSV for SRR27437571, sort by f_unique_weighted, look at match name - has about 3x coverage (median_abund), and Buchnera is about 0.0034% of the sample. So a very small match.

Note that I did this analysis with k=51, so much more specific than k=21 (but less sensitive to evolutionary distance). We can do some alignments if we need to verify it but I'd bet good money that it's there

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

A larger analysis

I took about 10 samples that matched to Buchnera or Sulcia, and did the metagenome breakdown as above.

understand why buchnera/sulcia not in some of them

Where does this leave us?

Not sure. More goes here.

Select a repo