AllTheBacteria Buchnera and Sulcia search w/sourmash

ATB download: https://github.com/sourmash-bio/sourmash/issues/3247

GTDB files: https://github.com/sourmash-bio/sourmash/issues/3183#issuecomment-2362160753

Build a picklist for the GTDB genomes we're interested in (Sulcia and Buchnera):

head -1 gtdb-rs220.lineages.csv > sublineages.csv
egrep -i 'buchnera|sulcia' gtdb-rs220.lineages.csv   >> sublineages.csv

Extract just those genomes from GTDB RS220:

sourmash sig cat --picklist sublineages.csv:ident:ident gtdb-rs220-k21.zip -o sublineages.sig.zip

Search ATB v0.2 for overlaps:

sourmash scripts manysearch sublineages.sig.zip allthebacteria-r0.2-k21.zip -o sublineages.x.atb.csv -k 21

(Takes about 10-15 minutes )

Parse results:

>>> df = pd.read_csv('sublineages.x.atb.csv')
>>> df2 = df[df['max_containment_ani'] >= 0.9]
>>> df2[['match_name', 'max_containment_ani']]
                                              match_name  max_containment_ani
23861  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             0.907215
23879  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             0.921266
23901  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             0.910469
23906  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             0.997631
23909  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             1.000000
23911  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             0.999172

Eliminate all ATB genomes already labeled as Buchnera:

>>> df3 = df[~df['match_name'].str.contains('Buchnera')]
>>> len(df3)
26764
>>> df3['max_containment'].max()
np.float64(0.0448430493273542)
>>> df3.sort_values(by='max_containment')[['match_name', 'max_containment']]
                                              match_name  max_containment
21457  NZ_KB944588.1 Enterococcus faecalis ATCC 19433...         0.010017
96     NZ_BNJW01000001.1 Enterococcus lactis strain C...         0.010017
26119  NZ_LT700212.1 Actinomyces provencensis strain ...         0.010017
25967  NZ_MVFY01000010.1 Citrobacter portucalensis st...         0.010017
25963  NZ_PRDB01000030.1 Klebsiella michiganensis str...         0.010017
...                                                  ...              ...
23667  JAGGDU010000001.1 Trichodesmium erythraeum GBR...         0.042601
23596  JAGGDU010000001.1 Trichodesmium erythraeum GBR...         0.042601
23525  JAGGDU010000001.1 Trichodesmium erythraeum GBR...         0.042601
24492  NZ_JPOS01000001.1 Phaeodactylibacter xiamenens...         0.044643
24451  NZ_JPOS01000001.1 Phaeodactylibacter xiamenens...         0.044843

So, there's a 4.5% overlap in k-mers with some Buchnera queries for JAGGDU010000001.1 and NZ_JPOS01000001.1, which is probably not enough to call it a Buchnera genome… The estimated ANI is in the 80% range (column max_containment_ani).

tl;dr no matches to Sulcia, only one match to Buchnera, in all the ATB genomes.

Download the CSV file here for more exploration: https://farm.cse.ucdavis.edu/~ctbrown/buchnera-sulcia.manysearch.zip

benchmark info

with 4 CPUs:

        Command being timed: "sourmash scripts manysearch sublineages.sig.zip allthebacteria-r0.2-k21.zip -o sublineages.x.atb.csv.2 -k 21 -c 4"
        User time (seconds): 10967.28
        System time (seconds): 98.50
        Percent of CPU this job got: 385%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 47:51.17
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 20568304
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 506284
        Minor (reclaiming a frame) page faults: 4419875
        Voluntary context switches: 63584
        Involuntary context switches: 120076
        Swaps: 0
        File system inputs: 129128808
        File system outputs: 21096
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Followup

sourmash calculates overlaps; ANI calculation; and reports

For some background: sourmash relies on estimating overlaps using k-mers. From this one can estimate Jaccard, containment, and ANI between sets. We've found these estimates are pretty reliable for a wide range of parameters.

I tend to use sourmash in a way that avoids reaching conclusions too quickly, because biology data is so messy :). But I didn't properly caveat the above results!

Technically, what I show above is that there are substantial overlaps between Buchnera and Sulcia reference genomes from GTDB, and > 2000 SRA "isolate" shotgun sets.

My first concern was: did I screw this up?? So that's what most of the below addresses. I'll revisit the questions you asked after I make sure I'm not imagining things!

I've updated the matches CSV with more information, explored below.

Looking at the 4th best match, SRR27437571, it has:

containment of 38%
intersect_hashes of 256 => 256 KB of predicted overlap
a query containment ANI of > 95%, based on k-mer matches within the query

So, whatever this sample is supposed to be, it contains approximately 256,000 21-mers that belong to the Buchnera genome in GTDB.

I would say this is either contamination in the Buchnera genome or Buchnera is present in this sample (or both).

A more detailed analysis of the SRR27437571 sample (Buchnera 3rd match)

Further analysis of SRR27437571 suggests that it is actually a metagenome, with a substantial portion of Buchnera present in it. I used sourmash gather to do this.

See the sourmash gather CSV for SRR27437571, sort by f_unique_weighted, look at match name - has about 3x coverage (median_abund), and Buchnera is about 0.0034% of the sample. So a very small match.

Note that I did this analysis with k=51, so much more specific than k=21 (but less sensitive to evolutionary distance). We can do some alignments if we need to verify it but I'd bet good money that it's there

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

A larger analysis

I took about 10 samples that matched to Buchnera or Sulcia, and did the metagenome breakdown as above.

understand why buchnera/sulcia not in some of them

Where does this leave us?

Not sure. More goes here.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.