what we're doing, and some associated chopportunities.
(Oct 19, 2020 lab meeting)
github.com/spacegraphcats/spacegraphcats/
Because communities are a diverse mix of strains,
This is also true of metatranscriptomes and ~RNAseq…
reads -> k-mers/De Bruijn graph -> compact De Bruijn graph (cDBG)
spacegraphcats indexes cDBGs and partitions them into pieces, based around "dominating nodes" that cover the cDBG at some radius r.
"piecing the graph" -> every cDBG node (blue) belongs to a dom node (purple)
This is easy for small graphs, but…
so, Taylor and I have been experimenting with "pulling" annotations (gene names; abundances; taxonomy) from cDBG nodes "up" to the pieces.
This means every purple dominating node (== piece) gets annotated with all of the features under it.
This is a first step to interpreting k-mers/sequence NOT in reference genome, but graph-adjacent to it.
we think we're mostly looking at minor variants (gene A) but
how do we evaluate the biological plausibility of this!?
…some tests are under way… cc taylor ;)
some preliminary results from taxonomy –
88512 dom nodes have no sourmash hashes
903 dom nodes have exactly one sourmash hash
37 dom nodes have two or more sourmash hashes
rank of dom node lca count of dom nodes with that rank
-------------------- ---------------------------------
strain 35
species 2
14149 dom nodes have no sourmash hashes
796 dom nodes have exactly one sourmash hash
86 dom nodes have two or more sourmash hashes
rank of dom node lca count of dom nodes with that rank
-------------------- ---------------------------------
strain 76
species 10
side note: this also lets us measure abundances directly on the graph…
(and where are there …chopportunities!)
for cdbg_node in cdbg_nodes: # v--- the tricky bit is building cdbg_annots! annots = cdbg_annots.get(cdbg_node, []) dom_id = catlas.cdbg_to_dom[cdbg_node] dom_annots[dom_id].update(annots)
cdbg_annots = defaultdict(set) for gene_name, gene_seq in screed.open(gene_sequences): kmers = khmer.sequence_to_kmers(gene_seq) # do the search - the tricky bit, in more detail cdbg_nodes = catlas.nodes_by_kmers(kmers) # save annotations for cdbg_node_id in cdbg_nodes: cdbg_annots[cdbg_node_id].add(gene_name)
to build a fast mapping { k-mer: cdbg_node_id }
for billions of k-mers, we take advantage of a feature of cDBGs: each k-mer is present in no more than one cDBG node.
we then use minimal perfect hash functions (MPHF) to map { kmer: offset }
and use offset
to index into an linear array.
this is how we use the minimal perfect hash functions (MPHF):
mphf = mphf_build_table(all_kmers) lookups = {} for cdbg_node_id, sequence in cdbg_nodes.items(): for kmer in khmer.sequence_to_kmers(sequence): mphf_id = mphf.kmer_to_hash(kmer) lookups[mphf_id] = cdbg_node_id
another important thing spacegraphcats does is retrieve the underlying reads that match to any given query. steps:
for the last step, we need a many-to-many mapping: { cdbg_id : read_id }
.
this is many to many because each read may overlap with multiple cDBG nodes, and each cDBG node will usually contain many reads.
what's a good way to do this for 100s of millions of reads, and 10s of millions of cDBG nodes??
fini.