# spacegraphcats ho! lab meeting
what we're doing, and some associated chopportunities.
(Oct 19, 2020 lab meeting)
github.com/spacegraphcats/spacegraphcats/
:warning: here be dragons... and maybe (ok, definitely!) some technical debt
---
![](https://i.imgur.com/FImHSsM.png)
---
![](https://i.imgur.com/dYt9Ak1.png)
---
## background: motivation for graph-based approaches in metagenomics
---
![](https://i.imgur.com/KDnDxM7.png)
---
![](https://i.imgur.com/raLyeIJ.png)
---
## MAPPING AND ASSEMBLY BAD
Because communities are a diverse mix of strains,
* cannot reliably use reference genomes from other samples;
* cannot reliably use references from the same sample, w/o haplotype-level resolution;
* cannot reliably co-assemble samples (will merge/collapse strains)
This is also true of metatranscriptomes and ~RNAseq...
---
reads -> k-mers/De Bruijn graph -> compact De Bruijn graph (cDBG)
![](https://i.imgur.com/dpie2B4.png)
---
spacegraphcats indexes cDBGs and partitions them into pieces, based around "dominating nodes" that cover the cDBG at some radius r.
![](https://i.imgur.com/VtXnKza.png)
---
"piecing the graph" -> every cDBG node (blue) belongs to a dom node (purple)
![](https://i.imgur.com/SCukfgH.png)
---
This is easy for small graphs, but...
![](https://i.imgur.com/8F3Usvy.png)
---
so, Taylor and I have been experimenting with "pulling" annotations (gene names; abundances; taxonomy) from cDBG nodes "up" to the pieces.
![](https://i.imgur.com/ljs4ESi.png)
---
This means every purple dominating node (== piece) gets annotated with all of the features under it.
![](https://i.imgur.com/ljs4ESi.png)
---
## what does this do? why do we do it?
---
This is a first step to interpreting k-mers/sequence NOT in reference genome, but graph-adjacent to it.
![](https://i.imgur.com/WD64XuY.png)
---
we think we're mostly looking at minor variants (gene A) but :shrug:
![](https://i.imgur.com/w0jr8A2.png)
---
## benchmarking considerations...
how do we evaluate the biological plausibility of this!?
...some tests are under way... cc taylor ;)
some preliminary results from taxonomy --
---
## taxonomy results for r=1 on mock community data:
88512 dom nodes have no sourmash hashes
903 dom nodes have exactly one sourmash hash
37 dom nodes have two or more sourmash hashes
```
rank of dom node lca count of dom nodes with that rank
-------------------- ---------------------------------
strain 35
species 2
```
---
## taxonomy results for r=5 on mock community data:
14149 dom nodes have no sourmash hashes
796 dom nodes have exactly one sourmash hash
86 dom nodes have two or more sourmash hashes
```
rank of dom node lca count of dom nodes with that rank
-------------------- ---------------------------------
strain 76
species 10
```
---
side note: this also lets us measure abundances directly on the graph... :grin:
![](https://i.imgur.com/ljs4ESi.png)
---
## how do we do it, code-wise?
(and where are there ...chopportunities!)
---
## expanding annotations from cDBG nodes to dom nodes
```python=
for cdbg_node in cdbg_nodes:
# v--- the tricky bit is building cdbg_annots!
annots = cdbg_annots.get(cdbg_node, [])
dom_id = catlas.cdbg_to_dom[cdbg_node]
dom_annots[dom_id].update(annots)
```
---
## building cdbg_annots often relies on k-mers
```python=
cdbg_annots = defaultdict(set)
for gene_name, gene_seq in screed.open(gene_sequences):
kmers = khmer.sequence_to_kmers(gene_seq)
# do the search - the tricky bit, in more detail
cdbg_nodes = catlas.nodes_by_kmers(kmers)
# save annotations
for cdbg_node_id in cdbg_nodes:
cdbg_annots[cdbg_node_id].add(gene_name)
```
---
## nodes_by_kmers relies on trickiness to *scale*
to build a fast mapping `{ k-mer: cdbg_node_id }` for billions of k-mers, we take advantage of a feature of cDBGs: each k-mer is present in no more than one cDBG node.
we then use minimal perfect hash functions (MPHF) to map `{ kmer: offset }` and use `offset` to index into an linear array.
---
this is how we use the minimal perfect hash functions (MPHF):
```python=
mphf = mphf_build_table(all_kmers)
lookups = {}
for cdbg_node_id, sequence in cdbg_nodes.items():
for kmer in khmer.sequence_to_kmers(sequence):
mphf_id = mphf.kmer_to_hash(kmer)
lookups[mphf_id] = cdbg_node_id
```
* this could maybe be replaced by using the MQF (Mostafa and Tamer's work).
* other thoughts?
---
## retrieving reads for cDBG contigs
another important thing spacegraphcats does is retrieve the underlying *reads* that match to any given query. steps:
* find IDs of cDBG contigs containing query k-mers
* extract dom nodes for those cDBG contigs
* extract cDBG IDs under those dom nodes
* find read IDs that share k-mers with those cDBG IDs
---
## building a read index
for the last step, we need a many-to-many mapping: `{ cdbg_id : read_id }`.
this is many to many because each read may overlap with multiple cDBG nodes, and each cDBG node will usually contain many reads.
what's a good way to do this for 100s of millions of reads, and 10s of millions of cDBG nodes??
---
## current strategy
* currently relies on a janky minimizer-like strategy using khmer tagging (pls no)
* not actually 100% sure how it works, upon reading code 😱
* very slow, & definitely non-optimal in any way you care to name 😂
---
## future strategies??
* cannot use MPHF because it only enables a one-to-many mapping.
* can we use better minimizers, maybe?
* one observation: all multi-cDBG-node reads must overlap with k-mers at an end of the cDBG node... can we use that? cc camille for insights.
---
## thanks!
fini.
![](https://i.imgur.com/0DMy0rR.png)
{"metaMigratedAt":"2023-06-15T14:26:59.461Z","metaMigratedFrom":"Content","title":"spacegraphcats ho! lab meeting","breaks":true,"contributors":"[{\"id\":\"fbac64b8-20e4-4eb4-85a6-d4048a601d72\",\"add\":6207,\"del\":413}]"}