# spacegraphcats ho! lab meeting what we're doing, and some associated chopportunities. (Oct 19, 2020 lab meeting) github.com/spacegraphcats/spacegraphcats/ :warning: here be dragons... and maybe (ok, definitely!) some technical debt --- ![](https://i.imgur.com/FImHSsM.png) --- ![](https://i.imgur.com/dYt9Ak1.png) --- ## background: motivation for graph-based approaches in metagenomics --- ![](https://i.imgur.com/KDnDxM7.png) --- ![](https://i.imgur.com/raLyeIJ.png) --- ## MAPPING AND ASSEMBLY BAD Because communities are a diverse mix of strains, * cannot reliably use reference genomes from other samples; * cannot reliably use references from the same sample, w/o haplotype-level resolution; * cannot reliably co-assemble samples (will merge/collapse strains) This is also true of metatranscriptomes and ~RNAseq... --- reads -> k-mers/De Bruijn graph -> compact De Bruijn graph (cDBG) ![](https://i.imgur.com/dpie2B4.png) --- spacegraphcats indexes cDBGs and partitions them into pieces, based around "dominating nodes" that cover the cDBG at some radius r. ![](https://i.imgur.com/VtXnKza.png) --- "piecing the graph" -> every cDBG node (blue) belongs to a dom node (purple) ![](https://i.imgur.com/SCukfgH.png) --- This is easy for small graphs, but... ![](https://i.imgur.com/8F3Usvy.png) --- so, Taylor and I have been experimenting with "pulling" annotations (gene names; abundances; taxonomy) from cDBG nodes "up" to the pieces. ![](https://i.imgur.com/ljs4ESi.png) --- This means every purple dominating node (== piece) gets annotated with all of the features under it. ![](https://i.imgur.com/ljs4ESi.png) --- ## what does this do? why do we do it? --- This is a first step to interpreting k-mers/sequence NOT in reference genome, but graph-adjacent to it. ![](https://i.imgur.com/WD64XuY.png) --- we think we're mostly looking at minor variants (gene A) but :shrug: ![](https://i.imgur.com/w0jr8A2.png) --- ## benchmarking considerations... how do we evaluate the biological plausibility of this!? ...some tests are under way... cc taylor ;) some preliminary results from taxonomy -- --- ## taxonomy results for r=1 on mock community data: 88512 dom nodes have no sourmash hashes 903 dom nodes have exactly one sourmash hash 37 dom nodes have two or more sourmash hashes ``` rank of dom node lca count of dom nodes with that rank -------------------- --------------------------------- strain 35 species 2 ``` --- ## taxonomy results for r=5 on mock community data: 14149 dom nodes have no sourmash hashes 796 dom nodes have exactly one sourmash hash 86 dom nodes have two or more sourmash hashes ``` rank of dom node lca count of dom nodes with that rank -------------------- --------------------------------- strain 76 species 10 ``` --- side note: this also lets us measure abundances directly on the graph... :grin: ![](https://i.imgur.com/ljs4ESi.png) --- ## how do we do it, code-wise? (and where are there ...chopportunities!) --- ## expanding annotations from cDBG nodes to dom nodes ```python= for cdbg_node in cdbg_nodes: # v--- the tricky bit is building cdbg_annots! annots = cdbg_annots.get(cdbg_node, []) dom_id = catlas.cdbg_to_dom[cdbg_node] dom_annots[dom_id].update(annots) ``` --- ## building cdbg_annots often relies on k-mers ```python= cdbg_annots = defaultdict(set) for gene_name, gene_seq in screed.open(gene_sequences): kmers = khmer.sequence_to_kmers(gene_seq) # do the search - the tricky bit, in more detail cdbg_nodes = catlas.nodes_by_kmers(kmers) # save annotations for cdbg_node_id in cdbg_nodes: cdbg_annots[cdbg_node_id].add(gene_name) ``` --- ## nodes_by_kmers relies on trickiness to *scale* to build a fast mapping `{ k-mer: cdbg_node_id }` for billions of k-mers, we take advantage of a feature of cDBGs: each k-mer is present in no more than one cDBG node. we then use minimal perfect hash functions (MPHF) to map `{ kmer: offset }` and use `offset` to index into an linear array. --- this is how we use the minimal perfect hash functions (MPHF): ```python= mphf = mphf_build_table(all_kmers) lookups = {} for cdbg_node_id, sequence in cdbg_nodes.items(): for kmer in khmer.sequence_to_kmers(sequence): mphf_id = mphf.kmer_to_hash(kmer) lookups[mphf_id] = cdbg_node_id ``` * this could maybe be replaced by using the MQF (Mostafa and Tamer's work). * other thoughts? --- ## retrieving reads for cDBG contigs another important thing spacegraphcats does is retrieve the underlying *reads* that match to any given query. steps: * find IDs of cDBG contigs containing query k-mers * extract dom nodes for those cDBG contigs * extract cDBG IDs under those dom nodes * find read IDs that share k-mers with those cDBG IDs --- ## building a read index for the last step, we need a many-to-many mapping: `{ cdbg_id : read_id }`. this is many to many because each read may overlap with multiple cDBG nodes, and each cDBG node will usually contain many reads. what's a good way to do this for 100s of millions of reads, and 10s of millions of cDBG nodes?? --- ## current strategy * currently relies on a janky minimizer-like strategy using khmer tagging (pls no) * not actually 100% sure how it works, upon reading code 😱 * very slow, & definitely non-optimal in any way you care to name 😂 --- ## future strategies?? * cannot use MPHF because it only enables a one-to-many mapping. * can we use better minimizers, maybe? * one observation: all multi-cDBG-node reads must overlap with k-mers at an end of the cDBG node... can we use that? cc camille for insights. --- ## thanks! fini. ![](https://i.imgur.com/0DMy0rR.png)
{"metaMigratedAt":"2023-06-15T14:26:59.461Z","metaMigratedFrom":"Content","title":"spacegraphcats ho! lab meeting","breaks":true,"contributors":"[{\"id\":\"fbac64b8-20e4-4eb4-85a6-d4048a601d72\",\"add\":6207,\"del\":413}]"}
    384 views