# Presentation plan ## Myself and Supervisor I am Andrey Chertkov, the 3rd year student. My supervisor is Nickolay Kudasov. My project is Succinct purely functional de Bruijn assembly graph representations. ## What the project is about ### Genomes sequencing and assembling - Today we cannot read all DNA at one time. But we can read the small chain of the DNA, and then combine them in one. - Read sequence is similar to substrings of the string. - The task is to find a way to combine substrings in one string - Original de Bruijn graph with Euler Path solves the problem of combining string such that they contain all substrings. ### De Bruijn Graph solution ![](https://i.imgur.com/VvXYZFx.png) - Space consumption is a pressing practical problem for DNA assembly with de Bruijn graph-based algorithms. ### The solution is Succinct Data Structures - In Computer Science, there are three ways to store data: - implicit if it takes $Z + O (1)$ bits of space, - succinct if it takes $Z + o (Z)$ bits of space, and - compact if it takes $O (Z)$ bits of space. - A succinct data structure is a data structure that uses an amount of space that is "close" to the information-theoretic lower bound, but (unlike other compressed representations) still allows for efficient query operations. - Succinct data structures are often presented in bit array representation - Two main operations on bit array for succinct data structures - $rank(i)$ : returns the number of elements equal to $1$ up to position $i$ - $select(i)$ : returns the position of the $i$-th occurrence of $1$ ### Succinct de Bruijn Graph ![](https://i.imgur.com/dMW5Pis.png) - The main operation in de Bruijn Graph is to find the successor of the node. - For find successor we may use combination of the $rank(i)$ and $select(i)$ - $succ(node) = \{select(r) | r \in [rank(4*node), rank(4*node + 4)]\}$ ## preliminary plan (Project Proposal) - Feb 9 (Weeks 1–3): Research modern succinct data structures and de Bruijn graph - Feb 16 (Week 4): Implement naive purely functional de Bruijn graph and give complexity analysis - Mar 1 (Weeks 5–6): Research approaches to succinct purely functional data structures - Mar 15 (Weeks 8): Design succinct purely functional de Bruijn graph with theoretical complexity analysis - Apr 5 (Weeks 9-10): Implement succinct purely functional de Bruijn graph. - Apr 12 (Weeks 12): Add test suite and benchmarks for naive and succinct versions - Apr 19 (Week 13): Compare purely functional implementations with an original ephemeral data structure - Apr 26 (Week 14): Package and document implementation ## What have I done (Iteration 2) - Found base paper - Naive realization - Base tests ## demo(?) ## Changed plan - Mar 15 (Weeks 8): Research approaches to succinct purely functional data structures. Do refactoring of the code. - Mar 22 (Weeks 9): Design succinct purely functional de Bruijn graph with theoretical complexity analysis - Apr 5 (Weeks 10-11): Implement succinct purely functional de Bruijn graph. - Apr 12 (Weeks 12): Add test suite and benchmarks for naive and succinct versions - Apr 19 (Week 13): Compare purely functional implementations with an original ephemeral data structure - Apr 26 (Week 14): Package and document implementation ## The end - https://github.com/AndreyChertckov/Succinct-de-Bruijn-Graph - Thank you for your attention # Presentation Text ## Intro Hello everyone, My name is Andrey Chertkov and I would like to present you the current status of my project, **Succinct purely functional de Bruijn assembly graph representations**. My supervisers are Nickolay Kudasov and Manuel Mazzara. Let's start. ## Genome sequencing and assembling Lets start from the basics of genome sequencing and assembling. Today we cannot read all DNA sequence at once. But we can read small chains of the DNA one by one and then combine them. Read chains are substrings of the resulting string. The task of finding a way to combine substrings in one string, this is a common computer science problem. ## De Bruijn Graph This problem was solved by Nicolaas Govert de Bruijn. The solution is called de Bruijn graph. De Bruijn graph is a directed graph representing overlaps between sequences of symbols. ## Succinct data structures The problem with using original de Bruijn Graph is high memory consumtion. In Computer Science, there are three ways to store data: - implicit if it takes Z + O(1) bits of space - succinct if it takes Z + o(Z) bits of space - compact if it takes O(Z) bits of space Where Z is the information-theoretical optimal number of bits needed to store some data. Implicit way is not suitable because of high time complexity of operations, but succinct way works well for de Bruijn Graph. A succinct data structure is a data structure that uses an amount of space that is "close" to the information-theoretic lower bound, but (unlike other compressed representations) still allows for efficient query operations. ## Succinct data structures - cont. Succinct data structures are implemented using bit arrays. Two main operations over succinct bit arrays are $select$ and $rank$. - $rank(i)$ : returns the number of elements equal to 1 up to position i - $select(i)$ : returns the position of the i-th occurrence of 1 ## Succinct de Bruijn Graph Succinct de Bruijn Graph is representend using bit array in which each cell signifies existence of the edge: if a cell contains 1, then the edge exists. The main operation in de Bruijn Graph is to find the successor of the node. To find the successor we may use combination of the $rank(i)$ and $select(i)$, such as the one shown on the slide. ## Purely functional succinct data structures The main difference between an arbitrary data structure and a purely functional one is that the latter is (strongly) immutable. This restriction ensures the data structure has the advantages of immutable objects: (full) persistency, quick copy of objects, and thread safety. Efficient purely functional data structures may require the use of lazy evaluation and memoization. ## Why purely functional de Bruijn Graph Purely functional data structures gives us persistance, that allows us to reuse of common subgraphs to save memory, access to different versions of the genome may allow analysis of many variations / mutations of the genome, an efficient fusion operation may allow efficient parallel construction ## Why Haskell Haskell is purely functional language, so it is convenient to build purely functional data structures using Haskell. ## What have I Done - Found base paper - Naive realization - Test suite - Materials for next improvements - Documentation ## Current plan Current plan you can see on this slide. In the near future I plan to create effective rank and select. Also I plan to research an approaches to create purely functional succinct data structures. ## The end Thank you for your attention