Anton Nekrutenko
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Versions and GitHub Sync Note Insights Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       owned this note    owned this note      
    Published Linked with GitHub
    Subscribed
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    Subscribe
    --- tags: BMMB554-23 --- ![](https://i.imgur.com/phZHgSu.png) # Lecture 21: Assembly I --- *k*-mers and graphs :::info Somehow I missed [**this**](https://www.nature.com/articles/d41586-023-00512-4) article. Take a look. It describes new players on the sequencing market. ::: The following is based on three primary sources: - [Ben Langmead's Teaching Materials](http://www.langmead-lab.org/teaching-materials/) - [Pevzner and Compeau Bioinformatics Book](http://www.amazon.com/Bioinformatics-Algorithms-Active-Learning-Approach/dp/0990374602) (referred in text as "CP") - [Mike Schatz's Teaching Materials](http://schatz-lab.org/teaching/) ## Key concepts ### Long and short reads are assembled using different approaches --- ![](https://i.imgur.com/ZOqtnk8.png) Image from [Mike Schatz](http://schatz-lab.org/teaching/) --- Overlap graphs and de Bruijn graphs are two commonly used approaches for genome assembly from high-throughput sequencing data. Overlap graphs represent the sequences in a genome as nodes in a graph, where edges connect the nodes that overlap by a certain length. The length of the overlap depends on the sequencing technology used and the quality of the reads. The graph is then traversed to assemble the genome. Overlap graphs are useful for assembling small genomes with high coverage, but they become more difficult to use for larger and more complex genomes due to the computational complexity of the graph traversal. De Bruijn graphs represent the sequences in a genome as a set of $k$-mers, which are all possible substrings of length $k$. Each $k$-mer is represented as a node in the graph, and edges connect $k$-mers that overlap by $k-1$ bases. The graph is then traversed to assemble the genome. De Bruijn graphs are particularly useful for assembling large and complex genomes with low coverage, as they are less computationally demanding than overlap graphs and can handle errors in the sequencing data. In summary, the main difference between overlap graphs and de Bruijn graphs is the way they represent the sequences in a genome. Overlap graphs use nodes to represent sequences that overlap, while de Bruijn graphs use nodes to represent k-mers. The choice of graph depends on the size and complexity of the genome, the coverage and quality of the sequencing data, and the computational resources available. ### Assembly quality depends on coverage, read length, and base quality ---- ![](https://i.imgur.com/vehfXFU.png) Image from [Mike Schatz](http://schatz-lab.org/teaching/) ---- ### The concept of coverage *Coverage* is the number of reads covering a particular position in the genome. For example, in the following case: ``` TAA AAT ATG <- "reads" (15 bases total) TGT GTT ------- TAATGTT <- "genome" (7 bases) ------- 0123456 ``` The *Coverage* at positions 1 and 6 is *1*, at positions 1 and 5 is *2*, and at position 2, 3, and 4 is *3*. <br>The *Average Coverage* will be $\frac{15}{7}\approx2\times$ Below is another, slightly more realistic example where average coverage is $\frac{177}{35}\approx7\times$ ``` CTAGGCCCTCAATTTTT CTCTAGGCCCTCAATTTTT GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA <- 177 bases TATCTCGACTCTAGGCC TCTATATCTCGGCTCTAGG GGCGTCTATATCTCG GGCGTCGATATCT GGCGTCTATATCT ----------------------------------- GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT <- 35 bases ----------------------------------- | | | | | 0 10 20 30 34 ``` Let's try to simulate coverage using a simple [python script](https://colab.research.google.com/drive/1rHJedtUIXzURyijGnbacTl4pWSGT6RoJ?usp=sharing). ## $k$-mer composition Genomes are strings of text. When we sequence genomes we use sequencing machines that generate reads. For now let's assume that all reads have the same length $k$ and every $k$-mer is sequenced only once. We will relax these assumptions later in this lecture. Thus sequencing a genome generates a large list of $k$-mers. Suppose we are dealing with a *very* short genome `TATGGGGTGC`. Its $k$-mer composition (note the subscript) $Composition_k(Text)$ is the collection of all $k$-mer substrings (including repeated ones). When $k$ = 3 we get (basically we split sequence into windows of length 3 sliding window by 1 base every time): $$ Composition_3(\texttt{TATGGGGTGC}) = \texttt{ATG, GGG, GGG, GGT, GTG, TAT, TGC, TGG} $$ Note that we have listed *k*-mers in lexicographic order (i.e., how they would appear in a dictionary) rather than in the order of their appearance in $\texttt{TATGGGGTGC}$. We have done this because the correct ordering of the reads is unknown when they are generated (i.e., a sequencing machine does not generate reads in any particular order). ----- ![](https://i.imgur.com/XgAAqpb.png) ![](https://i.imgur.com/yPmGs3D.png) ![](https://i.imgur.com/LqWdU6Q.png) ![](https://i.imgur.com/lEks8zI.png) Images from [Mike Schatz](http://schatz-lab.org/teaching/) ----- ## Assembly by overlap In real life we don't know that and our goal is to determine genome sequence given a scrambled collection of $k$-mers. Let's consider the following collection of 3-mers representing a hypothetical genome: $$ \texttt{AAT ATG GTT TAA TGT} $$ Let's "tile" $k$-mers if they overlap in $k-1$ nucleotides: ``` TAA AAT ATG TGT GTT ------- TAATGTT ``` Now let's apply it to slightly longer "genome" with the following 3-mer composition sorted in a lexicographic order: $$ \texttt{AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT} $$ `TAA` looks like a great beginning and we are continuing: ``` 1 TAA 2 AAT 3 ATG 4 TGT 5 GTT ------- TAATGTT ``` There is nothing in the original 3-mer composition, which starts with `TT`. Let's track back and instead of `TGT` in step 4 insert `TGC`: ``` 1 TAA 2 AAT 3 ATG 4 TGC 5 GCC 6 CCA 7 CAT 8 ATG 9 TGG 10 GGA 11 GAT 12 ATG 13 TGT 14 GTT ---------------- TAATGCCATGGATGTT ``` We only used 14 3-mers from the total of 15, so our genome is shorter (we have extra parts!). This difficulty is related to the fact that there are three repeated `ATG` motifs in this genome and as a result each `ATG` can be extended by either `TGG`, `TGC`, or `TGT`. # The First and the Second laws of assembly The goal of assembly process is to reconstruct an unknown genome sequence given a collection of scrambled sequencing reads: ----- ``` CTAGGCCCTCAATTTTT CTCTAGGCCCTCAATTTTT GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA <- Reads (Given) TATCTCGACTCTAGGCC TCTATATCTCGGCTCTAGG GGCGTCTATATCTCG GGCGTCGATATCT GGCGTCTATATCT ----------------------------------- ??????????????????????????????????? <- Genome (Unknown) ``` <small>**The goal of assembly process**. Given sequencing reads reconstruct underlying genome sequence. </small> ----- We've seen that this can (in principle) be accomplished by finding overlaps. We also discussed the concept of the coverage. We can now formulate the two first assembly laws. ## The First Assembly Law: Overlaps imply co-location Let's define terms **Prefix** and **Suffix** using string $\texttt{TAA}$ as an example: * $Prefix(\texttt{TAA}) = \texttt{TA}$ * $Suffix(\texttt{TAA}) = \texttt{AA}$ The First law states that if a *suffix* of one read is similar to a *prefix* of another read ... ``` TCTATATCTCGGCTCTAGG <- read 1 ||||||| ||||||| TATCTCGACTCTAGGCC <- read 2 ``` ... then they may overlap (may be derived from the same location) within the genome. ``` TCTATATCTCGGCTCTAGG <- read 1 ------------------------------------- AGCGTTCTATATCTCGGCTCTAGGCCGTGCAGGACGT <- genome ------------------------------------- TATCTCGACTCTAGGCC <- read 2 ``` Note that in the above example suffix of the first read is *not* exactly identical to the prefix of the second read: they differ by a G-to-A substitution. Such differences are quite common in real life and may be caused by: * **sequencing errors** - experimental or computational artifacts of DNA sequencing procedures. * **allelic differences** - organisms such as human are diploid (and others, such as wheat are hexaploid) which maternal and paternal genomes being different at a number of genomic sites. * **polymorphic sites** - DNA that is being sequenced is usually isolated from a large number of cells (e.g., white blood cells) or individuals (bacterial and viral cultures). Natural variation present in these cell (or viral particle) populations will manifest itself as these differences. ## The Second Assembly Law: The higher the coverage, the better The Second law states that higher coverage leads to more frequent and longer overlaps: ``` CTAGGCCCTCAATTTTT TATCTCGACTCTAGGCCCTCA <- Low coverage GGCGTCTATATCT ----------------------------------- GGCGTCTATATCTCGGCTCTAGGCCCTCATTTTTT <- Genome ----------------------------------- CTAGGCCCTCAATTTTT CTCTAGGCCCTCAATTTTT GGCTCTAGGCCCTCATTTTTT CTCGGCTCTAGCCCCTCATTTT TATCTCGACTCTAGGCCCTCA <- Higher coverage TATCTCGACTCTAGGCC TCTATATCTCGGCTCTAGG GGCGTCTATATCTCG GGCGTCGATATCT GGCGTCTATATCT ``` # Solving assembly problem with graphs We can solve assembly challenge using overlaps between sequencing reads. However, to solve this problem effectively we need to first represent all overlaps in a way that would facilitate further analysis. *Directed graphs* help achieving this. ## Directed graphs Finding overlaps is identical to building a *directed graph* where directed *edges* connect *nodes* representing overlapping reads: ------ ![](https://i.imgur.com/TeoO58f.png) **Directed graph** representing overlapping reads. (Image from [Ben Langmead](http://www.cs.jhu.edu/~langmea/resources/lecture_notes/assembly_scs.pdf)) ------ For example, the string reconstruction we have seen earlier (with the difference of inserting `GGG` in line 10): ``` 1 TAA 2 AAT 3 ATG 4 TGC 5 GCC 6 CCA 7 CAT 8 ATG 9 TGG 10 GGG 11 GGA 12 GAT 13 ATG 14 TGT 15 GTT ----------------- TAATGCCATGGGATGTT ``` can be represented as a following directed graph (or genome path): ----- ![](https://i.imgur.com/yLV2ywo.png) **Genome path**. Trimers composing the $\texttt{TAATGCCATGGGATGTT}$ sequence represented as the "genome" path. (Fig. 4.6 from CP). In this path a suffix of a 3-mer is equal to prefix of the next 3-mer. ----- **However**, we do not know the actual genome! All we have in real life is a collection of reads. Let's first build an overlap graph by connecting two 3-mers if suffix of one is equal to the prefix of the other: ------ ![](https://i.imgur.com/QMDqdT3.png) **Overlap graph**. All possible overlap connections for our 3-mer collection. (Adopted from Fig. 4.7 from CP)</small> ------ So to determine the sequence of the underlying genome we are looking a path in this graph that visits every node (3-mer) once. Such path is called [Hamiltonial path](https://en.wikipedia.org/wiki/Hamiltonian_path) and it may not be unique. For example for our 3-mer collection there two possible Hamiltonian paths: ------ <small>Path 1</small>: ![](https://i.imgur.com/XX6iLya.png) <small>Path 2</small>: ![](https://i.imgur.com/qKjciNl.png) **Two Hamiltonian paths for the 15 3-mers**. Edges spelling "genomes" $\texttt{TAATGCCATGGGATGTT}$ and $\texttt{TAATGGGATGCCATGTT}$ are highlighted in black. (Adopted Fig. 4.9. from CP) ------ The reason for this "duality" is the fact that we have a *repeat*: 3-mer $\texttt{ATG}$ is present three times in our data (<font color="green">green</font>, <font color="red">red</font>, and <font color="blue">blue</font>). As we will see later repeats cause a lot of trouble in genome assembly. ## Finding overlaps In the example above we had a collection of 3-mers and were always looking for overlaps of length two. In real life things may not be so "regular". Suppose we have two reads: ``` Read X CTCTAGGCC Read Y TAGGCCCTC ``` what is the overlap between these two reads? For now we will define overlap of $length - l$ suffix of Read X matches $length - l$ prefix of Read Y, where $l$ is given. To find these overlap we look in Read Y for instances $length - l$ suffix of Read X. We will start with some minimal match of length $k$. Once a match is found it will be extended to the left to verify that the entire prefix of Read Y matches: ----- ![](https://i.imgur.com/SfMJyou.png) **Finding overlaps** between Read X and Read Y. As a result we represent two two reads are connected nodes: ![](https://i.imgur.com/TLIlJY7.png) Number above the edge shows the length of the overlap. (Adopted from [Ben Langmead](http://www.cs.jhu.edu/~langmea/resources/lecture_notes/assembly_scs.pdf)). ------ While with just two reads the problem may seen quite straightforward. Let now consider a set of reads representing a very short genome $\texttt{GTACGTACGAT}$: ``` GTACGT TACGTA CGTACG ACGTAC GTACGA TACGAT ``` Building an overlap graph with overlap of $length \geq 4$ will give us the following: ------ ![](https://i.imgur.com/1CYAJZU.png) You can see that there is a path through this graph that would spell out the original genome sequence $\texttt{GTACGTACGAT}$: ![](https://i.imgur.com/EqCAOOc.png) Here we are lucky enough to have all nodes having a single outgoing edge with the highest number (the length of overlap). ------ ## The Shortest Common Superstring (SCS) Problem The problem of reconstructing genome using the overlap graph that we have just illustrated can be initially formulated as the *Shortest Common Superstring (SCS)* problem. It states: *given a collection of strings S, find SCS(S), which is the shortest string that contains all strings from the set S as substrings*. For simplicity let's suppose that we have the following set of strings $S$: $$ S: \texttt{BAA AAB BBA ABA ABB BBB AAA BAB} $$ one way of getting a string that would contain all of these as substrings will simply be concatenating them: $$ Concat(S): \texttt{BAAAABBBAABAABBBBBAAABAB}\ (length = 24) $$ this, however, is not the *shortest* superstring that contains all strings from $S$. Instead the SCS is (just trust us here): $$ SCS(S): \texttt{AAABBBABAA}\ (length = 10) $$ It looks like finding SCS for a set of sequencing reads may just be what we need to produce a genome assembly. But how can this work in practice? One potential idea is to order the strings in some way and "reduce" them into a superstring (following examples are from Ben Langmead): ------- ![](https://i.imgur.com/mbxTZkP.png) Let's look at the first two strings. They can be "reduced" to `AAAB`: ![](https://i.imgur.com/MQpMhlo.png) The next two add an `A`: ![](https://i.imgur.com/uWgPadO.png) Third and fourth add `BB`: ![](https://i.imgur.com/VfYhAm9.png) Continuing this we will eventually get `AAABABBAABABBABB`: ![](https://i.imgur.com/SOpEv4V.png) But `AAABABBAABABBABB` is the shortest only for this particular ordering. So let's reorder and try again: ![](https://i.imgur.com/2BaHk2F.png) Now we did better, but maybe we can do even better. (Adopted from [Ben Langmead](http://www.cs.jhu.edu/~langmea/resources/lecture_notes/assembly_scs.pdf)). ------- Ultimately we need to try all possible ordering and pick the shortest among all. Using this approach is we have $S$ strings we will need to do $S!$ tries. This can quickly get impossible. For our set of eight strings $8! = 40320$. If we get, say, a 1,000,000 reads from an Illumina machine then the factorial of a million is not going to be an attractive analysis option. ## Shortest common superstring: Greedy approach As we've seen it will be impossible to assemble the genome using SCS logic. There is a simplification called *Greedy* approach to SCS problem. Let's take the following set of "reads": $$ S: \texttt{AAA AAB ABB BBA BBB} $$ and first build an overlap graph: ----- ![](https://i.imgur.com/BOoTHZs.png) **An overlap graph** for set $S: \texttt{AAA AAB ABB BBA BBB}$. ------ Next, we start collapsing the nodes to maximize the overlap (and hence to decrease the length of the SCS we are trying to construct): ------ In the graph below there are multiple ties: nodes with outgoing edges of identical weights (e.g., edges pointing from `ABB` to both `BBA` and `BBB` have weight of two. Remember, that the weight is the length of overlap between two nodes' labels). In this situation we will break ties by randomly picking an edge to traverse. Let's pick <font color="red">`AAA` &#8594; `AAB`</font>: ![](https://i.imgur.com/bIhrqvZ.png) We then merge `AAA` and `AAB` into an SCS containing both, which will be `AAAB`: ![](https://i.imgur.com/P5clLDe.png) Now let's pick edge <font color="red">`ABB` &#8594; `BBB`</font>: ![](https://i.imgur.com/83S13tY.png) Collapse the nodes: ![](https://i.imgur.com/yA60d0Y.png) Pick <font color="red">`ABBB` &#8594; `BBA`</font>: ![](https://i.imgur.com/8dg1Hpn.png) Collapse the nodes: ![](https://i.imgur.com/XGwvKve.png) Pick <font color="red">`AAAB` &#8594; `ABBBA`</font>: ![](https://i.imgur.com/qnrXfIU.png) Collapse again and now we are left with a superstring of length 7: ![](https://i.imgur.com/wlpnOd7.png) ----- The above procedure can be computed *very* quickly. But there is a catch: it does not guarantee that it will give us truly the shortest superstring. It really depends on how we choose edges. Below is another example of using the same dataset in which we traverse graph an a slightly different way: ----- We start the same way as before by choosing <font color="red">`AAA` &#8594; `AAB`</font>: ![](https://i.imgur.com/tBcJ6zj.png) Merge `AAA` and `AAB`: ![](https://i.imgur.com/IuuOf2d.png) But now we pick a different edge <font color="red">`ABB` &#8594; `BBA`</font>: ![](https://i.imgur.com/ykn8gar.png) Collapsing these nodes dramatically changes the graph: ![](https://i.imgur.com/umFl7yv.png) Now we pick <font color="red">`AAAB` &#8594; `ABBA`</font> as this is the edge with the highest weight: ![](https://i.imgur.com/iiSruJx.png) Collapsing it produces two nodes that are not connected to each other: ![](https://i.imgur.com/9P14fSF.png) ------ And the SCS of these two will be a concatenation `AAABBABBB` of length 9. Thus a greedy approach may produce different answers. However, it is a sufficient approximation as the superstring yielded this way will not be more than ~2.5 times longer than the true SCS ([Gusfield](https://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198) 16.17.1). # The Third Law of Assembly: Repeats are Evil! Let's again apply Greedy SCS to a different "genome". Suppose we want to reconstruct the phrase: ``` a_long_long_long_time ``` from all 6-mers that overlap by at least 3 characters. The list of 6-mers is: ``` ng_lon _long_ a_long long_l ong_ti ong_lo long_t g_long g_time ng_tim ``` An overlap graph will look like this: ----- ![](https://i.imgur.com/B1HhK6L.png) **An overlap graph** for with overlap length $\geq 3$. ------ If we proceed with Greedy SCS we will follow the following trajectory through the graph: ----- ![](https://i.imgur.com/2qCFl0c.png) To make things even clearer let's isolate the path: ![](https://i.imgur.com/67Y0tcb.png) The total overlap here (the sum of edge weights) is $4+5+5+5+5+5+5+5+5=44$ ------ The problem is that $4+5+5+5+5+5+5+5+5=44$ gives us `a_long_long_time` as the shortest superstring: ``` a_long long_l ong_lo ng_lon g_long _long_ long_t ong_ti ng_tim g_time ---------------- a_long_long_time ``` We are missing one instance of `long` in this string. The following graph shows the path that would return the *correct* string: ----- ![](https://i.imgur.com/xjRznUF.png) A path yielding the correct string with three repeats. The total overlap here is $5+3+3+5+4+4+5+5+5=39$ ----- The above graph with overlap $5+3+3+5+4+4+5+5+5=39$ is actually *worse* than the previous path if our goal is to find the shortest superstring: ``` a_long _long_ ng_lon long_l ong_lo g_long long_t ong_ti ng_tim g_time --------------------- a_long_long_long_time ``` ## Are we really looking for the shortest superstring? As we've seen above the shortest common superstring (SCS) is: 1. **Difficult to obtain** as Greedy SCS algorithm does not guarantee funding it. So the answer we get may be longer that the real genome we are trying to assemble. 2. **May be shorter than we want** because if the genome contains repeats that are longer than the reads we are using, Greedy SCS will collapse them and make assembly shorter that the genome we are trying to get. Let's talk about an alternative way to represent the relationship between $k$-mers that may give us a more efficient algorithm. # de Bruijn graphs [Nicolaas de Bruijn](https://en.wikipedia.org/wiki/Nicolaas_Govert_de_Bruijn) had a purely theoretical interest of constructing $k$-universal strings for an arbitrary value of $k$. A $k$-universal string contains every possible $k$-mer only once: ------ ![](https://i.imgur.com/ya9yVT1.png) **de Bruijn graph**. From [Compeau:2011](http://www.nature.com/nbt/journal/v29/n11/abs/nbt.2023.html) ------ This problem is equivalent to a string reconstruction problem we have been talking about above: finding a $k$-universal string is equivalent to finding a Hamiltonian path in an overlap graph constructed from all $k$-mers. Yet finding a Hamiltonian path in a really large graph (representing a real genome) is not a tractable problem as we have seen. Instead de Bruijn decided to represent $k$-mer composition in a graph using a slightly different logic. Again, suppose we have a "genome" $\texttt{TAATGCCATGGGATGTT}$ split in a collection of 3-mers: ``` TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT ``` We will assign 3-mers to _edges_ instead or _nodes_: ----- ![](https://i.imgur.com/p3ZaiGA.png) **$k$-mers as edges**. Edges represented by 3-mers connect nodes representing the overlaps. (Adopted from Fig. 4.12 from CP). ------ This graph can be simplified by gluing identical nodes together: ----- ![](https://i.imgur.com/tIEATZQ.png) ![](https://i.imgur.com/E3xLpHk.png) Here the complexity of the graph is reduced by first gluing redundant <font color="red">`AT`</font> nodes ![](https://i.imgur.com/k4PK6Os.png) ![](https://i.imgur.com/nFCwH2F.png) Next, <font color="blue">`TG`</font> nodes are merged ![](https://i.imgur.com/25mq4mc.png) ![](https://i.imgur.com/PCk9xcU.png) And, finally the two <font color="green">`GG`</font> nodes are resolved. (Adopted from Fig. 4.13 from CP) ------ Because we now represent _k_-mers as edges (rather than nodes), our problem has morphed into finding a path that visits every _edge_ once, or an [Eulerian Path](https://en.wikipedia.org/wiki/Eulerian_path): ------ ![](https://i.imgur.com/2rGUmCi.png) **Eulerian paths for the 15 3-mers**. Numbering of edges provides a way to reconstruct the original "genome". (Adopted from Fig. 4.15 from CP) ------ ## Euler's Theorem Some definitions: * **Balanced node** - a node where the number of incoming edges is equal to the number of outgoing edges * **Balanced graph** - a graph where all nodes are balanced * **Strongly connected graph** - any node can be reached from any other node **Euler's Theorem**: >Every balanced, strongly connected directed graph is Eulerian. Let's apply Euler's Theorem to a classical problem: The bridges of Köninsberg problem. Here the question is: *Can you walk through all of Köninsberg traversing every bridge exactly one time?* In other words: *Is there a Eulerian path through the city of Köninsberg?* ----- ![](https://i.imgur.com/QzmL8g4.png) **Köninsberg and Euler's Theorem**. (a) A map of old Königsberg, in which each area of the city is labeled with a different color point. (b) The Königsberg Bridge graph, formed by representing each of four land areas as a node and each of the city's seven bridges as an edge. (From [Campeau:2011](http://www.nature.com/nbt/journal/v29/n11/abs/nbt.2023.html#close)) ------ By looking at this graph we can see that it is *unblanaced*. If one arrives to, say, the <font color="orange">orange</font> node from the <font color="blue">blue</font> node there are two ways to get out. Thus there is no way to see all of the city and traverse every bridge once! ---- <iframe src="https://www.google.com/maps/embed?pb=!1m14!1m8!1m3!1d73747.90922499954!2d20.451778999511713!3d54.71627037775103!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x0%3A0x0!2zNTTCsDQzJzAwLjAiTiAyMMKwMzEnMDAuMCJF!5e0!3m2!1sen!2sus!4v1615904421852!5m2!1sen!2sus" width="600" height="450" style="border:0;" allowfullscreen="" loading="lazy"></iframe> Another problem with bridges of Köninsberg, is that Köninsberg no longer exists. ----- ## Repeats are still a challenge Let's look at the de Bruijn graph from above again. But this time let's drop edge numbering and pretend that the genome is not really known to us (as is usually the case in real life): ------ ![](https://i.imgur.com/pYXzYox.png) **Eulerian paths for the 15 3-mers**. ------ In the original sequence `TAATGCCATGGGATGTT` $k$-mer <font color="red">`AT`</font> is present 3 times and $k$-mer <font color="blue">`TG`</font> is found twice. Thus *multiple* Eulerian walks are now possible like this: ------ ![](https://i.imgur.com/Ruv2DPV.png) **Possible path #1**. Here after we reach <font color="blue">`TG`</font> node we turn **up**. ------- The above path spells out: ``` TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT ----------------- TAATGCCATGGGATGTT ``` Yet there is an alternative: ------ ![](https://i.imgur.com/TvUb8Se.png) **Possible path #2**. Here after we reach <font color="blue">`TG`</font> node we turn **dow**. ------- Which spells: ``` TAA AAT ATG TGG GGG GGA GAT ATG TGC GCC CCA CAT ATG TGT GTT ----------------- TAATGGGATGCCATGTT ``` Note how different these are (`|` = same; `*` = different): ``` TAATGCCATGGGATGTT |||||**|||**||||| TAATGGGATGCCATGTT ``` and only one of them is correct. <font color="orange">Repeats are evil!</font> ## $k$-mer size affects repeat resolution In the above example we have used $k$-mer size of 3. But what if we try 4 or 5? Below are DeBruijn graphs for different values of $k$: ------ $k = 3$: ```graphviz // DeBruijn graph digraph { GG [label=GG] TA [label=TA] CA [label=CA] GA [label=GA] CC [label=CC] GT [label=GT] AA [label=AA] GC [label=GC] TG [label=TG] AT [label=AT] GG -> GG GG -> GA TA -> AA CA -> AT GA -> AT CC -> CA GT -> TT AA -> AT GC -> CC TG -> GC TG -> GG TG -> GT AT -> TG AT -> TG AT -> TG } ``` This :point_up: is our original graph. $k = 4$: ```graphviz digraph { ATG [label=ATG] GCC [label=GCC] TAA [label=TAA] CCA [label=CCA] TGG [label=TGG] TGC [label=TGC] GGA [label=GGA] GAT [label=GAT] CAT [label=CAT] AAT [label=AAT] TGT [label=TGT] GGG [label=GGG] ATG -> TGC ATG -> TGG ATG -> TGT GCC -> CCA TAA -> AAT CCA -> CAT TGG -> GGG TGC -> GCC GGA -> GAT GAT -> ATG CAT -> ATG AAT -> ATG TGT -> GTT GGG -> GGA } ``` Here complexity is decreasing, but we still have the problem with having `ATG` twice. $k = 5$: ```graphviz digraph { TAAT [label=TAAT] CATG [label=CATG] ATGC [label=ATGC] CCAT [label=CCAT] GGAT [label=GGAT] TGGG [label=TGGG] GGGA [label=GGGA] TGCC [label=TGCC] AATG [label=AATG] ATGG [label=ATGG] GCCA [label=GCCA] ATGT [label=ATGT] GATG [label=GATG] TAAT -> AATG CATG -> ATGG ATGC -> TGCC CCAT -> CATG GGAT -> GATG TGGG -> GGGA GGGA -> GGAT TGCC -> GCCA AATG -> ATGC ATGG -> TGGG GCCA -> CCAT ATGT -> TGTT GATG -> ATGT } ``` In this case there is only one path. This because our $k$ is larger that the repeat size, so we can resolve it accurately. ------ This is why technologies producing long sequencing reads stimulate so much enthusiasm - they will allow to resolve and produce accurate assembly of large genomes.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully