# Diff - SoK # Diff ## Semantic ## AST ## Patdiff ## Graphtage ### How Graphtage Works In general, optimally mapping one graph to another cannot be executed in polynomial time[^1], and is therefore not tractable for graphs of any useful size[^2]. This is true even for restricted classes of graphs like DAGs[^3]. However, trees and forests are a special case that *can* be mapped in polynomial time, with reasonable constraints on the types of edits possible. Graphtage exploits this. ### Why Mapping Trees is Complex Ordered nodes in the tree (*e.g.*, JSON lists) and, in particular, mappings (*e.g.*, JSON dicts) are challenging. Most extant diffing algorithms and utilities assume that the structures are ordered. Take this JSON as an example: <table> <tbody> <tr class="odd"> <td>Original JSON</td> <td>Modified JSON</td> </tr> <tr class="even"> <td><pre class="json"><code>{ &quot;foo&quot;: [1, 2, 3, 4], &quot;bar&quot;: &quot;testing&quot; }</code></pre></td> <td><pre class="json"><code>{ &quot;foo&quot;: [2, 3, 4, 5], &quot;zab&quot;: &quot;testing&quot;, &quot;woo&quot;: [&quot;foobar&quot;] }</code></pre></td> </tr> </tbody> </table> Existing tools effectively canonicalize the JSON (*e.g.*, sort dictionary elements by key and format lists with one item per line), and then perform a traditional diff: ```bash cat original.json | jq -M --sort-keys > original.canonical.json cat modified.json | jq -M --sort-keys > modified.canonical.json diff -u original.canonical.json modified.canonical.json ``` ```diff { - "bar": "testing", "foo": [ - 1, 2, 3, - 4 - ] + 4, + 5 + ], + "woo": [ + "foobar" + ], + "zab": "testing" } ``` Not entirely useful, particularly if the input files are large. The problem is that changing dict keys breaks the diff: Since "bar" was changed to "zab", the canonical representation changes and they are considered separate edits (lines 2 and 15 of the diff). ### Matching Ordered Sequences Graphtage matches ordered sequences like lists using an "online"[^4], "constructive"[^5] implementation of the Levenshtein distance metric[^6], similar to the Wagner–Fischer algorithm[^7]. The algorithm starts with an unbounded mapping and iteratively improves it until the bounds converge, at which point the optimal edit sequence is discovered. This is implemented in the `graphtage.levenshtein` module. ### Matching Unordered Collections Dicts are matched by solving the minimum weight matching problem[^8] on the complete bipartite graph from key/value pairs in the source dict to key/value pairs in the destination dict. This is implemented in the `graphtage.matching` module. ## Patdiff ### Patience diff algorithm Based on Bram Cohen’s patience diff algorithm, patdiff is optimized for diffing code and config files. > Patdiff includes both an OCaml library (Patdiff_lib) as well as a command-line tool (patdiff). #### Word-level refinement Rather than displaying a small change to a line as a wholesale removal and addition of that line, patdiff shows the word-level diff of that line. All the usual features Recursive diffing of directories Extensively configurable output (markers, colors, location format, context) Whitespace-aware diffing ## Footnotes [1]: see https://en.wikipedia.org/wiki/Graph_isomorphism_problem [2]: Unless $*P* = *N**P*$ [3]: see <https://en.wikipedia.org/wiki/Directed_acyclic_graph> [4]: see <https://en.wikipedia.org/wiki/Online_algorithm> [5]: see <https://en.wikipedia.org/wiki/Constructive_proof> [6]: see <https://en.wikipedia.org/wiki/Levenshtein_distance> [7]: <https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm> [8]: <https://en.wikipedia.org/wiki/Assignment_problem> Johannes Köbler and Uwe Schöning and Jacobo Torán The Graph Isomorphism Problem: Its Structural Complexity Progress in Theoretical Computer Science. Birkhäuser, Boston, MA, (1993). preface, TOC, etc.. cited in KaibelSchwartz2002.references.bib Skiena, Steve 1.5.9 Graph Isomorphism in the Stony Brook Algorithm Repository Stony Brook University 2001 with reference to GMT - Graph Matching Toolkit Delta: an ontology for the distribution of differences between RDF graphs https://www.w3.org/DesignIssues/Diff.html