# Diff - SoK
# Diff
## Semantic
## AST
## Patdiff
## Graphtage
### How Graphtage Works
In general, optimally mapping one graph to another cannot be executed in polynomial time[^1], and is therefore not tractable for graphs of any useful size[^2]. This is true even for restricted classes of graphs like DAGs[^3]. However, trees and forests are a special case that *can* be mapped in polynomial time, with reasonable constraints on the types of edits possible. Graphtage exploits this.
### Why Mapping Trees is Complex
Ordered nodes in the tree (*e.g.*, JSON lists) and, in particular,
mappings (*e.g.*, JSON dicts) are challenging. Most extant diffing
algorithms and utilities assume that the structures are ordered. Take
this JSON as an example:
<table>
<tbody>
<tr class="odd">
<td>Original JSON</td>
<td>Modified JSON</td>
</tr>
<tr class="even">
<td><pre class="json"><code>{
"foo": [1, 2, 3, 4],
"bar": "testing"
}</code></pre></td>
<td><pre class="json"><code>{
"foo": [2, 3, 4, 5],
"zab": "testing",
"woo": ["foobar"]
}</code></pre></td>
</tr>
</tbody>
</table>
Existing tools effectively canonicalize the JSON (*e.g.*, sort dictionary elements by key and format lists with one item per line), and then perform a traditional diff:
```bash
cat original.json | jq -M --sort-keys > original.canonical.json
cat modified.json | jq -M --sort-keys > modified.canonical.json
diff -u original.canonical.json modified.canonical.json
```
```diff
{
- "bar": "testing",
"foo": [
- 1,
2,
3,
- 4
- ]
+ 4,
+ 5
+ ],
+ "woo": [
+ "foobar"
+ ],
+ "zab": "testing"
}
```
Not entirely useful, particularly if the input files are large. The problem is that changing dict keys breaks the diff: Since "bar" was changed to "zab", the canonical representation changes and they are considered separate edits (lines 2 and 15 of the diff).
### Matching Ordered Sequences
Graphtage matches ordered sequences like lists using an "online"[^4],
"constructive"[^5] implementation of the Levenshtein distance metric[^6], similar to the Wagner–Fischer algorithm[^7]. The algorithm starts with an unbounded mapping and iteratively improves it until the bounds converge, at which point the optimal edit sequence is discovered. This is implemented in the `graphtage.levenshtein` module.
### Matching Unordered Collections
Dicts are matched by solving the minimum weight matching problem[^8] on the complete bipartite graph from key/value pairs in the source dict to key/value pairs in the destination dict. This is implemented in the `graphtage.matching` module.
## Patdiff
### Patience diff algorithm
Based on Bram Cohen’s patience diff algorithm, patdiff is optimized for diffing code and config files.
> Patdiff includes both an OCaml library (Patdiff_lib) as well as a command-line tool (patdiff).
#### Word-level refinement
Rather than displaying a small change to a line as a wholesale removal and addition of that line, patdiff shows the word-level diff of that line.
All the usual features Recursive diffing of directories
Extensively configurable output (markers, colors, location format, context) Whitespace-aware diffing
## Footnotes
[1]: see https://en.wikipedia.org/wiki/Graph_isomorphism_problem
[2]: Unless $*P* = *N**P*$
[3]: see <https://en.wikipedia.org/wiki/Directed_acyclic_graph>
[4]: see <https://en.wikipedia.org/wiki/Online_algorithm>
[5]: see <https://en.wikipedia.org/wiki/Constructive_proof>
[6]: see <https://en.wikipedia.org/wiki/Levenshtein_distance>
[7]: <https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm>
[8]: <https://en.wikipedia.org/wiki/Assignment_problem>
Johannes Köbler and Uwe Schöning and Jacobo Torán The Graph Isomorphism Problem: Its Structural Complexity Progress in Theoretical Computer Science. Birkhäuser, Boston, MA, (1993).
preface, TOC, etc.. cited in KaibelSchwartz2002.references.bib
Skiena, Steve 1.5.9 Graph Isomorphism in the Stony Brook Algorithm Repository Stony Brook University 2001
with reference to GMT - Graph Matching Toolkit
Delta: an ontology for the distribution of differences between RDF graphs
https://www.w3.org/DesignIssues/Diff.html