# DDlog 2.0 RVSDG Dataflow Optimization
Background:
- [RVSDG: An Intermediate Representation for Optimizing Compilers](https://arxiv.org/pdf/1912.05036.pdf)
- [RVSDG: An Intermediate Representation for the Multi-Core Era](https://www.sjalander.com/research/pdf/sjalander-mcc2018.pdf)
- [Rewrite Rule Inference Using Equality Saturation](https://arxiv.org/pdf/2108.10436.pdf)
- [Semantic Reasoning about the Sea of Nodes](https://hal.inria.fr/hal-01723236/file/sea-of-nodes-hal.pdf)
- [Optimizing compilation with the Value State Dependence Graph](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-705.pdf)
As an intermediate representation we should use the RVSDG to represent both the timely-level dataflow graph and the inter-operator expressions (map functions, for instance). Using this unified representation to encompass both aspects of ddlog programs brings a number of benefits:
- Intrinsic optimization, if we know both the contents of the dataflow graph and the contents of each function within it we can perform both easy and complex dataflow (in the compiler sense, not the paradigm) optimizations such as `relation.filter(|_| false) => empty_relation()` and `.map(|_| 10).filter(|x| x == 10) => .map(|_| 10)`
- Trivial dead code elimination on both a dataflow graph and expression level, this allows us to remove functions that are never called, relations that are never used and close inputs that are never read from
- Trivial de-duplication of the dataflow graph to allow for minimal re-computation (this extends to both collections and arrangements)
- Modular compiler internals. Since the RVSDG is very abstract it's very decoupled from both the frontend and the final codegen target, allowing easy swapping of both or even using the RVSDG as a target for entirely third-party languages or libraries
For those unfamiliar, the RVSDG is a way to represent programs as an abstract graph with only four distinct components:
- Nodes, these are atomic expressions within the program such as `1`, `add 1, 2` or `map`
- Regions, these are nodes that themselves contain sub-nodes. Regions are used for representing both control flow and effect flow, for example the phi node represents an `if ... { ... } else { ... }` and the theta node represents a `loop { ... }`
- Value edges, these are edges between nodes that express data dependence such as an `add` node having two incoming value edges from a `1` and a `2` node
- Effect edges, these edges express state dependence, such as two `print()` calls being sequential *relative to each other* (this is an important note as the RVSDG is inherently unstructured). Effect edges won't be used very much within the ddlog compiler, but they do have an important, albeit sparse, role to play in it
With these four primitives we can express programs in a very abstract and inherently normalizing way. That is to say, many different source language programs can (and will) look identical once translated into an RVSDG due to it's inherent lack of ordering. Instead of `add 1, 2; add 2, 3` and `add 2, 3, add 1, 2` being seen as two separate programs, the RVSDG simply sees

This (as mentioned before) also comes into effect when analyzing the dataflow graph as a whole since both the nodes of the dataflow graph and the expressions they contain can be unified, cutting the total program's potential runtime in half:
```prolog
input relation R(x: usize)
output relation X(x: usize)
X(x) :- R(x), x + 5, x > 10.
X(x) :- R(x), x + 5, x < 4.
```
Ignoring the questionable nature of this program, the raw translation of this program into a graph would render this

From which we can see that the map operation is trivially reducible, that's duplicated effort we've done and we can remove it

That's a rather trivial example that could definitely be optimized further (say, by turning multiple filters on one relation that are then concatenated into a single filter), but it expresses the intent quite nicely.
Another optimization I mentioned was dead code elimination, with two simple rules we can do all the DCE on dataflow graphs that we need to
1. Any node that cannot reach an output node is dead
2. Any node that is not reachable by an input node is dead
 
*Both of these graphs would be considered dead code*
So to wrap things up, I believe that the RVSDG is an optimal representation for compiling and optimizing our dataflow graphs and the expressions within them