# Numba RVSDG Datastructure Proposal
Problem Statement
-----------------
The transformational algorithms LOOP-RESTRUCTURE and BRANCH-RESTRUCTURE as
described in Bahmann2015 are to be implemented in this repository. The idea is
to use these algorithms to regularize Python bytecode such that it becomes
"structured". Using the terms of the paper, to take an existing Control Flow
Graph (CFG) and restructure it into a Structured Control Flow Graph (SCFG). The
algorithms consist of two parts conceptually: restructuring the CFG such that
regions (Loop, Head, Branch and Tail) can be identified clearly. That is to say
there is "data" in the form of the blocks of a CFG and analysis in the form or
regions that are "overlayed" on top of the data.
The Story so far
----------------
The approach implemented by myself (Val) up until the commit
c75b0d93dd4022c5f1d70e30b6363adafc990dcb was to "mix" control-flow (blocks) and
analysis (regions). That is, whenever a region was identified in the CFG, all
the blocks pertaining to this region were removed and a Region block with an
additional SCFG inside was inserted in it's place. So when traversing the final
SCFG, an iterator may return a block or a region and whoever is consuming this
would need to be aware or this nested, recursive and hierarchical structure. It
was a difficult mental model and eventually programming new features against
this became unnecessarily hard.
As pointed out by Stuart, mixing data and analysis will lead to a problem
factory and I strongly believe this statement to be correct. It is my opinion
that data and analysis MUST be kept separate to ease the mental load on the
programmer. Having worked on this code for several months, this is indeed my
experience.
Moving Forward
--------------
However, there are really two phases to these algorithms: construction and
reflection. During construction the graph is basically still a CFG and the
structures must be identified to turn it into a SCFG. Because there are
multiple region types, we need to go back and forth between inspecting the
graph, adding blocks and identifying regions. For example, we can not identify
head/branch/tail regions until loop regions have been identified.
Also, different stages of the algorithm need to view the SCFG from different
perspectives. For example, the initial detection of a top-level loop-region
must view the graph in a cyclic way. That is to say: any potential backedges
need to be retained. However, to find nested regions inside other loop regions,
the backedge of the loop-region in which we wish to find other loops must be
taken into account, otherwise the Strongly Connected Components algorithm will
only find the outermost loop.
Another example is that during BRANCH-RESTRUCTURE, we must view any looping
regions as "blocks" that is --- when doing the analysis --- we abstract away
all the blocks inside the loop and present an entirely acyclic graph to the
algorithm. The nested regions are irrelevant in this case and need not be
presented. A third example is rendering the graph, in this case, we need a view
that makes all blocks and all regions available, since we need to render all of
them. The conclusion is that we will need multiple views, each one presenting
the data and analysis contained in the SCFG in a different fashion. I.e. the
underlying datastructures must not mix data and analysis, but any given view
can stitch together data and analysis in a pre-defined way for any potential
consumers to operate on.
In order to facilitate this multi-view paradigm, I would like to propose the
following structure. I will use Python pseudocode to outline the structures.
Some bits may be missing, but I think this should be sufficient to convey the
ideas I have and can translate into real code from here.
The data in the SCFG is stored in the following fashion, which, implementation
details omitted, should be sufficient for storing all information.
```python
class Block:
# Block has a name. Other block types may have more info, e.g. Python
# bytecode
name: str
class Region:
# Region has a name
name: str
# Regions can be meta, loop, head, tail, branch
kind: str
# The header block of this region
header: str
# The exiting block of this region
exiting: str
class SCFG:
# This maps the names of the blocks to the block objects, which may contain
# additional information such as assembly instructions or Python byetcode
blocks: dict[str, Block]
# This is the actual graph, contains a mapping of block names to block
# names they point to, i.e. the edges of the graph. Any block name in the
# list of block names must also be a key in this mapping. I.e. this mapping
# must be self-contained.
edges: dict[str, list[str]]
# These are any identified back-edges. This must also be self conatined.
backs: dict[str, list[str]]
# This is the top-level region
meta_region: str
# This is the region storage. It maps region names to the Region objects,
# which themselves contain the header and exiting blocks of this region
regions: dict[str, Region]
# This is the region tree which stores the hierarchical relationship
# between regions. The root node of this tree is the meta-region (once the
# graph is fully analysed).
region_tree: dict[str, list[str]]
```
Now, importantly, this must be presented in different views. My proposal here
is to implement different types of `GraphView`s. A concept taken from NetworkX.
The idea is that a view is like a facade
(https://en.wikipedia.org/wiki/Facade_pattern) pattern. A class that follows a
given (abstract) interface forwards calls and modify things. In this case the
API is for a directed graph. A graph structure needs to be able to a) iterate all the
items/nodes/vertices (ideally sorted in some fashion) and b) for any given
item/node/vertex return the target items/nodes/vertices that it points to.
This then allows us to create different views, depending on how we need to see
the graph. Looking for nested loops in a loop (subgraph/region)? Use the
`AcyclicBlockView`. This will "zoom in" to the loops until it can no longer
find any strongly connected components, and thus finding all loops and nested
loops. Are you in the meta-region and want to find the top-level head, branch
and tail regions, use the `RegionView` to abstract away any exiting top-level
loops. Do you need to iterate over all blocks and regions in a certain order
and depth such that you can render them? You'll need to implement the
`RenderingView` who's `iter` is to be implemented such that it returns the
correct elements from the scfg either regions or blocks and their metadata.
The `__getitem__` returns the edges from the block-graph so they can be
rendered (there are no edges between regions, in the final representation to be
rendered).
My proposal allows for a fully flexible view model with concrete views that are
easy to test and easy to extend or invent depending on whatever lens the graph
needs to be seen in. As described above there are already a few different
internal use-cases that would make use of these views. The would also serve as
a uniform API to access the SCFG which would make it much easier to refactor
the internal representation, while maintaining passing test coverage.
```python
class AbstractGraphView:
def __init__(self, scfg):
self.scfg = scfg
def __getitem__(self, item):
# Is expected to return a list[str, str] where each string must be an
# item in the graph
raise NotImplementedError
def __iter__(self)
raise NotImplementedError
class CyclicBlockView(AbstractGraphView):
def __getitem__(self, item):
return self.scfg.edges[item]
def __iter__(self):
return self.scfg.edges.keys()
class AcyclicBlockView(AbstractGraphView):
def __getitem__(self, item):
return [e for e in self.scfg.edges[item]
if e not in self.scfg.backs[item]]
def __iter__(self):
return self.scfg.edges.keys()
class RegionView(AbstractGraphView):
def __init__(self, scfg, region):
self.scfg = scfg
self.region = region
self.nested_regions = scfg.region_tree[region]
self.blocks_in_region = ... # use one of the views above
def __getitem__(self, item):
if item in self.scfg.regions:
# If the item we are looking at is a region...
exiting = self.scfg.regions[item].exiting
# Check if the region points to another region, e.g. two loops
# behind each other, in that case, return the region name
for r in self.nested_regions
if exiting == r.header:
return [r.header]
# Otherwise return the block name, if it is in the region
if exiting in self.blocks_in_region:
return [exiting]
else:
return []
else:
outs = self.scfg.edges[item]
# replace any arcs into a nested region with an arc to region name
for i,o in enumerate(outs):
for r in self.nested_regions
if o == r.header
outs[i] = r.header
# remove any edges pointing outside the region
outs = [e for e in outs if e in self.blocks_in_region]
return outs
def __iter__(self):
# Use the __getitem__ from self.region.header until it no longer
# returns anything.
pass
```