Numba RVSDG Datastructure Proposal

# Numba RVSDG Datastructure Proposal Problem Statement ----------------- The transformational algorithms LOOP-RESTRUCTURE and BRANCH-RESTRUCTURE as described in Bahmann2015 are to be implemented in this repository. The idea is to use these algorithms to regularize Python bytecode such that it becomes "structured". Using the terms of the paper, to take an existing Control Flow Graph (CFG) and restructure it into a Structured Control Flow Graph (SCFG). The algorithms consist of two parts conceptually: restructuring the CFG such that regions (Loop, Head, Branch and Tail) can be identified clearly. That is to say there is "data" in the form of the blocks of a CFG and analysis in the form or regions that are "overlayed" on top of the data. The Story so far ---------------- The approach implemented by myself (Val) up until the commit c75b0d93dd4022c5f1d70e30b6363adafc990dcb was to "mix" control-flow (blocks) and analysis (regions). That is, whenever a region was identified in the CFG, all the blocks pertaining to this region were removed and a Region block with an additional SCFG inside was inserted in it's place. So when traversing the final SCFG, an iterator may return a block or a region and whoever is consuming this would need to be aware or this nested, recursive and hierarchical structure. It was a difficult mental model and eventually programming new features against this became unnecessarily hard. As pointed out by Stuart, mixing data and analysis will lead to a problem factory and I strongly believe this statement to be correct. It is my opinion that data and analysis MUST be kept separate to ease the mental load on the programmer. Having worked on this code for several months, this is indeed my experience. Moving Forward -------------- However, there are really two phases to these algorithms: construction and reflection. During construction the graph is basically still a CFG and the structures must be identified to turn it into a SCFG. Because there are multiple region types, we need to go back and forth between inspecting the graph, adding blocks and identifying regions. For example, we can not identify head/branch/tail regions until loop regions have been identified. Also, different stages of the algorithm need to view the SCFG from different perspectives. For example, the initial detection of a top-level loop-region must view the graph in a cyclic way. That is to say: any potential backedges need to be retained. However, to find nested regions inside other loop regions, the backedge of the loop-region in which we wish to find other loops must be taken into account, otherwise the Strongly Connected Components algorithm will only find the outermost loop. Another example is that during BRANCH-RESTRUCTURE, we must view any looping regions as "blocks" that is --- when doing the analysis --- we abstract away all the blocks inside the loop and present an entirely acyclic graph to the algorithm. The nested regions are irrelevant in this case and need not be presented. A third example is rendering the graph, in this case, we need a view that makes all blocks and all regions available, since we need to render all of them. The conclusion is that we will need multiple views, each one presenting the data and analysis contained in the SCFG in a different fashion. I.e. the underlying datastructures must not mix data and analysis, but any given view can stitch together data and analysis in a pre-defined way for any potential consumers to operate on. In order to facilitate this multi-view paradigm, I would like to propose the following structure. I will use Python pseudocode to outline the structures. Some bits may be missing, but I think this should be sufficient to convey the ideas I have and can translate into real code from here. The data in the SCFG is stored in the following fashion, which, implementation details omitted, should be sufficient for storing all information. ```python class Block: # Block has a name. Other block types may have more info, e.g. Python # bytecode name: str class Region: # Region has a name name: str # Regions can be meta, loop, head, tail, branch kind: str # The header block of this region header: str # The exiting block of this region exiting: str class SCFG: # This maps the names of the blocks to the block objects, which may contain # additional information such as assembly instructions or Python byetcode blocks: dict[str, Block] # This is the actual graph, contains a mapping of block names to block # names they point to, i.e. the edges of the graph. Any block name in the # list of block names must also be a key in this mapping. I.e. this mapping # must be self-contained. edges: dict[str, list[str]] # These are any identified back-edges. This must also be self conatined. backs: dict[str, list[str]] # This is the top-level region meta_region: str # This is the region storage. It maps region names to the Region objects, # which themselves contain the header and exiting blocks of this region regions: dict[str, Region] # This is the region tree which stores the hierarchical relationship # between regions. The root node of this tree is the meta-region (once the # graph is fully analysed). region_tree: dict[str, list[str]] ``` Now, importantly, this must be presented in different views. My proposal here is to implement different types of `GraphView`s. A concept taken from NetworkX. The idea is that a view is like a facade (https://en.wikipedia.org/wiki/Facade_pattern) pattern. A class that follows a given (abstract) interface forwards calls and modify things. In this case the API is for a directed graph. A graph structure needs to be able to a) iterate all the items/nodes/vertices (ideally sorted in some fashion) and b) for any given item/node/vertex return the target items/nodes/vertices that it points to. This then allows us to create different views, depending on how we need to see the graph. Looking for nested loops in a loop (subgraph/region)? Use the `AcyclicBlockView`. This will "zoom in" to the loops until it can no longer find any strongly connected components, and thus finding all loops and nested loops. Are you in the meta-region and want to find the top-level head, branch and tail regions, use the `RegionView` to abstract away any exiting top-level loops. Do you need to iterate over all blocks and regions in a certain order and depth such that you can render them? You'll need to implement the `RenderingView` who's `iter` is to be implemented such that it returns the correct elements from the scfg either regions or blocks and their metadata. The `__getitem__` returns the edges from the block-graph so they can be rendered (there are no edges between regions, in the final representation to be rendered). My proposal allows for a fully flexible view model with concrete views that are easy to test and easy to extend or invent depending on whatever lens the graph needs to be seen in. As described above there are already a few different internal use-cases that would make use of these views. The would also serve as a uniform API to access the SCFG which would make it much easier to refactor the internal representation, while maintaining passing test coverage. ```python class AbstractGraphView: def __init__(self, scfg): self.scfg = scfg def __getitem__(self, item): # Is expected to return a list[str, str] where each string must be an # item in the graph raise NotImplementedError def __iter__(self) raise NotImplementedError class CyclicBlockView(AbstractGraphView): def __getitem__(self, item): return self.scfg.edges[item] def __iter__(self): return self.scfg.edges.keys() class AcyclicBlockView(AbstractGraphView): def __getitem__(self, item): return [e for e in self.scfg.edges[item] if e not in self.scfg.backs[item]] def __iter__(self): return self.scfg.edges.keys() class RegionView(AbstractGraphView): def __init__(self, scfg, region): self.scfg = scfg self.region = region self.nested_regions = scfg.region_tree[region] self.blocks_in_region = ... # use one of the views above def __getitem__(self, item): if item in self.scfg.regions: # If the item we are looking at is a region... exiting = self.scfg.regions[item].exiting # Check if the region points to another region, e.g. two loops # behind each other, in that case, return the region name for r in self.nested_regions if exiting == r.header: return [r.header] # Otherwise return the block name, if it is in the region if exiting in self.blocks_in_region: return [exiting] else: return [] else: outs = self.scfg.edges[item] # replace any arcs into a nested region with an arc to region name for i,o in enumerate(outs): for r in self.nested_regions if o == r.header outs[i] = r.header # remove any edges pointing outside the region outs = [e for e in outs if e in self.blocks_in_region] return outs def __iter__(self): # Use the __getitem__ from self.region.header until it no longer # returns anything. pass ```