Static detection of data dependency cycles

# Static detection of data dependency cycles --- # Introduction Given a program written in [*Relang*](https://github.com/pmikolajczyk41/relang/blob/main/relang/grammar.pest) we want to check in the preprocessing (static analysis) phase whether there is any chance of creating data dependency cycle. Regarding the language itself, what actually matters is: - **Language constructs:** - nested function declarations - returning a value from a function - variable declarations with initializer being a combination of other variables' values, constants and function call results - no variable mutation - **Visibility:** - variable is visible only after its declaration - function is visible in the whole scope of its declaration - functions and variables defined in the ancestor functions are visible (with shadowing) The simplest example of the data dependency cycle is: ``` let x = x; // assuming that no `x` has been already defined in some ancestor function ``` While this easy case could have been detected by the name resolution procedure, our goal is to catch also situations like: ``` fn f() { return g(); } let x = f(); let g() { return x; } ``` Here obviously we cannot compute `x`, since its value depends on itself, but this time we comply with name resolution rules. Hence, we need another process of detecting cyclic dependencies. Note, however, that we are not trying to discover *any* infinite computation. For example in: ``` fn f() { let x = g(); return x; } fn g() { let y = f(); return y; } ``` calling `f()` wouldn't finish, but it is not data dependency to be blamed for it. # Definitions and assumptions **Assumption**: Every function and variable has a unique name (any program can be trivially converted to meet this rule). **Assumption**: Every function has a single special variable that is used for returning from it (even across different control flow branches). While this isn't achievable with the language's grammar, any program can be easily modified to be compliant. For example: ``` fn original_function() { if _ { return a() · b · c(); } else { return d · e(); } } ------------------------- fn converted_function() { let return_var; if _ { return_var = a() · b · c(); } else { return_var = d · e(); } return return_var; } ``` **Definition [*function forest, function level*]**: Functions form a natural forest according to their nesting. From that every function inherits level, i.e. its depth in that forest. **Definition [*frame*]**: A frame is a tuple containing: 1. some unique ID 2. function name 3. call data (arguments passed to the function) 4. list of variable names defined in this function **Definition [*frame tree*]**: Executing a program yields a tree of frames according to the function calls that were made. Similarly, a static analysis of a program code yields a tree of frames according to any *possible* function calls. This tree might be infinite, but it will be regular. The only difference is that instead of having concrete values for the call data, we will have languages describing all possible arguments (based on the callee's signature and optionally some auxiliary analysis for narrowing the domain). For the rest of this document, we will be working with the static frame trees (the one that can be constructed before runtime). **Definition [*item*]**: An item is a pair: (variable identifier, frame identifier). Obviously, we require that the variable belongs to the frame. As one might expect, we will be searching for dependency cycles on items. So the last thing to do is to formally define the dependency relation. **Definition [*dependency relation*]**: We define a relation on a set of variable identifiers. For every variable declaration instruction: ``` let x = func(expr) · eq_var · arg3 · const · var; ``` we add to our relation pairs `(x, y)`, where `y` is: - any variable name used in expressions `expr` - return variable for `func1` - variable `var` (We ignore both equation variables and current function arguments as they cannot contribute to a dependency cycle). While the relation is defined only on variable identifiers (so that it is computable statically), we can extend it to items in a natural way: Let `(x, fr)` and `(x', fr')` be two items. They are in the dependency relation if and only if: - `(x, x')` is in the base relation - if `x'` is a return variable, then `fr'` (in the frame tree) must be either a child of `fr` or the closest ancestor of `fr` corresponding to the callee (it might be `fr` as well) - if `x'` is just a variable, then `fr'` is the closest ancestor of `fr` containing `x'` (it might be `fr` as well) To avoid confusion, we refer to the first relation as the *base relation*. # Problem statement Given the definitions above, we can now formally state the main problem: given a program (complying with all the assumptions) and its (static) frame tree decide if there exists any cycle on its items (along the dependency relation). ## Examples Let us take a look at the mentioned example (modified a bit): ``` | <simplified code> fn f() { | fn f() { let f_ret = g(); | return g(); return f_ret; | } } | | let x = f(); | let x = f(); | let g() { | fn g() { let g_ret = x; | return x; return g_ret; | } } | ``` The base relation contains: `(x, f_ret)`, `(f_ret, g_ret)` and `(g_ret, x)`. (This obviously yields a cycle, but this might not witness a data dependency cycle!). The tree frame is quite simple: ``` --------------------------- | frame 0, <unnamed>, [], | // frame id, fn id, call data | [x] | // variables id --------------------------- | --------------------------- | frame 1, f, [], | | [f_ret] | --------------------------- | --------------------------- | frame 2, g, [], | | [g_ret] | --------------------------- ``` Corresponding relation contains (we use `i` instead `frame i`): - `((x, 0), (f_ret, 1))` - `((f_ret, 1), (g_ret, 2))` - `((g_ret, 2), (x, 0))` which clearly contains a cycle. For contrast, we can construct a program with the same (cyclic) base relation, but without actual data dependency cycle: ``` | <simplified code> fn f() { | fn f() { let f_ret = g(); | return g(); return f_ret; | } } | | fn g() { | fn g() { let x = f(); | let x = f(); let g_ret = x; | return x; return g_ret; | } } | ``` Although again `(x, f_ret)`, `(f_ret, g_ret)` and `(g_ret, x)` belong to the base relation, the frame tree doesn't witness any cycle: ``` ----------------- | i, f, [], | | [f_ret] | ----------------- | ------------------ | i + 1, g, [], | | [x, g_ret] | ------------------ | ------------------ | i + 2, f, [], | | [f_ret] | ------------------ | ------------------ | i + 3, g, [], | | [x, g_ret] | ------------------ ... ``` The item relation contains pairs `((x, i), (f_ret, i+1))`, `((f_ret, i+1), (g_ret, i+2))` and `((g_ret, i+2), (x, i+2))` for every natural `i`, which doesn't yield any cycle. **Note**: There might be cases where the cycle is not realizable - for example it may hold that there doesn't exist a start argument that would cause a cycle in runtime. However, we must be conservative and do not harness any argument / control flow analysis, as in general this reduces to the halting problem. # Solution We need to find a 'static' way of representing dynamic (runtime) structures (namely, the frame tree and dependency relation on items). We base our solution on the notion of *good paths*. **Definition [*good path*]**: We define a *good path at level `k`* as a finite walk along the base relation (i.e. a sequence of variable identifiers consistent with the dependency relation), such that: - all the variables come from functions on levels at least `k` - all the variables from the level `k` belong to the same function - the first variable comes from the level `k` - none of the variables from the top (`k`) level, excluding the first one, is a return variable - between these variables, the sequence is a concatenation of good paths at level `k+1` So in other words, a good path `p` is a sequence of variables, for which there exists a function `f` at level `k`, such that `p = (v0,) · q0 · (v1,) · q1 · ... · qn`, where: - `v0`, `v1`, ..., `vn` are variables declared in `f` - `v1`, ..., `vn` are not the return variable of `f` - `q0`, `q1`, ..., `qn` are concatenations of good path at level `k+1` - all variables in `q0`, `q1`, ..., `qn` come from functions at levels at least `k+1` **Definition [*good cycle*]**: A good cycle is a good path where the first and the last variable is the same. What is more, none of the top-level variables is a return variable. **Theorem**: The item dependency relation contains a cycle iff there exists a good cycle. As stated in the **Algorithm** section, we can run a polynomial procedure analyzing statically the program which will detect if there is any good cycle. # Proof of the theorem ## Good cycle → cyclic data dependency **Observation**: Let `u`, `v` be variables and let `fr` be a frame containing `u`. If `(u,v)` belongs to the base relation, then there exists exactly one frame `fr'`, such that items `(u,fr)` and `(v,fr')` are in the dependency relation. *Proof: It follows directly from the way we extend base relation to items - no matter whether `v` is a return variable or a standard one, we are guaranteed to find a corresponding frame `fr'` witnessing the dependency.* We are given a good cycle at some level `k`: `(v0, v1, v2, ..., vn, v0)`. Let `f` be the function of level `k` which contains declaration of `v0`. Our goal is to find a set of frames, such that `((v0, fr0), (v1, fr1), (v2, fr2), ..., (v0, fr0))` is an item dependency cycle. Let `fr0` be any frame that corresponds to `f` (here we assume that `f` is achievable from the root - `main` - function; the other case is not interesting in this document). According to the observation above, we can start our traversal from `(v0, fr0)` and simply follow the transitions from the good cycle. This way we will surely end up in some item `(v0, fr')`. Since none of the variables from function `f` that appear in our cycle is a return variable, we are sure that `fr0` = `fr'` - we couldn't have made any call to `f` which would move us to another frame for `f`. ## Item cycle → good cycle ### Setting Let `((v0, fr0), (v1, fr1), ..., (vn, frn), (v0, fr0))` be an item dependency cycle. Let us also assume that it is a simple cycle - otherwise we can reduce it to its simple subcomponent. Let `k` be the highest (smallest value) level of any function that corresponds to any of these frames. By shifting the cycle, ensure that `v0` belongs to such a function and let us call this function `f`. Our goal is to show that `(v0, v1, ..., vn, v0)` is a good cycle at level `k`. ### Single top function In this section we will show that an item dependency cycle can never leave the subtree of `f` (in terms of function nesting tree). This will ensure that all top-level variables in `(v0, v1, ..., vn, v0)` belong to the same function `f`. What is more, we will prove that none of them is a return variable. **Observation**: Let `f1`, `f2`, ..., `fi` be a family of directly nested functions (i.e. `f2` is defined directly in `f1`, `f3` is directly defined in `f2` and so on). Let `fri` be a frame in the frame tree corresponding to `fi`. On the path from `fri` to the root we will find successively frames corresponding to all `f{i-1}`, ..., `f2` and `f1` (in that order, but maybe some other frames as well). *Proof: Inductively on `i`. It is obvious for `i` = 1. For `i` = 2, notice that `f2` can be called only from `f1` or any function descendant to `f2`. However, if we consider the first call to `f2`, then it must have been done from `f1` (we can call functions on level at most one deeper than the current). Hence, for sure there must be some frame `fr1` corresponding to `f1` above `fr2`.* *For `i` > 2 we directly use induction hypothesis firstly for `f2`, `f3`, ..., `fi` and then for `f1`, `f2`.* **Observation:** `v0` cannot be a return variable. *Proof: Otherwise, we would not be able to come back to `fr0`.* **Observation**: None of the variables from the item dependency cycle that comes from `f` is a return variable. In other words, `f` is never called when traversing such cycle. *Proof: Assume otherwise and let `vi` be the last return variable from `f`. That means that the suffix of the sequence looks like: `(vi, fri), ..., (vn, frn), (v0, fr0)`, where `fri` ≠ `fr0`.* *Notice, that all variables `vi`, `v{i+1}`, ..., `vn` must come from functions from the subtree of `f` - otherwise, in order to come back to `f`, we would have to call it directly, which is a contradiction to the choice of `i`. Let us take a look at the frame stack for this suffix (a path in the corresponding part of the frame tree).* ``` -------------------- | fr0, f, _ | | [v0, vi, ...] | -------------------- ... -------------------- | fri, f, _ | | [v0, vi, ...] | -------------------- ... -------------------- // there must be some frame below `fri`: if we do not call | frj, _, _ | // anything from `fri`, then we would stay there forever and | [vj, ...] | // thus never return to `fr0`; notice that we cannot reach -------------------- // any variable from some ancestor function, because we are // already at level `k` ``` *From one of the frames (weakly) below `fri` we must start reaching `fr0` using at least one transition that isn't a function call, i.e. through a variable reference. However, using the first observation in this section, every such variable that we could reference to appears (at least once) below `fri` and shadows all its appearances above `fri`. Therefore we cannot come back to `fr0`.* *In other words - from any `frj` below `fri` we are not able to reach any of the frames above `fri`.* **Observation**: We never leave the subtree of `f`. *Proof: Assume otherwise. We can only leave the subtree of `f` by calling some sibling function of `f` (we cannot reach any higher than the level `k`). However, in order to return to `fr0` (or in general - to the subtree of `f`), we would have to call `f` directly (`f`'s descendants are not visible), which contradicts the previous observation.* ### Good form The last thing to show about `(v0, v1, ..., vn, v0)` is that the parts between variables from `f` are concatenations of good paths at level `k+1`. Once this is shown, the proof of the theorem is complete. Let `a` and `b` be the consecutive variables from `f` in our sequence. Let `u1, u2, ..., um` be the variables between them. For `m` = 0, we are done right away. Similarly for `m` = 1, it is easy to see that `u1` must be a return variable of some child function of `f` at level `k+1`. From now on, we assume that `m` > 1. We can greedily partition this subsequence into good paths at level `k+1` as follows. Throughout the procedure we maintain a list of good paths at level `k+1`. Initially, there is only path: `(u1,)`. We start processing the sequence from `u2`. For every encountered variable `u` we check: 1. if it is a return variable of some function at level `k+1`: we create a new path `(u,)`, append it to our list, and continue 2. otherwise, we push it back to the current (last) path and continue In every such path it is clear, that all the top-level variables belong to the same function at level `k+1` and none but the first one is a return variable. In order to show that the variables between them form good paths at level `k+2` we use the same (recursive) argument. # Algorithm We can naturally derive a polynomial algorithm for detecting good cycles during static program analysis. The procedure runs in turns - one round for every level of the function nesting tree. In every round we will be 'uncovering' a new layer of the tree (precisely, in the `i`th round we will have uncovered lowest `i` levels). Together with the functions, we will be uncovering some of the base relation edges, but not all of them. At the end of a round we will run simple BFS on the visible relation edges to detect any cycles - if we found one, it will be a good cycle at a corresponding level. In the first round we uncover the deepest level together will all induced relation edges. In every consecutive round we uncover: - all the missing edges induced on already visible variables - all functions from the next level - all edges that are going from a fresh variable (those that are targeting a fresh variable but come from an old one are still hidden) The correctness of the algorithm follows directly from its construction and the theorem from the previous section.