Introduction to Dataflow

# Introduction to Dataflow *This doc assumes that you are familiar with the basic constructs and type definitions of Mir.* The Mir dataflow framework is a tool for helping you answer certain analysis questions about Mir. That means to understand how to use this tool, we need to primarily answer two questions: 1. What are the possible analysis results that can be returned from a dataflow analysis? 2. What are the functions/types that I have to define/implement in order to use the dataflow analysis framework? We'll answer those questions two questions in order: ### What are the analysis results returned by a dataflow analysis? Throughout this doc, I'll go through two examples: `MaybeStorageLive` and a simplified version of `DataflowConstProp`. The `MaybeStorageLive` analysis wants to know which locals have "live storage," and when. In other words, the type of the analysis result is something like: ``` type StorageLiveAnalysisResult = Map<mir::Location, BitSet<Local>>; ``` If `result[location][local]` is false, that means that `local` does not have live storage at `location`. If it's true, then it may or may not have live storage (this is why it's called "maybe" storage live). For `DataflowConstProp`, the result type looks like this: ``` enum MaybeConstant { Unknown, Constant(mir::Const), Uninitialized } type ConstPropAnalysisResult = Map<mir::Location, Map<Local, MaybeConstant>>; ``` If `result[location][local]` is `Uninitialized`, then that the local is definitely uninitialized at the given point in the CFG. If it's `Constant(c)`, then the value of that local is definitely the constant `c`. If it's `Unknown`, then nothing can be known. In general, there are a properties that the type of the analysis result must satisfy. First off, dataflow analysis in rustc always produces one analysis result for each `mir::Location`. In other words, the analysis result must always look like `Map<mir::Location, Foo>`. We call this `Foo` the "analysis domain." The two examples above both have an analysis domain that is of the form `Map<Local, Bar>`. That is common, but not required. Furthermore, every analysis domain must be what is known as a "join semi-lattice." Concretely, this means it needs to satisfy two properties: 1. There must be a way to "join" two analysis domains, producing a third one that is more conservative than both of the inputs. - For `MaybeStorageLive`, this means we need a way to combine two `BitSet<Local>`s in a way that yields a more conservative result. We do this as follows: For each local, if either of the inputs say that that local may be storage live, then the output says that it will as well. Thinking about this a little bit, we can see that that's just the union operation on bitsets. - For `DataflowConstProp`, we also combine the analysis results on a per-local basis, but the logic is a bit more complicated: ```rust match (first_analysis_local, second_analysis_local) { (Uninitialized, Uninitialized) => Uninitialized, (Uninitalized, Constant(c)) => Constant(c), (Constant(c), Uninitialized) => Constant(c), (Constant(c), Constant(c)) => Constant(c), _ => Unknown, } ``` Mostly, this is doing the same thing though. It's finding the most conservative result for the local that represents the results of both the input locals. 2. There must be a "bottom" value for the analysis domain - the bottom value is always the least conservative value, and represents the strongest analysis result. For `MaybeStorageLive`, this is "nothing has live storage". For `DataflowConstProp` this is "everything is uninit." In rustc, the relevant trait is [`AnalysisDomain`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_mir_dataflow/framework/trait.AnalysisDomain.html#associatedtype.Domain), although that trait also contains some things we haven't gotten to yet. ### What are the function/types you have to implement? Once you've chosen the analysis domain for your analysis, the dataflow framework only requires you to implement two functions before you can start getting results out. The first of these is that you have to specify what the analysis results at the beginning of the start block should be. This is typically pretty simple: For `MaybeStorageLive`, those locals that always have live storage have live storage. Everything else doesn't. For `DataflowConstProp`, function arguments have an `Unknown` value. Everything else is `Uninitialized`. The second is a bit more complicated; it's known as the "transfer function" and it's what defines the main logic of your analysis. The transfer function defines how your analysis result changes across a statement or terminator. It's easiest to explain with examples: For `MaybeStorageLive`: ```rust= fn transfer_statement(statement: mir::Statement, domain: &mut BitSet<Local>) { match statement { StorageLive(l) => { domain.insert(l); } StorageDead(l) => { domain.remove(l); } _ => (), } } fn transfer_terminator(_: mir::Terminator, _: &mut BitSet<Local>) { // Storage liveness can't be changed in terminators } ``` For `ConstPropDataflow`, the logic is too complicated to write out completely, but here's pseudocode for a part of the logic: - If the statement/terminator assigns to a local `l`: - If the RHS of that assignment is a constant `c`, set the state of `l` to `Constant(c)` - If the RHS of that assignment is another local `k`, set the state of `l` to a copy of the state of `k`. - If the RHS of that assignment is any other value, set the state of `l` to unknown. In order to be sound, the analysis also needs to do some handling of assignments to subplaces of locals (`_l.field`), but I'll skip that here. In rustc, the transfer function is found in the required methods on the [`Analysis`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_mir_dataflow/trait.Analysis.html) trait. ### How is this implemented? So how does the framework actually compute a result, given these definitions? Unfortunately, going through the actual implementation would make this doc unacceptably long, so I won't do it. Instead, I'm going to give a very brief description of how this could be implemented. This description is only intended to show a correct algorithm. This algorithm is extremely slow, and rustc is significantly smarter. 1. Make a `results: Map<mir::Location, AnalysisDomain>`. Initialize everything with the bottom value, except the location at the beginning of the start block, which gets the value the user provided for it. 2. For each `mir::Location` except the one at the beginning of the start block: - If the location is not the beginning of a block, set the result at that location to the output of calling the transfer function on the result for the previous location. - If the location is at the beginning of a block, set the result to the output of `join`ing the results at the ends (ie after the terminators) of all the predecessor blocks. 3. Repeat 2, iterating until you reach a fixed point. The iteration order in the second step is not specified. It turns out not to matter for correctness (it matters greatly for performance of course). ### What about direction? `MaybeLiveLocals` (different from `MaybeStorageLiveLocals`) is an analysis that attempts to answer "for which locals is the value in that local going to be read later." Locals with values that are going to be read again later are called "live," otherwise they're called "dead." Trying to use this framework to implement that analysis runs into a problem though; the transfer function is going the wrong way. The transfer functions we've looked at so far specify the state of the world after a statement in terms of the state of the world before the statement. `MaybeLiveLocals` wants to do the exact opposite - the analysis result for a given local depends on what is going to happen to that local in the future, not what has happened to it in the past. rustc supports this by letting you specify a [direction](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_mir_dataflow/framework/trait.AnalysisDomain.html#associatedtype.Direction) on the analysis domain. Setting the direction to backwards essentially just makes the entire algorithm reverse direction. ### What about `GenKillAnalysis` Many analyses satisfy these two conditions: 1. The analysis domain is a `BitSet<Whatever>`. 2. The transfer function does not have to *read* from the previous result, only write to it. `MaybeStorageLiveLocals` is such an analysis. `MaybeLiveLocals` is too. `ConstPropDataflow` is not - it fails both conditions. Analyses which satisfy these conditions can be written via the [`GenKillAnalysis`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_mir_dataflow/framework/trait.GenKillAnalysis.html) trait. That makes them somewhat easier to write, but the primary benefit of a `GenKillAnalysis` is actually that it allows for a significantly more performant implementation of the main algorithm.