"Software Apocalypse"; more symbolic execution

:::warning Announcements - Final project proposal due next Thursday - DM me with questions/concerns about groups - SMT 2 (Symbolic Execution) out Monday - [MIT PL Review](https://plr.csail.mit.edu) workshop Friday April 25 (attending can bump your participation grade) ::: :::info Today - Paper discussion: "The Coming Software Apocalypse" - More symbolic execution - At least 20 minutes of final project group discussion ::: ## Paper discussion: ["The Coming Software Apocalypse"](https://www.theatlantic.com/technology/archive/2017/09/saving-the-world-from-code/540393/) We broke into groups to discuss the following starter questions, plus questions you brought. - Why is software more complex to reason about than something like building a bridge? - Do you agree that “what you see is what you get” interfaces can replace most software use cases? - Who should be held responsible for catastrophic software bugs? # More symbolic execution To understand symbolic execution implementations, we need to answer: how do we actually represent programs as data? This is really a programming languages topic, but we'll need it for the next assignment. ## Programs as data/inputs How would we represent the following snippet of C as data in an analysis program? ```C x = ((1 * 2) + (5 - 1)); ``` ## Representing programs as data Programming language implementations (whatever actually lets you **run** your code) first need to go from the keystrokes we type into our IDE (e.g., strings) to treating the program itself as data, with program representations in memory that are easier to perform analysis on. ## A bit on parsing For any language implementation, the first step is _parsing_: consuming a string, character by character, and producing a program representation in memory. This includes details like, for example, are parentheses balanced? [Parsing](https://en.wikipedia.org/wiki/Parsing) is a rich and interesting subfield, but not necessarily the most useful to understand for working programmers, so we won't focus on it. For SMT 2, we'll use an existing parsing library for C. ## Abstract Syntax Trees (ASTs) Usually, the next program representation after the source string is an _abstract syntax tree_. An AST represents the structure of a program (or pieces of it), with the outermost operators at the top of each subtree. For example, the abstract syntax tree for the warmup snippets: An AST for `((1 * 2) + (5 - 1))` is: ``` + / \ * - / \ / \ 1 2 5 1 ``` `+` is the outer-most operator, so it is the root of the entire tree. `*` and `-` are each the root of their own subtrees. Note that order matters: `1` comes before `2` in the multiplication, so `1` is the left subtree/leaf. The `x =` part may have a more complex representation in an actual representation, but we can think of it as: ``` assign / | var + | / \ x * - / \ / \ 1 2 5 1 ``` After creating the AST, actual implementations have a choice in how to further evaluate this program. There are two primary methods: - An _interpreter_ traverses the AST to compute results **directly** (typically by using the _host_ or _implementation_ language operators). For example, when evaluating `((1 * 2) + (5 - 1))`, an interpreter would first evaluate `(1 * 2)`, then `(5 - 1)`, and finally apply `+` to the results. The `*`, `+`, `-` being invoked would be based on what language the interpreter is written in: e.g. Python's operators. Python is usually interpreted. - A _compiler_ traverses the AST and translates it to a lower-level representation: either bytecode (e.g., Java) or machine code (e.g., C). This lower-level code is then executed by a lower-level system: a language runtime or virtual machine, or directly by the hardware. We can think of symbolic execution as a distinct method of evaluating a program. We can symbolically evaluate programs that are normally compiled or interpreted. We can think of the implementation of symbolic execution engine in two ways: - An interpreter that uses symbolic rather than concrete inputs. - A compiler from the input language to SMT ## Processing input ASTs ASTs are trees. We can think of our symbolic executor as a post-order (children processed before parents, as in our example) tree traversal! :::success From an algorithmic standpoint, what's the easiest way to do a post-order traversal of a tree? ::: <details> <summary>Think, then click! </summary> <br> Via a recursive function! We'll recur on each subterm, then perform the term's operation on the results. </details> <br> For example, if we had an `expr_to_smt` recursive helper, we could implement on case like so: ```python def expr_to_smt(e): if e.op and e.op == "+": # process left child/subexpression left = expr_to_smt(e.left) # process right child/subexpression right = expr_to_smt(e.right) # process the operation itself: here, a symbolic add return left + right ``` We also need to decide on our representation for leaf nodes, or the base cases. More on this next time. ## Distinction between expressions and statements Expressions produce values (e.g., `2`, `1 + 2`). Statements are done for their side effects (e.g., `x = 5`; `if ...`; statements roughly map to `;` or `{}` constructs in C). More on when we start SMT 2 on Monday! # Final part of class: project group discussion Reminder to contact me with any questions about groups/project ideas.