Ivy, Part 1: Vyper’s Syntax, Semantics, and the Case for ASTs

This article starts a series on the AST interpretation of Vyper. I built [Ivy](https://github.com/cyberthirst/ivy), an AST interpreter for Vyper, with the goal of an easy-to-read **executable specification** of the language in Python. Ivy primarily serves as a fuzzing oracle for differential testing of the Vyper compiler. In this part we’ll establish core terminology from programming-language implementation. The actual interpreter design will follow in later parts. ## Syntax and Semantics Firstly, let's talk about syntax and semantics. Program representation starts as a string which is the compiler's input. In the first stage of compilation this string gets transformed into so called tokens (e.g. `IF` or `FOR` tokens). Syntactic rules dictate how programs can be structured, i.e. **what token sequences are valid**. These rules are defined through a **grammar**. A grammar is a set of recursive production rules which dictate how to produce syntactically valid code. A **parser** checks the token sequence against the grammar and builds a tree structure. Assume the grammar rule: `IfStmt → 'if' Expr ':' Body ('else' ':' Body)?`, it says: to generate an `if` statement you have to 1st generate the `if` token, then an expression, then a colon token, etc. For example, having a sequence `if if a > b: body` is invalid. The problem is that the grammar requires `if` token to be followed by `<Expr>`, and the second `if` is not an expression. Example input string to the compiler: ```python def foo(i: bool): a: uint256 = 0 if i: a = 42 ``` And a slice of the corresponding lexer tokens: ``` 1: NAME 'def' (1,0)-(1,3) 2: NAME 'foo' (1,4)-(1,7) 3: OP '(' (1,7)-(1,8) 4: NAME 'i' (1,8)-(1,9) 5: OP ':' (1,9)-(1,10) 6: NAME 'bool' (1,11)-(1,15) 7: OP ')' (1,15)-(1,16) 8: OP ':' (1,16)-(1,17) 9: NEWLINE '\n' (1,17)-(1,18) ``` Syntax gives us the **structure**, semantics gives **meaning** to the syntactic elements. Semantics is usually split into two types: static and dynamic. Static semantics are enforced by the compiler during compile time. Compile time checks include type-checking, scoping, name binding or storage vs memory rules. Dynamic semantics then define what actually happens when you execute a program which has valid static semantics. It defines what it means to do a step in a loop, what the evaluation order is, what comparisons do, or what the gas effects are. For instance, a dynamic rule for an `if` is: evaluate `cond`; if it yields true, execute `body`; otherwise execute `orelse` or continue after the `if`. **Where does Ivy sit?** Ivy delegates lexing, parsing, and static checks to the Vyper compiler front end, then interprets the **typed AST** to model the dynamic semantics. If there are bugs in the front end, then Ivy shares them. We evaluated what types of bugs are serious and where they appear in the compiler, and concluded most of them are in the dynamic semantics part. ## What's an AST? Ivy is an AST interpreter, so what is an AST? Compilers work in stages. Firstly they run the *lexer* which turns the input string (your program) into tokens. Then the parser is run. The parser checks that the token sequence satisfies the rules of the grammar and uses the grammar to transform the sequence of tokens into a the abstract syntax tree. It's abstract because there's no redundant syntax like whitespace or syntax sugar. For Vyper, the root of the AST would be the `Module` node. The module will have a body. The body may contain nodes like `VariableDef` for storage or immutable variables, or `FunctionDef` nodes; those recursively contain statements and expressions until leaves like literals and variable references. After the AST goes through the compiler front-end, the nodes have: - a node kind (e.g., `Call(func, args, keywords)`for function calls) assigned by the parser, and - an inferred language type (e.g., `int8`) assigned by the type checker. AST corresponding to the example program: ![image](https://hackmd.io/_uploads/rySmPmEsxl.png) ### Why not IR or Bytecode? So why should we interpret the AST? 1) AST is very close to the source language, and 2) it's easy. One of Ivy's goals is to be the executable spec - by interpreting the AST we have a clear correspondance between each syntactic construct and its semantics. If we lowered the AST e.g. to bytecode, the corespondence wouldn't stay so clear. Also, translating AST into bytecode is non-trivial and would increase the dev time and introduce bugs. Why do many production interpreters use bytecode instead of the AST? Mainly performance. ASTs are pointer-rich and heterogeneous: walking them hurts the locality principle and causes frequent cache misses and the page walks add up. Bytecode is linear and compact which is structure more amenable for high performance. In Ivy we don't need speed, we only strive for correctness and thus AST is the ideal representation. ### Statements and Expressions Statements and expressions are the basic building blocks of all programs. These concepts will become handy later, so I'll explain them now. Typically, all AST nodes are either statements or expressions. A function definition is a statement, so is an `if` node or `for loop` node. Programs consist of **statements** and **expressions**. A statement is a syntactic form that does not yield a value to its surrounding context; it sequences computation or introduces bindings (e.g., `a: uint256 = 10`). An expression evaluates to a value and can appear where a value is expected (e.g., `1`, `a`, `foo(42) // 12`). ## What do I mean by an AST interpreter? In this series, an AST interpreter is a Python program that traverses Vyper’s typed AST and performs effects according to **Vyper’s dynamic semantics embedded in the EVM model**. There are some practical problems: 1) EVM has no concept of the Vyper AST. We'll have to swap bytecode interpretation to Vyper AST interpretation. 2) Vyper semantics live inside EVM semantics. An external call in Vyper corresponds to an EVM message call. To stay faithful, Ivy has to model enough of the EVM to make those interactions meaningful (accounts, storage, logs, call stack, return data). Ivy must therefore give precise meanings to Vyper constructs (statements like loops or assignments, storage model, or calls to builtins) and hook them to EVM notions (account creation, calls, logs, reverts and return). The practical aspects of this will be discussed later in the series. ## Closing We covered the essentials: tokens → grammar → AST (syntax), static vs dynamic semantics (meaning), why Ivy interprets the AST (correctness & clarity), and the role of statements vs expressions. Next time we’ll look under the hood of the Ivy interpreter.