Special thanks go to Charles Cooper, fubuloubu, banteg, and Tanguy Rocher for feedback and review.
gm 🐍snekooors, as you all well know, Vyper is a contract-oriented Pythonic programming language that is targeting the Ethereum Virtual Machine (EVM). But do you guys know about the inner workings of the Vyper compiler itself? This article attempts to shed some light on how the Vyper compiler itself works and delves into the different layers of the compilation phases. Or, in other words, we examine what happens beneath the surface when you have a Vyper contract, e.g. Foo.vy
:
and invoke vyper Foo.vy
, which outputs:
Enjoy the ride 🫡!
In a nutshell, the Vyper compiler translates the Vyper programming language into the EVM bytecode consisting of (currently) 144 opcodes (including PUSH0
). The compilation itself can be broken down into 10 distinct steps (see also the Python module phases.py
for a code-level summary):
In the first step, the Vyper source code is converted into an abstract syntax tree (AST). The full Python module vyper.ast
can be found here. This conversion takes place in several consecutive steps:
@version
pragma against the current compiler version or translating the keywords enum
, event
, interface
, or struct
into the Python keyword class
. See pre_parser.py
for the full code logic.parse_to_ast
that parses a Vyper source string and generates basic Vyper AST nodes.annotation.py
contains the AnnotatingVisitor
class, used to annotate and modify the Python AST prior to converting it to a Vyper AST.nodes.py
.Next, the literal nodes in the AST are (lightly) validated via validate_literal_nodes
in validation.py
. Examples that will throw at this stage are:
throws with:
The reason why the validation fails here is due to the incorrect checksum of the address
foo
.
or
throws with:
or
throws with:
Now constant folding takes place. Constant folding is a generic compiler optimisation technique that eliminates run-time computations with compile-time computations. Thus, the fold
function in folding.py
replaces constant values in the AST with their values. Furthermore, constant expressions are also evaluated accordingly. Examples are:
3+2
becomes 5
(arithmetic operations),["foo", "bar"][1]
becomes "bar"
(references to literal arrays),min(1,2)
becomes 1
(built-in functions applied to literals).Next, the semantics of the program are validated via validate_semantics
in __init__.py
. The structure and the types of the program are checked, and type annotations are added to the AST. A full discussion of how the semantic validation is carried out is beyond the scope of this article, but it can be broadly divided into 4 steps:
ModuleAnalyzer
for further insights.validate_unique_method_ids
) or circular references (see the function _find_cyclic_call
).FunctionNodeVisitor
for further insights.StatementAnnotationVisitor
and ExpressionAnnotationVisitor
, which are eventually called to annotate the expression nodes of the AST.In the next step, for all public variables getter functions are generated via generate_public_variable_getters
and any unused statements are removed from the AST via remove_unused_statements
. These two functions are bundled into the Python module expansion.py
via the expand_annotated_ast
function.
immutable
VariablesNow it's time to determine the data positions in storage and bytecode for all storage and immutable
variables. The compiler achieves this via data_positions.py
.
VyperIR
The Vyper AST is turned into a lower-level intermediate representation language (IR) called VyperIR
via the function generate_ir_for_module
in the Python codegen module module.py
. VyperIR
is used as a high-level assembly. The grammar of VyperIR
is (s_expr)
, where s_expr
is one of:
I would like to point out that a new IR design is in the works, called venomIR
. For the original discussion see here and the main PR here.
VyperIR
The time has arrived for various optimisations via optimizer.py
of the VyperIR
representation. It's important to note that further constant folding also takes place at this stage (a typical example is the simplification of safemath
expressions), see e.g. the function _optimize_binop
.
VyperIR
Into EVM AssemblyThe VyperIR
representation is turned into EVM assembly via calling compile_to_assembly
in the IR Python module compile_ir.py
.
Eventually, the EVM assembly is turned into bytecode via assembly_to_evm
.
Let me emphasise that the semantic analysis and codegen part is the most complex part of the compilation and a full delve into it is beyond the scope of this article. We leave these deep-dives as an opportunity for future articles.
Let's go through the different steps sequentially with an example Foo.vy
:
vyper -f ast Foo.vy
A note on the side, I prettified the original ast
output with the following bash script (h/t Tanguy Rocher):
vyper -f layout Foo.vy
VyperIR
Representation via vyper -f ir_runtime Foo.vy
For the contract creation code IR, you can simply invoke
vyper -f ir Foo.vy
.
vyper -f asm Foo.vy
vyper -f bytecode Foo.vy
vyper -f bytecode_runtime Foo.vy
Ok, you've come a long way, but you made it. Congratulations 🎉!
Ok, fair, I do understand that this is a lot to digest, but what would life be without challenges 🤓. I still hope this was an insightful read! The best way to see if you have grasped everything is to go to Vyper's main repository (now 😉) and tackle an existing issue or ask King Charles for advice on where you can best contribute. Furthermore, if you can just help others understand how this complex machine works, you would already be adding a lot of value. See you deep down in the issues and PRs 🫡!