CIP-XXX: Multi-lambdas in Plutus Core

# CIP-XXX: Multi-lambdas in Plutus Core ## Abstract This CIP proposes adding lambdas and applications which process multiple arguments at once to Plutus Core. This allows the evaluator to process them more efficiently, enabling faster programs. ## Motivation Plutus Core is a small functional programming language. As such, its most fundamental building blocks are _labmda abstractions_ and _applications_. Since Plutus Core is so small, improvements to its fundamental pieces can have significant effects on program runtime. Most of what Plutus Core programs _do_ is to apply lambda abstractions! If we can make that faster, even if only in some cases, that can be a big deal. Today, Plutus Core has _unary_ lambda abstractions and applications. That means that programs can contain functions that take _one_ argument and apply those to _one_ argument at a time. Functions that take multiple arguments are defined by _currying_, i.e. by writing nested unary lambdas and nested unary applications. This costs us no expressive power, but since many functions that people actually want to use take multiple arguments, we might expect that we could improve things by having a dedicated way to do that. Indeed, specific support for non-unary functions is a staple amongst production programming languages, even those that are very committed to currying and partial application, such as Haskell. ## Specification See Appendix 1 for changes to Typed Plutus Core. We _modify_ the term constructs for lambda abstractions and applications as follows: - Lambda abstractions now have a _non-empty list_ of names that they bind - Applications now have a _non-empty list_ of arguments to apply In the concrete syntx of Plutus Core, that looks like this: ``` (lam n1 ... nn t) [ f a1 ... an ] ``` ### Term reduction These rules are given as extensions of the existing Plutus Core specification term reduction rules. See Appendix 3 for a discussion of one way of implementing them in a CEK machine. The rule for reducing the parts of an application is straightforward, although see Appendix 2 for discussion of evaluation order. $$ \mathrm{APPLY}\frac{f \rightarrow f',\ a_i \rightarrow a'_i}{[f\ a_1 \ldots a_n] \rightarrow [f'\ a'_1 \ldots a'_n]}\\ $$ We add a reduction rule for reducing applications where the number of arguments match exactly. $$ \mathrm{BETA_{exact}}\frac{n=m}{[(\mathrm{lam}\ n_1 \ldots n_n\ t)\ a_1 \ldots a_m] \rightarrow [n_1 \mapsto a_1 \ldots n_n \mapsto a_n]t} $$ And also one for reducing applications which have too many arguments.[^overapply] [^overapply]: These rules are very similar and could be merged except that we don't have nullary applications, and so we need to produce a term without an application in the exact case. $$ \mathrm{BETA_{over}}\frac{n<m}{[(\mathrm{lam}\ n_1 \ldots n_n\ t)\ a_1 \ldots a_m] \rightarrow [[n_1 \mapsto a_1 \ldots n_n \mapsto a_n]t\ a_{n+1}\ldots a_m]} $$ We do not provide a rule for under-applications in isolation, we leave them as stuck terms and treat them as our semantic values for partially-applied functions. As such, we include a rule to accumulate additional arguments until the lambda is fully satisfied.[^syntax-directed] [^syntax-directed]: It may not be obvious that these rules are syntax-directed: in the case of a nested application where the inner application matches $\mathrm{BETA_{exact}}$ or $\mathrm{BETA_{over}}$, we could apply either of those rules, or we could apply $\mathrm{BETA_{partial}}$ first. However, both of these produce the same result in the end, so we can choose either approach. $$ \mathrm{BETA_{partial}}\frac{}{[[(\mathrm{lam}\ n_1 \ldots n_n\ t)\ a_1 \ldots a_x]\ a_{x+1} \ldots a_m ] \rightarrow [(\mathrm{lam}\ n_1 \ldots n_n\ t)\ a_1 \ldots a_m]} $$ ## Rationale ### Why do we expect this to be faster? A priori, evaluating a nested lambda with a nested application has to do more work than evaluating a multi-lambda with a multi-application because it does not know that a) the nested lambdas _are_ lambdas and not other terms, and b) that it has all the arguments already. Consider the following example: `[[(lam x (lam y t)) a] b]` vs `[(lam x y t) a b]` 1. When we start evaluating the body of `(lam x ...)`, we do not _know_ that it is a lambda. It could be any arbitrary term that will _evaluate_ to a lambda, e.g. `(lam x [(lam z (lam y t) (con 1)]))` . Whereas with the multi-lambda we _know_ that we have two arguments to pass and nothing else. 2. Because we need to evaluate the body of `(lam x ...)` and it could be an arbitrary term, we need to construct an appropriate environment for that evaluation that contains _only_ the addition of `x` mapped to `a`. After doing that evaluation (trivially), we will then extend the environment again to get a combined environment that maps `x` to `a` and `y` to `b`. . Whereas with the multi-lambda we _only_ need to construct the combined environment. Another way of putting this is that we know that we don't need to do any evaluation work until the function has all of its arguments, which saves us some overhead. Furthermore we may be able to to construct environments more efficiently if we are able to add entries to them in bulk. However, we will gain some countervailing overhead. Because we still want to support partial application of multi-lambdas, we need to manage the accumulation of arguments in an efficient way that doesn't cancel out the gains we got before. For example, consider `[[(lam x y t) a ] b]`. We would like this to be as efficient as `[(lam x (lam y t)) a ] b]` is today, even though it will need to do some book-keeping to accumulate the arguments until it has both `a` and `b` available. ### Performance assessment We implemented a prototype version of multi-lambda support. We ran our standard `nofib` benchmark suite testing the following scenarios: 1. Baseline, using the non-multi-lambda implementation 2. Just switching to the multi-lambda implementation but not changing anything else (so all lambdas and applications remain unary) 3. As 2, but with an optimization pass that aggressively translates nested lambdas and applications into multi-lambdas and multi-applications At the moment 2 is >2x slower than 1, and 3 is >1.5x slower than 1 :( Investigation continues. ## Appendices ### Appendix 1: Changes to Typed Plutus Core The changes to Typed Plutus Core are very trivial and mirror those for Untyped Plutus Core almost exactly. The main difference is that lambdas require types for each argument. ``` (lam n1 ty1 ... nn tyn t) [ f a1 ... an ] ``` The typing rules for lambdas and applications change as follows. $$ \frac{\Gamma, n_i : ty_i \vdash t : rty}{\Gamma \vdash (\mathrm{lam}\ n_1\ ty_1 \ldots n_n\ ty_n\ t) : ty_1 \rightarrow \ldots \rightarrow ty_n \rightarrow rty} $$ $$ \frac{\Gamma \vdash a_i : ty_i,\ f : ty_1 \rightarrow \ldots \rightarrow ty_n \rightarrow rty}{\Gamma \vdash [f\ a_1 \ldots a_n] : rty} $$ Term reduction is the same. ### Appendix 2: Evaluation order Strictly speaking, we have said everything we need to say about evaluation already. Since Plutus Core has only one effect, failure, and it is not observable _which_ failure was triggered, there is no need to specify evaluation order. However, it is helpful to specify this for two reasons: 1. It affects the implementation of evaluators, such as the CEK machine used by the Haskell Cardano node. 2. The current Plutus Core implementation does have a tracing effect, which allows evaluation order to be observed. This is not specified and cannot affect validation, so in a sense it doesn't matter, but it would still be nicer to pin down how it should behave. 3. We might conceivably add other observable effects and make evaluation order observable later. There are at least the following options for evaluation order: 1. Evaluate all the children of an application before doing any substitutions. a. Evaluate the function and the arguments from left to right b. Evaluate the function and the arguments from right to left c. Evaluate the function and the arguments in some other order 2. Evaluate only as many arguments as are needed in order to start reducing the function, leave the others unevaluated until they are needed. This implies evaluating the function first. a. Evaluate the needed arguments from left to right b. Evaluate the needed arguments from right to left c. Evaluate the needed arguments in some other order To illustrate some options, here is an example evaluation using orders 1a and 2a Input: ``` [f a1 a2 a3] ``` assuming ``` f --> (lam n1 n2 t) t --> (lam n3 (n1 + n2 + n3)) a1 --> (con 1) a2 --> (con 2) a3 --> (con 3) ``` Reduction trace using order 1a: ``` [f a1 a2 a3] [(lam n1 n2 t) a1 a2 a3] [(lam n1 n2 t) (con 1) a2 a3] [(lam n1 n2 t) (con 1) (con 2) a3] [(lam n1 n2 t) (con 1) (con 2) (con 3)] [[(con 1)/n1, (con2)/n2]t (con 3)] [[(con 1)/n1, (con2)/n2](lam n3 (n1 + n2 + n3)) (con 3)] [(lam n3 ((con 1) + (con 2) + n3) (con 3)] [(con 3)/n3](con 1) + (con 2) + n3) (con 1) + (con 2) + (con 3) (con 6) ``` Reduction trace using order 2a: ``` [f a1 a2 a3] [(lam n1 n2 t) a1 a2 a3] [(lam n1 n2 t) (con 1) a2 a3] [(lam n1 n2 t) (con 1) (con 2) a3] [[(con 1)/n1, (con2)/n2]t a3] [[(con 1)/n1, (con2)/n2](lam n3 (n1 + n2 + n3)) a3] [(lam n3 ((con 1) + (con 2) + n3) a3] [(lam n3 ((con 1) + (con 2) + n3) (con 3)] [(con 3)/n3](con 1) + (con 2) + n3) (con 1) + (con 2) + (con 3) (con 6) ``` The author's preference is order 1a, because: 1. Evaluating from left-to-right is intuitive, and corresponds to the order in which the arguments are required by the function. 2. Evaluating all the arguments in an application before progressing is more consistent with how strict function application usually works. 3. Evaluating all the sub-terms of the construct before doing anything is more consistent with how other constructs are treated. However, 1a does _not_ have the property that `[f a b]` and `[[f a] b]` have the same evaluation order. That means that an optimization which turns nested applications into multi-applications may change evaluation order, which would be unsound if we have effects that we care about. We can get this property using the other orders, specifically: - 1b can achieve this so long as we always evaluate from right-to-left in _all cases_. That means e.g. we evaluate `b` first for both `[f a b]` _and_ `[[f a ] b]`, which is a bit counter-intuitive. This is famously the approach taken by OCaml (see the [Zinc report][zinc], section 2.2.3). - 2a can achieve this but at the cost of interleaving the evaluation of the partial applications with the evaluation of the arguments. potentially interfering with efficiency. ### Appendix 3: CEK machine This section gives _one_ possible implementation of the reduction rules in the CEK machine, as an extension to the existing CEK machine described in the Plutus Core Specification. Evaluation order follows option 1a from Appendix 2, i.e. left-to-right. At a high level, the idea is: 1. Applications are handled by an accumulator frame, which processes the function and arguments, by tracking those still to do and those which have been done. When we return into the accumulator frame, either we pop off the next todo item and evaluate it; or there is nothing to do, and we switch into an "apply arguments" frame. 2. Lambda values track an accumulated environment to use for evaluating the body, and a "remaining arity" field that tracks how many more arguments are needed. When we return into an "apply arguments" frame, we either: a. If the number of arguments matches exactly: extend the environment and evaluate the body. b. If there are more arguments than needed: extend the environment and evaluate the body, returning into a frame with the remaining arguments. c. If there are fewer arguments than needed: extend the environment, decrement the "remaining arity" and return the lambda value. $$ \begin{align*} s; \rho &\vartriangleright [F\ N_1 \ldots N_n]\\ &\mapsto\\ s, \mathrm{Accum}(N_1 \ldots N_n,\emptyset{}); \rho &\vartriangleright F \\ \\ s; \rho &\vartriangleright (\mathrm{lam}\ n_1 \ldots n_n\ T)\\ &\mapsto\\ s; \rho &\vartriangleleft ((\mathrm{lam}\ n_1 \ldots n_n\ T),n, \rho) \\ \\ s, \mathrm{Accum}(N_1 \ldots N_n, V_1 \ldots V_{m-1}); \rho &\vartriangleleft V_m\\ &\mapsto\\ s, \mathrm{Accum}(N_2 \ldots N_n, V_1 \ldots V_m); \rho &\vartriangleright N_1 \\ \\ s, \mathrm{Accum}(\emptyset{}, F\ V_1 \ldots V_{m-1}); \rho &\vartriangleleft V_m\\ &\mapsto\\ s, [\_\ V_1 \ldots V_m]; \rho &\vartriangleleft F\\ \\ s, [\_\ V_1 \ldots V_m]; \rho &\vartriangleleft ((\mathrm{lam}\ n_1 \ldots n_n\ T), a, \rho')\\ &\mapsto \textrm{if $m = a$}\\ s; \rho'[n_{n-a} \mapsto V_1\ldots n_n \mapsto V_m] &\vartriangleright T\\ \\ s, [\_\ V_1 \ldots V_m]; \rho &\vartriangleleft ((\mathrm{lam}\ n_1 \ldots n_n\ T), a, \rho')\\ &\mapsto \textrm{if $m > a$}\\ s, [\_\ V_{a+1}\ \ldots V_m]; \rho'[n_{n-a} \mapsto V_1\ldots n_n \mapsto V_a] &\vartriangleright T\\ \\ s, [\_\ V_1 \ldots V_m]; \rho &\vartriangleleft ((\mathrm{lam}\ n_1 \ldots n_n\ T), a, \rho')\\ &\mapsto \textrm{if $m < a$}\\ s; \rho &\vartriangleleft ((\mathrm{lam}\ n_1 \ldots n_n\ T), a-m, \rho'[n_{n-a} \mapsto V_1\ldots n_{n-(a-m)} \mapsto V_m])\\ \end{align*} $$ The "apply arguments" frame is separate from the accumulator frame, since we need to manipulate the "apply arguments" frame in a number of ways when actually returning into it, so it's helpful to be able to represent it differently. An excellent reference for implementing behaviour like this is ["Making a fast curry"][fast-curry]. We diverge from this in a number of ways: rather than having a PAP object for partial applications, and deconstructing it by pushing the partially-applied arguments back into an application frame, we instead incrementally build up the environment for the eventual function application. This turned out to be much faster in the prototype. [zinc]: https://xavierleroy.org/bibrefs/Leroy-ZINC.html [fast-curry]: https://www.microsoft.com/en-us/research/publication/make-fast-curry-pushenter-vs-evalapply/