Lecture 5: Programs as Data (Interp)

:::warning **Announcements** - `A2: Functions` due Thursday - Next help hours: - Mine 6:30-8:30 tonight - Grace's 3-5pm tomorrow ::: :::success Agenda for today - Wrap up from last lecture: `fold` ([prior notes](https://hackmd.io/@avh/SJOzCjFjgl#fold-reducecombine-to-result)), group exercise - Representing programs as data - Abstract syntax trees, just a bit on parsing - Interpreters vs compilers - Interpreter basics: - A Racket #lang with types: plait - Operating on programs as data - Calculator example ::: We've now spent about 2 weeks learning functional programming, including some insights on how languages work at a deeper level (evaluation rules, tail call recursion, desugaring, etc). :::info In the next two lectures, we'll focus more heavily on ==*language implementation*==. Understanding how programming languages are actually *implemented* will make us better programmers. Today's lecture in particular is designed to give you the skills to complete `A3: Interp`, where you will build your first interpreter for a language we'll call `MiniRacket`. ::: # Representing programs as data Programming language implementations (whatever actually lets you **run** your code) first need to go from the keystrokes we type into our IDE (e.g., strings) to treating the program itself as data, with program representations in memory that are easier to perform analysis on. ## A bit on parsing For any language implementation, the first step is _parsing_: consuming a string, character by character, and producing a program representation in memory. This includes details like, for example, whether parentheses in Racket are balanced. [Parsing](https://en.wikipedia.org/wiki/Parsing) is a rich and interesting subfield, but not necessarily the most useful to understand for working programmers, so we won't have time to focus on it in CS251. I will typically provide parsers where needed for assignments. ## Abstract Syntax Trees (ASTs) Usually, the next program representation after the source string is an _abstract syntax tree_. An AST represents the structure of a program (or pieces of it), with the outer-most operators at the top of each subtree. For example, the abstract syntax tree for the warmup snippets: An AST for `(+ (* 1 2) (- 5 1))` is: ``` + / \ * - / \ / \ 1 2 5 1 ``` `+` is the outer-most operator, so it is the root of the entire tree. `*` and `-` are each the root of their own subtrees. Note that order matters: `1` comes before `2` in the multiplication, so `1` is the left subtree/leaf. Note how the AST keeps the details needed to evaluate the program (which numbers are used, the order of the operations) while loosing some syntactic details (the parentheses). This is where the "abstract" in the name comes from. An AST for `(define y 3)`: ``` define / \ y 3 ``` An AST for `(define f (lambda (x) (+ x y)))`: ``` define / \ f lambda / \ x + / \ x y ``` Sometimes AST edges will be labeled: for example, we could label the first use of `x` with `argument` and the sum as `body` of the `lambda`. To the second warmup question: next lecture, we'll talk more about additional information language implementations need to track for lambdas. After creating the AST, we have a choice in how to further evaluate this program. There are two primary methods: - An ==_interpreter_== traverses the AST to compute results **directly** (typically by using the _host_ or _implementation_ language operators). For example, when evaluating `(+ (* 1 2) (- 5 1))`, an interpreter would first evaluate `(* 1 2)`, then `(- 5 1)`, and finally apply `+` to the results. The `*`, `+`, `-` being invoked would be based on what language the interpreter is written in: e.g. Racket's operators. Python and Racket are both usually interpreted. - A ==_compiler_== traverses the AST and translates it to a lower-level representation: either bytecode instructions (e.g., Java) or machine code instructions (e.g., C, Rust). These instructions are then executed by a lower-level system: a language runtime or virtual machine, or directly by the hardware. Modern languages might do a mix of interpreting and compiling to produce results. The same language can have both interpreter and compiler implementations. For example, JavaScript is usually "just-in-time (JIT)" compiled, which is a mix of these two primary strategies in that some aspects of the program are compiled to hardware instructions _while_ the program is being executed. Interpreters are often slower, but are easier because they often directly map to the ==evaluation rules== of a language's defined semantics. We'll build several interpreters throughout the remainder of this course, starting with `A3: Interp` next week! # Operating on programs as data Part of the power of Racket is being able to build and work in many different sublanguages for different purposes. For A3, we'll use a new Racket sublanguage: `#lang: plait` (named for the optional textbook, [Programming Languages: Application and Interpretation](https://www.plai.org/), + Typed). You can find the `plait` language documentation, including a tutorial, [here](https://docs.racket-lang.org/plait/). `plait` shares most of its syntax with Racket. :::info Install Plait by going into DrRacket, File | Package Manager… | Do What I Mean, and type `plait` (all lower-case) into the box before pressing `install`. DrRacket will find and install the latest version of `plait`. Then, use `#lang plait` at the start of your file. ::: # Calculator example Rather than start off with showing a full interpreter in class, we will instead build a simplified interpreter. Our simplified essentially will be a calculator for a math-based language, with expression and constants, but no functions, let-bindings, etc (those, you will implement yourself in `A3: Interp`!). ### `CalcLang` definition We want to support a simple calculator language with the following grammar: ``` e ::= n ; number | const ; constant | (+ e1 e2) ; addition ``` To start with, expressions can be either numbers, pre-defined mathematical constants, or additions on two expression arguments. :::success How can we translate this to a programming language implementation? We want numbers, constants, and additions to all "be" expressions. (Fun note: Alexa once got a similar question in a tech finance interview). ::: :::spoiler Think, then click! In Java, we might express this by defining an `Expression` class with subclasses for `Num` `Const` and `Addition`. The `Addition` subclass could have fields, `left` and `right`, each of type `Expression`. But then how each subclass differed in behavior might be spread out amongst the subclass definitions. ::: In `plait`, as with many functional languages, we'll instead use a *recursive data definition*, also known as an abstract data type (ADT) definition: ```racket (define-type MathExpr ; we define a new type (e-num [value : Number]) ; one variant: a MathExpr can be a number (e-const [name : Symbol]) ; next variant: a MathExpr can be a named mathematical constant (e-add [left : MathExpr] [right : MathExpr])) ; or, a MathExpr can be the addition of two other MathExprs ``` This definition says we have a new type, `MathExpr`, that can be one of 3 variants: 1. an `e-num` (read as: expression number) that contains a `value` of built-in type `Number`, or 2. an `e-const` that contains a `name` of built-in type `Symbol`, or 3. a recursive `e-add` that contains two `MathExpr`s, one named `left` and one named `right`. You can read more about the `define-type` keyword in the [plait documentation](https://docs.racket-lang.org/plait/Definitions.html#%28form._%28%28lib._plait%2Fmain..rkt%29._define-type%29%29). This type of recursive type definition, also known as an [algebraic data type](https://en.wikipedia.org/wiki/Algebraic_data_type), is common across many languages: plait, OCaml (which we will start learning next week), Haskell, and Rust use them heavily, while they are also now supported in some form in many more languages, including Java, C++, Python, TypeScript, and Scala. Now, we can write functions that take in a `MathExpr`, and define different behavior based on which of the 3 variants the data is. ## Symbols vs. strings A note on the `e-const` definition: Racket's [`Symbol` type](https://docs.racket-lang.org/reference/symbols.html) is distinct from its `String` type, and is a more general language term that is often used for _names_. Symbols differ from strings in that they cannot be broken down into substrings or characters: the symbol `'alexa` is equal to itself and not equal to `"alexa"`; only the latter can be broken into substrings. This allows languages to implement symbols more efficiently than strings: a symbol can be _interned_ to store the mapping from an id (which takes less space than a string) to the full name. Going back to our overall `MathExpr` type: we can now write a `calc` function to interpret nested math expressions! ## Examples of using `plait` First, some examples in the interactions window: `plait`, unlike Racket, tells us both _the type of a term_ and its value after evaluation. 1 unsurprisingly has builtin type `Number`: ```racket > 1 - Number 1 ``` Some `MathExpr`s: ```racket > (e-num 1) - MathExpr (e-num 1) > (e-const 'pi) - MathExpr (e-const 'pi) > (e-add (e-num 1) (e-const 'pi)) - MathExpr (e-add (e-num 1) (e-const 'pi)) ``` Note, plait gives us several things from a type definition: - As shown, we can construct values of each variant type using its name, e.g., `(e-num n)` where `n` is a `Number`. - We can use the variant name + a `?` to ask if a `MathExpr` is that variant: ```racket > (e-add? (e-num 1)) - Boolean #f ``` - We can access each field of the variant via `variant-id-field-id`, for example: ```racket > (define e (e-add (e-num 1) (e-const 'pi))) > (e-add-left e) - MathExpr (e-num 1) > (e-add-right e) - MathExpr (e-const 'pi) ``` ## Defining `calc` Our first `calc` case is easy: if we want to evaluate `MathExpr`s to final numeric values, then an `e-num` just needs to return its number field: ```racket (define (calc (e : MathExpr)) : Number ; e has type MathExpr ^ ^ calc returns a Number (type-case MathExpr e ; type, then the thing we want to case on ; [() ()] shape of each case: matching on variant for e, then result ; e-nums just evaluate to their value [(e-num value) value] ; report an error for unhandled cases, for now [else (error 'calc-error "cannot do calculation")] )) ``` `type-case` is similar to `cond` in `#lang racket`, except it matches an expression (here, `e`) to its variant (in each case). This is how we define different behavior per variant. ## Our first recursive case Our first recursive case is `e-add`. To `calc` (or evaluate/interpret) an add expression, we need to first `calc`/eval/interp each subexpression, then add the result: ```racket [(e-add x y) (+ (calc x) (calc y))] ``` This might seem a little redundant, but what we are implementing is that the semantics of `add` in this `MathExpression` match Racket's `+` semantics. Next, we'll see how we can define *different* semantics than the host language via the interpreter. ## Using a `hash` in `plait` To support constants, we'll use a `hash` map from `Symbol` to `Number`. Note that the way we're doing this in class is not the best style (it's overly verbose), but I'm doing it this way to show you some `hash` functions you'll need for A3. We can define a `hash` on these types as: ```racket (define-type-alias Constants (Hashof Symbol Number)) ``` :::warning We started to see this part in Monday's lecture, but will go over it in more detail in the next lecture. ::: We can then create an **immutable** hash. Like immutable lists, operations on this hash produce new hashes, rather than modifying the existing hash. ```racket (define (create-constants) ; Make an initially empty immutable hash (let* ([init (hash '())] ; Add pi [just-pi (hash-set init 'pi 3.14159)] ; Add e [more (hash-set just-pi 'e 2.71828)]) more)) ; Use a top-level binding in this example, because our consts ; won't change. Note this is *not* what you'll do for A3: Interp, since ; there, expressions can add bindings. (define consts (create-constants)) ``` `hash-ref` is the function to look up a value in a hash given its key. Note that when we look up a key in our `hash`, plait makes a different design decision than e.g. Python. Rather than producing a runtime error, plait returns a `none` value when a key is in the hash: ```racket > (hash-ref consts 'alexa) - (Optionof Number) (none) ``` This `none` is one variant of an `Option` algebraic data type: another pattern that is common across many languages, including OCaml, Rust, and newer versions of e.g. Java. In high level terms, an `Option` wraps another type and has two variants: - `some v` indicates there is some result, `v` - `none` indicates there is not a result Option types are nice for (at least!) two reasons: 1. They require programmers to explicitly handle no result, instead of falling back to e.g. `key not found` errors or null pointer exceptions. 2. Data structures that return option types sometimes allow us to write more efficient lookup code. For example, in Python, a programming might write the following to print a value if it is present in a dictionary: ``` if key in dict: # Check if key is there print(key[dict]) # Get value if so ``` This pattern checks the dictionary twice, one to check if `key` is present, and once to get its value. While both of these are amortized constant time, this is still somewhat redundant work. If a hashmap returns an option type, as we'll see below, we query the hashmap once, then do a very quick (worst case still constant time) check for whether it is `some` or `none`. Here is our case for looking up constants as part of `calc`: ```racket (define (calc (e : MathExpr)) : Number (type-case MathExpr e [(e-num value) value] [(e-const name) (let ([n (hash-ref consts name)]) (if (some? n) ; if result (some-v n) ; access the value field (error 'calc-error "constant not found")))] ; ... ``` #### Full `calc.rkt` file at the end of Monday's lecture ```racket #lang plait ;; Define a data type, MathExpr, that can either be a ;; numeric value, and addition of two arguments (left or ;; right), or a pre-defined mathematical constant. ;; This is our AST for our lecture language (define-type MathExpr ; Number is a type built in to plait (e-num [value : Number]) ; Symbol is a type built in to plait (e-const [name : Symbol]) ; Like our grammar, this type definition is recursive (e-add [left : MathExpr] [right : MathExpr])) ;; Note: this is similar to variables on A3 interp (define (create-consts) (let* ([init (hash '())] [justpi (hash-set init 'pi 3.14)]) justpi)) (define consts (create-consts)) ;; "calc" is our interpreter on these mathematical ;; expressions. (define (calc (e : MathExpr)) : Number (type-case MathExpr e [(e-num value) value] [(e-const name) ; Lookup the name in our constant hash table (let ([res (hash-ref consts name)]) (if (some? res) (some-v res) (error 'calc-error "const not found")))] ; (error 'calc-error "TODO, need to implement")] [(e-add left right) (+ (calc left) (calc right))])) ``` Next class, we'll go over the `hash` logic in more detail, as well as do a group exercise on implementing *different* semantics than Racket's semantics for `divide`.