PLs-and-Degrees-of-Freedom

--- title: PLs-and-Degrees-of-Freedom author: bynx tags: pl --- # Programming Language Resolutions ## What is _Resolution_? _Resolution_ is defined here ala _Display Resolution_. In other words, as degrees-of-freedom vary, so too does our observable. In QFT, we know this term as _dimensionality_ or _cardinality_; we use the former and latter interchangeably. Classical representations of programming languages aim to reduce dimensionality in an effort to ease lexical, syntactical, and semantical analysis. This result is a _discrete resolution_ of a provided input sample program, $P$. Examples of the aforementioned discrete resolutions can be see in the [Cononical Representations](/kLE1slUeS1qZyjOAoeV4Xg##Canonical-Representations) section below. ## A Modest Proposition Given a vector representation of a sample program, $P_{\lvert\psi\rangle}$, renormaliztion group transformations of $\lvert P \rangle$ will validate universality for cononical representations, $P_{\gamma}$, $P_{\rho}$, $P_{\tau}$. As a result, dimensionality can be preserved while maintaining a discrete resolution, i.e., logically equivalent semantical mutations exist for all semantic mutations at all resolutions of $P$. ### Challanges Assigning a semantic label to a code snippet (such as a name to a method) is an example for a class of problems that require a compact semantic descriptor of a snippet. The question is how to represent code snippets in a way that captures some semantic information, is reusable across programs, and can be used to predict properties such as a label for the snippet. This leads to two challenges: 1. Representing a snippet in a way that enables RG transformations 1. Learning which parts in the representation are relevant to prediction of the desired property, and learning the order of importance of the part. ## Canonical Representations ### Source Code ### Concrete Syntax Tree / Parse Tree / Derivation Tree ### Abstract Syntax Tree A tree representation for the abstract _syntactic structure_ of source code $\quad \rightarrow$ **_Node_**_: construct, such as statement, loop_ $\quad \rightarrow$ **_Edge_**_: containment relationship_ This makes it possible to apply all kinds of **syntax-directed** translation/transformation ASTs are simplified parse tree. It retains syntactic structure of code. **Definition 1 (Abstract Syntax Tree).** An Abstract Syntax Tree $(AST)$ for program sample $P$ is a tuple, where for: $$ \Large\color{orchid}{P_{{_A}{_S}{_T}} = \lgroup \eta ,\tau ,\chi , s, \delta, \phi \rgroup} $$ - $\eta$ is a **set of nonterminal nodes** - $\tau$ is a **set of terminal nodes** - $\chi$ is a **set of values** - $s \in \eta$ is the **root node** - $\delta : \eta \rightarrow (\eta \bigcup \tau )\star$ is a function that maps a nonterminal node to a list of its children - $\phi : \tau \rightarrow \chi$ is a function that maps a terminal node to an associated value. Every node except the root appears exactly once in all the lists of children. Next, we define $AST$ paths. For convenience, in the rest of this section we assume that all definitions refer to a single $P_{{_A}{_S}{_T}}$ . An $AST$ path is a path between nodes in the AST, starting from one terminal, ending in another terminal, and passing through an intermediate nonterminal in the path which is a common ancestor of both terminals. More formally: **Definition 2 (AST path).** An $AST$-path of length $\kappa$ is a sequence of the form $n_{1}d_{1} \dots n_{\kappa}d_{\kappa} n_{\kappa+1}$ where: - $n_{1},n_{\kappa+1} \in \tau$ are terminals - $\forall i \in [2..\kappa]: n_{i} \in \eta$ are nonterminals - $\forall i \in [1..\kappa]: d_{i} \in {\{\uparrow,\downarrow\}}$ are movement directions (either up or down in the tree), e.g., \begin{align} (d_{i}\ &=\ \uparrow) \ \Rightarrow\ n_{i} \in \delta (n_{i+1} ) \\ (d_{i}\ &=\ \downarrow)\ \Rightarrow\ n_{i+1} \in \delta (n_{i}) \end{align} For an $AST$-path $\rho \in P$, we use $start(\rho)$ to denote $n_{1}$ $-$ the starting terminal of $\rho$, and $end(\rho)$ to denote $n_{\kappa+1}$ $-$ its final terminal. Using this definition we define a path-context as a tuple of an AST path and the values associated with its terminals: **Definition 3 (AST path-context).** Given an $AST$ path $\rho$, its path-context is a triplet $\lgroup \chi_{s}, \rho, \chi_{\tau} \rgroup$ where $\chi_{s} = \phi$ (i.e., $start(\rho)$) and $\chi_{\tau} = \phi$ (i.e., $end(\rho)$) are the values associated with the start and end terminals of $\rho$, respectively. That is, a path-context describes two actual tokens with the syntactic path between them. *Example 1.* A possible path-context that represents the statement: $x = 7;$" would be: $$ \color{orchid}{\lgroup\ x,\ (NameExpr\ \uparrow\ AssignExpr\ \downarrow\ IntegerLiteralExpr),\ 7\ \rgroup} $$ ### Directed Acyclic Graph $DAG$s are similar to an $AST$ but with _a unique node for each value._ ### Control Flow Graph A representation, using graph notation, of all paths that might be traversed through a program during its execution. A directed graph (usually for a single procedure) in which: - Each node is a single basic block - There is an edge $b1→b2$, where control _may_ flow from the last statement of $b1$ to the first statement of $b2$ in some execution. _Note: $CFG$s are actualy conservative approximations of the control flow_ ### Code Property Graph A code property graph is a property graph, $G = (V,E,λ,μ)$, constructed from the $AST$, $CFG$, and $PDG$ of source code with: \begin{align} \color{orchid}{V} &\color{orchid}{= V_{A}} \\ \color{orchid}{E} &\color{orchid}{= E_{A}\ {\bigcup}\ \ E_{C}\ \bigcup E_{P}}\ \ \\ \color{orchid}{λ} &\color{orchid}{= λ_{A}\ \ {\bigcup}\ \ λ_{C}\ \bigcup λ_{P}}\ \ \\ \color{orchid}{μ} &\color{orchid}{= μ_{A}\ \ {\bigcup}\ \ μ\ P} \\ \end{align} where we combine the labeling and property functions with a slight abuse of notation. [J. Ferrante, et al. _The program dependence graph and its use in optimization_. (1987)][pdg-doi] [pdg-doi]: https://dl.acm.org/doi/abs/10.1145/24039.24041 ### Program Dependence Graph A _directed_ graph representing dependencies among: - Code **Control** dependence - $A$'s control depends on $B$, if $B$’s execution decides whether or not $A$ is executed - Code **Data** dependence (Data Dependency Graph $-$ $DDG$) - $A$'s data depends on $B$ if $A$ uses variable defined in $B$ A $PDG$ contains both **_control dependence edges_** and **_data dependence edges_**. ### Points-to Graph For a program location, for any object `reference/pointer`, calculate all the possible `objects/variables` it may/must refer/point to: - Connect together analyzed program semantics for individual methods - Essential to expand intra-procedural analysis to inter-procedural - Detect consistent usage of resources - File `open`/`close`, `lock`/`unlock`, `malloc`/`free` ### Property Graph A _property graph_ $G=(V,E,λ,μ)$, is a _directed, edge-labeled, attributed multigraph_ where $V$ is a **set of nodes**, $E$ is a **set of directed edges**, and $λ : E → Σ$ is an edge labeling function assigning a label from the alphabet $Σ$ to each edge. Properties can be assigned to edges and nodes by the function: $$ \large{\color{orchid}{μ:(V\ \bigcup E)×K→S}} $$ where $K$ is the set of _property keys_ and $S$ the set of _property values_. ### Call Graph A directed graph representing _caller-callee relationship_ between methods/functions $\quad \rightarrow$ **Node**: methods/functions $\quad \rightarrow$ **Edges**: calls ### Static Single Assignment Form A program is in $SSA$ form $iff$: 1. **each variable** is assigned a value in **exactly one** statement 1. **each variable** use is **dominated by the definition** The $SSA$ Graph is a _directed graph_ in which: $\quad \rightarrow$ **Node**: All definitions and uses of $SSA$ variables $\quad \rightarrow$ **Edges**: $\{(d,\ u)\ :\ u \}$ use the $SSA$ variable defined in $d$ ### Three-Address Code A term used to describe many different representations of the form: $$ \Large{\color{orchid}{\psi\ =\ \theta_{fn}\ \ op1\ \ op2\ \ op3}} $$ where $\psi \in \Psi$, $\theta \in \Theta$ are our statement and operation value, respectively $-$ $op\star$ are _optional_ parameters passed to $\theta_{fn}$. ### Domain-Specific Languages We define $A+B$ as the _disjoint union_ of two sets, $A$ and $B$: \begin{align} \large \color{orchid}{A+B} &\color{orchid}{\equiv \large A \ \ {\bigcup}^{\ \ast} \ B} \\ &\color{orchid}{\equiv \large \ (\ A×{\{0\}}\ ) \ \bigcup \ (\ B×{\{1\}}\ )} \\ &\color{orchid}{\equiv \large \ A^{*} \ \bigcup \ B^{\ *}} \end{align} _Note: Sequences in the _free monoid_ are denoted as a vector: $\large{\vec{v} \in A^{\ast}}$_ ### Context-free Grammar A context-free grammar is a tuple $(\Sigma, X, R, s)$ where $\Sigma$ and $X$ are finite sets called **_terminals_** and the **_non-terminals_**, respectively. - $R \subseteq X \times (X + \Sigma)^\star$ is a finite set of **_production rules_** - $s \in X$ is called the **_start symbol_** The **_language_** of a context-free grammar is given by: $$ \large{\color{orchid}{L(G) = \{\ \vec{u} \in V^\star\ \vert\ s \to_R \vec{u}\ \}}} $$ where the rewriting relation, $(\to_R) \subseteq (V + X)^\star \times (V + X)^\star$, is traditionally defined as the transitive closure of the following directed graph: $$ \large\color{orchid}{\{\ \ (\vec{u}{\ x\ }\vec{w},\ \ \vec{u\ v\ w})\quad \Big{\vert} \quad \vec{u}, \vec{w} \in (V+X)^{\ast},\ \ (x, \vec{v}) \in R\ \ \large\}} $$ Note: Given a set $S$, the **_free monoid_** on $S$ is the set $S^{\ast}$ of all finite sequences of elements $s \in S$, made into a monoid using concatenation. ### Intermediate Representations ### Assembly Instructions ### Machine Code --- ## Language Translation Sample Let's solidify that with an example in JavaScript: ### JavaScript Source Code ```javascript= /* * ECMA-262/11 */ const v0 = 1337; const v1 = {a:v1}; for (let v2 in v1) { v2 = v3; { const v4 = {a:v1}; v2 = v4; } } ``` ### Abstract Syntax Tree (Domain-Specific Language) ```== /* * Custom ES AST */ 'const' BindExpr ';' 'const' BindExpr ';' ForInExpr '{' BindExpr ';' '{' 'const' BindExpr ';' BindExpr ';' '}' '}' ``` ### FuzzIL (Intermediate Representation) Three-Address Code ```swift= /* * FuzzIL Translation */ v0 <- LoadInteger 1337 v1 <- CreateObject ['a':v0] BeginForIn v1 -> v2 BeginBlockStatement v3 <- LoadInteger 1337 Reassign v2, v3 BeginBlockStatement v4 <- CreateObject ['a':v1] Reassign v2, v4 EndBlockStatement EndBlockStatement EndForIn ``` ### JavaScriptCore's B3 Intermediate Representation ```== bb#1 [ 0] enter [ 1] get_scope dst:loc4 [ 3] mov dst:loc5, src:loc4 [ 6] check_traps [ 7] mov dst:loc6, src:Undefined(const0) [ 10] resolve_scope dst:loc7, scope:loc4, var:0, resolveType:GlobalProperty, localScopeDepth:0 [ 17] put_to_scope scope:loc7, var:0, value:Int32: 1337(const1), getPutInfo:1049600<DoNotThrowIfNotFound|GlobalProperty|ConstInitialization|NotStrictMode>, symbolTableOrScopeDepth:0, offset:0 [ 25] resolve_scope dst:loc7, scope:loc4, var:1, resolveType:GlobalProperty, localScopeDepth:0 [ 32] new_object dst:loc8, inlineCapacity:1 [ 36] resolve_scope dst:loc9, scope:loc4, var:1, resolveType:GlobalProperty, localScopeDepth:0 [ 43] get_from_scope dst:loc10, scope:loc9, var:1, getPutInfo:2048<ThrowIfNotFound|GlobalProperty|NotInitialization|NotStrictMode>, localScopeDepth:0, offset:0 [ 51] put_by_id base:loc8, property:2, value:loc10, flags:IsDirect [ 57] put_to_scope scope:loc7, var:1, value:loc8, getPutInfo:1049600<DoNotThrowIfNotFound|GlobalProperty|ConstInitialization|NotStrictMode>, symbolTableOrScopeDepth:0, offset:0 [ 65] mov dst:loc6, src:Undefined(const0) [ 68] mov dst:loc7, src:<JSValue()>(const2) [ 71] resolve_scope dst:loc8, scope:loc4, var:1, resolveType:GlobalProperty, localScopeDepth:0 [ 78] get_from_scope dst:loc9, scope:loc8, var:1, getPutInfo:2048<ThrowIfNotFound|GlobalProperty|NotInitialization|NotStrictMode>, localScopeDepth:0, offset:0 [ 86] mov dst:loc8, src:loc9 [ 89] get_property_enumerator dst:loc9, base:loc8 [ 92] get_enumerable_length dst:loc10, base:loc9 [ 95] mov dst:loc11, src:Int32: 0(const3) Successors: [ #2 ] ```