PowerShell Binary AST

--- RFC: RFC<four digit unique incrementing number assigned by Committee, this shall be left blank by the author> Author: Nikita Baksalyar Status: Draft SupercededBy: <link to another RFC> Version: 1.0 Area: <Area within the PowerShell language> Comments Due: <Date for submitting comments to current draft (minimum 1 month)> Plan to implement: <Yes | No> --- # PowerShell Binary AST This RFC proposes a new serialization format for PowerShell Abstract Syntax Trees (AST) to make it more interoperable and compact. ## Motivation As a developer wanting to work with PowerShell scripts in a language outside the .NET ecosystem, I can serialize my scripts into a universal binary format, so that I can store them in a compact way and manipulate AST nodes in my language. ## Specification There are two main user profiles for this feature: * Users who want to embed PowerShell within their applications outside the .NET ecosystem. A standard approach to binary serialization would allow to automatically generate serializers, effectively allowing to generate PowerShell code and perform AST transformations from multiple programming languages. * Users who are intersted in storing or transfering PowerShell scripts in a compact format (for cases when storage is expensive, e.g. in blockchain). Naturally, a serialized script will also lack comments and whitespace. With scripts serialized in a binary format, PowerShell would have an additional advantage in efficiency: it will only need to hydrate AST data structures instead of parsing a text file from scratch. ### Implementation The Abstract Syntax Tree (AST) API is already public [1] and it can be used both by PowerShell users from within the shell itself and by developers integrating with PowerShell as a library. The public API comprises of multiple classes corresponding to AST nodes (e.g., `BinaryExpressionAst` for expressions like `$a + $b`, `IfStatementAst` for `if` statements, etc.). We propose to add another description schema for AST nodes in a type definition language like Protocol Buffers. It would allow to automatically generate corresponding plain data structures and serialization and deserialization routines. Consequently, we will need to implement a hydration routine that will translate plain data into full AST nodes which can be understood and used by the PowerShell interpreter. Similarly, users wanting to work with the PowerShell AST in other languages will only need to use a tool that will automatically generate plain data structures and serialization/deserialization code from the schema description we provide. In order to simplify the task of keeping the serialization schema up to date, we propose to use the Roslyn Syntax API [2]. It is possible to automatically extract metadata about structures and fields from the AST classes listed in the corresponding file [3] and generate an output in the target type description language of our choice. This step can be executed automatically during the build or packaging time and it would alleviate the maintenance burden. [1] It's available in the `System.Management.Automation.Language` namespace. [2] Available as a part of the .NET Compiler Platform SDK. [3] `src/System.Management.Automation/engine/parser/ast.cs` ## Alternate Proposals and Considerations ### Schema-less formats Schema-less serialization has an advantage in requiring less code maintenance if a new language construct is added. In this case, the serialization code does not need to be modified because we don't describe AST node types in any external type description language. It can work similarly to JSON serialization, using reflection to find AST node types and fields. There are many formats that can be used for schema-less serialization, for instance: - BSON. This format is similar to JSON in many ways but it's more compact because it doesn't contain whitespace, doesn't require quoting for strings, and it doesn't have complex syntax rules. It's used in a NoSQL database MongoDB to represent and store data in a compact way. - CBOR (Concise Binary Object Representation). This format is also similar to JSON and BSON. It is standardized as an IETF [RFC 8949](https://datatracker.ietf.org/doc/html/rfc8949). Disadvantages of the schema-less approach: - Requiring to store more meta-information about AST node types, making it less compact compared to schema-based serialization. - With typed binary serialization formats like Protocol Buffers, it's easier to generate code in other languages for interoperability. ### Binary serializer implemented in PowerShell It should be possible to implement a new serialization format as a PowerShell utility script. However, this would be less efficient because of the level of indirection and if we want to package this functionality as part of the PowerShell distribution, it doesn't provide much advantage over implementing the same features in C#. This approach would also require implementing PowerShell as a target language for the serialization code generator.