# Proto4 Thought Experiment ###### tags: `protobuf` `design` `idea` Some ideas around what a proto4 might look like. _**This is just speculation/fun**_. - **Non-goal**: This shouldn't be called proto4 😜 - **Non-goal**: Source compatibility with proto2/3 (though perhaps offer a way to convert to/from .proto?) - **Goal**: wire compatibility. - **Goal**: remove the [legacy limitations](https://reasonablypolymorphic.com/blog/protos-are-wrong/) of protos (this is a terrible article but it has _some_ valid points). - **Goal**: improve language ergonomics where possible - **Goal**: consume and produce FileDescriptorSets so we can leverage the full Proto ecosystem - **Goal**: create a syntax that can easily be formatted (like gofumpt) # Language changes ## Simplify the existing type system, add some things - Remove support for varint types (except bool). - Unify on a common syntax for maps, lists and user-defined generics. - Add generics. - Add sum types, remove `oneof`. - Enums as with proto3. ### Scalars Varint encoding is [computationally inefficient](https://stackoverflow.com/a/24642169). Remove support for varint types, except bool. Supported scalar types are `int32`, `int64`, `uint32`, `uint64`, `float32`, `float64`, `bool`. > This does mean we won't be able to import .proto files at all. How much of a problem is this? FileDescriptorSets won't be usable, for example. This could be a deal breaker. > [name=Alec Thomas] As an alternative to completely removing support, fixed-size types could be the default but we support varint/zigzag encoding with type modifiers, eg. ```protobuf= 1: zigzag int32 age 2: varint int32 weight ``` ### Bytes Bytes remain as-is because the proto wire format has no 8-bit data type to allow `byte` and `list<byte>`. ### Lists Use the same syntax as maps and user generics, `list<TYPE>`. ### Maps Maps are declared in the same way but we remove many of the existing limitations: - Map keys can be any scalar type, strings, or enums, but not bytes because these are often mutable values with no stable hash. - Map values can be arbitrarily nested lists/maps. ### Messages Messages are encoded as with proto3, but the syntax differs slightly in that field IDs are prefixes: ```= message NAME { ID: TYPE NAME } ``` eg. ```protobuf= message Message { 1: string name } ``` ### Sum types First-class support for sum types. As a sum type does not have a meaningful zero value they must always be optional. ```= sum NAME { ID: TYPE } ``` eg. ```protobuf= sum SumType { 1: Message 2: int64 } ``` See the [generated code examples](#Generated-code-examples) below for what this might loook as generated Go code. ### Generics > This is the feature I'm least sure about. I can see its utility, but I personally have never needed generics in Protobufs, so I'm not sure it's useful in practice. > [name=Alec Thomas] First-class support for user-defined generics. If the target language does not support generics, the generated code will use monomorphisation. The syntax is similar to many other languages: ```protobuf= message Message<NAME, ...> { ... } ``` eg. ```protobuf= message LinkedList<T> { 1: T value 2: LinkedList<T>? next } ``` ## Semi-colons Semi-colons are supported but not required. eg. these are all equivalent: ```protobuf= enum Enum { ZERO = 0; ONE = 1 } enum Enum { ZERO = 0 ONE = 1 } enum Enum { ZERO = 0 ONE = 1 } ``` ## Prefix IDs This makes it much more obvious what the ID for a particular field is, and additionally allows for default values if desired. ```protobuf= 1: string name 2: int64 age = -1 ``` ## Enums Enumerated constants are identical to proto4, except there *must* be a zero value and it is the default if not provided. ```protobuf= enum Enum { NONE = 0 ONE = 1 } ``` ## Improved optional syntax Just a nice bit of syntactical sugar inspired by a bunch of other languages. ```protobuf= 1: string? name ``` ## RPC Syntax Simplified slightly. ```protobuf= service Service { Method(Request) Response } ``` ## Define options Being able to extend arbitrary messages leads to confusing and hard to understand behaviour. Drop support for extending anything but the various option types, and those only via dedicated syntax: ```protobuf= option message { 50001: bool redacted } option field { 50001: bool redacted } option file { 50001: bool redacted } ``` ## Applying options Specifying options is very verbose, and the syntax for values is some weird proto-specific language. We will simplify the former and use JavaScript Object syntax (ie. JSON without quotes on keys). - Disallow nested key references through message fields such as `foo.bar.waz`. - The key will always be in the form `[<pkg>.]<name>`, where `<pkg>` is only necessary to disambiguate keys. - Don't require fully qualified references unless there is ambiguity. eg. don't require `google.api.http`, only `http`. The new syntax is modelled off Python/Java-like decorators: ```protobuf= @redacted(true) 1: string name ``` May also be on a single line. ```protobuf= @redacted(true) 2: int64 age ``` As a special case boolean options may omit the true value: ```protobuf= @redacted 2: int64 age ``` ### Example New syntax: ```protobuf= service Furniture { @http(post="/v1/shelves", body="shelf") CreateShelf(CreateShelfRequest) Shelf @http(get="/v1/shelves/{shelf}") GetShelf(GetShelfRequest) Shelf } ``` Old syntax: ```protobuf= service Furniture { rpc CreateShelf(CreateShelfRequest) returns (Shelf) { option (google.api.http) = { post: "/v1/shelves" body: "shelf" }; } rpc GetShelf(GetShelfRequest) returns (Shelf) { option (google.api.http) = { get: "/v1/shelves/{shelf}" }; } } ``` # Semantic changes ## Simplified types Varint types are not supported for user types, though they are used internally by the encoding. See the [Protobuf Encoding documentation](https://developers.google.com/protocol-buffers/docs/encoding) for more information. This vastly simplifies implementations and opens up the possibility of zero (or close to zero) overhead marshalling/unmarshalling in some languages. Supported scalar types are the following, all little-endian encoded without varint encoding. | Type | Encoding | | ----------- | ---------------- | | `int32` | 4 bytes | | `uint32` | 4 bytes | | `int64` | 8 bytes | | `uint64` | 8 bytes | | `float32` | 4 bytes | | `float64` | 8 bytes | Variable length types | Type | Encoding | | ----------- | ---------------- | | `bytes` | Varint length + bytes | | `string` | Varint length + utf8 string | | `<type>[]` | Varint length + repeated elements | ## Semantic errors All (most) of the linter checks in eg. Buf should be compiler errors in proto4. ## New types? First class support for duration/timestamps. No need to import the types explicitly? | Type | Encoding | | ----------- | ---------------- | | `duration` | 8 bytes (ns) | | `timestamp` | 8 bytes (ns) | eg. ```protobuf= 1: duration delay = 1m30s ``` ## Don’t support capturing "unknown fields"? Controversial, but I've needed this approximately zero times and it complicates the implementation quite a bit. As an alternative to _removing_ support we could also store these out of band rather than in the Go struct itself. For example in Go: ```go var unknown sync.Map msg, unnkownFields := decodeMsg(msgBytes) runtime.SetFinalizer(msg, func(msg []byte) { unknown.Delete(keyFor(msg)) }) unknown.Store(keyFor(msg), unknownFields) ``` Then to retrieve unknown fields for a message: ```go func UnknownFields(msg proto4.Message) []byte { value, ok := unknown.Load(keyFor(msg)) if !ok { return nil } return value.([]byte) } ``` ## First class support for redaction? ```protobuf= message User { @redacted 1: string ssn } ``` # RPC ## Syntax ```protobuf= @serviceoption service Service { @methodoption Method(stream Request) stream Response } ``` ## Global functions? Support for "global" functions? These would go in a package-scoped service. ```protobuf= func Func(Request) Response ``` Equivalent to: ```protobuf= service Package { Func(Request) Response } ``` ## Always generate a service interface? The issue is that different RPC implementations may want different signatures, such as functional options, but perhaps there can be a "common" base interface that they can all implement. # Package management Modelled on Go, imports are absolute source control references. Also as with Go all files in a directory are combined into a single namespace. The package is always referenced by the last path component unless an alias is provided. ```protobuf= import "github.com/protocolbuffers/protobuf/src/google/protobuf" // Import as a local alias. import "github.com/protocolbuffers/protobuf/src/google/protobuf" as protoalias message Message { 1: protobuf.Message message 1: protoalias.Message message } ``` There are no language-specific package options, packages are always deterministically mapped from their import path. eg. In Java this becomes ```java= package com.github.protocolbuffers.protobuf.src.google.protobuf; ``` A `proto4.mod` file might need to exist to tell the tooling where the files in the local tree reside in the package namespace. Similar to `go.mod`, eg. ``` $ cat proto4.mod package github.com/example/myproject $ cat service/service.proto4 message Message {} ``` This package would be importable as `github.com/example/myproject/service`. # Code generation In order to be compatible with the proto3 ecosystem, we will almost certainly have to use FileDescriptors as our intermediate format. The problem is that FileDescriptors don't have the semantic information we need, such as`sum` type information. We should be able to work around this by leveraging options to direct the proto4 code generators. eg. This .proto4 file: ```protobuf= sum Sum { 1: int64 age 2: string name } ``` Might result in the following FileDescriptor (in .proto form): ```protobuf= message SumType { option (proto4.oneof) = true; oneof value { Message first = 1; int64 second = 2; } } ``` ## Invocation Simplify the code generation stage. Something like the following? ``` $ cat proto4.mod package github.com/exammple/service $ cat example/message.proto4 message Message {} $ proto4 build --go --java ./... $ find src src/main/java/com/github/example/proto/example/Message.java ``` # Go code generation ## Non-pointer zero values For default zero message values, use non-pointer structs to reduce garbage: ```protobuf= message Outer { message Inner { 1: string name } 1: Inner inner_value 2: Inner? inner_optional } ``` Generates the following Go code: ```go= type OuterInner struct { Name string } type Outer struct { InnerValue OuterInner InnerOptional *OuterInner } ``` ## Always generate marshal/unmarshal methods Relying on reflection [is slow](https://github.com/LesnyRumcajs/grpc_bench/wiki/2022-01-11-bench-results), and because we've significantly simplified how protos are encoded, generating marshal/unmarshal methods should be relatively straightforward and _significantly_ faster. ## Where possible unmarshal arrays/bytes with zero-copy? Given that binary-encoded proto4 never uses varints, scalar lists can be mapped directly to the corresponding Go values. eg. ```protobuf= 1: list<int64> ids ``` The binary encoding of `list<int32> ids = [1, 2, 3, 4, 5]` would be: ``` 0x5 0x1 0x0 0x0 0x0 0x2 0x0 0x0 0x0 0x3 0x0 0x0 0x0 0x4 0x0 0x0 0x0 0x5 0x0 0x0 0x0 ``` Once the length is decoded the Go decoder _could_ type cast the underlying `[]byte` values of the buffer to `[]int32`, bypassing any copying. The downside here would be that the slice reference would keep the entire Proto byte buffer in memory. This may not be a good idea in practice. --- # Full Example ```protobuf= // Package name is always a simple identifier. // // File options are applied to the package? @my_file_option("yes") package example // No "syntax" - use extension .proto4 // No semi-colons! // Enumerated constants are identical to proto4, except there must // be a zero value and it is the default if not provided. enum Enum { NONE = 0 ONE = 1 } // Support for sum types. Must always be optional? How else do we // model a default? Error on serialisation? sum SumType { 1: Message 2: int64 } // Generics? type LinkedList<Value> { 1: Vale value 2: LinkedList<Value>? next } message Message { // If not provided use default value of type (0 in this case). 1: int64 id // "?" is optional. 2: string? name // Valid because we've provided a default value? 3: SumType sum_type = 123 // Valid, optional. 4: SumType? valid_sum_type // Don't have repeated fields? Syntactic sugar but nice? 5: list<SumType?> friends // User-defined generic types. 6: LinkedList<string> strings // Arbitrarily nested types? How do we encode this? See below. 7: map<string, list<LinkedList<SumType?>>> things } // Translation of option examples from [here](<https://scalapb.github.io/docs/user_defined_options/>). option file { 50000: string? my_file_option } message MyMessageOption { 1: int32? priority } option message { 50001: MyMessageOption? my_message_option } // Applying options. option my_file_option = "yes" @my_message_option(priority=123) // Same as above but with explicit package to disambiguate. @pkg.my_message_option(priority=123) message MyMessage { @field_option(true) 1: string my_field } ``` # Generated code examples Here’s how sum types, arbitrarily nested types, and generics might be generated in a couple of languages, and their equivalent proto3 representation. This proto4: ```protobuf= sum SumType { 1: Message 2: int64 } // Generics? type LinkedList<Value> { 1: Value value 2: LinkedList<Value>? next } message Message { 7: map<string, list<LinkedList<SumType>>> things } ``` Becomes this TypeScript: ```typescript= type SumType = Message | number type LinkedList<Value> = { value: Value next?: LinkedList<Value> } type Message = { // ... ``` Becomes this Go: ```go= type SumType interface { sumType() } type SumTypeMessage Message func (SumTypeMessage) sumType() {} type SumTypeInt64 int64 func (SumTypeInt64) sumType() {} type LinkedList[Value any] struct { Value Value Next *LinkedList[Value] } type Message struct { Things map[string][]LinkedList[SumType] } ``` And the proto3 representation might be something like this: ```protobuf= message SumType { option (proto4.oneof) = true; oneof value { Message first = 1; int64 second = 2; } } message LinkedList__SumType { optional SumType value = 1; optional LinkedList__SumType next = 2; } message Message { message __anonymous_type_0 { repeated LinkedList__SumType _ = 1; } map<string, __anonymous_type_0> things = 7; } ``` # Other wild ideas ## Require an envelope that allows for indexed random access Currently it's impossible to access a random nested value in an encoded protobuf without completely decoding it. We could _require_ some kind of envelope type that indexes into the real structure. The index could by ID based, so it could be fairly compact.