# TSON Specification It's really just a play on JSON with the T meaning "table", but if you like, it can also be backronymed as Table Serialization Object Notation. This is a work-in-progress specification. ## Data Types | Tag | Name | Summary | | --- | ------------- | ------------------------------------------------------ | | 0 | None | Empty value. 0-byte representation. | | 1 | Integer | Variable length signed integer. | | 2 | Float32 | 32-bit floating point number. | | 3 | Float64 | 64-bit floating point number. | | 4 | String | UTF-8 encoded text. Length is VarUInt. | | 5 | FixedIntArray | An array of fixed-size integers. | | 6 | List | Single element type. | | 7 | Tuple | A group of disjoint element types, of fixed length. | | 8 | Record | Fixed name-value mapping (e.g. a struct). | | 9 | Dictionary | Key to value mapping. Keys are not limited to strings. | | 10 | Union | Allows an alternation between multiple types. | ## Format ### Structure Files **must** start with these two bytes: | magic | version | | ----- | ------- | | 0x72 | 0x00 | After this comes the document schema, and after that comes the payload. To parse a document, the schema first needs to be read into a tree-based description, and then the payload is interpreted by applying rules based on the schema. The schema describes, from the root downwards, the structure of the document that will follow. This starts with the type description of the root tag. ### `varuint` For compact encoding, integers are stored in a variable length representation. The most-significant bit (0x80) is used as a continuation marker - if it is set to 1 then there is another byte following. The remaining 7 bits are parts of the integer. The payload is stored in big-endian order. This naturally falls out that numbers in the range 0 to 127 (inclusive) are represented using 1 byte. ### `varsint` Signed integers in the variable length encoding scheme. These are stored the same way as `varuint`, but the least significant bit of the final integer further stores the sign bit. The sign bit is two's complement format, which means there's no negative zero. A value of 0 indicates a positive number, while 1 indicates negative. ### Strings Strings consist of a length value followed by the payload. | len | str<sub>i</sub> | | ------- | --------------- | | varuint | byte | Strings **must** be valid UTF-8 text. Invalid strings should be stored as a byte array tagged with a usage hint instead. ### Type descriptions Type descriptions start with a 1-byte integer which contains the type tag, and end with a string for the usage hint. | tag | content | usage | | ---- | ------- | ------ | | byte | ... | string | The contents of the descriptor depend on the tag: Primitives (None, String, Integer, Float32, Float64) have no content (zero-length). Fixed int arrays have a length marker, followed by an integer type bit. | tag | len | prim | usage | | ---- | ------- | ---- | ------ | | 0x05 | varuint | byte | string | The high bit of the `prim` byte is 1 if the type is signed, 0 if not. The lower 7 bits encode a power of two representing the length of the integer in bits. It must not be higher than 7 (128 bits). The most common case, a byte array would set `prim` = 3. Arrays for 1, 2, and 4-bit integers are padded to the nearest byte boundary at the end of the array. | # | size in bits | | --- | -------------- | | 0 | 1 | | 1 | 2 | | 2 | 4 (nibble) | | 3 | 8 (byte/octet) | | 4 | 16 | | 5 | 32 | | 6 | 64 | | 7 | 128 | Lists have a length marker and then a type descriptor. The element type **must not** be None. | tag | len | elt | usage | | ---- | ------- | --------- | ------ | | 0x06 | varuint | type desc | string | Tuples have a length value N followed by N type descriptors. Implementations **should** omit fields that have a value of None. | tag | len | elt<sub>i</sub> | usage | | ---- | ------- | --------------- | ------ | | 0x07 | varuint | type desc | string | Records have a length value N followed by N field descriptions. Implementations **should** omit fields that have a value of None. | tag | len | name<sub>i</sub> | type<sub>i</sub> | usage | | ---- | ------- | ---------------- | ---------------- | ------ | | 0x08 | varuint | string | type desc | string | Dictionaries have two type descriptors for the key and value types. The key **must not** be None. The value **may** be None, which **should** be interpreted as a set rather than a dictionary. | tag | key | value | usage | | ---- | --------- | --------- | ------ | | 0x09 | type desc | type desc | string | Unions have a length value N followed by N variants. | tag | len | name<sub>i</sub> | type<sub>i</sub> | usage | | ---- | ------- | ---------------- | ---------------- | ------ | | 0x0a | varuint | string | type desc | string | Unions that only have 1 variant **should not** be emitted. Unions that have 0 variants **must not** be emitted. Implementations **should** avoid generating union variants that are not used in the payload that follows. ### Payload Format The root of the document (a value) is described using the root of the schema (a type description). To decode a particular value depends on the type description. For primitives: - None reads nothing. - String reads a string, described above. - Integer reads a varsint, described above. - Float32 reads 4 bytes. - Float64 reads 8 bytes. For arrays, the behavior is different depending on the length marker. For arrays with a length marker of 0, read a varuint N and then read N elements of the element type. | len | element<sub>i</sub> | | ------- | ------------------- | | varuint | element type | For arrays with a non-zero length marker M, read M elements of the element type. | element<sub>i</sub> | | ------------------- | | element type | For tuples, read each element type sequentially. | element<sub>i</sub> | | ------------------------ | | element<sub>i</sub> type | For records, read the value types of each field in the order they were written in the schema. | value<sub>i</sub> | | ----------------- | | value type | For dictionaries, read a length value N and then read N key/value pairs. | len | key<sub>i</sub> | value<sub>i</sub> | | ------- | --------------- | ----------------- | | varuint | key type | value type | For unions, read a varuint index N, and then read the type descriptor corresponding to the Nth variant. | index | value | | --------- | ------------------- | | varuint | variant<sub>index</sub> | ## Usage hints Usage hints are designed as a way for implementations to pass through native data types and allow for 1:1 encode and decode. A few common usage hints are pre-defined to maximize compatibility between languages and universality of tooling. Unlike raw data types, new usage hints can be added without existing implementations needing to be adapted to support them. Implementations should define their own namespace when adding new formats not specified here. This includes sub-formats like those of `datetime`. For example, one might define a `json:null` type to distinguish between `null` and lack-of-existence in JSON. | URN | Base Type | Summary | | ----------------------- | ---------------------- | ------------------------------------- | | `tson:bool` | Numeric[^4] | 0 for false, 1 for true. | | `tson:uuid` | FixedIntArray<u8> | UUID stored as big endian bytes. | | `tson:uuid` | String | UUID stored as a hex string.[^3] | | `tson:display/hex` | Integer, FixedIntArray | Display as hexadecimal (base 16). | | `tson:display/octal` | Integer, FixedIntArray | Display as octal (base 8). | | `tson:display/binary` | Integer, FixedIntArray | Display as binary (base 2). | | `tson:datetime/unix` | Integer | Seconds since Jan 1, 1970 UTC.[^1] | | `tson:datetime/iso8601` | String | ISO 8601 formatted string. | | `tson:datetime/http` | String | [RFC 7231][1] formatted string. | | `tson:string/utf16` | FixedIntArray<u16> | UTF-16 text with BOM.[^2] | | `tson:unit/bytes` | Numeric[^4] | Represents a quantity in bytes.[^1] | | `tson:unit/bits` | Numeric[^4] | Represents a quantity in bits.[^1] | | `tson:unit/seconds` | Numeric[^4] | Represents a quantity in seconds.[^1] | [1]: https://tools.ietf.org/html/rfc7231#section-7.1.1.1 [^1]: Scalable. [^2]: `tson:string/utf16/be` and `tson:string/utf16/le` may be used in order to omit the BOM. This format should generally be assumed to potentially contain orphaned surrogate pairs. [^3]: This format should be generally avoided in favor of the array representation. UUIDs stored as a string should be assumed to potentially have wrapping curly braces `{}` and included dashes `-`. [^4]: Integer, Float32, Float64. Scalable units may have query parameters `mul` and `div`. These two combine to form a ratio that the number should be scaled by to achieve the base unit. Both numbers must be integers, but may be written using scientific notation. ### Examples - `tson:unit/bytes?mul=1000`: Kilobytes (KB). - `tson:unit/bytes?mul=1048576`: Mibibytes (MiB). - `tson:unit/seconds?div=1e9`: Nanoseconds (ns). - `tson:unit/seconds?div=60`: 60Hz timer ticks. - `tson:datetime/unix?div=1e3`: Milliseconds since epoch.