TSON Specification

It's really just a play on JSON with the T meaning "table", but if you like, it can also be backronymed as Table Serialization Object Notation.

This is a work-in-progress specification.

Data Types

Tag Name Summary
0 None Empty value. 0-byte representation.
1 Integer Variable length signed integer.
2 Float32 32-bit floating point number.
3 Float64 64-bit floating point number.
4 String UTF-8 encoded text. Length is VarUInt.
5 FixedIntArray An array of fixed-size integers.
6 List Single element type.
7 Tuple A group of disjoint element types, of fixed length.
8 Record Fixed name-value mapping (e.g. a struct).
9 Dictionary Key to value mapping. Keys are not limited to strings.
10 Union Allows an alternation between multiple types.

Format

Structure

Files must start with these two bytes:

magic version
0x72 0x00

After this comes the document schema, and after that comes the payload. To parse a document, the schema first needs to be read into a tree-based description, and then the payload is interpreted by applying rules based on the schema.

The schema describes, from the root downwards, the structure of the document that will follow. This starts with the type description of the root tag.

varuint

For compact encoding, integers are stored in a variable length representation. The most-significant bit (0x80) is used as a continuation marker - if it is set to 1 then there is another byte following. The remaining 7 bits are parts of the integer. The payload is stored in big-endian order.

This naturally falls out that numbers in the range 0 to 127 (inclusive) are represented using 1 byte.

varsint

Signed integers in the variable length encoding scheme.

These are stored the same way as varuint, but the least significant bit of the final integer further stores the sign bit.

The sign bit is two's complement format, which means there's no negative zero. A value of 0 indicates a positive number, while 1 indicates negative.

Strings

Strings consist of a length value followed by the payload.

len stri
varuint byte

Strings must be valid UTF-8 text. Invalid strings should be stored as a byte array tagged with a usage hint instead.

Type descriptions

Type descriptions start with a 1-byte integer which contains the type tag, and end with a string for the usage hint.

tag content usage
byte string

The contents of the descriptor depend on the tag:

Primitives (None, String, Integer, Float32, Float64) have no content (zero-length).

Fixed int arrays have a length marker, followed by an integer type bit.

tag len prim usage
0x05 varuint byte string

The high bit of the prim byte is 1 if the type is signed, 0 if not. The lower 7 bits encode a power of two representing the length of the integer in bits. It must not be higher than 7 (128 bits). The most common case, a byte array would set prim = 3.

Arrays for 1, 2, and 4-bit integers are padded to the nearest byte boundary at the end of the array.

# size in bits
0 1
1 2
2 4 (nibble)
3 8 (byte/octet)
4 16
5 32
6 64
7 128

Lists have a length marker and then a type descriptor. The element type must not be None.

tag len elt usage
0x06 varuint type desc string

Tuples have a length value N followed by N type descriptors. Implementations should omit fields that have a value of None.

tag len elti usage
0x07 varuint type desc string

Records have a length value N followed by N field descriptions. Implementations should omit fields that have a value of None.

tag len namei typei usage
0x08 varuint string type desc string

Dictionaries have two type descriptors for the key and value types. The key must not be None. The value may be None, which should be interpreted as a set rather than a dictionary.

tag key value usage
0x09 type desc type desc string

Unions have a length value N followed by N variants.

tag len namei typei usage
0x0a varuint string type desc string

Unions that only have 1 variant should not be emitted. Unions that have 0 variants must not be emitted. Implementations should avoid generating union variants that are not used in the payload that follows.

Payload Format

The root of the document (a value) is described using the root of the schema (a type description).

To decode a particular value depends on the type description.

For primitives:

  • None reads nothing.
  • String reads a string, described above.
  • Integer reads a varsint, described above.
  • Float32 reads 4 bytes.
  • Float64 reads 8 bytes.

For arrays, the behavior is different depending on the length marker. For arrays with a length marker of 0, read a varuint N and then read N elements of the element type.

len elementi
varuint element type

For arrays with a non-zero length marker M, read M elements of the element type.

elementi
element type

For tuples, read each element type sequentially.

elementi
elementi type

For records, read the value types of each field in the order they were written in the schema.

valuei
value type

For dictionaries, read a length value N and then read N key/value pairs.

len keyi valuei
varuint key type value type

For unions, read a varuint index N, and then read the type descriptor corresponding to the Nth variant.

index value
varuint variantindex

Usage hints

Usage hints are designed as a way for implementations to pass through native data types and allow for 1:1 encode and decode. A few common usage hints are pre-defined to maximize compatibility between languages and universality of tooling.

Unlike raw data types, new usage hints can be added without existing implementations needing to be adapted to support them.

Implementations should define their own namespace when adding new formats not specified here. This includes sub-formats like those of datetime. For example, one might define a json:null type to distinguish between null and lack-of-existence in JSON.

URN Base Type Summary
tson:bool Numeric[1] 0 for false, 1 for true.
tson:uuid FixedIntArray<u8> UUID stored as big endian bytes.
tson:uuid String UUID stored as a hex string.[2]
tson:display/hex Integer, FixedIntArray Display as hexadecimal (base 16).
tson:display/octal Integer, FixedIntArray Display as octal (base 8).
tson:display/binary Integer, FixedIntArray Display as binary (base 2).
tson:datetime/unix Integer Seconds since Jan 1, 1970 UTC.[3]
tson:datetime/iso8601 String ISO 8601 formatted string.
tson:datetime/http String RFC 7231 formatted string.
tson:string/utf16 FixedIntArray<u16> UTF-16 text with BOM.[4]
tson:unit/bytes Numeric[1:1] Represents a quantity in bytes.[3:1]
tson:unit/bits Numeric[1:2] Represents a quantity in bits.[3:2]
tson:unit/seconds Numeric[1:3] Represents a quantity in seconds.[3:3]

Scalable units may have query parameters mul and div. These two combine to form a ratio that the number should be scaled by to achieve the base unit. Both numbers must be integers, but may be written using scientific notation.

Examples

  • tson:unit/bytes?mul=1000: Kilobytes (KB).
  • tson:unit/bytes?mul=1048576: Mibibytes (MiB).
  • tson:unit/seconds?div=1e9: Nanoseconds (ns).
  • tson:unit/seconds?div=60: 60Hz timer ticks.
  • tson:datetime/unix?div=1e3: Milliseconds since epoch.

  1. Integer, Float32, Float64. ↩︎ ↩︎ ↩︎ ↩︎

  2. This format should be generally avoided in favor of the array representation. UUIDs stored as a string should be assumed to potentially have wrapping curly braces {} and included dashes -. ↩︎

  3. Scalable. ↩︎ ↩︎ ↩︎ ↩︎

  4. tson:string/utf16/be and tson:string/utf16/le may be used in order to omit the BOM. This format should generally be assumed to potentially contain orphaned surrogate pairs. ↩︎