TSON Specification

It's really just a play on JSON with the T meaning "table", but if you like, it can also be backronymed as Table Serialization Object Notation.

This is a work-in-progress specification.

Data Types

Tag	Name	Summary
0	None	Empty value. 0-byte representation.
1	Integer	Variable length signed integer.
2	Float32	32-bit floating point number.
3	Float64	64-bit floating point number.
4	String	UTF-8 encoded text. Length is VarUInt.
5	FixedIntArray	An array of fixed-size integers.
6	List	Single element type.
7	Tuple	A group of disjoint element types, of fixed length.
8	Record	Fixed name-value mapping (e.g. a struct).
9	Dictionary	Key to value mapping. Keys are not limited to strings.
10	Union	Allows an alternation between multiple types.

Format

Structure

Files must start with these two bytes:

magic	version
0x72	0x00

After this comes the document schema, and after that comes the payload. To parse a document, the schema first needs to be read into a tree-based description, and then the payload is interpreted by applying rules based on the schema.

The schema describes, from the root downwards, the structure of the document that will follow. This starts with the type description of the root tag.

`varuint`

For compact encoding, integers are stored in a variable length representation. The most-significant bit (0x80) is used as a continuation marker - if it is set to 1 then there is another byte following. The remaining 7 bits are parts of the integer. The payload is stored in big-endian order.

This naturally falls out that numbers in the range 0 to 127 (inclusive) are represented using 1 byte.

`varsint`

Signed integers in the variable length encoding scheme.

These are stored the same way as varuint, but the least significant bit of the final integer further stores the sign bit.

The sign bit is two's complement format, which means there's no negative zero. A value of 0 indicates a positive number, while 1 indicates negative.

Strings

Strings consist of a length value followed by the payload.

len	str_i
varuint	byte

Strings must be valid UTF-8 text. Invalid strings should be stored as a byte array tagged with a usage hint instead.

Type descriptions

Type descriptions start with a 1-byte integer which contains the type tag, and end with a string for the usage hint.

tag	content	usage
byte	…	string

The contents of the descriptor depend on the tag:

Primitives (None, String, Integer, Float32, Float64) have no content (zero-length).

Fixed int arrays have a length marker, followed by an integer type bit.

tag	len	prim	usage
0x05	varuint	byte	string

The high bit of the prim byte is 1 if the type is signed, 0 if not. The lower 7 bits encode a power of two representing the length of the integer in bits. It must not be higher than 7 (128 bits). The most common case, a byte array would set prim = 3.

Arrays for 1, 2, and 4-bit integers are padded to the nearest byte boundary at the end of the array.

#	size in bits
0	1
1	2
2	4 (nibble)
3	8 (byte/octet)
4	16
5	32
6	64
7	128

Lists have a length marker and then a type descriptor. The element type must not be None.

tag	len	elt	usage
0x06	varuint	type desc	string

Tuples have a length value N followed by N type descriptors. Implementations should omit fields that have a value of None.

tag	len	elt_i	usage
0x07	varuint	type desc	string

Records have a length value N followed by N field descriptions. Implementations should omit fields that have a value of None.

tag	len	name_i	type_i	usage
0x08	varuint	string	type desc	string

Dictionaries have two type descriptors for the key and value types. The key must not be None. The value may be None, which should be interpreted as a set rather than a dictionary.

tag	key	value	usage
0x09	type desc	type desc	string

Unions have a length value N followed by N variants.

tag	len	name_i	type_i	usage
0x0a	varuint	string	type desc	string

Unions that only have 1 variant should not be emitted. Unions that have 0 variants must not be emitted. Implementations should avoid generating union variants that are not used in the payload that follows.

Payload Format

The root of the document (a value) is described using the root of the schema (a type description).

To decode a particular value depends on the type description.

For primitives:

None reads nothing.
String reads a string, described above.
Integer reads a varsint, described above.
Float32 reads 4 bytes.
Float64 reads 8 bytes.

For arrays, the behavior is different depending on the length marker. For arrays with a length marker of 0, read a varuint N and then read N elements of the element type.

len	element_i
varuint	element type

For arrays with a non-zero length marker M, read M elements of the element type.

element_i
element type

For tuples, read each element type sequentially.

element_i
element_i type

For records, read the value types of each field in the order they were written in the schema.

value_i
value type

For dictionaries, read a length value N and then read N key/value pairs.

len	key_i	value_i
varuint	key type	value type

For unions, read a varuint index N, and then read the type descriptor corresponding to the Nth variant.

index	value
varuint	variant_index

Usage hints

Usage hints are designed as a way for implementations to pass through native data types and allow for 1:1 encode and decode. A few common usage hints are pre-defined to maximize compatibility between languages and universality of tooling.

Unlike raw data types, new usage hints can be added without existing implementations needing to be adapted to support them.

Implementations should define their own namespace when adding new formats not specified here. This includes sub-formats like those of datetime. For example, one might define a json:null type to distinguish between null and lack-of-existence in JSON.

URN	Base Type	Summary
`tson:bool`	Numeric^[1]	0 for false, 1 for true.
`tson:uuid`	FixedIntArray<u8>	UUID stored as big endian bytes.
`tson:uuid`	String	UUID stored as a hex string.^[2]
`tson:display/hex`	Integer, FixedIntArray	Display as hexadecimal (base 16).
`tson:display/octal`	Integer, FixedIntArray	Display as octal (base 8).
`tson:display/binary`	Integer, FixedIntArray	Display as binary (base 2).
`tson:datetime/unix`	Integer	Seconds since Jan 1, 1970 UTC.^[3]
`tson:datetime/iso8601`	String	ISO 8601 formatted string.
`tson:datetime/http`	String	RFC 7231 formatted string.
`tson:string/utf16`	FixedIntArray<u16>	UTF-16 text with BOM.^[4]
`tson:unit/bytes`	Numeric^[1:1]	Represents a quantity in bytes.^[3:1]
`tson:unit/bits`	Numeric^[1:2]	Represents a quantity in bits.^[3:2]
`tson:unit/seconds`	Numeric^[1:3]	Represents a quantity in seconds.^[3:3]

Scalable units may have query parameters mul and div. These two combine to form a ratio that the number should be scaled by to achieve the base unit. Both numbers must be integers, but may be written using scientific notation.

Examples

tson:unit/bytes?mul=1000: Kilobytes (KB).
tson:unit/bytes?mul=1048576: Mibibytes (MiB).
tson:unit/seconds?div=1e9: Nanoseconds (ns).
tson:unit/seconds?div=60: 60Hz timer ticks.
tson:datetime/unix?div=1e3: Milliseconds since epoch.

Integer, Float32, Float64. ↩︎ ↩︎ ↩︎ ↩︎
This format should be generally avoided in favor of the array representation. UUIDs stored as a string should be assumed to potentially have wrapping curly braces {} and included dashes -. ↩︎
Scalable. ↩︎ ↩︎ ↩︎ ↩︎
tson:string/utf16/be and tson:string/utf16/le may be used in order to omit the BOM. This format should generally be assumed to potentially contain orphaned surrogate pairs. ↩︎