Why CBOR? Video Transcript

--- robots: noindex, nofollow --- # Why CBOR? Video Transcript [move to Gordian?] ## Transcript I'm Wolf McNally, Lead Researcher for Blockchain Commons. In this video I'm going to explain in detail one of the key architectural decisions we've made, how it's impacting our projects, and how we expect our community of developers and supports to benefit from it. Blockchain Commons is a "not-for-profit" social benefit corporation that advocates for the creation of open, interoperable, secure, and compassionate digital infrastructure. Our goal is to enable people to control their own digital destiny and maintain their human dignity online. The research and development mission of Blockchain Commons includes the development of open source technical specifications, reference implementations, and tooling that helps developers solve common problems with hardware and software that needs to be decentralized, secure, preserve privacy, and enhance human independence. Blockchain Commons is working to build a stack of technologies that can be easily adopted by developers. Part of our aim is to invent new solutions where needed, and integrate existing "best of breed" solutions where they already exist. Many of the solutions we're working on have a fundamental need to serialize structured binary data. You'd like to be able to create structured data then move it through the network, store it electronically or even on hard physical media, and finally receive that data in a different place, possibly by a different agent, running different software. Because of varying requirements for serializing data, this basic task has seen a lot of approaches taken to it, all of which involve various tradeoffs. So in this video, I'm going to focus on our serialization format of choice, and how we reached that decision. As I already mentioned, we wanted a *structured, binary* format. Many of the applications we're concerned with involve cryptographic keys, signatures, and other forms of data best represented as binary. Text formats like JSON require the use of additional encoding layers like Base-64, adding bulk and complexity, especially when you'd like to continue parsing down inside that data! Second, we wanted the serialized structured data to be as *concise* as possible. This means that small structures should result in messages of no more bytes than reasonably necessary. A format like BSON, for example, has a surprisingly large serialization footprint, as it trades off conciseness for the ability to easily update it in place in a database. Third, we wanted a format that is *self-describing*. This means that the serialized data contains the associated metadata that describes its semantics. Self-describing formats can be *schemaless*, which makes a lot of sense in a world where both ends of a communication relationship may be evolving at high speed. Like JSON, which is fundamentally schemaless, we also wanted the option to support formal schemas as the need arises, but didn't want to tie developers to specific schema processors or toolchains. Fourth, we wanted a format that works well in *constrained environments*, like special purpose embedded systems and the Internet of Things. This means the codec implementations should be straightforward and efficiently implementable in a minimum number of lines of code. Fifth, we wanted a format that is not closely tied to any particular hardware or software platform, or any specific programming language. And finally, we wanted a format has had many experienced eyes on it, and that means a format that has been through the standards process. This also means that exemplary specifications exist, along with multiple reference implementations and test vectors. Being adopted as a standard also reduces resistance to adoption, therefore increasing the likelihood that there is an active community of developers and projects relying on the code and tools that support the standard. This led us to selecting CBOR: the Concise Binary Object Representation. Binary formats all have the primary drawback that you can't simply examine them in a text editor. But CBOR tooling is very good, and is quite easy to see a dump that breaks down well-formed CBOR into its constituents. With a little more effort, known tags can automatically be displayed, making understanding the semantics even easier. But it gets even better. The CBOR Diagnostic Notation moves above the byte level and uses a JSON-like text syntax, including square brackets for arrays and curly braces for maps (analogous to JSON objects.) CBOR Diagnostic Notation is designed to round-trip with the CBOR binary encoding, but it is primarily intended as a tool for development and debugging. If you encounter some unfamiliar CBOR, you can always parse it into diagnostic notation to start exploring it: no external schema is needed. The structure you're seeing here is an instance of Gordian Envelope. Blockchain Commons has tools that take examining the structure of envelopes to the next level. This is the same structure in Envelope Notation. You can now see it's just the subject "Alice" with a single assertion having the predicate "knows" and the object "Bob". The Blockchain Commons reference implementation tools even include output in the graphical Mermaid format, making the structure of envelopes (especially complex ones) even easier to understand. So while CBOR is a binary format, in our experience the gains far outweigh the costs. CBOR's designers weren't kidding when they put "concise" in its name. When encoding a small structure like this nested array of integers, BSON weighs in at a hefty 34 bytes, Abstract Syntax Notation One (DER encoding) weighs in at 13, EXI4JSON uses 11, RFC-713 uses 7, and CBOR only needs 5! The only popular structured binary serialization format that matches it is the less-capable and never-standardized MessagePack, upon which the design of CBOR was based. CBOR's light weight is due to its consistent use of a single byte header for every element. In fact, that single-byte header can be used as a jump-table for super-fast CBOR decoding. It defines whether the element is an unsigned integer, a negative integer, a byte string, a UTF-8 text string, an array, a map, a tagged item or certain special "simple" values like `true`, `false`, and `null`. Tags allow developers to define extended and composite data types. Devices in the Internet of Things and other embedded systems often operate under tight constraints of processing power, storage, and bandwidth. CBOR is designed to be simple to encode and decode, with one popular C++ codec consisting of only about 900 lines of code, and one Python implementation having less than 500 lines! Over 200 Github repositories are tagged CBOR. Numerous CBOR codecs exist across many popular languages, and sites like CBOR.io exist as hubs to point developers to documentation, implementations, and other tools. Last but not least, CBOR is *standardized*. In addition to RFC 8949, which is the core CBOR spec, the IETF Datatracker shows 23 RFCs, as well as 16 active Internet Drafts that reference CBOR in their title. Another important RFC is 8610: The Concise Data Definition Language (CDDL), which is a schema description notation for CBOR. Blockchain Commons uses CDDL to describe all of our CBOR-based structures, notably those of our Gordian Envelope Internet Draft. Additionally IANA, the Internet Assigned Numbers Authority, maintains a registry of CBOR tags, which helps developers coordinate extending CBOR data types. But there's one more thing that makes CBOR a particularly great choice for what we're doing at Blockchain Commons, and particular our requirements for Gordian Envelope. Envelopes are "smart documents," and one of several things that makes them smart is that for a particular set of semantics to encode, there is a single unique way of encoding it. This is particularly important for cryptographic constructs like hashing and signing. Not many other data serialization formats have a standard way to do this. Some, like ASN.1 define a specific encoding method like DER to do this. Before text formats like JSON can be used as smart documents, the JSON has to be "canonicalized," which uses a rather involved algorithm needed to transform the document to enable repeatability, at the expense of human readability. Furthermore, there aren't many implementations of the JSON Canonicalization Scheme, and in fact there are several ongoing competing efforts. One might ask: if you're going to sacrifice readability, why not just go with a binary format with a single, standard, canonical form? and Gordian envelopes require all CBOR they contain to be deterministically encoded. And that's why Blockchain Commons is excited about working with CBOR, and we're looking forward to hearing your questions and ideas.