Floats in ATProto

Floats in ATProto ----------------- Creation date: 2025-10-16 Note: I was part of the IPLD team that created the specs/information that is linked from this FAQ. I'd like to see floats being allowed in Lexicons, especially for geo data. It's common for web APIs to encode geometries as [GeoJSON]. A polygon looks like that: ```json { "type": "Polygon", "coordinates": [ [ [112.48331, -22.56463], [114.38091, -34.67519], [147.95734, -44.72361], [156.66985, -26.37422], [134.39716, -10.88458], [112.48331, -22.56463] ] ] } ``` Encoding these numbers as strings or Base64 encoded binary would working with those geometries less pleasant. ATProto disallows floats in their data model and they refer to the IPLD specs for the reason. IPLD has a huge problem space, ATProto uses a small subset of that space. Hence I think the problems that arise from floats, that are brought up by the IPLD specs don't apply to ATProto. On conterary, not having floats actually introduces the problems mentioned there. When I say "floats" I mean binary64, the 64-bit [IEEE 754 double-precision binary floating-point format]. The ["Why No Floats?" section in the ATProto data model spec] states: > In short, de-serializing in to machine-native format, then later re-encoding, is not always consistent. This is definitely true for special values and corner-cases, but can even be true with "normal" float values on less-common architectures. I'll focus on "normal" floats, which are neither NaN, nor infinity, as they are forbidden by DAG-CBOR anyway. Earlier in that section, it refers to the [floating point section in the IPLD tricky choices design document]. There the "base problem" section says: > Codecs which are binary tend to also be able to use base-2 numbers naturally, and so for these codecs there are typically no issues. (CBOR and DAG-CBOR are examples of this.) If only binary formats are used, things are fine (as long as there's a single representation, for CBOR that would mean using 64-bit floats only). But IPLD allows any representation, you could even think of serializing the data model as Markdown. This is where the next sentence comes into play: > Codecs which are textual and human readable tend to use base-10 numbers, and so far these codecs there are typically conversion issues. (JSON and DAG-JSON are examples of this!) So it applies to textual representations, that still want to have the trait of having an as deterministic serialization as possible. DAG-JSON is one instance of that. And IPLD could have endless more, like the fictional Markdown one. But the good news is, ATProto only has two specific serialization formats, CBOR and JSON. CBOR is not a problem as it's a binary format with proper float support, so let's look at JSON then. The ["Relationship With IPLD" section of the ATProto data model spec] states: > The atproto JSON encoding is not designed to be byte-determinisitic, and the CBOR representation is used when data needs to be cryptographically signed or hashed. This means that he ATProto JSON representation doesn't have the goal of being a deterministic encoding. So it wouldn't matter if two different implementations would lead to a different textual decimal representation of a float. An example would be 0.099999999 and 0.1: ```python >>> import struct; struct.unpack('>l', struct.pack('>f', 0.099999999))[0] 1036831949 >>> import struct; struct.unpack('>l', struct.pack('>f', 0.1))[0] 1036831949 ``` The important thing would be that they round-trip correctly. I personally have seen cases where two different programming languages/environments don't output the same decimal representation for a certain float. But so far I've never encountered the reverse, that a certain decimal representation leads to a different float. I'm not saying it doesn't exist, I just think from a practical perspective it's not a problem. Now I'd like to come to the point that the current recommendations are even a bigger problem than using floats directly. The recommendation in the "Why No Floats?" section, what to use instead is: > …we recommend encoding the floats as strings or even bytes. This provides a safe default round-trip representation. It is true that the data now round-trips without issues between the JSON and the CBOR representation. But the problems occur when you parse the data and encode it again. You introduce the problem of textual formats into the world of binary formats, due to the usage of the decimal representation as string. Let's take the [`community.lexicon.location.geo` lexicon] as an example. There the longitude and latitude are encoded as strings. So it looks like this (for brevity I leave out other keys): ```json { "latitude": "-33.80", "longitude": "151.21" } ``` Your application parses the data in order to display it on a map. You use Python and parse the coordinates via [`float()`]. The user can drag the location around and hit a save button. It's not dragged at all. The implementation calls [`str()`] on the current numbers to encode it again as strings. The result is: ```json { "latitude": "-33.8", "longitude": "151.21" } ``` The `latitude` misses the trailing `0`. Now the JSON is encoded as CBOR and as it's a string, it carries over that change. So serialization changed, although it shouldn't have. If it would've been a float in CBOR, it would've already been encoded as `-33.8` (without the trailing zero) in the JSON output. But even if it would have been `-33.80`, once it's a Python float, it would've serialized into CBOR as the correct float. The other recommendation is using bytes. That should work, but a human cannot easily read it in the JSON output. It would look like that: ```json { "latitude": "wgczMw", "longitude": "Qxc1ww" } ``` And it would need a separate Base64 decoding step for each of those. I don't see how this would be better than using CBOR floats directly. [GeoJSON]: https://en.wikipedia.org/wiki/GeoJSON [`float()`]: https://docs.python.org/3/library/functions.html#float [`str()`]: https://docs.python.org/3/library/functions.html#func-str [`community.lexicon.location.geo` lexicon]: https://github.com/lexicon-community/lexicon/blob/2bf2cbbfd3058d710f8c468307ef7e003bc22383/community/lexicon/location/geo.json ["Why No Floats?" section in the ATProto data model spec]: https://atproto.com/specs/data-model#why-no-floats [IEEE 754 double-precision binary floating-point format]: https://en.wikipedia.org/wiki/Double-precision_floating-point_format [floating point section in the IPLD tricky choices design document]: https://ipld.io/design/tricky-choices/numeric-domain/#floating-point ["Relationship With IPLD" section of the ATProto data model spec]: https://atproto.com/specs/data-model#relationship-with-ipld