owned this note
owned this note
Published
Linked with GitHub
ipld pathing, data model, and strings options
=============================================
(where "text" means "strictly unicode (which?)")
(where "string" means "it's secretely bytes but we really nudge you hard to be utf8 strings")
(the final name we choose for either of these concepts may vary; we use them with these distinctions in this document only.)
top row is the things we have in the data model Kind enum
(other kinds like list|int|... are presumed still present in all cases, since they're not in discussion.
for some options on the left column, the definiton of 'map' in the kinds enumeration may change, see detailed descriptions for those cases.)
left column is how we solve the definition of maps
- These are related: Remember, the data model needs to self-describe.
- Concretely, this can be seen in Selectors.
- What is this line in the selectors spec supposed to say: https://github.com/ipld/specs/blame/fd3697982f031405ffa00fff71801d3759d06f1f/selectors/selectors.md#L77
- Key highlights of this: Selectors are describing *other data* in the Data Model, so, one way or anyother, what currently is stated as "String" on that line **must** _**match**_ (not be close -- _match_) the domain that defines what map keys are.
| | `{string|bytes}` | `{text|bytes}` | `{string|bytes|text}` |
| ---------------- | ---------------- | --------------- | --------------------- |
| `mapkey=string` | AA | - | AC |
| `mapkey=bytes` | BA | BB | BC |
| `mapkey=text` | - | CB | CC |
| `mapkey=key` | DA | DB | DC |
| `mapkey=[mixed]` | EA | EB | EC |
| `[many mapkind]` | FA | FB | FC |
### AA
**What is it**: Map keys are strings. Strings are sequences of 8-bit bytes. The Data Model knows of both string kind and bytes kind.
**Prognosis**: Plausible. (Clearly, since it's what go-ipld-prime is already shipping.)
- Extremely clear and non-lossy in the data model.
- Moderately frustrating to implement in some ecosystems which have standard libraries that wish strings were stricter.
- ... but generally still possible. Almost all languages and ecosystems we've examined have escape valves for this. (Generally: check how that language does FFI, and you'll find your answer.)
- Clear in most codecs; escapable in those that have restrictions; or, codecs can simply be "limited" (and that's OKAY).
- Some codecs can encode 8-bit byte sequences quickly and without remark (this is unambiguous to do in CBOR, even if not spec compliant).
- Some codecs have limitations (e.g. JSON) but can use escaping such as UTF8-C8.
- New codecs can define their own escaping (which may be more efficient than UTF8-C8).
- Or codecs can simply not support all strings that are potentially in range of the Data Model: limited codecs are a valid part of the IPLD ecosystem.
- Friendly to partial implementations.
- Your language doesn't have a full Unicode table and validator implemented yet? No problem.
- Fast.
- Checking strings for range limitations requires an O(n) check, generally speaking. Not doing so is certainly faster than doing so.
- Can contain other domains without difficulty.
- For example: one could encode "bigint" values as binary data in a map key, and with this definition, not be troubled.
- N.b. we don't *recommend* doing this, but that it's possible to describe is interesting.
- Retains definition and continues to work even when encountering data in the wild that does not match a stricter definition of bytes.
- Given how often we see implementations of systems which don't do unicode strictness checking in practice, and the evolutionary pressure for fast systems which which selects again such checking, we should be concerned with how we handle data that doesn't pass some particular unicode definition. We should be able to round-trip such data.
- Though UTF-8 has become common, data certainly exists in the world which either predates the prevalence of UTF-8, or simply chooses not to be restricted to UTF-8 for other reasons. We should be able to handle such data.
- Filenames are one massive example of such data. People tend to consider these strings (and wouldn't generally want them rendered as hex); but they also aren't restricted to the domain UTF-8.
- (Filesystems may also be an interesting example of how systems with strictness evolve: NTFS, for example, technically forbids creating files containing colon characters... in the UI. A linux NTFS driver? Can and will. A linux NTFS driver? Will read this and live with it; what else is it supposed to do? Fail to mount the entire filesystem?)
- Downside: not all text is renderable by a rendering platform that supports UTF-8.
- ... but this is of dubious consequence: consider the ZERO WIDTH NON-JOINER (U+200C) character -- not all UTF-8 is exactly "renderable", either.
### AB
Undefined. (Contradiction: map keys can't be string kind if the data model doesn't contain a string kind!)
### AC
**What is it**: Map keys are strings. Strings are sequences of 8-bit bytes; Text is a unicode-only sequence. The Data Model knows of both string kind and bytes kind and text kind. The string kind merely _suggests_ unicode; the text kind enforces it.
**Prognosis**: Not really viable. Implausible due to codec realities; also, of very dubious desirability.
- Fairly nonsensical and impossible in codecs we consider dear.
- Distinguishability problem: how do you tell "string" and "text" apart in CBOR? There aren't enough "major types" in the spec to support this.
- We could add more tags -- DAG-CBOR *is* distinct from CBOR, after all. But would it be a good idea?
- Same question for other codecs. JSON also only has one kind of string. And we could not extend it easily without creating data-control-plane mismatches or else creating an entirely new format.
- The distinction that this would offer to creators of new data is not one that anyone wants:
- There are protocols that _say_ string but don't mean UTF8; and there are those that _say_ string and do mean UTF8; there are none in human history that do both, in addition to a distinct bytes concept.
### BA
**What is it**: Map keys are bytes. Strings are sequences of 8-bit bytes. The Data Model knows of both string kind and bytes kind.
**Prognosis**: Plausible, but not very pedagologically pleasing.
- Saying "map keys are bytes" gives many people the wrong idea.
- Many people will think this is too complex, and walk away from the project before learning more.
- Many people will think this means maps are unreadable, and walk away from the project before learning more.
- People who do implement things in this ecosystem will feel encouraged to use unreadable map keys, which we do not desire.
- Given that map keys *usually are* strings, we'd still want to make APIs in libraries prefer strings, which seems to require some doublethink.
### BB
**What is it**: Map keys are bytes. Text is a unicode-only sequence. The Data Model knows about the bytes kind and the text kind (there is nothing called string).
**Prognosis**: Plausible, but lots of weird double-think required, and doesn't necessarily suggest good interfaces.
- All of the "gives people the wrong idea" from BA still applies here. Even if this is a technically possible choice, it is extremely dubious from a didactics perspective.
- Many people will think this is too complex, and walk away from the project before learning more.
- Many people will think this means maps are unreadable, and walk away from the project before learning more.
- People who do implement things in this ecosystem will feel encouraged to use unreadable map keys, which we do not desire.
- Given that existing DAG-CBOR data uses the CBOR "string" type in map keys, this would require some serious doublethink to make sense of, even if we can make it technically defined.
- The thing we call "bytes" in the data model would be serialized as "string" in this codec.
- It would be forevermore impossible to make the "bytes" marker in this codec appear in maps.
### BC
**What is it**: you know what, tread it out yourself. The table is pretty clear, isn't it?
**Prognosis**: Nonviable.
- Same reasons as AC.
- Distinguishability problem: how do you tell "string" and "text" apart?
- Remains error prone in practical use and brings awkward questions to end-user attention and thus doesn't really solve many problems.
- All of the "gives people the wrong idea" from BA still applies here. Even if this is a technically possible choice, it is extremely dubious from a didactics perspective.
- Spoiler for future `*C` entries: they're always going to be the worst of their row, because they'll always have the problems in AC plus whatever else is the worst implication of their row.
### CA
Undefined. (Contradiction: map keys can't be text kind if the data model doesn't contain a text kind!)
### CB
**What is it**: Map keys are text. Text is a unicode-only sequence. The Data Model knows about the bytes kind and the text kind (there is nothing called string).
**Prognosis**: Nonviable.
- There is already plenty of data in the wild which has non-unicode data in map keys.
- If we want filenames from POSIX filesystems to fit in map keys (and we do), this would be contradiction.
### CC
**What is it**: Map keys are text, and the Data Model has strings and text and bytes.
**Prognosis**: Nonviable. Good lord, it's literally the cross product of "no" and "no". Nope.
- Same problems as AC.
- AND the same problems as CB.
- In case you didn't see the spoiler earlier:`*C` entries are always going to be the worst of their row, because they'll always have the problems in AC plus whatever else is the worst implication of their row.
### DA
**What is it**: Map keys have their own special kind called "key", which is not actually a Data Model Kind at all, and acts sort of like a union of string|bytes|int. Strings are sequences of 8-bit bytes. The Data Model knows of both string kind and bytes kind.
(Note: `int` is added to this psuedo-union because it's a neat idea; you may which to evaluate this as "DA" and "DA+int" if those seem interestingly distinct to you.)
**Prognosis**: Interesting, but implausible due to codec realities. Also, if DA worked, one of DB or EA or EB might be considered superior.
- Some codecs we consider dear would find this impossible to represent distinguishably.
- e.g., JSON: telling apart string keys and bytes keys is impossible there (unless we added More Stuff, like TJSON... but that's a whole different outcome than DAGJSON).
- adding int to the psuedo-union exhibits the same problem.
- Some codecs are fine with this: CBOR has solid definitions for various key kinds, for example.
- Having map keys be a weird special not-exactly-a-kind is deeply bewildering for implementation consistency.
- Means map keys aren't a `Node` at all, and that would then break many more things.
- How would iterating on a map with IPLD Schema type info communicate the key type if it can't pass the value around as a `Node`?
- Honestly hard to even enumerate how many things this would shatter.
- Option EA is basically the same as DA, except this problem doesn't apply for EA -- and there are no drawbacks to EA compared to DA, so, we can essentially disregard DA because EA is strictly superior.
- Doesn't really make the unicode lovers happy either, same as the rest of the `*A` column.
- Option DB is probably superior to DA, because it gives more guarantees on this.
- Doesn't help describe what to do if we encounter strings in the wild that don't fit our desired encoding.
### DB
**What is it**: Map keys have their own special kind called "key", which is not actually a Data Model Kind at all, and acts sort of like a union of text|bytes|int. Text is a unicode-only sequence. The Data Model knows of both text kind and bytes kind.
**Prognosis**: Interesting, but implausible due to codec realities. (Generally, same as DA; this provides slightly more guarantees at the expense of being slightly more fragile.)
- Same distinguishability problem as DA: in some codecs we consider dear, we have no place to stash the data necessary to distinguish the different key kinds.
- This is fairly killer.
- Does get us an explicit unicode-strict type, which some may enjoy!
- Does not help describe problem of what we would do with data in the wild that self-describes as a string but has non-unicode byte sequences.
- (e.g. there are still "strings" in CBOR in the wild, as well as DAG-PB, etc, that won't be interpretable as strict "text", both for historical reasons and going forward due to non-strict implementations regardless of what the spec says; what do you want to do about this _when_ (not _if_) a library encounters it?)
- Most of the `*B` column has this issue.
### DC
I'm tired.
The `*C` column continues to have the same problems. Let's just skip this one?
### EA
**What is it**: Maps support mixed kinds of keys -- the same map can have some keys which are strings, others which are bytes, and yet others which are ints. Strings are sequences of 8-bit bytes. The Data Model knows of both string kind and bytes kind.
(Note: int is added to this mix because it’s a neat idea; you may which to evaluate this as “EA” and “DA+int” if those seem interestingly distinct to you.)
**Prognosis**: Interesting! Might even be quite nice! But implausible due to codec realities (similar to DA).
- Some codecs we consider dear would find this impossible to represent distinguishably.
- e.g., JSON: telling apart string keys and bytes keys is impossible there (unless we added More Stuff, like TJSON... but that's a whole different outcome than DAGJSON).
- adding int to the psuedo-union exhibits the same problem.
- Some codecs are fine with this: CBOR has solid definitions for various key kinds, for example.
- Just having maps in IPLD specified as supporting mixed key kinds is probably preferable to introducing a weird special not-exactly-a-kind -- in other words, this is superior to approach DA.
- There are some questions raised for library design by mixed key kinds...
- Probably, this means libraries would want to define a MapIterator's `Next` method as `Next() (key Node, value Node, maybeError)`.
- This is probably fine?
- Implementations may have fun making this efficient.
- Most languages without native compiler understandings of sum types will have a bit of a sad time here. It will always be *possible* to implement -- but will probably require 3x the memory per key (for small keys; amortizes for larger keys).
- Equality operations may be tricky to implement correctly.
- But it's possible!
- Sorting operations, where necessary, are slightly trickier to define.
- But it's possible!
- Describing the Data Model in terms of other Data Model values gets a bit trickier.
- We'd need a union (probably kinded) where map keys need to be described.
- This comes up in Selectors.
- This would be a bit of a change to the currently shipped Selectors, but possibly viable, and perhaps even backwards compatible.
### EB
**What is it**: Maps support mixed kinds of keys -- the same map can have some keys which are text, others which are bytes, and yet others which are ints. Text is a unicode-only sequence. The Data Model knows of both text kind and bytes kind.
**Prognosis**: Interesting, but implausible due to codec realities. (Generally, same as EA; this provides slightly more guarantees at the expense of being slightly more fragile.)
- Like EA, but stricter on unicode, which may make some happy.
- Does not help describe problem of what we would do with data in the wild that self-describes as a string but has non-unicode byte sequences.
### EC
**What is it**: Maps support mixed kinds of keys. The Data Model has strings and text and bytes.
**Prognosis**: `*C` column: as usual, this is just not really viable.
### FA
**What is it**: Maps in the data model come in several distinct kinds: instead of `map`, there's `map_with_bytes_key` and `map_with_string_key` and `map_with_int_key`.
**Prognosis**: Possibly interesting, but usually does not bring joy, and also is implausible due to codec realities.
- Speaking generally, this approach just does not seem popular with anyone. Adding a bunch of kinds to the data model is ishy at the best of times, and adding such verbose ones just doesn't sound fun at all.
- Like EA, distinguishability problems in codecs we already hold dear.
- No codec we've yet interacted with has demarcations in the areas that this design would wish for them.
- Retrofitting existing codecs (like dag-cbor or dag-json) to contain this kind of information is not really possible without a major change (milder attempts will flunk into mixing-data-and-control-plane issues).
- Retrofitting existing codecs (like dag-cbor or dag-json) to contain this information would make the per-key indicators redundant. Again, a total format change would be more advisable.
- Seemingly very little actual upsides to this, compared to any of the other alternatives.
### FB
**What is it**:
**Prognosis**:
- TODO finish
### FC
**What is it**:
**Prognosis**:
- TODO finish
### MORE
We could add an entire new column: `{bytes}` -- no string, no text.
In General
----------
- Remember, we aren't writing a single codec. We can't easily just patch our definition of encodings to make more things distinguishable.
- This is what makes the `*C` column generally nonviable throughout this entire report.
- ... although even if we could do this, it would still leave questions about whether we _should_.
- Remember, a lot of other projects about serialization have some information about what structure they expect in the data. We explicitly avoid this in IPLD: we want codecs to be able to parse data into the Data Model with no additional supporting information (we sometimes refer to this as "context-free" codecs).
- By contrast: in a library like (e.g.) Serde, the decode function is `(serialData, aRustType) --deserialize--> (aRustValue)` -- the rust type *is an input that provides more information*.
- In IPLD, because we want context-free decodes, our function has less: it's only `(serialData) --deserialize--> (dataModelNode)`.
- This is critical to understanding the distinguishability problem: if you cannot tell, from the serial data alone, which Data Model Kind the resultant Node will be, then the operation would be undefined (and the IPLD spec would be broken, at least certainly for that codec).
- _Other serialization libraries might not have problems with this, even where IPLD will, because **we set a higher bar** for context-free operation_.
- Remember, round-tripping is important.
- We consider this an even higher priority goal in IPLD than many other ecosystems and formats and implementation libraries do: the final strictness of content addressing makes round-tripping even _more_ load-bearing than it usually is.
- Consider chanting: "Normalization Is Mutation".
- Remember, limited-domain codecs and limited-domain libraries are totally fine.
- We hope the IPLD specs are reasonably friendly to partial implementations as well as intentially limited-domain implementations.
- It's just important that we be able to _specify_ and _talk clearly_ about these things.
- Remember, nothing is ever "free" in terms of developer efforts:
- I don't think we should place undue worry on the idea that people are going to need to write code for codecs when making a new IPLD library.
- Frankly, it's in general inevitable. We only get to aim for greater or lesser degrees.
- We've already seen that trying to use off-the-shelf codecs is often a bad/risky idea, because it results in various imprecision tolerance that doesn't match our specs, or requires patching to support links anyway, or etc etc.
- Off-the-shelf codecs can be fine for partial implementations, but for "complete" IPLD libraries, we really shouldn't coddle the idea that they're always going to work out. We can _hope_, but we shouldn't _presume_.
- If you can use some parts of off the shelf existing codec libraries in your language ecosystem, that's great. But don't be afraid to copy, fork, and expand.