String Lengths: Bytes or Characters?

# String Lengths: Bytes or Characters? Farcaster currently specifies the allowed lengths of strings in bytes. For example, a cast's text field can hold 320 bytes. Usually, this means 320 characters but because of [how UTF-8 works](#Appendix-A-UTF-Code-Points-and-Bytes), it can be lower. Currency symbols, fancy quotations and other common glyphs take up multiple bytes. Character lengths would be simpler to understand for most users. But they have other implications that affect protocol stability, developer experience and user experience negatively. Applications like Twitter do not use true character lengths, expose [some of this complexity](#Appendix-B-What-does-Twitter-do) to users. What should Farcaster do? ### Proposal: Use Bytes Calculate cast text length in bytes and limit it to 320. The text `hi` would take 2 bytes, `ħi` would take 3 bytes and `ħī` would take four bytes due to UTF-8 encoding. Denote cast mentions with byte positions in a separate array in the object. Mentions do not count against the cast length and up to 10 mentions are allowed. The text `hi @foo anđ @bar` becomes `{text: "hi anđ ", mentions: [{fid: 1, pos: 3}, {fid: 2, pos: 8}]}` assuming foo and bar have fid's 1 and 2 respectively. Farcaster can also open source libraries and packages to make composing casts easier for developers.This approach has the most beneficial side effects for all parties: - Users benefit because the max cast size is unchanged and storage limits don't need to be lowered (10,000 casts per user) - Protocol benefits because DDOS attacks are 4x harder and clients and users are incentivized to use [efficient unicode representations](https://unicode.org/reports/tr15/). - Developers benefit because byte counting is less error prone. ### FAQ **Isn't this worse for users?** No, using character limits is worse for users in other ways. They get ~ 2x less storage and their casts are pruned more quickly. Also, users on Twitter face this problem today since emojis take 2 characters and seem to be OK with the experience. **Why do users get less storage if we use characters?** Hubs have finite storage and place limits on how many casts users can store which is 10,000 today. If a cast is 320 bytes it is 320 bytes, if it is 320 characters it can be between 320 and 1280 bytes. Hubs will have to reduce limits from 10,000 and the new limits will be likely be closer to 5,000. **Can we switch to a purely bytes based storage system?** A purely bytes based system where a user has N bytes of total storage and no limits on cast size solves this issue but introduces others. A client could accidentally generate a 1 MB cast and wipe out a user's history. An approach with a total byte size would need to be hybrid with multple limits, which is not a clear winner over the current model. ## Appendix A: UTF, Code Points and Bytes ### UTF-8 - UTF-8 converts characters into one or more codepoints. - A code points may take up 1 to 4 bytes. - Common latin characters (`a`) take up one code point and one byte. - Common emoji (`😊`) take up one code point and 3 bytes. - Uncommon glyphs (🇯🇵, 🙋🏿) take up 2 code points and up to 6 bytes. ### UTF-16 - UTF-16 converts characters into one or more codepoints. - Code points may take up 2 bytes or 4 bytes. - UTF-16 is generally less efficient than UTF-16 except for code point range U+0800 to U+FFFF which take up 3 bytes in UTF-8 but only 2 points in UTF-16 ## Appendix B: What does Twitter do? Twitter does a custom mapping which seems to convert all unicode characters to a length of 1 or 2, even if they take up more bytes. The test suite [here](https://github.com/twitter/twitter-text/blob/master/conformance/validate.yml) provides some examples.