owned this note
owned this note
Published
Linked with GitHub
# My many years confusion on Unicode & UTF-8
---
## Format
- We will start with **questions**
- Not structured
---
## How to participate
- Feel free to interupt and raise discussiona at anytime!
- especially if you have little prior knowledge on Unicode (like me from last week)
---
## Q1: Unicode v.s. UTF-8?
- charset v.s. encoding
---
## Unicode is a charset
- Unicode Charater Set (UCS) is a **charset**, a mapping between **characters** and **code points**
- Other charsets: ASCII
| Grapheme (character) | R | 讚 | 👍 | ... |
| --- | --- | --- | --- | --- |
| Unicode Code Points | [U+0052](https://www.compart.com/en/unicode/U+0052) | [U+8B9A](https://www.compart.com/en/unicode/U+8B9A) | [U+1F44D](https://www.compart.com/en/unicode/U+1F44D) | ... |
---

---
## UTF-8 is a encoding
- UTF = Unicode Transformation Format
- UTF-8 is a **encoding**, the algorithm for converting **Code Points** to **Bytes**
- Other encoding: UTF-32, BIG5, GB2312 ...
---
## Let's see UTF-32 first
<small>
| Grapheme (character) | R | 讚 | 👍 |
| --- | --- | --- | --- |
| Unicode Code Points | [U+0052](https://www.compart.com/en/unicode/U+0052) | [U+8B9A](https://www.compart.com/en/unicode/U+8B9A) | [U+1F44D](https://www.compart.com/en/unicode/U+1F44D) ||
| UTF-32 Encoding | `0x00000052` | `0x00008B9A` | `0x0001F44D` |
</small>
- What are some problem?
---
## UTF-8 is a (better) encoding
<small>
| Grapheme (character) | R | 讚 | 👍 |
| --- | --- | --- | --- |
| Unicode Code Points | [U+0052](https://www.compart.com/en/unicode/U+0052) | [U+8B9A](https://www.compart.com/en/unicode/U+8B9A) | [U+1F44D](https://www.compart.com/en/unicode/U+1F44D) |
| UTF-32 Encoding | `0x00000052` | `0x00008B9A` | `0x0001F44D` |
| UTF-8 encoding | `0x52` <br> `[82]` | `0xE8 0xAE 0x9A` <br> `[232, 174, 154]` | ` 0xF0 0x9F 0x91 0x8D` <br> `[240, 159, 145, 141]` |
</small>
---
## UTF-8 is a (better) encoding
- Variable length encoding
- "R" takes up 1 byte v.s. "讚" takes up 3 bytes
- Save space
- Can't to jump charactor of arbitrary index
- Same ordering as unicode
---

---
### Ruby demo
```txt
Ruby讚👍🏽
```
```ruby!
content = File.open('my-utf8-file.txt').read
# => "Ruby讚👍🏽\n"
# Why Ruby know how?
Encoding.default_external
# => #<Encoding:UTF-8>
content.length # Count by charactors!
# => 8
content.split('')
# => ["R", "u", "b", "y", "讚", "👍", "🏽", "\n"]
content.bytes
# => [82, 117, 98, 121, 232, 174, 154, 240, 159, 145, 141,
# 240, 159, 143, 189, 10]
content.split('').map{|char| [char, char.bytes]}.to_h
# => {"R"=>[82],
# "u"=>[117],
# "b"=>[98],
# "y"=>[121],
# "讚"=>[232, 174, 154],
# "👍"=>[240, 159, 145, 141],
# "🏽"=>[240, 159, 143, 189],
# "\n"=>[10]}
content.split('').map{|char| [char, char.bytes, char.bytes.map{|byte| byte.to_s(2)}]}
# =>
# [["R", [82], ["1010010"]],
# ["u", [117], ["1110101"]],
# ["b", [98], ["1100010"]],
# ["y", [121], ["1111001"]],
# ["讚", [232, 174, 154],
# ["11101000", "10101110", "10011010"]],
# ["👍", [240, 159, 145, 141],
# ["11110000", "10011111", "10010001", "10001101"]],
# ["🏽", [240, 159, 143, 189],
# ["11110000", "10011111", "10001111", "10111101"]],
# ["\n", [10], ["1010"]]]
# Look for the pattern:
# - `0xxxxxxx`
# - `1110xxxx 10xxxxxx 10xxxxxx`
# - `11110xxx 10xxxxxx 10xxxxxx`
```
---
## Q: What does "UTF-8 is backward compatible w/ ASCII" actually means?
~~A valid JavaScript program is a valid TypeScript program.~~
A valid ASCII byte string is a valid UTF-8 byte string.
```ruby!
"Ruby".encode("ASCII").bytes
# => [82, 117, 98, 121]
"Ruby".encode("UTF-32").bytes
# => [0, 0, 254, 255, 0, 0, 0, 82, 0, 0, 0, 117, 0, 0, 0, 98, 0, 0, 0, 121]
"Ruby".encode("UTF-8").bytes
# => [82, 117, 98, 121]
"Ruby".encode("UTF-16").bytes
# => [254, 255, 0, 82, 0, 117, 0, 98, 0, 121]
```
---
## Q: Why the hard-coded `4E00-9FFF` for chinese?
- Let's check out [Unicode Planes & BMP](https://zh.wikipedia.org/zh-tw/Unicode%E5%AD%97%E7%AC%A6%E5%B9%B3%E9%9D%A2%E6%98%A0%E5%B0%84#%E5%9F%BA%E6%9C%AC%E5%A4%9A%E6%96%87%E7%A7%8D%E5%B9%B3%E9%9D%A2)
- Planes are just visualization for code points grouping

- Caveats
- it’s actually CJVK
- missing some newer CJK
---
## Q: Why Mojibake / 亂碼 / Buchstabensalat?
A encoded byte string may not contain its encoding. We must guess which encoding to use.
> 
> From [Programming with Unicode - 3.11 Mojibake](https://unicodebook.readthedocs.io/definitions.html#mojibake-1)
---
## Q: "Unicode" is a "encoding" options in Windows?

- UTF-8 means `UTF-8`
- Unicode big endian means `UTF-16BE`
- Unicode means `UTF-16LE` 😱
<!--
## Q: Why is 新同文堂 awesome?
## Q: BOM and endiness
Endieness
- UTF-16 quicklook
- BOM (byte order mark)
- `0xFF 0xFE ...`: little endian
- `0xFE 0xFF`: big endian
- Gulliver's Travels
- BOM in `UTF-8`!?
- Why we don’t like UTF-16?
- Conclusion
- Caveats
## Q: Why `UTF-8` has no endieness problem?
There is no ambiguiaty for how to interpret UTF-8 in terms of endieness.
ref: https://superuser.com/a/1648865/954056
## Q: [UTF-8] Why so many `10xxxxxx`?
Can quickly pick up in the middle.
## Q: What is `surrogate`?
-->
---
# Reference
<small>
- [YouTube - Unicode, in friendly terms: ASCII, UTF-8, code points, character encodings, and more](https://www.youtube.com/watch?v=ut74oHojxqo)
- Only 10 mins
- Visual demo
- [內核恐慌 Podcast](https://pan.icu/) & [字谈字暢 Podcast](https://www.thetype.com/typechat/)
- [EP026:Kerning Panic·字谈字串(二)](https://www.thetype.com/typechat/ep-026/)
- [Programming with Unicode](https://unicodebook.readthedocs.io)
- A comprehensive but succint little book
- [pjchender - 瞭解網頁中看不懂的編碼:Unicode 在 JavaScript 中的使用](https://pjchender.dev/webdev/guide-unicode/)
- Wikipedia
- [Unicode](https://zh.wikipedia.org/zh-tw/Unicode)
- [UTF-8](https://zh.wikipedia.org/zh-tw/UTF-8)
- [Unicode Plane](https://zh.wikipedia.org/zh-tw/Unicode%E5%AD%97%E7%AC%A6%E5%B9%B3%E9%9D%A2%E6%98%A0%E5%B0%84#%E5%9F%BA%E6%9C%AC%E5%A4%9A%E6%96%87%E7%A7%8D%E5%B9%B3%E9%9D%A2)
- [Unicode、UTF-8、UTF-16,終於懂了](https://www.readfog.com/a/1638084002220969984)
- A great quick refresher in the future
</small>
---
## Bonus
- If you like computing history and trivia:
- ⭐[Command Line Hero](https://www.redhat.com/en/command-line-heroes/season-1) podcast by Red Hat
- recording 👍
- storytelling 👍
<!-- - ⭐ How to implement a search engine in 150 lines of Python -->