Section 5: HTML Document Representation

link

The section is mostly about these 2 topics:

What abstract characters may be part of an HTML document
How those characters may be represented in a file or when transferred over the Interne

5.1 The Document Character Set

A document character set consists of:

A Repertoire: A set of abstract characters
Code positions: A set of integer references to characters in the repertoire.

e.g. ASCII character set:

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Since the ASCII character set is not sufficient for a global information system such as the Web, HTML uses the much more complete character set called the Universal Character Set (UCS), defined in [ISO10646].

The character set defined in [ISO10646] is character-by-character equivalent to Unicode.

5.2 Character encodings

Character encodings are methods of converting a sequence of bytes into a sequence of characters, which are also called charset.
The scheme of Web activity
- Servers send HTML documents to user agents as a stream of bytes
- User agents interpret them as a sequence of characters.
A simple one-byte-per-character encoding technique is not sufficient for text strings over a character repertoire as large as [ISO10646].
There are several different encodings of parts of [ISO10646] in addition to encodings of the entire character set (such as UCS-4)

5.2.1 Choosing an encoding

Servers and proxies may change a character encoding (called transcoding) on the fly to meet the requests of user agents
Servers and proxies do not have to serve a document in a character encoding that covers the entire document character set.
Commonly used character encodings e.g. ISO-8859-1, ISO-8859-5, EUC-JP, and UTF-8 (encoding of ISO 10646).

5.2.2 Specifying the character encoding

Server
- Server determine which character encoding applies for a document.
  - Examine the first few bytes of the document
  - Check against a database of known files and encodings.
  - Web masters control over charset configuration
- Web masters should use these mechanisms to send out a "charset" parameter whenever possible
User agent
- User agent use information from server to know which character encoding applied
  1. An HTTP "charset" parameter in a "Content-Type" field of a HTTP header.
```
Content-Type: text/html; charset=EUC-JP
```
  2. The <meta> element specifies
```
<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
```
    - it should only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters (at least until the META element is parsed).
    - META declarations should appear as early as possible in the HEAD element.
  3. The "charset" attribute set on an element that designates an external resource
- The user agent may also use heuristics and user settings. (自己判斷哪種編碼 & 內建編碼)
- User agents may provide a mechanism that allows users to override incorrect "charset" information. However, if a user agent offers such a mechanism, it should only offer it for browsing and not for editing, to avoid the creation of Web pages marked with an incorrect "charset" parameter.

The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a default character encoding when "charset" parameter absent from the "Content-Type" header field. (Yet in practice, it's been proved to be useless)

If, for a specific application, it becomes necessary to refer to characters outside [ISO10646], characters should be assigned to a private zone to avoid conflicts with present or future versions of the standard.

5.3 Character references

Used when:

A given character encoding may not be able to express all characters of the document character set
When hardware or software configurations do not allow users to input some document characters

Authors may use SGML character references to enter any character from the document character set.

Character references in HTML appear in two forms:

Numeric character references (either decimal or hexadecimal).
Character entity references.

5.3.1 Numeric character references

Numeric character references specify the code position of a character in the document character set.

e.g.

å (in decimal) represents the letter "a" with a small circle above it
å (in hexadecimal) represents the same character.

5.3.2 Character entity references

To give authors a more intuitive way of referring to characters in the document character set, HTML offers a set of character entity references.
- "<" represents the < sign.
- ">" represents the > sign.
- "&" represents the & sign.
- "" represents the " mark.
HTML 4 does not define a character entity reference for every character in the document character set.

5.4 Undisplayable characters

A user agent may not be able to render all characters in a document meaningfully.

Adopt a clearly visible, but unobtrusive mechanism to alert the user of missing resources.
If missing characters are presented using their numeric representation, use the hexadecimal (not decimal) form since this is the form used in character set standards.

Section 5: HTML Document Representation

5.1 The Document Character Set

5.2 Character encodings

5.2.1 Choosing an encoding

5.2.2 Specifying the character encoding

5.3 Character references

5.3.1 Numeric character references

5.3.2 Character entity references

5.4 Undisplayable characters

Read more

Section 12: Links

NYCU 課程資訊交流平台

[5月座談會] - Scisprint 2022 May in NYCU preparation meeting - 0420

HTML4 Specification