owned this note changed 3 years ago
Published Linked with GitHub

Section 8: Language information and text direction

  • dicument link
  • purpose: internationalization of HTML (lang and dir attribute)

8.1 Specifying the language of content: the lang attribute

This attribute specifies the base language of an element's attribute values and text content.

  • Assisting search engines
  • Assisting speech synthesizers
  • Helping a user agent select glyph variants for high quality typography
  • Helping a user agent choose a set of quotation marks
  • Helping a user agent make decisions about hyphenation, ligatures, and spacing
  • Assisting spell checkers and grammar checkers

8.1.1 Language codes

The lang attribute's value is a language code. Language codes consist of a primary code and a possibly empty series of subcodes:

language-code = primary-code ( "-" subcode )*

Example:
en : English
en-US : U.S. version of English
en-cockney : Cockney version of English

  • Two-letter primary codes : language abbreviations
  • Two-letter subcode : country code

8.1.2 Inheritance of language codes

An element inherits language code information according to the following order of precedence (highest to lowest):

  1. The lang attribute set for the element itself
  2. The closest parent element that has the lang attribute set
  3. The HTTP "Content-Language" header (which may be configured in a server)
  4. User agent default values and user preferences

8.1.3 Interpretation of language codes

  • A language code should be interpreted by user agents as a hierarchy of tokens rather than a single token.
  • should always favor an exact match

Example:

<HTML lang="en-US">

  1. en-US
  2. en

8.2 Specifying the direction of text and tables: the dir attribute

This attribute specifies the base direction of directionally neutral text in an element's content and attribute values.

  • LTR: Left-to-right text
  • RTL: Right-to-left text

Example: to express a Hebrew quotation, it is more intuitive to write

<Q lang="he" dir="rtl">...a Hebrew quotation...</Q>

希伯來語文字 是用來撰寫希伯來語及猶太語言的字體,
並在時間上早於英語數千年。希伯來文為雙向字串的一個範例,
希伯來文的字母是以由右至左的方向讀寫,數字則是由左至右。

Example

8.2.0 希伯來語簡介

希伯來語(Hebrew)是古代猶太民族的通行語言,是現時世上最古老的語言之一,在宗教上具有崇高的地位,古時《聖經》和猶太教的典籍都是用希伯來語所寫。今天,以希伯來語為官方語言的國家有以色列。

  • 希伯來語屬於閃米特語族(Semitic languages),文字的書寫方向是從右到左

  • 希伯來語的書寫文字只有輔音(又稱子音,consonants),而沒有元音(又稱母音,vowels)

    舉例說,假設你見到一句英語:
    You should love your parents with all your heart.

    以希伯來語來書寫,首先要脫掉每個詞語的元音變成:
    Y shld lv yr prnts wth ll yr hrt.

    再從右到左的方向書寫變成:
    .trh ry ll htw stnrp ry vl dlhs Y

8.2.1 Introduction to the bidirectional algorithm

Consider the following example text:

english1 HEBREW2 english3 HEBREW4 english5 HEBREW6

The order of characters stored in computer:

0 => e
1 => n
...
49 => 6

The way to display this sentence depends on which language is predominant.

  • English

    ​​​​english1 2WERBEH english3 4WERBEH english5 6WERBEH
    ​​​​         <------          <------          <------
    ​​​​            H                H                H
    ​​​​------------------------------------------------->
    ​​​​                       E
    
  • Hebrew

    ​​​​6WERBEH english5 4WERBEH english3 2WERBEH english1
    ​​​​        ------->         ------->         ------->
    ​​​​            E                E                E
    ​​​​<-------------------------------------------------
    ​​​​                       H
    

8.2.2 Inheritance of text direction information

Bidirectional algorithm requires a base text direction for text blocks. To specify the base direction of a block-level element, set the element's dir attribute. The default value of the dir attribute is "ltr" (left-to-right text).

When the dir attribute is set for a block-level element, it remains in effect for the duration of the element and any nested block-level elements

Inline elements, on the other hand, do not inherit the dir attribute.

8.2.3 Setting the direction of embedded text

Bidirectional algorithm automatically reverses embedded character sequences according to their inherent directionality, see 8.2.1.

owever, in general only one level of embedding can be accounted for.

english1 HEBREW2 english3 HEBREW4 english5 HEBREW6
  • English

    ​​​​english1 2WERBEH english3 4WERBEH english5 6WERBEH
    ​​​​         <------          <------          <------
    ​​​​            H                H                H
    ​​​​------------------------------------------------->
    ​​​​                       E
    
  • Hebrew

    ​​​​6WERBEH english5 4WERBEH english3 2WERBEH english1
    ​​​​        ------->         ------->         ------->
    ​​​​            E                E                E
    ​​​​<-------------------------------------------------
    ​​​​                       H
    
  • English for all sentence, Hebrew for part of sentence: must supply additional information, which we do by delimiting the second embedding explicitly

    ​​​​english1 4WERBEH english3 2WERBEH english5 6WERBEH
    ​​​​                 ------->
    ​​​​                    E
    ​​​​         <-----------------------
    ​​​​                    H
    ​​​​<------------------------------------------------->
    ​​​​                    E
    
    ​​​​english1 <SPAN dir="RTL">HEBREW2 english3 HEBREW4</SPAN> english5 HEBREW6
    

8.2.4 Overriding the bidirectional algorithm: the BDO element

<!ELEMENT BDO - - (%inline;)*          -- I18N BiDi over-ride -->
<!ATTLIST BDO
  %coreattrs;                          -- id, class, style, title --
  lang        %LanguageCode; #IMPLIED  -- language code --
  dir         (ltr|rtl)      #REQUIRED -- directionality --
  >

dir is mandatory attribute specifies the base direction of the element's text content.

  • LTR: Left-to-right text.
  • RTL: Right-to-left text.

Some situations may arise when the bidirectional algorithm results in incorrect presentation. The BDO element allows authors to turn off the bidirectional algorithm for selected fragments of text

Consider a document containing the same text as before:

english1 HEBREW2 english3 HEBREW4 english5 HEBREW6

the above might be formatted, including line breaks, as:

english1 2WERBEH english3
4WERBEH english5 6WERBEH

This conflicts with the bidirectional algorithm, because that algorithm would invert 2WERBEH, 4WERBEH, and 6WERBEH a second time, displaying the Hebrew words left-to-right instead of right-to-left.

Solution:

<PRE>
<BDO dir="LTR">english1 2WERBEH english3</BDO>
<BDO dir="LTR">4WERBEH english5 6WERBEH</BDO>
</PRE>

8.2.5 Character references for directionality and joining control

Since ambiguities sometimes arise as to the directionality of certain characters, specification includes characters to enable their proper resolution.

Some directional entities:

<!ENTITY zwnj CDATA "&#8204;"--=zero width non-joiner-->
<!ENTITY zwj  CDATA "&#8205;"--=zero width joiner-->
<!ENTITY lrm  CDATA "&#8206;"--=left-to-right mark-->
<!ENTITY rlm  CDATA "&#8207;"--=right-to-left mark-->

8.2.6 The effect of style sheets on bidirectionality

When an inline element that does not have a dir attribute is transformed to the style of a block-level element by a style sheet, it inherits the dir attribute from its closest parent block element

When a block element that does not have a dir attribute is transformed to the style of an inline element by a style sheet, explicitly adding a dir attribute (assigned the inherited value) to the transformed element.

Select a repo