changed 5 years ago
Linked with GitHub

$localize - legacy message id handling

Background

In Angular i18n messages in component templates are translated by matching a message id. Message ids can be a custom, provided by the developer (specified by the @@id syntax), or computed by a digest function. Translations of i18n messages are commonly stored in files that have one of three XML based formats:

  • XLIFF 1.2
  • XLIFF 2
  • XMB/XTB (also used by goog.getMsg())

The digest function for XLIFF 1.2 is different to the other two formats.

Note that none of these formats specifies how to compute message ids. It is up to the tools reading and writing them to agree on a digest function.

Pre-$localize

Before $localize, the Angular compiler translated messages during compilation of the template. At this point it has access to the translation format and so can apply the appropriate digest function to compute the message id for translation lookup.

Moreover, the Angular compiler has access to the original HTML. There is information that is used to compute the message ids that is only available in the original HTML source.

Post-$localize

With $localize, translation is done much later in the build pipeline after the Angular compilation has completed. At this point, the only information available is what is passed to the $localize function.

In other words, only the static parts of the template string, the substitution expressions and message metadata blocks are available for computing the message id.

For example, in the following tagged string:

$localize `:greeting|Home page user greeting:Hello, ${user.name}:name:!`

The only information available is:

meaning: 'greeting',
description: 'Home page user greeting'
message parts: ['Hello, ', ' !']
placeholder names: ['name']

Translation problems

The fact that the original HTML and the format of translation files (and so the digest function) is no longer available at the time of translation raises some problems that need to be addressed.

Obscure canonical message strings

The current canonical message string is difficult (if not impossible) to compute only from the information passed to $localize.

For example, given the following HTML

"<p i18n>
Press <b>cancel</b>
to stop {{job}} job
</p>"

The canonical message string would be:

"
Press <ph tag name="START_BOLD_TEXT">cancel</ph name="CLOSE_BOLD_TEXT">[
to stop ,<ph name="INTERPOLATION">job</ph>, job
]"

Note that the sequence of static and interpolated text get wrapped in [... , ... ] to look like an array.

The equivalent $localize call would be:

$localize `
Press ${"�#1�"}:START_BOLD_TEXT:cancel${"/�#1�"}:CLOSE_BOLD_TEXT:
to stop ${"�0�"}:INTERPOLATION: job
`;

or something similar to:

$localize(
  [
    'Press ',
    ':START_BOLD_TEXT:cancel',
    ':CLOSE_BOLD_TEXT:\nto stop ',
    ':INTERPOLATION: job\n'
  ],
  '�#1�',
  '/�#1�',
  '�0�'
 );

The grouping markers are not easily computed without a certain amount of computing (and guesswork?) in parsing the messageParts and expressions.

Without knowledge of the original HTML it is not possible for $localize to compute the message id.

Unknown digest function

Currently, XLIFF 1.2 uses a different digest function from the other two. For example given the message from the previous section, the computed message id is:

XLIFF 2 / XMB/XTB XLIFF 1.2
7056919470098446707 ec1d033f2436133c14ab038286c4f5df4697484a

The previous implementation can cope with this because translation was done in the Angular compiler, which knew what format the translations were in and so what digest function to use in computing the message ids.

Without knowledge of the format of the translations (i.e. what digest function should be used), it is not possible for $localize to compute the message id.

Whitespace resilience

The current conversion of HTML to a [canonical message string] is resilient to some changes in the source message but not others.

  • Expressions being interpolated can change
  • Whitespace within ICU expressions can change

Significantly though, whitespace outside ICU expressions is always included in the canonical message string, whether or not the component whose template contains the message has preserveWhitespaces set to true or not.

The $localize calls contain message strings where whitespace has been collapsed (unless preserveWhitespaces: true).

Without knowledge of the original HTML it is not possible for $localize to compute the message id in cases where whitespace has been collapsed.

Proposed Solution

To avoid these problems, the ivy compiler should use a common digest function for all translation formats that can be computed only using the information available to $localize.

Translation would be achieved by computing the message id from the $localize call and matching against a set of translations keyed off the message id.

Extraction of messages (message ids and source messages) may be achieved directly from bundled code (containing calls to $localize) without any dependence on the Angular compiler.

Since this would be a breaking change for current applications, whose translation files might contain message ids computed using legacy digest functions we should implement

Common digest function

PR: https://github.com/angular/angular/pull/32867

Both XLIFF 2 and XMB/XTB use the same digest function. The new common digest function should use the same hashing function as these but compute the canonical message string in a way that is resilient to whitespace changes (if appropriate) and can be computed from the information provided to $localize alone.

It will be possible to compute the message id in the $localize function. Therefore there will be no need to pass around message ids, unless they are custom ids provided by the developer.

The digest function will work as follows:

  • Generate a canonical message string by joining the tagged string message parts together with generated placeholders of the form {$...}.
  • Compute a hash using the current computeMsgId(message, meaning) function.

Some examples:

​​$localize `abc${1}def${2}`
​​    -> 'abc{$PH}def{$PH_1}'
​​        -> '6223022895014632549'

​​$localize `abc${1}:custom:def{2}:custom2:`
​​    -> 'abc{$custom}def{$custom2}'
​​        -> '8479809234660862889'

​​$localize `:meaning|description:abc`
​​    -> `abc`
​​        -> '1071947593002928768'

​​$localize `:@@custom-id:abc`
​​    -> ...
​​        -> 'custom-id'

By enabling message id generation from $localize calls there is no need to add computed message ids to the generated template code. This keeps the size of the bundles down, especially for runtime translation, where calls to $localize are not inlined.

If the component is not set to preserveWhitespaces: true then canonical message strings generated from its templates will have already had their whitespace collapsed.

Computed message ids are resilient to trivial whitespace changes, unless the component specifically preserves whitespace in its template.

In order to localize strings within application code (e.g. in an Angular service) the developer would call $localize directly. The message ids can be computed directly from application code calls to $localize.

Localized messages, within application code, are supported out of the box.

Legacy mode

PR: https://github.com/angular/angular/pull/32937

For initial backward compatibility with pre-ivy translation files, we shall provide a legacy mode in the Angular compiler.

In this mode we will compute the old message id using the appropriate digest function and pass it through to the $localize call as a custom id. (This is basically what is happening already in the code but only for the XMB/XLIFF 2 format.)

In the pre-ivy world translations are done in the Angular compiler where a translation format must be provided (via the compiler option i18nFormat). We can make use of this in the legacy mode.

If this format option is provided then the Angular compiler should add the legacy message ids to the $localize calls as custom ids inside a metadata block.

Old translation files can continue to be used until the developer is ready to migrate.

Migration tool

Implement a tool that converts each translation file to the new message id format. Due to the legacy mode this can be a secondary activity.

Concepts

Message id

A string that uniquely identifies a message to be translated. These can be custom (provided by a developer) or computed via a digest function.

Digest function

A function to convert a message to a hash value that can be used to lookup a translation.

Digest functions typically implement the following three steps:

  • Convert the message to a canonical string representation
  • Combine the canonical string with an optional meaning string
  • Compute a hash value from this combined string

Canonical message string

A string that represents the message to be translated, which is resilient to irrelevant changes, such as the original text of expressions being interpolated, or certain whitespace changes.

Meaning

A string, associated with a message, to indicate the particular meaning of a message, which may be ambiguous otherwise. For example, the English word "right" could be translated to more than one French words, e.g. "droit" or "vrai".

In Angular meanings are assigned to messages via a message metadata string.

Message metadata

Additional information about a message included in the template string literal tagged with the $localize function via "blocks", marked with by colon characters :.

The meaning, description and custom id block must be at the start of the string:

$localize `:(meaning|)?(description)?(@@id)?:message string`

In this block the meaning, description and id are optional and delimited by | and @@ respectively.

Placeholder name blocks appear directly after a substitution:

$localize `Hello, ${person.name}:name:. Welcome to the game.`
Select a repo