Translation-friendly authoring ,
section Automatic translation and HTML


How to prevent translation

Paradoxically, one of the most serious practical problems in translating Web documents automatically is how to prevent translation of various parts of the document.

Consider the section of this document with the example of a text in English and its translation into French. Quite obviously, if that section is to be translated (into French or into some other language), the example text in French should not be translated, especially not by applying to it algorithms and dictionaries for translating from English to some language! (However, that's what Babelfish currently does.)

To take a simpler and more common example, consider a text in English with a proper name "John Birch" in it. When translating to Italian, for example, how can we prevent a program from translating "Birch" as "Betulla" (using the Italian word for birch)? Someone might suggest heuristics based on the use of capital letters, but that would be rather ineffective - it would fail entirely when translating from German, for example, since in German all nouns are spelled with a capital initial.

It seems obvious that some method of marking words as proper names is needed. That's not sufficient, however. There are other words too which shall not be translated. Examples range from code-like things appearing in texts about computer languages (like the element name BODY in HTML or the keyword case in C) to linguistic texts speaking about words. It is obvious that if when translating a text which discusses the English language, sample English words (like in "the plural of ox is oxen") must not be translated.

It should be noted that one cannot deduce from the word itself, as a string in a text, whether it should be translated, no matter how large glossaries we use. For example, the word "John" in a name like "John Birch" must remain as such, whereas "king John" must become "kuningas Juhana" in Finnish and "John the Baptist" must become "Jean-Baptist" in French.

It seems that Babelfish regards the contents of the following HTML elements as something that shall not be translated: ADDRESS, CITE, and SAMP. For all of these, one can present arguments in favor of treating them as "literals" which are not to be translated. On the other hand, counterarguments could be presented, and at least the CODE element would be an obvious candidate to be added.

But basically what is needed is a better official specification of the semantics of phrase markup elements in HTML. In the process of creating such specifications, the questions of translation should be explicitly discussed.

Discussion is needed to determine which is the best approach to preventing translations. Alternatives include:

  1. Specifying which HTML elements are to remain invariant in translations, at least by default.
  2. Introducing a phrase element (which might be called LIT) for specifying that a piece of text is a literal which is to remain unchanged in translations. A set of CLASS attribute values (such as CLASS="person" for person's names) might be introduced to specify the class of literals, mainly for style sheet purposes; they might have some relevance in translation, too.
  3. Introducing an attribute for specifying that the content of the element is a literal which is to remain unchanged in translations.

The first alternative could hardly be the only solution. It would require additional methods both for specifying that other element instances are translation-invariant and for specifying that normally translation-invariant elements are to be translated.