section Automatic translation and HTML
Paradoxically, one of the most serious practical problems in translating Web documents automatically is how to prevent translation of various parts of the document.
Consider the section of this document with the example of a text in English and its translation into French. Quite obviously, if that section is to be translated (into French or into some other language), the example text in French should not be translated, especially not by applying to it algorithms and dictionaries for translating from English to some language! (However, that's what Babelfish currently does.)
To take a simpler and more common example, consider a text in English with a proper name "John Birch" in it. When translating to Italian, for example, how can we prevent a program from translating "Birch" as "Betulla" (using the Italian word for birch)? Someone might suggest heuristics based on the use of capital letters, but that would be rather ineffective - it would fail entirely when translating from German, for example, since in German all nouns are spelled with a capital initial.
It seems obvious that some method of marking words as proper
names is needed. That's not sufficient, however. There are
other words too which shall not be translated.
Examples range from code-like things
appearing in texts about computer languages
(like the element name
BODY in HTML or the keyword
case in C)
to linguistic texts speaking about words.
It is obvious that if when translating a text which discusses
the English language, sample English words
(like in "the plural of ox is
must not be translated.
It should be noted that one cannot deduce from the word itself, as a string in a text, whether it should be translated, no matter how large glossaries we use. For example, the word "John" in a name like "John Birch" must remain as such, whereas "king John" must become "kuningas Juhana" in Finnish and "John the Baptist" must become "Jean-Baptist" in French.
It seems that
regards the contents of the following HTML elements as something
that shall not be translated:
For all of these, one can present arguments in favor of
treating them as "literals" which are not to be translated.
On the other hand, counterarguments could be presented, and
at least the
CODE element would be an obvious candidate to be
But basically what is needed is a better official specification of the semantics of phrase markup elements in HTML. In the process of creating such specifications, the questions of translation should be explicitly discussed.
Discussion is needed to determine which is the best approach to preventing translations. Alternatives include:
LIT) for specifying that a piece of text is a literal which is to remain unchanged in translations. A set of
CLASSattribute values (such as
CLASS="person"for person's names) might be introduced to specify the class of literals, mainly for style sheet purposes; they might have some relevance in translation, too.
The first alternative could hardly be the only solution. It would require additional methods both for specifying that other element instances are translation-invariant and for specifying that normally translation-invariant elements are to be translated.