Translation-friendly authoring,
especially in HTML for the WWW

People who write HTML documents for the WWW should become aware of the possibilities and problems opened by automatic translation services such as Babelfish. This is a rather new area, and most of the discussion here applies to changes needed both to the HTML language and to the translation software. However, practical tips are also described in issues where one can suggest "translation-friendly" techniques in the present situation. Some of the suggestions are equally applicable to other forms of authoring and to human translation. A collaborative effort by authors of documents and implementors of translation techniques is needed, and designers and implementors of markup languages like HTML should get involved, too.

For concreteness, this presentation describes the suggested guidelines first, in order to give the reader an idea of the relative simplicity of the actions needed. Naturally, the guidelines have been derived from observations, reasoning and experiments presented in later sections.

Try translating this document using Babelfish, to get some idea of the possibilities and problems:

Jukka Korpela

Practical guidelines for authors

Guidelines on natural language usage

These guidelines apply to the textual content of documents, irrespective of the presence of HTML markup or some other markup. Mostly the guidelines apply to all forms of translation - human, automatic, or combined.

Guidelines on HTML markup

An example of linking to an entry in a special dictionary when using a word which may cause problems to translation programs (and even human translators or human readers of the original text):

There is a <A HREF= "http://wagner.princeton.edu/foldoc/cgi-script?action=Search%3A&amp;query=workaround" TITLE="a description of the word 'workaround'">workaround</A> to this problem.

This looks like the following on your current browser:

There is a workaround to this problem.

Interestingly, using Babelfish, the link gets converted to a link thru Babelfish, so the reader of the translated document, when following the link, will get a translated version of the dictionary entry! This is often - probably most often - very nice, but how can you write a link which does not get translated that way, for example a definite link to the original in English?

Why automatic translation is realistic

Introduction: automatic translation within easy reach

In December 1997, Web users started asking each other: "Have you noticed the 'Translate' links on AltaVista search results?" The popular AltaVista search engine had started suggesting, in a not so prominent way, that users can get automatically generated translations of documents. When a user had sent a set of keywords to AltaVista, it returned a list of documents matching the keywords as previously, but now there were "Translate" links in the following style:

4. Writing for Translation
[URL: www.stc.org/region2/pit/www/bpencil/vol3...ep/translat.htm]
Writing for Translation. by Roz Treger and Nancy Ott. More and more companies are marketing their products globally. As technical communicators, we are...
Last modified 11-Sep-97 - page size 10K - in English [ Translate ]

That document, by the way, is related to our topic and is definitely worth reading.

By following the "Translate" link, the user would get page containing a form like the following:

To translate, type plain text or the address (URL) of a Web page here:
Translate from:

The user could then request for a translation into one of a set of languages, namely French, Italian, German, Portuguese, and Spanish. (For a document in one of these language, one could request for a translation into English.) In our example, a translation into German would begin as follows:

Schreiben für Übersetzung

durch Roz Treger und Nancy Ott

Immer mehr Firmen sind Marketing ihre Produkte global. Als technische Verbindungen produzieren wir Material, das von den Benutzern in vielen unterschiedlichen Ländern gelesen werden und in einige Sprachen übersetzt werden kann. Wir müssen dieses in Betracht ziehen, wenn wir Schreiben es sind. Übersetzung ist ein Re-Ausdruck von Ideen in einer anderen Sprache, nicht in einem eins-zu-eins Ersatz von Wörtern und in den Phrasen. Übersetzung-freundliches Schreiben stellt Ideen offenbar und durchweg dar und läßt sie einfacher, damit Übersetzer und non-native englische Lautsprecher sie verstehen. Soviel wie möglich, ist es kulturell Neutrales.

This is far from being good German, but anyone who knows German reasonably well understands what the document is about.

Notice that the translation is based on the document as written in the HTML markup language, which indicates the structure of the document, e.g. denoting parts of the text as headings. The translation preserves, or at least tries to preserve, the markup. (Babelfish can translate plain text, too, of course.)

There are many deficiencies in the translation. For example, the English phrase "are marketing" has been translated as if "marketing" were a noun! But the translation is far from being a naive word by word conversion. It constructs sentences according to German grammar, which e.g. often uses a word order quite different from English.

Some clarifications

First we should distinguish between automatic translation, simple word replacement and consulting a dictionary. Companies seem to market dictionary programs sometimes as "translators". A program which allows the user to get a dictionary entry for a word, with corresponding words, or "translations", in one's native language, can definitely be very useful; but it is definitely not a language translator. Similarly, a program which "translates" simply by replacing each word by its "equivalent" in another language should not be called a translator, although it can be useful for some purposes in special cases. What we mean by automatic translation here is a process which involves at least some minimal grammatical analysis of the source text and generation of corresponding text in the target language. The depth and quality of the process may vary a lot.

There is a large number of automatic translation software available. For a list of some of them, please refer to Information on Computer-Assisted Translation Software by the Oxford University Language Centre. A short document Do you have any information on automatic translation software? by LTG lists a few services especially for translating Web pages. See also Langenscheidts T1 Test Drive which demonstrates translation of German and Spanish (in plain text format) into English.

It is very easy to see that automatic translation does not (currently) work well for poetry, for example. It is a cheap amusement to use a tool for something it is not intended for and laugh at the result. Babelfish takes a good attitude on this; its Translation Tips urge people to try that, too:

Cheap Entertainment
Idioms and slang -- phrases like "the whole nine yards" or "what's up, Doc?" in American English -- are notoriously hard to translate well, particularly when the computer doesn't know the context of the phrase. Try a few for some good laughs.

Cheap Entertainment, Part 2
Remember the old game Gossip -- where one person whispers something to the next person, then the second to the third, and so on, then everyone has a good laugh about what comes out at the end? Try that with translations. Just start with one languages, then translate that to another, then another, then another, then back to the original.

After such good laughs, perhaps people are willing to consider what else the software can do, and how it performs in the areas it was designed for.

Examples

Translation from English to French

This example consists of a text of mine, an extract from How to use images in communication in general and on the Web in particular, and its translation into French by Babelfish:

English original French translation

"You can never use too many images"

A very large number of published documents contain text only. They often look boring, and they are often written in obscure language, using mile-long sentences and cryptic technical terms, using one font only, perhaps even without headings. Such style, or lack of style, might be the one you are strongly expected to follow when writing eg scientific or technical reports, legal documents, or administrative papers. It is natural to think that such documents would benefit from a few illustrative images. (However, just adding illustration might be rather useless, if the text remains obscure and unstructured.)

It is too easy to go to the other extreme when trying to avoid the boring plain text syndrome. This is especially true on the Web, where it is relatively easy technically to add illustration, for instance by picking images from various existing collections. Many people seem to think that you can't have too many images. If they can't find a suitable image, they use an unsuitable one.

When people say that one image tells more than a thousand words, they tend to overlook the fact that what the image says might be true or false, relevant or off-topic, useful or disturbing, constructive or tasteless. (I won't bother to refute the saying by pointing out that there are images which say nothing. However, I cannot resist the temptation to remark that oddly enough the saying itself is expressed using words.)

" vous pouvez ne jamais utiliser trop d'images "

Un nombre très grand de documents édités contiennent le texte seulement. Ils regardent souvent le sondage, et ils sont souvent écrits en langage obscur, en utilisant de mille-longues phrases et termes techniques cachés, en utilisant une fonte seulement, peut-être même sans entêtes. Un tel modèle, ou manque de modèle, pourrait être celui on s'attend à ce que fortement que vous suiviez en écrivant par exemple des états scientifiques ou techniques, les documents juridiques, ou les papiers administratifs. Il est normal de penser que de tels documents tireraient bénéfice de quelques images d'illustration. (cependant, l'illustration ajoutante juste pourrait être plutôt inutile, si le texte demeure obscur et non structuré.)

Il est trop facile d'aller à l'autre extrémité en essayant d'éviter le syndrome ordinaire ennuyeux des textes. C'est particulièrement vrai sur le Web, où il est relativement facile techniquement d'ajouter l'illustration, par exemple en sélectionnant des images de diverses collections existantes. Beaucoup de gens semblent penser que vous ne pouvez pas avoir trop d'images. S' ils ne peuvent pas trouver une image appropriée, ils utilisent peu convenable.

Quand les gens disent qu' une image indique plus que mille mots, ils tendent à donner sur le fait que ce que l'image indique pourrait être vrai ou faux, approprié ou hors fonction-sujet, utile ou dérangeant, constructif ou insipide. (je ne prendrai pas la peine de réfuter l'énonciation en précisant qu'il y a des images qui n'indiquent rien. Cependant, je ne puis pas résister à la tentation de remarquer qu'assez curieusement l'énonciation elle-même est exprimée en utilisant des mots.)

The result certainly isn't good French--it is easy to see that there are errors even in the use of capital letters to begin a sentence--but I suppose it conveys the basic message in the original. Perhaps it should be mentioned that I wrote the text before I started thinking about the use of automatic translation for Web pages, and thus it has not been "tuned".

Translation from Finnish to English

This example is a short translation (of a fragment of an article on machine translation) from Finnish to English by TranSmart, a demo of software by Kielikone.

Finnish original English translation
Kääntäminen lisääntyy eri syistä jatkuvasti. Teknologia tuottaa yhä mutkikkaampia laitteita ja yhä laajempia asennus- ja käyttöoppaita. Vientituotteiden ohjeet pitää kääntää asiakkaiden kielille. Samaten patenttihakemusten ja tieteellisten artikkelien määrä kasvaa. The translation increases continuously for different reasons. The technology produces increasingly complicated devices and the increasingly wide installation guides and use guides. The instructions of the export products must be translated into the customers' languages. The number likewise of the patent applications and of the scientific articles increases.

The translation given by Transmart contains some text in italics, indicating that the italicized text is a translation for a compound word in the original. (Often this means that the translation is not idiomatically correct but usually helps in getting the idea.)

Obviously, the translation is far from being idiomatically and stylistically perfect. Yet, one can understand the content pretty well.

The incorrect use of articles in the translation is mainly caused by the fact that the Finnish language lacks both definite and indefinite articles.

In which phase could a WWW document be translated?

For HTML documents on the WWW, automatic translation can be requested for in several alternative phases:

It would be interesting to discuss the pros and cons of each method, but here we will only mention that fully automatic translation will in practice be needed for transient documents which are generated dynamically instead of being static files.

Naturally, these methods could be combined. For example, a browser might request (on the basis of language preferences set by the user) a document in a specific language, and the server could check whether it actually has a suitable translation; if not, it could check whether it can find a program for generating a translation; and if this fails, it could send the document in a language lower in the user's preferences, in which case the browser could check whether it can translate the document. (Some delicacies might be needed to prevent the situation where a server sends an automatically generated translation which needs to be translated again to the user's preferred language, instead of sending the original.)

What could automatic translation be good for

Even a coarse and erroneous translation can give an idea of what the text is about. This is crucial on the Web where one can easily access a huge number of documents in various languages and needs to sort out what's relevant. If one finds a document in a strange language, automatic translation helps in deciding whether it is worth a closer look. If it looks really interesting, one could perhaps afford a translation made by a human translator.

A relatively good automatic translation could serve as the basis for human translator. Admittedly, there is a risk that the human translator produces unnecessarily low-quality translation that way, since automatic translation tends to reflect the structure of the source language too much. In any case, automatic translation is currently used that way, and it can save a lot of time e.g. by saving the human translator from the boring work of translating simple sentences, doing dictionary lookup, etc.

Automatic translation can be useful alongside with the original when the reader knows the source language to some extent but not fluently. Depending on how well he knows it, he could use either the original or the translation as the primary text, consulting the other one when problems are encountered.

An author who speaks several languages could use automatic translation of his documents as a extra check for clarity and grammatic correctness. For example, having written a document in English I could request an automatic translation into German, then read through the translation. Translation errors may well indicate problems in the original text, such as typographic errors not detected by spelling checkers - such as an error which happens to produce another word of the language - or too complicated or ambiguos grammatical structures. If a translation program cannot correctly handle a piece of text, this might result from features which also prevent a human reader from understanding it or make him understand it incorrectly, especially if the language used is not his native language.

The question remains whether "standalone" automatic translation can be feasible. That is, could one use a fully automatically produced translation as the only form in which a document is accessed? The answer is that it depends on the nature of the text and on the translation program. Currently Babelfish is already used to some extent that way in Usenet discussions, using grammatically simple language.

As the translation programs improve, it can be used for more complicated texts. Naturally, the question remains whether one can rely on an automatically generated translation. One might answer with another question: Don't humans err? In fact, in many details automatic translation can be more reliable, since computer programs do boring work more conscientiously than people do.

If documents are translated by human translators, each translation needs to be updated whenever the original is changed. This means a lot of boring work - some of which might be automated with suitable tools - especially for documents which are updated very frequently. And to a large number of Web pages is generated dynamically, on the basis of a user request and e.g. search from a database. A simply query report might contain very simple language grammatically and be easily translatable by computer.

For the majority of all uses of all documents on the Web, automatic translation is the only feasible way of access to anyone who does not know well the language in which the document was written. If a document exists in Portuguese only and you don't know Portuguese, you either utilize automatic translation or you can't read the document at all, except in rare cases where you can afford to order a man-made translation or find someone who does the job for you for free.

Automatic translation and HTML

Introduction

This section discusses such problems and solutions in automatic translation which are specific to the situation where the source document is in HTML format. This section (or this document) does not discuss the general problems of automatic translation. For such issues, please refer to the extensive directory The ACL NLP/CL Universe.

Automatic translation can be significantly improved if authors and their tools are involved. This means that the problems of translation are taken into account by the author or his assistants when the original work is created.

For existing documents, some relatively simple modifications may greatly improve translatability. This will be demonstrated next by an example.

Example: modifying a simple document for translatability

As an example of how modifications to a document can improve translatability, I have taken a short page which tells some numeric and other facts about the university where I work, HUT. Such simple fact pages could be expected to be relatively easily translatable, since they do not contain grammatically complex structures. Moreover, automatic translatability is essential since such pages can be interesting to people speaking different languages, and one hardly wants to allocate resources to maintaining such pages in many languages by hand.

Note: The example document and its modified form and their translations are not embedded into this document. Instead, links to them are provided. In a typical graphical browser, such as Internet Explorer or Netscape, on Windows for example, you can use the rightmost button of the mouse when following a link (instead of the normal use of the leftmost button), then select the alternative Open in New Window in the pulldown menu opened. You can the move window to another position on the screen and resize it suitably, e.g. so that you can view different versions side by side.

The original page is a short fact sheet, Helsinki University of Technology in a Nutshell. In its French translation by Babelfish, there are several obvious failings (most of which you probably notice even if you don't know French):

In other translations, there are similar failings but also some different problems. For example:

In order to solve some of the problems detected, I constructed an experimental modified page by applying the methods described in the first section (guidelines on natural language usage and guidelines on HTML markup). Its French translation (by Babelfish) is considerably better than that of the original. The remaining flaws (such as "professeurs d'associé" instead of "professeurs associés") are probable things that can be fixed only by improving the translation program.

Notes on the changes:

You may wish to compare the presentation of the modified document (in English) on your browser with a screenshot of what it looks like in one browsing situation viewed on Internet Explorer 4.0 with stylesheet support on. (It isn't quite what it should, due to deficiencies in stylesheet support.)

You may wish to look at the other translations of the modified document:
German translation Italian translation Portuguese translation Spanish translation

The Portuguese translation is the most problematic. In addition to the "nutshell" problem mentioned above, the change of the English spelling "vicerector" to "vice-rector" caused a new problem: it's now translated as "vice-vice-rector"!

Logical markup and translation

To be written...

Multilingualization ("internationalization") of HTML

To be written... Need to consider the different roles of the LANG attribute for example.

How to prevent translation

Paradoxically, one of the most serious practical problems in translating Web documents automatically is how to prevent translation of various parts of the document.

Consider the section of this document with the example of a text in English and its translation into French. Quite obviously, if that section is to be translated (into French or into some other language), the example text in French should not be translated, especially not by applying to it algorithms and dictionaries for translating from English to some language! (However, that's what Babelfish currently does.)

To take a simpler and more common example, consider a text in English with a proper name "John Birch" in it. When translating to Italian, for example, how can we prevent a program from translating "Birch" as "Betulla" (using the Italian word for birch)? Someone might suggest heuristics based on the use of capital letters, but that would be rather ineffective - it would fail entirely when translating from German, for example, since in German all nouns are spelled with a capital initial.

It seems obvious that some method of marking words as proper names is needed. That's not sufficient, however. There are other words too which shall not be translated. Examples range from code-like things appearing in texts about computer languages (like the element name BODY in HTML or the keyword case in C) to linguistic texts speaking about words. It is obvious that if when translating a text which discusses the English language, sample English words (like in "the plural of ox is oxen") must not be translated.

It should be noted that one cannot deduce from the word itself, as a string in a text, whether it should be translated, no matter how large glossaries we use. For example, the word "John" in a name like "John Birch" must remain as such, whereas "king John" must become "kuningas Juhana" in Finnish and "John the Baptist" must become "Jean-Baptist" in French.

It seems that Babelfish regards the contents of the following HTML elements as something that shall not be translated: ADDRESS, CITE, and SAMP. For all of these, one can present arguments in favor of treating them as "literals" which are not to be translated. On the other hand, counterarguments could be presented, and at least the CODE element would be an obvious candidate to be added.

But basically what is needed is a better official specification of the semantics of phrase markup elements in HTML. In the process of creating such specifications, the questions of translation should be explicitly discussed.

Discussion is needed to determine which is the best approach to preventing translations. Alternatives include:

  1. Specifying which HTML elements are to remain invariant in translations, at least by default.
  2. Introducing a phrase element (which might be called LIT) for specifying that a piece of text is a literal which is to remain unchanged in translations. A set of CLASS attribute values (such as CLASS="person" for person's names) might be introduced to specify the class of literals, mainly for style sheet purposes; they might have some relevance in translation, too.
  3. Introducing an attribute for specifying that the content of the element is a literal which is to remain unchanged in translations.

The first alternative could hardly be the only solution. It would require additional methods both for specifying that other element instances are translation-invariant and for specifying that normally translation-invariant elements are to be translated.

Proposed improvements to HTML and translation techniques

Proposed improvements to translation techniques

The following list indicates some deficiences and problems in Babelfish noted by me when using it. The list is by no means exclusive and not even systematic.

To be continued...

Proposed improvements to the HTML language

This is a very preliminary "wish list". Most probably some of the problems discussed here should be solved by introducing a more general construct than the one proposed here, or solved outside HTML, e.g. by improving the translation software. Sorry, you probably don't understand very much of this unless you know the HTML language rather well.

To be continued...


Jukka Korpela, jkorpela@malibutelecom.com