Translation-friendly authoring,
especially in HTML for the WWW

People who write HTML documents for the WWW should become aware of the possibilities and problems opened by automatic translation services such as Babelfish. This is a rather new area, and most of the discussion here applies to changes needed both to the HTML language and to the translation software. However, practical tips are also described in issues where one can suggest "translation-friendly" techniques in the present situation. Some of the suggestions are equally applicable to other forms of authoring and to human translation. A collaborative effort by authors of documents and implementors of translation techniques is needed, and designers and implementors of markup languages like HTML should get involved, too.

For concreteness, this presentation describes the suggested guidelines first, in order to give the reader an idea of the relative simplicity of the actions needed. Naturally, the guidelines have been derived from observations, reasoning and experiments presented in later sections.

Try translating this document using Babelfish, to get some idea of the possibilities and problems:

Jukka Korpela

Practical guidelines for authors

Guidelines on natural language usage

These guidelines apply to the textual content of documents, irrespective of the presence of HTML markup or some other markup. Mostly the guidelines apply to all forms of translation - human, automatic, or combined.

Make your material available, at least as one option, as a set of small pages, each consisting of a logical unit such as a section or subsection or perhaps a large table. (As a concrete practical point, Babelfish says, in its help file, that it translates "a maximum of 5k of text in an html page".) When using HTML, the pages should of course be interlinked. "Small" means, speaking very roughly, at most two pages when printed on paper with typical settings. This is useful mainly for practical reasons, such as restrictions in freely accessible translation services and evaluation versions of programs, and also for efficiency reasons: it is faster to translate a short text separately than as part of a large text, of course. (In principle, translating as part of a large text may produce better quality, since the translation program can make use of the context.)
Use normal language, avoiding idiomatic expressions, dialect and slang words, and technical terms outside their normal scope. Figurative expressions are risky, unless the metaphor is widespread among languages. Fixed idiomatic expressions as such pose no fundamental challenge to translation software; it is quote easy to make a program check a large list of fixed phrases and use predefined translations for them, instead of translating "word by word". But current translation programs are not very good at idioms, and even in the long run idioms will cause problems in cases where it is context-dependent whether a phrase is to be interpreted literally or idiomatically. Naturally, you should not impoverish you language to make your documents more suitable to simple translators, just to notice somewhat later that newer software could handle richer language better. The point here is that you should think about your language and abstain from using idiomatic and figurative language in vain, keeping in mind both human readers whose native language might be different from yours and automatic translators which are unable to read anything between the lines.
In particular, say things directly instead of using hidden humour, sarcasm, or implicit language. On the Internet, Wiio's law "if a message can be understood in different ways, it will be understood in just that way which does the most harm" applies particularly widely. Attempts to be sarcastic will fail even more often when read through automatic translation.
Write simple sentences which are reasonably short. The longer and the more complex the sentence, the more probable it is that an automatic translator (or a human reader!) parses it wrongly.
Write words and phrases in full form, avoiding abbreviations, except very common ones like "etc". (If you really need to use abbreviations in HTML authoring, you may consider using the ABBR element to specify an expansion of an abbreviation.)
Prefer words with specific meaning to words which have a large set of different meanings. For example, instead of using a word like "issue" in the meaning 'subject of a discourse', consider using "topic", since "issue" has several meanings and a translation program would have great difficulties in selecting a correct one.
Formulate sentences so that ambiguous words have suitable local context to give a clue to translation software (and people). For example, the English word "type" has quite a many meanings, as a verb or as a noun, and this might confuse a translator. If your intention is to give an instruction to type something, please begin it with something like "please type".
Keep phrases together if feasible. For example, don't use a list header like "The goal of the university is to:" but instead attach the word "to" to each verb in the list, making it clear (even to an intellectually challenged translator program) that it is an infinitive form.
As an interim solution, prefer spellings like "DejaNews" and "AltaVista" to "Deja News" and "Alta Vista", since a simple translator might well translate the latter alternatives word by word (producing e.g. "Nouvelles De Deja"!).
Use spelling checkers. A spelling error is very often corrected (perhaps unconsciously) by a human reader but may cause serious trouble to translation programs (as well as indexers and other software). For HTML documents in English, you could use e.g. a simple online spelling checker named WebSter's Dictionary.

Guidelines on HTML markup

Validate and check your documents using suitable software. Errors in HTML markup, even if they cause no visible problems when the document is viewed on some browser, may cause unpredictable results in automatic translation, especially in the future when programs utilize the structural information conveyed by HTML markup.
Use logical markup instead of abusing HTML elements to achieve a desired physical effect. For example, if you use the H1 element just to have a paragraph presented in very large font, a translation program may assume it is 1st level heading (since that's what H1 really means); this might imply that when translating from French to English, all words except a few small words will have capital initial according to what is common usage in headings in English!
Specify the LANG attribute at least for the entire document and for major parts (such as block quotations) within it if they are in another language. (This currently seems to have no effect on Babelfish, but in the long run such markup is crucial for good translation.)
When you need to use a word or phrase which is probably not found in normal dictionaries, provide a link to a definition. This probably won't help the translation process, but it may help people who read your text either translated or as such. For example, for computer-related jargon you can often find suitable definitions in some of the Internet glossaries. See example below.
Use suitable markup to designate expressions which should not be translated despite looking like words in the source language. There is currently no good markup for this, but as an interim solution to this problem for Babelfish, you could use the SAMP element. For example, to prevent Babelfish from translating the abbreviation "HUT" into something that means a hut in the target language, you could write <SAMP>HUT</SAMP>. A drawback is that the text will then appear in monospaced font on most browsers, but by using style sheets you can suggest that it be rendered in a normal font.
For textual information, use normal text instead of images with text embedded into them. This applies to navigational "panels", too.
Use tables (TABLE elements) instead of preformatted text (PRE element) for tabular material. If you "line up" things using preformatted text, the lining up is almost certainly lost in translation. (In fact, Babelfish seems to screw up preformatted blocks rather badly.)
When setting up a link, try to make sure that the link text is a phrase that can be reasonably translated as a single entity. The same applies to text level markup such as emphasis. For example, if you emphasize a preposition (as in it is not <EM>on</EM> the table but <EM>under</EM> it) there will be difficulties when translating to a language where suffixes are used for things expressed by prepositions in English. (A human translator might be able to find a suitable circumlocution.)
Use entities for characters outside the Ascii repertoire, e.g. ö instead of ö (o dieresis, or o umlaut), especially for characters which do not belong to the normal alphabet of the main language of the document. (This can be rather inconvenient, but it may help to circumvent a bug in Babelfish.)

An example of linking to an entry in a special dictionary when using a word which may cause problems to translation programs (and even human translators or human readers of the original text):

There is a <A HREF= "http://wagner.princeton.edu/foldoc/cgi-script?action=Search%3A&query=workaround" TITLE="a description of the word 'workaround'">workaround</A> to this problem.

This looks like the following on your current browser:

There is a workaround to this problem.

Interestingly, using Babelfish, the link gets converted to a link thru Babelfish, so the reader of the translated document, when following the link, will get a translated version of the dictionary entry! This is often - probably most often - very nice, but how can you write a link which does not get translated that way, for example a definite link to the original in English?

Why automatic translation is realistic

Introduction: automatic translation within easy reach

In December 1997, Web users started asking each other: "Have you noticed the 'Translate' links on AltaVista search results?" The popular AltaVista search engine had started suggesting, in a not so prominent way, that users can get automatically generated translations of documents. When a user had sent a set of keywords to AltaVista, it returned a list of documents matching the keywords as previously, but now there were "Translate" links in the following style:

4. Writing for Translation
[URL: www.stc.org/region2/pit/www/bpencil/vol3...ep/translat.htm]
Writing for Translation. by Roz Treger and Nancy Ott. More and more companies are marketing their products globally. As technical communicators, we are...
Last modified 11-Sep-97 - page size 10K - in English [ Translate ]

That document, by the way, is related to our topic and is definitely worth reading.

By following the "Translate" link, the user would get page containing a form like the following:

To translate, type plain text or the address (URL) of a Web page here:

http://www.stc.org/region2/pit/www/bpencil/vol34/01_sep/translat.htm

Translate from:

The user could then request for a translation into one of a set of languages, namely French, Italian, German, Portuguese, and Spanish. (For a document in one of these language, one could request for a translation into English.) In our example, a translation into German would begin as follows:

Schreiben für Übersetzung

durch Roz Treger und Nancy Ott

Immer mehr Firmen sind Marketing ihre Produkte global. Als technische Verbindungen produzieren wir Material, das von den Benutzern in vielen unterschiedlichen Ländern gelesen werden und in einige Sprachen übersetzt werden kann. Wir müssen dieses in Betracht ziehen, wenn wir Schreiben es sind. Übersetzung ist ein Re-Ausdruck von Ideen in einer anderen Sprache, nicht in einem eins-zu-eins Ersatz von Wörtern und in den Phrasen. Übersetzung-freundliches Schreiben stellt Ideen offenbar und durchweg dar und läßt sie einfacher, damit Übersetzer und non-native englische Lautsprecher sie verstehen. Soviel wie möglich, ist es kulturell Neutrales.

This is far from being good German, but anyone who knows German reasonably well understands what the document is about.

Notice that the translation is based on the document as written in the HTML markup language, which indicates the structure of the document, e.g. denoting parts of the text as headings. The translation preserves, or at least tries to preserve, the markup. (Babelfish can translate plain text, too, of course.)

There are many deficiencies in the translation. For example, the English phrase "are marketing" has been translated as if "marketing" were a noun! But the translation is far from being a naive word by word conversion. It constructs sentences according to German grammar, which e.g. often uses a word order quite different from English.

Some clarifications

First we should distinguish between automatic translation, simple word replacement and consulting a dictionary. Companies seem to market dictionary programs sometimes as "translators". A program which allows the user to get a dictionary entry for a word, with corresponding words, or "translations", in one's native language, can definitely be very useful; but it is definitely not a language translator. Similarly, a program which "translates" simply by replacing each word by its "equivalent" in another language should not be called a translator, although it can be useful for some purposes in special cases. What we mean by automatic translation here is a process which involves at least some minimal grammatical analysis of the source text and generation of corresponding text in the target language. The depth and quality of the process may vary a lot.

There is a large number of automatic translation software available. For a list of some of them, please refer to Information on Computer-Assisted Translation Software by the Oxford University Language Centre. A short document Do you have any information on automatic translation software? by LTG lists a few services especially for translating Web pages. See also Langenscheidts T1 Test Drive which demonstrates translation of German and Spanish (in plain text format) into English.

It is very easy to see that automatic translation does not (currently) work well for poetry, for example. It is a cheap amusement to use a tool for something it is not intended for and laugh at the result. Babelfish takes a good attitude on this; its Translation Tips urge people to try that, too:

Cheap Entertainment
Idioms and slang -- phrases like "the whole nine yards" or "what's up, Doc?" in American English -- are notoriously hard to translate well, particularly when the computer doesn't know the context of the phrase. Try a few for some good laughs.
Cheap Entertainment, Part 2
Remember the old game Gossip -- where one person whispers something to the next person, then the second to the third, and so on, then everyone has a good laugh about what comes out at the end? Try that with translations. Just start with one languages, then translate that to another, then another, then another, then back to the original.

After such good laughs, perhaps people are willing to consider what else the software can do, and how it performs in the areas it was designed for.

Examples

Translation from English to French

This example consists of a text of mine, an extract from How to use images in communication in general and on the Web in particular, and its translation into French by Babelfish:

English original	French translation
"You can never use too many images" A very large number of published documents contain text only. They often look boring, and they are often written in obscure language, using mile-long sentences and cryptic technical terms, using one font only, perhaps even without headings. Such style, or lack of style, might be the one you are strongly expected to follow when writing eg scientific or technical reports, legal documents, or administrative papers. It is natural to think that such documents would benefit from a few illustrative images. (However, just adding illustration might be rather useless, if the text remains obscure and unstructured.) It is too easy to go to the other extreme when trying to avoid the boring plain text syndrome. This is especially true on the Web, where it is relatively easy technically to add illustration, for instance by picking images from various existing collections. Many people seem to think that you can't have too many images. If they can't find a suitable image, they use an unsuitable one. When people say that one image tells more than a thousand words, they tend to overlook the fact that what the image says might be true or false, relevant or off-topic, useful or disturbing, constructive or tasteless. (I won't bother to refute the saying by pointing out that there are images which say nothing. However, I cannot resist the temptation to remark that oddly enough the saying itself is expressed using words.)	" vous pouvez ne jamais utiliser trop d'images " Un nombre très grand de documents édités contiennent le texte seulement. Ils regardent souvent le sondage, et ils sont souvent écrits en langage obscur, en utilisant de mille-longues phrases et termes techniques cachés, en utilisant une fonte seulement, peut-être même sans entêtes. Un tel modèle, ou manque de modèle, pourrait être celui on s'attend à ce que fortement que vous suiviez en écrivant par exemple des états scientifiques ou techniques, les documents juridiques, ou les papiers administratifs. Il est normal de penser que de tels documents tireraient bénéfice de quelques images d'illustration. (cependant, l'illustration ajoutante juste pourrait être plutôt inutile, si le texte demeure obscur et non structuré.) Il est trop facile d'aller à l'autre extrémité en essayant d'éviter le syndrome ordinaire ennuyeux des textes. C'est particulièrement vrai sur le Web, où il est relativement facile techniquement d'ajouter l'illustration, par exemple en sélectionnant des images de diverses collections existantes. Beaucoup de gens semblent penser que vous ne pouvez pas avoir trop d'images. S' ils ne peuvent pas trouver une image appropriée, ils utilisent peu convenable. Quand les gens disent qu' une image indique plus que mille mots, ils tendent à donner sur le fait que ce que l'image indique pourrait être vrai ou faux, approprié ou hors fonction-sujet, utile ou dérangeant, constructif ou insipide. (je ne prendrai pas la peine de réfuter l'énonciation en précisant qu'il y a des images qui n'indiquent rien. Cependant, je ne puis pas résister à la tentation de remarquer qu'assez curieusement l'énonciation elle-même est exprimée en utilisant des mots.)

English original

French translation

"You can never use too many images"

A very large number of published documents contain text only. They often look boring, and they are often written in obscure language, using mile-long sentences and cryptic technical terms, using one font only, perhaps even without headings. Such style, or lack of style, might be the one you are strongly expected to follow when writing eg scientific or technical reports, legal documents, or administrative papers. It is natural to think that such documents would benefit from a few illustrative images. (However, just adding illustration might be rather useless, if the text remains obscure and unstructured.)

It is too easy to go to the other extreme when trying to avoid the boring plain text syndrome. This is especially true on the Web, where it is relatively easy technically to add illustration, for instance by picking images from various existing collections. Many people seem to think that you can't have too many images. If they can't find a suitable image, they use an unsuitable one.

When people say that one image tells more than a thousand words, they tend to overlook the fact that what the image says might be true or false, relevant or off-topic, useful or disturbing, constructive or tasteless. (I won't bother to refute the saying by pointing out that there are images which say nothing. However, I cannot resist the temptation to remark that oddly enough the saying itself is expressed using words.)

" vous pouvez ne jamais utiliser trop d'images "

Un nombre très grand de documents édités contiennent le texte seulement. Ils regardent souvent le sondage, et ils sont souvent écrits en langage obscur, en utilisant de mille-longues phrases et termes techniques cachés, en utilisant une fonte seulement, peut-être même sans entêtes. Un tel modèle, ou manque de modèle, pourrait être celui on s'attend à ce que fortement que vous suiviez en écrivant par exemple des états scientifiques ou techniques, les documents juridiques, ou les papiers administratifs. Il est normal de penser que de tels documents tireraient bénéfice de quelques images d'illustration. (cependant, l'illustration ajoutante juste pourrait être plutôt inutile, si le texte demeure obscur et non structuré.)

Il est trop facile d'aller à l'autre extrémité en essayant d'éviter le syndrome ordinaire ennuyeux des textes. C'est particulièrement vrai sur le Web, où il est relativement facile techniquement d'ajouter l'illustration, par exemple en sélectionnant des images de diverses collections existantes. Beaucoup de gens semblent penser que vous ne pouvez pas avoir trop d'images. S' ils ne peuvent pas trouver une image appropriée, ils utilisent peu convenable.

Quand les gens disent qu' une image indique plus que mille mots, ils tendent à donner sur le fait que ce que l'image indique pourrait être vrai ou faux, approprié ou hors fonction-sujet, utile ou dérangeant, constructif ou insipide. (je ne prendrai pas la peine de réfuter l'énonciation en précisant qu'il y a des images qui n'indiquent rien. Cependant, je ne puis pas résister à la tentation de remarquer qu'assez curieusement l'énonciation elle-même est exprimée en utilisant des mots.)

The result certainly isn't good French--it is easy to see that there are errors even in the use of capital letters to begin a sentence--but I suppose it conveys the basic message in the original. Perhaps it should be mentioned that I wrote the text before I started thinking about the use of automatic translation for Web pages, and thus it has not been "tuned".

Translation from Finnish to English

This example is a short translation (of a fragment of an article on machine translation) from Finnish to English by TranSmart, a demo of software by Kielikone.

Finnish original	English translation
Kääntäminen lisääntyy eri syistä jatkuvasti. Teknologia tuottaa yhä mutkikkaampia laitteita ja yhä laajempia asennus- ja käyttöoppaita. Vientituotteiden ohjeet pitää kääntää asiakkaiden kielille. Samaten patenttihakemusten ja tieteellisten artikkelien määrä kasvaa.	The translation increases continuously for different reasons. The technology produces increasingly complicated devices and the increasingly wide installation guides and use guides. The instructions of the export products must be translated into the customers' languages. The number likewise of the patent applications and of the scientific articles increases.

The translation given by Transmart contains some text in italics, indicating that the italicized text is a translation for a compound word in the original. (Often this means that the translation is not idiomatically correct but usually helps in getting the idea.)

Obviously, the translation is far from being idiomatically and stylistically perfect. Yet, one can understand the content pretty well.

The incorrect use of articles in the translation is mainly caused by the fact that the Finnish language lacks both definite and indefinite articles.

In which phase could a WWW document be translated?

For HTML documents on the WWW, automatic translation can be requested for in several alternative phases:

By the author (or a person assisting the author) in order to produce a translated document which will then be put onto the WWW as a separate document, perhaps after it has been checked by a human translator. The author might link the versions of his document in different languages together, using normal HTML links or using the LINK element. It is also possible, in principle at least, to organize things so that all the versions are accessible using a single address (URL); the preferred version would be picked up according to the so-called language negotiation mechanism.
By the server when the document is requested. That is, after an HTTP request for a document, the WWW server could determine that the fulfillment of the request requires translation, and it would invoke an automatic translation program. The server could cache the translated document so that future requests might be answered without new translation process.
By the browser after receiving the document, e.g. using techniques like those of Globalink. In the future, a browser might invoke a translator automatically, then just display the result so that the naive user doesn't even realize that a translation was performed! More probably, and more preferably, the browser could indicate that it is a translation and provide a simple method for viewing the original, perhaps alongside with the translation.
By the user, who could either ask some translation program on his computer to do the task or send a request, over a network, to an online service which then returns a translated document. The latter is how Babelfish works.

It would be interesting to discuss the pros and cons of each method, but here we will only mention that fully automatic translation will in practice be needed for transient documents which are generated dynamically instead of being static files.

Naturally, these methods could be combined. For example, a browser might request (on the basis of language preferences set by the user) a document in a specific language, and the server could check whether it actually has a suitable translation; if not, it could check whether it can find a program for generating a translation; and if this fails, it could send the document in a language lower in the user's preferences, in which case the browser could check whether it can translate the document. (Some delicacies might be needed to prevent the situation where a server sends an automatically generated translation which needs to be translated again to the user's preferred language, instead of sending the original.)

What could automatic translation be good for

Even a coarse and erroneous translation can give an idea of what the text is about. This is crucial on the Web where one can easily access a huge number of documents in various languages and needs to sort out what's relevant. If one finds a document in a strange language, automatic translation helps in deciding whether it is worth a closer look. If it looks really interesting, one could perhaps afford a translation made by a human translator.

A relatively good automatic translation could serve as the basis for human translator. Admittedly, there is a risk that the human translator produces unnecessarily low-quality translation that way, since automatic translation tends to reflect the structure of the source language too much. In any case, automatic translation is currently used that way, and it can save a lot of time e.g. by saving the human translator from the boring work of translating simple sentences, doing dictionary lookup, etc.

Automatic translation can be useful alongside with the original when the reader knows the source language to some extent but not fluently. Depending on how well he knows it, he could use either the original or the translation as the primary text, consulting the other one when problems are encountered.

An author who speaks several languages could use automatic translation of his documents as a extra check for clarity and grammatic correctness. For example, having written a document in English I could request an automatic translation into German, then read through the translation. Translation errors may well indicate problems in the original text, such as typographic errors not detected by spelling checkers - such as an error which happens to produce another word of the language - or too complicated or ambiguos grammatical structures. If a translation program cannot correctly handle a piece of text, this might result from features which also prevent a human reader from understanding it or make him understand it incorrectly, especially if the language used is not his native language.

The question remains whether "standalone" automatic translation can be feasible. That is, could one use a fully automatically produced translation as the only form in which a document is accessed? The answer is that it depends on the nature of the text and on the translation program. Currently Babelfish is already used to some extent that way in Usenet discussions, using grammatically simple language.

As the translation programs improve, it can be used for more complicated texts. Naturally, the question remains whether one can rely on an automatically generated translation. One might answer with another question: Don't humans err? In fact, in many details automatic translation can be more reliable, since computer programs do boring work more conscientiously than people do.

If documents are translated by human translators, each translation needs to be updated whenever the original is changed. This means a lot of boring work - some of which might be automated with suitable tools - especially for documents which are updated very frequently. And to a large number of Web pages is generated dynamically, on the basis of a user request and e.g. search from a database. A simply query report might contain very simple language grammatically and be easily translatable by computer.

For the majority of all uses of all documents on the Web, automatic translation is the only feasible way of access to anyone who does not know well the language in which the document was written. If a document exists in Portuguese only and you don't know Portuguese, you either utilize automatic translation or you can't read the document at all, except in rare cases where you can afford to order a man-made translation or find someone who does the job for you for free.

Automatic translation and HTML

Introduction

This section discusses such problems and solutions in automatic translation which are specific to the situation where the source document is in HTML format. This section (or this document) does not discuss the general problems of automatic translation. For such issues, please refer to the extensive directory The ACL NLP/CL Universe.

Automatic translation can be significantly improved if authors and their tools are involved. This means that the problems of translation are taken into account by the author or his assistants when the original work is created.

For existing documents, some relatively simple modifications may greatly improve translatability. This will be demonstrated next by an example.

Example: modifying a simple document for translatability

As an example of how modifications to a document can improve translatability, I have taken a short page which tells some numeric and other facts about the university where I work, HUT. Such simple fact pages could be expected to be relatively easily translatable, since they do not contain grammatically complex structures. Moreover, automatic translatability is essential since such pages can be interesting to people speaking different languages, and one hardly wants to allocate resources to maintaining such pages in many languages by hand.

Note: The example document and its modified form and their translations are not embedded into this document. Instead, links to them are provided. In a typical graphical browser, such as Internet Explorer or Netscape, on Windows for example, you can use the rightmost button of the mouse when following a link (instead of the normal use of the leftmost button), then select the alternative Open in New Window in the pulldown menu opened. You can the move window to another position on the screen and resize it suitably, e.g. so that you can view different versions side by side.

The original page is a short fact sheet, Helsinki University of Technology in a Nutshell. In its French translation by Babelfish, there are several obvious failings (most of which you probably notice even if you don't know French):

many texts have not been translated at all
some texts are very confusing in the translation, since Babelfish has taken e.g. the word "Twelwe" as a proper name (kept untranslated), not as a misspelling of "Twelve"
the expression "13 degree programmes" is translated as if "13 degree" were an attribute to "programme"; that is, Babelfish analyzed the structure differently from the intended one
an expression like "under-" has been taken as a preposition (translated into the corresponding French preposition) instead of being part of the expression "under- and postgraduate".

In other translations, there are similar failings but also some different problems. For example:

In the German translation, the word "state" was translated as "Zustand". Obviously, for a noun with so many meanings as "state" has, a translation program is unlikely to pick up a correct equivalent without some help from the content. Changing "state" to "state budget" would improve the situation: it would be translated into the single word "Staatshaushalt". (One should of course be careful with such changes to text which modify the meaning. In this context, the modification would probably be acceptable.)
In the Italian translation, the name "Esa" has been replaced by "SEC". Assumably it was interpreted as some abbreviation and translated using the corresponding Italian abbreviation! And the word "marks" was translated as "contrassegni" instead of being taken as a currency name. This problem can be circumvented by removing the word from the original; it is redundant due to the appearance of the currency abbreviation FIM.
In the Portuguese translation, the word "nutshell" is left untranslated. This is interesting, compared with the fact that in the French translation the phrase "in a nutshell" has been replaced by the idiomatic equivalent "en un mot" (literally, 'in a word').
In the Spanish translation, the name "Räisänen" has been changed to "R5ais5anen", i.e. the letter ä has become 5a. This is obviously an error in the way the translator processes data at the character level and should be fixed there. On the other hand, modifying the document by presenting ä as the entity ä circumvents the problem.

In order to solve some of the problems detected, I constructed an experimental modified page by applying the methods described in the first section (guidelines on natural language usage and guidelines on HTML markup). Its French translation (by Babelfish) is considerably better than that of the original. The remaining flaws (such as "professeurs d'associé" instead of "professeurs associés") are probable things that can be fixed only by improving the translation program.

Notes on the changes:

The original document uses twice an ADDRESS element, which is treated in "don't translate this" mode by Babelfish. The illogical use of ADDRESS for something that really isn't a normal address thus causes unwanted phenomena in automatic translation. In the first of ADDRESS, the tags were simply removed. In the latter case, they were replaced by SMALL tags; it seems natural to suggest that technical information about the maintenance of a document should appear in smaller font than normal.
On the other hand, some other parts of the text, people's names (and one part which was in an ADDRESS element, namely the abbreviation HUT) need to be protected from any attempt to translate them. This was made using the "SAMP hack"; it has the drawback that words so marked are presented in monospaced ("typewriter") font on many browsers by default. Style sheets are used to suggest another rendering, small-caps.
The logo of the university is used in the original page as one image, which contains the logo symbol and the name of the university (in English). Naturally, texts presented in images are not translated by automatic translation programs (although something like that might be done, in principle). Therefore, using an image which contains the logo symbol only and the name as text leads to better translatability.

You may wish to compare the presentation of the modified document (in English) on your browser with a screenshot of what it looks like in one browsing situation viewed on Internet Explorer 4.0 with stylesheet support on. (It isn't quite what it should, due to deficiencies in stylesheet support.)

You may wish to look at the other translations of the modified document:

German translation Italian translation Portuguese translation Spanish translation

The Portuguese translation is the most problematic. In addition to the "nutshell" problem mentioned above, the change of the English spelling "vicerector" to "vice-rector" caused a new problem: it's now translated as "vice-vice-rector"!

Logical markup and translation

To be written...

Multilingualization ("internationalization") of HTML

To be written... Need to consider the different roles of the LANG attribute for example.

How to prevent translation

Paradoxically, one of the most serious practical problems in translating Web documents automatically is how to prevent translation of various parts of the document.

Consider the section of this document with the example of a text in English and its translation into French. Quite obviously, if that section is to be translated (into French or into some other language), the example text in French should not be translated, especially not by applying to it algorithms and dictionaries for translating from English to some language! (However, that's what Babelfish currently does.)

To take a simpler and more common example, consider a text in English with a proper name "John Birch" in it. When translating to Italian, for example, how can we prevent a program from translating "Birch" as "Betulla" (using the Italian word for birch)? Someone might suggest heuristics based on the use of capital letters, but that would be rather ineffective - it would fail entirely when translating from German, for example, since in German all nouns are spelled with a capital initial.

It seems obvious that some method of marking words as proper names is needed. That's not sufficient, however. There are other words too which shall not be translated. Examples range from code-like things appearing in texts about computer languages (like the element name BODY in HTML or the keyword case in C) to linguistic texts speaking about words. It is obvious that if when translating a text which discusses the English language, sample English words (like in "the plural of ox is oxen") must not be translated.

It should be noted that one cannot deduce from the word itself, as a string in a text, whether it should be translated, no matter how large glossaries we use. For example, the word "John" in a name like "John Birch" must remain as such, whereas "king John" must become "kuningas Juhana" in Finnish and "John the Baptist" must become "Jean-Baptist" in French.

It seems that Babelfish regards the contents of the following HTML elements as something that shall not be translated: ADDRESS, CITE, and SAMP. For all of these, one can present arguments in favor of treating them as "literals" which are not to be translated. On the other hand, counterarguments could be presented, and at least the CODE element would be an obvious candidate to be added.

But basically what is needed is a better official specification of the semantics of phrase markup elements in HTML. In the process of creating such specifications, the questions of translation should be explicitly discussed.

Discussion is needed to determine which is the best approach to preventing translations. Alternatives include:

Specifying which HTML elements are to remain invariant in translations, at least by default.
Introducing a phrase element (which might be called LIT) for specifying that a piece of text is a literal which is to remain unchanged in translations. A set of CLASS attribute values (such as CLASS="person" for person's names) might be introduced to specify the class of literals, mainly for style sheet purposes; they might have some relevance in translation, too.
Introducing an attribute for specifying that the content of the element is a literal which is to remain unchanged in translations.

The first alternative could hardly be the only solution. It would require additional methods both for specifying that other element instances are translation-invariant and for specifying that normally translation-invariant elements are to be translated.

Proposed improvements to HTML and translation techniques

Proposed improvements to translation techniques

The following list indicates some deficiences and problems in Babelfish noted by me when using it. The list is by no means exclusive and not even systematic.

Instead of simply translating from one language to another, the languages being specified by the user, a translation program should operate on a multilingual basis, not bilingual. That is, it should accept data where different languages may appear and it should produce a result where different languages might be used, depending on language preferences. For example, a user might request for a translation in Finnish but so that texts in Swedish and English are passed as such; and for other languages, if direct translation from it to Finnish is not available, he might prefer an English translation to a translation from the original via English.
Babelfish seems to ignore the LANG attribute entirely. In addition to using the LANG specified for the HTML element in order to determine the basic language in the document, a translation program should check the LANG (and HREFLANG) attributes in contained elements and leave texts written in other languages than the basic source language untranslated, or translate them using algorithms and lexica for the language specified.
No attempt is made to translate texts in attributes like TITLE and ALT. It would be quite essential to have them translated, too. Notice that ALT is crucial for accessibility.
Any text within CODE, KBD, and SAMP elements should be left untranslated by default.
The PRE elements are messed up, since the translation does not preserve line breaks.
Babelfish often translates words assuming a specific technical meaning even for words which are much more often used in another meaning. For instance, the word "reader" gets translated into "program de lectura" in Spanish!
Babelfish converts notations like 3.2 in English assuming that they are decimal numbers, making it 3,2 in French for example. This is incorrect when the notation is actually something else, such as a program version number. (An addition to HTML language might be needed to distinguish between the cases. As an interim solution, translation programs should refrain from trying conversion between different notations for decimal numbers.)
Some ISO 8859-1 characters are not handled correctly in some cases.

To be continued...

Proposed improvements to the HTML language

This is a very preliminary "wish list". Most probably some of the problems discussed here should be solved by introducing a more general construct than the one proposed here, or solved outside HTML, e.g. by improving the translation software. Sorry, you probably don't understand very much of this unless you know the HTML language rather well.

It is very important in translations that the translator knows the context. A human translator or and advanced automatic translator could first process the entire text and deduce what topics it discusses, then use this information to select suitable alternatives among dictionary entries. It might still be useful to assist translators by using markup which specifies the context, for an entire document, or for a part of it, or even for a single word or phrase. Some general classification (e.g. UDK) might be used. When a translator knows that the text discusses mathematics, for example, it (or he) is in a much better position to translate terms like "field" or "product" appropriately. In other languages, these words as mathematical terms may have equivalents quite different from the words that correspond to "field" or "product" in everyday language. On the other hand, a mathematical text might use "field" in its everyday meaning, too!
Markup for indicating "mode" or "style" such as sarcasm, jocularity, or poetry might be added. On Usenet, people sometimes use pseudo-HTML like "<SARCASM>...</SARCASM>. Perhaps translation programs might one day benefit from such markup. (They might also be used e.g. by speech-based user agents in order to select appropriate tone of voice.)
Introduce markup for indicating uncertainty of enclosed text. Additional information indicating the numerical degree of uncertainty and an explanation of the reason for uncertainty might be introduced. Translators - both automatic and human - might use such markup for indicating that the translation might be incorrect. The explanation generated might include the corresponding fragment in the original text. (Notice that in our example of Finnish to English translation, the translator uses italics to denote one kind of uncertainty. Naturally, logical markup would be needed instead of physical, but current HTML has no suitable element.) The markup could also be used for other purposes, such as in an HTML document containing the text of an old manuscript where some words are less certain than others, e.g. interpolated.
In conjunction with uncertainties, markup with variants could be used. That is, at the HTML level one could specify different variants for a piece of text. One of the variants could be marked as the "normal" one to be presented to the user by default, distinguished in a manner similar to links, and letting the user access the other variants and their explanations. Naturally, translators should generate such constructs sparingly.
Introduce "joiner" markup which means that two or more words are to be treated as a whole. This would be especially useful for languages like English which use phrases consisting of several words rather than compound words. Translation might be greatly assisted by indicating how words belong together. (Notice that the existing SPAN element and the no-break space   do not logically mean the same as a "joiner" markup would.)
Include ORIGINAL into the set of standardized values for the REL attribute. It would indicate a link to the original version from which the current version was translated. Translator programs should leave such links intact instead of converting them to links through a translator (as Babelfish now seems to do to all links). Naturally, a translator program, when asked to translate a document to language X, should check whether the document itself refers to its original which is written in X. This could be important when following links in a manner which goes through translations; it might prevent the situation where the user gets a translation of a document from language Y to X instead of getting an existing original in X!
Include markup for presenting author's estimate of translatability. This should be in the HTML markup, not in metadata outside it. For example, a prose document might contain a quotation of a poem, and the author might wish to denote that it has low translatability. This might imply, depending on software and user preferences, that parts of a document are not translated, or translations would be indicated as being less reliable than usual. At the HTML level, this might involve the use of a "uncertainty element" mentioned above.