People who write HTML documents for the WWW should become aware of the possibilities and problems opened by automatic translation services such as Babelfish. This is a rather new area, and most of the discussion here applies to changes needed both to the HTML language and to the translation software. However, practical tips are also described in issues where one can suggest "translation-friendly" techniques in the present situation. Some of the suggestions are equally applicable to other forms of authoring and to human translation. A collaborative effort by authors of documents and implementors of translation techniques is needed, and designers and implementors of markup languages like HTML should get involved, too.
For concreteness, this presentation describes the suggested guidelines first, in order to give the reader an idea of the relative simplicity of the actions needed. Naturally, the guidelines have been derived from observations, reasoning and experiments presented in later sections.
Try translating this document using Babelfish, to get some idea of the possibilities and problems:
These guidelines apply to the textual content of documents, irrespective of the presence of HTML markup or some other markup. Mostly the guidelines apply to all forms of translation - human, automatic, or combined.
ABBRelement to specify an expansion of an abbreviation.)
logical markupinstead of abusing HTML elements to achieve a desired physical effect. For example, if you use the
H1element just to have a paragraph presented in very large font, a translation program may assume it is 1st level heading (since that's what
H1really means); this might imply that when translating from French to English, all words except a few small words will have capital initial according to what is common usage in headings in English!
LANGattribute at least for the entire document and for major parts (such as block quotations) within it if they are in another language. (This currently seems to have no effect on Babelfish, but in the long run such markup is crucial for good translation.)
SAMPelement. For example, to prevent Babelfish from translating the abbreviation "HUT" into something that means a hut in the target language, you could write
<SAMP>HUT</SAMP>. A drawback is that the text will then appear in monospaced font on most browsers, but by using style sheets you can suggest that it be rendered in a normal font.
TABLEelements) instead of preformatted text (
PREelement) for tabular material. If you "line up" things using preformatted text, the lining up is almost certainly lost in translation. (In fact, Babelfish seems to screw up preformatted blocks rather badly.)
it is not <EM>on</EM> the table but <EM>under</EM> it) there will be difficulties when translating to a language where suffixes are used for things expressed by prepositions in English. (A human translator might be able to find a suitable circumlocution.)
ö(o dieresis, or o umlaut), especially for characters which do not belong to the normal alphabet of the main language of the document. (This can be rather inconvenient, but it may help to circumvent a bug in Babelfish.)
An example of linking to an entry in a special dictionary when using a word which may cause problems to translation programs (and even human translators or human readers of the original text):
There is a <A HREF= "http://wagner.princeton.edu/foldoc/cgi-script?action=Search%3A&query=workaround" TITLE="a description of the word 'workaround'">workaround</A> to this problem.
This looks like the following on your current browser:
There is a workaround to this problem.
Interestingly, using Babelfish, the link gets converted to a link thru Babelfish, so the reader of the translated document, when following the link, will get a translated version of the dictionary entry! This is often - probably most often - very nice, but how can you write a link which does not get translated that way, for example a definite link to the original in English?
In December 1997, Web users started asking each other: "Have you noticed the 'Translate' links on AltaVista search results?" The popular AltaVista search engine had started suggesting, in a not so prominent way, that users can get automatically generated translations of documents. When a user had sent a set of keywords to AltaVista, it returned a list of documents matching the keywords as previously, but now there were "Translate" links in the following style:
4. Writing for Translation
Writing for Translation. by Roz Treger and Nancy Ott. More and more companies are marketing their products globally. As technical communicators, we are...
Last modified 11-Sep-97 - page size 10K - in English [ Translate ]
That document, by the way, is related to our topic and is definitely worth reading.
By following the "Translate" link, the user would get page containing a form like the following:
The user could then request for a translation into one of a set of languages, namely French, Italian, German, Portuguese, and Spanish. (For a document in one of these language, one could request for a translation into English.) In our example, a translation into German would begin as follows:
Schreiben für Übersetzung
durch Roz Treger und Nancy Ott
Immer mehr Firmen sind Marketing ihre Produkte global. Als technische Verbindungen produzieren wir Material, das von den Benutzern in vielen unterschiedlichen Ländern gelesen werden und in einige Sprachen übersetzt werden kann. Wir müssen dieses in Betracht ziehen, wenn wir Schreiben es sind. Übersetzung ist ein Re-Ausdruck von Ideen in einer anderen Sprache, nicht in einem eins-zu-eins Ersatz von Wörtern und in den Phrasen. Übersetzung-freundliches Schreiben stellt Ideen offenbar und durchweg dar und läßt sie einfacher, damit Übersetzer und non-native englische Lautsprecher sie verstehen. Soviel wie möglich, ist es kulturell Neutrales.
This is far from being good German, but anyone who knows German reasonably well understands what the document is about.
Notice that the translation is based on the document as written in the HTML markup language, which indicates the structure of the document, e.g. denoting parts of the text as headings. The translation preserves, or at least tries to preserve, the markup. (Babelfish can translate plain text, too, of course.)
There are many deficiencies in the translation. For example, the English phrase "are marketing" has been translated as if "marketing" were a noun! But the translation is far from being a naive word by word conversion. It constructs sentences according to German grammar, which e.g. often uses a word order quite different from English.
First we should distinguish between automatic translation, simple word replacement and consulting a dictionary. Companies seem to market dictionary programs sometimes as "translators". A program which allows the user to get a dictionary entry for a word, with corresponding words, or "translations", in one's native language, can definitely be very useful; but it is definitely not a language translator. Similarly, a program which "translates" simply by replacing each word by its "equivalent" in another language should not be called a translator, although it can be useful for some purposes in special cases. What we mean by automatic translation here is a process which involves at least some minimal grammatical analysis of the source text and generation of corresponding text in the target language. The depth and quality of the process may vary a lot.
There is a large number of automatic translation software available. For a list of some of them, please refer to Information on Computer-Assisted Translation Software by the Oxford University Language Centre. A short document Do you have any information on automatic translation software? by LTG lists a few services especially for translating Web pages. See also Langenscheidts T1 Test Drive which demonstrates translation of German and Spanish (in plain text format) into English.
It is very easy to see that automatic translation does not (currently) work well for poetry, for example. It is a cheap amusement to use a tool for something it is not intended for and laugh at the result. Babelfish takes a good attitude on this; its Translation Tips urge people to try that, too:
Idioms and slang -- phrases like "the whole nine yards" or "what's up, Doc?" in American English -- are notoriously hard to translate well, particularly when the computer doesn't know the context of the phrase. Try a few for some good laughs.
Cheap Entertainment, Part 2
Remember the old game Gossip -- where one person whispers something to the next person, then the second to the third, and so on, then everyone has a good laugh about what comes out at the end? Try that with translations. Just start with one languages, then translate that to another, then another, then another, then back to the original.
After such good laughs, perhaps people are willing to consider what else the software can do, and how it performs in the areas it was designed for.
This example consists of a text of mine, an extract from How to use images in communication in general and on the Web in particular, and its translation into French by Babelfish:
|English original||French translation|
"You can never use too many images"
A very large number of published documents contain text only. They often look boring, and they are often written in obscure language, using mile-long sentences and cryptic technical terms, using one font only, perhaps even without headings. Such style, or lack of style, might be the one you are strongly expected to follow when writing eg scientific or technical reports, legal documents, or administrative papers. It is natural to think that such documents would benefit from a few illustrative images. (However, just adding illustration might be rather useless, if the text remains obscure and unstructured.)
It is too easy to go to the other extreme when trying to avoid the boring plain text syndrome. This is especially true on the Web, where it is relatively easy technically to add illustration, for instance by picking images from various existing collections. Many people seem to think that you can't have too many images. If they can't find a suitable image, they use an unsuitable one.
When people say that one image tells more than a thousand words, they tend to overlook the fact that what the image says might be true or false, relevant or off-topic, useful or disturbing, constructive or tasteless. (I won't bother to refute the saying by pointing out that there are images which say nothing. However, I cannot resist the temptation to remark that oddly enough the saying itself is expressed using words.)
" vous pouvez ne jamais utiliser trop d'images "
Un nombre très grand de documents édités contiennent le texte seulement. Ils regardent souvent le sondage, et ils sont souvent écrits en langage obscur, en utilisant de mille-longues phrases et termes techniques cachés, en utilisant une fonte seulement, peut-être même sans entêtes. Un tel modèle, ou manque de modèle, pourrait être celui on s'attend à ce que fortement que vous suiviez en écrivant par exemple des états scientifiques ou techniques, les documents juridiques, ou les papiers administratifs. Il est normal de penser que de tels documents tireraient bénéfice de quelques images d'illustration. (cependant, l'illustration ajoutante juste pourrait être plutôt inutile, si le texte demeure obscur et non structuré.)
Il est trop facile d'aller à l'autre extrémité en essayant d'éviter le syndrome ordinaire ennuyeux des textes. C'est particulièrement vrai sur le Web, où il est relativement facile techniquement d'ajouter l'illustration, par exemple en sélectionnant des images de diverses collections existantes. Beaucoup de gens semblent penser que vous ne pouvez pas avoir trop d'images. S' ils ne peuvent pas trouver une image appropriée, ils utilisent peu convenable.
Quand les gens disent qu' une image indique plus que mille mots, ils tendent à donner sur le fait que ce que l'image indique pourrait être vrai ou faux, approprié ou hors fonction-sujet, utile ou dérangeant, constructif ou insipide. (je ne prendrai pas la peine de réfuter l'énonciation en précisant qu'il y a des images qui n'indiquent rien. Cependant, je ne puis pas résister à la tentation de remarquer qu'assez curieusement l'énonciation elle-même est exprimée en utilisant des mots.)
The result certainly isn't good French--it is easy to see that there are errors even in the use of capital letters to begin a sentence--but I suppose it conveys the basic message in the original. Perhaps it should be mentioned that I wrote the text before I started thinking about the use of automatic translation for Web pages, and thus it has not been "tuned".
This example is a short translation (of a fragment of an article on machine translation) from Finnish to English by TranSmart, a demo of software by Kielikone.
|Finnish original||English translation|
|Kääntäminen lisääntyy eri syistä jatkuvasti. Teknologia tuottaa yhä mutkikkaampia laitteita ja yhä laajempia asennus- ja käyttöoppaita. Vientituotteiden ohjeet pitää kääntää asiakkaiden kielille. Samaten patenttihakemusten ja tieteellisten artikkelien määrä kasvaa.||The translation increases continuously for different reasons. The technology produces increasingly complicated devices and the increasingly wide installation guides and use guides. The instructions of the export products must be translated into the customers' languages. The number likewise of the patent applications and of the scientific articles increases.|
The translation given by Transmart contains some text in italics, indicating that the italicized text is a translation for a compound word in the original. (Often this means that the translation is not idiomatically correct but usually helps in getting the idea.)
Obviously, the translation is far from being idiomatically and stylistically perfect. Yet, one can understand the content pretty well.
The incorrect use of articles in the translation is mainly caused by the fact that the Finnish language lacks both definite and indefinite articles.
For HTML documents on the WWW, automatic translation can be requested for in several alternative phases:
LINKelement. It is also possible, in principle at least, to organize things so that all the versions are accessible using a single address (URL); the preferred version would be picked up according to the so-called language negotiation mechanism.
It would be interesting to discuss the pros and cons of each method, but here we will only mention that fully automatic translation will in practice be needed for transient documents which are generated dynamically instead of being static files.
Naturally, these methods could be combined. For example, a browser might request (on the basis of language preferences set by the user) a document in a specific language, and the server could check whether it actually has a suitable translation; if not, it could check whether it can find a program for generating a translation; and if this fails, it could send the document in a language lower in the user's preferences, in which case the browser could check whether it can translate the document. (Some delicacies might be needed to prevent the situation where a server sends an automatically generated translation which needs to be translated again to the user's preferred language, instead of sending the original.)
Even a coarse and erroneous translation can give an idea of what the text is about. This is crucial on the Web where one can easily access a huge number of documents in various languages and needs to sort out what's relevant. If one finds a document in a strange language, automatic translation helps in deciding whether it is worth a closer look. If it looks really interesting, one could perhaps afford a translation made by a human translator.
A relatively good automatic translation could serve as the basis for human translator. Admittedly, there is a risk that the human translator produces unnecessarily low-quality translation that way, since automatic translation tends to reflect the structure of the source language too much. In any case, automatic translation is currently used that way, and it can save a lot of time e.g. by saving the human translator from the boring work of translating simple sentences, doing dictionary lookup, etc.
Automatic translation can be useful alongside with the original when the reader knows the source language to some extent but not fluently. Depending on how well he knows it, he could use either the original or the translation as the primary text, consulting the other one when problems are encountered.
An author who speaks several languages could use automatic translation of his documents as a extra check for clarity and grammatic correctness. For example, having written a document in English I could request an automatic translation into German, then read through the translation. Translation errors may well indicate problems in the original text, such as typographic errors not detected by spelling checkers - such as an error which happens to produce another word of the language - or too complicated or ambiguos grammatical structures. If a translation program cannot correctly handle a piece of text, this might result from features which also prevent a human reader from understanding it or make him understand it incorrectly, especially if the language used is not his native language.
The question remains whether "standalone" automatic translation can be feasible. That is, could one use a fully automatically produced translation as the only form in which a document is accessed? The answer is that it depends on the nature of the text and on the translation program. Currently Babelfish is already used to some extent that way in Usenet discussions, using grammatically simple language.
As the translation programs improve, it can be used for more complicated texts. Naturally, the question remains whether one can rely on an automatically generated translation. One might answer with another question: Don't humans err? In fact, in many details automatic translation can be more reliable, since computer programs do boring work more conscientiously than people do.
If documents are translated by human translators, each translation needs to be updated whenever the original is changed. This means a lot of boring work - some of which might be automated with suitable tools - especially for documents which are updated very frequently. And to a large number of Web pages is generated dynamically, on the basis of a user request and e.g. search from a database. A simply query report might contain very simple language grammatically and be easily translatable by computer.
This section discusses such problems and solutions in automatic translation which are specific to the situation where the source document is in HTML format. This section (or this document) does not discuss the general problems of automatic translation. For such issues, please refer to the extensive directory The ACL NLP/CL Universe.
Automatic translation can be significantly improved if authors and their tools are involved. This means that the problems of translation are taken into account by the author or his assistants when the original work is created.
For existing documents, some relatively simple modifications may greatly improve translatability. This will be demonstrated next by an example.
As an example of how modifications to a document can improve translatability, I have taken a short page which tells some numeric and other facts about the university where I work, HUT. Such simple fact pages could be expected to be relatively easily translatable, since they do not contain grammatically complex structures. Moreover, automatic translatability is essential since such pages can be interesting to people speaking different languages, and one hardly wants to allocate resources to maintaining such pages in many languages by hand.
The original page is a short fact sheet, Helsinki University of Technology in a Nutshell. In its French translation by Babelfish, there are several obvious failings (most of which you probably notice even if you don't know French):
In other translations, there are similar failings but also some different problems. For example:
äcircumvents the problem.
In order to solve some of the problems detected, I constructed an experimental modified page by applying the methods described in the first section (guidelines on natural language usage and guidelines on HTML markup). Its French translation (by Babelfish) is considerably better than that of the original. The remaining flaws (such as "professeurs d'associé" instead of "professeurs associés") are probable things that can be fixed only by improving the translation program.
Notes on the changes:
ADDRESSelement, which is treated in "don't translate this" mode by Babelfish. The illogical use of
ADDRESSfor something that really isn't a normal address thus causes unwanted phenomena in automatic translation. In the first of
ADDRESS, the tags were simply removed. In the latter case, they were replaced by
SMALLtags; it seems natural to suggest that technical information about the maintenance of a document should appear in smaller font than normal.
ADDRESSelement, namely the abbreviation HUT) need to be protected from any attempt to translate them. This was made using the "
SAMPhack"; it has the drawback that words so marked are presented in monospaced ("typewriter") font on many browsers by default. Style sheets are used to suggest another rendering, small-caps.
You may wish to compare the presentation of the modified document (in English) on your browser with a screenshot of what it looks like in one browsing situation viewed on Internet Explorer 4.0 with stylesheet support on. (It isn't quite what it should, due to deficiencies in stylesheet support.)
You may wish to look at the other translations of the modified document:
|German translation||Italian translation||Portuguese translation||Spanish translation|
The Portuguese translation is the most problematic. In addition to the "nutshell" problem mentioned above, the change of the English spelling "vicerector" to "vice-rector" caused a new problem: it's now translated as "vice-vice-rector"!
To be written...
To be written...
Need to consider the different roles of the
attribute for example.
Paradoxically, one of the most serious practical problems in translating Web documents automatically is how to prevent translation of various parts of the document.
Consider the section of this document with the example of a text in English and its translation into French. Quite obviously, if that section is to be translated (into French or into some other language), the example text in French should not be translated, especially not by applying to it algorithms and dictionaries for translating from English to some language! (However, that's what Babelfish currently does.)
To take a simpler and more common example, consider a text in English with a proper name "John Birch" in it. When translating to Italian, for example, how can we prevent a program from translating "Birch" as "Betulla" (using the Italian word for birch)? Someone might suggest heuristics based on the use of capital letters, but that would be rather ineffective - it would fail entirely when translating from German, for example, since in German all nouns are spelled with a capital initial.
It seems obvious that some method of marking words as proper
names is needed. That's not sufficient, however. There are
other words too which shall not be translated.
Examples range from code-like things
appearing in texts about computer languages
(like the element name
BODY in HTML or the keyword
case in C)
to linguistic texts speaking about words.
It is obvious that if when translating a text which discusses
the English language, sample English words
(like in "the plural of ox is
must not be translated.
It should be noted that one cannot deduce from the word itself, as a string in a text, whether it should be translated, no matter how large glossaries we use. For example, the word "John" in a name like "John Birch" must remain as such, whereas "king John" must become "kuningas Juhana" in Finnish and "John the Baptist" must become "Jean-Baptist" in French.
It seems that
regards the contents of the following HTML elements as something
that shall not be translated:
For all of these, one can present arguments in favor of
treating them as "literals" which are not to be translated.
On the other hand, counterarguments could be presented, and
at least the
CODE element would be an obvious candidate to be
But basically what is needed is a better official specification of the semantics of phrase markup elements in HTML. In the process of creating such specifications, the questions of translation should be explicitly discussed.
Discussion is needed to determine which is the best approach to preventing translations. Alternatives include:
LIT) for specifying that a piece of text is a literal which is to remain unchanged in translations. A set of
CLASSattribute values (such as
CLASS="person"for person's names) might be introduced to specify the class of literals, mainly for style sheet purposes; they might have some relevance in translation, too.
The first alternative could hardly be the only solution. It would require additional methods both for specifying that other element instances are translation-invariant and for specifying that normally translation-invariant elements are to be translated.
The following list indicates some deficiences and problems in Babelfish noted by me when using it. The list is by no means exclusive and not even systematic.
LANGattribute entirely. In addition to using the
LANGspecified for the
HTMLelement in order to determine the basic language in the document, a translation program should check the
HREFLANG) attributes in contained elements and leave texts written in other languages than the basic source language untranslated, or translate them using algorithms and lexica for the language specified.
ALT. It would be quite essential to have them translated, too. Notice that
ALTis crucial for accessibility.
SAMPelements should be left untranslated by default.
PREelements are messed up, since the translation does not preserve line breaks.
To be continued...
This is a very preliminary "wish list". Most probably some of the problems discussed here should be solved by introducing a more general construct than the one proposed here, or solved outside HTML, e.g. by improving the translation software. Sorry, you probably don't understand very much of this unless you know the HTML language rather well.
SPANelement and the no-break space do not logically mean the same as a "joiner" markup would.)
ORIGINALinto the set of standardized values for the
RELattribute. It would indicate a link to the original version from which the current version was translated. Translator programs should leave such links intact instead of converting them to links through a translator (as Babelfish now seems to do to all links). Naturally, a translator program, when asked to translate a document to language X, should check whether the document itself refers to its original which is written in X. This could be important when following links in a manner which goes through translations; it might prevent the situation where the user gets a translation of a document from language Y to X instead of getting an existing original in X!
To be continued...