section Automatic translation and HTML
Example: modifying a simple document for translatability
As an example of how modifications to a document can improve
translatability, I have taken
a short page which tells some numeric and other facts about the
university where I work,
Such simple fact pages could be expected to be relatively
easily translatable, since they
do not contain grammatically complex structures.
translatability is essential since such pages can
be interesting to people speaking different languages, and
hardly wants to allocate resources to maintaining
such pages in many languages by hand.
Note: The example document and its modified form
and their translations are not embedded into this document.
Instead, links to them are provided.
In a typical graphical browser, such as Internet Explorer or Netscape,
on Windows for example,
you can use the rightmost
button of the mouse when following a link
(instead of the normal use of the leftmost button), then select
Open in New Window
in the pulldown menu opened.
You can the move window to another position on
the screen and resize it suitably, e.g. so that you can view
different versions side by side.
The original page
is a short fact sheet,
Helsinki University of Technology in a Nutshell.
there are several obvious failings
(most of which you probably notice even if you don't know
- many texts have not been translated at all
- some texts are very
confusing in the translation, since Babelfish has taken e.g.
the word "Twelwe" as a proper name (kept untranslated), not as
a misspelling of "Twelve"
- the expression "13 degree programmes" is translated as if
"13 degree" were an attribute to "programme"; that is, Babelfish
analyzed the structure differently from the intended one
- an expression like "under-" has been
taken as a preposition (translated into the corresponding French
preposition) instead of being part of the expression
"under- and postgraduate".
In other translations,
there are similar failings but also some different problems.
- In the German translation,
the word "state" was translated as "Zustand". Obviously,
for a noun with so
many meanings as "state" has, a translation program is unlikely
to pick up a correct equivalent without some help from the content.
Changing "state" to "state budget" would improve the situation:
it would be translated into the single word
"Staatshaushalt". (One should of course be careful with
such changes to text which modify the meaning. In this context,
the modification would probably be acceptable.)
- In the Italian translation,
the name "Esa" has been replaced by "SEC". Assumably it was
interpreted as some abbreviation
and translated using the corresponding Italian abbreviation!
And the word "marks" was translated as "contrassegni"
instead of being taken as a currency name. This problem can
be circumvented by removing the word from the original;
it is redundant due to the appearance of the currency
In the Portuguese translation,
the word "nutshell" is left untranslated. This is interesting,
compared with the fact that in the French translation the
phrase "in a nutshell" has been replaced by the idiomatic
equivalent "en un mot" (literally, 'in a word').
the Spanish translation,
the name "Räisänen" has been changed to "R5ais5anen", i.e.
the letter ä has become 5a. This is obviously an error in the
way the translator processes data at the character level and
should be fixed there. On the other hand, modifying the document
by presenting ä as the
circumvents the problem.
In order to solve some of the problems detected,
constructed an experimental
by applying the methods described in the first section
(guidelines on natural language usage and
guidelines on HTML markup).
is considerably better than that of the original.
The remaining flaws (such as
"professeurs d'associé" instead of
"professeurs associés") are probable things that can be fixed
only by improving the translation program.
Notes on the changes:
The original document uses twice an
"don't translate this" mode by Babelfish.
The illogical use of
ADDRESS for something that really
isn't a normal address thus causes unwanted phenomena in
In the first of
ADDRESS, the tags were simply removed.
In the latter case, they were replaced by
it seems natural to suggest that technical information about
the maintenance of a document should appear in smaller font
On the other hand, some other parts of the text,
(and one part which was in an
ADDRESS element, namely the
abbreviation HUT) need to be
protected from any attempt to translate them.
This was made using the
it has the drawback that words so marked are presented in monospaced
("typewriter") font on many browsers by default. Style sheets are
used to suggest another rendering, small-caps.
- The logo of the university is used in the original page as
one image, which contains the logo symbol and the name of
the university (in English). Naturally, texts presented in
images are not translated by automatic translation programs
(although something like that might be done, in principle).
Therefore, using an image which contains the logo symbol only
and the name as text leads to better translatability.
You may wish to compare the presentation of the
modified document (in English)
on your browser
of what it looks like in one browsing situation
viewed on Internet Explorer 4.0 with stylesheet support on.
(It isn't quite what it should, due to deficiencies
in stylesheet support.)
You may wish to look at the other translations of the modified document:
The Portuguese translation is the most problematic.
In addition to the
"nutshell" problem mentioned above,
the change of the English spelling "vicerector" to
"vice-rector" caused a new problem:
it's now translated as
Next subsection: Logical markup and translation