The Effect of Computer Technology on the Design of Artificial Languages

Preface

The design of artificial languages such as Esperanto is commonly regarded as fruitless effort, since they do not tend to gain much popularity. However, the objective need for them is very great and increasing. Computers and information technology have made it possible to process languages automatically. Automatic translation is one aspect of this, and it would benefit from the use of a clearly defined and structured intermediate language. Even more important aspects are the automatic conversion of texts from spoken form to a written one and vice versa and the need for a suitable language that could be used in man-machine interaction. It would be ideal to use a well-designed language (with the expressive power of natural languages) or subsets of it as control languages of computer programs.

Computers can be programmed to process complicated and irregular languages, but the computational efficiency is a very important issue, since we wish to process large amounts of text (or speech) and since the processing inherently requires resources which are large even compared with the capabilities of modern computers. For instance, World Wide Web search engines operate on very large and rapidly growing amount of textual information.

This paper will mainly discuss the effect of computer technology on the design of artificial languages. There are, of course, many other aspects, which will be discussed only briefly.

Definability

A language suitable for automatic processing should be defined formally as far as possible. A formal description of the syntax as well as would make it easier to write software for both analyzing and generating the language. An official dictionary should use a rigorously defined formalism to indicate properties of words such as classes of words and transitiveness of verbs. Ideally, a subset of the language itself should act as the metalanguage.

Some features of semantics could be defined formally. This applies in particular to the meanings of derivative suffixes: they should be expressed using a notation which specifies the meaning using an analytic expression.

Modularity

In compiler technology it is customary to make a clear distinction between lexical, syntactic, and semantic analysis. This approach should be applied to the construction of an artificial language:

the mapping between letters and phonemes should be independent of other features of the language, such as the grammar and the meanings of the words
it should be possible to analyze a sentence grammatically without any lexical knowledge; this is important both for computer processing and for human beings (who can benefit from knowing the grammatical role of a word without knowing its meaning).

Alphabet

Texts are produced, transmitted, and recorded using computers to a rapidly increasing amount. The most common character set is still ASCII, which contains the English letters (A - Z, a - z) and some punctuation and special characters. Although wider character sets have been defined and standardized, general support for them cannot yet be assumed to exist.

Thus, the alphabet for an artificial language should be the English alphabet or a subset of it, without any diacritic marks.

Phonetics

Automatic generation and recognition of speech are technically possible now, and future development will make them economically feasible in a large number of applications soon. The effort needed to generate and especially to recognize speech strongly depends on the regularity of the phonetic structure of the language.

The following features are desirable:

one-to-one correspondence between phonemes and letters
regular stressing, preferably such that the first syllable is stressed, since this makes it easier to separate words in speech recognition.

The normal principle in modern Indo-European languages is that the vowel of an open stressed syllable is somewhat longer than other vowels, and there is no technical objection to this. It provides additional support for distinguishing the stressed syllable of a word and is easy to generate. However, since the stressed syllable is the first one, this would lead to unfamiliar pronunciation of familiar words in many cases.

Grammatical categories

Let us first consider the need for grammatical categories such as number, gender, and tenses, without yet considering how they are to be expressed.

Grammatical gender is definitely an atavistic feature. It is even undesirable to have to express the natural sex of a being unless it is relevant in the context. For example, there should be a neutral (or more exactly "utral") pronoun that covers both "he" and "she".

Number is astonishingly often an unnecessary or even harmful category. Consider how often one has to say something like "one or more" if one wants to be exact. And consider how often number is specified by other than grammatical methods; in a phrase like "two horses" the plural ending "-s" is of course redundant. Also notice the illogical use of plural in questions like "How many horses have you got?"; the answer might well be "one" or "none", not a plural numeral. We can easily dispense with number as a category. Whenever desired one can use an auxiliary word denoting "one" or "more than one".

Analogous reasoning might lead us to omit tenses as well. For instance, in the sentence "yesterday I worked ten hours" the grammatical element "-ed" is as redundant as "-s" because of the adverbial "yesterday". Thus one could well have the rule that the one and only finite form of a verb does not as such specify the time in any way, so that e.g. past tense is expressed by having an adverbial that refers to the past; one should of course have a one-word adverbial that means "in the past" and could be used when a more specific one cannot be used. However, it is useful to have the category of tense for participles, and it would be irregular not to have it for finite forms. More importantly, it would be difficult to express complex temporal relations (e.g. "I will have written") without tenses; using the three natural tenses (past, present, and future) Esperanto allows to express them nicely.

Modes of verbs are hardly needed, in general, since the desired meanings can be expressed using adverbs, different conjunctions for various types of sentences, etc. Modes like subjunctive are difficult to learn and to use in languages like French, and they seldom have any useful meaning. The only exception to "modeless" system of conjugation could be the imperative. The imperative could be used to denote an impersonal instruction or suggestion, as opposite to personal commands, wishes etc. which require delicate distinctions of degrees of politeness and imperativeness, best expressed using adverbs.

Definiteness (as expressed by indefinite and definite articles in many languages) may appear as a necessary category. However, articles are often entirely redundant, and the exact rules for using articles are one of the most difficult features of the English grammar. The following solution is suggested: an indefinite and a definite article exist, but they are to be used only when it is desired to explicitly designate a being as "previously unknown" (or indefinite) or as "known" (or definite). In phrases like "the sun" or "the best of all possible worlds" the article is redundant, and in phrases corresponding to "I saw a fox" or "I saw the fox" an article should be used only if one wants to be explicit about the matter. Analytic or synthetic Obviously the language should have a systematic grammatical structure which is easy to recognize and generate automatically. It is not obvious, however, whether this is better achieved by analytic or by synthetic methods.

It is often said that languages develop towards analyticity. But this is probably partially due to the irregularities in synthetic methods. More regularly synthetic languages like the Fenno-Ugrian ones have tended to preserve and even extend synthetic features in the grammars.

From the viewpoint of automatic processability there is no strong objection to synthetic methods. On the contrary, congruence may ease the task of recognizing the grammatical relations between words of a sentence. Moreover, word derivation is essentially synthetic, and the distinction between derivation and inflexion is to a great extent a mere convention.

However, purely synthetic language would be slightly better for automatic processing. This does not apply primarily to computer programs but their human users. It is easier to write a command for searching for a word if one knows that it always occurs in exactly the same spelling.

If synthetic methods are introduced, they must be regular: an affix shall not depend on the base word and shall not affect its form.

A very important aspect is that it should be possible to "parse" a word into morphemes "mechanically", without any semantic or lexical information, solely on the basis of knowledge about the possible affixes. Concretely, base words should not have a beginning or an ending which is the same as a derivation prefix or suffix, respectively. lexical knowledge. Thus, if we have the prefix "re-" (as in Latin and in many international words), then no base word and no other prefix should begin with "re-". This is not easy to achieve. Actually, conflicting prefixes or suffixes like "re-" and "retro-" are not so problematic - we can simply recognize the longest possible affix. We might allow some conflicts to be resolved by (small) vocabularies of exceptional words (e.g. base words beginning with "re-").

There is, however, a very simple approach which solves the essential problems: use only one way of attaching affixes - and suffixing is the obvious approach, since it is more productive in current languages. Thus a word would be morphematically parsed "backwards": simply check if the word ends with a suffix of the language, remove the suffix and apply the same test to the rest of the word etc., until no suffix can be found. Then one can lexically check that the remaining word exists in the basic vocabulary; if not, the word is assumed to be a "foreign" word (a proper noun). Notice that the "suffixing only" principle also removes a semantic problem: assuming that we have a word of the form prefix+base+suffix, is it to be understood as (prefix+base)+suffix or as prefix+(base+suffix)? (Such issues are often real problems. For instance, many people misunderstand the word "atheism" as if it consisted of the negation prefix "a-" and the word "theism"!)

Some very simple and easily definable and recognizable prefixes could be accepted (e.g. the prefix "non-", denoting simple negation of the rest of the word).

Suffixal inflexion and derivation according to the above-mentioned principles requires that each word (excluding such particles which are not used as base words) end in a manner which allows agglutinative suffixing, i.e. suffixing which involves no change in the base word or in the suffix. An obvious way of achieving this is that each word (or, for verbs, the stem of the word) ends with a vowel and a suffix normally consists of one or more consonants followed by a vowel.

It should be as easy as possible to recognize the constituents of a sentence, such as subject, predicate, and adverbials. This is often difficult in natural languages, and sentences might be really ambiguous, so that the decomposition into constituents is ultimately based on semantic reasoning. This causes serious problems to nonnative users of the language and even more serious problems to computer programs.

A simple principle which makes the recognition of constituents easier is that whenever there are adjacent nominal clauses (with no prepositions between them) they are "codeterminative", i.e. they form a composite expression which denotes the entity determined by them together. This means that we need not have a distinction between nouns and adjectives. Semantically a word can have a wide range of meanings; when the context is clear, a word can be used as such to denote a specific (context-defined) entity; and when desired, additional codeterminative words can be added to select a particular meaning. Thus, a word might mean "good" both as a noun (a good person or a good thing) and as an adjective.

Notice that the principle of codeterminativeness requires a grammatical sign for the object. In a synthetic language, the sign would be a preposition corresponding to the accusative case. (One might even consider marking the subject, to make the structure more regular.)

We can define the language so that an adverbial which begins with a preposition consists of all nouns following the preposition, up to (and excluding) the end of the sentence, or a preposition, or a finite form of a verb, whichever occurs first.

Word order

Rules for word order can be set up relatively freely. For automatic generation of sentences, a strict word order would be easier. Languages with "free" word order use variations in the order to indicate nuances. For automatic processing, it would be better to express nuances by using adverbs and by having partial synonyms for important words. Adverbs are definitely to be preferred, since synonyms cause serious problems in search operations: it is difficult to search for texts discussing a phenomenon if there is large set of alternative words for the phenomenon.

From the computational point of view, the most natural word order would be VSO (verb, subject, object), since that corresponds to the normal syntax of subprogram calls. Both the subject and the objects (direct and indirect) are comparable to arguments of a subprogram call, whereas the predicate verb corresponds to the subprogram name. Adverbials can be regarded as optional arguments, so they should logically appear after other arguments. Notice that normal imperative sentences, which are so common in languages for controlling computers, have the form VSO with the subject omitted.

Word derivation

Word derivation should be extensive to make vocabularies smaller and generally based on suffixes. Each suffix is defined by its actual phonetic (and literal) appearance, its role as deverbal or denominal, its class (noun or verb), and its semantic meaning either as a function of the meaning of the base word or as "to be defined". The latter option means that there is no generic predefined meaning; in that case, the meaning of each word formed using the suffix should be defined separately and listed in a dictionary. (In natural languages, deminutive suffixes are typically morphemes which are thought as indicating smallness only, but in fact they usually belong to the latter category. A cigarette is not a small cigar in reality, just metaphorically.)

Composite phrases consisting of two or more words should be preferred to word composition for the following reasons:

it is difficult to decompose a compound word automatically (and often also by a human reader or listener); this can be alleviated by requiring that a hyphen must be used in compound words, but it is still difficult to recognize such a word from speech
word composition leads to long words which are difficult for human beings.

Thus, word composition should be restricted to cases where a one-word term is obviously needed. It should not be regarded as a fully productive tool, and accepted compound words should be listed in vocabularies. Consequently, the semantics of word composition need not be so well-defined as that of suffixes.

Concluding remarks

This article has outlined an optimal international language starting from the idea of automatic processability. For further design, it would be necessary to fix such things as the structure of morphemes and the basis of word creation. There have been several basic strategies of word creation in artificial languages. (Somewhat surprisingly, the one that would most naturally suggest itself in the modern world, using English vocabulary as the basis, is one of the rarest.)

The Loglan language had several design criteria similar to those presented here, probably mainly because they were considered useful for purely human communication as well. It has often been said that Loglan is too logical for human beings to gain popularity, since normal fluent speech does not apply logical forms. On the other hand, Loglan has a feature which is definitely unnatural: its vocabulary is very artificial. It might be a useful experiment to construct a language with structure similar to that of Loglan but with Latin or English based vocabulary.