Computers can be programmed to process complicated and irregular languages, but the computational efficiency is a very important issue, since we wish to process large amounts of text (or speech) and since the processing inherently requires resources which are large even compared with the capabilities of modern computers. For instance, World Wide Web search engines operate on very large and rapidly growing amount of textual information.
This paper will mainly discuss the effect of computer technology on the design of artificial languages. There are, of course, many other aspects, which will be discussed only briefly.
Some features of semantics could be defined formally. This applies in particular to the meanings of derivative suffixes: they should be expressed using a notation which specifies the meaning using an analytic expression.
Thus, the alphabet for an artificial language should be the English alphabet or a subset of it, without any diacritic marks.
The following features are desirable:
Grammatical gender is definitely an atavistic feature. It is even undesirable to have to express the natural sex of a being unless it is relevant in the context. For example, there should be a neutral (or more exactly "utral") pronoun that covers both "he" and "she".
Number is astonishingly often an unnecessary or even harmful category. Consider how often one has to say something like "one or more" if one wants to be exact. And consider how often number is specified by other than grammatical methods; in a phrase like "two horses" the plural ending "-s" is of course redundant. Also notice the illogical use of plural in questions like "How many horses have you got?"; the answer might well be "one" or "none", not a plural numeral. We can easily dispense with number as a category. Whenever desired one can use an auxiliary word denoting "one" or "more than one".
Analogous reasoning might lead us to omit tenses as well. For instance, in the sentence "yesterday I worked ten hours" the grammatical element "-ed" is as redundant as "-s" because of the adverbial "yesterday". Thus one could well have the rule that the one and only finite form of a verb does not as such specify the time in any way, so that e.g. past tense is expressed by having an adverbial that refers to the past; one should of course have a one-word adverbial that means "in the past" and could be used when a more specific one cannot be used. However, it is useful to have the category of tense for participles, and it would be irregular not to have it for finite forms. More importantly, it would be difficult to express complex temporal relations (e.g. "I will have written") without tenses; using the three natural tenses (past, present, and future) Esperanto allows to express them nicely.
Modes of verbs are hardly needed, in general, since the desired meanings can be expressed using adverbs, different conjunctions for various types of sentences, etc. Modes like subjunctive are difficult to learn and to use in languages like French, and they seldom have any useful meaning. The only exception to "modeless" system of conjugation could be the imperative. The imperative could be used to denote an impersonal instruction or suggestion, as opposite to personal commands, wishes etc. which require delicate distinctions of degrees of politeness and imperativeness, best expressed using adverbs.
Definiteness (as expressed by indefinite and definite articles in many languages) may appear as a necessary category. However, articles are often entirely redundant, and the exact rules for using articles are one of the most difficult features of the English grammar. The following solution is suggested: an indefinite and a definite article exist, but they are to be used only when it is desired to explicitly designate a being as "previously unknown" (or indefinite) or as "known" (or definite). In phrases like "the sun" or "the best of all possible worlds" the article is redundant, and in phrases corresponding to "I saw a fox" or "I saw the fox" an article should be used only if one wants to be explicit about the matter. Analytic or synthetic Obviously the language should have a systematic grammatical structure which is easy to recognize and generate automatically. It is not obvious, however, whether this is better achieved by analytic or by synthetic methods.
It is often said that languages develop towards analyticity. But this is probably partially due to the irregularities in synthetic methods. More regularly synthetic languages like the Fenno-Ugrian ones have tended to preserve and even extend synthetic features in the grammars.
From the viewpoint of automatic processability there is no strong objection to synthetic methods. On the contrary, congruence may ease the task of recognizing the grammatical relations between words of a sentence. Moreover, word derivation is essentially synthetic, and the distinction between derivation and inflexion is to a great extent a mere convention.
However, purely synthetic language would be slightly better for automatic processing. This does not apply primarily to computer programs but their human users. It is easier to write a command for searching for a word if one knows that it always occurs in exactly the same spelling.
If synthetic methods are introduced, they must be regular: an affix shall not depend on the base word and shall not affect its form.
A very important aspect is that it should be possible to "parse" a word into morphemes "mechanically", without any semantic or lexical information, solely on the basis of knowledge about the possible affixes. Concretely, base words should not have a beginning or an ending which is the same as a derivation prefix or suffix, respectively. lexical knowledge. Thus, if we have the prefix "re-" (as in Latin and in many international words), then no base word and no other prefix should begin with "re-". This is not easy to achieve. Actually, conflicting prefixes or suffixes like "re-" and "retro-" are not so problematic - we can simply recognize the longest possible affix. We might allow some conflicts to be resolved by (small) vocabularies of exceptional words (e.g. base words beginning with "re-").
There is, however, a very simple approach which solves the essential problems: use only one way of attaching affixes - and suffixing is the obvious approach, since it is more productive in current languages. Thus a word would be morphematically parsed "backwards": simply check if the word ends with a suffix of the language, remove the suffix and apply the same test to the rest of the word etc., until no suffix can be found. Then one can lexically check that the remaining word exists in the basic vocabulary; if not, the word is assumed to be a "foreign" word (a proper noun). Notice that the "suffixing only" principle also removes a semantic problem: assuming that we have a word of the form prefix+base+suffix, is it to be understood as (prefix+base)+suffix or as prefix+(base+suffix)? (Such issues are often real problems. For instance, many people misunderstand the word "atheism" as if it consisted of the negation prefix "a-" and the word "theism"!)
Some very simple and easily definable and recognizable prefixes could be accepted (e.g. the prefix "non-", denoting simple negation of the rest of the word).
Suffixal inflexion and derivation according to the above-mentioned principles requires that each word (excluding such particles which are not used as base words) end in a manner which allows agglutinative suffixing, i.e. suffixing which involves no change in the base word or in the suffix. An obvious way of achieving this is that each word (or, for verbs, the stem of the word) ends with a vowel and a suffix normally consists of one or more consonants followed by a vowel.
It should be as easy as possible to recognize the constituents of a sentence, such as subject, predicate, and adverbials. This is often difficult in natural languages, and sentences might be really ambiguous, so that the decomposition into constituents is ultimately based on semantic reasoning. This causes serious problems to nonnative users of the language and even more serious problems to computer programs.
A simple principle which makes the recognition of constituents easier is that whenever there are adjacent nominal clauses (with no prepositions between them) they are "codeterminative", i.e. they form a composite expression which denotes the entity determined by them together. This means that we need not have a distinction between nouns and adjectives. Semantically a word can have a wide range of meanings; when the context is clear, a word can be used as such to denote a specific (context-defined) entity; and when desired, additional codeterminative words can be added to select a particular meaning. Thus, a word might mean "good" both as a noun (a good person or a good thing) and as an adjective.
Notice that the principle of codeterminativeness requires a grammatical sign for the object. In a synthetic language, the sign would be a preposition corresponding to the accusative case. (One might even consider marking the subject, to make the structure more regular.)
We can define the language so that an adverbial which begins with a preposition consists of all nouns following the preposition, up to (and excluding) the end of the sentence, or a preposition, or a finite form of a verb, whichever occurs first.
From the computational point of view, the most natural word order would be VSO (verb, subject, object), since that corresponds to the normal syntax of subprogram calls. Both the subject and the objects (direct and indirect) are comparable to arguments of a subprogram call, whereas the predicate verb corresponds to the subprogram name. Adverbials can be regarded as optional arguments, so they should logically appear after other arguments. Notice that normal imperative sentences, which are so common in languages for controlling computers, have the form VSO with the subject omitted.
Composite phrases consisting of two or more words should be preferred to word composition for the following reasons:
The Loglan language had several design criteria similar to those presented here, probably mainly because they were considered useful for purely human communication as well. It has often been said that Loglan is too logical for human beings to gain popularity, since normal fluent speech does not apply logical forms. On the other hand, Loglan has a feature which is definitely unnatural: its vocabulary is very artificial. It might be a useful experiment to construct a language with structure similar to that of Loglan but with Latin or English based vocabulary.