The Unicode 3.0 standard specifies "line breaking behavior" of characters in a manner which I find very confusing. This document tries to present the ideas the way I've understood them, and partly criticizes them. The reader is assumed to have a basic understanding of what Unicode is.
This document is partly out of date, now that Unicode 4.0 has been published. However I think the changes do not substantially change the basic points in this document.
There are several different descriptions of "line breaking behavior" in the Unicode standard. In subchapter 5.15 of the printed book, there is a description under the heading Line Boundaries, on p. 129--132. It describes some "Character Classes" and says:
Note: For a precise specification of these classes, see Unicode Technical Report #14, "Line Breaking Properties," on the CD-ROM or the up-to-date version on the Unicode Web site. (The classes have slightly different names here for consistency.)
In fact, the names (or identifiers) of classes are very confusingly different, like "Insep" vs. "IN Inseparable", and there are many more classes in the Report (now called UAX #14). The book presents the line breaking rules in a semi-formalized manner which uses strange symbols (explained on p. 125): ÷ means a break is allowed, × means a break is not allowed, and some additional rules related to spaces are given in prose. The Report contains similar rules, in a more formalized notation, and naturally different in content since there are more classes. The Report also contains the chapter 7 Pair-table Based Implementation with a tabular presentation of some of the rules.
The Report alternatingly
uses the words "class" and "line breaking property". It lists,
in various ways, characters which have a given property. But it seems
that the intended idea is that the ultimate reference for
"classification" in this respect is the
Unicode character database, more specifically the
Note: Most character properties are specified in the
UnicodeData.txt file, to be read according to
UnicodeData File Format.
For some reason, line breaking properties have not been integrated into
It seems that the intended authoritative
specification of line-breaking properties (both normative and
informative) consists of the first part
(before Table 1) of
chapter 2 Definitions and
6 Line Breaking Algorithm
of UAX #14
The former describes the rules in terms of LineBreak properties;
the latter assigns a LineBreak property to each character.
All the rest is attempted explanations or illustrations (and often
Of course, that authoritative specification is rather formal, and some explanations, like verbal characterizations of the properties (classes), reasons for the rules, and examples of characters in each class, etc., might be useful. But this would require consistency in notations and a more logical structure of presentation.
The following is an extract of
LineBreak.txt, covering the printable ISO Latin 1
(ISO 8859-1 character repertoire). In some cases,
entries have been omitted, using ellipsis (...) to
indicate that a range of lines has the obvious content.
0020;SP # SPACE 0021;EX # EXCLAMATION MARK 0022;QU # QUOTATION MARK 0023;AL # NUMBER SIGN 0024;PR # DOLLAR SIGN 0025;PO # PERCENT SIGN 0026;AL # AMPERSAND 0027;QU # APOSTROPHE 0028;OP # LEFT PARENTHESIS 0029;CL # RIGHT PARENTHESIS 002A;AL # ASTERISK 002B;PR # PLUS SIGN 002C;IS # COMMA 002D;HY # HYPHEN-MINUS 002E;IS # FULL STOP 002F;SY # SOLIDUS 0030;NU # DIGIT ZERO ... 0039;NU # DIGIT NINE 003A;IS # COLON 003B;IS # SEMICOLON 003C;AL # LESS-THAN SIGN 003D;AL # EQUALS SIGN 003E;AL # GREATER-THAN SIGN 003F;EX # QUESTION MARK 0040;AL # COMMERCIAL AT 0041;AL # LATIN CAPITAL LETTER A ... 005A;AL # LATIN CAPITAL LETTER Z 005B;OP # LEFT SQUARE BRACKET 005C;PR # REVERSE SOLIDUS 005D;CL # RIGHT SQUARE BRACKET 005E;AL # CIRCUMFLEX ACCENT 005F;AL # LOW LINE 0060;AL # GRAVE ACCENT 0061;AL # LATIN SMALL LETTER A ... 007A;AL # LATIN SMALL LETTER Z 007B;OP # LEFT CURLY BRACKET 007C;BA # VERTICAL LINE 007D;CL # RIGHT CURLY BRACKET 007E;AL # TILDE 00A0;GL # NO-BREAK SPACE 00A1;AI # INVERTED EXCLAMATION MARK 00A2;PO # CENT SIGN 00A3;PR # POUND SIGN 00A4;PR # CURRENCY SIGN 00A5;PR # YEN SIGN 00A6;AL # BROKEN BAR 00A7;AI # SECTION SIGN 00A8;AI # DIAERESIS 00A9;AL # COPYRIGHT SIGN 00AA;AI # FEMININE ORDINAL INDICATOR 00AB;QU # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK 00AC;AL # NOT SIGN 00AD;BA # SOFT HYPHEN 00AE;AL # REGISTERED SIGN 00AF;AL # MACRON 00B0;PO # DEGREE SIGN 00B1;PR # PLUS-MINUS SIGN 00B2;AI # SUPERSCRIPT TWO 00B3;AI # SUPERSCRIPT THREE 00B4;BB # ACUTE ACCENT 00B5;AL # MICRO SIGN 00B6;AI # PILCROW SIGN 00B7;AI # MIDDLE DOT 00B8;AI # CEDILLA 00B9;AI # SUPERSCRIPT ONE 00BA;AI # MASCULINE ORDINAL INDICATOR 00BB;QU # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK 00BC;AI # VULGAR FRACTION ONE QUARTER 00BD;AI # VULGAR FRACTION ONE HALF 00BE;AI # VULGAR FRACTION THREE QUARTERS 00BF;AI # INVERTED QUESTION MARK 00C0;AL # LATIN CAPITAL LETTER A WITH GRAVE 00C1;AL # LATIN CAPITAL LETTER A WITH ACUTE 00C2;AL # LATIN CAPITAL LETTER A WITH CIRCUMFLEX 00C3;AL # LATIN CAPITAL LETTER A WITH TILDE 00C4;AL # LATIN CAPITAL LETTER A WITH DIAERESIS 00C5;AL # LATIN CAPITAL LETTER A WITH RING ABOVE 00C6;AI # LATIN CAPITAL LETTER AE 00C7;AL # LATIN CAPITAL LETTER C WITH CEDILLA 00C8;AL # LATIN CAPITAL LETTER E WITH GRAVE 00C9;AL # LATIN CAPITAL LETTER E WITH ACUTE 00CA;AL # LATIN CAPITAL LETTER E WITH CIRCUMFLEX 00CB;AL # LATIN CAPITAL LETTER E WITH DIAERESIS 00CC;AL # LATIN CAPITAL LETTER I WITH GRAVE 00CD;AL # LATIN CAPITAL LETTER I WITH ACUTE 00CE;AL # LATIN CAPITAL LETTER I WITH CIRCUMFLEX 00CF;AL # LATIN CAPITAL LETTER I WITH DIAERESIS 00D0;AI # LATIN CAPITAL LETTER ETH 00D1;AL # LATIN CAPITAL LETTER N WITH TILDE 00D2;AL # LATIN CAPITAL LETTER O WITH GRAVE 00D3;AL # LATIN CAPITAL LETTER O WITH ACUTE 00D4;AL # LATIN CAPITAL LETTER O WITH CIRCUMFLEX 00D5;AL # LATIN CAPITAL LETTER O WITH TILDE 00D6;AL # LATIN CAPITAL LETTER O WITH DIAERESIS 00D7;AI # MULTIPLICATION SIGN 00D8;AI # LATIN CAPITAL LETTER O WITH STROKE 00D9;AL # LATIN CAPITAL LETTER U WITH GRAVE 00DA;AL # LATIN CAPITAL LETTER U WITH ACUTE 00DB;AL # LATIN CAPITAL LETTER U WITH CIRCUMFLEX 00DC;AL # LATIN CAPITAL LETTER U WITH DIAERESIS 00DD;AL # LATIN CAPITAL LETTER Y WITH ACUTE 00DE;AI # LATIN CAPITAL LETTER THORN 00DF;AI # LATIN SMALL LETTER SHARP S 00E0;AI # LATIN SMALL LETTER A WITH GRAVE 00E1;AI # LATIN SMALL LETTER A WITH ACUTE 00E2;AL # LATIN SMALL LETTER A WITH CIRCUMFLEX 00E3;AL # LATIN SMALL LETTER A WITH TILDE 00E4;AL # LATIN SMALL LETTER A WITH DIAERESIS 00E5;AL # LATIN SMALL LETTER A WITH RING ABOVE 00E6;AI # LATIN SMALL LETTER AE 00E7;AL # LATIN SMALL LETTER C WITH CEDILLA 00E8;AI # LATIN SMALL LETTER E WITH GRAVE 00E9;AI # LATIN SMALL LETTER E WITH ACUTE 00EA;AI # LATIN SMALL LETTER E WITH CIRCUMFLEX 00EB;AL # LATIN SMALL LETTER E WITH DIAERESIS 00EC;AI # LATIN SMALL LETTER I WITH GRAVE 00ED;AI # LATIN SMALL LETTER I WITH ACUTE 00EE;AL # LATIN SMALL LETTER I WITH CIRCUMFLEX 00EF;AL # LATIN SMALL LETTER I WITH DIAERESIS 00F0;AI # LATIN SMALL LETTER ETH 00F1;AL # LATIN SMALL LETTER N WITH TILDE 00F2;AI # LATIN SMALL LETTER O WITH GRAVE 00F3;AI # LATIN SMALL LETTER O WITH ACUTE 00F4;AL # LATIN SMALL LETTER O WITH CIRCUMFLEX 00F5;AL # LATIN SMALL LETTER O WITH TILDE 00F6;AL # LATIN SMALL LETTER O WITH DIAERESIS 00F7;AI # DIVISION SIGN 00F8;AI # LATIN SMALL LETTER O WITH STROKE 00F9;AI # LATIN SMALL LETTER U WITH GRAVE 00FA;AI # LATIN SMALL LETTER U WITH ACUTE 00FB;AL # LATIN SMALL LETTER U WITH CIRCUMFLEX 00FC;AI # LATIN SMALL LETTER U WITH DIAERESIS 00FD;AL # LATIN SMALL LETTER Y WITH ACUTE 00FE;AI # LATIN SMALL LETTER THORN 00FF;AL # LATIN SMALL LETTER Y WITH DIAERESIS
file is of rather simple format, described at the beginning
of the file (on lines beginning with
each entry consists of one line, containing three fields:
Unicode value (code number, four hexadecimal digits); LineBreak
property, two characters; and Unicode name (which is purely
comment-like here, since the code number identifies the character
So for example,
says that Unicode character
U+00B0 (which has the name
DEGREE SIGN) has
PO as the LineBreak
property, i.e. belongs to class
PO (which is by the
way abbreviated from the word "postfix" - not very mnemonic, is it?).
The description at the beginning of the file gives the following additional information:
Last>are omitted, as in
UnicodeData.txt. For example, the following means that all characters between 3400 and 4DB5 have the value
See chapters 3 Conformance and 4 Character Properties of the printed standard for definitions of "normative" and "informative". It is hard to say what they really mean. On p. 73 the standard seems to say that applying normative properties is an absolute requirement for conforming implementations whereas informative properties can be freely overridden. The natural idea is that informative properties are just suggested defaults. But on p. 40 the standard says: "Some normative behavior is default behavior; this behavior can be overridden by higher-level protocols. However, in the absence of such protocols, the behavior must be observed so as to follow the character semantics". This is rather vague; which normative behavior is overridable? Specifically, what about the normative line breaking properties?
In principle, it is relatively straightforward to apply the
rules, once we know what they really are. Consider for example
the question whether line breaks are allowed within the string
The LineBreak properties of the characters in it are found from
002F;SY;SOLIDUS 0025;PO;PERCENT SIGN 0037;NU;DIGIT SEVEN 0065;AL;LATIN SMALL LETTER E 006A;AL;LATIN SMALL LETTER J
Then, taking the characters in order and applying the rules in Line Breaking Algorithm in order (as they have been specified to apply), we find
%a line break is permitted, since no rule forbids it and the last rule LB 20 says "break everywhere else"
7, a line break is permitted on the same grounds
e, no line break is allowed, according to rule LB 17:
j, no line break is allowed, according to rule LB 19:
At each step,
when considering whether a line break is permitted before two
consecutive characters A and B,
we need to consider the LineBreak properties of both characters.
A rule that
prohibits line break before B might appear earlier
in the list of rules
(i.e. have higher precedence) than a rule that permits a line
break after A. In such a case, a line break
is not permitted. To take a trivial example, if there are two
the a line break between them is not allowed. Rule
LB 15b allows a break after hyphen-minus
HY ÷), but
rule LB 15 forbids a break before hyphen-minus
HY). It is the order of the rules (which
is a precedence order) that decides which rule "wins", not the order
of the characters under study.
I became interested in line breaking properties when I had noted
that Internet Explorer 4 applies some "interesting" methods
when formatting an HTML document for display. For example, it can
break a string like
This doesn't make much sense, but formally complies with Unicode
line breaking rules. So the first issue to be considered is that
rules do not say any warnings about applying them
See my document Word division in IE for some dirty details. Note that some of the odd breaks (which have to some extent been fixed in IE 5) might reflect wrong interpretations of the line breaking rules. And this would not be surprising; it's easy to get confused with those rules.
In Latin and some other scripts, line breaks in positions other than after spaces (and explicit line and paragraph break control codes, of course) should be regarded as exceptions. But the standard more or less takes the position "break everywhere" except when specifically forbidden. And those exceptions are just rough guesses, really. The rules as a whole forbid quite a many perfectly reasonable line breaks, like before the solidus in "it's in directory /usr/spool", and allow some really absurd breaks, like after either solidus in our example.
It is very confusing to see a string broken into lines just
because some mechanical rules have allowed it. A string which is
mixture of Latin letters, digits, and various symbols is most probably
part of some special notation, such as a URL, or a variable in
a programming language, or some code. It is unacceptable to have
a string like
foo%bar broken, especially when it occurs
with no indication of what has happened. It can even distort
information or corrupt data.
It would probably be best to remove all prohibitions against line breaks after spaces. After all, the no-break space can be used instead of a normal space in such cases, or language-specific higher-level protocols can be applied (e.g. to prevent line breaks in French text between a space and a question mark).
Normally a sequence of characters should not be split to two lines when applying generic (language-independent) rules n unless it is clearly justified on the basis of the usage properties of the characters. This means that basically just a space is allowed line break point in a Latin script, though a few punctuation character (mainly hyphens and dashes) might be considered too. For quite a many characters used in non-Latin scripts, line breaks are accepted before and after in any known usage, so the Unicode standard should just list them.
Rules for "emergency breaks", like breaking a URL to two lines to prevent absurdly long lines, might be developed. But they would belong to higher protocol levels. And it is always a matter of judgement between two evils, like allowing a (somewhat) overfull line versus splitting a string that should not be broken, or perhaps reducing font size. In particular, in URLs it should suffice to break a line after a solidus or an ampersand. And breaking a URL to several lines should always be accompanied with the use of suitable delimiters, as recommended in appendix E of RFC 2396.
Date of creation: 2000-10-11. Last update: 2000-10-11.
Minor corrections 2002-04-08.
Added an extract of