The Unicode standard specifies “line breaking behavior” of characters in a very confusing manner. This document tries to present the ideas the way I’ve understood them, and partly criticizes them. The reader is assumed to have a basic understanding of what Unicode is.
Previously, there were different descriptions of “line breaking behavior” in the Unicode standard. They have now been collected into Unicode Standard Annex #14, Line Breaking Properties, UAX #14). Despite being issued as a separate document, it is an integral part of the standard.
The Annex discusses the line breaking rules in different ways. It is not obvious which parts are the ultimate definitions. Apparently, the longish chapter 5 (Line Breaking Properties) is explanatory, or “narrative” as it calls itself. The chapter 7 Pair-table Based Implementation, with a tabular presentation of some of the rules, looks like descriptive, too: it explains a possible implementation.
It seems that the intended authoritative
specification of line-breaking properties (both normative and
informative) consists of the first part
(before Table 1) of
chapter 2 Definitions and the
formalized rules in chapter
6 Line Breaking Algorithm
of UAX #14
and the
LineBreak.txt
file.
The former describes the rules in terms of LineBreak properties;
the latter assigns a LineBreak property to each character.
All the rest is attempted explanations or illustrations (and often
just confusing).
The following is an extract of
LineBreak.txt
, covering the printable ISO Latin 1
(ISO 8859-1 character repertoire). In some cases,
entries have been omitted, using ellipsis (...) to
indicate that a range of lines has the obvious content.
0020;SP # SPACE 0021;EX # EXCLAMATION MARK 0022;QU # QUOTATION MARK 0023;AL # NUMBER SIGN 0024;PR # DOLLAR SIGN 0025;PO # PERCENT SIGN 0026;AL # AMPERSAND 0027;QU # APOSTROPHE 0028;OP # LEFT PARENTHESIS 0029;CL # RIGHT PARENTHESIS 002A;AL # ASTERISK 002B;PR # PLUS SIGN 002C;IS # COMMA 002D;HY # HYPHEN-MINUS 002E;IS # FULL STOP 002F;SY # SOLIDUS 0030;NU # DIGIT ZERO ... 0039;NU # DIGIT NINE 003A;IS # COLON 003B;IS # SEMICOLON 003C;AL # LESS-THAN SIGN 003D;AL # EQUALS SIGN 003E;AL # GREATER-THAN SIGN 003F;EX # QUESTION MARK 0040;AL # COMMERCIAL AT 0041;AL # LATIN CAPITAL LETTER A ... 005A;AL # LATIN CAPITAL LETTER Z 005B;OP # LEFT SQUARE BRACKET 005C;PR # REVERSE SOLIDUS 005D;CL # RIGHT SQUARE BRACKET 005E;AL # CIRCUMFLEX ACCENT 005F;AL # LOW LINE 0060;AL # GRAVE ACCENT 0061;AL # LATIN SMALL LETTER A ... 007A;AL # LATIN SMALL LETTER Z 007B;OP # LEFT CURLY BRACKET 007C;BA # VERTICAL LINE 007D;CL # RIGHT CURLY BRACKET 007E;AL # TILDE 00A0;GL # NO-BREAK SPACE 00A1;AI # INVERTED EXCLAMATION MARK 00A2;PO # CENT SIGN 00A3;PR # POUND SIGN 00A4;PR # CURRENCY SIGN 00A5;PR # YEN SIGN 00A6;AL # BROKEN BAR 00A7;AI # SECTION SIGN 00A8;AI # DIAERESIS 00A9;AL # COPYRIGHT SIGN 00AA;AI # FEMININE ORDINAL INDICATOR 00AB;QU # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK 00AC;AL # NOT SIGN 00AD;BA # SOFT HYPHEN 00AE;AL # REGISTERED SIGN 00AF;AL # MACRON 00B0;PO # DEGREE SIGN 00B1;PR # PLUS-MINUS SIGN 00B2;AI # SUPERSCRIPT TWO 00B3;AI # SUPERSCRIPT THREE 00B4;BB # ACUTE ACCENT 00B5;AL # MICRO SIGN 00B6;AI # PILCROW SIGN 00B7;AI # MIDDLE DOT 00B8;AI # CEDILLA 00B9;AI # SUPERSCRIPT ONE 00BA;AI # MASCULINE ORDINAL INDICATOR 00BB;QU # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK 00BC;AI # VULGAR FRACTION ONE QUARTER 00BD;AI # VULGAR FRACTION ONE HALF 00BE;AI # VULGAR FRACTION THREE QUARTERS 00BF;AI # INVERTED QUESTION MARK 00C0;AL # LATIN CAPITAL LETTER A WITH GRAVE 00C1;AL # LATIN CAPITAL LETTER A WITH ACUTE 00C2;AL # LATIN CAPITAL LETTER A WITH CIRCUMFLEX 00C3;AL # LATIN CAPITAL LETTER A WITH TILDE 00C4;AL # LATIN CAPITAL LETTER A WITH DIAERESIS 00C5;AL # LATIN CAPITAL LETTER A WITH RING ABOVE 00C6;AL # LATIN CAPITAL LETTER AE 00C7;AL # LATIN CAPITAL LETTER C WITH CEDILLA 00C8;AL # LATIN CAPITAL LETTER E WITH GRAVE 00C9;AL # LATIN CAPITAL LETTER E WITH ACUTE 00CA;AL # LATIN CAPITAL LETTER E WITH CIRCUMFLEX 00CB;AL # LATIN CAPITAL LETTER E WITH DIAERESIS 00CC;AL # LATIN CAPITAL LETTER I WITH GRAVE 00CD;AL # LATIN CAPITAL LETTER I WITH ACUTE 00CE;AL # LATIN CAPITAL LETTER I WITH CIRCUMFLEX 00CF;AL # LATIN CAPITAL LETTER I WITH DIAERESIS 00D0;AL # LATIN CAPITAL LETTER ETH 00D1;AL # LATIN CAPITAL LETTER N WITH TILDE 00D2;AL # LATIN CAPITAL LETTER O WITH GRAVE 00D3;AL # LATIN CAPITAL LETTER O WITH ACUTE 00D4;AL # LATIN CAPITAL LETTER O WITH CIRCUMFLEX 00D5;AL # LATIN CAPITAL LETTER O WITH TILDE 00D6;AL # LATIN CAPITAL LETTER O WITH DIAERESIS 00D7;AI # MULTIPLICATION SIGN 00D8;AL # LATIN CAPITAL LETTER O WITH STROKE 00D9;AL # LATIN CAPITAL LETTER U WITH GRAVE 00DA;AL # LATIN CAPITAL LETTER U WITH ACUTE 00DB;AL # LATIN CAPITAL LETTER U WITH CIRCUMFLEX 00DC;AL # LATIN CAPITAL LETTER U WITH DIAERESIS 00DD;AL # LATIN CAPITAL LETTER Y WITH ACUTE 00DE;AL # LATIN CAPITAL LETTER THORN 00DF;AL # LATIN SMALL LETTER SHARP S 00E0;AL # LATIN SMALL LETTER A WITH GRAVE 00E1;AL # LATIN SMALL LETTER A WITH ACUTE 00E2;AL # LATIN SMALL LETTER A WITH CIRCUMFLEX 00E3;AL # LATIN SMALL LETTER A WITH TILDE 00E4;AL # LATIN SMALL LETTER A WITH DIAERESIS 00E5;AL # LATIN SMALL LETTER A WITH RING ABOVE 00E6;AL # LATIN SMALL LETTER AE 00E7;AL # LATIN SMALL LETTER C WITH CEDILLA 00E8;AL # LATIN SMALL LETTER E WITH GRAVE 00E9;AL # LATIN SMALL LETTER E WITH ACUTE 00EA;AL # LATIN SMALL LETTER E WITH CIRCUMFLEX 00EB;AL # LATIN SMALL LETTER E WITH DIAERESIS 00EC;AL # LATIN SMALL LETTER I WITH GRAVE 00ED;AL # LATIN SMALL LETTER I WITH ACUTE 00EE;AL # LATIN SMALL LETTER I WITH CIRCUMFLEX 00EF;AL # LATIN SMALL LETTER I WITH DIAERESIS 00F0;AL # LATIN SMALL LETTER ETH 00F1;AL # LATIN SMALL LETTER N WITH TILDE 00F2;AL # LATIN SMALL LETTER O WITH GRAVE 00F3;AL # LATIN SMALL LETTER O WITH ACUTE 00F4;AL # LATIN SMALL LETTER O WITH CIRCUMFLEX 00F5;AL # LATIN SMALL LETTER O WITH TILDE 00F6;AL # LATIN SMALL LETTER O WITH DIAERESIS 00F7;AI # DIVISION SIGN 00F8;AL # LATIN SMALL LETTER O WITH STROKE 00F9;AL # LATIN SMALL LETTER U WITH GRAVE 00FA;AL # LATIN SMALL LETTER U WITH ACUTE 00FB;AL # LATIN SMALL LETTER U WITH CIRCUMFLEX 00FC;AL # LATIN SMALL LETTER U WITH DIAERESIS 00FD;AL # LATIN SMALL LETTER Y WITH ACUTE 00FE;AL # LATIN SMALL LETTER THORN 00FF;AL # LATIN SMALL LETTER Y WITH DIAERESIS
LineBreak.txt
The LineBreak.txt
file is of rather simple format, described at the beginning
of the file (on lines beginning with #
). Basically,
each entry consists of one line, containing three fields:
Unicode value (code number, four hexadecimal digits);
value of the LineBreak
property, two characters; and Unicode name (which is purely
comment-like here, since the code number identifies the character
uniquely).
So for example,
00B0;PO # DEGREE SIGN
says that for the Unicode character U+00B0
(which has the name
DEGREE SIGN),
the value of the LineBreak
property is
PO
, i.e. the character belongs to
line breaking class PO
(which is by the
way abbreviated from the word “postfix”—not
very mnemonic, is it?).
The description at the beginning of the file gives the following additional information:
BK
, CR
, LF
, CM
, SG
, GL
, CB
, SP
, ZW
,
NL
, WJ
, JL
, JV
, JT
, H2
, H3
XX
, OP
, CL
, QU
, NS
, EX
, SY
,
IS
, PR
, PO
, NU
, AL
, ID
, IN
, HY
,
BB
, BA
, SA
, AI
, B2
XX
.
4E00..9FBB;ID # <CJK Ideograph, First>..<CJK Ideograph, Last>
ID
for the LineBreak
property.
See chapters 3 Conformance and 4 Character Properties of the standard for definitions of “normative” and “informative.” It is hard to say what they really mean. The standard seems to say that applying normative properties is an absolute requirement for conforming implementations that use the property, whereas informative properties can be freely overridden. The natural idea is that informative properties are just suggested defaults. But the standard also says: “Some normative behavior is default behavior; this behavior can be overridden by higher-level protocols. However, in the absence of such protocols, the behavior must be observed so as to follow the character semantics.” This is rather vague; which normative behavior is overridable? Specifically, what about the normative line breaking properties?
In principle, it is relatively straightforward to apply the
rules, once we know what they really are. Consider for example
the question whether line breaks are allowed within the string
/%7ej
The LineBreak properties of the characters in it are
found from LineBreak.txt
:
002F;SY;SOLIDUS 0025;PO;PERCENT SIGN 0037;NU;DIGIT SEVEN 0065;AL;LATIN SMALL LETTER E 006A;AL;LATIN SMALL LETTER J
Then, taking the characters in order and applying the rules in Line Breaking Algorithm in order (as they have been specified to apply), we find
/
and %
a line break is
permitted, since no rule forbids it and the last
rule LB 20 says “break everywhere else”
%
and 7
, a line break is
permitted on the same grounds
7
and e
, no line break is allowed,
according to rule LB 17: NU
× AL
e
and j
, no line break is allowed,
according to rule LB 19: AL
× AL
Somewhat surprisingly, a line break is not allowed
before /
even after a space (rule LB 8).
At each step,
when considering whether a line break is permitted before two
consecutive characters A and B,
we need to consider the LineBreak
properties of both characters.
If there is no rule (formulated in terms of
the LineBreak
property values) that forbids a line break
between A and B,
or after A in general,
or before B in general,
then a line break is permitted between them.
I became interested in line breaking properties when I had noted
that Internet Explorer 4 applies some “interesting” methods
when formatting an HTML document for display. For example, it can
break a string like a-b
to a-
and b
.
This doesn’t make much sense, but formally complies with Unicode
line breaking rules. So the first issue to be considered is that
rules do not say any warnings about applying them
indiscriminately.
See my document Word division in IE for some dirty details. Note that some of the odd breaks (which have to some extent been fixed in IE 5) might reflect wrong interpretations of the line breaking rules. And this would not be surprising; it’s easy to get confused with those rules.
In the Latin script and some other scripts, line breaks in positions other than after spaces (and explicit line and paragraph break control codes, of course) should be regarded as exceptions. But the standard more or less takes the position “break everywhere” except when specifically forbidden. And those exceptions are just rough guesses, really. The rules as a whole forbid quite a few perfectly reasonable line breaks, like before the solidus in “it’s in directory /usr/spool,” and allow some really absurd breaks, like after either solidus in our example.
It is very confusing to see a string broken into lines just
because some mechanical rules have allowed it. A string which is
mixture of Latin letters, digits, and various symbols is most probably
part of some special notation, such as a URL, or a variable in
a programming language, or some code. It is unacceptable to have
a string like foo%bar
broken, especially when it occurs
with no indication of what has happened. It can even distort
information or corrupt data.
It would probably be best to remove all prohibitions against line breaks after spaces. After all, the no-break space can be used instead of a normal space in such cases, or language-specific higher-level protocols can be applied (e.g. to prevent line breaks in French text between a space and a question mark).
Normally a sequence of characters should not be split to two lines when applying generic (language-independent) rules, unless it is clearly justified on the basis of the usage properties of the characters. This means that basically just a space is allowed line break point in a Latin script, though a few punctuation character (mainly hyphens and dashes) might be considered too. For quite a few characters used in non-Latin scripts, line breaks are accepted before and after in any known usage, so the Unicode standard should just list them.
Rules for “emergency breaks,” like breaking a URL to two lines to prevent absurdly long lines, might be developed. But they would belong to higher protocol levels. And it is always a matter of judgement between two evils, like allowing a (somewhat) overfull line versus splitting a string that should not be broken, or perhaps reducing font size. In particular, in URLs it should suffice to break a line after a solidus or an ampersand. And breaking a URL to several lines should always be accompanied with the use of suitable delimiters, as recommended in appendix E of RFC 3986.