Character encodings in Nvu

This document describes the character encodings supported by the Nvu web page editor. The encodings are briefly described, and their usefulness on the web is commented.

For the concept of character encoding, please consult my tutorial on character code issues or the Nvu help.

By default, Nvu saves an HTML document in ISO-8859-1 (ISO Latin 1) encoding. You can choose another encoding by selecting Save And Change Character Encoding in the File menu. This opens first a dialog window where you can select among a large set of encodings.

The window contains two parts, Page Title and Character Encoding,
and a checkbox named Export to Text, as well as OK and Cancel buttons.
The Character Encoding part contains a large listbox menu
containing names of encodings. The encodings appear with a common name followed by a more official (MIME) name in parentheses. However, not all name in parentheses are official, and they may differ from the exact official spelling.

When you save a document in a particular encoding, Nvu generates a meta tag that specifies that encoding, using its official name, e.g. <meta content="text/html; charset=IBM864" http-equiv="content-type">. (Usually such tags are used with the http-equiv and content attributes in the reverse order, but the order is not significant.)

Nvu represents characters as entity references (e.g., &eacute for é) or as character references (e.g., Й for Й) to the extent that characters cannot be written as such in the selected encoding. Moreover, it may use such representations for some characters even if they could appear as such. This depends on the settings; select Tools, Preferences, Advanced, Special characters to view and modify them.

Practical notes:

The default encoding is Western (ISO-8859-1). It is rather universally supported by browsers.
Other commonly used and very widely supported encodings are Western (Windows-1252) and Unicode (UTF-8).
Other encodings may work well within a particular user community, but for the worldwide audience, UTF-8 is usually the best choice if you do not wish to use ISO-8859-1.
Many ISO and Windows encodings work well, too.
DOS code pages, though supported by many browsers, do not offer benefits over ISO and Windows encodings.
Unregistered encodings should not be expected to work on the WWW. In particular, many of them work on Internet Explorer only if additional software is downloaded and installed by the user, and this may often fail for different reasons.

The following table presents the entries in the character encoding menu. The second column contains the name actually appearing in the meta tag that Nvu generates. Practical notes are given in the third column. The word “unregistered” means that the encoding is not registered according to MIME specifications.

Menu entry	Charset name	Notes
Arabic (IBM-864)	IBM864	DOS code page for Arabic, cp864
Arabic (ISO-8859-6)	ISO-8859-6	ISO Latin/Arabic
Arabic (MacArabic)	–	x-mac-arabic	Macintosh encoding for Arabic, unregistered
Arabic (Windows-1256)	windows-1256	Windows Arabic
Armenian (ARMSCII-8)	armscii-8	“Armenian ASCII”, unregistered
Baltic (ISO-8859-13)	ISO-8859-13	ISO Latin 7, “Baltic Rim”
Baltic (ISO-8859-4)	ISO-8859-4	ISO Latin 4, “North European”
Baltic (Windows-1257)	windows-1257	Windows Baltic
Celtic (ISO-8859-14)	ISO-8859-14	ISO Latin 8; no wide support
Central European (IBM-852)	IBM852	DOS code page for Central European, cp852
Central European (ISO-8859-2)	ISO-8859-2	ISO Latin 2
Central European (MacCE)	x-mac-ce	Macintosh encoding for Central European, unregistered
Central European (Windows-1250)	windows-1250	Windows Latin 2
Chinese Simplified (GB18030)	gb18030	Newer encoding for Chinese in Simplified writing system
Chinese Simplified (GB2312)	gb2312	Older encoding for Chinese in Simplified writing system
Chinese Simplified (GBK)	x-gbk	An extension of GB2312 (MIME name: GBK)
Chinese Simplified (HZ)	HZ-GB-2312	An encoding designed for E-mail
Chinese Simplified (ISO-2022-CN)	ISO-2022-CN	ISO 2022 based encoding for Chinese
Chinese Traditional (Big5)	Big5	Chinese encoding, used especially in Taiwan
Chinese Traditional (Big5-HKSCS)	Big5-HKSCS	Chinese encoding, used especially in Hong Kong
Chinese Traditional (EUC-TW)	x-euc-tw	Chinese encoding, unregistered
Croatian (MacCroatian)	x-mac-croatian	Macintosh encoding for Croatian, unregistered
Cyrillic (IBM-855)	IBM855	DOS code page for Cyrillic, cp855
Cyrillic (ISO-8859-5)	ISO-8859-5	ISO Latin/Cyrillic
Cyrillic (ISO-IR-111)	ISO-IR-111	ECMA Cyrillic
Cyrillic (KOI8-R)	KOI8-R	Russian version of KOI8
Cyrillic (MacCyrillic)	x-mac-cyrillic	Macintosh encoding for Cyrillic, unregistered
Cyrillic (Windows-1251)	windows-1251	Windows Cyrillic
Cyrillic/Russian (CP-866)	IBM866	DOS code page for Russian
Cyrillic/Ukrainian (KOI8-U)	KOI8-U	Ukrainian version of KOI8
Cyrillic/Ukrainian (MacUkrainian)	x-mac-ukrainian	Macintosh encoding for Ukrainian
Farsi (MacFarsi)	x-mac-farsi	Macintosh encoding for Farsi (Persian), unregistered
Georgian (GEOSTD8)	GEOSTD8	Encoding for the Georgian language, unregistered
Greek (ISO-8859-7)	ISO-8859-7	ISO Latin/Greek
Greek (MacGreek)	x-mac-greek	Macintosh encoding for Greek, unregistered
Greek (Windows-1253)	windows-1253	Windows Greek
Gujarati (MacGujarati)	x-mac-gujarati	Macintosh encoding for Gujarati, unregistered
Gurmukhi (MacGurmukhi)	x-mac-gurmukhi	Macintosh encoding for Gurmukhi, unregistered
Hebrew (IBM-862)	IBM862	DOS code page for Hebrew, cp862
Hebrew (ISO-8859-8-I)	ISO-8859-8-I	ISO-8859-8 (ISO Latin/Hebrew) in logical order
Hebrew (MacHebrew)	x-mac-hebrew	Macintosh encoding for Hebrew, unregistered
Hebrew (Windows-1255)	windows-1255	Windows Hebrew
Hindi (MacDevanagari)	x-mac-devanagari	Macintosh encoding for Devanagari, unregistered
Icelandic (MacIcelandic)	x-mac-icelandic	Macintosh encoding for Icelandic, unregistered
Japanese (EUC-JP)	EUC-JP	Common Japanese encoding
Japanese (ISO-2022-JP)	ISO-2022-JP	Another common Japanese encoding
Japanese (Shift_JIS)	Shift_JIS	Yet another common Japanese encoding
Korean (EUC-KR)	EUC-KR	Common Korean encoding
Nordic (ISO-8859-10)	ISO-8859-10	ISO Latin 6, “Nordic” (Sámi etc.); no wide support
Romanian (ISO-8859-16)	ISO-8859-16	ISO Latin 10; no wide support
Romanian (MacRomanian)	x-mac-romanian	Macintosh encoding for Romanian, unregistered
South European (ISO-8859-3)	ISO-8859-3	ISO Latin 3, for Maltese and Esperanto; no wide support
Thai (ISO-8859-11)	ISO-8859-11	ISO Latin/Thai
Thai (TIS-620)	TIS-620	Encoding for Thai, national standard
Thai (Windows-874)	windows-874	Windows Thai
Turkish (IBM-857)	IBM857	DOS code page for Turkish, cp857
Turkish (ISO-8859-9)	ISO-8859-9	ISO Latin 5
Turkish (MacTurkish)	x-mac-turkish	Macintosh encoding for Turkish, unregistered
Turkish (Windows-1254)	windows-1254	Windows Turkish
Unicode (UTF-16 Big Endian)	UTF-16BE	UTF-16 Big Endian (high byte first)
Unicode (UTF-16 Little Endian)	UTF-16LE	UTF-16 Little Endian (low byte first)
Unicode (UTF-16)	UTF-16	UTF-16, with endianness to be inferred
Unicode (UTF-32 Big Endian)	UTF-32BE	UTF-32 Big Endian
Unicode (UTF-32 Little Endian)	UTF-32LE	UTF-32 Little Endian
Unicode (UTF-32)	UTF-32	UTF-32, with endianness to be inferred
Unicode (UTF-8)	UTF-8	UTF-8, the preferred Unicode encoding on the Internet
User Defined	x-user-defined	Unspecified encoding, usually for use with specific font
Western (IBM-850)	IBM850	DOS code page for West European languages, cp850
Western (ISO-8859-1)	ISO-8859-1	ISO Latin 1, the default encoding
Western (ISO-8859-15)	ISO-8859-15	ISO Latin 9, with euro sign, not widely supported
Western (MacRoman)	x-mac-roman	Macintosh encoding for Western European, unregistered
Western (Windows-1252)	windows-1252	Windows Latin 1
Vietnamese (TCVN)	x-viet-tcvn5712	TCVN 5712, VISCII-2, unregistered
Vietnamese (Windows-1258)	windows-1258	Windows Vietnamese
Vietnamese (VISCII)	VISCII	“Vietnamese extension to ASCII”
Vietnamese (VPS)	x-viet-vps	VPS, unregistered

Note: Nvu generates an unregistered name x-gbk for the GBK encoding, although this encoding has a MIME registration under the name GBK. You can change the meta tag in the Source mode in Nvu and save the file to fix this.