Character encodings in Nvu

This document describes the character encodings supported by the Nvu web page editor. The encodings are briefly described, and their usefulness on the web is commented.

For the concept of character encoding, please consult my tutorial on character code issues or the Nvu help.

By default, Nvu saves an HTML document in ISO-8859-1 (ISO Latin 1) encoding. You can choose another encoding by selecting Save And Change Character Encoding in the File menu. This opens first a dialog window where you can select among a large set of encodings.

The window contains two parts, Page Title and Character Encoding,
and a checkbox named Export to Text, as well as OK and Cancel buttons.
The Character Encoding part contains a large listbox menu
containing names of encodings. The encodings appear with a common name followed by a more official (MIME) name in parentheses. However, not all name in parentheses are official, and they may differ from the exact official spelling.

When you save a document in a particular encoding, Nvu generates a meta tag that specifies that encoding, using its official name, e.g. <meta content="text/html; charset=IBM864" http-equiv="content-type">. (Usually such tags are used with the http-equiv and content attributes in the reverse order, but the order is not significant.)

Nvu represents characters as entity references (e.g., &eacute for é) or as character references (e.g., &#1049; for Й) to the extent that characters cannot be written as such in the selected encoding. Moreover, it may use such representations for some characters even if they could appear as such. This depends on the settings; select Tools, Preferences, Advanced, Special characters to view and modify them.

Practical notes:

The following table presents the entries in the character encoding menu. The second column contains the name actually appearing in the meta tag that Nvu generates. Practical notes are given in the third column. The word “unregistered” means that the encoding is not registered according to MIME specifications.

Menu entry Charset name Notes
Arabic (IBM-864) IBM864 DOS code page for Arabic, cp864
Arabic (ISO-8859-6) ISO-8859-6 ISO Latin/Arabic
Arabic (MacArabic) x-mac-arabic Macintosh encoding for Arabic, unregistered
Arabic (Windows-1256) windows-1256 Windows Arabic
Armenian (ARMSCII-8) armscii-8 “Armenian ASCII”, unregistered
Baltic (ISO-8859-13) ISO-8859-13 ISO Latin 7, “Baltic Rim”
Baltic (ISO-8859-4) ISO-8859-4 ISO Latin 4, “North European”
Baltic (Windows-1257) windows-1257 Windows Baltic
Celtic (ISO-8859-14) ISO-8859-14 ISO Latin 8; no wide support
Central European (IBM-852) IBM852 DOS code page for Central European, cp852
Central European (ISO-8859-2) ISO-8859-2 ISO Latin 2
Central European (MacCE) x-mac-ce Macintosh encoding for Central European, unregistered
Central European (Windows-1250) windows-1250 Windows Latin 2
Chinese Simplified (GB18030) gb18030 Newer encoding for Chinese in Simplified writing system
Chinese Simplified (GB2312) gb2312 Older encoding for Chinese in Simplified writing system
Chinese Simplified (GBK) x-gbk An extension of GB2312 (MIME name: GBK)
Chinese Simplified (HZ) HZ-GB-2312 An encoding designed for E-mail
Chinese Simplified (ISO-2022-CN) ISO-2022-CN ISO 2022 based encoding for Chinese
Chinese Traditional (Big5) Big5 Chinese encoding, used especially in Taiwan
Chinese Traditional (Big5-HKSCS) Big5-HKSCS Chinese encoding, used especially in Hong Kong
Chinese Traditional (EUC-TW) x-euc-tw Chinese encoding, unregistered
Croatian (MacCroatian) x-mac-croatian Macintosh encoding for Croatian, unregistered
Cyrillic (IBM-855) IBM855 DOS code page for Cyrillic, cp855
Cyrillic (ISO-8859-5) ISO-8859-5 ISO Latin/Cyrillic
Cyrillic (ISO-IR-111) ISO-IR-111 ECMA Cyrillic
Cyrillic (KOI8-R) KOI8-R Russian version of KOI8
Cyrillic (MacCyrillic) x-mac-cyrillic Macintosh encoding for Cyrillic, unregistered
Cyrillic (Windows-1251) windows-1251 Windows Cyrillic
Cyrillic/Russian (CP-866) IBM866 DOS code page for Russian
Cyrillic/Ukrainian (KOI8-U) KOI8-U Ukrainian version of KOI8
Cyrillic/Ukrainian (MacUkrainian) x-mac-ukrainian Macintosh encoding for Ukrainian
Farsi (MacFarsi) x-mac-farsi Macintosh encoding for Farsi (Persian), unregistered
Georgian (GEOSTD8) GEOSTD8 Encoding for the Georgian language, unregistered
Greek (ISO-8859-7) ISO-8859-7 ISO Latin/Greek
Greek (MacGreek) x-mac-greek Macintosh encoding for Greek, unregistered
Greek (Windows-1253) windows-1253 Windows Greek
Gujarati (MacGujarati) x-mac-gujarati Macintosh encoding for Gujarati, unregistered
Gurmukhi (MacGurmukhi) x-mac-gurmukhi Macintosh encoding for Gurmukhi, unregistered
Hebrew (IBM-862) IBM862 DOS code page for Hebrew, cp862
Hebrew (ISO-8859-8-I) ISO-8859-8-I ISO-8859-8 (ISO Latin/Hebrew) in logical order
Hebrew (MacHebrew) x-mac-hebrew Macintosh encoding for Hebrew, unregistered
Hebrew (Windows-1255) windows-1255 Windows Hebrew
Hindi (MacDevanagari) x-mac-devanagari Macintosh encoding for Devanagari, unregistered
Icelandic (MacIcelandic) x-mac-icelandic Macintosh encoding for Icelandic, unregistered
Japanese (EUC-JP) EUC-JP Common Japanese encoding
Japanese (ISO-2022-JP) ISO-2022-JP Another common Japanese encoding
Japanese (Shift_JIS) Shift_JIS Yet another common Japanese encoding
Korean (EUC-KR) EUC-KR Common Korean encoding
Nordic (ISO-8859-10) ISO-8859-10 ISO Latin 6, “Nordic” (Sámi etc.); no wide support
Romanian (ISO-8859-16) ISO-8859-16 ISO Latin 10; no wide support
Romanian (MacRomanian) x-mac-romanian Macintosh encoding for Romanian, unregistered
South European (ISO-8859-3) ISO-8859-3 ISO Latin 3, for Maltese and Esperanto; no wide support
Thai (ISO-8859-11) ISO-8859-11 ISO Latin/Thai
Thai (TIS-620) TIS-620 Encoding for Thai, national standard
Thai (Windows-874) windows-874 Windows Thai
Turkish (IBM-857) IBM857 DOS code page for Turkish, cp857
Turkish (ISO-8859-9) ISO-8859-9 ISO Latin 5
Turkish (MacTurkish) x-mac-turkish Macintosh encoding for Turkish, unregistered
Turkish (Windows-1254) windows-1254 Windows Turkish
Unicode (UTF-16 Big Endian) UTF-16BE UTF-16 Big Endian (high byte first)
Unicode (UTF-16 Little Endian) UTF-16LE UTF-16 Little Endian (low byte first)
Unicode (UTF-16) UTF-16 UTF-16, with endianness to be inferred
Unicode (UTF-32 Big Endian) UTF-32BE UTF-32 Big Endian
Unicode (UTF-32 Little Endian) UTF-32LE UTF-32 Little Endian
Unicode (UTF-32) UTF-32 UTF-32, with endianness to be inferred
Unicode (UTF-8) UTF-8 UTF-8, the preferred Unicode encoding on the Internet
User Defined x-user-defined Unspecified encoding, usually for use with specific font
Western (IBM-850) IBM850 DOS code page for West European languages, cp850
Western (ISO-8859-1) ISO-8859-1 ISO Latin 1, the default encoding
Western (ISO-8859-15) ISO-8859-15 ISO Latin 9, with euro sign, not widely supported
Western (MacRoman) x-mac-roman Macintosh encoding for Western European, unregistered
Western (Windows-1252) windows-1252 Windows Latin 1
Vietnamese (TCVN) x-viet-tcvn5712 TCVN 5712, VISCII-2, unregistered
Vietnamese (Windows-1258) windows-1258 Windows Vietnamese
Vietnamese (VISCII) VISCII “Vietnamese extension to ASCII”
Vietnamese (VPS) x-viet-vps VPS, unregistered

Note: Nvu generates an unregistered name x-gbk for the GBK encoding, although this encoding has a MIME registration under the name GBK. You can change the meta tag in the Source mode in Nvu and save the file to fix this.