Расширенный код Unix

Расширенный код Unix ( EUC ) - это система кодирования многобайтовых символов, используемая в основном для японского , корейского и упрощенного китайского языков .

Наиболее часто используемые коды EUC - это кодировки переменной ширины с символом, принадлежащим к набору кодированных символов, совместимым с ISO / IEC 646 (например, ASCII ), занимающим один байт, и символом, принадлежащим к набору кодированных символов 94x94 (например, GB 2312 ). в два байта. Форма EUC-CN из GB 2312 и EUC-KR являются примерами таких двухбайтовых кодов EUC. EUC-JP включает символы, представленные до трех байтов, включая начальный код сдвига , тогда как один символ в EUC-TW может занимать до четырех байтов.

Современные приложения с большей вероятностью будут использовать UTF-8 , который поддерживает все символы кодов EUC и многое другое, и, как правило, более переносим с меньшим количеством отклонений и ошибок от поставщиков. Однако EUC по-прежнему очень популярен, особенно EUC-KR для Южной Кореи.

Структура кодирования [ править ]

Связь между упакованными EUC и другими 8-битными профилями ISO 2022

Структура EUC основана на стандарте ISO / IEC 2022 , который определяет систему графических наборов символов, которые могут быть представлены последовательностью 94 7-битных байтов 0x 21–7E или, альтернативно, 0xA1 – FE, если восьмой бит доступен. Это позволяет использовать наборы из 94 графических символов, или 8836 (94 ² ) символов, или 830584 (94 ³ ) символа. Хотя изначально 0x20 и 0x7F всегда были символом пробела и удаления, а 0xA0 и 0xFF не использовались, более поздние версии ISO / IEC 2022разрешено использование байтов 0xA0 и 0xFF (или 0x20 и 0x7F) в наборах при определенных обстоятельствах, что позволяет включать 96-символьные наборы. Диапазоны 0x00–1F и 0x80–9F используются для управляющих кодов C0 и C1 .

EUC - это семейство 8-битных профилей ISO / IEC 2022 , в отличие от 7-битных профилей, таких как ISO-2022-JP . Таким образом, только наборы символов, совместимые с ISO 2022, могут иметь формы EUC. До четырех наборов кодированных символов (называемых G0, G1, G2 и G3 или кодовых наборов 0, 1, 2 и 3) могут быть представлены с помощью схемы EUC. Набор G0 установлен в соответствующий ISO / IEC 646 кодированный набор символов, такой как US-ASCII , ISO 646: KR ( KS X 1003 ) или ISO 646: JP (нижняя половина JIS X 0201 ) и вызывается через GL (т. Е. 0x21–0x7E, старший бит очищен). ^[1] Если используется US-ASCII, это делает кодрасширенная кодировка ASCII ; наиболее частым отклонением от US-ASCII является то, что 0x5C ( обратная косая черта в US-ASCII) часто используется для обозначения знака йены в EUC-JP (см. ниже) и знака выигрыша в EUC-KR.

Другие кодовые наборы вызываются через GR (т. Е. С набором наиболее значимых битов). Следовательно, чтобы получить форму EUC символа, устанавливается самый старший бит каждого байта кодирования (эквивалентно добавлению 128 к каждому 7-битному байту кодирования или добавлению 160 к каждому числу в коде kuten ); это позволяет программное обеспечение легко отличить ли конкретный байт в строке символов принадлежит ISO 646 кода или расширенного кода. Символы в кодовых наборах 2 и 3 имеют префиксы с управляющими кодами SS2 (0x8E) и SS3 (0x8F) соответственно и вызываются через GR. Помимо начального кода сдвига, любой байт вне диапазона 0xA0–0xFF, появляющийся в символе из кодовых наборов с 1 по 3, не является допустимым кодом EUC. ^[1]

Сам код EUC не использует последовательности объявления и обозначения из ISO 2022 . ^[1] Однако спецификация кода эквивалентна следующей последовательности из четырех последовательностей объявлений ISO 2022 со следующими значениями. ^[1]

Индивидуальная последовательность	Шестнадцатеричный	Обозначенная особенность EUC
`ESC SP C`	`1B 20 43`	ISO-8 (8-бит, G0 в GL, G1 в GR)
`ESC SP Z`	`1B 20 5A`	G2 доступен через SS2
`ESC SP [`	`1B 20 5B`	Доступ к G3 через SS3
`ESC SP \`	`1B 20 5C`	Односменные вызовы через GR

Формат с фиксированной шириной [ править ]

Описанное выше кодирование переменной ширины на основе ISO-2022 иногда называют упакованным форматом EUC , который представляет собой формат кодирования, обычно обозначаемый как EUC. Однако внутренняя обработка данных EUC может использовать формат преобразования фиксированной ширины, называемый полным двухбайтовым форматом EUC . Это означает: ^[2]

Кодовый набор 0 как два байта в диапазоне 0x21–0x7E (за исключением того, что первый может быть 0x00).
Кодовый набор 1 как два байта в диапазоне 0xA0–0xFF (за исключением того, что первый может быть 0x80).
Кодовый набор 2 в виде байта в диапазоне 0x20–0x7E (или 0x00), за которым следует байт в диапазоне 0xA0–0xFF.
Кодовый набор 3 в виде байта в диапазоне 0xA0–0xFF (или 0x80), за которым следует байт в диапазоне 0x21–0x7E.

Начальные байты 0x00 и 0x80 используются в случаях, когда кодовый набор использует только один байт. Существует также четырехбайтовый формат фиксированной длины. ^[2] Эти форматы кодирования фиксированной длины подходят для внутренней обработки и обычно не встречаются при обмене.

EUC-JP зарегистрирован IANA в обоих форматах: в упакованном формате как «EUC-JP» или «csEUCPkdFmtJapanese» и в формате фиксированной ширины как «csEUCFixWidJapanese». ^[3] В стандарт кодирования WHATWG, используемый HTML5, включен только упакованный формат . ^[4]

EUC-CN [ править ]

EUC-CN

MIME / IANA	GB2312
Псевдоним (а)	csGB2312
Язык (и)	Упрощенный китайский , английский , русский
Стандарт	GB 2312 (1980)
Classification	Extended ASCII, variable-width encoding, CJK encoding, EUC
Extends	US-ASCII
Extensions	748, GBK, GB 18030, x-mac-chinesesimp
Transforms / Encodes	GB 2312
Succeeded by	GBK, GB 18030
v t e

EUC-CN^[5] is the usual encoded form of the GB 2312 standard for simplified Chinese characters. Unlike the case of Japanese JIS X 0208 and ISO-2022-JP, GB 2312 is not normally used in a 7-bit ISO 2022 code version,^[a] although a variant form called HZ (which delimits GB 2312 text with ASCII sequences) was sometimes used on USENET.

An ASCII character is represented in its usual encoding. A character from GB 2312 is represented by two bytes, both from the range 0xA1–0xFE.

Related Mainland Chinese encoding systems[edit]

748 code[edit]

An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all of GB 2312, but is not ISO 2022–compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is therefore more similar in structure to Big5 and other non–ISO 2022–compliant DBCS encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting.

GBK and GB 18030[edit]

GBK is an extension to GB 2312. It defines an extended form of the EUC-CN encoding capable of representing a larger array of CJK characters sourced largely from Unicode 1.1, including traditional Chinese characters and characters used only in Japanese. It is not, however, a true EUC code, because ASCII bytes may appear as trail bytes (and C1 bytes, not limited to the single shifts, may appear as lead or trail bytes), due to a larger encoding space being required.

Variants of GBK are implemented by Windows code page 936 (the Microsoft Windows code page for simplified Chinese), and by IBM's code page 1386.

The Unicode-based GB 18030 character encoding defines an extension of GBK capable of encoding the entirety of Unicode. However, Unicode encoded as GB 18030 is a variable-width encoding which may use up to four bytes per character, due to an even larger encoding space being required. Being an extension of GBK, it is a superset of EUC-CN but is not itself a true EUC code. Being a Unicode encoding, its repertoire is identical to that of other Unicode transformation formats such as UTF-8.

Mac OS Chinese Simplified[edit]

Other EUC-CN variants deviating from the EUC mechanism include the Mac OS Chinese Simplified script (known as Code page 10008 or x-mac-chinesesimp).^[6] It uses the bytes 0x80, 0x81, 0x82, 0xA0, 0xFD, 0xFE and 0xFF for the U with umlaut (ü), two special font metric characters, the non-breaking space, the copyright sign (©), the trademark sign (™) and the ellipsis (…) respectively.^[5] This differs in what is regarded as a single-byte character versus the first byte of a two-byte character from both EUC (where, of those, 0xFD and 0xFE are defined as lead bytes) and GBK (where, of those, 0x81, 0x82, 0xFD and 0xFE are defined as lead bytes).

This use of 0xA0, 0xFD, 0xFE and 0xFF matches Apple's Shift_JIS variant.

Besides these changes to the lead byte range, the other distinctive feature of the double-byte portion of Mac OS Chinese Simplified is the inclusion of two extensions to the basic GB 2312-80 set in rows 6 and 8.^[5] These are considered "standard extensions to GB 2312", neither of which is proprietary to Apple: the row 8 extension was taken from GB 6345.1,^[5] both extensions are included by GB/T 12345 (the Traditional Chinese variant of GB 2312),^[7] and both extensions are included by GB 18030 (the successor to GB 2312).^[8]

EUC-JP[edit]

EUC-JP

MIME / IANA	EUC-JP
Alias(es)	Unixized JIS (UJIS), csEUCPkdFmtJapanese
Language(s)	Japanese, English, Russian
Classification	Extended ISO 646, variable-width encoding, CJK encoding, EUC
Extends	US-ASCII or ISO 646:JP
Transforms / Encodes	JIS X 0208, JIS X 0212, JIS X 0201
Succeeded by	EUC-JISx0213
v t e

EUC-JIS-2004
Alias(es)	EUC-JISx0213
Language(s)	Japanese, Ainu, English, Russian
Standard	JIS X 0213
Classification	Extended ASCII, variable-width encoding, CJK encoding, EUC
Extends	US-ASCII
Transforms / Encodes	JIS X 0213, JIS X 0201 (Kana)
Preceded by	EUC-JP
v t e

EUC-JP is a variable-width encoding used to represent the elements of three Japanese character set standards, namely JIS X 0208, JIS X 0212, and JIS X 0201. Other names for this encoding include Unixized JIS (or UJIS) and AT&T JIS.^[2] 0.1% of all web pages use EUC-JP since August 2018,^[9] while 2.8% of websites in Japanese use this encoding (less used than Shift JIS, or UTF-8). It is called Code page 954 by IBM.^[10]^[11] Microsoft has two code page numbers for this encoding (51932 and 20932).

This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by ISO-2022-JP, which is based on the same character set standards, and without ASCII bytes appearing as trail bytes (unlike Shift JIS).

A related and partially compatible encoding, called EUC-JISx0213 or EUC-JIS-2004, encodes JIS X 0201 and JIS X 0213^[12] (similarly to Shift_JISx0213, its Shift_JIS-based counterpart).

Compared to EUC-CN or EUC-KR, EUC-JP did not become as widely adopted on PC and Macintosh systems in Japan, which used Shift JIS or its extensions (Windows code page 932 on Microsoft Windows, and MacJapanese on classic Mac OS), although it became heavily used by Unix or Unix-like operating systems (except for HP-UX). Therefore, whether Japanese web sites use EUC-JP or Shift_JIS often depends on what OS the author uses.

Characters are encoded as follows:

As an EUC/ISO 2022 compliant encoding, the C0 control characters, space and DEL are represented as in ASCII.
A graphical character from ASCII (code set 0) is represented as its usual one-byte representation, in the range 0x21 – 0x7E. While some variants of EUC-JP encode the lower half of JIS X 0201 here, most encode ASCII,^[13] including the W3C/WHATWG Encoding standard used by HTML5,^[14] and so does EUC-JIS-2004.^[12] While this means that 0x5C is typically mapped to Unicode as U+005C REVERSE SOLIDUS (the ASCII backslash), U+005C may be displayed as a Yen sign by certain Japanese-locale fonts, e.g. on Microsoft Windows, for compatibility with the lower half of JIS X 0201.^[15]^[16]
A character from JIS X 0208 (code set 1) is represented by two bytes, both in the range 0xA1 – 0xFE. This differs from the ISO-2022-JP representation by having the high bit set. This code set may also contain vendor extensions in some EUC-JP variants. In EUC-JIS-2004, the first plane of JIS X 0213 is encoded here, which is effectively a superset of standard JIS X 0208.^[12]
A character from the upper half of JIS X 0201 (half-width kana, code set 2) is represented by two bytes, the first being 0x8E, the second being the usual JIS X 0201 representation in the range 0xA1 – 0xDF. This set may contain IBM vendor extensions in some variants.
A character from JIS X 0212 (code set 3) is represented in EUC-JP by three bytes, the first being 0x8F, the following two being in the range 0xA1–0xFE, i.e. with the high bit set. In addition to standard JIS X 0212, code set 3 of some EUC-JP variants may also contain extensions in rows 83 and 84 to represent characters from IBM's Shift JIS extensions which lack standard JIS X 0212 mappings, which may be coded in either of two layouts, one defined by IBM themselves and one defined by the OSF.^[17]^[18] In EUC-JIS-2004, the second plane of JIS X 0213 is encoded here,^[12] which does not collide with the allocated rows in standard JIS X 0212.^[19] Some implementations of EUC-JIS-2004, such as the one used by Python, allow both JIS X 0212 and JIS X 0213 plane 2 characters in this set.^[19]

Related Japanese encoding methods[edit]

Vendor extensions to EUC-JP (from, for example, the Open Software Foundation, IBM or NEC) were often allocated within the individual code sets,^[17]^[18] as opposed to using invalid EUC sequences (as in popular extensions of EUC-CN and EUC-KR).

However, some vendor-specific encodings are partially compatible with EUC-JP, due to encoding JIS X 0208 over GR, but do not follow the packed EUC structure. Often, these do not include use of the single shifts from EUC-JP, and are thus not straight extensions of EUC-JP, with the exception of Super DEC Kanji.

DEC Kanji[edit]

Digital Equipment Corporation defines two variants of EUC-JP only partly conforming to the EUC packed format, but also bearing some resemblance to the complete two-byte format. The overall format of the "DEC Kanji" encoding mostly corresponds to fixed-width (complete two-byte) EUC; however, code set 0 is not required to be left-padded with null bytes (similarly to the packed format).^[20] JIS X 0208 is, as usual, used for code set 1; code set 2 (half-width katakana) is absent; code set 3 is encoded like the two-byte fixed width format (i.e. without a shift byte and with only the first high bit set), but used for two-byte user defined characters rather than being specified for JIS X 0212.^[20] In the basic "DEC Kanji" encoding, only the first 31 rows of code set 3 are used for user-defined characters: rows 32 through 94 are reserved, similarly to the unused rows in code set 1.^[21]

The "Super DEC Kanji" encoding accepts codes both from the "DEC Kanji" encoding and from packed-format EUC, for a total of five code-sets.^[20] It also allows the entire user defined code set, and the unused rows at the ends of the JIS X 0208 and JIS X 0212 code sets (rows 85–94 and 78–94 respectively), to be used for user-defined characters.^[21]

HP-16[edit]

Hewlett-Packard defines an encoding referred to as "HP-16". This accompanies their "HP-15" encoding, which is a variant of Shift JIS. HP-16 encodes JIS X 0208 using the same bytes as in EUC-JP, but does not use the single shift codes (thus omitting code sets 2 and 3), and adds three user-defined regions which do not follow the packed-format EUC structure:^[20]

Lead bytes 0xA1–C2, trail bytes 0x21–7E
Lead bytes 0xC3–E3, trail bytes 0x21–3F
Lead bytes 0xC3–E1, trail bytes 0x40–64

IKIS[edit]

The IKIS (Interactive Kanji Information System) encoding used by Data General resembles EUC-JP without single shifts, i.e. with only code sets 0 and 1. Half-width katakana are instead included in row 8 of JIS X 0208 (colliding with the box-drawing characters added to the standard in 1983). JIS X 0208 rows 9 through 12 are used for user-defined characters.^[20]^[21]

Adaptations of EUC-JP for EBCDIC[edit]

KEIS (Kanji-processing Extended Information System) is an EBCDIC encoding used by Hitachi,^[21] with double-byte characters (a DBCS-Host encoding) included using shifting sequences, making it a stateful encoding. Specifically, the sequence 0x0A 0x41 switches to single-byte mode and the sequence 0x0A 0x42 switches to double-byte mode.^[b] However, JIS X 0208 characters are encoded using the same byte sequences used to encode them in EUC-JP. This results in duplicate encodings for the ideographic space—0x4040 per the DBCS-Host code structure, and 0xA1A1 as in EUC-JP. This differs from IBM's DBCS-Host encoding for Japanese, the layout of which builds on versions which predate JIS X 0208 altogether. The lead byte range is extended back to 0x59, out of which the lead bytes 0x81–A0 are designated for user-defined characters,^[20] and the remainder are used for corporate-defined characters, including both kanji and non-kanji.^[21]

JEF (Japanese-processing Extended Feature)^[21] is an EBCDIC encoding used on Fujitsu FACOM mainframes, contrasting with FMR (a variant of Shift JIS) used on Fujitsu PCs. Like KEIS, JEF is a stateful encoding, switching to a double-byte DBCS-Host mode using shifting sequences (where 0x29 switches to single-byte mode and 0x28 switches to double-byte mode).^[22] Also similarly to KEIS, JIS X 0208 codes are represented the same as in EUC-JP.^[20] The lead byte range is extended back to 0x41, with 0x80–A0 designated for user definition; lead bytes 0x41–7F are assigned row numbers 101 through 163 for kuten purposes, although row 162 (lead byte 0x7E) is unused.^[20]^[21] Rows 101 through 148 are used for extended kanji, while rows 149 through 163 are used for extended non-kanji.^[21]

EUC-KR[edit]

EUC-KR
EUC-KR code structure
MIME / IANA	EUC-KR
Alias(es)	Wansung, IBM-970
Language(s)	Korean, English, Russian
Standard	KS X 2901 (KS C 5861)
Classification	Extended ISO 646, variable-width encoding, CJK encoding, EUC
Extends	US-ASCII or ISO 646:KR
Extensions	Mac OS Korean, IBM-949, Unified Hangul Code (Windows-949)
Transforms / Encodes	KS X 1001
Succeeded by	Unified Hangul Code (web standards)
v t e

EUC-KR is a variable-width encoding to represent Korean text using two coded character sets, KS X 1001 (formerly KS C 5601)^[23]^[24] and either ISO 646:KR (KS X 1003, formerly KS C 5636) or US-ASCII, depending on variant. KS X 2901 (formerly KS C 5861) stipulates the encoding and RFC 1557 dubbed it as EUC-KR.

A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1–0xFE) and a character from KS X 1003 or US-ASCII (G0, code set 0) takes one byte in GL (0x21–0x7E).

It is usually referred to as Wansung (Korean: 완성, romanized: Wanseong, lit. 'precomposed^[25]') in the Republic of Korea. When used with ASCII, it is called Code page 970 by IBM.^[26]^[27]^[28] It is implemented as Code page 20949 ("Korean Wansung")^[29]^[30] and Code page 51949 ("EUC Korean") by Microsoft.^[29]

As of April 2021^[update], 0.1% of all web pages globally use EUC-KR,^[9] which is misleading as 10.5% of South Korean web pages use (only country the encoding is meant for),^[31] making it the most popular non-UTF-8/Unicode encoding for a language/web domain, while only 6.0% of web pages using Korean language (making UTF-8 less popular in South Korea than in (seemingly) all countries of the world).^[32] Including extensions, it is the most widely used legacy character encoding in Korea on all three major platforms (macOS, other Unix-like OSes, and Windows), but its use has been very slowly shifting to UTF-8 as it gains popularity, especially on Linux and macOS.

As with most other encodings, UTF-8 is now preferred for new use, solving problems with consistency between platforms and vendors.

Related Korean encoding systems[edit]

Unified Hangul Code[edit]

A common extension of EUC-KR is the Unified Hangul Code (통합형 한글 코드, Tonghabhyeong Hangeul Kodeu,^[33] or 통합 완성형, Tonghab Wansunghyung), which is the default Korean codepage on Microsoft Windows. It is given the code page number 949 by Microsoft, and 1261^[34] or 1363^[35] by IBM. IBM's code page 949 is a different, unrelated, EUC-KR extension.

Unified Hangul Code extends EUC-KR by using codes which do not conform to the EUC structure to incorporate additional syllable blocks, completing the coverage of the composed syllable blocks available in Johab and Unicode. The W3C/WHATWG Encoding Standard used by HTML5 incorporates the Unified Hangul Code extensions into its definition of EUC-KR.^[36]

Mac OS Korean (HangulTalk)[edit]

Other encodings incorporating EUC-KR as a subset include the Mac OS Korean script (known as Code page 10003 or x-mac-korean),^[6] which was used by HangulTalk (MacOS-KH), the Korean localisation of the classic Mac OS. It was developed by Elex Computer (일렉스), who were at the time the authorised distributor of Apple Macintosh computers in South Korea.^[37]^[21]

HangulTalk adds extension characters with lead bytes between 0xA1 and 0xAD, both in unused space within the EUC-KR GR plane (trail bytes 0xA1–0xFE), and using non-EUC codes outside of it (trail bytes 0x41–0xA0). Some of these characters are font-style-independent stylised dingbats.^[21] Many of these characters do not have exact Unicode mappings, and Apple software maps these cases variously to combining sequences, to approximate mappings with an appended private-use character as a modifier for round-trip purposes, or to private-use characters.^[38]

Apple also uses certain single-byte codes outside of the EUC-KR plane for additional characters: 0x80 for a required space, 0x81 for a won sign (₩), 0x82 for an en dash (–), 0x83 for a copyright sign (©), 0x84 for a wide underscore (＿) and 0xFF for an ellipsis (…).^[38] Although none of these additional single-byte codes are within the lead byte range of plain EUC-KR (unlike Apple's extensions to EUC-CN, see above), some are within the lead byte range of Unified Hangul Code (specifically, 0x81, 0x82, 0x83 and 0x84).

EUC-KP[edit]

Similarly to KS X 1001, the North Korean KPS 9566 standard is typically used in EUC form; in these contexts, it is sometimes referred to as EUC-KP.^[39] More recent editions of the standard extend the EUC representation with characters using non-EUC two-byte codes, in a similar manner to Unified Hangul Code.^[40]

EUC-TW[edit]

EUC-TW is a variable-width encoding that supports US-ASCII and 16 planes of CNS 11643, each of which is 94x94. It is a rarely used encoding for traditional Chinese characters as used in Taiwan. Variants of Big5 are much more common than EUC-TW, although Big5 only encodes the first two planes of CNS 11643 hanzi, while UTF-8 is becoming more common.

As an EUC/ISO 2022 encoding, the C0 control characters, ASCII space and DEL are encoded as in ASCII.
A graphical character from US-ASCII (G0, code set 0) is encoded in GL as its usual single byte representation (0x21–0x7E).
A character from CNS 11643 plane 1 (code set 1) is encoded as two bytes in GR (0xA1–0xFE).
A character in plane 1 through 16 of CNS 11643 (code set 2) is encoded as four bytes:
- The first byte is always 0x8E (Single Shift 2).
- The second byte (0xA1–0xB0) indicates the plane, the number of which is obtained by subtracting 0xA0 from that byte.
- The third and fourth bytes are in GR (0xA1–0xFE).

Note that the plane 1 of CNS 11643 is encoded twice as code set 1 and a part of code set 2.

Notes[edit]

^ 7-bit ISO 2022 code versions supporting GB 2312 include ISO-2022-CN (with shift codes) and ISO-2022-JP-2 (without shift codes), both of which also support other non-ASCII sets.
^ These sequences match the hexadecimal forms shown by DEC^[22] and the decimal forms (10 65 and 10 66) listed by Lunde.^[20] Lunde lists the hexadecimal forms for both as 0xA0 0x42, seemingly in error.

References[edit]

^ a b c d IBM. "Character Data Representation Architecture (CDRA)". pp. 157–162.
^ a b c Lunde, Ken (2008). CJKV Information Processing: Chinese, Japanese, Korean, and Vietnamese Computing. O'Reilly. pp. 242–244. ISBN 9780596800925.
^ "Character Sets". IANA.
^ "4.2. Names and labels". Encoding Standard. WHATWG.
^ a b c d "Map (external version) from Mac OS Chinese Simplified encoding to Unicode 3.0 and later". Apple, Inc.
^ a b "Encoding.WindowsCodePage Property - .NET Framework (current version)". MSDN. Microsoft.
^ Lunde, Ken (1998). Appendix F: GB/T 12345 (PDF). CJKV Information Processing. O'Reilly Media. ISBN 9781565922242.
^ Standardization Administration of China (SAC) (2005-11-18). GB 18030-2005: Information Technology—Chinese coded character set.
^ a b "Historical trends in the usage of character encodings for websites". W3Techs.
^ "CCSID 954 information document". Archived from the original on 2016-03-27.
^ International Components for Unicode (ICU), ibm-954_P101-2007.ucm, 2002-12-03
^ a b c d "JIS X 0213 Code Mapping Tables". x0213.org.
^ "Ambiguities in conversion from Japanese EUC to Unicode (Non-Normative)". XML Japanese Profile. W3C.
^ "EUC-JP decoder". Encoding Standard. WHATWG. "If byte is an ASCII byte, return a code point whose value is byte."
^ "3.1.1 Details of Problems". Problems and Solutions for Unicode and User/Vendor Defined Characters. The Open Group Japan. Archived from the original on 1999-02-03. Retrieved 2019-08-14.
^ Kaplan, Michael S. (2005-09-17). "When is a backslash not a backslash?".
^ a b "4.2 Review Process of Rules for Code Set Conversion Between eucJP-open and UCS". Problems and Solutions for Unicode and User/Vendor Defined Characters. The Open Group Japan. Archived from the original on 1999-02-03. Retrieved 2019-08-14.
^ a b Lunde, Ken (13 January 2009). "Appendix J: Japanese Character Sets" (PDF). CJKV Information Processing (2nd ed.). ISBN 978-0-596-51447-1.
^ a b Chang, Hyeshik. "Readme for CJKCodecs". cPython. Python Software Foundation.
^ a b c d e f g h i Lunde, Ken (13 January 2009). "Appendix F: Vendor Encoding Methods" (PDF). CJKV Information Processing (2nd ed.). ISBN 978-0-596-51447-1.
^ a b c d e f g h i j Lunde, Ken (2009). "Appendix E: Vendor Character Set Standards" (PDF). CJKV Information Processing: Chinese, Japanese, Korean & Vietnamese Computing (2nd ed.). Sebastopol, CA: O'Reilly. ISBN 978-0-596-51447-1.
^ a b "2: Codesets and Codeset Conversion". DIGITAL UNIX Technical Reference for Using Japanese Features. Digital Equipment Corporation, Compaq.
^ "KS X 1001:1992" (PDF).
^ "KS C 5601:1987" (PDF). 1988-10-01.
^ Lunde, Ken (2009). "Chapter 3: Character Set Standards". CJKV Information Processing. p. 146. ISBN 978-0596514471.
^ "CCSID 970". IBM Globalization. IBM. Archived from the original on 2014-12-01.
^ "ibm-970_P110_P110-2006_U2 (alias euc-kr)". Converter Explorer - ICU Demonstration. International Components for Unicode.
^ International Components for Unicode (ICU), ibm-970_P110_P110-2006_U2.ucm, 2002-12-03
^ a b "Code Page Identifiers". Windows Dev Center. Microsoft.
^ Julliard, Alexandre. "dump_krwansung_codepage: build Korean Wansung table from the KSX1001 file". make_unicode: Generate code page .c files from ftp.unicode.org descriptions. Wine Project.
^ "Distribution of Character Encodings among websites that use .kr". w3techs.com. Retrieved 2021-04-01.
^ "Distribution of Character Encodings among websites that use Korean". w3techs.com. Retrieved 2020-07-03.
^ "한글 코드에 대하여" (in Korean). W3C. Archived from the original on 2013-05-24. Retrieved 2019-01-07.
^ In ucnv_lmb.cpp, a file originating from IBM and included in the International Components for Unicode source tree, the lead byte 0x11 is commented as referring to "Korean: ibm-1261" after the definition of ULMBCS_GRP_KO, and is mapped to the "windows-949" ICU codec in the OptGroupByteToCPName array later in the file.
^ "Coded character set identifiers - CCSID 1363", IBM Globalization, IBM, archived from the original on 2014-11-29
^ "5. Indexes (§ index EUC-KR)", Encoding Standard, WHATWG
^ Gil, Hojin. "HangulTalk: De facto standard Hangul environment for Mac". Guide to using Hangul on Macintosh.
^ a b Apple (2005-04-05). "Map (external version) from Mac OS Korean encoding to Unicode 3.2 and later". Unicode Consortium.
^ Kim, Kyongsok (2002-11-30). "3-way cross-reference tables - KS X 1001, KPS 9566, and UCS" (PDF). ISO/IEC JTC 1/SC 2/WG 2 N2564. [Note: updated links for tables accompanying document: [1] [2]]
^ Chung, Jaemin (2018-01-05). "Information on the most recent version of KPS 9566 (KPS 9566-2011?)" (PDF). UTC L2/18-011.

External links[edit]

EUC-JP codeset table (minus the ASCII and halfwidth parts)
Code Page Identifiers
GB18030-2000 – The New Chinese National Standard
The New Generation of Pre-Press Software in China – mentions the 748 code
Description of the EUC-TW code (in Chinese)
Manual page of EUC-JISX0213 in the Perl Encode module
International Register of Coded Character Sets to be Used With Escape Sequence – section 2.4 (p.14f.) with the coded character sets of China, Japan, South Korea, North Korea and Taiwan (ISO/IEC)
Chinese, Japanese, and Korean character set standards and encoding systems

[6] 7-bit ISO 2022 code versions supporting GB 2312 include ISO-2022-CN (with shift codes) and ISO-2022-JP-2 (without shift codes), both of which also support other non-ASCII sets.

[24] These sequences match the hexadecimal forms shown by DEC^[22] and the decimal forms (10 65 and 10 66) listed by Lunde.^[20] Lunde lists the hexadecimal forms for both as 0xA0 0x42, seemingly in error.

[cdra-1] IBM. "Character Data Representation Architecture (CDRA)". pp. 157–162.

[lunde-2] Lunde, Ken (2008). CJKV Information Processing: Chinese, Japanese, Korean, and Vietnamese Computing. O'Reilly. pp. 242–244. ISBN 9780596800925.

[ianaeuc-3] "Character Sets". IANA.

[4] "4.2. Names and labels". Encoding Standard. WHATWG.

[macsimchinese-5] "Map (external version) from Mac OS Chinese Simplified encoding to Unicode 3.0 and later". Apple, Inc.

[msdnlabels-7] "Encoding.WindowsCodePage Property - .NET Framework (current version)". MSDN. Microsoft.

[cjkv-12345-8] Lunde, Ken (1998). Appendix F: GB/T 12345 (PDF). CJKV Information Processing. O'Reilly Media. ISBN 9781565922242.

[gb18030-9] Standardization Administration of China (SAC) (2005-11-18). GB 18030-2005: Information Technology—Chinese coded character set.

[w3techs-10] "Historical trends in the usage of character encodings for websites". W3Techs.

[11] "CCSID 954 information document". Archived from the original on 2016-03-27.

[12] International Components for Unicode (ICU), ibm-954_P101-2007.ucm, 2002-12-03

[x0213org-13] "JIS X 0213 Code Mapping Tables". x0213.org.

[w3cxmleuc-14] "Ambiguities in conversion from Japanese EUC to Unicode (Non-Normative)". XML Japanese Profile. W3C.

[15] "EUC-JP decoder". Encoding Standard. WHATWG. "If byte is an ASCII byte, return a code point whose value is byte."

[16] "3.1.1 Details of Problems". Problems and Solutions for Unicode and User/Vendor Defined Characters. The Open Group Japan. Archived from the original on 1999-02-03. Retrieved 2019-08-14.

[17] Kaplan, Michael S. (2005-09-17). "When is a backslash not a backslash?".

[osfibmextensions-18] "4.2 Review Process of Rules for Code Set Conversion Between eucJP-open and UCS". Problems and Solutions for Unicode and User/Vendor Defined Characters. The Open Group Japan. Archived from the original on 1999-02-03. Retrieved 2019-08-14.

[lundeJ-19] Lunde, Ken (13 January 2009). "Appendix J: Japanese Character Sets" (PDF). CJKV Information Processing (2nd ed.). ISBN 978-0-596-51447-1.

[hyeshik-20] Chang, Hyeshik. "Readme for CJKCodecs". cPython. Python Software Foundation.

[lundeF-21] ^ a b c d e f g h i Lunde, Ken (13 January 2009). "Appendix F: Vendor Encoding Methods" (PDF). CJKV Information Processing (2nd ed.). ISBN 978-0-596-51447-1.

[lunde2009appE-22] ^ a b c d e f g h i j Lunde, Ken (2009). "Appendix E: Vendor Character Set Standards" (PDF). CJKV Information Processing: Chinese, Japanese, Korean & Vietnamese Computing (2nd ed.). Sebastopol, CA: O'Reilly. ISBN 978-0-596-51447-1.

[decunix-23] "2: Codesets and Codeset Conversion". DIGITAL UNIX Technical Reference for Using Japanese Features. Digital Equipment Corporation, Compaq.

[ksx-25] "KS X 1001:1992" (PDF).

[ksc-26] "KS C 5601:1987" (PDF). 1988-10-01.

[27] Lunde, Ken (2009). "Chapter 3: Character Set Standards". CJKV Information Processing. p. 146. ISBN 978-0596514471.

[28] "CCSID 970". IBM Globalization. IBM. Archived from the original on 2014-12-01.

[29] "ibm-970_P110_P110-2006_U2 (alias euc-kr)". Converter Explorer - ICU Demonstration. International Components for Unicode.

[30] International Components for Unicode (ICU), ibm-970_P110_P110-2006_U2.ucm, 2002-12-03

[winids-31] "Code Page Identifiers". Windows Dev Center. Microsoft.

[32] Julliard, Alexandre. "dump_krwansung_codepage: build Korean Wansung table from the KSX1001 file". make_unicode: Generate code page .c files from ftp.unicode.org descriptions. Wine Project.

[33] "Distribution of Character Encodings among websites that use .kr". w3techs.com. Retrieved 2021-04-01.

[34] "Distribution of Character Encodings among websites that use Korean". w3techs.com. Retrieved 2020-07-03.

[35] "한글 코드에 대하여" (in Korean). W3C. Archived from the original on 2013-05-24. Retrieved 2019-01-07.

[36] In ucnv_lmb.cpp, a file originating from IBM and included in the International Components for Unicode source tree, the lead byte 0x11 is commented as referring to "Korean: ibm-1261" after the definition of ULMBCS_GRP_KO, and is mapped to the "windows-949" ICU codec in the OptGroupByteToCPName array later in the file.

[37] "Coded character set identifiers - CCSID 1363", IBM Globalization, IBM, archived from the original on 2014-11-29

[whatwgext-38] "5. Indexes (§ index EUC-KR)", Encoding Standard, WHATWG

[39] Gil, Hojin. "HangulTalk: De facto standard Hangul environment for Mac". Guide to using Hangul on Macintosh.

[mackoreantxt-40] Apple (2005-04-05). "Map (external version) from Mac OS Korean encoding to Unicode 3.2 and later". Unicode Consortium.

[41] Kim, Kyongsok (2002-11-30). "3-way cross-reference tables - KS X 1001, KPS 9566, and UCS" (PDF). ISO/IEC JTC 1/SC 2/WG 2 N2564. [Note: updated links for tables accompanying document: [1] [2]]

[42] Chung, Jaemin (2018-01-05). "Information on the most recent version of KPS 9566 (KPS 9566-2011?)" (PDF). UTC L2/18-011.

vteCharacter encodings
Early telecommunications	Telegraph code Needle Morse Non-Latin Wabun/Kana Chinese Cyrillic Korean Baudot and Murray FIELDATA ASCII ISO/IEC 646 BCDIC 353 355 357 358 359 360 EBCDIC Teletex and Videotex/Teletext ISO/IEC 6937 / ITU T.51 ITU T.61 ITU T.101 World System Teletext background sets
ISO/IEC 8859	Approved parts -1 (Western Europe) -2 (Central Europe) -3 (Maltese/Esperanto) -4 (North Europe) -5 (Cyrillic) -6 (Arabic) -7 (Greek) -8 (Hebrew) -9 (Turkish) -10 (Nordic) -11 (Thai) -13 (Baltic) -14 (Celtic) -15 (New Western Europe) -16 (Romanian) Abandoned parts -12 (Devanagari) Proposed but not approved KOI-8 Cyrillic Sámi French/Dutch/Turkish Adaptations Welsh Barents Cyrillic Volga Cyrillic Estonian Ukrainian Cyrillic
Bibliographic use	MARC-8 ANSEL CCCII/EACC ISO 5426 / 5426-2 / 5427 / 5428 / 6438 / 6861 / 6862 / 10585 / 10586 / 10754 / 11822
National standards	ArmSCII BraSCII CNS 11643 ELOT 927 GOST 10859 GB 2312 GB 12052 GB 18030 HKSCS I.S. 434 ISCII JIS X 0201 JIS X 0208 JIS X 0212 JIS X 0213 KOI-7 KPS 9566 KS X 1001 KS X 1002 LST 1284 LST 1564 LST 1590-1 LST 1590-2 LST 1590-3 LST 1590-4 PASCII RUSCII SI 960 TIS-620 TSCII VISCII VSCII YUSCII
ISO/IEC 2022	7-bit CN CN-EXT JP JP-EXT JP-1 JP-2 JP-3 KR ISO/IEC 4873 ISO/IEC 8859 ISO/IEC 10367 Extended Unix Code / EUC CN KR JP TW
Mac OS code pages("scripts")	Armenian Arabic Barents Cyrillic Celtic CentEuro ChineseSimp / EUC-CN ChineseTrad / Big5 Croatian Cyrillic Devanagari / ISCII Dingbats Farsi (Persian) Gaelic Georgian Greek Gujarati / ISCII Gurmukhi / ISCII Hebrew Iceland Inuit Japanese / Shift JIS Keyboard Korean / EUC-KR Latin (Kermit) Maltese/Esperanto Ogham / I.S. 434 Roman Romanian Sámi Symbol Thai / TIS-620 Turkish Turkic Cyrillic Ukrainian VT100
DOS code pages	100 111 112 113 151 152 161 162 163 164 165 166 210 220 301 437 449 489 620 667 668 707 708 709 710 711 714 715 720 721 737 768 770 771 772 773 774 775 776 777 778 790 850 851 852 853 854 855/872 856 857 858 859 860 861 862 863 864 865 866/808 867 868 869 874/1161/1162 876 877 878 881 882 883 884 885 891 895 896 897 898 899 900 903 904 906 907 909 910 911 926 927 928 929 932 934 936 938 941 942 943 944 946 947 948 949 950/1370 951 966 991 1034 1039 1040 1041 1042 1043 1044 1046 1086 1088 1092 1093 1098 1108 1109 1114 1115 1116 1117 1118 1119 1125/848 1126 1127 1131/849 1139 1167 1168 1300 1351 1361 1362 1363 1372 1373 1374 1375 1380 1381 1385 1386 1391 1392 1393 1394 3012 3021 3843 3844 3845 3846 3847 3848 30000 30001 30002 30003 30004 30005 30006 30007 30008 30009 30010 30011 30012 30013 30014 30015 30016 30017 30018 30019 30020 30021 30022 30023 30024 30025 30026 30027 30028 30029 30030 30031 30032 30033 30034 30039 30040 58152 58210 58335 59234 59829 60258 60853 61282 62306 CS Indic CSX Indic CSX+ Indic CWI-2 Iran System Kamenický KOI8 Mazovia MIK
IBM AIX code pages	367 371 806 813 819 895 896 912 913 914 915 916 919 920 921/901 922/902 923 952 953 954 955 956 957 958 959 960 961 963 964 965 970 971 1004 1006 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1029 1036 1089 1111 1124 1129/1163 1133 1350 1382 1383
IBM code pages forother vendors' encodings	Apple Macintosh 1275 1280 1281 1282 1283 1284 1285 1286 Adobe 1038 1276 1277 DEC 1020 1021 1023 1090 1100 1101 1102 1103 1104 1105 1106 1107 1287 1288 HP 1050 1051 1052 1053 1054 1055 1056 1057 1058
Windows code pages	CER-GS 874/1162 (TIS-620) 932/943 (Shift JIS) 936/1386 (GBK) 950/1370 (Big5) 949/1363 (EUC-KR) 1169 1174 Extended Latin-8 1200 (UTF-16LE) 1201 (UTF-16BE) 1250 1251 1252 1253 1254 1255 1256 1257 1258 1261 1270 54936 (GB18030) Armenian Cyrillic + Finnish Cyrillic + French Cyrillic + German Polytonic Greek 65001 (UTF-8)
Microsoft code pages forother vendors' encodings	Apple Macintosh 10000 10001 10002 10003 10004 10005 10006 10007 10008 10010 10017 10021 10029 10079 10081 10082
EBCDIC code pages	37 390 391 392 393 394 395 435 829 834 835 837 839 881 882 883 884 885 886 887 888 889 890 931 933/1364 935/1388 937/1371 939/1399 1001 1003 1005 1007 1024 1027 1028 1030 1031 1032 1033 1037 1068 1071 1073 1074 1075 1076 1077 1078 1080 1082 1083 1085 1087 1091 1136 1150 1151 1152 1278 1279 1303 1364 1376 1377
DEC terminals (VTx)	Multinational (MCS) National Replacement (NRCS) French Canadian Swiss Spanish United Kingdom Dutch Finnish French Norwegian and Danish Swedish Norwegian and Danish (alternative) 8-bit Greek 8-bit Turkish 7-bit Hebrew 8-bit Hebrew Special Graphics Technical (TCS)
Platform specific	Acorn Adobe Standard Adobe Latin 1 Amstrad CPC Apple I Apple II Apple III ATASCII Atari ST BICS Casio calculators CDC Compucolor II CP/M+ DEC RADIX 50 DEC MCS/NRCS DG International ELWRO-Junior FIELDATA GEM GEOS GSM 03.38 HP Roman Extension HP Roman-8 HP Roman-9 HP FOCAL HP RPL IBM SQUOZE LICS LMBCS Mattel Aquarius Minitel MSX NEC APC NeXT OricSCII PCW PETSCII Sega SC-3000 Sharp calculators Sharp MZ Sinclair QL Teletext TI calculators TRS-80 Ventura International Ventura Symbol WISCII XCCS ZX80 ZX81 ZX Spectrum
Unicode / ISO/IEC 10646	UTF-1 UTF-7 UTF-8 UTF-16 (UTF-16LE/UTF-16BE) / UCS-2 UTF-32 (UTF-32LE/UTF-32BE) / UCS-4 UTF-EBCDIC GB 18030 BOCU-1 CESU-8 SCSU
TeX typesetting system	Cork IL1 IL2 IL3 L7X LGR LY1 OML OMS OMX OT1 OT2 OT3 OT4 PL0 QX T2A T2B T2C T2D T3 T4 T5 TS1 TS3 U X2
Miscellaneous code pages	ABICOMP APL 293 310 (Graphic Escape) 351 (GDDM) 907 (OEM) ISO-IR-68 ARIB STD-B24 HZ IEC-P27-1 INIS 7-bit 8-bit Cyrillic ISO-IR-169 ISO 2033 Johab Mojikyō SEASCII Stanford/ITS TACE16 TRON UTF-5 UTF-6 WTF-8
Control and nonprinting character sets	Morse prosigns C0 and C1 control codes ISO/IEC 6429 / ANSI X3.64 / ECMA-48 / JIS X 0211 ISO 6630 DIN 31626 JIS X 0207 ITU T.101 C0 C1 EBCDIC control codes Unicode control, format and separator characters Whitespace characters
Related topics	Code page Windows code page CCSID Character encodings in HTML Charset detection Han unification Hardware Mojibake
Character sets