On 2009/04/07 3:07, Corinna Vinschen wrote: > > I would nevertheless be glad if you would write something up about them, > > so we have it in the records should we ever re-examine this issue.
Just for information. Sorry it has become very long... I thought I should write some background information so that others can understand... On 2009/04/04 4:20, Corinna Vinschen wrote: > > Windows. http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx > > contains all supported codepages/charsets. If you look for > > the codepages 50220-50222, you'll see they are all called ISO 2022 > > Japanese. In Cygwin I'm using 50220 for JIS. Is that correct? > > Or should I rather use one of 50221 or 50222? A short answer: standard ISO-2022-JP as specified in RFC1468 should be used. If you look at RFC1468, you will see that there are two standard character sets for Japanese, JIS X 0201 and JIS X 0208. The first version of JIS X 0201 dates back to 1969, when everything was single byte. This is Japanese extension of the ASCII code to encode Katakana characters. http://en.wikipedia.org/wiki/JIS_X_0201 For example, 'HALFWIDTH KATAKANA LETTER A' (U+FF71) is $ printf "\xff\x71"| iconv -f UTF-16BE -t JISX0201-1976| od -t x1 -t a 0000000 b1 1 0000001 Note that this is a single byte, with the highest bit of 1. 0x00-0x7f are called "Roman" set, and 0x80-0xff are called "Kana" set. Then came the multibyte era, and Japanese Standards Association made a new standard for multibyte characters. This was JIS X 0208, and it encompassed Hiragana and Katakana characters, and many of the frequently used Kanji's. For example, 'KATAKANA LETTER A' (U+30A2) is $ printf "\x30\xa2"| iconv -f UTF-16BE -t ISO-IR-87| od -t x1 -t a 0000000 25 22 % " 0000002 Note that this is identical to an ASCII string %" . That was why something was needed to distinguish between ASCII and JIS X 0208. So, the escape sequence was devised as in RFC1468. This is ISO-2022-JP, or so-call "JIS" encoding. For example, 'KATAKANA LETTER A' (U+30A2) is $ printf "\x30\xa2"| iconv -f UTF-16BE -t ISO-2022-JP| od -t x1 -t a 0000000 1b 24 42 25 22 1b 28 42 esc $ B % " esc ( B 0000010 Note that %" is delimited by escape sequences. As is written in RFC1468, this ISO-2022-JP was made for emails. the original SMTP transfers 7 bits and clears the highest bit to zero. So, any encoding with the highest bit of 1 cannot be used for SMTP. Thus, the support for JIS X 0201 was not included in ISO-2022-JP. For example, 'HALFWIDTH KATAKANA LETTER A' (U+FF71) is $ printf "\xff\x71"| iconv -f UTF-16BE -t ISO-2022-JP iconv: (stdin):1:0: cannot convert This is an expected behavior for standard ISO-2022-JP. Halfwidth Katakana in JIS X 0201 is not supported. But then came the time when Windows computers got connected to the net. Windows in Japanese had used their own extension(CP932) to SHIFT_JIS encoding, but they had to support ISO-2022-JP for emails. Then, they found that some characters in CP932 was not supported in ISO-2022-JP. Instead of respecting the RFC1468 standard, Microsoft made their own modification to (or deviations from) it. (1) They added Microsoft-specific characters, signs and symbols. (2) They added JIS X0201 halfwidth Katakana's. (1) was a big problem, and it has caused compatibility problems with Mac/Linux/Unix. But I will not go into details because every cygwin user is using Windows. For (2), Microsoft made three kinds of modification. These are codepages 50220-50222. CP50220 forces all halfwidth Katakana's into the fullwidth (or double byte) counterpart in JIS X 0208. For example, a sequence of 'KATAKANA LETTER A' (U+30A2) and 'HALFWIDTH KATAKANA LETTER A' (U+FF71) becomes $ printf "\x30\xa2\xff\x71"|ruby -r nkf -ne 'print(NKF.nkf("-W16B -X -j", $_))'| od -t x1 -t a 0000000 1b 24 42 25 22 25 22 1b 28 42 esc $ B % " % " esc ( B 0000012 Note that both 'KATAKANA LETTER A' and 'HALFWIDTH KATAKANA LETTER A' are converted into the same 'KATAKANA LETTER A' or %" and are delimited by two escape sequences. Thus two files with these characters cannot be distinguished. The conversion of CP50220 -> UTF-16 -> CP50220 is not guaranteed to yield an identical result as the original. CP50221 introduced another escaped sequence Esc(I to designate JIS X 0201 halfwidth Katakana's. For example, a sequence of 'KATAKANA LETTER A' (U+30A2) and 'HALFWIDTH KATAKANA LETTER A' (U+FF71) becomes $ printf "\x30\xa2\xff\x71"|ruby -r nkf -ne 'print(NKF.nkf("-W16B -x -j", $_))'| od -t x1 -t a 0000000 1b 24 42 25 22 1b 28 49 31 1b 28 42 esc $ B % " esc ( I 1 esc ( B 0000014 Note that the 'KATAKANA LETTER A' part is the same as CP50220 (1b 24 42 25 22), but the highest bit of 0xb1 for 'HALFWIDTH KATAKANA LETTER A' is dropped to be 0x31, and it is prepended by esc ( I . CP50222 used another escape sequence. It uses the combination of RFC1468-defined ESC(J and SHIFT-OUT(0x0e)/SHIFT-IN(0x0f). I do not know of any utility to simulate it, but for example, create a file with a name of a sequence of 'KATAKANA LETTER A' (U+30A2) and 'HALFWIDTH KATAKANA LETTER A' (U+FF71), chcp 50222, dir to some file, and od, I get, 1b 24 42 25 22 1b 28 4a 0e 31 1b 28 42 esc $ B % " esc ( J so 1 esc ( B Note that 'KATAKANA LETTER A' part is the same as CP50220 (1b 24 42 25 22), but the highest bit of 0xb1 for 'HALFWIDTH KATAKANA LETTER A' is dropped to be 0x31, and it is prepended by esc ( J and so(0x0e). (si(0x0f) is not used here because it is at the end.) In contrast to CP50220, CP50221 and CP50222 supports conversion of halfwidth Katakana's to and from UTF-16 is supported for . But these are not valid ISO-2022-JP, are not guaranteed to work with all applications. For example, iconv cannot handle them correctly. CP50220 produces valid ISO-2022-JP, but as I wrote before, two files with halfwidth and double byte Kana's cannot be distinguished. Conclusion: users should be careful when using CP50220-50222. They should stick to standard ISO-2022-JP. > > No, no. I implemented the full three-byte variation as it is, for > > instance, implemented in glibc as well. Cygwin converts the CP 20932 > > doublebyte sequences for JIS X 0212 into Unix compatible triplebyte > > sequences and vice versa. So Cygwin applications see a Unix-like > > eucJP implementation, not a CP 20932 implementation. Thanks, that is more than I expected. -- neomjp -------------------------------------- Power up the Internet with Yahoo! Toolbar. http://pr.mail.yahoo.co.jp/toolbar/ -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/