Re: [1.7] Support for CJK Character Sets

neomjp Tue, 07 Apr 2009 06:02:19 -0700

On 2009/04/07 3:07, Corinna Vinschen wrote:
> > I would nevertheless be glad if you would write something up about them,
> > so we have it in the records should we ever re-examine this issue.


        Just for information. Sorry it has become very long... I
thought I should write some background information so that others can
understand...

On 2009/04/04 4:20, Corinna Vinschen wrote:
> > Windows.  http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx
> > contains all supported codepages/charsets.  If you look for
> > the codepages 50220-50222, you'll see they are all called ISO 2022
> > Japanese.  In Cygwin I'm using 50220 for JIS.  Is that correct?
> > Or should I rather use one of 50221 or 50222?

        A short answer: standard ISO-2022-JP as specified in RFC1468
should be used.

        If you look at RFC1468, you will see that there are two standard
character sets for Japanese, JIS X 0201 and JIS X 0208.

        The first version of JIS X 0201 dates back to 1969, when
everything was single byte. This is Japanese extension of the ASCII code
to encode Katakana characters.
http://en.wikipedia.org/wiki/JIS_X_0201
For example, 'HALFWIDTH KATAKANA LETTER A' (U+FF71) is

$ printf "\xff\x71"| iconv -f UTF-16BE -t JISX0201-1976| od -t x1 -t a
0000000  b1
          1
0000001

        Note that this is a single byte, with the highest bit of 1.
0x00-0x7f are called "Roman" set, and 0x80-0xff are called "Kana" set.

        Then came the multibyte era, and Japanese Standards Association
made a new standard for multibyte characters. This was JIS X 0208, and it
encompassed Hiragana and Katakana characters, and many of the frequently
used Kanji's. For example, 'KATAKANA LETTER A' (U+30A2) is

$ printf "\x30\xa2"| iconv -f UTF-16BE -t ISO-IR-87| od -t x1 -t a
0000000  25  22
          %   "
0000002

        Note that this is identical to an ASCII string %" . That was why
something was needed to distinguish between ASCII and JIS X 0208. So, the
escape sequence was devised as in RFC1468. This is ISO-2022-JP, or
so-call "JIS" encoding. For example, 'KATAKANA LETTER A' (U+30A2) is

$ printf "\x30\xa2"| iconv -f UTF-16BE -t ISO-2022-JP| od -t x1 -t a
0000000  1b  24  42  25  22  1b  28  42
        esc   $   B   %   " esc   (   B
0000010

        Note that %" is delimited by escape sequences.

        As is written in RFC1468, this ISO-2022-JP was made for emails.
the original SMTP transfers 7 bits and clears the highest bit to zero.
So, any encoding with the highest bit of 1 cannot be used for SMTP.
Thus, the support for JIS X 0201 was not included in ISO-2022-JP.
For example, 'HALFWIDTH KATAKANA LETTER A' (U+FF71) is

$ printf "\xff\x71"| iconv -f UTF-16BE -t ISO-2022-JP
iconv: (stdin):1:0: cannot convert

        This is an expected behavior for standard ISO-2022-JP. Halfwidth
Katakana in JIS X 0201 is not supported.

        But then came the time when Windows computers got connected to
the net. Windows in Japanese had used their own extension(CP932) to
SHIFT_JIS encoding, but they had to support ISO-2022-JP for emails. Then,
they found that some characters in CP932 was not supported in ISO-2022-JP.
Instead of respecting the RFC1468 standard, Microsoft made their own
modification to (or deviations from) it.

(1) They added Microsoft-specific characters, signs and symbols.
(2) They added JIS X0201 halfwidth Katakana's.

        (1) was a big problem, and it has caused compatibility problems
with Mac/Linux/Unix. But I will not go into details because every cygwin
user is using Windows.

        For (2), Microsoft made three kinds of modification. These are
codepages 50220-50222.

        CP50220 forces all halfwidth Katakana's into the fullwidth (or
double byte) counterpart in JIS X 0208. For example, a sequence of
'KATAKANA LETTER A' (U+30A2) and 'HALFWIDTH KATAKANA LETTER A' (U+FF71)
becomes

$ printf "\x30\xa2\xff\x71"|ruby -r nkf -ne 'print(NKF.nkf("-W16B -X
-j", $_))'| od -t x1 -t a
0000000  1b  24  42  25  22  25  22  1b  28  42
        esc   $   B   %   "   %   " esc   (   B
0000012

        Note that both 'KATAKANA LETTER A' and 'HALFWIDTH KATAKANA LETTER A'
are converted into the same 'KATAKANA LETTER A' or %" and are delimited by two 
escape sequences. Thus two files with these characters cannot be
distinguished. The conversion of CP50220 -> UTF-16 -> CP50220 is not
guaranteed to yield an identical result as the original.

        CP50221 introduced another escaped sequence Esc(I to designate
JIS X 0201 halfwidth Katakana's. For example, a sequence of 'KATAKANA
LETTER A' (U+30A2) and 'HALFWIDTH KATAKANA LETTER A' (U+FF71) becomes

$ printf "\x30\xa2\xff\x71"|ruby -r nkf -ne 'print(NKF.nkf("-W16B -x
-j", $_))'| od -t x1 -t a
0000000  1b  24  42  25  22  1b  28  49  31  1b  28  42
        esc   $   B   %   " esc   (   I   1 esc   (   B
0000014

        Note that the 'KATAKANA LETTER A' part is the same as CP50220
(1b 24 42 25 22), but the highest bit of 0xb1 for 'HALFWIDTH KATAKANA
LETTER A' is dropped to be 0x31, and it is prepended by esc ( I .

        CP50222 used another escape sequence. It uses the combination of
RFC1468-defined ESC(J and SHIFT-OUT(0x0e)/SHIFT-IN(0x0f). I do not know
of any utility to simulate it, but for example, create a file with a
name of a sequence of 'KATAKANA LETTER A' (U+30A2) and 'HALFWIDTH
KATAKANA LETTER A' (U+FF71), chcp 50222, dir to some file, and od, I get,

1b  24  42  25  22  1b  28  4a  0e  31  1b  28  42
esc   $   B   %   " esc   (   J  so   1 esc   (   B

        Note that 'KATAKANA LETTER A' part is the same as CP50220 (1b 24  42 25 
22), but the highest bit of 0xb1 for 'HALFWIDTH KATAKANA LETTER A'
is dropped to be 0x31, and it is prepended by esc ( J and so(0x0e).
(si(0x0f) is not used here because it is at the end.)

        In contrast to CP50220, CP50221 and CP50222 supports conversion
of halfwidth Katakana's to and from UTF-16 is supported for . But these
are not valid ISO-2022-JP, are not guaranteed to work with all
applications. For example, iconv cannot handle them correctly. CP50220
produces valid ISO-2022-JP, but as I wrote before, two files with
halfwidth and double byte Kana's cannot be distinguished.

        Conclusion: users should be careful when using CP50220-50222.
They should stick to standard ISO-2022-JP.

> > No, no.  I implemented the full three-byte variation as it is, for
> > instance, implemented in glibc as well.  Cygwin converts the CP 20932
> > doublebyte sequences for JIS X 0212 into Unix compatible triplebyte
> > sequences and vice versa.  So Cygwin applications see a Unix-like
> > eucJP implementation, not a CP 20932 implementation.

        Thanks, that is more than I expected.
--
neomjp

--------------------------------------
Power up the Internet with Yahoo! Toolbar.
http://pr.mail.yahoo.co.jp/toolbar/

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

Re: [1.7] Support for CJK Character Sets

Reply via email to