Corinna Vinschen wrote on Wednesday, May 13, 2009 10:30: > On May 12 19:37, Corinna Vinschen wrote: >> On May 13 02:29, IWAMURO Motonori wrote: >>> I propose that the filename encoding in C locale uses UTF-8 instead >>> of SO/UTF-8. >>> >>> There are three reasons: >> >> That's an interesting thought. Do you have a patch and, if so, did >> you try it? Does it, for instance, help for the issue reported in >> the thread starting at > http://cygwin.com/ml/cygwin/2009-05/msg00245.html? > > After examining the issue Lenik reported in the above thread, > I'm at a loss how to solve this problem in a generic way. >
I may be dense, as all of my internationlization experience was from the late 90's. But in my experience the only solution for this is a cognizant effort on behalf of the user (or admin). > The problem is that the filename changes dependent on the > character set used in $LANG. The reason is that every time a > multibyte filename has to be generated, it has to be > converted from UTF-16 to multibyte. > > For instance, taking one of the filename from Lenik's > example. It's stored on the filesystem as the UTF-16 > sequence \u684c \u9762. If I set LANG to en_US.UTF-8, a > readdir(2) call returns the multibyte sequence > > 0xe6 0xa1 0x8c 0xe9 0x9d 0xa2 > > If I set LANG to en_US.GBK, `ls' returns the filename > > 0xd7 0xc0 0xc3 0xe6 > > And in case LANG=C, `ls' returns > > 0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2 > > So, dependent on the character set setting in the > application, the idea of the filename differs. That's not > exactly helpful for interoperability between different applications. > > I can think of two potential solutions to fix this problem: > > (1) Always return filenames in UTF-8 encoding and pretend that UTF-8 > is the way files are stored on disk. That results in unchangable > filenames which are always valid. > > But what if an application sets LANG="xxxx.SJIS" and > tries to create > a file using SJIS character encoding? Should the file be created > using the SJIS->UTF-16 conversion or should open fail with > EILSEQ? That's not good. > > (2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then > Cygwin uses the LC_CTYPE setting which corresponds to the current > codepage. If one of $LC_ALL/$LC_CTYPE/$LANG is set in > the environment, If nothing is set use UTF-8 as it will work in existing code. > Cygwin uses that to convert pathnames. If the application uses > setlocale, Cygwin uses that setting to convert pathnames. > > One problem can't be solved this way: If an application fetches > and stores a filename, then switches the locale, and then tries > to use the filename in another system call, the filename is > potentially broken. This is the user's problem to resolve. > > Any better ideas? > Not necessarily better, but here is a chart: Sys: App: function expects/returns NULL: NULL: UTF-8 C/UA: NULL: UTF-8 NULL: C/UA: UTF-8 C/UA: C/UA: UTF-8 SPEC: NULL: System Locale SPEC: C/UA: UTF-8 NULL SPEC: Application Locale C/UA: SPEC: Application Locale SPEC: SPEC: Application Locale Key: Sys= System's current locale App= Application's current locale NULL= No setting C/UA= C or any Unicode aware locale SPEC= Some other locale (i.e. SJIS) -jason -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100 - - +1 (443) 269-1555 x333 Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00. -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/