On Jul 28 13:33, Andy Koppe wrote: > 2009/7/28 Corinna Vinschen: > >> >> Trouble is, the hack will also only work correctly if the whole UTF-8 > >> >> sequence for the non-BMP character is passed at once. If you pass the > >> >> bytes one-by-one instead, and assuming the bug above wasn't there, > >> >> you'd get this: > >> > > >> > Yes, I know. The real trouble is, I don't know how that can be fixed > >> > in a still sort-of-POSIXy way. > >> > >> The way I'd suggested is sort-of-POSIXy, but perhaps not enough, > >> because apps that check the mbrtowc() return code (and not the written > >> wc) against zero will interpret a low surrogate as string end. An > >> alternative might be to just return an error when there's no compliant > >> way to return the low surrogate. Do you think either of these are > >> worth pursuing? > > > > I'm thinking of faking a valid return of 1 (or 2, or 3) after the third byte > > has been read. Three bytes are sufficient to create the first surrogate > > half in wc. > > Great idea! > > I wouldn't even say it's fake, because as you say, you definitely have > a high surrogate after three bytes. So just return the number of bytes > actually used. It's also valid to leave it in a non-initial state > after that; consider it the surrogate shift state or some such. And if > the first byte in the next call isn't actually a valid fourth byte, > just return an error.
I propsed a patch: http://sourceware.org/ml/newlib/2009/msg00781.html Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple