On 14/05/15 01:30, Bruno Haible wrote: > [CCing bug-gnulib to share the understanding about i18n issues] > > Pádraig Brady wrote on 13.05.2015: >> MB_LEN_MAX was changed from 6 to 16 with: >> https://sourceware.org/git/?p=glibc.git;a=commit;f=include/limits.h;h=d64b6ad075 >> Do you know why the value 16 is used exactly? > > This was motivated either by the desire to be completely future-proof > for the next 30 years (and you don't know what kinds of encodings will > be invented). > > Or because for a couple of months Ulrich Drepper & François Pinard were > considering to add locales with stateful encodings such as ISO-2022-JP-2. > This later turned out to be not worth the effort (as the user experience > with filenames and shell in such locales was found to be terrible).
Excellent. Info like that is nigh on impossible to search for. >> BTW I see MB_LEN_MAX is 4 on musl libc. > > The value of 4 is sufficient to accommodate all stateless encodings in > use, including UTF-8 (which was restricted from max. 6 to 4 bytes by > an ISO standard) and GB18030. But it's not necessarily future-proof. Right. A good summary of the UTF8 6 -> 4 bytes thing is at: https://stijndewitt.wordpress.com/2014/08/09/max-bytes-in-a-utf-8-char/ I see that MB_CUR_MAX is 6 for UTF8 on glibc. I wonder could that be reduced to 4? I see that one has to be more careful with the _compile time_ constant MB_LEN_MAX, though it would be tempting to reduce to 8 at least, requiring a recompile for the unlikely case of supporting legacy stateful encodings. >> I was worried that it implied that wctomb() might convert a wide char to >> _multiple_ encoded chars >> for some character/encoding combinations? > > No, neither POSIX nor glibc supports locales with encodings where > a wide char would correspond to multiple characters or a where a > character would correspond to multiple wide chars. This was my key question answered. > In particular, > this prevented EUC-JISX0213 from being used as a locale encoding in > glibc [1], thus accelerating the move to UTF-8. Interesting, though EUC-JISX0213 might now be supported with newer unicode standards that include the appropriate chars? >> For example iso-2022-kr can have up to 7 bytes per encoded char, >> so maybe wctomb() might output two of those for some wide chars, >> and the extra two bytes were added for alignment? > > Yes, this was part of the considerations regarding stateful encodings. > >> Specifically why I'm wondering about this is to size the >> output buffer for wctomb() appropriately. >> Note the linux man page for wctomb() says to use MB_CUR_MAX, >> while the freebsd man page says to use MB_LEN_MAX > > That's simply because MB_CUR_MAX is not a compile-time constant, > and therefore for a long time the declaration of a local variable > char buf[MB_CUR_MAX]; > required GCC or C++, and the FreeBSD people are not keen adopters > of GCC extensions. > >> I also asked this at: >> http://stackoverflow.com/q/30222107/4421 > > Bruno > > [1] https://sourceware.org/git/?p=glibc.git;a=blob;f=iconvdata/euc-jisx0213.c thanks! Pádraig.
