control: severity -1 normal control: tag -1 + upstream control: retitle -1 libc6: mb* functions consider C locale as 7-bit (128 characters) instead of 8-bit (256 characters) since POSIX Issue 7 TC2/Issue 8
On 2022-08-21 16:23, наб wrote: > Package: libc6 > Version: 2.33-8 > Severity: important > > Dear Maintainer, > > Consider the following reproducer: [ snip ] > This breaks all programs that expect to process text/data portably, > since in LC_ALL=C half of all bytes collapse to one character "breaks" is a bit strong there given that this behaviour of the C locale has been there for decades. Note also that the C.UTF-8 helps there, even if I agree that it should also work with the POSIX locale. > (for sort this means that they all collate equally, &c., &c.)! It depends what is used for sorting. For instance the sort(1) utility behaves correctly with the C locale. > Consider a diff of XBD 6.2 ("Character Encoding"), Issue 7 vs Issue 7 TC2: > -- >8 -- > @@ -1768,9 +1664,13 @@ > > <h3><a name="tag_06_02"> 6.2 </a>Character Encoding</h3> > > -<p>The POSIX locale contains the characters in <a href="#tagtcjh_3">Portable > Character Set</a> , which have the properties listed > -in <a href="../basedefs/V1_chap07.html#tag_07_03_01"><i>LC_CTYPE</i></a> . > In other locales, the presence, meaning, and > -representation of any additional characters are locale-specific.</p> > +<p>The POSIX locale shall contain 256 single-byte characters including the > characters in <a href="#tagtcjh_3">Portable Character > +Set</a> and <a href="#tagtcjh_4">Non-Portable Control Characters</a>, which > have the properties listed in <a href= > +"../basedefs/V1_chap07.html#tag_07_03_01"><i>LC_CTYPE</i></a>. It is > unspecified whether characters not listed in those two tables > +are classified as <b>punct</b> or <b>cntrl</b>, or neither. Other locales > shall contain the characters in <a href= > +"#tagtcjh_3">Portable Character Set</a> and may contain any or all of the > control characters identified in <a href= > +"#tagtcjh_4">Non-Portable Control Characters</a>; the presence, meaning, and > representation of any additional characters are > +locale-specific.</p> > > <p>In locales other than the POSIX locale, a character may have a > state-dependent encoding. There are two types of these > encodings:</p> > -- >8 -- That comes for bug 663. However for the functions listed in that bug, only the mb* functions are affected. The strcasecmp, strncasecmp, toupper, tolower and is* functions behave as in the standard. Anyway please bring this issue upstream, as it has to be solved there. Regards Aurelien -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurel...@aurel32.net http://www.aurel32.net
signature.asc
Description: PGP signature