Re: [PATCH, resend] Handle multibyte codepoint width properly

Vladimir 'φ-coder/phcoder' Serbinenko Thu, 05 Apr 2012 12:48:49 -0700

On 05.04.2012 21:18, Bruno Haible wrote:
> Hi Vladimir,
> mbsnwidth returns -1 in such a case only if the option MBSW_REJECT_INVALID
> is passed as third argument. If you pass 0, mbsnwidth will not return -1;
> instead, it will assume width 1 for every invalid byte or unprintable
> character.
Ok, will use mbsnwidth instead then.
>>> - The function __argp_get_display_len looks very similar to mbsnwidth(),
>> Remaining is the issue due to escape sequences.
> What is the use case? PO file editors are not required to support editing
> of strings with control characters. msgfmt warns when a message in a PO file
> contains an unusual control character like ESC.
Unfortunately it doesn't do so enough (see my post on bug-gettext, still
got no answer to it). In particular it accepts the file cpio/ko.po from
TP with no warnings despite it containing loads of \e. Some other
control characters like \b are ignored as well. (the file in reality
uses unsupported ISO-2022 variant, an encoding using many escapes and
not EUC-KR as it claims)


>> it is used in en@boldquot
> Ah, right. But I don't know how frequently it is used; maybe I and Simon
> were the only persons to ever use this? If we want to support this, not
> only mbswidth has to be modified, but basically any code that uses
> wcwidth - including libunistring. So, until this is discussed (and possibly
> generalized to more languages than 'en'), I propose to get away without
> it.
Ok. In long term I see only 2 possible ways: deprecate en@boldquot or
fix all those places. I don't care if boldquot gets deprecated.
>> Done but the test is valid only for UTF-8 locales. Should I force some
>> specific locale? It's impossible to make a test working in all locales
>> since in case of e.g. ASCII we don't have such characters at all.
> In such a situation, it is best to split the test into two parts: a part
> that can be executed on every machine, and a part which can only be executed
> on a system with a UTF-8 locale. This way, the first part is not skipped
> just because the system has no UTF-8 locale.
Ok, will do. Can I include all the "normal" test in UTF-8 test for
simplicity?
> Please take a look how it's done in module 'mbsstr-tests':
>   - test-mbsstr1.c is a test that doesn't need a particular locale.
>   - test-mbsstr2.c is a test that requires a UTF-8 locale. We use the
>     French one for simplicity. (If a system does not have fr_FR.UTF-8
>     installed, it would be unlikely that it has ru_RU.UTF-8 installed.)
>   - test-mbsstr2.sh is a wrapper script that uses the LOCALE_FR_UTF8
>     value, determined by m4/locale-fr.m4, and invokes test-mbsstr2.
Ok.
> +      if (wc == '\e' && ptr + 3 < end
> +          && ptr[1] == '[' && (ptr[2] == '0' || ptr[2] == '1')
> +          && ptr[3] == 'm')
> '\e' is not portable, only GCC supports it. Use '\x1b' or '\033' instead.
>
> Also, the test  ptr + 3 < end  is wrong. Should be written as
>   end - ptr > 3
> instead. (Think of ptr = 0xFFFFFFD, end = 0xFFFFFFFE on a 32-bit machine.)
> Sure, on many systems this won't matter, because this memory range is
> either unmapped or occupied by the stack. But in general you have no guarantee
> that the memory page from 0xFFFFC000..0xFFFFFFFF will not be used for 
> malloc().
I have already been bitten by this once on sparc64 with GRUB :(
> Bruno
>
>


-- 
Regards
Vladimir 'φ-coder/phcoder' Serbinenko

signature.asc
Description: OpenPGP digital signature

Re: [PATCH, resend] Handle multibyte codepoint width properly

Reply via email to