Re: horrible utf-8 performace in wc

Bruno Haible Thu, 08 May 2008 06:21:11 -0700

> @@ -368,6 +370,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
>                             linepos += width;
>                           if (iswspace (wide_char))
>                             goto mb_word_separator;
> +                         else if (uc_combining_class (wide_char) != 0)
> +                           chars--; /* don't count combining chars */
>                           in_word = true;
>                         }
>                       break;


If you want a tool to ignore combining characters (not 'wc -m', since 'wc -m'
is not specified to behave like this, see the other mail), then
uc_combining_class from gnulib is a usable API.

However, in this patch you are assuming an UTF-8 locale. Recall that on some
systems (Solaris, FreeBSD, ...) in EUC-JP locale for example, the wide-character
representation of a double-byte character is unrelated to Unicode: the mbrtowc
routine just combines the two bytes in a single wchar_t with a bit of shifting
and masking; no conversion to Unicode takes place here.

If you want to convert a byte sequence from the locale's encoding to a
sequence of Unicode characters, in order to use uc_combining_class and similar
API, you can do so through the gnulib function u32_conv_from_encoding
(using locale_charset() as encoding). It's defined in gnulib's "uniconv.h" file.

Bruno



_______________________________________________
Bug-coreutils mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: horrible utf-8 performace in wc

Reply via email to