> @@ -368,6 +370,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus) > linepos += width; > if (iswspace (wide_char)) > goto mb_word_separator; > + else if (uc_combining_class (wide_char) != 0) > + chars--; /* don't count combining chars */ > in_word = true; > } > break;
If you want a tool to ignore combining characters (not 'wc -m', since 'wc -m' is not specified to behave like this, see the other mail), then uc_combining_class from gnulib is a usable API. However, in this patch you are assuming an UTF-8 locale. Recall that on some systems (Solaris, FreeBSD, ...) in EUC-JP locale for example, the wide-character representation of a double-byte character is unrelated to Unicode: the mbrtowc routine just combines the two bytes in a single wchar_t with a bit of shifting and masking; no conversion to Unicode takes place here. If you want to convert a byte sequence from the locale's encoding to a sequence of Unicode characters, in order to use uc_combining_class and similar API, you can do so through the gnulib function u32_conv_from_encoding (using locale_charset() as encoding). It's defined in gnulib's "uniconv.h" file. Bruno _______________________________________________ Bug-coreutils mailing list [email protected] http://lists.gnu.org/mailman/listinfo/bug-coreutils
