Pádraig Brady wrote: > In the first 65535 code points there are also 404 chars which are > not classed as combining in the unicode database, but are classed > as zero width in the glibc locale data at least (zero-width space > being one of them like you mentioned). I determined this with the > attached progs: > > ./zw | python unidata.py | grep " 0 " | wc -l
Hi Pádraig, Wow, I knew there were some stand-alone zero-width characters, but I had no idea there were so many! I poked around a little in gnulib and found a function for determining the combining class of a Unicode character. I think the attached patch does what you were intending to do, and it also counts all of the stand-alone zero-width characters you found: ---- $ ./zw | python unidata.py | grep " 0 " | perl packu.pl | src/wc -m 404 $ src/wc -m 2char 2 2char ---- Please note that this requires a re-run of `./bootstrap', since it needs to bring some extra stuff in from gnulib. Hope that helps. Bo
diff --git a/bootstrap.conf b/bootstrap.conf
index 8bde0ad..ef5a328 100644
--- a/bootstrap.conf
+++ b/bootstrap.conf
@@ -82,6 +82,7 @@ gnulib_modules="
stpncpy
strftime
strpbrk strtoimax strtoumax strverscmp sys_stat timespec tzset
+ unictype/combining-class
unicodeio unistd-safer unlink-busy unlinkdir unlocked-io
uptime
useless-if-before-free
diff --git a/src/wc.c b/src/wc.c
index 61ab485..ed6630c 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -32,6 +32,8 @@
#include "readtokens0.h"
#include "safe-read.h"
+#include "unictype.h"
+
#if !defined iswspace && !HAVE_ISWSPACE
# define iswspace(wc) \
((wc) == to_uchar (wc) && isspace (to_uchar (wc)))
@@ -368,6 +370,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
linepos += width;
if (iswspace (wide_char))
goto mb_word_separator;
+ else if (uc_combining_class (wide_char) != 0)
+ chars--; /* don't count combining chars */
in_word = true;
}
break;
packu.pl
Description: Perl program
eÌé
_______________________________________________ Bug-coreutils mailing list [email protected] http://lists.gnu.org/mailman/listinfo/bug-coreutils
