Hi Ben, > > "grapheme" or "grapheme cluster"? I'm a bit confused: The Unicode 3.0 > > book uses the term "grapheme" to denote the entity that users consider > > to be a single character, but UAX #29 nowadays calls it "grapheme cluster". > > I am being a little sloppy with terminology. My take-away from > the Unicode glossary definitions is that a "grapheme" is a > user-perceived character, and a "grapheme cluster" is the > sequence of code points that make up a grapheme.
Hmm. In <http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf> they use only the term "grapheme cluster". So, to me it appears that "grapheme" is an older term that they wanted to get away from. Therefore your exclusive use of "grapheme cluster" in unigbrk.h is perfect. One tiny improvement of your patches: In C source code, use octal escapes instead of hexadecimal escapes. Some platform's cc compiler (IRIX 6.5 or HP-UX 10.20 or something like that) supports only octal escapes correctly. 2011-01-01 Bruno Haible <br...@clisp.org> Avoid use of hexadecimal escapes. * tests/unigbrk/test-uc-is-grapheme-break.c (main): Use octal escapes instead of hexadecimal escapes. --- tests/unigbrk/test-uc-is-grapheme-break.c.orig Sat Jan 1 12:52:02 2011 +++ tests/unigbrk/test-uc-is-grapheme-break.c Sat Jan 1 12:34:04 2011 @@ -1,5 +1,5 @@ /* Grapheme cluster break function test. - Copyright (C) 2010 Free Software Foundation, Inc. + Copyright (C) 2010-2011 Free Software Foundation, Inc. This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published @@ -97,12 +97,12 @@ ucs4_t next; p += strspn (p, " \t\r\n"); - if (!strncmp (p, "\xc3\xb7" /* ÷ */, 2)) + if (!strncmp (p, "\303\267" /* ÷ */, 2)) { should_break = true; p += 2; } - else if (!strncmp (p, "\xc3\x97" /* × */, 2)) + else if (!strncmp (p, "\303\227" /* × */, 2)) { should_break = false; p += 2;