Re: new modules for grapheme cluster breaking

Bruno Haible Sat, 01 Jan 2011 03:54:45 -0800

Hi Ben,

> > "grapheme" or "grapheme cluster"? I'm a bit confused: The Unicode 3.0
> > book uses the term "grapheme" to denote the entity that users consider
> > to be a single character, but UAX #29 nowadays calls it "grapheme cluster".
> 
> I am being a little sloppy with terminology.  My take-away from
> the Unicode glossary definitions is that a "grapheme" is a
> user-perceived character, and a "grapheme cluster" is the
> sequence of code points that make up a grapheme.


Hmm. In <http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf> they use
only the term "grapheme cluster". So, to me it appears that "grapheme" is
an older term that they wanted to get away from. Therefore your exclusive
use of "grapheme cluster" in unigbrk.h is perfect.

One tiny improvement of your patches: In C source code, use octal escapes
instead of hexadecimal escapes. Some platform's cc compiler (IRIX 6.5 or
HP-UX 10.20 or something like that) supports only octal escapes correctly.


2011-01-01  Bruno Haible  <br...@clisp.org>

        Avoid use of hexadecimal escapes.
        * tests/unigbrk/test-uc-is-grapheme-break.c (main): Use octal escapes
        instead of hexadecimal escapes.

--- tests/unigbrk/test-uc-is-grapheme-break.c.orig      Sat Jan  1 12:52:02 2011
+++ tests/unigbrk/test-uc-is-grapheme-break.c   Sat Jan  1 12:34:04 2011
@@ -1,5 +1,5 @@
 /* Grapheme cluster break function test.
-   Copyright (C) 2010 Free Software Foundation, Inc.
+   Copyright (C) 2010-2011 Free Software Foundation, Inc.
 
    This program is free software: you can redistribute it and/or modify it
    under the terms of the GNU Lesser General Public License as published
@@ -97,12 +97,12 @@
           ucs4_t next;
 
           p += strspn (p, " \t\r\n");
-          if (!strncmp (p, "\xc3\xb7" /* ÷ */, 2))
+          if (!strncmp (p, "\303\267" /* ÷ */, 2))
             {
               should_break = true;
               p += 2;
             }
-          else if (!strncmp (p, "\xc3\x97" /* × */, 2))
+          else if (!strncmp (p, "\303\227" /* × */, 2))
             {
               should_break = false;
               p += 2;

Re: new modules for grapheme cluster breaking

Reply via email to