Re: uc_width and wcwidth optimization

Bruno Haible Tue, 13 Dec 2011 02:33:33 -0800

Hello,

Alexander V. Lukyanov wrote:
> Attached is the patch to optimize performance of wcwith, uc_width and
> uc{8,16,32}_width functions.
> 
> The optimization is caching of is_cjk_encoding() and using
> nl_langinfo(CODESET) before the complex locale_charset() to check if the
> charset has changed.


Thanks for the patch, but I cannot use it like this:
  1) The uc_width change modifies public API of libunistring.
     You can introduce new API in <uniwidth.h>, but changing the signature
     of an existing function is impossible.
  2) The wcwidth change is a good idea, but unfortunately is not multithread-
     safe. Different threads can have different locales, therefore a global
     variable as a cache won't lead to correct results always.

I'm attaching the benchmark program I'm experimenting with. So far, it seems
that locale_charset() is really slow, whereas the is_cjk stuff is not a big
speed problem.

I would love to have locale_charset be either faster or use some thread-safe
cache. Do you have an idea how to realize this?

> Besides, uc_width is used in wcwidth for cjk encodings as designed.

-  if (STREQ (encoding, "UTF-8", 'U', 'T', 'F', '-', '8', 0, 0, 0 ,0))
+  if (cached_is_utf8_encoding || cached_is_cjk_encoding)
     {
       /* We assume that in a UTF-8 locale, a wide character is the same as a
          Unicode character.  */
-      return uc_width (wc, encoding);
+      return uc_width (wc, cached_is_cjk_encoding);
     }

This won't work portably: The comment says that only in UTF-8 locales we know
that a wchar_t represents a Unicode character. In locales with encodings
such as EUC-JP or GB18030 you cannot assume anything about how to libc has
defined the wchar_t values.

Bruno
-- 
In memoriam The victims of the Massacre of Margarita Belén 
<http://en.wikipedia.org/wiki/Massacre_of_Margarita_Belén>

#define _GNU_SOURCE 1
#include <config.h>
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
int total;
int main ()
{
  const char *teststring = "\
This is a list of ways to say hello in various languages.\n\
Its purpose is to illustrate a number of scripts.\n\
\n\
---------------------------------------------------------\n\
Amharic (አማርኛ)  ሠላም\n\
Arabic                  ﺍﻟﺴﻼﻡ ﻋﻠﻴﻜﻢ\n\
Czech (česky)           Dobrý den\n\
Danish (Dansk)          Hej, Goddag\n\
English                 Hello\n\
Esperanto               Saluton\n\
Estonian                Tere, Tervist\n\
FORTRAN                 PROGRAM\n\
Finnish (Suomi)         Hei\n\
French (Français)       Bonjour, Salut\n\
German (Deutsch Nord)   Guten Tag\n\
German (Deutsch Süd)    Grüß Gott\n\
Greek (Ελληνικά)        Γειά σας\n\
Hebrew                  שלום\n\
Italiano                Ciao, Buon giorno\n\
Lao(ພາສາລາວ)            ສະບາຍດີ, ຂໍໃຫ້ໂຊກດີ\n\
Maltese                 Ciao\n\
Nederlands, Vlaams      Hallo, Dag\n\
Norwegian (Norsk)       Hei, God dag\n\
Polish                  Dzień dobry, Hej\n\
Russian (Русский)       Здравствуйте!\n\
Slovak                  Dobrý deň\n\
Spanish (Español)       ¡Hola!\n\
Swedish (Svenska)       Hej, Goddag\n\
Thai (ภาษาไทย)          สวัสดีครับ, สวัสดีค่ะ\n\
\n\
Tigrigna (ትግርኛ) ሰላማት\n\
Turkish (Türkçe)        Merhaba\n\
Vietnamese (Tiếng Việt) Chào bạn\n\
\n\
Japanese (日本語)               こんにちは, ｺﾝﾆﾁﾊ\n\
Chinese (中文,普通话,汉语)      你好\n\
Cantonese (粵語,廣東話)         早晨, 你好\n\
Korean (한글)                   안녕하세요, 안녕하십니까\n\
\n\
Difference among chinese characters in GB, JIS, KSC, BIG5:\n\
        GB   -- 元气  开发\n\
        JIS  -- 元気  開発\n\
        KSC  -- 元氣  開發\n\
        BIG5 -- 元氣  開發\n\
\n\
Just for a test of JISX0212: 騏驎 (the second character is of JISX0212)\n\
";

  setlocale (LC_ALL, "en_US.UTF-8");

#define REPEAT 10000
  {
    int repeat;
    for (repeat = 0; repeat < REPEAT; repeat++)
      {
        int width = 0;
        const char *cp = teststring;
        while (*cp != '\0')
          {
            wchar_t wc;
            int i = mbtowc (&wc, cp, 6);
            if (i < 0) break;
            width += wcwidth (wc);
            cp += i;
          }
        total += width;
      }
  }

  return 0;
}
/*
In UTF-8 locale:
glibc implementation:   104 µs
gnulib implementation: 1050 µs
                         80 µs for the mbtowc call
                        960 µs for the wcwidth call
                        900 µs for the locale_charset call
                         40 µs for the STREQ call
                         20 µs for the uc_width call

 Compile-command:
 gcc -Wall -I.. bench-wcwidth.c lib/wcwidth.c lib/localcharset.c lib/uniwidth/width.c -Ilib -DLIBDIR='"/usr/share"' -DHAVE_WORKING_O_NOFOLLOW=0
 */

Re: uc_width and wcwidth optimization

Reply via email to