Hello, Alexander V. Lukyanov wrote: > Attached is the patch to optimize performance of wcwith, uc_width and > uc{8,16,32}_width functions. > > The optimization is caching of is_cjk_encoding() and using > nl_langinfo(CODESET) before the complex locale_charset() to check if the > charset has changed.
Thanks for the patch, but I cannot use it like this: 1) The uc_width change modifies public API of libunistring. You can introduce new API in <uniwidth.h>, but changing the signature of an existing function is impossible. 2) The wcwidth change is a good idea, but unfortunately is not multithread- safe. Different threads can have different locales, therefore a global variable as a cache won't lead to correct results always. I'm attaching the benchmark program I'm experimenting with. So far, it seems that locale_charset() is really slow, whereas the is_cjk stuff is not a big speed problem. I would love to have locale_charset be either faster or use some thread-safe cache. Do you have an idea how to realize this? > Besides, uc_width is used in wcwidth for cjk encodings as designed. - if (STREQ (encoding, "UTF-8", 'U', 'T', 'F', '-', '8', 0, 0, 0 ,0)) + if (cached_is_utf8_encoding || cached_is_cjk_encoding) { /* We assume that in a UTF-8 locale, a wide character is the same as a Unicode character. */ - return uc_width (wc, encoding); + return uc_width (wc, cached_is_cjk_encoding); } This won't work portably: The comment says that only in UTF-8 locales we know that a wchar_t represents a Unicode character. In locales with encodings such as EUC-JP or GB18030 you cannot assume anything about how to libc has defined the wchar_t values. Bruno -- In memoriam The victims of the Massacre of Margarita Belén <http://en.wikipedia.org/wiki/Massacre_of_Margarita_Belén>
#define _GNU_SOURCE 1 #include <config.h> #include <locale.h> #include <stdio.h> #include <stdlib.h> #include <wchar.h> int total; int main () { const char *teststring = "\ This is a list of ways to say hello in various languages.\n\ Its purpose is to illustrate a number of scripts.\n\ \n\ ---------------------------------------------------------\n\ Amharic (አማርኛ) ሠላም\n\ Arabic ﺍﻟﺴﻼﻡ ﻋﻠﻴﻜﻢ\n\ Czech (česky) Dobrý den\n\ Danish (Dansk) Hej, Goddag\n\ English Hello\n\ Esperanto Saluton\n\ Estonian Tere, Tervist\n\ FORTRAN PROGRAM\n\ Finnish (Suomi) Hei\n\ French (Français) Bonjour, Salut\n\ German (Deutsch Nord) Guten Tag\n\ German (Deutsch Süd) Grüß Gott\n\ Greek (Ελληνικά) Γειά σας\n\ Hebrew שלום\n\ Italiano Ciao, Buon giorno\n\ Lao(ພາສາລາວ) ສະບາຍດີ, ຂໍໃຫ້ໂຊກດີ\n\ Maltese Ciao\n\ Nederlands, Vlaams Hallo, Dag\n\ Norwegian (Norsk) Hei, God dag\n\ Polish Dzień dobry, Hej\n\ Russian (Русский) Здравствуйте!\n\ Slovak Dobrý deň\n\ Spanish (Español) ¡Hola!\n\ Swedish (Svenska) Hej, Goddag\n\ Thai (ภาษาไทย) สวัสดีครับ, สวัสดีค่ะ\n\ \n\ Tigrigna (ትግርኛ) ሰላማት\n\ Turkish (Türkçe) Merhaba\n\ Vietnamese (Tiếng Việt) Chào bạn\n\ \n\ Japanese (日本語) こんにちは, コンニチハ\n\ Chinese (中文,普通话,汉语) 你好\n\ Cantonese (粵語,廣東話) 早晨, 你好\n\ Korean (한글) 안녕하세요, 안녕하십니까\n\ \n\ Difference among chinese characters in GB, JIS, KSC, BIG5:\n\ GB -- 元气 开发\n\ JIS -- 元気 開発\n\ KSC -- 元氣 開發\n\ BIG5 -- 元氣 開發\n\ \n\ Just for a test of JISX0212: 騏驎 (the second character is of JISX0212)\n\ "; setlocale (LC_ALL, "en_US.UTF-8"); #define REPEAT 10000 { int repeat; for (repeat = 0; repeat < REPEAT; repeat++) { int width = 0; const char *cp = teststring; while (*cp != '\0') { wchar_t wc; int i = mbtowc (&wc, cp, 6); if (i < 0) break; width += wcwidth (wc); cp += i; } total += width; } } return 0; } /* In UTF-8 locale: glibc implementation: 104 µs gnulib implementation: 1050 µs 80 µs for the mbtowc call 960 µs for the wcwidth call 900 µs for the locale_charset call 40 µs for the STREQ call 20 µs for the uc_width call Compile-command: gcc -Wall -I.. bench-wcwidth.c lib/wcwidth.c lib/localcharset.c lib/uniwidth/width.c -Ilib -DLIBDIR='"/usr/share"' -DHAVE_WORKING_O_NOFOLLOW=0 */