Re: speeding up `wc -m`

Pádraig Brady Mon, 21 May 2018 19:27:06 -0700

On 21/05/18 11:00, Bruno Haible wrote:
> Hi Pádraig,
> 
>> $ yes áááááááááááááááááááá | head -n100000 > mbc.txt
>> $ yes 12345678901234567890 | head -n100000 > num.txt
>>
>> ===== Before ====
>>
>> $ time src/wc -m < mbc.txt
>> 2100000
>> real    0m0.186s
>>
>> $ time src/wc -m < num.txt
>> 2100000
>> real    0m0.056s
> 
> Here's my take on improving this. I'm attaching draft patches that have
> this effect on the timings:
> 
> * On glibc:
> 
>              num     mbc
>   Before    0.056   0.152
>   After     0.057   0.089
>   -------
>   Speedup    1.0     1.7
>   factor
> 
> * On macOS 10.13:
> 
>              num     mbc
>   Before    0.153   0.229
>   After     0.042   0.112
>   -------
>   Speedup    3.6     2.0
>   factor
> 
> Basically, the two problems that the profiling found were:
> 
>   * It is pointless to call locale_charset repeatedly, because the
>     locale won't change while 'wc' is running.
> 
>   * glibc has a slow mbrtowc() implementation for UTF-8 locales.
> 
> Both problems can be addressed with the "abstract factory" design patterns.
> Namely, instead of using the generic 'wcwidth'/'mbrtowc' function each
> time, let the program produce an optimized 'wcwidth'/'mbrtowc' function
> [pointer] once, and then call this optimized function pointer repeatedly
> for each character.
> 
> While at it, let me also do the same for the initialization of an mbstate_t,
> because on macOS the mbstate_t is 128 bytes long but only the first 12 bytes
> actually matter.
> 
> This factory of function pointers side-steps the portability problems of
> 'locale_t'.
> 
> Notes:
>   - When you use these new gnulib modules, you are programming against an API
>     that is very similar to POSIX, but not exactly POSIX.
>   - The platform-specific #ifs have to be adjusted, by the help of configure
>     tests.
>   - mbrtowc-factory needs a unit test (for which I have a draft).


Wow thanks for doing all that.

It's also worth noting that using wcwidth-factory etc.
will always use the replacement routines on utf8,
which would be a disadvantage on code size and divergence.

Another disadvantage is the change in API which is a bit awkward,
and would need tweaking elsewhere in coreutils to take advantage
of the speedup.

The first locale_charset() issue at least could be dealt with internally
to gnulib with a simple gnulib level config.  The attached is a proposed
solution to the charset issue, that would just require depending
on the wchar-single gnulib module to indicate locales don't change across calls.

Now if utf8_mbrtowc() is about 4.5 and 2.3 times faster than mbrtowc()
on glibc and macOS respectively, it's probably worthwhile to replace 
unconditionally.
Though it would be cool to do that behind the wchar-single setting
which could setup the appropriate calls/pointers internally to rpl_mbrtowc()?

As for the reduced mbstate settings, it might not be worth
complicating the interface for this perf gain?

cheers,
Pádraig.

From 3081860737968ae9d461c1dc5ab2e31f94918d35 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <p...@draigbrady.com>
Date: Sun, 20 May 2018 22:11:12 -0700
Subject: [PATCH] wchar-single: a new module to enable optimizations in wchar
 replacements

* lib/mbrtowc.c (mbrtowc): Only check locale_charset() once if
GNULIB_WCHAR_SINGLE is enabled.
* lib/wcwidth.c (wcwidth): Likewise.
---
 lib/mbrtowc.c        | 14 ++++++++++++--
 lib/wcwidth.c        | 15 +++++++++++++--
 modules/wchar-single | 21 +++++++++++++++++++++
 3 files changed, 46 insertions(+), 4 deletions(-)
 create mode 100644 modules/wchar-single

diff --git a/lib/mbrtowc.c b/lib/mbrtowc.c
index 22ac2c4..dc3597f 100644
--- a/lib/mbrtowc.c
+++ b/lib/mbrtowc.c
@@ -138,9 +138,19 @@ mbrtowc (wchar_t *pwc, const char *s, size_t n, mbstate_t *ps)
         goto invalid;
       /* Here MB_CUR_MAX > 1 and 0 < m < 4.  */
       {
-        const char *encoding = locale_charset ();
+        static int utf8_charset = -1;
+        static const char *encoding;
 
-        if (STREQ_OPT (encoding, "UTF-8", 'U', 'T', 'F', '-', '8', 0, 0, 0, 0))
+# if GNULIB_WCHAR_SINGLE
+        if (utf8_charset == -1)
+# endif
+          {
+            encoding = locale_charset ();
+            utf8_charset = STREQ_OPT (encoding,
+                                      "UTF-8", 'U', 'T', 'F', '-', '8', 0, 0, 0 ,0);
+          }
+
+        if (utf8_charset)
           {
             /* Cf. unistr/u8-mblen.c.  */
             unsigned char c = (unsigned char) p[0];
diff --git a/lib/wcwidth.c b/lib/wcwidth.c
index 5fb090e..d03a50f 100644
--- a/lib/wcwidth.c
+++ b/lib/wcwidth.c
@@ -30,9 +30,20 @@ int
 wcwidth (wchar_t wc)
 #undef wcwidth
 {
+  static int utf8_charset = -1;
+  static const char *encoding;
+
+#if GNULIB_WCHAR_SINGLE
+  if (utf8_charset == -1)
+#endif
+    {
+      encoding = locale_charset ();
+      utf8_charset = STREQ_OPT (encoding,
+                                "UTF-8", 'U', 'T', 'F', '-', '8', 0, 0, 0 ,0);
+    }
+
   /* In UTF-8 locales, use a Unicode aware width function.  */
-  const char *encoding = locale_charset ();
-  if (STREQ_OPT (encoding, "UTF-8", 'U', 'T', 'F', '-', '8', 0, 0, 0 ,0))
+  if (utf8_charset)
     {
       /* We assume that in a UTF-8 locale, a wide character is the same as a
          Unicode character.  */
diff --git a/modules/wchar-single b/modules/wchar-single
new file mode 100644
index 0000000..e047de0
--- /dev/null
+++ b/modules/wchar-single
@@ -0,0 +1,21 @@
+Description:
+Enable more efficient wchar replacements, where we know
+the locale charset will not change between calls.
+
+Files:
+
+Depends-on:
+wchar
+
+configure.ac:
+gl_MODULE_INDICATOR([wchar-single])
+
+Makefile.am:
+
+Include:
+
+License:
+LGPLv2+
+
+Maintainer:
+all
-- 
2.9.3

Re: speeding up `wc -m`

Reply via email to