[Bug libstdc++/98723] New: On Windows with CP936 encoding, regex compiles very slow.

goughostt at gmail dot com via Gcc-bugs Mon, 18 Jan 2021 02:37:41 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723


            Bug ID: 98723
           Summary: On Windows with CP936 encoding, regex compiles very
                    slow.
           Product: gcc
           Version: 10.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: goughostt at gmail dot com
  Target Milestone: ---

example code:

#include <regex>
#include <iostream>
#include <locale>
int main() {
   std::setlocale(LC_ALL, "");
   std::regex rgx{"[a-z][a-z][a-z]"};
   std::cerr<<rgx.mark_count()<<std::endl;
   return 0;
}

build and run in mingw64 environment (gcc 10.2.0), the program blocks while
compiling the regex for a long time.

my finding is that:

compiling '[a-z]' needs to cache info for all 256 chars;
for each char, a call to std::collate<char>::do_transform() is made;
do_transform() will use the result of strxfrm() to allocate buffer;
on Windows, strxfrm() returns INT_MAX to indicate error;
if char > 0x7f, and the system encoding is CP936, strxfrm() will fail;
thus, compiling '[a-z]' will repeatedly allocate large buffers.

issues:

1. the regex compilation will be affected by current locale even if
std::regex::collate is not set, by calling strxfrm.

2. code in bits/locale_classes.tcc should handle documented return conditions
of strxfrm() on Windows:

         size_t __res = _M_transform(__c, __p, __len); //*** calls strxfrm()
         // If the buffer was not large enough, try again with the
         // correct size.
         if (__res >= __len)
      {
        __len = __res + 1;
        delete [] __c, __c = 0;
        __c = new _CharT[__len];
        __res = _M_transform(__c, __p, __len);
      }

[Bug libstdc++/98723] New: On Windows with CP936 encoding, regex compiles very slow.

Reply via email to