https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723
Bug ID: 98723 Summary: On Windows with CP936 encoding, regex compiles very slow. Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: goughostt at gmail dot com Target Milestone: --- example code: #include <regex> #include <iostream> #include <locale> int main() { std::setlocale(LC_ALL, ""); std::regex rgx{"[a-z][a-z][a-z]"}; std::cerr<<rgx.mark_count()<<std::endl; return 0; } build and run in mingw64 environment (gcc 10.2.0), the program blocks while compiling the regex for a long time. my finding is that: compiling '[a-z]' needs to cache info for all 256 chars; for each char, a call to std::collate<char>::do_transform() is made; do_transform() will use the result of strxfrm() to allocate buffer; on Windows, strxfrm() returns INT_MAX to indicate error; if char > 0x7f, and the system encoding is CP936, strxfrm() will fail; thus, compiling '[a-z]' will repeatedly allocate large buffers. issues: 1. the regex compilation will be affected by current locale even if std::regex::collate is not set, by calling strxfrm. 2. code in bits/locale_classes.tcc should handle documented return conditions of strxfrm() on Windows: size_t __res = _M_transform(__c, __p, __len); //*** calls strxfrm() // If the buffer was not large enough, try again with the // correct size. if (__res >= __len) { __len = __res + 1; delete [] __c, __c = 0; __c = new _CharT[__len]; __res = _M_transform(__c, __p, __len); }