[Bug libstdc++/98723] New: On Windows with CP936 encoding, regex compiles very slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723 Bug ID: 98723 Summary: On Windows with CP936 encoding, regex compiles very slow. Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: goughostt at gmail dot com Target Milestone: --- example code: #include #include #include int main() { std::setlocale(LC_ALL, ""); std::regex rgx{"[a-z][a-z][a-z]"}; std::cerr<::do_transform() is made; do_transform() will use the result of strxfrm() to allocate buffer; on Windows, strxfrm() returns INT_MAX to indicate error; if char > 0x7f, and the system encoding is CP936, strxfrm() will fail; thus, compiling '[a-z]' will repeatedly allocate large buffers. issues: 1. the regex compilation will be affected by current locale even if std::regex::collate is not set, by calling strxfrm. 2. code in bits/locale_classes.tcc should handle documented return conditions of strxfrm() on Windows: size_t __res = _M_transform(__c, __p, __len); //*** calls strxfrm() // If the buffer was not large enough, try again with the // correct size. if (__res >= __len) { __len = __res + 1; delete [] __c, __c = 0; __c = new _CharT[__len]; __res = _M_transform(__c, __p, __len); }
[Bug libstdc++/98723] On Windows with CP936 encoding, regex compiles very slow.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723 --- Comment #2 from goughost --- That may be acceptable for issue 2. But additional fixes are need; otherwise, users cannot use regex after calling setlocale(LC_ALL,"") in such a situation. Can regex compilers work without calling _M_transform? (at least when std::regex::collate is not set) On the other hand, maybe the error condition can be handled by regex compiler code. To some extent, the bug is in the regex compiler. Building cache for '\xee' calls strxfrm() with "\xee\x00", which is not a valid string if current encoding is utf8. Also, in GNU/Linux, resulting strings of such (successful) calls might not help building the cache. Examples calling strxfrom in GNU/Linux with various locales. (Note that, in cases when Windows fails, Linux gives trivial results.) // C input 61 00, errno 0, res 1, outbuf: 61 input 62 00, errno 0, res 1, outbuf: 62 input aa 00, errno 0, res 1, outbuf: aa input bb 00, errno 0, res 1, outbuf: bb // C.UTF-8 input 61 00, errno 0, res 1, outbuf: 63 input 62 00, errno 0, res 1, outbuf: 64 input aa 00, errno 0, res 1, outbuf: 03 input bb 00, errno 0, res 1, outbuf: 03 // en_US.UTF-8 input 61 00, errno 0, res 10, outbuf: 51 01 02 01 02 01 00 00 00 00 input 62 00, errno 0, res 10, outbuf: 5e 01 02 01 02 01 00 00 00 00 input aa 00, errno 0, res 5, outbuf: 01 01 01 01 03 input bb 00, errno 0, res 5, outbuf: 01 01 01 01 03 // zh_CN.GB2312 input 61 00, errno 0, res 11, outbuf: e1 a9 bd 01 02 01 02 01 00 00 61 input 62 00, errno 0, res 11, outbuf: e1 a9 be 01 02 01 02 01 00 00 62 input aa 00, errno 0, res 5, outbuf: 01 01 01 01 03 input bb 00, errno 0, res 5, outbuf: 01 01 01 01 03
[Bug libstdc++/106028] New: std::filesystem::path lacks conversion to native mbs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106028 Bug ID: 106028 Summary: std::filesystem::path lacks conversion to native mbs Product: gcc Version: 12.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: goughostt at gmail dot com Target Milestone: --- The following assumes c++17 std. With Windows (mingw): a path can be build with UTF-8 (`u8path`) or native mbs (`path(const char*)`) and converted to UTF-8 string (`path.string()` and `path.u8string()`). It seems impossible to get a native mbs string back (easily). I suggest `path.string()` to return the native mbs string instead. That allows one to `printf` the path in Windows. Otherwise, I cannot find an easy way (without explicit encoding conversion) for a mingw cross-compiled binary to print paths to the Cmd console. As I understand, the suggested change conforms to the standard. And developers also noted this in the code (XXX): if (__str_codecvt_out_all(__wfirst, __wlast, __u8str, __cvt)) { if constexpr (is_same_v<_CharT, char>) return __u8str; // XXX assumes native ordinary encoding is UTF-8. else {