[Bug libstdc++/98723] New: On Windows with CP936 encoding, regex compiles very slow.

2021-01-18 Thread goughostt at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723

Bug ID: 98723
   Summary: On Windows with CP936 encoding, regex compiles very
slow.
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: goughostt at gmail dot com
  Target Milestone: ---

example code:

#include 
#include 
#include 
int main() {
   std::setlocale(LC_ALL, "");
   std::regex rgx{"[a-z][a-z][a-z]"};
   std::cerr<::do_transform() is made;
do_transform() will use the result of strxfrm() to allocate buffer;
on Windows, strxfrm() returns INT_MAX to indicate error;
if char > 0x7f, and the system encoding is CP936, strxfrm() will fail;
thus, compiling '[a-z]' will repeatedly allocate large buffers.

issues:

1. the regex compilation will be affected by current locale even if
std::regex::collate is not set, by calling strxfrm.

2. code in bits/locale_classes.tcc should handle documented return conditions
of strxfrm() on Windows:

 size_t __res = _M_transform(__c, __p, __len); //*** calls strxfrm()
 // If the buffer was not large enough, try again with the
 // correct size.
 if (__res >= __len)
  {
__len = __res + 1;
delete [] __c, __c = 0;
__c = new _CharT[__len];
__res = _M_transform(__c, __p, __len);
  }

[Bug libstdc++/98723] On Windows with CP936 encoding, regex compiles very slow.

2021-01-18 Thread goughostt at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98723

--- Comment #2 from goughost  ---
That may be acceptable for issue 2.
But additional fixes are need; otherwise, users cannot use regex after calling
setlocale(LC_ALL,"") in such a situation.
Can regex compilers work without calling _M_transform? (at least when
std::regex::collate is not set)

On the other hand, maybe the error condition can be handled by regex compiler
code.
To some extent, the bug is in the regex compiler.
Building cache for '\xee' calls strxfrm() with "\xee\x00", which is not a valid
string if current encoding is utf8.
Also, in GNU/Linux, resulting strings of such (successful) calls might not help
building the cache.

Examples calling strxfrom in GNU/Linux with various locales.
(Note that, in cases when Windows fails, Linux gives trivial results.)

// C
input 61 00, errno 0, res 1, outbuf:  61
input 62 00, errno 0, res 1, outbuf:  62
input aa 00, errno 0, res 1, outbuf:  aa
input bb 00, errno 0, res 1, outbuf:  bb

// C.UTF-8
input 61 00, errno 0, res 1, outbuf:  63
input 62 00, errno 0, res 1, outbuf:  64
input aa 00, errno 0, res 1, outbuf:  03
input bb 00, errno 0, res 1, outbuf:  03

// en_US.UTF-8
input 61 00, errno 0, res 10, outbuf:  51 01 02 01 02 01 00 00 00 00
input 62 00, errno 0, res 10, outbuf:  5e 01 02 01 02 01 00 00 00 00
input aa 00, errno 0, res 5, outbuf:  01 01 01 01 03
input bb 00, errno 0, res 5, outbuf:  01 01 01 01 03

// zh_CN.GB2312 
input 61 00, errno 0, res 11, outbuf:  e1 a9 bd 01 02 01 02 01 00 00 61
input 62 00, errno 0, res 11, outbuf:  e1 a9 be 01 02 01 02 01 00 00 62
input aa 00, errno 0, res 5, outbuf:  01 01 01 01 03
input bb 00, errno 0, res 5, outbuf:  01 01 01 01 03

[Bug libstdc++/106028] New: std::filesystem::path lacks conversion to native mbs

2022-06-18 Thread goughostt at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106028

Bug ID: 106028
   Summary: std::filesystem::path lacks conversion to native mbs
   Product: gcc
   Version: 12.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: goughostt at gmail dot com
  Target Milestone: ---

The following assumes c++17 std.

With Windows (mingw): a path can be build with UTF-8 (`u8path`) or native mbs
(`path(const char*)`) and converted to UTF-8 string (`path.string()` and
`path.u8string()`).
It seems impossible to get a native mbs string back (easily).

I suggest `path.string()` to return the native mbs string instead.
That allows one to `printf` the path in Windows.
Otherwise, I cannot find an easy way (without explicit encoding conversion) for
a mingw cross-compiled binary to print paths to the Cmd console.

As I understand, the suggested change conforms to the standard.
And developers also noted this in the code (XXX):

  if (__str_codecvt_out_all(__wfirst, __wlast, __u8str, __cvt)) {
  if constexpr (is_same_v<_CharT, char>)
return __u8str; // XXX assumes native ordinary encoding is UTF-8.
  else {