https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63776

--- Comment #8 from Tim Shen <timshen at gcc dot gnu.org> ---
I'm not sure how you call boost::regex in your code, here's what I did:

// g++ b.cc -lboost_regex -licuuc
#include <boost/regex/icu.hpp>
#include <boost/regex.hpp>
#include <iostream>
#include <string>
using namespace boost;

int main() {
    std::locale loc("en_US.UTF-8");
    std::string s(u8"Ī");
    u32regex re = make_u32regex("[[:alpha:]]");
    std::cout << u32regex_match(s.data(), s.data() + s.size(), re) << "\n";
    return 0;
}


If this is the way that we do utf-8 matching using boost, then I don't think
std::regex_match and boost::u32regex_match (notice that it's not
boost::regex_match) have the same semantic.

An user who uses boost::u32regex_match explicitly tells the library that "I
want a unicode match here, here's my regex object, with type u32regex, please
do the decode for and match for me", and u32regex is actually
boost::basic_regex< ::UChar32, icu_regex_traits> with a library defined
regex_traits. u32regex_match, on the other hand, takes no user defined
regex_traits type, but u32regex only.

I don't think std::regex_match<BiIter, Alloc, char, RegexTraits> should care
about decoding a char string to wchar_t string and call
std::regex_match<AnotherBiIter, AnotherAlloc, wchar_t,
std::regex_traits<wchar_t>>, leaving user defined RegexTraits potentially
unused.

Instead, user can maually decode the utf-8 string (I'm sad we don't have a
standard char iterator adaptor which converts a utf-8 char iterator to char32_t
iterator) and call std::regex_match<..., wchar_t, ...>.

These are my understanding, so it's surely possible that I may miss something.

Thoughts?

Reply via email to