[Bug libstdc++/71500] regex::icase only works on first character in a range

timshen at gcc dot gnu.org Sat, 11 Jun 2016 22:29:47 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71500


--- Comment #8 from Tim Shen <timshen at gcc dot gnu.org> ---
(In reply to Michael Duggan from comment #7)
> Hmm...  Okay.  For the sake of argument, I am going to make the
> following claims: 

Yeah, thanks for the arguments, we should at least get this clear.

> a) Ignoring case _requires_ a locale.  Without a locale, how do you
>    determine the case of a character anyway?  Especially if the
>    character is not a char?  Even std::toupper and std::towuppwer are
>    bound to the "C" locale.
> 
>    As such, you can't say you're not to use the locale when collate is
>    off.  Otherwise icase wouldn't work ever unless collate was
>    specified.

Not necessarily. My interpretation of "icase" is whether regex should call
traits_type::translate or traits_type::translate_nocase when comparing two
characters. No locale is directly involved - a user-defined regex traits type
may ignore the locale and inject its own weird stuff into translate_nocase.
It's just, not surprisingly, std::regex_traits<T> "happens to" use locale to
implement translate_nocase.

> b) You can still use regex_traits::transform in an icase scenario.  The
>    rule for [A-B], matching X should be:
> 
>      transform(a) <= transform(x) <= transform(b)
> 
>    In the icase scenario, I would posit that the result of this _or_
> 
>      transform(a) <= transform(toggle_case(x)) <= transform(b)
> 
>    is correct.

+1.

I looked at boost source, it seems that "collate" is interpreted as whether
regex should call traits_type::transform when doing range matching:
https://github.com/boostorg/regex/blob/9059bfb5c6287b0c579bfa4be5160b44c8cc2957/include/boost/regex/v4/perl_matcher.hpp#L198

I don't like the way the documentation put it, but the definition above seems
accurate and clear. transform/collate seems to be an unrelated concern.


With above two claims hold, my point is, the way I can think of to implement
toggle_case:
  char_type basic_regex<...>::toggle_case(char_type c) {
    const auto& ctype =
std::use_facet<std::ctype<char_type>>(this->_M_traits.getloc());
    auto lower = ctype.tolower(c);
    if (lower == c) return ctype.toupper(c);
    return lower;
  }

it uses locale's definition of uppercase/lowercase, not translate_nocase's
definition. This result might surprise a picky user.

There is a correct way to implement this:
  if (icase)
    c = traits.translate_nocase(c);
  else
    c=  traits.translate(c);
  for (pair<char_type, char_type> range : regex.range_set)
    for (char_type e = range.first; e != range.second+1; ++e) {
      if (icase)
        e = traits.translate_nocase(e);
      else
        e = traits.translate(e);
      if (collate) {
        if (traits.transform(e) == traits.transform(c))
          return true;
      } else {
        if (e == c)
          return true;
      }
    }
  }
  return false;

But it's terribly inefficient and not practical at all.

[Bug libstdc++/71500] regex::icase only works on first character in a range

Reply via email to