This is another bug in computing the fastmap. I had overlooked it when fixing the fastmap mess, because it usually does not happen with !_LIBC. However, it is there in that case too.
The bug is that whenever we have a range at the beginning of the regex, the regex must be tested on any possible multibyte character. The reason why _LIBC masks it, is that almost always there is a collation symbol for each possible multibyte-character lead byte, so all the lead bytes are in general already part of the fastmap. A simple reproducer is the following sed script: $ echo 'абвгдеёжзийклмнопрстуфхцчшщъыьэюя' | ./bad-sed -e 's/[а-я]/!/g' абвгдеёжзийклмнопрстуфхцчшщъыьэюя $ echo 'абвгдеёжзийклмнопрстуфхцчшщъыьэюя' | ./good-sed -e 's/[а-я]/!/g' !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2009-11-25 Paolo Bonzini <bonz...@gnu.org> * lib/regcomp.c (re_compute_fastmap_iter): Add all multibyte lead characters when a multibyte character range is included. --- ChangeLog | 6 ++++++ lib/regcomp.c | 2 +- 2 files changed, 7 insertions(+), 1 deletions(-) diff --git a/ChangeLog b/ChangeLog index fcdf307..54c5514 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,9 @@ +2009-11-25 Paolo Bonzini <bonz...@gnu.org> + + regex: Fix fastmap for multibyte character ranges. + * lib/regcomp.c (re_compute_fastmap_iter): Add all multibyte lead + characters when a multibyte character range is included. + 2009-11-22 Andy Wingo <wi...@pobox.com> version-etc: work also with AM_INIT_AUTOMAKE's no-define option diff --git a/lib/regcomp.c b/lib/regcomp.c index 6472ff6..6aef405 100644 --- a/lib/regcomp.c +++ b/lib/regcomp.c @@ -383,7 +383,7 @@ re_compile_fastmap_iter (regex_t *bufp, const re_dfastate_t *init_state, applies to multibyte character sets; for single byte character sets, the SIMPLE_BRACKET again suffices. */ if (dfa->mb_cur_max > 1 - && (cset->nchar_classes || cset->non_match + && (cset->nchar_classes || cset->non_match || cset->nranges # ifdef _LIBC || cset->nequiv_classes # endif /* _LIBC */ -- 1.6.5.2