[ https://issues.apache.org/jira/browse/LUCENE-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563380#comment-17563380 ]
Andriy Redko edited comment on LUCENE-10642 at 7/6/22 6:33 PM: --------------------------------------------------------------- Thanks for checking it [~uschindler], the common replacements \t \n \r do work. Indeed, the error was not thrown before but now it does (so the impact of using escape sequences is more apparent). The error is also confusing because the implementation references explicitly the javadoc with character classes and escape sequences but does not detect latter properly. From the user perspective, is it non-intuitive why the character classes should be denoted with two slashes , but escape sequences with only one, I think we could make it more convenient for users allow usage of escape sequences the same way as character classes (at least, this is the way javadoc describes that). Anyway, fix seems to be simple but please feel free to close the issue if there is no interest in supporting that. Thank you! was (Author: reta): Thanks for checking it [~uschindler], the common replacements \t \n \r do work. Indeed, the error was not thrown before but now it does (so the impact of using escape sequences is more apparent). The error is also confusing because the implementation references explicitly the javadoc with character classes and escape sequences but does not detect latter properly. From the user perspective, is it non-intuitive why the character classes should be denoted with two slashes \\ but escape sequences with \, I think we could make it more convenient for users allow usage of escape sequences the same way as character classes (at least, this is the way javadoc describes that). Anyway, fix seems to be simple but please feel free to close the issue if there is no interest in supporting that. Thank you! > Regexp query: escape sequences are treated as character classes > --------------------------------------------------------------- > > Key: LUCENE-10642 > URL: https://issues.apache.org/jira/browse/LUCENE-10642 > Project: Lucene - Core > Issue Type: Bug > Affects Versions: 9.0, 9.1, 9.2, 9.3 > Reporter: Andriy Redko > Priority: Major > > Interesting issue has been reported to Opensearch project [1], which has been > caused by [2], [3]. In the nutshell, the regression is causing escape > sequences (like \n, \r, \t, ...) to be treated as character classes > (specifically, > [https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#bs).] > The problematic function is RegExp::matchPredefinedCharacterClass which does > not consider characters that denote an escaped construct. Simple test to > reproduce which fails with IllegalArgumentException("{color:#0451a5}invalid > character class{color}"): > > {noformat} > public class TestRegexpQuery extends LuceneTestCase { > public void testEscapeSequences() throws IOException { > assertEquals(1, regexQueryNrHits("\\n")); > assertEquals(1, regexQueryNrHits("[\\n]")); } > } > } > {noformat} > > [1] [https://github.com/opensearch-project/OpenSearch/issues/3781] > [2] > [https://github.com/apache/lucene/commit/1efce5444dd40142c55c5a3a30eeebc7b86796c3] > [3] > [https://github.com/apache/lucene/commit/819e668ce2fcfcf86b652a191cdbe0fad0a8ffce] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org