[jira] [Comment Edited] (LUCENE-10642) Regexp query: escape sequences are treated as character classes

Andriy Redko (Jira) Wed, 06 Jul 2022 11:34:07 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17563380#comment-17563380
 ]


Andriy Redko edited comment on LUCENE-10642 at 7/6/22 6:33 PM:
---------------------------------------------------------------

Thanks for checking it [~uschindler],  the common replacements \t \n \r do 
work. Indeed, the error was not thrown before but now it does (so the impact of 
using escape sequences is more apparent). The error is also confusing because 
the implementation references explicitly the javadoc with character classes and 
escape sequences but does not detect latter properly. From the user 
perspective, is it non-intuitive why the character classes should be denoted 
with two slashes , but escape sequences with only one, I think we could make it 
more convenient for users allow usage of escape sequences the same way as 
character classes (at least, this is the way javadoc describes that). Anyway, 
fix seems to be simple but please feel free to close the issue if there is no 
interest in supporting that. Thank you!


was (Author: reta):
Thanks for checking it [~uschindler],  the common replacements \t \n \r do 
work. Indeed, the error was not thrown before but now it does (so the impact of 
using escape sequences is more apparent). The error is also confusing because 
the implementation references explicitly the javadoc with character classes and 
escape sequences but does not detect latter properly. From the user 
perspective, is it non-intuitive why the character classes should be denoted 
with two slashes \\ but escape sequences with \, I think we could make it more 
convenient for users allow usage of escape sequences the same way as character 
classes (at least, this is the way javadoc describes that). Anyway, fix seems 
to be simple but please feel free to close the issue if there is no interest in 
supporting that. Thank you!

> Regexp query: escape sequences are treated as character classes
> ---------------------------------------------------------------
>
>                 Key: LUCENE-10642
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10642
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 9.0, 9.1, 9.2, 9.3
>            Reporter: Andriy Redko
>            Priority: Major
>
> Interesting issue has been reported to Opensearch project [1], which has been 
> caused by [2], [3]. In the nutshell, the regression is causing escape 
> sequences (like \n, \r, \t, ...) to be treated as character classes 
> (specifically, 
> [https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#bs).]
> The problematic function is RegExp::matchPredefinedCharacterClass which does 
> not consider characters that denote an escaped construct. Simple test to 
> reproduce which fails with IllegalArgumentException("{color:#0451a5}invalid 
> character class{color}"):
>  
> {noformat}
> public class TestRegexpQuery extends LuceneTestCase {
>   public void testEscapeSequences() throws IOException {           
>     assertEquals(1, regexQueryNrHits("\\n"));           
>     assertEquals(1, regexQueryNrHits("[\\n]"));   }
>   }
> }
>   {noformat}
>  
> [1] [https://github.com/opensearch-project/OpenSearch/issues/3781]
> [2] 
> [https://github.com/apache/lucene/commit/1efce5444dd40142c55c5a3a30eeebc7b86796c3]
> [3] 
> [https://github.com/apache/lucene/commit/819e668ce2fcfcf86b652a191cdbe0fad0a8ffce]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10642) Regexp query: escape sequences are treated as character classes

Reply via email to