Mark Harwood created LUCENE-9370:
------------------------------------

             Summary: RegExpQuery should error for inappropriate use of \ 
character in input
                 Key: LUCENE-9370
                 URL: https://issues.apache.org/jira/browse/LUCENE-9370
             Project: Lucene - Core
          Issue Type: Bug
          Components: core/search
    Affects Versions: master (9.0)
            Reporter: Mark Harwood


The RegExp class is too lenient in parsing user input which can confuse or 
mislead users and cause backwards compatibility issues as we enhance regex 
support.

In normal regular expression syntax the backslash is used to:
*  escape a reserved character like  \. 
*  use certain unreserved characters in a shorthand context e.g. \d means 
digits [0-9]
 
The leniency bug in RegExp is that it adds an extra rule to this list - any 
backslashed characters that don't satisfy the above rules are taken literally. 
For example, there's no reason to put a backslash in front of the letter "p" 
but we accept \p as the letter p.

Java's Pattern class will throw a parse exception given a meaningless backslash 
like \p.
We should too.

In [Lucene-9336|https://issues.apache.org/jira/browse/LUCENE-9336] we added 
support for commonly supported regex expressions like `\d`. Sadly this is a 
breaking change because of the leniency that has allowed \d to be accepted as 
the letter d without an exception. Users were likely silently missing results 
they were hoping for and we made a BWC problem for ourselves in filling in the 
gaps.

I propose we do like other RegEx parsers and error on inappropriate use of 
backslashes.
This will be another breaking change so should target 9.0





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to