eraneverlaw opened a new issue, #12561:
URL: https://github.com/apache/lucene/issues/12561

   ### Description
   
   The `UAX29URLEmailTokenizerImpl.jflex` code matches commas as part of email 
local part, as well as invalid leading, trailing, or consecutive periods. 
Examples of bad matches: `foo,b...@yahoo.com`, `,b...@yahoo.com`, 
`b...@yahoo.com`, `.b...@gmail.com`, `foo......@gmail.com`. 
   
   
https://github.com/apache/lucene/blame/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/UAX29URLEmailTokenizerImpl.jflex#L188
   
   Relevant code:
   ```
   EMAILatomText = [A-Za-z0-9!#$%&'*+-/=?\^_`{|}~]
   EMAILlabel = {EMAILatomText}+ | {EMAILquotedString}
   EMAILlocalPart = {EMAILlabel} ("." {EMAILlabel})*
   ```
   
   The problem is `EMAILatomText` matches periods and commas, and it was not 
meant to. Commas are invalid in the local part unless it's quoted 
(`EMAILquotedString` supports that). Periods are supposed to only be valid as 
`EMAILlocalPart` allows, but with the bug, the `("." {EMAILlabel})*` part never 
matches as periods are matched in `EMAILlabel`.
   
   As for why it matches commas and periods? 
   
   See the ASCII table:
   
   
![image](https://github.com/apache/lucene/assets/40778421/68691685-a661-4aeb-bab3-1753ead4ee41)
   
   And look for a hidden range that matches these 5 characters. It's the `+-/` 
above. Clearly copy-pasted from someplace like this:
   
![image](https://github.com/apache/lucene/assets/40778421/1874739b-14c9-47ce-9a4d-93fda529980b)
   
   The dash is a range here, and coincidentally (or is it?) is the middle 
character in this range as well! (perhaps mitigating discovering this bug)
   
   The solution is as simple as this escaping the dash:
   ```
   EMAILatomText = [A-Za-z0-9!#$%&'*+\-/=?\^_`{|}~]
   ```
   You can of course also just move it to the end or the beginning of the 
character class where it is taken literally, but seems a bit more fragile, and 
escaping keeps the "minus" after the "plus" as in the original order this was 
copied from.
   
   ### Version and environment details
   
   Current and all older versions I believe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to