eraneverlaw opened a new issue, #12561: URL: https://github.com/apache/lucene/issues/12561
### Description The `UAX29URLEmailTokenizerImpl.jflex` code matches commas as part of email local part, as well as invalid leading, trailing, or consecutive periods. Examples of bad matches: `foo,b...@yahoo.com`, `,b...@yahoo.com`, `b...@yahoo.com`, `.b...@gmail.com`, `foo......@gmail.com`. https://github.com/apache/lucene/blame/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/email/UAX29URLEmailTokenizerImpl.jflex#L188 Relevant code: ``` EMAILatomText = [A-Za-z0-9!#$%&'*+-/=?\^_`{|}~] EMAILlabel = {EMAILatomText}+ | {EMAILquotedString} EMAILlocalPart = {EMAILlabel} ("." {EMAILlabel})* ``` The problem is `EMAILatomText` matches periods and commas, and it was not meant to. Commas are invalid in the local part unless it's quoted (`EMAILquotedString` supports that). Periods are supposed to only be valid as `EMAILlocalPart` allows, but with the bug, the `("." {EMAILlabel})*` part never matches as periods are matched in `EMAILlabel`. As for why it matches commas and periods? See the ASCII table:  And look for a hidden range that matches these 5 characters. It's the `+-/` above. Clearly copy-pasted from someplace like this:  The dash is a range here, and coincidentally (or is it?) is the middle character in this range as well! (perhaps mitigating discovering this bug) The solution is as simple as this escaping the dash: ``` EMAILatomText = [A-Za-z0-9!#$%&'*+\-/=?\^_`{|}~] ``` You can of course also just move it to the end or the beginning of the character class where it is taken literally, but seems a bit more fragile, and escaping keeps the "minus" after the "plus" as in the original order this was copied from. ### Version and environment details Current and all older versions I believe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org