romseygeek commented on issue #12264:
URL: https://github.com/apache/lucene/issues/12264#issuecomment-1534586713

   The tokenizer is based on http://unicode.org/reports/tr29/, which has rules 
for handling dots that appear in numbers or in URLs, but it does seem that URLs 
that have a number before a dot are not handled here (the relevant rule I think 
is http://unicode.org/reports/tr29/#WB6 that tells the tokenizer not to break 
on letter + dot + letter, and then WB11 tells it not to break on number + dot + 
number, but there's nothing about number + dot + letter - possibly because 
there are also a bunch of cases where we *do* actually want to break here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to