[GitHub] [lucene] romseygeek commented on issue #12264: Shouldn't StandardTokenizer keep aplanum dot joined?

via GitHub Thu, 04 May 2023 04:16:00 -0700


romseygeek commented on issue #12264:
URL: https://github.com/apache/lucene/issues/12264#issuecomment-1534586713


   The tokenizer is based on http://unicode.org/reports/tr29/, which has rules 
for handling dots that appear in numbers or in URLs, but it does seem that URLs 
that have a number before a dot are not handled here (the relevant rule I think 
is http://unicode.org/reports/tr29/#WB6 that tells the tokenizer not to break 
on letter + dot + letter, and then WB11 tells it not to break on number + dot + 
number, but there's nothing about number + dot + letter - possibly because 
there are also a bunch of cases where we *do* actually want to break here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] romseygeek commented on issue #12264: Shouldn't StandardTokenizer keep aplanum dot joined?

Reply via email to