msfroh commented on PR #12750: URL: https://github.com/apache/lucene/pull/12750#issuecomment-1828469855
I was looking into this and the approach used for (Edge)NGramTokenizer back in 2013: https://github.com/apache/lucene/commit/a03e38d5d05008aaef969a200071c03a1d6cb991 The solution there is to *always* set the position increment and length to 1: https://github.com/apache/lucene/blob/8ef6a0da56878177ff8d6880c92e8f7d0321d076/lucene/analysis/common/src/java/org/apache/lucene/analysis/ngram/NGramTokenizer.java#L186-L187 With that change, your test passes (but I had to change every other test): https://github.com/msfroh/lucene/commit/0d05366c65a79aabc407e0662537520ba9c56737 Given that it's not backward-compatible, I imagine it would have to be a change for 10.0? Also, whatever we do should probably also be applied to ReversePathHierarchyTokenizer too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org