Hi Rick, Quoting Robert Muir’s comments on https://issues.apache.org/jira/browse/LUCENE-2167 (he’s referring to the word break rules in UAX#29[1] when he says “the standard”): > i actually am of the opinion StandardTokenizer should follow unicode standard > tokenization. then we can throw subjective decisions away, and stick with a > standard.
> I think it would be really nice for StandardTokenizer to adhere straight to > the standard as much as we can with jflex [....] Then its name would actually > make sense. [1] Unicode Standard Annex #29: Unicode Text Segmentation <http://unicode.org/reports/tr29/> -- Steve www.lucidworks.com > On Jan 10, 2018, at 10:09 PM, Shawn Heisey <apa...@elyograg.org> wrote: > > On 1/10/2018 2:27 PM, Rick Leir wrote: >> I did not express that clearly. >> The reference guide says "The Classic Tokenizer preserves the same behavior >> as the Standard Tokenizer of Solr versions 3.1 and previous. " >> So I am curious to know why they changed StandardTokenizer after 3.1 to >> break on hyphens, when it seems to me to work better the old way? > > I really have no idea. Those are Lucene classes, not Solr. Maybe someone > who was around for whatever discussions happened on Lucene lists back in > those days will comment. > > I wasn't able to find the issue where ClassicTokenizer was created, and I > couldn't find any information discussing the change. > > If I had to guess why StandardTokenizer was updated this way, I think it is > to accommodate searches where people were searching for one word in text > where that word was part of something larger with a hyphen, and it wasn't > being found. There was probably a discussion among the developers about what > a typical Lucene user would want, so they could decide what they would have > the standard tokenizer do. > > Likely because there was a vocal segment of the community reliant on the old > behavior, they preserved that behavior in ClassicTokenizer, but updated the > standard one to do what they felt would be normal for a typical user. > > Obviously *your* needs do not fall in line with what was decided ... so the > standard tokenizer isn't going to work for you. > > Thanks, > Shawn