On 1/10/2018 2:27 PM, Rick Leir wrote:
I did not express that clearly.
The reference guide says "The Classic Tokenizer preserves the same behavior as the 
Standard Tokenizer of Solr versions 3.1 and previous. "

So I am curious to know why they changed StandardTokenizer after 3.1 to break 
on hyphens, when it seems to me to work better the old way?

I really have no idea. Those are Lucene classes, not Solr. Maybe someone who was around for whatever discussions happened on Lucene lists back in those days will comment.

I wasn't able to find the issue where ClassicTokenizer was created, and I couldn't find any information discussing the change.

If I had to guess why StandardTokenizer was updated this way, I think it is to accommodate searches where people were searching for one word in text where that word was part of something larger with a hyphen, and it wasn't being found. There was probably a discussion among the developers about what a typical Lucene user would want, so they could decide what they would have the standard tokenizer do.

Likely because there was a vocal segment of the community reliant on the old behavior, they preserved that behavior in ClassicTokenizer, but updated the standard one to do what they felt would be normal for a typical user.

Obviously *your* needs do not fall in line with what was decided ... so the standard tokenizer isn't going to work for you.

Thanks,
Shawn

Reply via email to