Hi Rick,

Quoting Robert Muir’s comments on 
https://issues.apache.org/jira/browse/LUCENE-2167 (he’s referring to the word 
break rules in UAX#29[1] when he says “the standard”):
 
> i actually am of the opinion StandardTokenizer should follow unicode standard 
> tokenization. then we can throw subjective decisions away, and stick with a 
> standard.

> I think it would be really nice for StandardTokenizer to adhere straight to 
> the standard as much as we can with jflex [....] Then its name would actually 
> make sense.


[1] Unicode Standard Annex #29: Unicode Text Segmentation 
<http://unicode.org/reports/tr29/>

--
Steve
www.lucidworks.com

> On Jan 10, 2018, at 10:09 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> 
> On 1/10/2018 2:27 PM, Rick Leir wrote:
>> I did not express that clearly.
>> The reference guide says "The Classic Tokenizer preserves the same behavior 
>> as the Standard Tokenizer of Solr versions 3.1 and previous. "
>> So I am curious to know why they changed StandardTokenizer after 3.1 to 
>> break on hyphens, when it seems to me to work better the old way?
> 
> I really have no idea.  Those are Lucene classes, not Solr.  Maybe someone 
> who was around for whatever discussions happened on Lucene lists back in 
> those days will comment.
> 
> I wasn't able to find the issue where ClassicTokenizer was created, and I 
> couldn't find any information discussing the change.
> 
> If I had to guess why StandardTokenizer was updated this way, I think it is 
> to accommodate searches where people were searching for one word in text 
> where that word was part of something larger with a hyphen, and it wasn't 
> being found.  There was probably a discussion among the developers about what 
> a typical Lucene user would want, so they could decide what they would have 
> the standard tokenizer do.
> 
> Likely because there was a vocal segment of the community reliant on the old 
> behavior, they preserved that behavior in ClassicTokenizer, but updated the 
> standard one to do what they felt would be normal for a typical user.
> 
> Obviously *your* needs do not fall in line with what was decided ... so the 
> standard tokenizer isn't going to work for you.
> 
> Thanks,
> Shawn

Reply via email to