Re: StandardTokenizer vs. hyphens

Shawn Heisey Fri, 17 May 2013 09:44:57 -0700

On 5/17/2013 10:26 AM, Kai Gülzau wrote:
> Is there some StandardTokenizer Implementation which does not break words on 
> hyphens?
> 
> I think it would be more flexible to retain hyphens and use a 
> WordDelimiterFactory to split these tokens.


You can use the whitespace tokenizer with WDF.  This is what I did for
my index up through 3.5.

In 4.x, I wanted to be able to use the CJK filters.  The CJK filters
don't work with the whitespace tokenizer, only the ICU or standard.
Until recently, the ICU tokenizer was just as aggressive as the standard
one on punctuation, so it wouldn't work either.

Thanks to SOLR-4123, I was able to change tokenizers so I could still
use WDF.  This issue adds the capability to change how the ICU tokenizer
works via a rule file.  Here is my fieldType:

http://pastie.org/private/tjd9pk6sfgohyhpfbpn7q

The custom rule capability on the ICU tokenizer is in 4.1 or later, and
the Latin-break-only-on-whitespace.rbbi file that I am using in my
schema can be found in the Solr source code.

https://issues.apache.org/jira/browse/SOLR-4123

Thanks,
Shawn

Re: StandardTokenizer vs. hyphens

Reply via email to