On 5/17/2013 10:26 AM, Kai Gülzau wrote: > Is there some StandardTokenizer Implementation which does not break words on > hyphens? > > I think it would be more flexible to retain hyphens and use a > WordDelimiterFactory to split these tokens.
You can use the whitespace tokenizer with WDF. This is what I did for my index up through 3.5. In 4.x, I wanted to be able to use the CJK filters. The CJK filters don't work with the whitespace tokenizer, only the ICU or standard. Until recently, the ICU tokenizer was just as aggressive as the standard one on punctuation, so it wouldn't work either. Thanks to SOLR-4123, I was able to change tokenizers so I could still use WDF. This issue adds the capability to change how the ICU tokenizer works via a rule file. Here is my fieldType: http://pastie.org/private/tjd9pk6sfgohyhpfbpn7q The custom rule capability on the ICU tokenizer is in 4.1 or later, and the Latin-break-only-on-whitespace.rbbi file that I am using in my schema can be found in the Solr source code. https://issues.apache.org/jira/browse/SOLR-4123 Thanks, Shawn