Solr/Lucene Tokenizers - cannot get the behavior I need

Shawn Heisey Fri, 16 Nov 2012 11:30:58 -0800

I cannot seem to get the combination of behaviors that I want from thetokenizer/filter combinations in Solr.

Right now I am using WhitespaceTokenizer. This does not split onpunctuation, which is the behavior I want, because I do this myselflater. I use WordDelimeterFilter with preserveOriginal so thatdocuments with text in the format "Word1-Word2" can be located by asearch for word1word2 as well as the two words individually.

I am extremely interested in the Unicode behavior of ICUTokenizer, but Icannot disable the punctuation-splitting behavior and let WDF handle itproperly, which causes recall problems. There is no filter that I canrun after tokenization, either. Looking at ICUTokenizer.java, I do notsee any way to write my own tokenizer that does what I need.

I have this problem with pretty much all of the tokenizers other thanWhitespace. There are situations where I would like to use some of theothers, but the punctuation-splitting behavior is a major problem for me.

Do I have any options? I have never looked at the ICU code from IBM, soI don't know if it would require major surgery there.


Thanks,
Shawn

Solr/Lucene Tokenizers - cannot get the behavior I need

Reply via email to