I cannot seem to get the combination of behaviors that I want from the tokenizer/filter combinations in Solr.

Right now I am using WhitespaceTokenizer. This does not split on punctuation, which is the behavior I want, because I do this myself later. I use WordDelimeterFilter with preserveOriginal so that documents with text in the format "Word1-Word2" can be located by a search for word1word2 as well as the two words individually.

I am extremely interested in the Unicode behavior of ICUTokenizer, but I cannot disable the punctuation-splitting behavior and let WDF handle it properly, which causes recall problems. There is no filter that I can run after tokenization, either. Looking at ICUTokenizer.java, I do not see any way to write my own tokenizer that does what I need.

I have this problem with pretty much all of the tokenizers other than Whitespace. There are situations where I would like to use some of the others, but the punctuation-splitting behavior is a major problem for me.

Do I have any options? I have never looked at the ICU code from IBM, so I don't know if it would require major surgery there.

Thanks,
Shawn

Reply via email to