I cannot seem to get the combination of behaviors that I want from the
tokenizer/filter combinations in Solr.
Right now I am using WhitespaceTokenizer. This does not split on
punctuation, which is the behavior I want, because I do this myself
later. I use WordDelimeterFilter with preserveOriginal so that
documents with text in the format "Word1-Word2" can be located by a
search for word1word2 as well as the two words individually.
I am extremely interested in the Unicode behavior of ICUTokenizer, but I
cannot disable the punctuation-splitting behavior and let WDF handle it
properly, which causes recall problems. There is no filter that I can
run after tokenization, either. Looking at ICUTokenizer.java, I do not
see any way to write my own tokenizer that does what I need.
I have this problem with pretty much all of the tokenizers other than
Whitespace. There are situations where I would like to use some of the
others, but the punctuation-splitting behavior is a major problem for me.
Do I have any options? I have never looked at the ICU code from IBM, so
I don't know if it would require major surgery there.
Thanks,
Shawn
- Solr/Lucene Tokenizers - cannot get the behavior I need Shawn Heisey
-