Chris Hostetter wrote:
: I need to tokenize my field on whitespaces, html, punctuation, apostrophe
: but if I use HTMLStripStandardTokenizerFactory it strips only html....
: but no apostrophes
you might consider using one of the HTML Tokenizers, and then use a
PatternReplaceFilterFilter ... or if you know java write a
simple Tokenizer that uses the HTMLStripReader.
in the long run, changing the HTMLStripReader to be useble as a
"CharFilter" so it can work with any Tokenizer is probably the way we'll
go -- but i don't think anyone has started working on a patch for that.
I opened:
https://issues.apache.org/jira/browse/SOLR-1343
Koji