[Using SolrCloud 4.4.0] I have a text field where the data will sometimes be delimited by whitespace, and sometimes by underscore. For example, both of the following are possible input values:
Group_EN_1000232142_blah_1000232142abc_foo Group EN 1000232142 blah 1000232142abc foo What I'd like to do is have underscores treated as spaces for tokenization purposes. I've tried using a PatternReplaceFilterFactory with: <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="_" replacement=" " replace="all" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="_" replacement=" " replace="all" /> </analyzer> </fieldType> but that seems to do the pattern replacement on each token, rather than splitting tokens into multiple tokens based on the pattern. So with the input "Group_EN_1000232142_blah_1000232142abc_foo" I end up with a single token of "group en 1000232142 blah 1000232142abc foo" rather than what I want, which is 6 tokens: "group", "en", "1000232142", "blah", "1000232142abc", "foo". Is there a way to configure for the behavior I'm looking for, or would I need to write a customer tokenizer? Thanks! -Greg