This is exactly what I needed. Thank you! -Greg
On Wed, Sep 25, 2013 at 2:48 PM, Jack Krupansky <j...@basetechnology.com> wrote: > Use the char filter instead: > http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilterFactory.html > > -- Jack Krupansky > > -----Original Message----- From: Greg Preston > Sent: Wednesday, September 25, 2013 5:43 PM > To: solr-user@lucene.apache.org > Subject: How to always tokenize on underscore? > > > [Using SolrCloud 4.4.0] > > I have a text field where the data will sometimes be delimited by > whitespace, and sometimes by underscore. For example, both of the > following are possible input values: > > Group_EN_1000232142_blah_1000232142abc_foo > Group EN 1000232142 blah 1000232142abc foo > > What I'd like to do is have underscores treated as spaces for > tokenization purposes. I've tried using a PatternReplaceFilterFactory > with: > > <fieldType name="text_general" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.PatternReplaceFilterFactory" pattern="_" > replacement=" " replace="all" /> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.PatternReplaceFilterFactory" pattern="_" > replacement=" " replace="all" /> > </analyzer> > </fieldType> > > but that seems to do the pattern replacement on each token, rather > than splitting tokens into multiple tokens based on the pattern. So > with the input "Group_EN_1000232142_blah_1000232142abc_foo" I end up > with a single token of "group en 1000232142 blah 1000232142abc foo" > rather than what I want, which is 6 tokens: "group", "en", > "1000232142", "blah", "1000232142abc", "foo". > > Is there a way to configure for the behavior I'm looking for, or would > I need to write a customer tokenizer? > > Thanks! > > -Greg