Re: How to always tokenize on underscore?

Jack Krupansky Wed, 25 Sep 2013 14:49:50 -0700

Use the char filter instead:
http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilterFactory.html


-- Jack Krupansky

-----Original Message-----From: Greg Preston

Sent: Wednesday, September 25, 2013 5:43 PM
To: solr-user@lucene.apache.org
Subject: How to always tokenize on underscore?

[Using SolrCloud 4.4.0]

I have a text field where the data will sometimes be delimited by
whitespace, and sometimes by underscore.  For example, both of the
following are possible input values:

Group_EN_1000232142_blah_1000232142abc_foo
Group EN 1000232142 blah 1000232142abc foo

What I'd like to do is have underscores treated as spaces for
tokenization purposes.  I've tried using a PatternReplaceFilterFactory
with:

   <fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
     <analyzer type="index">
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.PatternReplaceFilterFactory" pattern="_"
replacement=" " replace="all" />
     </analyzer>
     <analyzer type="query">
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.PatternReplaceFilterFactory" pattern="_"
replacement=" " replace="all" />
     </analyzer>
   </fieldType>

but that seems to do the pattern replacement on each token, rather
than splitting tokens into multiple tokens based on the pattern.  So
with the input "Group_EN_1000232142_blah_1000232142abc_foo" I end up
with a single token of "group en 1000232142 blah 1000232142abc foo"
rather than what I want, which is 6 tokens: "group", "en",
"1000232142", "blah", "1000232142abc", "foo".

Is there a way to configure for the behavior I'm looking for, or would
I need to write a customer tokenizer?

Thanks!

-Greg

Re: How to always tokenize on underscore?

Reply via email to