Stripping leading/trailing punctuation with SOLR-1653

Shawn Heisey Tue, 31 Aug 2010 07:24:23 -0700

I am trying to use PatternReplaceCharFilterFactory (SOLR-1653) tostrip leading and trailing punctuation from terms. It's not working.This was previously discussed here as part of something I was tryingwith WordDelimiterFilterFactory, but I think it needs its own thread now.

I seem to be having two problems, based on what I can see. The firstproblem is that analysis shows the PatternReplaceCharFilterFactoryapplied in a different order than I have configured it - it's goingfirst. The other problem is that it's eating all my text, leaving anyfields of that type (which is most of my index!) completely empty. Ascreenshot showing the issue:


http://www.elyograg.org/punct_analysis.png

Here's my entire fieldType definition, but the same thing happens when Ireplace the pattern with a very basic "([0-9]*)(.*)([0-9]*)" and theinput value with "9dummy".

<fieldType name="text" class="solr.TextField" sortMissingLast="true"positionIncrementGap="100">

<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
          pattern="(\p{Punct}*)(.*)(\p{Punct}*)"
          replaceWith="$2"
        />
<filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="1"
          catenateNumbers="1"
          catenateAll="1"
          preserveOriginal="1"
        />

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt" enablePositionIncrements="true"/>

<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
          pattern="(\p{Punct}*)(.*)(\p{Punct}*)"
          replaceWith="$2"
        />
<filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="0"
          catenateNumbers="0"
          catenateAll="0"
          preserveOriginal="1"
        />

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt" enablePositionIncrements="true"/>

<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Am I doing something wrong, or is this a bug?

Thanks,
Shawn

Stripping leading/trailing punctuation with SOLR-1653

Reply via email to