I am trying to use PatternReplaceCharFilterFactory (SOLR-1653) to
strip leading and trailing punctuation from terms. It's not working.
This was previously discussed here as part of something I was trying
with WordDelimiterFilterFactory, but I think it needs its own thread now.
I seem to be having two problems, based on what I can see. The first
problem is that analysis shows the PatternReplaceCharFilterFactory
applied in a different order than I have configured it - it's going
first. The other problem is that it's eating all my text, leaving any
fields of that type (which is most of my index!) completely empty. A
screenshot showing the issue:
http://www.elyograg.org/punct_analysis.png
Here's my entire fieldType definition, but the same thing happens when I
replace the pattern with a very basic "([0-9]*)(.*)([0-9]*)" and the
input value with "9dummy".
<fieldType name="text" class="solr.TextField" sortMissingLast="true"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(\p{Punct}*)(.*)(\p{Punct}*)"
replaceWith="$2"
/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1"
splitOnNumerics="1"
stemEnglishPossessive="1"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
preserveOriginal="1"
/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(\p{Punct}*)(.*)(\p{Punct}*)"
replaceWith="$2"
/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1"
splitOnNumerics="1"
stemEnglishPossessive="1"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
preserveOriginal="1"
/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Am I doing something wrong, or is this a bug?
Thanks,
Shawn