PatternReplaceFilterFactory creating empty string as a term

Shawn Heisey Tue, 05 Oct 2010 15:25:09 -0700

I am developing a new schema. It has a pattern filter that trimsleading and trailing punctuation from terms.


<filter class="solr.PatternReplaceFilterFactory"
pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
replacement="$2"
/>

It is resulting in empty terms, because there are situations in theanalyzer stream where a term happens to be composed of nothing butpunctuation. This problem is not happening in production. I want thoseterms removed.

This blank term makes the top of the list as far as term frequency. Outof 7.6 million documents, 4.8 million of them have it. From TermsComponent:


<response>
−
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">19106</int>
</lst>
−
<lst name="terms">
−
<lst name="catchall">
<int name="">4830648</int>
<int name="usa">1863264</int>
<int name="photo">1743551</int>
<int name="new">1544314</int>
<int name="de">1455691</int>
<int name="during">1412551</int>
<int name="los">1408855</int>
<int name="united">1368594</int>
<int name="2009">1271103</int>
<int name="la">1204441</int>
</lst>
</lst>
</response>

Is there any existing way to remove empty terms during analysis? I triedTrimFilterFactory but that made no difference. Is this a bug inPatternReplaceFilterFactory?


Shawn

PatternReplaceFilterFactory creating empty string as a term

Reply via email to